Logistic regression is a statistical method used to model and predict the probability of a binary outcome (e.g., success/failure, yes/no) based on one or more independent variables. Unlike linear regression, logistic regression predicts probabilities, which are then used to classify outcomes.
-
Binary Dependent Variable:
The outcome variable ($Y$ ) has two possible values, typically coded as 0 and 1.- Example: 1 = Approved, 0 = Not Approved.
-
Independent Variables:
These can be continuous, categorical, or a mix of both.- Example: Income, age, education level.
-
Logistic Function (Sigmoid Function):
Converts the linear combination of inputs into probabilities between 0 and 1.$$P(Y=1 \mid X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_kX_k)}}$$ -
Log-Odds Transformation: The logistic model estimates the log-odds of the probability:
$$\text{Log-Odds} = \ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_kX_k$$
- Binary Outcome: The dependent variable is binary.
- Independence of Observations: Each observation is independent.
- Linearity in the Log-Odds: The relationship between the independent variables and the log-odds is linear.
- No Multicollinearity: Independent variables are not highly correlated.
-
Binary Logistic Regression:
Predicts a binary outcome (e.g., pass/fail). -
Multinomial Logistic Regression:
Predicts outcomes with more than two categories without an order (e.g., types of transportation). -
Ordinal Logistic Regression:
Predicts outcomes with ordered categories (e.g., customer satisfaction levels: low, medium, high).
Specify the dependent and independent variables.
Estimate the coefficients (
-
Positive Coefficient: Increases the log-odds of
$Y = 1$ . -
Negative Coefficient: Decreases the log-odds of
$Y = 1$ .
Convert the log-odds to probabilities using the logistic function.
Assess model performance using accuracy, precision, recall, and
A bank wants to predict whether a loan applicant will default (
Income ( |
Credit Score ( |
Default ( |
---|---|---|
40,000 | 600 | 1 |
50,000 | 650 | 0 |
60,000 | 700 | 0 |
70,000 | 750 | 0 |
80,000 | 800 | 1 |
-
Logistic Model:
$$P(Y=1 \mid X_1, X_2) = \frac{1}{1 + e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2)}}$$ -
Estimated Coefficients:
$$\beta_0 = -10$$ $$\beta_1 = 0.0001$$ $$\beta_2 = 0.02$$ -
Prediction for New Applicant: Applicant with income
$X_1 = 65,000$ and credit score$X_2 = 720$ :$$\text{Log-Odds} = -10 + 0.0001(65,000) + 0.02(720) = -10 + 6.5 + 14.4 = 10.9$$ $$P(Y=1) = \frac{1}{1 + e^{-10.9}} \approx 0.999$$ The applicant is very likely to default.
-
Confusion Matrix:
- True Positive (TP): Correctly predicted defaults.
- False Positive (FP): Incorrectly predicted defaults.
- True Negative (TN): Correctly predicted non-defaults.
- False Negative (FN): Incorrectly predicted non-defaults.
-
Accuracy:
$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{Total Observations}}$$ -
Precision:
$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}$$ -
Recall:
$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$ -
AUC-ROC: Measures the model's ability to distinguish between classes. A higher value indicates better performance.
- Healthcare: Predicting the likelihood of a disease based on patient attributes.
- Finance: Determining loan defaults or creditworthiness.
- Marketing: Classifying customers as likely or unlikely to purchase.
Logistic regression is a powerful and widely used statistical tool for binary classification problems. By understanding its principles, assumptions, and evaluation metrics, you can use logistic regression to make accurate predictions and informed decisions.
Next Steps: ANOVA and MANOVA