You are on page 1of 47

Application of Logistic

Regression

Introduction and Description

Why use logistic regression?


Estimation by maximum likelihood
Interpreting coefficients
Hypothesis testing
Evaluating the performance of the model

Why use logistic regression?


There are many important research topics for
which the dependent variable is "limited."
For example: voting, morbidity or mortality, and
participation data is not continuous or distributed
normally.
Binary logistic regression is a type of regression
analysis where the dependent variable is a dummy
variable: coded 0 (did not vote) or 1(did vote)

The Linear Probability Model


In the OLS regression:
Y = + X + e ; where Y = (0, 1)
The error terms are heteroskedastic
e is not normally distributed because Y takes on
only two values
The predicted probabilities can be greater than
1 or less than 0

An Example: Eart Quake


Evacuations
Say we want to model whether a individual evacuated his home before
the gujarat earth quake.
Dependant Variable: Evac (0: Did not evacuate; 1: Evacuated)
Independent Variable: (i) Whether they had pets, (ii) whether they were
on rented house (iii) Time spent in current house (iv) Educational
Qualification (coded: 10 to 18)

The Data
EVAC
0
0
0
1
1
0
0
0
0
0
0
0
1

PETS
1
1
1
1
0
0
0
1
1
0
0
1
1

Rented
0
0
1
1
0
0
0
0
0
0
0
0
1

TENURE
16
26
11
1
5
34
3
3
10
2
2
25
20

EDUC
16
12
13
10
12
12
14
16
12
18
12
16
12

OLS Results
Variable
(Constant)
PETS
MOBLHOME
TENURE
EDUC
R2
F-stat

B
0.296
-0.139
0.357
-0.004
0.010
0.086
12.422

t-value
2.058
-3.308
5.865
-2.520
1.039

Problems:
Predicted Values outside the 0,1
range
Residuals Statistics

Minimum

Std. Predicted
Value

-2.12109

Maximum

3.251766

Mean

Std.
Devia
tion
0

Three Know Issues with Application


of OLS When Dependent variable is
a Decision Variable
Error Terms are heteroskedastic
Error terms are not normally distributed
Predicted values can be >1 or <0

The Logistic Regression Model


The "logit" model solves these problems:
ln[p/(1-p)] = + X + e

p is the probability that the event Y occurs, p(Y=1)


p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"

More:
The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
The estimated probability is:
p = 1/[1 + exp(- - X)]
if you let + X =0, then p = .50
as + X gets really big, p approaches 1
as + X gets really small, p approaches 0

Maximum Likelihood Estimation


(MLE)
MLE is a statistical method for estimating the
coefficients of a model.
Most important while modeling non-linear, non-normal
data
The likelihood function (L) measures the probability of
observing the particular set of dependent variable
values (p1, p2, ..., pn) that occur in the sample:
L = Prob (p1* p2* * * pn)
The higher the L, the higher the probability of observing
the ps in the sample.

MLE involves finding the coefficients (, ) that makes


the log of the likelihood function (LL < 0) as large as
possible
Or, finds the coefficients that make -2 times the log of
the likelihood function (-2LL) as small as possible

Interpreting Coefficients
Since:

ln[p/(1-p)] = + X + e
The slope coefficient () is interpreted as the rate
of change in the "log odds" as X changes not very
useful.
Since:

p = 1/[1 + exp(- - X)]

An interpretation of the logit


coefficient which is usually more
intuitive is the "odds ratio"
Since:

[p/(1-p)] = exp( + X)

exp() is the effect of the independent


variable on the "odds ratio"

From SPSS Output:


Variable
pet
Rented
tenure
education
Constant

S.E.
-0.6593
1.5583
-0.0198
0.0501
0.916

0.2012
0.2874
0.008
0.0468
0.69

Wald
10.7323
29.3895
6.1238
1.1483
1.7624

Households without pets are 1.933 times more likely to


evacuate than those with pets.

From SPSS Output:


Variable
pet
Rented
tenure
education
Constant

Exp(B)

1/Exp(B)

-0.6593
1.5583
-0.0198
0.0501
-0.916

0.5172
4.7508
0.9804
1.0514

1.933
1.020

Households without pets are 1.933 times more likely to


evacuate than those without pets.

Evaluating the Performance


of the Model
There are several statistics which can be
used for comparing alternative models or
evaluating the performance of a single
model:
Model Chi-Square
Percent Correct Predictions
Pseudo-R2

Model Chi-Square
The model likelihood ratio (LR), statistic is
LLR[i] = -2[LL() - LL(, ) ]
{Or, as you are reading SPSS printout:
LLR[i] = [-2LL (of beginning model)] - [-2LL (of ending model)]}

The LR statistic is distributed chi-square with i


degrees of freedom, where i is the number of
independent variables
Use the Model Chi-Square statistic to determine
if the overall model is statistically significant.

Percent Correct Predictions


The "Percent Correct Predictions" statistic assumes
that if the estimated p is greater than or equal to .5
then the event is expected to occur and not occur
otherwise.
By assigning these probabilities 0s and 1s and
comparing these to the actual 0s and 1s, the %
correct Yes, % correct No, and overall % correct
scores are calculated.

An Example:
Observed
0
1

Predicted
0
1
328
24
139
44
Overall

% Correct
93.18%
24.04%
69.53%

Pseudo-R2
One psuedo-R2 statistic is the McFadden's-R2 statistic:
McFadden's-R2 = 1 - [LL(,)/LL()]
{= 1 - [-2LL(, )/-2LL()] (from SPSS printout)}
where the R2 is a scalar measure which varies between
0 and (somewhat close to) 1 much like the R2 in a LP
model.

An Example:
Beginning -2 LL
Ending -2 LL
Ending/Beginning
2

McF. R = 1 - E./B.

687.36
641.84
0.9338
0.0662

Strength of association (pseudo Rsquare)

McFaddens

LL( B )
1
LL(0)
2

this value tends to be smaller than Rsquare and values of .2 to .4 are


considered highly satisfactory.

Strength of association (pseudo Rsquare)

Cox and Snell is also based on loglikelihood but it takes the sample size
into account:
2
CS

1 exp [ LL( B) LL(0)]


n

but it cannot reach a maximum of 1 like


we would like so

Strength of association (pseudo Rsquare)


The Nagelkerke measure adjusts the C
and S measure for the maximum value so
that 1 can be achieved:
2
R
2
2
1
CS
RN 2 , where RMAX 1 exp[2(n ) LL(0)]
RMAX

Hosmer Lemeshow Test Statistic


Divides the subjects into ten ordered groups
and then compares the numbers (actual v/s
predicted) in each group.
Decile groups based on probability of
occurance .1 to 1
Null Hypothesis: No difference between
observed and predicted classifications. (non
significance is actually good!!!).
It assumes that there are enough

HL result (Sample)
Hosmer - Lemeshow Test
Chi-square
4.361

Df

Sig.
.823

Table 5.7 Contingency Table for Hosmer - Lemeshow Test


Death Status = Alive
1

Observed
27

Expected
26.991

25

Death Status = Dead


0

Expected
.009

27

24.977

.023

25

23

22.948

.052

23

28

27.879

.121

28

25

25.717

.283

26

27

26.420

.580

27

24

23.815

1.185

25

22

22.933

3.067

26

20

17.814

8.186

26

5.505

22

20.495

26

10

Observed

Total

Application of Logistic Regression to Default


Prediction a application of SPSS
You are a loan officer in a Bank, you want to be
able to identify characteristics that are indicative
of people who are likely to default on loans and
use those characteristics to identify good and bad
loans. You have information on 850 past and
prospective clients. The first 700 cases are
people who have been already given loans
(previously). Use these 700 peoples database to
create a logistic regression model and then use
this model to classify the 150 prospective
customers as good or bad.

Variable Description

Step 1: Select the 700 observations


for Logistic Regression

Step 2: Construct a logistic


Model

Diagnostics
Cook's. The logistic regression analog of Cook's
influence statistic. A measure of how much the
residuals of all cases would change if a particular case
were excluded from the calculation of the regression
coefficients.
Cook's distance measures the effect of deleting a
given observation. Data points with large residuals
(outliers) and/or high leverage may distort the
outcome and accuracy of a regression. Points with a
large Cook's distance are considered to merit closer
examination in the analysis.

Use of cooks statistics


There are different opinions regarding what cutoff values to use for spotting outliers. A simple
operational guideline of Di > 1 has been
suggested. Other's have indicated that Di > 4 / n,
where n is the number of observations, might be
used

Hosmer Lemeshow Goodness


of fit
Hosmer-Lemeshow goodness-of-fit statistic. This
goodness-of-fit statistic is more robust than the
traditional goodness-of-fit statistic used in
logistic regression, particularly for models with
continuous covariates and studies with small
sample sizes. It is based on grouping cases into
deciles of risk and comparing the observed
probability with the expected probability within
each decile.

Hosmer Lemeshaw statistic (distributed


as Chi Square with 8 degrees of
freedom

Challenger Space Shuttle


On Jan 18th , 1986 NASA space shuttle
program launched its 25th space shuttle
flight from Kennedy space centre in
Florida. 73 secs into the flight the external
fuel tank collapsed and spilled liquid
oxygen and hydrogen.
http://www.youtube.com/watch?v=j4JOjcD
FtBE

Rogers' Commission Report


O-ring resiliency is directly related to its temperature.. A warm O-ring that
has been compressed will return to its original shape much quicker than will
a cold O-ring when compression is relieved. ... A compressed O-ring at 75
degrees Fahrenheit is five times more responsive in returning to its
uncompressed shape than a cold O-ring at 30 degrees Fahrenheit.... At the
cold launch temperature experienced, the O-ring would be very slow in
returning to its normal rounded shape. ... It would remain in its compressed
position in the O-ring channel and not provide a space between itself and the
upstream channel wall. Thus, it is probable the O-ring would not... seal the
gap in time to preclude joint failure due to blow-by and erosion from hot
combustion gases... Of 21 launches with ambient temperatures of 61 degrees
Fahrenheit or greater, only four showed signs of O-ring thermal distress; i.e.,
erosion or blow-by and soot. Each of the launches below 61. degrees
Fahrenheit resulted in one or more O-rings showing signs of thermal
distress.

Reccomendation of Thickol
Corp
A lamentable aspect of this disaster was that the problem with the Orings was already understood by some engineers prior to the
Challenger launch. In February of 1984, the Marshall Configuration
Control Board sent a memo about the O-ring erosion that occurred on
STS 41-B (the 10th space shuttle flight and the 4th mission for the
Challenger shuttle). These messages continued to increase in
intensity as evidenced by a 1985 internal memo from Thiokol Corp.,
the company that designed the O-ring. Employees from Thiokol wrote
the following to their Vice President of Engineering:

This letter is written to ensure that management is fully aware of the


seriousness of the current Oring erosion problem in the SRM joints
from an engineering standpoint....

Could Logit have helped save


seven precious lives?
The temperature on the launch date was
310 F.

Data Description
n=23 Space Shuttle Lift-offs prior to
Challenger
Response: Presence/Absence of erosion or
blow-by on at least one O-Ring field joint
Y=1 if occurred, 0 if not

Predictor Variable: Temperature at lift-off


X = Temperature (degrees Fahrenheit)

Data

O-Ring Problem versus Temperature

O-Ring Problem

50

55

60

65

70
Temperature

Flight#
1
2
3
4
5
6
7
8
9
10
11
12

Temp
66
70
69
68
67
72
73
70
57
63
70
78

O-Ring Problem
0
1
0
0
0
0
0
0
1
1
1
0

Flight#
13
14
15
16
17
18
19
20
21
22
23

Temp
67
53
67
75
70
81
76
79
75
76
58

O-Ring Problem
0
1
0
0
0
0
0
0
1
0
1

75

80

85

Logistic Regression Model


Distribution of Responses: Binomial
Link Function: Logit

( X ) X PY 1 X
(X )
0 1 X
g ( ) ln
1 ( X )
0 1 X
e
(X )
0 1 X
1 e

Model Estimation/Inference
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.0429
7.3786 2.039
0.0415 *
degrees
-0.2322
0.1082 -2.145
0.0320 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null deviance: 28.267 on 22 degrees of freedom
Residual deviance: 20.315 on 21 degrees of freedom
H0: No association between incidence of O-Ring Failure and Temperature (1 = 0)
HA: Association between incidence of O-Ring Failure and Temperature (1 0)
Reject H0 (zobs = -2.145, P=0.032), Conclude a negative association exists

15.04 0.23 X

e
X
15.04 0.23 X
1 e
^

0.9

0.8

0.7

0.6

O-Ring Problem
P(O-Ring Prob)

0.5

0.4

0.3

0.2

0.1

0
50

55

60

65

70

75

80

85

90

Odds Ratio
e 0 1 X
e 0 1 X
e 0 1 X
1 e 0 1 X
1 e 0 1 X
1 e 0 1 X
X

odds X

1
1 X
e 0 1 X 1 e 0 1 X e 0 1 X
1

1 e 0 1 X
1 e 0 1 X
1 e 0 1 X

odds( X 1) e 0 1 ( X 1)

e 0 1 X

odds( X 1) e 0 1 ( X 1) e 0 1 X e 1
Odds Ratio : OR
0 1 X 0 1 X e 1
odds( X )
e
e
^

Estimated Odds Ratio : OR e

^ 1 1.96 SE ^ 1 ^ 1 1.96 SE ^ 1


95% CI for Population Odds Ratio : e
,e

Challenger Data :
^

OR e

e 0.23 0.795

^ 1 1.96 SE ^ 1 ^ 1 1.96 SE ^ 1
e


,e
e 0.231.96(.1082) , e 0.231.96(.1082) e .4421 , e .0179 .6427,.9823

Note : The odds of failure decreasesas temperature increases (Intervalbelow 1)

You might also like