Logit

Application of Logistic
Regression
Introduction and Description
Why use logistic regression?

Estimation by maximum likelihood
Interpreting coefficients
Hypothesis testing
Evaluating the performance of the model
Why use logistic regression?

There are many important research topics for
which the dependent variable is "limited."
For example: voting, morbidity or mortality, and
participation data is not continuous or distributed
normally.
Binary logistic regression is a type of regression
analysis where the dependent variable is a dummy
variable: coded 0 (did not vote) or 1(did vote)
The Linear Probability Model

In the OLS regression:
Y = + X + e ; where Y = (0, 1)
The error terms are heteroskedastic
e is not normally distributed because Y takes on
only two values
The predicted probabilities can be greater than
1 or less than 0
An Example: Eart Quake

Evacuations
Say we want to model whether a individual evacuated his home before
the gujarat earth quake.
Dependant Variable: Evac (0: Did not evacuate; 1: Evacuated)
Independent Variable: (i) Whether they had pets, (ii) whether they were
on rented house (iii) Time spent in current house (iv) Educational
Qualification (coded: 10 to 18)
The Data
EVAC
0
0
0
1
1
0
0
0
0
0
0
0
1
PETS
1
1
1
1
0
0
0
1
1
0
0
1
1
Rented
0
0
1
1
0
0
0
0
0
0
0
0
1
TENURE
16
26
11
1
5
34
3
3
10
2
2
25
20
EDUC
16
12
13
10
12
12
14
16
12
18
12
16
12
OLS Results
Variable
(Constant)
PETS
MOBLHOME
TENURE
EDUC
R2
F-stat
B
0.296
-0.139
0.357
-0.004
0.010
0.086
12.422
t-value
2.058
-3.308
5.865
-2.520
1.039
Problems:
Predicted Values outside the 0,1
range
Residuals Statistics
Minimum
Std. Predicted
Value
-2.12109
Maximum
3.251766
Mean
Std.
Devia
tion
0
Three Know Issues with Application

of OLS When Dependent variable is
a Decision Variable
Error Terms are heteroskedastic
Error terms are not normally distributed
Predicted values can be >1 or <0
The Logistic Regression Model

The "logit" model solves these problems:
ln[p/(1-p)] = + X + e
p is the probability that the event Y occurs, p(Y=1)

p/(1-p) is the "odds ratio"
ln[p/(1-p)] is the log odds ratio, or "logit"
More:
The logistic distribution constrains the estimated
probabilities to lie between 0 and 1.
The estimated probability is:
p = 1/[1 + exp(- - X)]
if you let + X =0, then p = .50
as + X gets really big, p approaches 1
as + X gets really small, p approaches 0
Maximum Likelihood Estimation

(MLE)
MLE is a statistical method for estimating the
coefficients of a model.
Most important while modeling non-linear, non-normal
data
The likelihood function (L) measures the probability of
observing the particular set of dependent variable
values (p1, p2, ..., pn) that occur in the sample:
L = Prob (p1* p2* * * pn)
The higher the L, the higher the probability of observing
the ps in the sample.
MLE involves finding the coefficients (, ) that makes

the log of the likelihood function (LL < 0) as large as
possible
Or, finds the coefficients that make -2 times the log of
the likelihood function (-2LL) as small as possible
Interpreting Coefficients
Since:
ln[p/(1-p)] = + X + e
The slope coefficient () is interpreted as the rate
of change in the "log odds" as X changes not very
useful.
Since:
p = 1/[1 + exp(- - X)]
An interpretation of the logit

coefficient which is usually more
intuitive is the "odds ratio"
Since:
[p/(1-p)] = exp( + X)
exp() is the effect of the independent

variable on the "odds ratio"
From SPSS Output:

Variable
pet
Rented
tenure
education
Constant
S.E.
-0.6593
1.5583
-0.0198
0.0501
0.916
0.2012
0.2874
0.008
0.0468
0.69
Wald
10.7323
29.3895
6.1238
1.1483
1.7624
Households without pets are 1.933 times more likely to

evacuate than those with pets.
From SPSS Output:

Variable
pet
Rented
tenure
education
Constant
Exp(B)
1/Exp(B)
-0.6593
1.5583
-0.0198
0.0501
-0.916
0.5172
4.7508
0.9804
1.0514
1.933
1.020
Households without pets are 1.933 times more likely to

evacuate than those without pets.
Evaluating the Performance

of the Model
There are several statistics which can be
used for comparing alternative models or
evaluating the performance of a single
model:
Model Chi-Square
Percent Correct Predictions
Pseudo-R2
Model Chi-Square
The model likelihood ratio (LR), statistic is
LLR[i] = -2[LL() - LL(, ) ]
{Or, as you are reading SPSS printout:
LLR[i] = [-2LL (of beginning model)] - [-2LL (of ending model)]}
The LR statistic is distributed chi-square with i

degrees of freedom, where i is the number of
independent variables
Use the Model Chi-Square statistic to determine
if the overall model is statistically significant.
Percent Correct Predictions

The "Percent Correct Predictions" statistic assumes
that if the estimated p is greater than or equal to .5
then the event is expected to occur and not occur
otherwise.
By assigning these probabilities 0s and 1s and
comparing these to the actual 0s and 1s, the %
correct Yes, % correct No, and overall % correct
scores are calculated.
An Example:
Observed
0
1
Predicted
0
1
328
24
139
44
Overall
% Correct
93.18%
24.04%
69.53%
Pseudo-R2
One psuedo-R2 statistic is the McFadden's-R2 statistic:
McFadden's-R2 = 1 - [LL(,)/LL()]
{= 1 - [-2LL(, )/-2LL()] (from SPSS printout)}
where the R2 is a scalar measure which varies between
0 and (somewhat close to) 1 much like the R2 in a LP
model.
An Example:
Beginning -2 LL
Ending -2 LL
Ending/Beginning
2
McF. R = 1 - E./B.
687.36
641.84
0.9338
0.0662
Strength of association (pseudo Rsquare)
McFaddens
LL( B )
1
LL(0)
2
this value tends to be smaller than Rsquare and values of .2 to .4 are

considered highly satisfactory.
Cox and Snell is also based on loglikelihood but it takes the sample size
into account:
2
CS
1 exp [ LL( B) LL(0)]

n
but it cannot reach a maximum of 1 like

we would like so

The Nagelkerke measure adjusts the C
and S measure for the maximum value so
that 1 can be achieved:
2
R
2
2
1
CS
RN 2 , where RMAX 1 exp[2(n ) LL(0)]
RMAX
Hosmer Lemeshow Test Statistic

Divides the subjects into ten ordered groups
and then compares the numbers (actual v/s
predicted) in each group.
Decile groups based on probability of
occurance .1 to 1
Null Hypothesis: No difference between
observed and predicted classifications. (non
significance is actually good!!!).
It assumes that there are enough
HL result (Sample)
Hosmer - Lemeshow Test
Chi-square
4.361
Df
Sig.
.823
Table 5.7 Contingency Table for Hosmer - Lemeshow Test

Death Status = Alive
1
Observed
27
Expected
26.991
25
Death Status = Dead

0
Expected
.009
27
24.977
.023
25
23
22.948
.052
23
28
27.879
.121
28
25
25.717
.283
26
27
26.420
.580
27
24
23.815
1.185
25
22
22.933
3.067
26
20
17.814
8.186
26
5.505
22
20.495
26
10
Observed
Total
Application of Logistic Regression to Default

Prediction a application of SPSS
You are a loan officer in a Bank, you want to be
able to identify characteristics that are indicative
of people who are likely to default on loans and
use those characteristics to identify good and bad
loans. You have information on 850 past and
prospective clients. The first 700 cases are
people who have been already given loans
(previously). Use these 700 peoples database to
create a logistic regression model and then use
this model to classify the 150 prospective
customers as good or bad.
Variable Description
Step 1: Select the 700 observations

for Logistic Regression
Step 2: Construct a logistic

Model
Diagnostics
Cook's. The logistic regression analog of Cook's
influence statistic. A measure of how much the
residuals of all cases would change if a particular case
were excluded from the calculation of the regression
coefficients.
Cook's distance measures the effect of deleting a
given observation. Data points with large residuals
(outliers) and/or high leverage may distort the
outcome and accuracy of a regression. Points with a
large Cook's distance are considered to merit closer
examination in the analysis.
Use of cooks statistics

There are different opinions regarding what cutoff values to use for spotting outliers. A simple
operational guideline of Di > 1 has been
suggested. Other's have indicated that Di > 4 / n,
where n is the number of observations, might be
used
Hosmer Lemeshow Goodness

of fit
Hosmer-Lemeshow goodness-of-fit statistic. This
goodness-of-fit statistic is more robust than the
traditional goodness-of-fit statistic used in
logistic regression, particularly for models with
continuous covariates and studies with small
sample sizes. It is based on grouping cases into
deciles of risk and comparing the observed
probability with the expected probability within
each decile.
Hosmer Lemeshaw statistic (distributed

as Chi Square with 8 degrees of
freedom
Challenger Space Shuttle

On Jan 18th , 1986 NASA space shuttle
program launched its 25th space shuttle
flight from Kennedy space centre in
Florida. 73 secs into the flight the external
fuel tank collapsed and spilled liquid
oxygen and hydrogen.
http://www.youtube.com/watch?v=j4JOjcD
FtBE
Rogers' Commission Report

O-ring resiliency is directly related to its temperature.. A warm O-ring that
has been compressed will return to its original shape much quicker than will
a cold O-ring when compression is relieved. ... A compressed O-ring at 75
degrees Fahrenheit is five times more responsive in returning to its
uncompressed shape than a cold O-ring at 30 degrees Fahrenheit.... At the
cold launch temperature experienced, the O-ring would be very slow in
returning to its normal rounded shape. ... It would remain in its compressed
position in the O-ring channel and not provide a space between itself and the
upstream channel wall. Thus, it is probable the O-ring would not... seal the
gap in time to preclude joint failure due to blow-by and erosion from hot
combustion gases... Of 21 launches with ambient temperatures of 61 degrees
Fahrenheit or greater, only four showed signs of O-ring thermal distress; i.e.,
erosion or blow-by and soot. Each of the launches below 61. degrees
Fahrenheit resulted in one or more O-rings showing signs of thermal
distress.
Reccomendation of Thickol
Corp
A lamentable aspect of this disaster was that the problem with the Orings was already understood by some engineers prior to the
Challenger launch. In February of 1984, the Marshall Configuration
Control Board sent a memo about the O-ring erosion that occurred on
STS 41-B (the 10th space shuttle flight and the 4th mission for the
Challenger shuttle). These messages continued to increase in
intensity as evidenced by a 1985 internal memo from Thiokol Corp.,
the company that designed the O-ring. Employees from Thiokol wrote
the following to their Vice President of Engineering:
This letter is written to ensure that management is fully aware of the

seriousness of the current Oring erosion problem in the SRM joints
from an engineering standpoint....
Could Logit have helped save

seven precious lives?
The temperature on the launch date was
310 F.
Data Description
n=23 Space Shuttle Lift-offs prior to
Challenger
Response: Presence/Absence of erosion or
blow-by on at least one O-Ring field joint
Y=1 if occurred, 0 if not
Predictor Variable: Temperature at lift-off

X = Temperature (degrees Fahrenheit)
Data
O-Ring Problem versus Temperature
O-Ring Problem
50
55
60
65
70
Temperature
Flight#
1
2
3
4
5
6
7
8
9
10
11
12
Temp
66
70
69
68
67
72
73
70
57
63
70
78
O-Ring Problem
0
1
0
0
0
0
0
0
1
1
1
0
Flight#
13
14
15
16
17
18
19
20
21
22
23
Temp
67
53
67
75
70
81
76
79
75
76
58
O-Ring Problem
0
1
0
0
0
0
0
0
1
0
1
75
80
85
Logistic Regression Model

Distribution of Responses: Binomial
Link Function: Logit
( X ) X PY 1 X
(X )
0 1 X
g ( ) ln
1 ( X )
0 1 X
e
(X )
0 1 X
1 e
Model Estimation/Inference
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 15.0429
7.3786 2.039
0.0415 *
degrees
-0.2322
0.1082 -2.145
0.0320 *
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null deviance: 28.267 on 22 degrees of freedom
Residual deviance: 20.315 on 21 degrees of freedom
H0: No association between incidence of O-Ring Failure and Temperature (1 = 0)
HA: Association between incidence of O-Ring Failure and Temperature (1 0)
Reject H0 (zobs = -2.145, P=0.032), Conclude a negative association exists
15.04 0.23 X
e
X
15.04 0.23 X
1 e
^
0.9
0.8
0.7
0.6
O-Ring Problem
P(O-Ring Prob)
0.5
0.4
0.3
0.2
0.1
0
50
55
60
65
70
75
80
85
90
Odds Ratio
e 0 1 X
e 0 1 X
e 0 1 X
1 e 0 1 X
1 e 0 1 X
1 e 0 1 X
X
odds X
1
1 X
e 0 1 X 1 e 0 1 X e 0 1 X
1
1 e 0 1 X
1 e 0 1 X
1 e 0 1 X
odds( X 1) e 0 1 ( X 1)
e 0 1 X
odds( X 1) e 0 1 ( X 1) e 0 1 X e 1
Odds Ratio : OR
0 1 X 0 1 X e 1
odds( X )
e
e
^
Estimated Odds Ratio : OR e
^ 1 1.96 SE ^ 1 ^ 1 1.96 SE ^ 1

95% CI for Population Odds Ratio : e
,e
Challenger Data :
^
OR e
e 0.23 0.795
^ 1 1.96 SE ^ 1 ^ 1 1.96 SE ^ 1
e

,e
e 0.231.96(.1082) , e 0.231.96(.1082) e .4421 , e .0179 .6427,.9823
Note : The odds of failure decreasesas temperature increases (Intervalbelow 1)

Logit

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Logit

Uploaded by

Copyright:

Available Formats

Application of Logistic

Introduction and Description

Why use logistic regression?

Why use logistic regression?

The Linear Probability Model

An Example: Eart Quake

Three Know Issues with Application

The Logistic Regression Model

p is the probability that the event Y occurs, p(Y=1)

Maximum Likelihood Estimation

MLE involves finding the coefficients (, ) that makes

p = 1/[1 + exp(- - X)]

An interpretation of the logit

exp() is the effect of the independent

From SPSS Output:

Households without pets are 1.933 times more likely to

From SPSS Output:

Households without pets are 1.933 times more likely to

Evaluating the Performance

The LR statistic is distributed chi-square with i

Percent Correct Predictions

Strength of association (pseudo Rsquare)

this value tends to be smaller than Rsquare and values of .2 to .4 are

Strength of association (pseudo Rsquare)

1 exp [ LL( B) LL(0)]

but it cannot reach a maximum of 1 like

Strength of association (pseudo Rsquare)

Hosmer Lemeshow Test Statistic

Table 5.7 Contingency Table for Hosmer - Lemeshow Test

Death Status = Dead

Application of Logistic Regression to Default

Step 1: Select the 700 observations

Step 2: Construct a logistic

Use of cooks statistics

Hosmer Lemeshow Goodness

Hosmer Lemeshaw statistic (distributed

Challenger Space Shuttle

Rogers' Commission Report

This letter is written to ensure that management is fully aware of the

Could Logit have helped save

Predictor Variable: Temperature at lift-off

O-Ring Problem versus Temperature

Logistic Regression Model

Estimated Odds Ratio : OR e

Note : The odds of failure decreasesas temperature increases (Intervalbelow 1)

You might also like