Professional Documents
Culture Documents
Model Checking
Prof. Sharyn OHalloran
Sustainable Development U9611
Econometrics II
Regression Diagnostics
Heterosckedasticity
Non-constant
variance
Multicollinearity
Non-independence
of x variables
Outliers
An
Outliers
Largest
positive outliers
Largest negative
outliers
Outliers
Leverage
Leverage
These cases have
relatively large
leverage
Outliers
Leverage
Influence
Influence
Regression line
without influential
data point
Regression line
with influential
data point
Introduction
Tools
Scatterplots
Residuals plots
Tentative fits of models with one or
more cases set aside
A strategy for dealing with influential
observations (11.3)
Tools to help detect outliers and
influential cases (11.4)
Cooks
distance
Leverage
Studentized residual
Difficulties to overcome
General strategy
Influence
By influential observation(s) we mean
Yes
Is there reason to believe the case
belongs to a population other than
he one under investigation
Yes
No
No
More data are needed to
resolve the question
(Section 11.1.1)
10
15
Possible influential
outliers
3
GASTRIC
Fitted values
METABOL
and male:
male
Interactive
Term
gen femgas=female*gastric
Scatterplots
Female line
does not change
Introduction
These help identify influential observations and help to
clarify the course of action.
Use them when:
outlierness
Note: i = 1,2,..., n
Cooks Distance:
Measure of overall influence
predict D, cooskd
graph twoway spike D subject
(y j(i) y j )
Di =
2
p
j=1
n
Note: observations
31 and 32 have
large cooks
distances.
The impact that
omitting a case has
on the estimated
regression
coefficients.
One formula:
Di =
( y j (i ) y j ) 2
j =1
p 2
Di =
( studres i )
p
1 hi
This term is big
if case i is unusual
in the y-direction
1 xi x
1
+
hi =
( n 1) s x
n
This case has a
relatively large
leverage
X2
Formula:
Fact:
resi
studresi =
SE (resi )
SE (resi ) = 1 hi
i.e.
For
STATA commands:
Cooks distance
Studentized
residuals
Leverage
1. predict D, cooksd
2. graph twoway scatter D subject, msymbol(i) mlabel(subject
ytitle("Cooks'D") xlabel(0(5)35) ylabel(0(0.5)1.5)
title("Cook's Distance by subject")
Question:
Is
= 0 + 1 takers + f(expend)
After controlling for the number of students taking
the test, does expenditures impact performance?
Step 1: Scatterplots
y 0 + 1 x1
versus x2
in
( y | x1 , x 2 ) = 0 + 1 x1 + 2 x 2
pres = y 0 + 1 x1
pres = res + 2 x 2
Graph cprplot x1
ei + b1 x1i
Graphs each
observation's residual
plus its component
predicted from x1
against values of x1i
Hetroskedasticity(Non-constant Variance)
Heteroskastic:
Systematic
variation in the
size of the
residuals
Here, for
instance, the
variance for
smaller fitted
values is
greater than for
larger ones
Hetroskedasticity
Heteroskastic:
Systematic
variation in the
size of the
residuals
Here, for
instance, the
variance for
smaller fitted
values is
greater than for
larger ones
Hetroskedasticity
Heteroskastic:
Systematic
variation in the
size of the
residuals
Here, for
instance, the
variance for
smaller fitted
values is
greater than for
larger ones
Fails hettest
Fails whitetst
Another Example
These error
terms are
really bad!
Previous
analysis
suggested
logging
enrollment
to correct
skewness
Another Example
Much
better
Errors look
more-orless normal
now
Adding
enrollment
keeps errors
normal
Dont need to
take the log
of enrollment
this time
Suppose:
( y | x1 , x 2 ) = 0 + 1 x1 + 2 x 2
var( y | x1 , x2 ) = 2 / i
and the wis are known
x x ) 2
(
y
i i 1 1i 2 2i
i =1
Multicollinearity
This means that two or more regressors are
highly correlated with each other.
Doesnt bias the estimates of the dependent
variable
So
Multicollinearity
Multicollinearity
We could:
1.
2.
3.
Multicollinearity
Oil
Only this
information is
used in the
estimation of 2
Temp
Insulation
This
information
is NOT used
in the
estimation
of 1 nor 2
These variables
interfere with
each other.
Temp
Oil
Large Overlap in
variation of Temp
and Insulation is
used in explaining
the variation in
Oil but NOT in
estimating 1 and
Insulation
These results
are not too bad
Much worse.
Problem:
education
variables are
highly correlated
Solution: delete
collinear factors.
Much worse.
Problem:
education
variables are
highly correlated
Solution: delete
collinear factors.
Much worse.
Problem:
education
variables are
highly correlated
Solution: delete
collinear factors.
Measurement errors in xs