Outliers and Influential Observations

Outliers and Inuential Observations
August 16, 2014

1 Motivation
Refer to graphs presented in class to distinguish between outliers (observa-
tions with large residuals) and inuential observations (observations that
may or may not be outliers, but which inuence a subset/all coecients, ts,
or variances, in a substantial way.
2 Single-row Diagnostics
All tests consider what happens if an observation is droppedwhat happens
to t, the estimated coecients, t-ratios, etc.
Let the model with all observations be denoted as usual as:
y = X + and the OLS estimator b = (X
X)
1
X
y.
Denote the t
th
diagonal of the projection matrix P = X(X
X)
1
X
as h
t
,
and the t
th
row of (X
X)
1
x
t
as c
t
, whose k
th
element is c
kt
. Note that h
t
can also be written as x
t
(X
X)
1
x
t
.
Say the t
th
observation is dropped. Denote the corresponding dependent
variable as y[t], the X matrix as X[t], the residual vector as e[t] etc.
The t
th
observation can be considered to be inuential if its omission has a
large impact on parameter estimates, t of the model etc. This is determined
by using some rules of thumb:
1. DFBETA: As shown below:
b
k
b
k
[t] =
c
kt
e
t
1 h
t
1
Proof: Without loss of generality let the t
th
observation be placed last.
I.e write the data matrices in partitioned form as follows:
X
= [X
[t] x
t
]; y
= [y
[t] y
t
]
where X is (nXK), X[t] is ((n1)XK) and x
t
is (1XK). y
t
is a scalar,
and y[t] is ((n 1)X1).
X
X = X
[t]X[t] +x
t
x
t
; or X
[t]X[t] = (X
X) x
t
x
t
X
y = X
[t]y[t] +x
t
y
t
; or X
[t]X[t] = X
y x
t
y
t
Given that for any matrix A and vector c
(A cc
)
1
= A
1
+A
1
c(I c
Ac)
1
c
A
1
Substitute (X
X) for A and c = x
t
.
(X
[t]X[t])
1
=
(X
X)
1
+ (X
X)
1
x
t
(1 x
t
(X
X)
1
x
t
)
1
x
t
(X
X)
1
Substituting h
t
= x
t
(X
X)
1
x
t
, a scalar,
=
(X
X)
1
+
(X
X)
1
x
t
x
t
(X
X)
1
1 h
t
b[t] = (X
[t]X[t])
1
X
[t]y[t] =
(X
X)
1
+
(X
X)
1
x
t
x
t
(X
X)
1
1 h
t
(X
y x
t
y
t
)
= (X
X)
1
X
y(X
X)
1
x
t
y
t
+
(X
X)
1
x
t
x
t
(X
X)
1
X
y
1 h
t
(X
X)
1
x
t
x
t
(X
X)
1
x
t
y
t
1 h
t
= b (X
X)
1
x
t
y
t
+
(X
X)
1
x
t
x
t
b
1 h
t
(X
X)
1
x
t
h
t
y
t
1 h
t
b b[t] =
(X
X)
1
x
t
y
t
(1 h
t
) (X
X)
1
x
t
x
t
b + (X
X)
1
x
t
h
t
y
t
1 h
t
Recognizing that h
t
and y
t
are scalars, and that x
t
b = y so that y
t
t
b = e
t
, after cancellation we get
b b[t] =
(X
X)
1
x
t
(y
t
x
t
b)
1 h
t
=
c
t
e
t
1 h
t
2
Focusing only on the k
th
coecient, we get the expression above
b
k
b
k
[t] =
c
kt
e
t
1 h
t
Some standardization is necessary to determine cut-os:
DFBETA
k
=
b
k
b
k
[t]
s[t]
c
2
kt
Cutoff :
2
n
2. DFFITS: It can be shown that:
y
t
y
t
[t] = x
t
[b b[t]] =
h
t
e
t
1 h
t
With standardization:
DFFIT
t
=
y
t
y
t
[t]
s[t]
h
t
Cutoff : 2
n
This was the impact of deleting the t
th
observation on the t
th
predicted
value. Can analogously consider y
j
y
j
[t]
3. RSTUDENT:
RSTUDENT =
e
t
s[t]
1 h
t
Cutoff : 2
4. COVRATIO:
COV RATIO =
|s
2
[t](X[t]
X[t])
1
|
|s
2
(X
X)
1
|
Cutoff :< 1
3K
n
bad; > 1 +
3K
n
good
3 Multiple-row Diagnostics
If there is a cluster of more than one outlier, it is clear that single-row di-
agnostics will not be able to identify inuential observations because of the
masking eect, demonstrated in class.
3
Multiple-row diagnostics can. Let m denote the subset of m deleted
observations
The measures dened above can be analogously determined:
DFBETA =
b
k
b
k
[m]
V ar(b
k
)
MDFIT = (b b[m])
(X[m]
X[m])(b b[m])
V ARRATIO =
|s
2
(X[m]
X[m])
1
|
|s
2
(X
X)
1
|
This is, however, not practical, although there are packages that can
consider every permutation of 2, 3, 4,.... data points, and also methods to
help identify m.
3.1 Partial Regression Plots
In a simple regression model (with one independent variable), inuential
observationsbe they single or multipleare easy to detect visually. But what
about a multiple regression model? One easy and practical solution is to
collapse a multiple regression model to a series of single-regressions using the
FWL Theorem.
For example, say there are four explanatory variables: y =
1
+ X
2
2
+
... +X
4
4
+
To know if there are observations inuencing the estimated b
2
.
1. Regress y on X
3
and X
4
and obtain the residual u.
2. Regress X
2
on X
3
and X
4
and obtain the residual w.
By the FWL Theorem, we know that the regression of u on w yields the
OLS slope coecient for X
2
. So, a plot of u on w enables us to collapse
multi-dimentional problem into a two-dimensional one.
Visual inspection along the lines presented earlier of such partial regres-
sion plots for each of the key parameters of interest can identify inuential
observationssingly or as a cluster.
4 What to do
The point is that an inuential observation/set of observations is/are not
necessarily to be jettisoned. A cluster of inuential observations could well
be an indication of structural change, for example.
4
5 References
There is no comparable treatment in Greene, or in Wooldridge. I have drawn
my class notes from the following classic references:
David Belsely, Edwin Kuh and Roy Welsch, Regression Diagnostics: Iden-
tifying Inuential Data and Sources of Collinearity, Wiley, 2004.
Samprit Chatterjee and Ali S. Hadi, Sensitivity Analysis in Linear Re-
gression Wiley, 1988.
These are NOT, however, REQUIRED.
5

Outliers and Influential Observations

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Outliers and Influential Observations

Uploaded by

Copyright:

Available Formats

Outliers and Inuential Observations

August 16, 2014

You might also like