You are on page 1of 22

Multicollinearity - violation of the assumption that no

independent variable is a perfect linear function of one or


more other independent variables.

1
is the impact X
1
has on Y holding all other factors
constant. If X
1
is related to X
2
then
1
will also capture
the impact of changes in X
2
.
In other words, interpretation of the parameters becomes
difficult.
With perfect multicollinearity one cannot technically estimate the
parameters.
Dominant variable - a variable that is so highly correlated with the
dependent variable that it completely masks the effects of all
other independent variables in the equation.
Example: Wins = f(PTS, DPTS)
Any other explanatory variable will tend to be insignificant.
Imperfect Multicollinearity - a statistical relationship exists between
two or more independent variables that significantly affects the
estimation of the model.
Estimates will remain unbiased:
Estimates will still be centered around the true values.
The variances of the estimates will increase.
In essence we are asking the model to tell us something we know
very little about (i.e. what is the impact of changing X
1
on Y
holding everything else constant).
Because we are not holding all else constant, the error associated
with our estimate increases.
Hence it is possible for us to observe coefficients of the opposite
sign than expected due to multicollinearity.
The computed t-stat will fall
Variance and standard error are increased. WHY?
Estimates will become very sensitive to changes in
specification
Overall fit of the equation will be generally unaffected.
If the multicollinearity occurs in the population as well
as the sample, then the predictive power of the model
is unaffected.
Note: It is possible that multicollinearity is a result of
the sample, so the above may not always be true.
The severity of the multicollinearity worsens its
consequences.
High R
2
with all low t-scores
If this is the case, you have multicollinearity.
If this is not the case, you may or may not have multicollinearity.
If all the t-scores are significant and in the expected direction than we
can conclude that multicollinearity is not likely to be a problem.
High Simple Correlation Coefficients
A high r between two variables (.80 is the rule of thumb) indicates the
potential for multicollinearity. In a model with more than two
independent variables, though, this test will not tell us if a relationship
exists between a collection of independent variables.
1. Run an OLS regression that has Xi as a function of all
the other explanatory variables in the equation.
2. Calculate the VIF
VIF(i) = 1 / (1-R
2
)
3. Analyze the degree of multicollinearity by evaluating
the size of VIF.
4. There is no table of critical VIF values. The rule of
thumb is if VIF > 5 then multicollinearity is an issue.
Other authors suggest a rule of thumb of VIF > 10.
5. What is the R
2
necessary to reach the rules of thumb?
Do nothing
If you are only interested in prediction, multicollinearity is not an
issue.
t-stats may be deflated, but still significant, hence
multicollinearity is not significant.
The cure is often worse than the disease.
Drop one or more of the multicollinear variables.
This solution can introduce specification bias. WHY?
In an effort to avoid specification bias a researcher can introduce
multicollinearity, hence it would be appropriate to drop a variable.
Transform the multicollinear variables.
Form a linear combination of the multicollinear variables.
Transform the equation into first differences or logs.
Increase the sample size.
The issue of micronumerosity.
Micronumerosity is the problem of (n) not exceeding (k). The
symptoms are similar to the issue of multicollinearity (lack of
variation in the independent variables).
A solution to each problem is to increase the sample size.
This will solve the problem of micronumerosity but not
necessarily the problem of multicollinearity.
Heteroskedasticity
Heteroskedasticity - violation of the
classic assumption that the observations of
the error term are drawn from distributions
that have a constant variance.
Why does this occur?
1. Substantial differences in the dependent variables across units
of observation in the cross-sectional data set.
2. People learn over time.
3. Improvements in data collection over time
4. The presence of outliers (an observation that is significantly
different than all other observations in the sample).
Pure Heteroskedasticiy
Classic assumption: Homeoskedasticity or
var(e
i
) =
2
= a constant
If this assumption is violated then var(e
i
) =
i
2

What is the difference? The variance is not a constant but varies
across the sample.
The precise form of heteroskedasticity can take on a variety of
forms. Our discussion will focus primarily on one form.
Proportionality factor (Z) - variance of the error term changes
proportionally to some factor (Z).
Therefore var(e
i
) =
2
Z
i
Consequences
1.Pure heteroskedasticity does not cause bias in the coefficient estimates
The variance of the coefficient estimates is now a function of the proportionality
factor. Hence the variance of the estimates of increases. These estimates are still
unbiased, since over-estimation and under-estimation are still as likely.
2.Heteroskedasticity causes OLS to underestimate the variances (and
standard errors) of the coefficients.
This is true as long as increases in the independent variable is related to
increases in the variance of the error term. This positive relationship
will cause the standard error of the coefficient to be biased negatively.
In most economic applications, this is the nature of the
heteroskedasticity problem.
Hence the t-stats and F-stats can not be relied upon for
statistical inference.

Testing for Heteroskedasticity:
First Questions
1. Are there any specification errors?
2. Is research in this area prone to the problem of
heteroskedasticity? Cross-sectional studies tend to have this
problem.
3. Does a graph of the residuals show evidence of
heteroskedasticity?
Park Test
Estimate the model and save the residuals.
Log the squared residuals and regress these upon the log of the
proportionality factor (Z).
Use the t-test to test the significance of Z.
PROBLEM: Picking Z
White Test
Estimate the model and save the residuals.
Square the residuals (dont log) and regress these on each X, the
square of each X, and product of each X times every other X.
Use the chi-square test to test the overall significance of the
equation. The test stat is NR
2
, where N is the sample size and R
2

is the unadjusted R
2
. Degrees of freedom equals the number of
slope coefficients in equation.
If the test stat is larger than the critical value, reject the null
hypothesis and conclude that you probably have
heteroskedasticity.
I GOT MORE TESTS!!!
Remedies for
Heteroskedasticity
Use weighted least squares.
Dividing the dependent and independent variables by Z will
remove the problem of heteroskedasticity.
This is a form of generalized least squares.
Problem: How do we identify Z?
How do we identify the form of Z?
Whites heteroskedasticity-consistent variances and standard errors
The math is beyond the scope of the class, but statitical
packages (like E-Views) do allow you to estimate the standard
errors so that asymptotically valid statistical inferences can be
made about the true parameter values.
More Remedies
Estimate a linear model with a double-logged model.
Such a model may be theoretically appealing since one
is estimating the elasticities.
Logging the variables compresses the scales in which
the variables are measured hence reducing the
problem of heteroskedasticity.
In the end, you may be unable to resolve the problem of
heteroskedasticity. In this case, report Whites heteroskedasticity-
consistent variances and standard errors and note the substantial
efforts you made to reach this conclusion

Serial Correlation
Which classic assumption are we talking about?
The correlation between any two observations of the
error term is zero.
Issue: Is e
t
related to e
t-1
? Such would be the case
in a time-series when a random shock has an impact
over a number of time periods.
Why is this important? Y = a + bX + e
t

With serial correlation, e
t
= f(e
t-1
)
hence Y = f(X
t
, e
t
, e
t-1
)
Therefore, serial correlation also impacts the
accuracy of our estimates of the parameters.
Pure Serial Correlation
Pure serial correlation - a violation of the
classical assumption that assumes uncorrelated
observations of the error term.
In other words, we assume cov(e
i
, e
j
) = 0
If the covariance of two error terms is not equal
to zero, then serial correlation exists.
First Order Serial Correlation
The most common form of serial correlation is
first order serial correlation.
e
t
= e
t-1
+ u
t
where e = error term of the equation in
question
= (rho) parameter depicting the functional
relationship between observations of the error
term.
u = classical (non-serially correlated) error term
Strength of Serial Correlation
Magnitude of indicates the strength of the serial correlation.
If = 0 no serial correlation
as approaches 1 in absolute value, the more the error terms are
serially correlated.
If > 1 then the error term continually increases over time, or
explodes. Such a result is unreasonable.
So we expect -1 < < 1
Values greater than zero indicate positive serial correlation values
less than zero indicate negative serial correlation.
Negative serial correlation indicates that the error term switches
signs from observation to observation.
In the data utilized most commonly in economics, negative serial
correlation is not expected.
Consequences
1. Pure serial correlation does not cause bias in the coefficient estimates
2. Serial correlation increases the variance of the distributions
1. Pure serial correlation causes the dependent variable to fluctuate in a fashion
that the estimation procedure (OLS) attributes to the independent variables.
Hence the variance of the estimates of increases. These estimates are still
unbiased, since over-estimation and under-estimation are still as likely
3. Serial correlation causes OLS to underestimate the variances (and standard
errors) of the coefficients.
1. Intuitively - Serial correlation increases the fit of the model. Hence the
estimation of the variance and standard errors is lower. This can lead the
researcher to conclude a relationship exists when in fact the variables in
question are unrelated.
2. Hence the t-stats and F-stats can not be relied upon for statistical inference.
3. Spurious Regressions.
REMEDIES???

You might also like