Professional Documents
Culture Documents
Rmulo A. Chumacero
Functional Form
Motivation
What?: Extend OLS framework
Why?: Crucial in practice
How?: Using what we have learned
Outline
1. Scaling
2. Dummy variables / Time trends
3. Possible nonlinearities
4. Diagnostics tests
5. Measurement errors in variables
6. Omitting relevant variables
7. Including irrelevant variables
8. Multicollinearity
9. Influential analysis
10. Model selection
11. Specification searches
1
Functional Form
Eects of Scaling
Data are not always conveniently scaled
Changing the scale of
= 1 + 2 + = 1 + ( 2)
= 1 + 2 +
= 1 + 2 +
= 1 + 2 +
= 1 + 2 +
Functional Form
Dummy Variables
Example: E ( |). Equivalent ways to model this:
Define a dummy variable
1 female
0 male
Thus, = 0 + 11 + . 0 = E ( |) and 0 + 1 = E ( | )
Alternatively, define the variable
0 female
2 =
1 male
1 =
Thus, = 0 + 12 + . 0 = E ( | ) and 0 + 1 = E ( |)
Or = 11 + 22 + . 1 = E ( | ) and 2 = E ( |)
Standard mistake: include an intercept, 1 and 2. Perfectly collinear (1 + 2 = 1)
If equation of interest is E ( | ):
= 0 + 11 + 2 +
Intercept eect for gender but return on education is the same
A regression model allowing for slope dierences (interactions) is
= 0 + 11 + 2 + 31 +
Functional Form
Dummy Variables
It is interesting to see how our estimators algebraically handle dummy variables
= 1 1 + 2 2 +
By construction 10 2 = 0. Thus,
1 X
0
0
=1 1
b
=
1 = 1
1 = (11) 1 = P
1 =1
=1 1
#
0
" b2
1
(11)
0
1 0
b =
b
V
=
b2
2
0
(20 2)1
0 b 2
1
1 X 2
b =
=1
2
1 X
2
b =
b2 for = 1 2
=1
4
Functional Form
Time Trend
Many economic variables exhibit trends
Consider a series growing at a constant rate:
= 0 (1 + )
where is the rate of growth per period
Taking logs (ln) of both sides:
ln = ln 0 + ln (1 + )
Adding a shock and changing notation:
= 1 + 2 +
where = ln 1 = ln 0 2 = ln (1 + ) '
Easily checked by taking the first dierence and ignoring the disturbance:
= 2
Important: choice of unit for is irrelevant if it used consistently
Functional Form
Seasonality
Economic time series:
= + + +
= 0 + 11 + 22 + 33 + 4 +
5.4
b
b
b
= 11 + 22 + 33
5.2
5.0
4.8
4.6
4.4
87
88
89
GDP
90
91
92
93
94
95
96
97
Detrended
6
(1)
Functional Form
Nonlinearity in Regressors
We are interested in E ( | ) = () R and form of is unknown
Common approach: polynomial approximation:
= 0 + 1 + 22 + + +
2
Let = 0 1 and = 1 this is = 0 + which is LRM.
Typically, is kept small
If R2, a simple quadratic approximation is
= 0 + 11 + 22 + 321 + 422 + 512 +
As dimensionality increases, approximations become non-parsimonious
Most applications use quadratic terms, some add cubics without interactions (or neural nets,
Fourier series, splines, wavelets, etc):
= 0 + 11 + 22 + 321 + 422 + 512 + 631 + 732 +
Since nonlinear models are linear in parameters, they can be estimated by OLS, and inference
is conventional
Functional Form
Nonlinearity in Regressors
However, model is nonlinear, so interpretation must take this into account
For example, in cubic model, slope with respect to 1 is
E ( | )
= 1 + 2 31 + 52 + 3 621
1
which is a function of 1 and 2, making reporting of the slope dicult
Important to report slopes for dierent values of the regressors, chosen to illustrate the point
of interest
In other applications, average slope may be sucient. Two obvious candidates:
Derivative evaluated at sample averages
E ( | )
= 1 + 2 31 + 52 + 3 621
1
=
and average derivative
1 X E ( | )
1X 2
= 1 + 2 31 + 52 + 3 6
=1
1
=1 1
Functional Form
Transformations
Even simplest model usually considers nonlinearities
Example: Cobb-Douglas production function
=
Take logs to (2):
= + +
Or: = +
Instances in which no transformation can be used: CES:
1
= + (1 ) +
OLS cannot be applied (NLLS)
(2)
Functional Form
Function
= 1 + 2
= 1 + 22
= 1 + 23
ln = 1 + 2 ln
ln = 1 + 2
= 1 + 2 ln
If ln is used 0 is required
10
Slope=
2
2 2
3 22
2
2
2 1
Elasticity
2
2
2 2
3
3 2
2
2
2 1
Functional Form
ln ( )
b +b
b +b
Econometrician can estimate =
or ln( ) =
(or both). Which is preferable?
Plain truth: either is fine, in the sense that E ( | ) and E (ln () | ) are well-defined (so
long as 0)
To select one specification over the other, requires the
of additional structure (as
imposition
Functional Form
e + 0
e by OLS, and
Let = () denote nonlinear functions of . Fit = 0
e +
test H0 : = 0
= 0 +
2
b
= ...
b
e + 0
= 0
e
e +
by OLS, and form the Wald statistic 21 for H0 : = 0
(3)
Functional Form
( 3) 2
=
2 +
2
6
4
(Skewness) is a measure of asymmetry of the distribution around the mean
3
3
1 X
1 X e
=
==
=1
=1
=1
+052
2+2
1
E =
Median () = V =
13
Functional Form
Measurement Errors
= +
2
2
or not observed, = + = + ; 0 , 0
= + + = + ; = +
b unbiased and ecient (not as ecient as when
Model satisfies assumptions of ,
is observed)
= ( ) + = +
where = . Since = + , regressor is correlated with the disturbance, given
that
Cov ( ) = Cov ( + ) = 2
b biased and inconsistent.
violates assumption of no correlation between and error term.
Functional Form
Omitted Variables
Correct Model:
= 1 1 + 2 2 +
Estimated Model: = 1 1 +
b1 = ( 0 1)1 0
1
1
1
1
0
= 1 + (11) 10 2 2 + (10 1) 10
b1 = 1 + ( 0 1)1 0 2 2
E
| 1 {z 1 }
Direction of bias dicult to assess in the general case; consider 1 and 2 are scalars
b1 = 1 + Cov (1 2) 2
E
V (1)
b1 1 and estimator will overestimate eect of 1 on
If sgn(Cov (1 2) 2) 0, E
(Friedman: Permanent Income)
15
Functional Form
Omitted Variables
b1 |1 = 2 ( 0 1)1
V
1
b = 1 and V
b |1 2 would be upper
If we had estimated the correct model, E
1
1
(4)
which is p.d.!
16
Functional Form
Omitted Variables
Proceeding as usual (thinking that the estimated model is correct) we would obtain
b0
b
e =
1
2
but
b = 1 = 1 (1 1 + 2 2 + ) = 12 2 + 1. Then
E (b
0
b) = 0220 12 2 + 2tr (1)
= 0220 12 2 + 2 ( 1)
In conclusion
b1 and
If we omit relevant variable,
e2 are biased
b
b
Even when 1 may be more precise than , 2 cannot estimate consistently
1
17
Functional Form
Irrelevant Variables
Correct Model:
= 1 1 +
Estimated Model: = 1 1 + 2 2 +
b1 = ( 0 21)1 0 2 = 1 + ( 0 21)1 0 2
1
1
"1 # 1
0
b
b
2
b = E 1 = 1 ; E
E
e
= 2
=E
b
0
1 2
2
1
1
b1 | = 2 ( 0 21)1
V
1
b |1 = 2 ( 0 1)1
V
1
1
Asymptotically as ecient if 1 and 2 orthogonal
Functional Form
Multicollinearity
Arises when measured variables are too highly intercorrelated to allow for precise analysis of
the individual eects of each one
We will discuss:
nature
ways to detect it
eects
remedies.
Perfect Collinearity
b not defined. Happens i columns of are linearly dependent.
If ( 0) ,
When happens, error is quickly discovered, software will be unable to construct ( 0)1
Since error is quickly discovered, this is rarely a problem of applied econometric practice
Thus, problem with multicollinearity is not with data, but with bad specification.
19
Functional Form
Near Multicollinearity
In contrast to perfect collinearity, near multicollinearity is statistical problem
Problem is not identification but precision
The higher the correlation between regressors, the less precise will be the estimates
Troubling about definition of problem is that our complaint is with the sample that was
given to us!
The usual symptoms of the problem are:
Small changes in data produce wide swings in estimates
statistics are not significant, 2 is high (excuse?)
Coecients have wrong sign or implausible magnitudes
Problem arises when 0 is near singular and columns of are close to linear dependence
(loose)
One implication of near singularity is that numerical reliability of calculations is reduced
(more likely that reported calculations will be in error due to floating-point calculation
diculties)
20
Functional Form
Near Multicollinearity
Problem is with ( 0)1, -th diagonal is ( = 1 for convenience):
1
1
1
0
0
0
0
0
(121) = 11 12 (22) 21
!!1
1
0
0
0
2 (22) 21
= 011 1 1
011
0
2 1
= 11 1 1
1
= 0
11 (1 12)
12 is (uncentered) 2 of regression of 1 on the other regressors. Thus,
2
b
V 1 = 0
11 (1 12)
2
b1
if a set of regressors is highly correlated to 1, 1 1 and V
21
Functional Form
Detection
Rule of thumb: concerned when overall 2 any 2
Alternative measure (Belsley) based on the conditioning number ()
r
max
=
min
q
0
0
s eigenvalues of = ( ) =diag 1
10
0
0
11
.
1
0
.
0
0
.
2 2
= .
.
.
..
.
0
0
0
0
22
Functional Form
Bottom Line
There is no pair of words that is more misused than multicollinearity problem
That explanatory variables are highly collinear is a fact of life
It is clear that there are realizations of 0 which would be much preferred to the actual
data
To complaint about the apparent malevolence of nature is not constructive
Ad-hoc cures for a bad sample, can be disastrously inappropriate
Better to rightly accept the fact that non-experimental data is sometimes not very informative about parameters of interest
23
Functional Form
Bottom Line
Example to clarify what we are really talking about
Consider = 11 + 22 +
A regression of 2 on 1 yields 2 = b
1 + b, where b (by construction) is orthogonal
to 1
Substitute this auxiliary relationship into the original one to obtain the model
b
= 11 + 2 1 + b +
= 1 + 2b
1 + 2b +
= 11 + 22 +
b
where 1 = 1 + 2 2 = 2 1 = 1 and 2 = 2 b
1
Researcher who used 1 and 2 and the parameters 1 and 2, reports that 2 is estimated
inaccurately because of the collinearity problem
Researcher who happened to stumble on the model with variables 1 and 2 and parameters 1 and 2 would report that there is no collinearity problem because 1 and 2 are
orthogonal (recall that 1 and b are orthogonal by construction). This researcher would
nonetheless report that 2(= 2) is estimated inaccurately, not because of collinearity,
but because 2 does not vary adequately
24
Functional Form
Bottom Line
Example illustrates that collinearity as a cause of weak evidence is indistinguishable from
inadequate variability as a cause of weak evidence
In light of that fact, surprising that all econometrics texts have sections dealing with the
collinearity problem but none has a section on the inadequate variability problem
In summary
Collinearity is bound to be present in applied econometric practice
There is no simple solution to this problem
Fortunately, multicollinearity does not lead to errors in inference
Asymptotic distribution is still valid. Estimates are asymptotically normal, and estimated standard errors are consistent
Confidence intervals are not misleading. They are large, correctly indicating inherent
uncertainty about the true parameter values
25
Functional Form
Influential Analysis
OLS seeks to prevent few large residuals at expense of incurring into many relatively small
residuals
A few observations can be extremely influential in the sense that dropping them from sample,
b substantially
changes elements of
()
1
1
b=
( 0)
b
1
0 ( 0)
(5)
=1
so equals on average.
What should be done with influential observations? Keep or drop?
26
(6)
Functional Form
0.2
Growth
0.1
0.0
1998:09
-0.1
-0.2
0.00
0.05
0.10
0.15
0.20
Policy Rate
0.2
Growth
0.1
0.0
-0.1
-0.2
0.02
0.04
0.06
0.08
0.10
0.12
Policy Rate
0.3
1998:09
0.2
0.1
0.0
0.00
0.05
0.10
0.15
0.20
Policy Rate
27
Functional Form
Model Selection
We discussed costs and benefits of inclusion/exclusion of variables
How to select specification, when theory does not provide complete guidance?
This is the question of model selection
Question: What is the right model for ? not well posed, it does not make clear the
conditioning set
Question: Which subset of (1 ) enters the E ( |1 )? is well posed.
In cases, model selection reduced to compare two nested models
= 1 1 + 2 2 +
1 is 1 and 2 is 2. Compare
M1 : = 1 1 +
M2 : = 1 1 + 2 2 +
28
Functional Form
Model Selection
M1 : = 1 1 +
M2 : = 1 1 + 2 2 +
Note that M1 M2
We say that M2 is true if 2 6= 0
29
Functional Form
2 = 1 (b
0
b) b
2
2
b
b2 + ( is a constant)
Gaussian log-likelihood
b = ( 2) ln
It might be thought attractive to base model selection on one of these measures of fit
Problem: measures are monotonic between nested models,
b01
b1
b02
b2 12 22 and
1 2, so M2 would always be selected, regardless of the actual data and probability
structure
Clearly an inappropriate decision rule!
30
Functional Form
b1
b22
=
b22
Model selection rule is: for a critical level , let satisfy Pr 22 . Select M1 if
, else select M2.
Major problem with this approach is that critical level is indeterminate
Reasoning which helps guide choice of in hypothesis testing (controlling
h Type I error)
i is
c = M1 |M1
not relevant for model selection. If is set to be a small number, then Pr M
i
h
c = M2 |M2 could vary dramatically, depending on the sample size, etc.
1 but Pr M
Another problem is that if is held fixed, model selection procedure is inconsistent, as
h
i
c = M1 |M1 1 1
Pr M
31
Functional Form
e2
b) ( )
=1
=1 2
b2
b
2
2
Since is monotonically decreasing
on
e
, rule is the same as selecting model with smaller
b2
= ln
b2 + ln 1 +
ln
e = ln
2
2
' ln
b +
' ln
b +
32
Functional Form
It turns out that model selection based on any criterion of the form
2
ln
b + 0
ln
b21 ln
b22 ' 22
=
'
'
=
h
i
c = M1 |M1
Pr M
h 2
i
2
Pr 1 2 |M1
2
2
Pr ln
e1 ln
e2 |M1
2
2
Pr ln
b1 + 1 ln
b2 + (1 + 2) |M1
Pr [ 2 |M1 ]
2
Pr 2 2 1
33
(7)
(8)
Functional Form
+2
' ln
b +2
34
Functional Form
= Pr [ 2 ln ( ) |M1 ]
2 |M1
= Pr
ln ( )
Pr (0 2 |M1 ) = 1
35
Functional Form
ln ( )
i
h
c
Pr M = M2 |M2 = Pr [2 1 |M2 ]
2 |M2 1
= Pr
ln ( )
36
Functional Form
+ 2 ln (ln ( ))
' ln
b + 2 ln (ln ( ))
37
Functional Form
38
Functional Form
39
Functional Form
Specification Searches
Theory often vague about relationship between variables
Result, many relations established from empirical regularities
If not accounted for, practice can generate serious biases in inference
Names: Data mining, data snooping, data grubbing, data fishing
Examples:
Because of space limitations, only the best of a variety of alternative models are
presented here.
The precise variables included in the regression were determined on the basis of
extensive experimentation (on the same body of data).
Since there is no firmly validated theory, we avoided a priori specification of the
functions we wished to fit.
We let the data specify the model.
Newsletter scam
Conventional hypothesis testing valid when a priori considerations rather than exploratory
data mining determine set of variables included
When miner uncovers t-statistics that appear significant at 0.05 level by running a large
number of alternative regressions on the same body of data, the probability of Type I error
is much greater than claimed 5%
40