Lecture 2

Economics 513
Lecture 2: The Classical Linear Regression Model
Introduction
In lecture 1 we
Introduced concept of an economic relation

Noted that empirical economic relations do not resemble textbook
relations
Introduced a method to find the best fitting linear relation
No reference to mathematical statistics in all of this. Initially econometrics

did not use the tools of mathematical statistics.
Mathematical statistics develops methods for the analysis of data generated

by a random experiment in order to learn about that random experiment.
1
Economics 513
Is this relevant in economics?
Consider
Wage equation: relation between wage and education, work experience,

gender,
Macro consumption function: relation between (national) consumption
and (national) income
What is the random experiment?
To make progress we start with the assumption that all economic relations
are essentially deterministic, if we include all variables
y = f( x1 , , xW )
2
Economics 513
Hence, if we have data yi , xi1 , , xiW , i = 1, , n then
yi = f( xi1 , , xiW ) , i = 1, , n
Let x1 , , xW be the sample averages of the variables and assume that f is

sufficiently many times differentiable to have a Taylor series expansion
around x1 , , xW , i.e. a polynomial approximation
yi = 0 + 1 ( xi1 x1 ) + + W ( xiW xW ) +
+ 1 ( xi1 x1 ) 2 + + ( xiW xW ) 2 +
+ 1 ( xi1 x1 )( xi 2 x2 ) +
3
Economics 513
We divide x1 , , xW into three groups
1. Variables that do not vary in the sample (take this to be last W V

variables), i.e. for i = 1, , n , xi ,V +1 = xV +1 , , xi ,W = xW . Example:
gender if we consider a sample of women.
2. Variables in the relation that are omitted or cannot be included because
they are unobservable. Let this be the next V K + 1 variables.
3. Variables included in the relation, i.e. x1 , , x K +1.
4
Economics 513
To keep the relation simple we concentrate on the linear part. Hence the
observations satisfy
yi = 0 + 1 xi1 + + K 1 xi , K 1 + i
with
K =1
0 = 0 jxj
j =1
The remainder term contains all the omitted terms
i = K ( xiK x K ) + + V ( xiV xV ) +
+ 1 ( xi1 x1 ) 2 + + V ( xiV xV ) 2 +
+ 1 ( xi1 x1 )( xi 2 x2 ) +
We call i the disturbance (of the exact linear relation).
5
Economics 513
Note that
f
j = ( x1 , , xW ) , j = 1, , K 1
xj
K 1
0 = f( x1 , , xW ) jxj
j =1
Conclusions:
1. The slope coefficient j is the partial effect of x j on y .

2. The slope coefficients and the intercept depend on the variables that
are constant in a sample, e.g. in a sample of women the coefficients in a
wage relation may be different from those in a sample of only men.
3. Only in very special cases will the relation have a 0 intercept.
6
Economics 513
Consider the following experiment: Prediction of yi on the basis of

xi1 , , xi , K 1 .
Assume that we have observed xi1 , , xi , K 1 . This does not tell us anything
about the disturbance i that depends on (many) variables beside
x1 , , x K 1 . Hence, even if we know the coefficients 0 , , K 1, we cannot
predict with certainty what yi is.
A variable with a value that is unknown before the experiment is performed

is a random variable in probability theory.
As in classical random experiments (flipping coin, rolling die) randomness

due to lack of knowledge.
7
Economics 513
In prediction experiment yi is a random variable and hence so is
i = yi 0 1 xi1 K 1 xi , K 1
If we have a sample of size n , we have n replications of the random

experiment
y1 = 0 + 1 x11 + + K 1 x1, K 1 + 1
y n = 0 + 1 xn1 + + K 1 xn, K 1 + n
The distribution of the disturbance reflects the fact that we do not know
the value of the disturbance if we know x1 , , x K 1 .
In fact we make a more extreme assumption: the value of is completely

unpredictable from x1,, xK 1 .
8
Economics 513
Assumption 1: E( i | xi1 , , xi , K 1 ) = 0
In words: the disturbance is mean-independent of x 1 , , x K 1 .
Why do we need this assumption?
Consider
(1) i = ( xi1 x1 ) + i , i = 1, , n
Hence, is (partially) predictable using x1 (but not completely so).
Substitution gives
yi = 0 x1 + ( 1 + ) xi1 + + K 1 xi , K 1 + i
9
Economics 513
Remember that
f
1 = ( x1 , , xW )
x1
Conclusion: If (1) holds then the coefficient of x1 is not equal to the partial
effect of x1 on y .
Failure of Assumption 1 is a failure of the ceteris paribus condition in the

sample: a change in x1 has two effects on y , a direct effect 1 and an
indirect effect . The latter effect is due to the fact that in a sample we
cannot hold other relevant variables fixed/constant. Hence we only measure
the partial effect if the omitted variables are not related with x1.
Measuring partial effects is the goal of most empirical research in

economics (and other social sciences). The biggest challenge in empirical
research is to ensure the Assumption 1 holds.
10
Economics 513
Whether we care depends on our objectives. Compare homeowner who is

interested in relation between house price and square footage of house.
If he/she wants to predict the sales price of house as is no reason to be

worried about interpretation of regression coefficient as partial effect.
If he/she wants to evaluate the investment in an addition to the house
the estimation of the partial effect is essential.
Two strategies to estimate partial effect
Include all variables that are correlated with x1 in the relation.

Assign x1 randomly, i.e. using a random experiment that is independent
of anything, e.g. by flipping a coin if x1is dichotomous.
11
Economics 513
The CLR model
Re-label the variables x 1 , , x K 1 as x2 , , x K and introduce the

variable
x1 1
i.e. xi1 = 1, i = 1, , n . Also re-label 0 , , K 1to 1 , , K .
In the new notation for i = 1, , n
(2) yi = 1 xi1 + 2 xi 2 + + K xiK + i
This is the multiple linear regression model. The relation (2) is linear in
x1 , , x K and also linear in 1 , , K . The latter is essential. By re-
interpreting x1 , , x K we can deal with relations that are non-linear in these
variables.
12
Economics 513
Examples:
Polynomial in x
y = 1 + 2 x + 3 x 2 + 4 x 3 +
Define: x1 1, x2 = x, x3 = x 2 , x4 = x 3
13
Economics 513
Log-linear relation
ln y = 1 + 2 ln x +
Define: x1 1, x2 = ln x
Note
ln y
2 =
ln x
i.e. 2 is the elasticity of y w.r.t. x .
14
Economics 513
Semi-log relation
ln y = 1 + 2 x +
Note: no re-definition is needed.
Also
y
ln y y
2 = =
x x
This is a semi-elasticity.
15
Economics 513
Assumption 1 in the new notation:
(3) E( i | xi1 , , xiK ) = 0 , i = 1, , n
Note that this implies that E ( i ) = 0 . This is without loss of generality

because a non-zero mean can be absorbed into the intercept.
16
Economics 513
In matrix notation (2) and (3) are
y = X +
and
E( i | xi ) = 0 , i = 1, , n
with
x1
(4) X =

xn
Note that xi is a K vector.
17
Economics 513
The Classical (Multiple) Linear Regression (CLR) model is a set of

assumptions mainly on the conditional distribution of given x1 , , x K .
Some assumptions are essential, some are convenient initial assumptions

and can/will be relaxed.
The CLR model is appropriate in many practical situations and is the

starting point for the use of mathematical statistical inference to measure
economic relations
18
Economics 513
CLR model
y = X +
Assumption 1: Fundamental assumption
E( | X ) = 0
Assumption 2: Spherical disturbances
E( | X ) = 2 I
Assumption 3: Full rank
rank( X ) = K
19
Economics 513
Discussion of the assumptions
Assumption 1 is shorthand for
E( i | X ) = 0 , i = 1, , n
Hence this is equivalent to
E( i | x1 , , xn ) = 0 , i = 1, , n
Compare this with
E( i | xi ) = 0 , i = 1, , n
By the law of iterated expectations, the current assumption implies the

latter.
20
Economics 513
The current assumption states that not only xi but also x j , j i is not
related to i . This is not stronger than the previous assumption if x1 , , x n
are independent as in a random sample from a population. If these are not
independent, as e.g. in time-series data, then this additional assumption
may be too strong.
Assumption 1 is satisfied if X is chosen independently of i . In that case we

can treat X as a matrix of known constants. Therefore instead of
Assumption 1 one sometimes sees
Assumption 1: X is a matrix of known constants determined independently

of .
Note: Chosing/controlling X is not enough.
21
Economics 513
Next, we consider assumption 2
Note
12 1 2 1 n

' = 2 1

n1 n2
Hence Assumption 2 implies that
E( i2 | X ) = 2 , i = 1, , n
This is called homoskedasticity.
22
Economics 513
Assumption 2 also implies that
E( i j | X ) = 0
Hence, given X the disturbances are uncorrelated.
23
Economics 513
Example of failure of homoskedasticity: random coefficients
y = 0 + ( 1 + u ) x +

population variation
in coefficient
= 0 + 1 x + + ux
For the composite disturbance, if E(u | x) = 0
E( + ux | x) = 0
but
Var( + ux | x) = 2 + u x + u2 x 2
with u = E(u ) , u2 = E(u 2 ) . Hence the composite error is heteroskedastic.
24
Economics 513
Failure of uncorrelated disturbances: serial correlation
Assume serial correlation of order 1 in disturbances
i = i 1 + ui
Applies in time-series.
Finally, consider Assumption 3.
If rank( X ) = K , then Xa = 0 if and only if a = 0 with a a K vector, i.e.

there is no linear relation between the K variables.
25
Economics 513
Failure of Assumption 3: wage equation
yi = 1 xi1 + 2 xi 2 + 3 xi 3 + 4 xi 4 + i , i = 1, , n
with
x2 = schooling (years)
x3 = age
x4 = potential experience
(age-years in school-6)
26
Economics 513
Hence
x4 = x3 x2 6
27
Economics 513
Define for any constant c,

~ ~ ~ ~
1 = 1 6c, 2 = 2 c, 3 = 3 + c, 4= 4 c
Then also
~ ~ ~ ~
yi = 1 xi1 + 2 xi 2 + 3 xi 3 + 4 xi 4 + i , i = 1, , n
~ ~
Conclusion: We cannot distinguish between 1 , , 4 and 1 , , 4 . For all
c these parameters are observationally equivalent.
Problem is also clear if we substitute for x4
yi = ( 1 6 4 ) xi1 + ( 2 4 ) xi 2 + ( 3 + 4 ) xi 3 + i
28
Economics 513
By assumptions 1 and 2, E( | X ) = 0 , Var( | X ) = 2 I . Sometimes it is

assumed that
| X ~ N (0, 2 I ) .
Why is the normal distribution a natural choice?
In sequel we sometimes assume:
Assumption 4
| X ~ N (0, 2 I )
29
Economics 513
Linear regression as projection
Alternative interpretation of linear regression is as conditional mean

function.
In previous derivation regression coefficient j in multiple linear regression

is the partial effect on y
f
j = ( x1 , , xW )
xj
For this result Assumption 1 was necessary.
Alternative: Consider j as the coefficient of x j in a linear conditional mean

function.
To keep things simple we consider the case of 1 explanatory variable (and

an intercept).
30
Economics 513
The dependent variable y and the independent variable x have a joint

population distribution with joint frequency distribution f(x, y) .
Example: y = savings rate, x = income and f(x, y) is frequency distribution

over all US households.
By a sample survey we obtain yi , xi , i = 1, , n . This survey can be used to

obtain an estimate of the population f(x, y) denoted by f(x, y) (see table 1.1).
Note savings rate and income have been discretized.
31
Economics 513
32
Economics 513
From the joint frequency in the sample we obtain the sample conditional
frequency distribution (see table 1.2)
(y, x)
f
f(y | x) =
f(x)
33
Economics 513
If there is an exact relation between y and x , then every column should

have one 1 and rest 0s.
If not, we can consider the average of y for every value of x . This is the
conditional mean function.
In population
m( x) = E( y | x) = y f( y | x)
y
with estimate
m ( x) = y f( y | x)
y
34
Economics 513
Note that m ( x) may be rough: sampling variation around smooth

population m(x) .
35
Economics 513
Why is m( x) = E( y | x) interesting?
One reason is optimal prediction. Assume joint population distribution

f(x, y) is known and that you have a random draw from this distribution.
Only x is revealed and you must predict y . What is the best predictor
h(x) ?
Criterion: minimize expected squared prediction error

( ) ( )
E ( y h( x) )2 = E (( y m( x)) + (m( x) h( x)) )2 =
= E (( y m( x)) ) + 2 E(( y m( x))(m( x) h( x)) ) +

2
( ) (
+ E (m( x) h( x)) 2 E ( y m( x)) 2 )
and this lower bound is achieved if h( x) = m( x) .
Conclusion: Optimal prediction is h( x) = E( y | x) .
36
Economics 513
Now let us restrict to linear prediction
h( x) = a + bx
Best linear predictor
(
min E ( y a bx) 2
a,b
)
First-order conditions with u = y a bx
2 E(u ) = 0 E( y ) = a + b E( x)
2 E(ux) = 0 E( xy ) = a + b E( x 2 )
37
Economics 513
Solution (compare with OLS solution)
Cov( x, y )
b=
Var( x)
a = E( y ) b E( x)
If we replace population moments with sample moments we obtain
n
( xi x )( yi y )
b = i =1 n
( xi x ) 2
i =1
a = y bx
This is the OLS solution!
38
Economics 513
There is an important difference between the case that the conditional

mean is linear and the case that the conditional mean is not linear.
If E( y | x) = a + bx , we have for u = y a bx
E(u | x) = E( y a bx | x) = E( y E( y | x) | x ) = 0
This is Assumption 1 in the CLR model.
However, b has not a structural interpretation, as a partial effect!
39
Economics 513
If conditional mean is not linear, we have from the first-order condition

E(ux) = 0 .
This is weaker:
E(u | x) = 0 E(ux) = 0
but
E(ux) = 0 not E(u | x) = 0
Hence in that case Assumption 1 of the CLR model is not satisfied but a
weaker uncorrelatedness assumption.
40
Economics 513
The Ordinary Least Squares (OLS) estimator
We have n observations yi , xi 2 , , xiK , i = 1, , n . We organize the data in

the n 1vector y and the n K matrix X with
1 x12 x1K

X =

1 x xnK
n2
The observations are a sample from a population and we assume that the
joint distribution of y, X is such that the CLR model is the appropriate
statistical model, i.e. y, X satisfy
y = X +
41
Economics 513
for some K 1vector of regression coefficients and some n 1vector of

random errors with a distribution that satisfies
E ( | X ) = 0
E ( | X ) = 2 I
This specifies the random experiment of which y, X is the outcome (the

CLR model).
We can now discuss statistical inference: estimation of population

parameters and tests of hypotheses concerning the population parameters
and other aspects of the population distribution.
42
Economics 513
Setup applies to both cross-section and time-series data. In first case

yi , xi 2 , , xiK , i = 1, , n is a random sample and the CLR assumptions on
the population distribution can made on
y1 = x1 + 1
Because the observations are independent we can obtain the joint

distribution of y, X from the marginal distributions. For time-series data
the observations are not independent and the CLR model applies directly to
the joint distribution of y, X .
43
Economics 513
Estimation of and 2
The solution to minimization of sum of squared deviations/residuals
b = ( X ' X ) 1 X ' y
Note that for this rank ( X ) = K . This is an estimator of (only depends on

the data): the Ordinary Least Squares (OLS) estimator of .
Is the OLS estimator a good estimator?
In mathematical statistics estimators are evaluated by considering their

sampling distribution, i.e. their distribution in repeated samples
( y s , X s ), s = 1, , S .
44
Economics 513
All samples are realizations of CLR random experiment
ys = X s + s , s = 1, , S
The sampling distribution of the OLS estimator b is the frequency

distribution of bs , s = 1, , S for S large. We can obtain this distribution by
computer simulation (as in assignment 2).
45
Economics 513
Alternative is to use the CLR assumptions and rules of probability theory to

derive (features) of the sampling distribution of b .
Consider
b = ( X ' X ) 1 X ' y = ( X ' X ) 1 X ' ( X + ) =

= + ( X ' X ) 1 X '
From this we can, using the CLR assumptions, find the conditional average
of b given X (in the sampling distribution)
E (b | X ) = + ( X ' X ) 1 X ' E ( | X ) =
Hence the unconditional average of b is (law of iterated expectations)
E (b) = E X ( E (b | X )) =
In words: under CLR assumptions the OLS estimator is unbiased for .
46
Economics 513
Beside mean consider the variance of b :
Var (b) = E [(b E (b))(b E (b))'] = E [(b )(b )']
We have
b = ( X ' X ) 1 X '
Upon substitution
(
Var (b) = E ( X ' X ) 1 X ' ' X ( X ' X ) 1 )
47
Economics 513
We have
( )
E ( X ' X ) 1 X ' ' X ( X ' X ) 1 | X =
= ( X ' X ) 1 X ' E ( ' | X ) X ( X ' X ) 1 = 2 ( X ' X ) 1
and hence
(
Var (b) = E X ( 2 ( X ' X ) 1 ) = 2 E ( X ' X ) 1 )
Note that 2 ( X ' X ) 1 is an unbiased estimator of this variance.
48
Economics 513
In special case of constant and one regressor we have the unbiased variance
estimator
2
Var (b2 ) = n
( xi x ) 2
i =1
Note that this decreases with 2 and with the variation in x .
49
Economics 513
Optimality of OLS estimator in CLR model
Consider class of estimators for that are linear in y , i.e.
bL = Cy
For the OLS estimator C = ( X ' X ) 1 X ' , i.e. C in general depends on X .
Gauss-Markov Theorem: In CLR model the OLS estimator is the Best

Linear Unbiased (BLU) of , i.e. it has the smallest variance of linear
unbiased estimators.
50

Lecture 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2

Uploaded by

Copyright:

Available Formats

Economics 513

Lecture 2: The Classical Linear Regression Model

Introduced concept of an economic relation

No reference to mathematical statistics in all of this. Initially econometrics

Mathematical statistics develops methods for the analysis of data generated

Is this relevant in economics?

Wage equation: relation between wage and education, work experience,

What is the random experiment?

Hence, if we have data yi , xi1 , , xiW , i = 1, , n then

Let x1 , , xW be the sample averages of the variables and assume that f is

We divide x1 , , xW into three groups

1. Variables that do not vary in the sample (take this to be last W V

We call i the disturbance (of the exact linear relation).

1. The slope coefficient j is the partial effect of x j on y .

Consider the following experiment: Prediction of yi on the basis of

A variable with a value that is unknown before the experiment is performed

As in classical random experiments (flipping coin, rolling die) randomness

In prediction experiment yi is a random variable and hence so is

If we have a sample of size n , we have n replications of the random

In fact we make a more extreme assumption: the value of is completely

In words: the disturbance is mean-independent of x 1 , , x K 1 .

Why do we need this assumption?

Hence, is (partially) predictable using x1 (but not completely so).

Failure of Assumption 1 is a failure of the ceteris paribus condition in the

Measuring partial effects is the goal of most empirical research in

Whether we care depends on our objectives. Compare homeowner who is

If he/she wants to predict the sales price of house as is no reason to be

Two strategies to estimate partial effect

Include all variables that are correlated with x1 in the relation.

The CLR model

Re-label the variables x 1 , , x K 1 as x2 , , x K and introduce the

i.e. xi1 = 1, i = 1, , n . Also re-label 0 , , K 1to 1 , , K .

In the new notation for i = 1, , n

(2) yi = 1 xi1 + 2 xi 2 + + K xiK + i

i.e. 2 is the elasticity of y w.r.t. x .

Assumption 1 in the new notation:

(3) E( i | xi1 , , xiK ) = 0 , i = 1, , n

Note that this implies that E ( i ) = 0 . This is without loss of generality

In matrix notation (2) and (3) are

Note that xi is a K vector.

The Classical (Multiple) Linear Regression (CLR) model is a set of

Some assumptions are essential, some are convenient initial assumptions

The CLR model is appropriate in many practical situations and is the

Assumption 1: Fundamental assumption

Assumption 2: Spherical disturbances

Assumption 3: Full rank

Discussion of the assumptions

Assumption 1 is shorthand for

Hence this is equivalent to

Compare this with

By the law of iterated expectations, the current assumption implies the

Assumption 1 is satisfied if X is chosen independently of i . In that case we

Assumption 1: X is a matrix of known constants determined independently

Note: Chosing/controlling X is not enough.

Next, we consider assumption 2

This is called homoskedasticity.

Assumption 2 also implies that

Hence, given X the disturbances are uncorrelated.

Example of failure of homoskedasticity: random coefficients

For the composite disturbance, if E(u | x) = 0

Failure of uncorrelated disturbances: serial correlation

Assume serial correlation of order 1 in disturbances

If rank( X ) = K , then Xa = 0 if and only if a = 0 with a a K vector, i.e.

Failure of Assumption 3: wage equation