You are on page 1of 59

Estimation and Inference

for
Linear Regression
in Matrix Notation
Simon Jackman
Department of Political Science
Stanford University
January 29, 2007
The Linear Model
y = Xb + e
where
X is a n by k matrix of data (independent variables),
y is a n by 1 vector of data (dependent variable),
e is a n by 1 vector of disturbances, and
b is a k by 1 vector of parameters to be estimated.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 1 OF 58
Estimation via Least Squares
Take as our
estimator
that value of the unknown
population parameter
which minimizes the sum of squared errors.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 2 OF 58
Apply Least Squares to the Linear Regression Model
Note that e = y - Xb. Thus
SSE
n

i=1
e
2
i
= [e
1
e
2
. . . e
n
]
(1~n)
_

_
e
1
e
2
.
.
.
e
n
_

_
(n~1)
= e

e
(1~1)
= (y - Xb)

(y - Xb)
= (y

- b

)(y - Xb)
= y

y - y

Xb - b

y + b

Xb, since(Xb)

= b

= y

y - 2b

y + b

Xb
since b

y is a scalar, it is equal to its transpose y

Xb.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 3 OF 58
Differentiate SSE
To find the value of b that minimizes
SSE = y

y - 2b

y + b

Xb
differentiate wrt b:
SSE
b
= -2X

y + 2X

Xb.
noting that
b

y
b
= X

y and
b

Xb
b
= 2X

Xb.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 4 OF 58
First-Order Condition
Set 1st derivative to zero (1st order condition), and solve for b:
SSE
b
= 0
(k~1)
,
-2X

y + 2X

Xb = 0
X

X b
*
= X

y (the normal equations)


(X

X)
-1
X

X b
*
= (X

X)
-1
X

y
b
*
= (X

X)
-1
X

y,
provided (X

X)
-1
exists,
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 5 OF 58
Second-Order Condition
Check that this solution yields a minimum (2nd order condition):

2
SSE
b b

= 2X

X
i.e., minimizes SSE, positive definite X

X.
For a definition of positive definiteness, see Graybill (1983), Matrices with
Applications in Statistics, 2nd edition, Wadsworth: Belmont, California; 12.2:
Definition: A n-by-n matrix A is positive definite if and only if
1. A = A

(i.e., A is symmetric)
2. z

A z > 0, z R
n
, z =/ 0.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 6 OF 58
Least Squares Estimate of b
Having verified that we have a minimizer of the SSE, we can now write

b
OLS
= (X

X)
-1
X

y.
where the hat or caret denotes an estimate of the population quantity b.
OLS: Ordinary Least Squares.
Later in the quarter we will consider a generalized least squares (GLS) estimator of
b.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 7 OF 58
Existence of the Least Squares Estimator

b
OLS
= (X

X)
-1
X

y.
Given data X and y, the only possibly non-trivial part of computing the least squares
estimator is inverting X

X.
(X

X)
-1
exists X has full column rank
i.e., there are no linear dependencies among the columns of X.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 8 OF 58
What is X

X?
If x
i
= [x
i1
. . . x
ik
], i = 1, . . . , n, then
X =
_

_
x
11
. . . x
1k
x
21
. . . x
2k
.
.
.
.
.
.
.
.
.
x
n1
. . . x
nk
_

_
X usually has a unit vector (a column of ones) so as to generate an intercept term in
the model:
y
i
= b
1
1 + b
2
x
i2
+ . . . + b
k
x
ik
+ e
i
when an intercept is present, we have k - 1 predictors, plus the intercepts column
of ones, to make up the k columns of X.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 9 OF 58
What is X

X?
X

X =
_

n
i=1
x
2
i1

n
i=1
x
i1
x
i2
. . .

n
i=1
x
i1
x
ik

n
i=1
x
i2
x
i1

n
i=1
x
2
i2
. . .

n
i=1
x
i2
x
ik
.
.
.
.
.
.
.
.
.
.
.
.

n
i=1
x
ik
x
i1

n
i=1
x
ik
x
i2
. . .

n
i=1
x
2
ik
_

_
n.b., asquare(k-by-k), symmetric matrix, withsums of squares ontheleadingdiagonal,
cross-products on the off-diagonals.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 10 OF 58
What is X

X?
X

X =
_

n
i=1
x
2
i1

n
i=1
x
i1
x
i2
. . .

n
i=1
x
i1
x
ik

n
i=1
x
i2
x
i1

n
i=1
x
2
i2
. . .

n
i=1
x
i2
x
ik
.
.
.
.
.
.
.
.
.
.
.
.

n
i=1
x
ik
x
i1

n
i=1
x
ik
x
i2
. . .

n
i=1
x
2
ik
_

_
If anyof thevariablesinXarelinear combinationsof other variables, thenthiswill create
linear dependencies in the columns of X

X: this is known as perfect multicollinearity.


JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 11 OF 58
Existence of (X

X)
-1
We also need the condition n > k, or, that we have more observations than
independent variables.
Definition: Rank of a matrix. If A is a r-by-c matrix, then
1. the column rank of A is the number of linearly independent columns
2. the row rank of A is the number of linearly independent rows
3. column rank = row rank = rank of A, written r(A)
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 12 OF 58
Results on ranks
If A is a r-by-c matrix, then
1. r(A) M min(r, c)
2. r(A) = r(A

) = r(AA

) = r(A

A)
3. if B is a square p-by-p matrix, then B is singular if r(B) < p.
Thus, X

X formed with n < k can not be inverted: r(X

X) = n < k. In the limiting case


n = k we can invert X, but the resulting regression model overfits (perfectly fits!) the
data: r
2
= 1.0.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 13 OF 58
Properties of

b
OLS
1. Is the OLS estimator unbiased?
2. What is its variability in repeated samples? Is it efficient?
3. asymptotic properties, i.e., consistency (but not today)
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 14 OF 58
Assumptions
1. X is full-rank
2. random sampling: the (x
i
, y
i
) are independent draws from their joint distribution
3. weak exogeneity: E(e
i
|x
i
) = 0, i = 1, . . . , n.
4. conditional homoskedasticity: var(e
i
|x
i
) = r
2
, i = 1, . . . , n.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 15 OF 58
Strict Exogeneity
Assumption 2 (random sampling) and Assumption 3 (weak exogeneity) imply strict
exogeneity:
E(e
i
|x
1
, . . . , x
n
) = E(e
i
|X) = 0, i = 1, . . . , n
or
E(e|X) = 0
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 16 OF 58
iid disturbances (conditional on X)
Assumption 2 (random sampling)
Assumption 3 (weak exogeneity)
Assumption 4 (conditional homoscedasticity)
_

conditional on X, the disturbances are iid (identically and independently distributed):


var(e|X) = E(ee

|X) = r
2
I
n
i.e., the (conditional) variance-covariance matrix of the disturbances is a diagonal
matrix (i.e., covariances are all zero), with a constant term r
2
on the diagonal
(variances are all r
2
). Well come back to this presently.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 17 OF 58
Unbiasedness
We seek the conditions under which E(

b) = b. Begin by considering the expectation of

b conditional on X:
E(

b| X) = E[(X

X)
-1
X

y | X],
= E[(X

X)
-1
X

(Xb + e) | X],
since y = Xb + e
= E[(X

X)
-1
X

Xb + (X

X)
-1
X

e|X],
= E[b + (X

X)
-1
X

e|X],
since (X

X)
-1
X

X = I
k
= b + (X

X)
-1
X

E(e|X)
= b since by assumption E(e|X) = 0
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 18 OF 58
Unbiasedness
We have
E(

b|X) = b,
and we get the unconditional result E(

b) = b by applying the Law of Iterated


Expectations.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 19 OF 58
Unbiasedness
To summarize:
1. E(e|X) = 0 (strict exogeneity)
2. (X

X)
-1
exists (n > k and no perfect multicollinearity in X; full column rank)
then
E(

b) = b
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 20 OF 58
Violation of E(e|X) = 0
Omitted Variable Bias
The true model is
y = X
1
b
1
+ X
2
b
2
+ e
but the model estimated is
y = X
1
b
1
+ e
*
,
where
e
*
= X
2
b
2
+ e.
LS estimate:

b
1
= (X

1
X
1
)
-1
X

1
y
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 21 OF 58
Omitted Variable Bias
Let X = [X
1
X
2
]. Taking conditional expectations,
E(

b
1
|X) = E[(X

1
X
1
)
-1
X

1
y|X]
= E[(X

1
X
1
)
-1
X

1
(X
1
b
1
+ X
2
b
2
+ e)|X]
= b
1
+ E[(X

1
X
1
)
-1
X

1
X
2
b
2
|X] + E[(X

1
X
1
)
-1
X

1
e|X]
Since our assumption is that the true model is correctly specified: E(e|X) = 0 which
implies E(X

1
e) = 0 and so
E(

b
1
|X) = b
1
+ E[(X

1
X
1
)
-1
X

1
X
2
b
2
|X]
. .
bias
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 22 OF 58
Omitted Variable Bias
E(

b
1
|X) = b
1
+ E[(X

1
X
1
)
-1
X

1
X
2
b
2
|X]
. .
bias
Bias eliminated if and only if
1. X

1
X
2
= 0 (i.e., X
1
and X
2
are uncorrelated); or
2. b
2
= 0 (i.e., X
2
wasnt really in the true model to begin with).
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 23 OF 58
Omitted Variable Bias
Thus
omitting a relevant independent variable usually
biases
the coefficients of the variables
in the estimated model
How consequential is this in practice?
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 24 OF 58
Omitted Variable Bias
We can sign the omitted bias. The bias in

b
1
is
E(

b
1
|X) - b
1
= E[(X

1
X
1
)
-1
X

1
X
2
b
2
|X].
Since (X

1
X
1
)
-1
is a positive definite matrix, the sign of the omittedvariable bias depends
on
1. the sign of X

1
X
2
(the covariances between X
1
and X
2
) and
2. the sign of b
2
,
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 25 OF 58
Omitted Variable Bias
Bias in

b
1
from omission of X
2
:
E(

b
1
|X) - b
1
= (X

1
X
1
)
-1
X

1
X
2
b
2
Consider the following cases:
X

1
X
2
b
2
Bias: E(

b
1
) - b
1

b
1
is
positive positive positive over-estimated
positive negative negative under-estimated
negative positive negative under-estimated
negative negative positive over-estimated
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 26 OF 58
Model Specification is a Compromise
The flip side of omitted variable bias is multicollinearity (see Fox, Ch 13) giving rise to
a
bias-variance tradeoff
Bias: Excluding relevant X variables risks biasing the coefficients of the included
variables.
Variance: Including(possibly irrelevant) X variablesinflatesthevariances(andhence
the standard errors) of the estimated coefficients (resulting from the collinearity
among the X variables). In addition, we chew up degrees of freedom and clutter our
model (lack of parsimony).
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 27 OF 58
Properties of

b
OLS
Related to Inference
Classical hypothesis testing requires an assessment of the sampling variability of the
estimator. Since the least squares estimator

b = (X

X)
-1
X

y is a vector of length k, its


sampling distribution is a multivariate distribution with
1. mean vector and a
2. variance-covariance matrix
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 28 OF 58
Conditional Variance-Covariance Matrix of

b
var(

b|X) = E[(

b - E(

b)) (

b - E(

b))

| X],
(i.e., the definition of variance),
= E[(

b - b)(

b - b)

|X], since E(

b) = b,
= E
_
(X

X)
-1
X

e[(X

X)
-1
X

e]

| X
_
, since

b - b = (X

X)
-1
X

e,
= E[(X

X)
-1
X

ee

X(X

X)
-1
|X],
since, by assumption E(ee

|X) = r
2
I
n
= r
2
(X

X)
-1
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 29 OF 58
Estimating the Conditional Variance-Covariance Matrix of

b
We estimate
var(

b|X) = r
2
(X

X)
-1
using the sample estimate of r
2
(derived below),
r
2
=
e

e
n - k
,
yielding

var(

b|X) = r
2
(X

X)
-1
.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 30 OF 58
Estimating the Conditional Variance-Covariance Matrix of

b

var(

b|X) = r
2
(X

X)
-1
.


var(

b|X) = r
2
(X

X)
-1
is a scalar times a k-by-k symmetric matrix.
Conditional variances of

b are on the diagonal; conditional covariances on the
off-diagonals
The standard errors of the estimated coefficients are obtained by taking the square
root of the diagonal
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 31 OF 58
Estimating the Conditional Variance-Covariance Matrix of

b

var(

b|X) = r
2
(X

X)
-1
(a k-by-k matrix)
=
_

var(

b
1
|X)

cov(

b
1
,

b
2
|X) . . .

cov(

b
1
,

b
k
|X)

cov(

b
2
,

b
1
|X)

var(

b
2
|X) . . .

cov(

b
2
,

b
k
|X)
.
.
.
.
.
.
.
.
.
.
.
.

cov(

b
k
,

b
1
|X)

cov(

b
k
,

b
2
|X) . . .

var(

b
k
|X)
_

se(

b
j
|X) =
_

var(

b
j
|X), j = 1, . . . , k
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 32 OF 58
Estimate of r
2
How to estimate r
2
= var(e
i
)? Note we dont observe e, just
e = y - X

b
= y - X(X

X)
-1
X

y
= [I
n
- X(X

X)
-1
X

]y
= My,
where M = I
n
- X(X

X)
-1
X

is a symmetric idempotent matrix (i.e., M has the property


that M

M = M). Substituting for y,


e = M(Xb + e)
= Me,
since MX = (I
n
- X(X

X)
-1
X

)X = I
n
X - X(X

X)
-1
X

X = X - X = 0.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 33 OF 58
Estimate of r
2
e = Me,
Because M is symmetric idempotent (M = M

M)
e

e = e

Me = e

Me
Well exploit this identity shortly; contrast the derivation in Fox 10.3.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 34 OF 58
Properties of trace
But first, a little more theory from linear algebra...
Definition: the trace of a square matrix A, tr(A), is equal to the sum of the elements
on its diagonal.
tr(ABC) = tr(CAB) = tr(BCA)
tr(E(A)) = E(tr(A))
tr(I
p
) = p
tr(A + B) = tr(A) + tr(B)
if k is a scalar, tr(kA) = k tr(A)
We now make use of these properties in evaluating E( e

e|X) = E(e

Me|X).
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 35 OF 58
Estimate of r
2
E( e

e|X) = E(e

Me|X),
= E[tr(e

Me)|X], (the trace of a scalar equals the scalar)


= E[tr(Mee

)|X], (changing the order of multiplication inside the trace)


= tr(ME(ee

|X)), (the trace and expectations operators are both linear)


= tr(Mr
2
I
n
), (by assumption, E(ee

|X) = r
2
I
n
)
= r
2
tr(M)
= r
2
tr(I
n
- X(X

X)
-1
X

)
= r
2
(tr(I
n
) - tr[(X

X)
-1
X

X])
= r
2
(tr(I
n
) - tr(I
k
))
= r
2
(n - k)
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 36 OF 58
Estimate of r
2
That is,
r
2
=
e

e
n - k
=

n
i=1
e
2
i
n - k
is a conditionally unbiased estimator of r
2
, i.e.,
If E(ee

|X) = r
2
I
n
,E( r
2
|X) = r
2
.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 37 OF 58
Consequence of Unbiased Estimation of r
2
The consequence of having a conditionally unbiased estimator of r
2
is that we
recover an unbiased estimate of the marginal variance of

b, V(

b), instead of the


conditional variance V(

b|X).
This is important since we dont want to limit ourselves to conclusions about the
sampling variance of

b conditional on the X we obtained.
Using the Law of Iterated Expectations (and see Ruud 8.5),
E[

var(

b|X)] = E[

r
2
(X

X)
-1
] = E
X
_
E( r
2
|X)(X

X)
-1
_
(iff r
2
is conditionally unbiased for r
2
)
= E
X
_
r
2
(X

X)
-1
_
= E
X
_
var(

b|X)

= var(

b)
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 38 OF 58
Standard Error of the Regression
The square root of r
2
, r, is alternately called
the standard error of the regression
the standard error of the estimate
and is usefully thought of as
the standard deviation of the residuals e.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 39 OF 58
Standard Error of the Regression
r is a goodness-of-fit measure:
As r
2
1.0, r 0
As r
2
0.0, r
_
var(y).
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 40 OF 58
Understanding the assumption: E(ee

|X) = r
2
I
n
We used this assumption in deriving the conditional variance-covariance matrix of

b:
Recall that if E(ee

|X) = r
2
I
n
then
var(

b|X) = r
2
(X

X)
-1
But what does this mean?
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 41 OF 58
Understanding the assumption: E(ee

|X) = r
2
I
n
E(ee

|X) is the (conditional) variance-covariance matrix of the disturbances, e. By the


definition of variance,
var(e|X) = E
_
_
e - E(e|X)
_ _
e - E(e|X)
_

|X
_
= E(ee

|X) since by assumption, E(e|X) = 0.


Our assumption is thus that
var(e|X) = E(ee

|X) = r
2
I
n
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 42 OF 58
Understanding the assumption: E(ee

|X) = r
2
I
n
var(e|X) = E(ee

|X) = r
2
I
n
is a n-by-n matrix:
var(e|X) = r
2
I
n
=
_

_
var(e
1
|X) cov(e
1
, e
2
|X) . . . cov(e
1
, e
n
|X)
cov(e
2
, e
1
|X) var(e
2
|X) . . . cov(e
2
, e
n
|X)
.
.
.
.
.
.
.
.
.
.
.
.
cov(e
n
, e
1
|X) cov(e
n
, e
2
|X) . . . var(e
n
|X)
_

_
variances on the diagonal
covariance on the off-diagonals
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 43 OF 58
Understanding the assumption: E(ee

|X) = r
2
I
n
E(ee

|X) =
_

_
var(e
1
|X) cov(e
1
, e
2
|X) . . . cov(e
1
, e
n
|X)
cov(e
2
, e
1
|X) var(e
2
|X) . . . cov(e
2
, e
n
|X)
.
.
.
.
.
.
.
.
.
.
.
.
cov(e
n
, e
1
|X) cov(e
n
, e
2
|X) . . . var(e
n
|X)
_

_
= r
2
I
n
= r
2
_

_
1 0 . . . 0
0 1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . 1
_

_
=
_

_
r
2
0 . . . 0
0 r
2
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . r
2
_

_
n.b., r
2
on the diagonal (variances), 0 on the off-diagonals (covariances).
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 44 OF 58
E(ee

|X) = r
2
I
n
: two assumptions in one
Conditionally Uncorrelated Errors:
E(e
i
e
j
|X) = 0, i =/ j
This assumption is implied by the assumptions
1. random sampling: (x
i
, y
i
), i = 1, . . . , n are iid draws from their joint distribution
2. weak exogeneity E(e
i
|x
i
) = 0, i = 1, . . . , n).
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 45 OF 58
E(ee

|X) = r
2
I
n
: two assumptions in one
Identical Conditional Variances (conditional homoskedasticity):
var(e
i
|X) = E(e
2
i
|X) = r
2
, i = 1, . . . , n.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 46 OF 58
E(ee

|X) = r
2
I
n
: two assumptions in one
Together, these assumptions add up to the conditional iid assumption: i.e., the
disturbances are conditionally independently and identically distributed.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 47 OF 58
Violation of E(ee

|X) = r
2
I
n
Thus, violation of E(ee

|X) = r
2
I
n
arises in two ways:
1. Off-diagonal elements of E(ee

) are non-zero (auto-correlation, or spatial-correlation


among the disturbances); in scalar terms, E(e
i
e
j
|X) =/ 0, i =/ j.
2. The elements on the leading diagonal of E(ee

|X) vary (heteroskedasticity); in scalar


terms var(e
i
|X) = E(e
2
i
|X) = r
2
i
=/ r
2
, i = 1, . . . , n.
In both these cases
E(ee

|X) = U = r
2
W, W =/ I
n
.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 48 OF 58
Consequences of Violation of conditional iid assumption
The ordinary least squares estimator

b
OLS
= (X

X)
-1
X

y is unbiased (i.e., E(

b
OLS
) = b) But
the usual expression for the conditional variance-covariance matrix of

b
OLS
is incorrect:
E[(

b
OLS
- b)(

b
OLS
- b)

|X] = E
_
(X

X)
-1
X

ee

X(X

X)
-1
_
= (X

X)
-1
X

UX(X

X)
-1
= r
2
(X

X)
-1
X

WX(X

X)
-1
=/ r
2
(X

X)
-1
, W =/ I
n
which differs from the usual OLS estimate r
2
(X

X)
-1
. As a consequence, estimates of
standard errors will be wrong, and hence hypothesis tests may be invalid.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 49 OF 58
Under ideal conditions

b
OLS
is BLUE:
Best Linear (conditionally) Unbiased Estimator
1. strict exogeneity: E(e|X) = 0
2. conditionally iid disturbances: E(ee

|X) = r
2
I
n
(build this from more primitive
assumptions)
3. X has full column rank (existence of

b
OLS
)
Best means smallest sampling variance in the class of linear unbiased estimators.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 50 OF 58
Gauss Markov Theorem
Given the assumptions above, then the ordinary least squares regression estimator is
BLUE. i.e., given the model y = Xb + e, consider any other estimator of b, say

b
*
of the
form Qy with the property (unbiasedness) that E(

b
*
) = b. Then
var(

b
*
|X) - var(

b
OLS
|X) = D
where D is a postive semi-definite matrix.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 51 OF 58
Gauss Markov Theorem
Consider the entire class of linear estimators

b
*
= Ay + C. The least squares estimator

b
OLS
is a special case, with A = (X

X)
-1
X

and C = 0. Under what conditions is



b
*
unbiased? Taking expectations,
E(

b
*
) = E
_
A(Xb + e) + C

= E [AXb + Ae + C]
So,
E(

b
*
- b) = 0
_

_
1. E(C) = 0
2. E(e) = 0
3. E(AX) = I
_
_
_
A = (X

X)
-1
X

i.e.,

b
*
= (X

X)
-1
X

y =

b
OLS
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 52 OF 58
Linear Unbiased Estimators
Now consider properties of the linear estimator

b
*
= [(X

X)
-1
X

+ C]y
Proposition: this estimator is unbiased if and only if CX = 0. Proof:
E(

b
*
|X) = E
_
[(X

X)
-1
X

+ C][Xb + e] | X
_
= E
_
(X

X)
-1
X

Xb + CXb + (X

X)
-1
X

e + Ce | X
_
= b + (X

X)
-1
X

E(e|X) + CE(e|X) (assuming CX = 0)


= b, assuming E(e|X) = 0
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 53 OF 58
Variance-Covariance Matrix of LUEs of b
var(

b
*
|X) = E
_
[

b
*
- E(

b
*
)][

b
*
- E(

b
*
)]

|X

= E
_
[(X

X)
-1
X

e + Ce][(X

X)
-1
X

e + Ce]

| X
_
= E
_
[(X

X)
-1
X

e + Ce][e

X(X

X)
-1
+ e

] | X
_
= E
_
(X

X)
-1
X

ee

X(X

X)
-1
+ (X

X)
-1
X

ee

+ Cee

X(X

X)
-1
+ Cee

| X
_
= r
2
(X

X)
-1
+ r
2
(X

X)
-1
X

+ r
2
CX(X

X)
-1
+ r
2
CC

= var(

b
OLS
) + r
2
CC

,
since unbiasedness requires CX = 0 and subject to the assumption that E(ee

|X) = r
2
I
n
.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 54 OF 58
Variance-Covariance Matrix of LUEs of b
var(

b
*
) = var(

b
OLS
) + r
2
CC

,
Thetermr
2
CC

, isapositivesemi-definitematrix, andsothesamplingvarianceof agiven


element of

b
*
is equal to or greater than the sampling variance of the corresponding
element of

b
OLS
= (X

X)
-1
X

y. Note also that if C = 0, then



b
*
=

b
OLS
= (X

X)
-1
X

y.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 55 OF 58
OLS is the Best LUE
No other linear unbiased estimator has smaller sampling variance than the ordinary
least squares estimator.
OLS is best in the class of linear, unbiased estimators of b.
If
1. all the assumptions discussed above hold, and
2. we focus attention on linear, unbiased estimators
we cant do any better than the least squares estimators
i.e., under the maintained assumptions, LS has minimum sampling variance in its
class (LUEs).
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 56 OF 58
OLS is BLUE
And so, finally, if
1. E(e|X) = 0
2. E(ee

|X) = r
2
I
n
3. (X

X)
-1
exists
then the
ordinary least squares estimator

b
OLS
= (X

X)
-1
X

y
is BLUE.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 57 OF 58
Linking Properties and Assumptions
What assumptions establish which properties of the estimator?:
Assumption Relevant Property
E(e|X) = 0 unbiasedness
E(ee

|X) = r
2
I
n
smallest sampling variance (best)
(X

X)
-1
exists estimator & var-covar matrix exist
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 58 OF 58

You might also like