Matrix For Web

Estimation and Inference
for
Linear Regression
in Matrix Notation
Simon Jackman
Department of Political Science
Stanford University
January 29, 2007
The Linear Model
y = Xb + e
where
X is a n by k matrix of data (independent variables),
y is a n by 1 vector of data (dependent variable),
e is a n by 1 vector of disturbances, and
b is a k by 1 vector of parameters to be estimated.
JACKMAN:PS150B/350B:WINTER 2007:INTRO TO MULTIPLE REGRESSION PAGE 1 OF 58
Estimation via Least Squares
Take as our
estimator
that value of the unknown
population parameter
which minimizes the sum of squared errors.
Apply Least Squares to the Linear Regression Model
Note that e = y - Xb. Thus
SSE
n
i=1
e
2
i
= [e
1
e
2
. . . e
n
]
(1~n)
_
_
e
1
e
2
.
.
.
e
n
_
_
(n~1)
= e
e
(1~1)
= (y - Xb)
(y - Xb)
= (y
- b
)(y - Xb)
= y
y - y
Xb - b
y + b
Xb, since(Xb)
= b
= y
y - 2b
y + b
Xb
since b
y is a scalar, it is equal to its transpose y
Xb.
Differentiate SSE
To find the value of b that minimizes
SSE = y
y - 2b
y + b
Xb
differentiate wrt b:
SSE
b
= -2X
y + 2X
Xb.
noting that
b
y
b
= X
y and
b
Xb
b
= 2X
Xb.
First-Order Condition
Set 1st derivative to zero (1st order condition), and solve for b:
SSE
b
= 0
(k~1)
,
-2X
y + 2X
Xb = 0
X
X b
*
= X
y (the normal equations)

(X
X)
-1
X
X b
*
= (X
X)
-1
X
y
b
*
= (X
X)
-1
X
y,
provided (X
X)
-1
exists,
Second-Order Condition
Check that this solution yields a minimum (2nd order condition):
2
SSE
b b
= 2X
X
i.e., minimizes SSE, positive definite X
X.
For a definition of positive definiteness, see Graybill (1983), Matrices with
Applications in Statistics, 2nd edition, Wadsworth: Belmont, California; 12.2:
Definition: A n-by-n matrix A is positive definite if and only if
1. A = A
(i.e., A is symmetric)
2. z
A z > 0, z R
n
, z =/ 0.
Least Squares Estimate of b
Having verified that we have a minimizer of the SSE, we can now write
b
OLS
= (X
X)
-1
X
y.
where the hat or caret denotes an estimate of the population quantity b.
OLS: Ordinary Least Squares.
Later in the quarter we will consider a generalized least squares (GLS) estimator of
b.
Existence of the Least Squares Estimator
b
OLS
= (X
X)
-1
X
y.
Given data X and y, the only possibly non-trivial part of computing the least squares
estimator is inverting X
X.
(X
X)
-1
exists X has full column rank
i.e., there are no linear dependencies among the columns of X.
What is X
X?
If x
i
= [x
i1
. . . x
ik
], i = 1, . . . , n, then
X =
_
_
x
11
. . . x
1k
x
21
. . . x
2k
.
.
.
.
.
.
.
.
.
x
n1
. . . x
nk
_
_
X usually has a unit vector (a column of ones) so as to generate an intercept term in
the model:
y
i
= b
1
1 + b
2
x
i2
+ . . . + b
k
x
ik
+ e
i
when an intercept is present, we have k - 1 predictors, plus the intercepts column
of ones, to make up the k columns of X.
What is X
X?
X
X =
_
n
i=1
x
2
i1
n
i=1
x
i1
x
i2
. . .

n
i=1
x
i1
x
ik
n
i=1
x
i2
x
i1
n
i=1
x
2
i2
. . .

n
i=1
x
i2
x
ik
.
.
.
.
.
.
.
.
.
.
.
.
n
i=1
x
ik
x
i1
n
i=1
x
ik
x
i2
. . .

n
i=1
x
2
ik
_
_
n.b., asquare(k-by-k), symmetric matrix, withsums of squares ontheleadingdiagonal,
cross-products on the off-diagonals.
What is X
X?
X
X =
_
n
i=1
x
2
i1
n
i=1
x
i1
x
i2
. . .

n
i=1
x
i1
x
ik
n
i=1
x
i2
x
i1
n
i=1
x
2
i2
. . .

n
i=1
x
i2
x
ik
.
.
.
.
.
.
.
.
.
.
.
.
n
i=1
x
ik
x
i1
n
i=1
x
ik
x
i2
. . .

n
i=1
x
2
ik
_
_
If anyof thevariablesinXarelinear combinationsof other variables, thenthiswill create
linear dependencies in the columns of X
X: this is known as perfect multicollinearity.

Existence of (X
X)
-1
We also need the condition n > k, or, that we have more observations than
independent variables.
Definition: Rank of a matrix. If A is a r-by-c matrix, then
1. the column rank of A is the number of linearly independent columns
2. the row rank of A is the number of linearly independent rows
3. column rank = row rank = rank of A, written r(A)
Results on ranks
If A is a r-by-c matrix, then
1. r(A) M min(r, c)
2. r(A) = r(A
) = r(AA
) = r(A
A)
3. if B is a square p-by-p matrix, then B is singular if r(B) < p.
Thus, X
X formed with n < k can not be inverted: r(X
X) = n < k. In the limiting case

n = k we can invert X, but the resulting regression model overfits (perfectly fits!) the
data: r
2
= 1.0.
Properties of

b
OLS
1. Is the OLS estimator unbiased?
2. What is its variability in repeated samples? Is it efficient?
3. asymptotic properties, i.e., consistency (but not today)
Assumptions
1. X is full-rank
2. random sampling: the (x
i
, y
i
) are independent draws from their joint distribution
3. weak exogeneity: E(e
i
|x
i
) = 0, i = 1, . . . , n.
4. conditional homoskedasticity: var(e
i
|x
i
) = r
2
, i = 1, . . . , n.
Strict Exogeneity
Assumption 2 (random sampling) and Assumption 3 (weak exogeneity) imply strict
exogeneity:
E(e
i
|x
1
, . . . , x
n
) = E(e
i
|X) = 0, i = 1, . . . , n
or
E(e|X) = 0
iid disturbances (conditional on X)
Assumption 2 (random sampling)
Assumption 3 (weak exogeneity)
Assumption 4 (conditional homoscedasticity)
_
conditional on X, the disturbances are iid (identically and independently distributed):

var(e|X) = E(ee
|X) = r
2
I
n
i.e., the (conditional) variance-covariance matrix of the disturbances is a diagonal
matrix (i.e., covariances are all zero), with a constant term r
2
on the diagonal
(variances are all r
2
). Well come back to this presently.
Unbiasedness
We seek the conditions under which E(
b) = b. Begin by considering the expectation of
b conditional on X:
E(
b| X) = E[(X
X)
-1
X
y | X],
= E[(X
X)
-1
X
(Xb + e) | X],
since y = Xb + e
= E[(X
X)
-1
X
Xb + (X
X)
-1
X
e|X],
= E[b + (X
X)
-1
X
e|X],
since (X
X)
-1
X
X = I
k
= b + (X
X)
-1
X
E(e|X)
= b since by assumption E(e|X) = 0
Unbiasedness
We have
E(
b|X) = b,
and we get the unconditional result E(
b) = b by applying the Law of Iterated

Expectations.
Unbiasedness
To summarize:
1. E(e|X) = 0 (strict exogeneity)
2. (X
X)
-1
exists (n > k and no perfect multicollinearity in X; full column rank)
then
E(
b) = b
Violation of E(e|X) = 0
Omitted Variable Bias
The true model is
y = X
1
b
1
+ X
2
b
2
+ e
but the model estimated is
y = X
1
b
1
+ e
*
,
where
e
*
= X
2
b
2
+ e.
LS estimate:
b
1
= (X
1
X
1
)
-1
X
1
y
Let X = [X
1
X
2
]. Taking conditional expectations,
E(
b
1
|X) = E[(X
1
X
1
)
-1
X
1
y|X]
= E[(X
1
X
1
)
-1
X
1
(X
1
b
1
+ X
2
b
2
+ e)|X]
= b
1
+ E[(X
1
X
1
)
-1
X
1
X
2
b
2
|X] + E[(X
1
X
1
)
-1
X
1
e|X]
Since our assumption is that the true model is correctly specified: E(e|X) = 0 which
implies E(X
1
e) = 0 and so
E(
b
1
|X) = b
1
+ E[(X
1
X
1
)
-1
X
1
X
2
b
2
|X]
. .
bias
E(
b
1
|X) = b
1
+ E[(X
1
X
1
)
-1
X
1
X
2
b
2
|X]
. .
bias
Bias eliminated if and only if
1. X
1
X
2
= 0 (i.e., X
1
and X
2
are uncorrelated); or
2. b
2
= 0 (i.e., X
2
wasnt really in the true model to begin with).
Thus
omitting a relevant independent variable usually
biases
the coefficients of the variables
in the estimated model
How consequential is this in practice?
We can sign the omitted bias. The bias in

b
1
is
E(
b
1
|X) - b
1
= E[(X
1
X
1
)
-1
X
1
X
2
b
2
|X].
Since (X
1
X
1
)
-1
is a positive definite matrix, the sign of the omittedvariable bias depends
on
1. the sign of X
1
X
2
(the covariances between X
1
and X
2
) and
2. the sign of b
2
,
Bias in

b
1
from omission of X
2
:
E(
b
1
|X) - b
1
= (X
1
X
1
)
-1
X
1
X
2
b
2
Consider the following cases:
X
1
X
2
b
2
Bias: E(
b
1
) - b
1

b
1
is
positive positive positive over-estimated
positive negative negative under-estimated
negative positive negative under-estimated
negative negative positive over-estimated
Model Specification is a Compromise
The flip side of omitted variable bias is multicollinearity (see Fox, Ch 13) giving rise to
a
bias-variance tradeoff
Bias: Excluding relevant X variables risks biasing the coefficients of the included
variables.
Variance: Including(possibly irrelevant) X variablesinflatesthevariances(andhence
the standard errors) of the estimated coefficients (resulting from the collinearity
among the X variables). In addition, we chew up degrees of freedom and clutter our
model (lack of parsimony).
Properties of

b
OLS
Related to Inference
Classical hypothesis testing requires an assessment of the sampling variability of the
estimator. Since the least squares estimator

b = (X
X)
-1
X
y is a vector of length k, its

sampling distribution is a multivariate distribution with
1. mean vector and a
2. variance-covariance matrix
Conditional Variance-Covariance Matrix of

b
var(
b|X) = E[(
b - E(
b)) (
b - E(
b))
| X],
(i.e., the definition of variance),
= E[(
b - b)(
b - b)
|X], since E(
b) = b,
= E
_
(X
X)
-1
X
e[(X
X)
-1
X
e]
| X
_
, since

b - b = (X
X)
-1
X
e,
= E[(X
X)
-1
X
ee
X(X
X)
-1
|X],
since, by assumption E(ee
|X) = r
2
I
n
= r
2
(X
X)
-1
Estimating the Conditional Variance-Covariance Matrix of

b
We estimate
var(
b|X) = r
2
(X
X)
-1
using the sample estimate of r
2
(derived below),
r
2
=
e
e
n - k
,
yielding
var(
b|X) = r
2
(X
X)
-1
.

b
var(
b|X) = r
2
(X
X)
-1
.

var(
b|X) = r
2
(X
X)
-1
is a scalar times a k-by-k symmetric matrix.
Conditional variances of

b are on the diagonal; conditional covariances on the
off-diagonals
The standard errors of the estimated coefficients are obtained by taking the square
root of the diagonal

b
var(
b|X) = r
2
(X
X)
-1
(a k-by-k matrix)
=
_
var(
b
1
|X)

cov(
b
1
,

b
2
|X) . . .

cov(
b
1
,

b
k
|X)
cov(
b
2
,

b
1
|X)

var(
b
2
|X) . . .

cov(
b
2
,

b
k
|X)
.
.
.
.
.
.
.
.
.
.
.
.
cov(
b
k
,

b
1
|X)

cov(
b
k
,

b
2
|X) . . .

var(
b
k
|X)
_
se(

b
j
|X) =
_

var(
b
j
|X), j = 1, . . . , k
Estimate of r
2
How to estimate r
2
= var(e
i
)? Note we dont observe e, just
e = y - X
b
= y - X(X
X)
-1
X
y
= [I
n
- X(X
X)
-1
X
]y
= My,
where M = I
n
- X(X
X)
-1
X
is a symmetric idempotent matrix (i.e., M has the property

that M
M = M). Substituting for y,

e = M(Xb + e)
= Me,
since MX = (I
n
- X(X
X)
-1
X
)X = I
n
X - X(X
X)
-1
X
X = X - X = 0.
Estimate of r
2
e = Me,
Because M is symmetric idempotent (M = M
M)
e
e = e
Me = e
Me
Well exploit this identity shortly; contrast the derivation in Fox 10.3.
Properties of trace
But first, a little more theory from linear algebra...
Definition: the trace of a square matrix A, tr(A), is equal to the sum of the elements
on its diagonal.
tr(ABC) = tr(CAB) = tr(BCA)
tr(E(A)) = E(tr(A))
tr(I
p
) = p
tr(A + B) = tr(A) + tr(B)
if k is a scalar, tr(kA) = k tr(A)
We now make use of these properties in evaluating E( e
e|X) = E(e
Me|X).
Estimate of r
2
E( e
e|X) = E(e
Me|X),
= E[tr(e
Me)|X], (the trace of a scalar equals the scalar)

= E[tr(Mee
)|X], (changing the order of multiplication inside the trace)

= tr(ME(ee
|X)), (the trace and expectations operators are both linear)

= tr(Mr
2
I
n
), (by assumption, E(ee
|X) = r
2
I
n
)
= r
2
tr(M)
= r
2
tr(I
n
- X(X
X)
-1
X
)
= r
2
(tr(I
n
) - tr[(X
X)
-1
X
X])
= r
2
(tr(I
n
) - tr(I
k
))
= r
2
(n - k)
Estimate of r
2
That is,
r
2
=
e
e
n - k
=
n
i=1
e
2
i
n - k
is a conditionally unbiased estimator of r
2
, i.e.,
If E(ee
|X) = r
2
I
n
,E( r
2
|X) = r
2
.
Consequence of Unbiased Estimation of r
2
The consequence of having a conditionally unbiased estimator of r
2
is that we
recover an unbiased estimate of the marginal variance of

b, V(
b), instead of the

conditional variance V(
b|X).
This is important since we dont want to limit ourselves to conclusions about the
sampling variance of

b conditional on the X we obtained.
Using the Law of Iterated Expectations (and see Ruud 8.5),
E[

var(
b|X)] = E[

r
2
(X
X)
-1
] = E
X
_
E( r
2
|X)(X
X)
-1
_
(iff r
2
is conditionally unbiased for r
2
)
= E
X
_
r
2
(X
X)
-1
_
= E
X
_
var(
b|X)
= var(
b)
Standard Error of the Regression
The square root of r
2
, r, is alternately called
the standard error of the regression
the standard error of the estimate
and is usefully thought of as
the standard deviation of the residuals e.
Standard Error of the Regression
r is a goodness-of-fit measure:
As r
2
1.0, r 0
As r
2
0.0, r
_
var(y).
Understanding the assumption: E(ee
|X) = r
2
I
n
We used this assumption in deriving the conditional variance-covariance matrix of

b:
Recall that if E(ee
|X) = r
2
I
n
then
var(
b|X) = r
2
(X
X)
-1
But what does this mean?
|X) = r
2
I
n
E(ee
|X) is the (conditional) variance-covariance matrix of the disturbances, e. By the

definition of variance,
var(e|X) = E
_
_
e - E(e|X)
_ _
e - E(e|X)
_
|X
_
= E(ee
|X) since by assumption, E(e|X) = 0.

Our assumption is thus that
var(e|X) = E(ee
|X) = r
2
I
n
|X) = r
2
I
n
var(e|X) = E(ee
|X) = r
2
I
n
is a n-by-n matrix:
var(e|X) = r
2
I
n
=
_
_
var(e
1
|X) cov(e
1
, e
2
|X) . . . cov(e
1
, e
n
|X)
cov(e
2
, e
1
|X) var(e
2
|X) . . . cov(e
2
, e
n
|X)
.
.
.
.
.
.
.
.
.
.
.
.
cov(e
n
, e
1
|X) cov(e
n
, e
2
|X) . . . var(e
n
|X)
_
_
variances on the diagonal
covariance on the off-diagonals
|X) = r
2
I
n
E(ee
|X) =
_
_
var(e
1
|X) cov(e
1
, e
2
|X) . . . cov(e
1
, e
n
|X)
cov(e
2
, e
1
|X) var(e
2
|X) . . . cov(e
2
, e
n
|X)
.
.
.
.
.
.
.
.
.
.
.
.
cov(e
n
, e
1
|X) cov(e
n
, e
2
|X) . . . var(e
n
|X)
_
_
= r
2
I
n
= r
2
_
_
1 0 . . . 0
0 1 . . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . 1
_
_
=
_
_
r
2
0 . . . 0
0 r
2
. . . 0
.
.
.
.
.
.
.
.
.
.
.
.
0 0 . . . r
2
_
_
n.b., r
2
on the diagonal (variances), 0 on the off-diagonals (covariances).
E(ee
|X) = r
2
I
n
: two assumptions in one
Conditionally Uncorrelated Errors:
E(e
i
e
j
|X) = 0, i =/ j
This assumption is implied by the assumptions
1. random sampling: (x
i
, y
i
), i = 1, . . . , n are iid draws from their joint distribution
2. weak exogeneity E(e
i
|x
i
) = 0, i = 1, . . . , n).
E(ee
|X) = r
2
I
n
Identical Conditional Variances (conditional homoskedasticity):
var(e
i
|X) = E(e
2
i
|X) = r
2
, i = 1, . . . , n.
E(ee
|X) = r
2
I
n
Together, these assumptions add up to the conditional iid assumption: i.e., the
disturbances are conditionally independently and identically distributed.
Violation of E(ee
|X) = r
2
I
n
Thus, violation of E(ee
|X) = r
2
I
n
arises in two ways:
1. Off-diagonal elements of E(ee
) are non-zero (auto-correlation, or spatial-correlation

among the disturbances); in scalar terms, E(e
i
e
j
|X) =/ 0, i =/ j.
2. The elements on the leading diagonal of E(ee
|X) vary (heteroskedasticity); in scalar

terms var(e
i
|X) = E(e
2
i
|X) = r
2
i
=/ r
2
, i = 1, . . . , n.
In both these cases
E(ee
|X) = U = r
2
W, W =/ I
n
.
Consequences of Violation of conditional iid assumption
The ordinary least squares estimator

b
OLS
= (X
X)
-1
X
y is unbiased (i.e., E(
b
OLS
) = b) But
the usual expression for the conditional variance-covariance matrix of

b
OLS
is incorrect:
E[(
b
OLS
- b)(
b
OLS
- b)
|X] = E
_
(X
X)
-1
X
ee
X(X
X)
-1
_
= (X
X)
-1
X
UX(X
X)
-1
= r
2
(X
X)
-1
X
WX(X
X)
-1
=/ r
2
(X
X)
-1
, W =/ I
n
which differs from the usual OLS estimate r
2
(X
X)
-1
. As a consequence, estimates of
standard errors will be wrong, and hence hypothesis tests may be invalid.
Under ideal conditions

b
OLS
is BLUE:
Best Linear (conditionally) Unbiased Estimator
1. strict exogeneity: E(e|X) = 0
2. conditionally iid disturbances: E(ee
|X) = r
2
I
n
(build this from more primitive
assumptions)
3. X has full column rank (existence of

b
OLS
)
Best means smallest sampling variance in the class of linear unbiased estimators.
Gauss Markov Theorem
Given the assumptions above, then the ordinary least squares regression estimator is
BLUE. i.e., given the model y = Xb + e, consider any other estimator of b, say

b
*
of the
form Qy with the property (unbiasedness) that E(
b
*
) = b. Then
var(
b
*
|X) - var(
b
OLS
|X) = D
where D is a postive semi-definite matrix.
Gauss Markov Theorem
Consider the entire class of linear estimators

b
*
= Ay + C. The least squares estimator
b
OLS
is a special case, with A = (X
X)
-1
X
and C = 0. Under what conditions is

b
*
unbiased? Taking expectations,
E(
b
*
) = E
_
A(Xb + e) + C
= E [AXb + Ae + C]
So,
E(
b
*
- b) = 0
_
_
1. E(C) = 0
2. E(e) = 0
3. E(AX) = I
_
_
_
A = (X
X)
-1
X
i.e.,
b
*
= (X
X)
-1
X
y =

b
OLS
Linear Unbiased Estimators
Now consider properties of the linear estimator
b
*
= [(X
X)
-1
X
+ C]y
Proposition: this estimator is unbiased if and only if CX = 0. Proof:
E(
b
*
|X) = E
_
[(X
X)
-1
X
+ C][Xb + e] | X
_
= E
_
(X
X)
-1
X
Xb + CXb + (X
X)
-1
X
e + Ce | X
_
= b + (X
X)
-1
X
E(e|X) + CE(e|X) (assuming CX = 0)

= b, assuming E(e|X) = 0
Variance-Covariance Matrix of LUEs of b
var(
b
*
|X) = E
_
[
b
*
- E(
b
*
)][
b
*
- E(
b
*
)]
|X
= E
_
[(X
X)
-1
X
e + Ce][(X
X)
-1
X
e + Ce]
| X
_
= E
_
[(X
X)
-1
X
e + Ce][e
X(X
X)
-1
+ e
] | X
_
= E
_
(X
X)
-1
X
ee
X(X
X)
-1
+ (X
X)
-1
X
ee
+ Cee
X(X
X)
-1
+ Cee
| X
_
= r
2
(X
X)
-1
+ r
2
(X
X)
-1
X
+ r
2
CX(X
X)
-1
+ r
2
CC
= var(
b
OLS
) + r
2
CC
,
since unbiasedness requires CX = 0 and subject to the assumption that E(ee
|X) = r
2
I
n
.
Variance-Covariance Matrix of LUEs of b
var(
b
*
) = var(
b
OLS
) + r
2
CC
,
Thetermr
2
CC
, isapositivesemi-definitematrix, andsothesamplingvarianceof agiven

element of

b
*
is equal to or greater than the sampling variance of the corresponding
element of

b
OLS
= (X
X)
-1
X
y. Note also that if C = 0, then

b
*
=

b
OLS
= (X
X)
-1
X
y.
OLS is the Best LUE
No other linear unbiased estimator has smaller sampling variance than the ordinary
least squares estimator.
OLS is best in the class of linear, unbiased estimators of b.
If
1. all the assumptions discussed above hold, and
2. we focus attention on linear, unbiased estimators
we cant do any better than the least squares estimators
i.e., under the maintained assumptions, LS has minimum sampling variance in its
class (LUEs).
OLS is BLUE
And so, finally, if
1. E(e|X) = 0
2. E(ee
|X) = r
2
I
n
3. (X
X)
-1
exists
then the
ordinary least squares estimator
b
OLS
= (X
X)
-1
X
y
is BLUE.
Linking Properties and Assumptions
What assumptions establish which properties of the estimator?:
Assumption Relevant Property
E(e|X) = 0 unbiasedness
E(ee
|X) = r
2
I
n
smallest sampling variance (best)
(X
X)
-1
exists estimator & var-covar matrix exist

Matrix For Web

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Matrix For Web

Uploaded by

Copyright:

Available Formats

Estimation and Inference

y is a scalar, it is equal to its transpose y

y (the normal equations)

X: this is known as perfect multicollinearity.

X formed with n < k can not be inverted: r(X

X) = n < k. In the limiting case

conditional on X, the disturbances are iid (identically and independently distributed):

b) = b. Begin by considering the expectation of

b) = b by applying the Law of Iterated

y is a vector of length k, its

is a symmetric idempotent matrix (i.e., M has the property

M = M). Substituting for y,

Me)|X], (the trace of a scalar equals the scalar)

)|X], (changing the order of multiplication inside the trace)

|X)), (the trace and expectations operators are both linear)

b), instead of the

|X) is the (conditional) variance-covariance matrix of the disturbances, e. By the

|X) since by assumption, E(e|X) = 0.

) are non-zero (auto-correlation, or spatial-correlation

|X) vary (heteroskedasticity); in scalar

and C = 0. Under what conditions is

E(e|X) + CE(e|X) (assuming CX = 0)

, isapositivesemi-definitematrix, andsothesamplingvarianceof agiven

y. Note also that if C = 0, then

You might also like