Lecture 1 Maximum Likelihood

Maximum Likelihood
Much estimation theory is presented in a rather ad hoc fashion. Minimising

squared errors seems a good idea but why not minimise the absolute error or the
cube of the absolute error?
The answer is that there is an underlying approach which justifies a particular

minimisation strategy conditional on certain assumptions.
This is the maximum likelihood principle.

The idea is to assume a particular model with unknown parameters, we
can then define the probability of observing a given event conditional on
a particular set of parameters. We have observed a set of outcomes in the
real world. It is then possible to choose a set of parameters which are
most likely to have produced the observed results.
This is maximum likelihood. In most cases it is both consistent and

efficient. It provides a standard to compare other estimation techniques.
An example
Suppose we sample a set of goods for quality and find 5 defective items
in a sample of 10. What is our estimate of the proportion of bad items in
the whole population.
Intuitively of course it is 50%. Formally in a sample of size n the

probability of finding B bad items is
n!
Π
n- B
P= Π
B
(1 - )
B! (n - B)!
Π is the proportion of bad items in the population
If the true proportion=0.1, P=0.0015, if it equals 0.2, P=0.0254 etc, we
could search for the most likely value. Or we can solve the problem
analytically,
δP  nˆ ! n- B 
ˆ ) 
= B ˆ (1 - Π
Π
B -1
ˆ
δΠ  B! (n - B)! 
n! ˆ
ˆ (1 - Π ) = 0
B n- B -1
- (n - B) Π
B! (n - B)!
B -1
ˆ ) − (n - B) Π
ˆ (1 - Π
BΠ
n- B B
ˆ )
ˆ (1 - Π
n - B -1
=0
BΠ
-1
ˆ )
ˆ = (n - B)(1 - Π
-1
ˆ ) = (n - B)Π
B(1 - Π ˆ
ˆ = B/n = 5/10 = 0.5

Π
So the maximum likelihood estimate of the population proportion of bad

items is 0.5.
This basic procedure can be applied in many cases, once we can define
the probability density function for a particular event we have a general
estimation strategy.
A general Statement
Consider a sample (X1...Xn) which is drawn from a probability distribution

P(X|A) where A are parameters. If the Xs are independent with probability
density function P(Xi|A) the joint probability of the whole set is
n
P( X 1 ... X n | A) = ∏ P( X i | A)
i=1
this may be maximised with respect to A to give the maximum likelihood

estimates.
It is often convenient to work with the Log of the likelihood function.
n
log(L(A)) = ∑ log(P( X
i=1
i | A))
the advantage of this approach is that it is extremely general but if the

model is misspecified it may be particularly sensitive to this
misspecification.
The Likelihood function for the general non linear model
if Y is a vector of n endogenous variables and
Y = f(X, β ) + e e ~ N(0, Θ )
Then the likelihood function for one period is
1
L( β ,Θ ) = ′
exp [-0.5(Y - f(X, β ) Θ (Y - f(X, β )]
-1
(2π ) | Θ|
0.5 0.5
and dropping some constants and taking logs
log(L( β , Θ )) = ∑ | Θ | -(Y - f(X, β )′ Θ-1 (Y - f(X, β ))
if the covariance structure is constant and has zero off

diagonal elements this reduces to single equation OLS
Two important matrices

The efficient score matrix
δ log(L( β ))
- = S(A)
δ β
this is made up of the first derivatives at each point in time.
It is a measure of dispersion of the maximum estimate.
The information matrix (Hessian)
This is defined as
  δ 2 log(L( β )) 
E -   = I( β )
  δ ββδ′ 
This is a measure of how `pointy' the likelihood function is.
The variance of the parameters is given either by the inverse
Hessian or the outer product of the score matrix
ˆ
Var( βˆ ML ) = [I( β ) ] = S( β )′S( β )
-1
The Cramer-Rao Lower Bound

This is an important theorem which establishes the
superiority of the ML estimate over all others. The Cramer-
Rao lower bound is the smallest theoretical variance which
can be achieved. ML gives this so any other estimation
technique can at best only equal it.
if β is another estimate of β
*
*
ˆ
Var( β ) ≥ I( β )
-1
this is the Cramer-Rao inequality.

Concentrating the Likelihood function
suppose we split the parameter vector into two sub vectors
L( β ) = L( β 1 , β 2 )
now suppose we knew 1 then sometimes we can derive a
formulae for the ML estimate of 2, eg
β 2 = g( β 1 )
then we could write the LF as
L( β 1 , β 2 ) = L( β 1 , g( β 1 )) = L ( β 1 )
*
this is the concentrated likelihood function.

This process is often very useful in practical estimation as it
reduces the number of parameters which need to be
An example of concentrating the LF
The likelihood function for a standard single variable normal non-linear
model is
L( β ) = -Tlog( σ ) - ∑ e / σ
2 2 2
we can concentrate this with respect to the variance as

follows, the FOC for a maximum with respect to the variance
is
δL 2 2
= -T/ σ + ∑ e /( σ ) = 0
2 2
δσ 2
which implies that
σ = ∑ e /T
2 2
so the concentrated log likelihood becomes
L (β )=
*
∑ Tlog( ∑ e2 /T) - T
Prediction error decomposition
We assumed that the observations were independent in the statements

above. This will not generally be true especially in the presence of
lagged dependent variables. However the prediction error
decomposition allows us to extend standard ML procedures to
dynamic models.
From the basic definition of conditional probability
Pr( α , β ) = Pr( α | β )Pr( β )
this may be applied directly to the likelihood function,

log(L( Y 1 ,Y 2 ,...Y T -1 ,Y T ))
= log(L( Y T | Y 1 ,Y 2 ,...,Y T -1 )) + log(L( Y 1 ,Y 2 ,...,Y T -1 ))
The first term is the conditional probability of Y given all past values. We
can then condition the second term and so on to give
T -2
= ∑ log(L( Y T -i | Y 1 ,...,Y T -i-1 )) + log(L( Y 1 ))
i=0
that is, a series of one step ahead prediction errors conditional on actual
lagged Y.
Testing hypothesis.
If a restriction on a model is acceptable this means that the reduction in

the likelihood value caused by imposing the restriction is not `significant'.
This gives us a very general basis for constructing hypothesis tests but
to implement the tests we need some definite metric to judge the tests
against, i.e. what is significant.
L
Lu
LR
β
*
Consider how the likelihood function changes as we move around the
parameter space, we can evaluate this by taking a Taylor series
expansion around the ML point
ˆ ˆ  δ log(L( β )) 
log(L( β )) = log(L( β )) + ( β - β ) 
 δ β 
ˆ  δ
2
log(L( β ))  ˆ
+ 0.5( β - β )′ ( β - β ) + O(1)
 δ ββδ′ 
and of course
δ log(L( β ))
= S( β ) = 0
δ β
δ log(L( β )) = I( β )
2
δ β
So
log(L( β )) = log(L( βˆ )) + 0.5( βˆ - β )′I( β )( βˆ - β )
ˆ ˆ ˆ ˆ
log(L( β )) - log(L( β )) = 0.5.( β - β )′I( β )( β - β )
r r r
it is possible to demonstrate that
ˆ ˆ ˆ
( β - βr)′I( β )( β - β ) ~ χ (m)
r 2
where m is the number of restrictions, and so

And so
ˆ
2[ log(L( β )) - log(L( β ))] ~ χ (m)
r 2
this gives us a measure for judging the significance of likelihood based

tests.
Three test procedures.
To construct the basic test we need an estimate of the likelihood value at

the unrestricted point and the restricted point and we compare these two.
There are three ways of deriving this.
The likelihood ratio test

we simply estimate the model twice, once unrestricted and once
restricted and compare the two.
The Wald test

This estimates only the unrestricted point and uses an estimate of the
second derivative to `guess' at the restricted point. Standard `t' tests are a
form of wald test.
The LaGrange multiplier test

this estimates only the restricted model and again uses an estimate of the
second derivatives to guess at the restricted point.
L
Lu
LR
β *
If the likelihood function were quadratic then LR=LM=W. In general

however W>LR>LM
A special form of the LM test
The LM test can be calculated in a particularly convenient way under

certain circumstances.
The general form of the LM test is
LM = S( β )′[I( β ) ] S( β ) ~ χ (m)
-1 2
Now suppose
Y t = f( X t , β 1 , β 2 ) + et
where we assume that the subset of parameters 1 is fixed according to a
set of restrictions g=0 (G is the derivative of this restriction).
Now
S( β 1 ) = σ G′e -2
I( β 1 ) = ( σ E(G′G) )
-2 -1
and so the LM test becomes
σ e′G[ σ E(G′G) ] σ G′e

-2 -2 -1 -2
And
if E(G′G) = G′G
LM = σ e′e
-2
which may be interpreted as TR2 from a regression of e on G

This is used in many tests for serial correlation heteroskedasticity
functional form etc.
e is the actual errors from a restricted model and G is the restrictions in

the model.
An Example: Serial correlation
Suppose
Y = Xβ + u
u = ρ u-1 + e
The restriction thatρ = 0 may be tested as an LM test as follows

estimate the model without serial correlation. save the residuals u. then
estimate the model
m
uˆ = Xγ + ∑θ uˆt -i
i=1
then TR2 from this regression is an LM(m) test for serial correlation
Quasi Maximum Likelihood
ML rests on the assumption that the errors follow a particular distribution

(OLS is only ML if the errors are normal etc.) What happens if we make
the wrong assumption.
White(1982) Econometrica, 50,1,pg1. demonstrates that, under very

broad assumptions about the misspecification of the error process, ML is
still a consistent estimator. The estimation is then referred to as Quasi
Maximum Likelihood.
But the covariance matrix is no longer the standard ML one instead it is

given by
ˆ ˆ -1
ˆ ˆ ˆ
C( β ) = I( β ) [S( β )′S( β )]I( β )
-1
Generally we may construct valid Wald and LM tests by using this

corrected covariance matrix but the LR test is invalid as it works directly
from the value of the likelihood function.
Numerical optimisation
In simple cases (e.g. OLS) we can calculate the maximum likelihood

estimates analytically. But in many cases we cannot, then we resort to
numerical optimisation of the likelihood function.
This amounts to hill climbing in parameter space.

there are many algorithms and many computer programmes implement
these for you.
It is useful to understand the broad steps of the procedure.

1. set an arbitrary initial set of parameters.
2. determine a direction of movement
3. determine a step length to move
4. examine some termination criteria and either stop or go back to 2.

L
Lu
β β
1 2
β
*
Important classes of maximisation techniques.
Gradient methods. These base the direction of movement on the first

derivatives of the LF with respect to the parameters. Often the step length
is also determined by (an approximation to) the second derivatives. So
-1
 δ L   δL 
2
β i+1 = β i +    
2 
 δ β   δ β
These include, Newton, Quasi Newton, Scoring, Steepest descent,

Davidson Fletcher Powel, BHHH etc.
Derivative free techniques. These do not use derivatives and so they are
less efficient but more robust to extreme non-linearity’s. e.g. Powell or
non-linear Simplex.
These techniques can all be sensitive to starting values and `tuning'

parameters.
Some special LFs
Qualitative response models.

These are where we have only partial information (insects and poison) in
one form or another.
We assume an underlying continuous model,
Y t = β X t + ut
but we only observe certain limited information, eg z=1 or 0 related to y
z = 1 if Y > 0
z = 0 if Y < 0
then we can group the data into two groups and form a likelihood
function with the following form
L = ∏ F(- β X t ) ∏ F(1 - β X t )
z=0 z=1
where F is a particular density function eg. the standard normal

Cumulative function or perhaps the logistic (logit model) function
ARCH and GARCH
These are an important class of models which have time varying

variances
Suppose
Y t = β X t + et
et ~ N(0, ht )
h t = α 0 + α 1 h t - 1 + α 2 et - 1
2
then the likelihood function for this model is

T
log(L( β ,α )) = ∑ - | ht | -( et2 / ht )
t=1
which is a specialisation of the general Normal LF with a time varying

variance.
An alternative approach
Method of moments
A widely used technique in estimation is the Generalised Method of

Moments (GMM), This is an extension of the standard method of
moments.
The idea here is that if we have random drawings from an unknown

probability distribution then the sample statistics we calculate will
converge in probability to some constant. This constant will be a function
of the unknown parameters of the distribution. If we want to estimate k of
these parameters,
φ1 ,..., φk
we compute k statistics (or moments) whose probability limits are known

functions of the parameters
m1 ,..., mk
These k moments are set equal to the function which generates the
moments and the function is inverted.
φ = f ( m)−1
A simple example
Suppose the first moment (the mean) is generated by the following
distribution, f ( x | φ1 ) . The observed moment from a sample of n
observations is
n
m1 = (1 / n)∑ xi
i =1
Hence
m1 = f ( x | φ1 )
And
φ1 = f (m1 ) = m1
−1
Method of Moments Estimation (MM)
This is a direct extension of the method of moments into a much more

useful setting.
The idea here is that we have a model which implies certain things about
the distribution or covariance’s of the variables and the errors. So we
know what some moments of the distribution should be. We then invert
the model to give us estimates of the unknown parameters of the model
which match the theoretical moments for a given sample.
So suppose we have a model
Y = f (φ , X )
where φ are k parameters. And we have k conditions (or moments)
which should be met by the model.
E ( g (Y , X | φ )) = 0
then we approximate E(g) with a sample measure and invert g.
φ = g (Y , X ,0)
−1
Examples
OLS
In OLS estimation we make the assumption that the regressors (Xs) are
orthogonal to the errors. Thus
E ( Xe) = 0
The sample analogue for each xi is
(1 / n)∑t =1 xit et = 0
n
and so
(1 / n)∑t =1 xit et = 0 = (1 / n)∑t =1 xit ( yt − xt ' β )

n n
and so the method of moments estimator in this case is the value of β

which simultaneously solves these i equations. This will be identical to
the OLS estimate.
Maximum likelihood as an MM estimator
In maximum likelihood we have a general likelihood function.
Ln( L) = ∑ Ln( f ( y, x | φ ))
and this will be maximised when the following k first order conditions are
met.
E (∂ ln( f ( y, x | φ )) / ∂φ ) = 0
This give rise to the following k sample conditions
(1 / n)∑i=1 (∂ ln( f ( y, x | φ )) / ∂φ ) = 0
n
Simultaneously solving these equations for φ gives the MM equivalent of

maximum likelihood.
Generalised Method of Moments (GMM)
In the previous conditions there are as many moments as unknown

parameters, so the parameters are uniquely and exactly determined. If
there were less moment conditions we would not be able to solve them
for a unique set of parameters (the model would be under identified). If
there are more moment conditions than parameters then all the
conditions can not be met at the same time, the model is over identified
and we have GMM estimation.
Basically, if we can not satisfy all the conditions at the same time we have
to trade them of against each other. So we need to make them all as close
to zero as possible at the same time. We need a criterion function to
minimise.
Suppose we have k parameters but L moment conditions L>k.
E (m j (φ )) = 0 = (1 / n)∑t =1 m j (φ ) = 0 j = 1,...L
n
Then we need to make all L moments as small as possible

simultaneously. One way is a weighted least squares criterion.
Min(q ) = m (φ )' Am (φ )
That is, the weighted squared sum of the moments.
This gives a consistent estimator for any positive definite matrix A (not a
function ofφ )
The optimal A
If any weighting matrix is consistent they clearly can not all be equally
efficient so what is the optimal estimate of A.
Hansen(1982) established the basic properties of the optimal A and how

to construct the covariance of the parameter estimates.
The optimal A is simply the covariance matrix of the moment conditions.

(just as in GLS)
Thus
optimal A = W = asy. var(m )

The parameters which solve this criterion function then have the
following properties.
φ gmm ~ N (0,Vgmm )
Where
Vgmm = (1 / n)(G ' Φ G ) −1 −1
where G is the matrix of derivatives of the moments with respect to the

parameters and
Φ = var(n (m − η ))
1/ 2
ηis the true moment value.

Conclusion
• Both ML and GMM are very flexible
estimation strategies
• They are equivalent ways of

approaching the same problem in many
instances.

Lecture 1 Maximum Likelihood

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 1 Maximum Likelihood

Uploaded by

Copyright:

Available Formats

Maximum Likelihood

Much estimation theory is presented in a rather ad hoc fashion. Minimising

The answer is that there is an underlying approach which justifies a particular

This is the maximum likelihood principle.

This is maximum likelihood. In most cases it is both consistent and

Intuitively of course it is 50%. Formally in a sample of size n the

ˆ = B/n = 5/10 = 0.5

So the maximum likelihood estimate of the population proportion of bad

Consider a sample (X1...Xn) which is drawn from a probability distribution

this may be maximised with respect to A to give the maximum likelihood

the advantage of this approach is that it is extremely general but if the

if Y is a vector of n endogenous variables and

Then the likelihood function for one period is

log(L( β , Θ )) = ∑ | Θ | -(Y - f(X, β )′ Θ-1 (Y - f(X, β ))

if the covariance structure is constant and has zero off

this is the Cramer-Rao inequality.

this is the concentrated likelihood function.

we can concentrate this with respect to the variance as

so the concentrated log likelihood becomes

We assumed that the observations were independent in the statements

From the basic definition of conditional probability

Pr( α , β ) = Pr( α | β )Pr( β )

this may be applied directly to the likelihood function,

If a restriction on a model is acceptable this means that the reduction in

log(L( β )) = log(L( βˆ )) + 0.5( βˆ - β )′I( β )( βˆ - β )

it is possible to demonstrate that

where m is the number of restrictions, and so

this gives us a measure for judging the significance of likelihood based

To construct the basic test we need an estimate of the likelihood value at

The likelihood ratio test

The Wald test

The LaGrange multiplier test

If the likelihood function were quadratic then LR=LM=W. In general

The LM test can be calculated in a particularly convenient way under

The general form of the LM test is

and so the LM test becomes

σ e′G[ σ E(G′G) ] σ G′e

which may be interpreted as TR2 from a regression of e on G

e is the actual errors from a restricted model and G is the restrictions in

The restriction thatρ = 0 may be tested as an LM test as follows

ML rests on the assumption that the errors follow a particular distribution

White(1982) Econometrica, 50,1,pg1. demonstrates that, under very

But the covariance matrix is no longer the standard ML one instead it is

Generally we may construct valid Wald and LM tests by using this

In simple cases (e.g. OLS) we can calculate the maximum likelihood

This amounts to hill climbing in parameter space.

It is useful to understand the broad steps of the procedure.

2. determine a direction of movement

3. determine a step length to move

4. examine some termination criteria and either stop or go back to 2.

Gradient methods. These base the direction of movement on the first

These include, Newton, Quasi Newton, Scoring, Steepest descent,

These techniques can all be sensitive to starting values and `tuning'

Qualitative response models.

We assume an underlying continuous model,

where F is a particular density function eg. the standard normal

These are an important class of models which have time varying

then the likelihood function for this model is

which is a specialisation of the general Normal LF with a time varying

A widely used technique in estimation is the Generalised Method of

The idea here is that if we have random drawings from an unknown

we compute k statistics (or moments) whose probability limits are known

This is a direct extension of the method of moments into a much more