Professional Documents
Culture Documents
Suppose we sample a set of goods for quality and find 5 defective items
in a sample of 10. What is our estimate of the proportion of bad items in
the whole population.
n!
Π
n- B
P= Π
B
(1 - )
B! (n - B)!
Π is the proportion of bad items in the population
If the true proportion=0.1, P=0.0015, if it equals 0.2, P=0.0254 etc, we
could search for the most likely value. Or we can solve the problem
analytically,
δP nˆ ! n- B
ˆ )
= B ˆ (1 - Π
Π
B -1
ˆ
δΠ B! (n - B)!
n! ˆ
ˆ (1 - Π ) = 0
B n- B -1
- (n - B) Π
B! (n - B)!
B -1
ˆ ) − (n - B) Π
ˆ (1 - Π
BΠ
n- B B
ˆ )
ˆ (1 - Π
n - B -1
=0
BΠ
-1
ˆ )
ˆ = (n - B)(1 - Π
-1
ˆ ) = (n - B)Π
B(1 - Π ˆ
This basic procedure can be applied in many cases, once we can define
the probability density function for a particular event we have a general
estimation strategy.
A general Statement
n
P( X 1 ... X n | A) = ∏ P( X i | A)
i=1
n
log(L(A)) = ∑ log(P( X
i=1
i | A))
Y = f(X, β ) + e e ~ N(0, Θ )
1
L( β ,Θ ) = ′
exp [-0.5(Y - f(X, β ) Θ (Y - f(X, β )]
-1
(2π ) | Θ|
0.5 0.5
and dropping some constants and taking logs
δ log(L( β ))
- = S(A)
δ β
this is made up of the first derivatives at each point in time.
It is a measure of dispersion of the maximum estimate.
The information matrix (Hessian)
This is defined as
δ 2 log(L( β ))
E - = I( β )
δ ββδ′
This is a measure of how `pointy' the likelihood function is.
The variance of the parameters is given either by the inverse
Hessian or the outer product of the score matrix
ˆ
Var( βˆ ML ) = [I( β ) ] = S( β )′S( β )
-1
The Cramer-Rao Lower Bound
This is an important theorem which establishes the
superiority of the ML estimate over all others. The Cramer-
Rao lower bound is the smallest theoretical variance which
can be achieved. ML gives this so any other estimation
technique can at best only equal it.
if β is another estimate of β
*
*
ˆ
Var( β ) ≥ I( β )
-1
L( β ) = L( β 1 , β 2 )
now suppose we knew 1 then sometimes we can derive a
formulae for the ML estimate of 2, eg
β 2 = g( β 1 )
then we could write the LF as
L( β 1 , β 2 ) = L( β 1 , g( β 1 )) = L ( β 1 )
*
L( β ) = -Tlog( σ ) - ∑ e / σ
2 2 2
δL 2 2
= -T/ σ + ∑ e /( σ ) = 0
2 2
δσ 2
which implies that
σ = ∑ e /T
2 2
L (β )=
*
∑ Tlog( ∑ e2 /T) - T
Prediction error decomposition
The first term is the conditional probability of Y given all past values. We
can then condition the second term and so on to give
T -2
= ∑ log(L( Y T -i | Y 1 ,...,Y T -i-1 )) + log(L( Y 1 ))
i=0
that is, a series of one step ahead prediction errors conditional on actual
lagged Y.
Testing hypothesis.
This gives us a very general basis for constructing hypothesis tests but
to implement the tests we need some definite metric to judge the tests
against, i.e. what is significant.
L
Lu
LR
β
*
Consider how the likelihood function changes as we move around the
parameter space, we can evaluate this by taking a Taylor series
expansion around the ML point
ˆ ˆ δ log(L( β ))
log(L( β )) = log(L( β )) + ( β - β )
δ β
ˆ δ
2
log(L( β )) ˆ
+ 0.5( β - β )′ ( β - β ) + O(1)
δ ββδ′
and of course
δ log(L( β ))
= S( β ) = 0
δ β
δ log(L( β )) = I( β )
2
δ β
So
ˆ ˆ ˆ ˆ
log(L( β )) - log(L( β )) = 0.5.( β - β )′I( β )( β - β )
r r r
ˆ ˆ ˆ
( β - βr)′I( β )( β - β ) ~ χ (m)
r 2
ˆ
2[ log(L( β )) - log(L( β ))] ~ χ (m)
r 2
Lu
LR
β *
LM = S( β )′[I( β ) ] S( β ) ~ χ (m)
-1 2
Now suppose
Y t = f( X t , β 1 , β 2 ) + et
where we assume that the subset of parameters 1 is fixed according to a
set of restrictions g=0 (G is the derivative of this restriction).
Now
S( β 1 ) = σ G′e -2
I( β 1 ) = ( σ E(G′G) )
-2 -1
if E(G′G) = G′G
LM = σ e′e
-2
Suppose
Y = Xβ + u
u = ρ u-1 + e
m
uˆ = Xγ + ∑θ uˆt -i
i=1
then TR2 from this regression is an LM(m) test for serial correlation
Quasi Maximum Likelihood
ˆ ˆ -1
ˆ ˆ ˆ
C( β ) = I( β ) [S( β )′S( β )]I( β )
-1
Lu
β β
1 2
β
*
Important classes of maximisation techniques.
-1
δ L δL
2
β i+1 = β i +
2
δ β δ β
Y t = β X t + ut
but we only observe certain limited information, eg z=1 or 0 related to y
z = 1 if Y > 0
z = 0 if Y < 0
then we can group the data into two groups and form a likelihood
function with the following form
L = ∏ F(- β X t ) ∏ F(1 - β X t )
z=0 z=1
Suppose
Y t = β X t + et
et ~ N(0, ht )
h t = α 0 + α 1 h t - 1 + α 2 et - 1
2
Method of moments
φ1 ,..., φk
These k moments are set equal to the function which generates the
moments and the function is inverted.
φ = f ( m)−1
A simple example
Suppose the first moment (the mean) is generated by the following
distribution, f ( x | φ1 ) . The observed moment from a sample of n
observations is
n
m1 = (1 / n)∑ xi
i =1
Hence
m1 = f ( x | φ1 )
And
φ1 = f (m1 ) = m1
−1
Method of Moments Estimation (MM)
The idea here is that we have a model which implies certain things about
the distribution or covariance’s of the variables and the errors. So we
know what some moments of the distribution should be. We then invert
the model to give us estimates of the unknown parameters of the model
which match the theoretical moments for a given sample.
Y = f (φ , X )
where φ are k parameters. And we have k conditions (or moments)
which should be met by the model.
E ( g (Y , X | φ )) = 0
then we approximate E(g) with a sample measure and invert g.
φ = g (Y , X ,0)
−1
Examples
OLS
In OLS estimation we make the assumption that the regressors (Xs) are
orthogonal to the errors. Thus
E ( Xe) = 0
The sample analogue for each xi is
(1 / n)∑t =1 xit et = 0
n
and so
Ln( L) = ∑ Ln( f ( y, x | φ ))
and this will be maximised when the following k first order conditions are
met.
E (∂ ln( f ( y, x | φ )) / ∂φ ) = 0
This give rise to the following k sample conditions
(1 / n)∑i=1 (∂ ln( f ( y, x | φ )) / ∂φ ) = 0
n
Basically, if we can not satisfy all the conditions at the same time we have
to trade them of against each other. So we need to make them all as close
to zero as possible at the same time. We need a criterion function to
minimise.
Suppose we have k parameters but L moment conditions L>k.
E (m j (φ )) = 0 = (1 / n)∑t =1 m j (φ ) = 0 j = 1,...L
n
Min(q ) = m (φ )' Am (φ )
That is, the weighted squared sum of the moments.
This gives a consistent estimator for any positive definite matrix A (not a
function ofφ )
The optimal A
If any weighting matrix is consistent they clearly can not all be equally
efficient so what is the optimal estimate of A.
Thus
φ gmm ~ N (0,Vgmm )
Where
Φ = var(n (m − η ))
1/ 2