Professional Documents
Culture Documents
MATH5063
Outline
7.3 Remarks.
An estimator is a function of the sample while an estimate is the realized
value of an estimator. Notationally,( W (X1 , . . . , Xn ) is an estimator and
W (x1 , . . . , xn ) is an estimate. Some authors use the word “estimate” for
)
both W (X1 , . . . , Xn ) and W (x1 , . . . , xn ).
Method of Moments:
Let X1 , . . . , Xn be a random sample from a pdf/pmf f (x | θ1 , . . . , θk ) which
depends on k parameters θ1 , . . . , θk . Method-of-moments estimators are
found by equating the first k sample moments to the corresponding
population moments, and solving the resulting system of equations. More
precisely, for l = 1, . . . , k, let
1∑ l
n
ml = µ′l (θ1 , . . . , θk ), l = 1, . . . , k
X = θ̃
1∑ 2
X2 = Xi = θ̃2 + σe2 .
n
That is,
θ̃ = X
1∑ 2
σe2 = Xi − θ̃2
n
1∑ 2 2
= Xi − X
n
1∑ n−1 2
= (Xi − X)2 = S < S2.
n n
7.6 Remarks.
Unless θ̂(x) occurs at the boundary of the parameter space, θ̂(x) is a
solution of the equations
∂
L(θ | x) = 0, i = 1, . . . , k.
∂θi
In most cases, the MLE cannot be found analytically. Thus, one needs to
resort to numerical methods to obtain the MLE approximately.
7.7 Remarks.
Maximizing L(θ | x) is equivalent to maximizing log L(θ | x) (the log
likelihood function).
1∑
n
n
log L(θ | x) = − log(2π) − (xi − θ)2 .
2 2
i=1
The equation d
dθ
log L(θ | x) = 0 reduces to
∑
n
(xi − θ) = 0,
i=1
Figure 1 Figure 2
1 ∑ (xi − θ)2
n
n n
log L(θ, σ 2 | x) = − log 2π − log σ 2 − ,
2 2 2 σ2
i=1
and
1 ∑
n
∂
log L(θ, σ 2 | x) = 2 (xi − θ)
∂θ σ
i=1
1 ∑
n
∂ n
log L(θ, σ 2
| x) = − + (xi − θ)2 .
∂(σ 2 ) 2σ 2 2σ 4
i=1
It can be rigorously verified that (θ̂, σb2 ) uniquely maximizes the (log)
likelihood function.
Bayes Estimators:
Classical (frequentist) approach: The parameter θ is fixed but unknown.
Bayesian approach: θ is random with some prior distribution (which is a
subjective distribution based on the experimenter’s prior knowledge about θ
before the data are collected). When a sample is taken from a pdf/pmf
f (x | θ), the prior distribution is then updated with this sample
information. The updated prior is called the posterior distribution.
Specifically, let π(θ) be the prior distribution. The sampling distribution
f (x | θ) is the conditional distribution of the random sample X given θ.
Thus the joint distribution of (θ, X) is f (x | θ) π(θ). When X is observed
with X = x, the updated prior (i.e. posterior) distribution of θ is the
conditional distribution of θ given X = x, i.e.
f (x | θ) π(θ)
π(θ | x) = ,
m(x)
∫
where m(x) = f (x | θ) π(θ) dθ, the marginal distribution of X.
We often use the mean of the posterior distribution (called the posterior
mean) of θ as a point estimate of θ. Note that the Bayes estimator
depends on the prior distribution. Different prior distributions result in
different Bayes estimators.
Γ(n + α + β)
f (p | y) = py+α−1 (1 − p)n−y+β−1 ,
Γ(y + α)Γ(n − y + β)
The EM Algorithm:
The EM Algorithm is designed to find MLEs when the likelihood function
involves incomplete data. Specifically, let Y = (Y1 , . . . , Yn ) be the
incomplete (observed) data, and X = (X1 , . . . , Xn ) be the augmented (or
missing) data, making (Y , X) the complete data.
The densities g(· | θ) of Y and f (· | θ) of (Y , X) have the relationship
∫
g(y | θ) = f (y, x | θ) dx.
To find θ̂ that maximizes L(θ | y) (the MLE of θ), it may be easier to work
with L(θ | y, x) via the EM algorithm as described below.
(i) Start with an initial value θ(0) .
(ii) At the r-th step with current value[ θ(r) (r = 0, 1, . .. ), find ]the value
(denoted θ(r+1) ) that maximizes E log L(θ | y, X) θ(r) , y .
Note that the r-th step of the EM algorithm consists of two parts: The
[ ]
“E-step” calculates the expected log likelihood E log L(θ | y, X) θ(r) , y
and the “M-step” finds its maximum.
7.14 Theorem.
log L(θ(r) | y) increases monotonically in r.
7.15 Lemma.
For two pdfs g and h, we have
∫ [ ]
g(x)
log g(x) dx ≥ 0.
h(x)
Proof.
Since ey ≥ 1 + y for all y ∈ R, we have, for x with g(x) > 0,
h(x) h(x) h(x)
= exp log ≥ 1 + log .
g(x) g(x) g(x)
It follows that
∫ [ ] ∫ [ ]
h(x) h(x)
log g(x) dx ≤ − 1 g(x) dx
g(x) g(x)
{x:g(x)>0} {x:g(x)>0}
∫
= h(x) dx − 1 ≤ 1 − 1 = 0,
{x:g(x)>0}
7.16 Remarks.
The Kullback-Leibler information
∫ [ ]
g(x)
D(g ∥ h) = log g(x) dx
h(x)
or equivalently,
∫
[ ]
7.18 log f (y, x | θ(r+1) ) f (x | θ(r) , y) dx
∫
[ ]
≥ log f (y, x | θ(r) ) f (x | θ(r) , y) dx.
∫
[ ]
7.19 log f (x | θ(r+1) , y) f (x | θ(r) , y) dx
∫
[ ]
≤ log f (x | θ(r) , y) f (x | θ(r) , y) dx.
7.20 Remarks.
The fact that log L(θ(r) | y) increases monotonically does not guarantee
that θ(r) converges to the MLE. Under suitable conditions, it can be shown
that all limit points of an EM sequence {θ(r) } are stationary points of
L(θ | y), and L(θ(r) | y) converges monotonically to L(θ̂ | y) for some
stationary point θ̂ (which may be a local maximum or saddle point). In
practice, one should try different initial values in the hope that one of them
yields the MLE.
(c) At the r-th step, with current value p(r) , we have, for E-step, that
[ ]
E log L(p | Z, x) p(r) , x
[∑ ]
n
=E Zi log pf (xi ) + (1 − Zi ) log(1 − p)g(xi ) p(r) , x
i=1
n {
∑ p(r) f (xi )
= log pf (xi )
p(r) f (x i ) + (1 − p
(r) )g(x )
i
i=1
[ ] }
p(r) f (xi )
+ 1− log(1 − p)g(xi ) .
p f (xi ) + (1 − p(r) )g(xi )
(r)
1∑
n
p(r) f (xi )
0=
p p f (xi ) + (1 − p(r) )g(xi )
(r)
i=1
[ ]
1 ∑
n
p(r) f (xi )
− 1 − (r)
1−p p f (xi ) + (1 − p(r) )g(xi )
i=1
1 1
= K− (n − K),
p 1−p
∑n p(r) f (xi )
where K = . So,
i=1 p(r) f (xi )+ (1 − p(r) )g(xi )
1∑
n
K p(r) f (xi )
p(r+1) = = .
n n p(r) f (xi ) + (1 − p(r) )g(xi )
i=1
7.23 Remarks.
The subscript θ in Eθ refers to the pdf f (· | θ). Thus,
∫
( )2
Eθ (W − θ)2 = W (x) − θ f (x | θ) dx.
The MSE measures the averaged squared difference between the estimator
W and the parameter θ. In general, any increasing function of the absolute
distance |W − θ| would serve to measure the goodness of an estimator
(mean absolute error, Eθ |W − θ|, is a reasonable alternative).
The MSE is a popular and convenient measure for its mathematical
tractability. Moreover, the MSE incorporates two components, one
measuring the variability of the estimator (precision) and the other
measuring its bias (accuracy). More precisely, we have
Eθ (W − θ)2 = E (W − Eθ W + Eθ W − θ)2
= Varθ W + (Eθ W − θ)2
= (variance of W ) + (squared bias of W ).
7.25 Example.
∫
Let X1 , . . . , Xn be a sample from f (x | θ). Let µ = µ(θ) = xf (x | θ) dx
∫
and σ 2 = σ 2 (θ) = x2 f (x | θ) dx − µ2 . Then, the sample mean X and
sample variance S 2 are unbiased estimators of µ and σ 2 , respectively. Now
suppose the f (x | θ) = n(µ, σ 2 ). Then,
σ2
E (X − µ)2 = Var X =
n
E (S 2 − σ 2 )2 = Var S 2 .
σ 2 (Z12 + · · · + Zn−1
2
)
S2 = ,
n−1
where Z1 , . . . , Zn−1 are iid n(0, 1).
σ4
E S4 = E (Z12 + · · · + Zn−1
2
)2
(n − 1)2
[ ]
σ4
= (n − 1) E Z14 + (n − 1)(n − 2) E Z12 Z22
(n − 1)2
[ ]
σ4 n+1 4
= 3(n − 1) + (n − 1)(n − 2) = σ .
(n − 1)2 n−1
2σ 4
MSE(S 2 ) = Var S 2 = E S 4 − σ 4 = .
n−1
1∑
n
n−1 2
σb2 = (Xi − X)2 = S .
n n
i=1
n−1 2 σ2
The bias of σb2 is E σb2 − σ 2 = σ − σ2 = − and
( ) ( n )2 n
−
n − 1 n − 1 2σ 4
2(n 1) σ 4
Var σb2 = Var S2 = = . We have
n n n−1 n2
( )2 2(n − 1) σ 4
( )
σ2 2n − 1 4
MSE(σb2 ) = E (σb2 − σ 2 )2 = − + = σ
n n2 n2
( )
2
< σ 4 = MSE(S 2 ).
n−1
7.27 Remarks.
In general, no estimator exists which is better than any other estimator at
every parameter value. For example, the trivial estimator W = 2.5, which
does not depend on the data, has MSE = 0 if θ = 2.5. This estimator is
obviously of no interest. Below we restrict our attention to the class of
unbiased estimators. (It should be pointed out that the class of unbiased
estimators is often too restrictive. It is instructive to consider estimators
with a small bias.) We will see that in some cases, there exists a best
unbiased estimator in this class in the following sense.
(ii) Varθ W ∗ ≤ Varθ W for all θ for any estimator W with Eθ W = τ (θ).
7.29 Remarks.
A best unbiased estimator of τ (θ) is also called a uniformly minimum
variance unbiased estimator (UMVUE) of τ (θ).
Proof.
Let τ (θ) = Eθ W (X). Then, by Cauchy-Schwarz inequality,
( )2 [ ]2
d d
τ (θ) = Eθ W (X)
dθ dθ
[∫ ]2
∂
= W (x) f (x | θ) dx
∂θ
[∫ ]2
( ) ∂
= W (x) − τ (θ) f (x | θ) dx
∂θ
[∫ ( ) ]2
( ) ∂
= W (x) − τ (θ) log f (x | θ) f (x | θ) dx
∂θ
[ ]2
( )∂
= Eθ W (X) − τ (θ) log f (X | θ)
∂θ
( )2
( )2 ∂
≤ Eθ W (X) − τ (θ) Eθ log f (X | θ) ,
∂θ
(contd.) Proof.
from which it follows that
[ 2 ]
d
Eθ W (X)
Varθ W (X) ≥ dθ
[ ]2 .
∂
Eθ log f (X | θ)
∂θ
Proof.
Since f (x | θ) = f (x1 | θ) · · · f (xn | θ), we have
[ ]2 [∑
n ]2
∂ ∂
Eθ log f (X | θ) = Eθ log f (Xi | θ)
∂θ ∂θ
i=1
[ ]
2
∂
= n Eθ log f (X1 | θ) +
∂θ
[∂ ∂ ]
n(n − 1) E log f (X1 | θ) log f (X2 | θ) .
∂θ ∂θ
(contd.) Proof.
The corollary follows by noting that
[∂ ∂ ] ∂ ∂
E log f (X1 | θ) log f (X2 | θ) = E log f (X1 | θ) E log f (X2 | θ)
∂θ ∂θ ∂θ ∂θ
and
∫
∂ [∂ ]
E log f (X1 | θ) = log f (x | θ) f (x | θ) dx
∂θ ∂θ
∫
∂
= f (x | θ) dx = 0.
∂θ
7.32 Remarks.
Theorem 7.30 and Corollary 7.31 hold for the discrete case as well. In
either the discrete or continuous case, in order for the Cramér-Rao lower
bound to hold, what is required is to allow interchange of integration (over
x) and differentiation (with respect to θ).
7.33 Remarks.
[ ]2
The quantity Iθ = Eθ ∂θ ∂
log f (X | θ) is called Fisher information per
observation. For a sample consisting of n iid observations, the total Fisher
information is nIθ . Fisher information may be regarded as the limiting
version of Kullback-Leibler information in the following sense. Fix θ0 and
let θ be close to θ0 . Recall that
By Taylor’s expansion,
[ ]
∂
log f (x | θ) = log f (x | θ0 ) + log f (x | θ0 ) (θ − θ0 )
∂θ
[ ]
1 ∂2
+ log f (x | θ0 ) (θ − θ0 )2 + high-order terms.
2 ∂θ2
7.34 Lemma.
Under regularity conditions (to allow interchange of integration and
differentiation), we have
[ ]2 [ ]
∂ ∂2
Eθ log f (X | θ) = −Eθ log f (X | θ) .
∂θ ∂θ2
Proof.
We have
∫
∂ ∂
0= 1= f (x | θ) dx
∂θ ∂θ
∫
∂
= f (x | θ) dx
∂θ
∫ [ ]
∂
= log f (x | θ) f (x | θ) dx
∂θ
(contd.) Proof.
∫ [ ]
∂ ∂
0= log f (x | θ) f (x | θ) dx
∂θ ∂θ
∫ [ ] ∫ [ ][ ]
∂2 ∂ ∂
= log f (x | θ) f (x | θ) dx + log f (x | θ) f (x | θ) dx
∂θ2 ∂θ ∂θ
∫ [ ] ∫ [ ]
∂2 ∂ 2
= log f (x | θ) f (x | θ) dx + log f (x | θ) f (x | θ) dx
∂θ2 ∂θ
[ ] [ ]
∂2 ∂ 2
= Eθ 2
log f (X | θ) + Eθ log f (X | θ) ,
∂θ ∂θ
completing the proof.
for any unbiased estimator W (X) of λ. This shows that the MLE λ̂ is a
best unbiased estimator of λ. In particular, we have Varλ λ̂ ≤ Varλ S 2 since
the sample variance S 2 is also an unbiased estimator of λ.
so that
∫
′
H (θ) = h(x) g(x) eθh(x)−H(θ) dx
= Eθ h(X).
so that
∫
( )2
H ′′ (θ) + (H ′ (θ))2 = h(x) g(x) eθh(x)−H(θ) dx
( )2
= Eθ h(X) .
i.e.
1∑
n
τ (θ̂) = h(xi ).
n
i=1
Note that
Eθ τ (θ̂) = Eθ h(X) = τ (θ)
1 H ′′ (θ) τ ′ (θ)
Varθ τ (θ̂) = Varθ h(X) = = .
n n n
Fisher information
[ 2 ]
∂
I θ = Eθ log f (X | θ)
∂θ
[ 2 ]
∂
= −Eθ 2
log f (X | θ)
∂θ
( )
= −Eθ − H ′′ (θ) = H ′′ (θ) = τ ′ (θ).
eθ
where H(θ) = log(1 + eθ ). Note that H ′ (θ) = 1+eθ
= 1
1+e−θ
= p. Thus the
MLE p̂ of p is X which is a best unbiased estimator of p as established
above. Indeed, we may directly prove this as follows. First,
p(1 − p)
Varp p̂ = .
n
7.39 Remarks.
Fisher information depends on the parameterization as follows. Consider
two parameterizations θ and ϕ. More precisely, there is a 1-1
correspondence between θ and ϕ, written θ = θ(ϕ) and ϕ = ϕ(θ). The pair
(θ, ϕ) refers to the same pdf f (x | θ) = g(x | ϕ). That is,
( )
f (x | θ) = g x | ϕ(θ) and
( )
g(x | ϕ) = f x | θ(ϕ) .
and θ
g(x | θ) = exθ−log(1+e )
7.40 Remarks.
In case that the support of f (· | θ) depends on θ, Fisher information may
not be well defined. For example, let f (x | θ) = θ1 for 0 ≤ x ≤ θ. The
support of f (· | θ) is [0, θ]. Note that
∫ θ ∫ θ
∂ ∂ ∂ 1
0= 1= f (x | θ)dx ̸= f (x | θ)dx = − .
∂θ ∂θ 0 0
∂θ θ
∂
“Formally”, ∂θ log f (x | θ) = ∂
∂θ
log θ−1 = − θ1 , so that Iθ would appear to
be equal to Eθ (− θ1 )2 = θ12 .
However, for ∆θ > 0,
∫ θ [ ]2
log f (x | θ) − log f (x | θ − ∆θ)
f (x | θ) dx
0
∆θ
∫ [ ]2
θ
log 1
− log 0 1
≥ θ
dx = ∞.
θ−∆θ
∆θ θ
(I θ is assumed to be non-singular.)
Proof.
We have ∫
1= f (x | θ) dx for all θ,
yielding
∫
∂ ∂
0= 1= f (x | θ) dx
∂θi ∂θi
∫
∂
= f (x | θ) dx
∂θi
∫
∂ log f (x | θ)
7.42 = f (x | θ) dx.
∂θi
∫
Similarly, differentiating both sides of τ (θ) = W (x) f (x | θ) dx gives
∫
∂τ (θ) ∂ log f (x | θ)
7.43 = W (x) f (x | θ) dx.
∂θi ∂θi
(contd.) Proof.
Combining (7.42) and (7.43), we have for constants c1 , . . . , ck ,
∑ ∫ [ k ]
k
∂τ [ ] ∑ ∂
ci = W (x) − τ (θ) ci log f (x | θ) f (x | θ) dx
∂θi ∂θi
i=1 i=1
[ k ]
[ ] ∑ ∂ log f (X | θ)
= Eθ W (X) − τ (θ) ci .
∂θi
i=1
By Cauchy-Schwarz inequality,
(∑
k )2 [∑
k ]2
∂τ ∂ log f (X | θ)
ci ≤ Varθ W (X) E ci .
∂θi ∂θi
i=1 i=1
(contd.) Proof.
Thus,
( )2
cT ∇τ (θ)
Varθ W (X) ≥ sup
c cT I θ c
( 1 −1 )
cT I θ2 I θ 2
∇τ (θ)
= sup 1 1
c (cT I θ2 )(I θ2 c)
( −1 )2
c̃T I θ
2
∇τ (θ)
= sup
c̃ c̃T c̃
[( −1 )T ( 1
−2 )]2
Iθ 2 ∇τ (θ) Iθ τ (θ)
= ( )T ( )
−1 1
−2
Iθ 2
∇τ (θ) Iθ ∇τ (θ)
( )T ( )
= ∇τ (θ) I −1
θ ∇τ (θ) ,
1 1 1
where I θ2 is symmetric positive-definite such that I θ2 I θ2 = I θ . This proves
the (generalized) Cramér-Rao inequality.
Proof.
We have
( )
τ (θ) = Eθ W (X) = Eθ Eθ W (X) | T
= Eθ ϕ(T ),
W ′ = a W + b.
{ at b = 1. }
Clearly, the risk function is minimized Thus, under Stein’s loss,
S 2 has the smallest risk in the class bS 2 : b > 0 .
Instructor: Yi-Ching Yao MATH5063 83 / 92
Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators
lim Pθ (|Wn − θ| ≥ ε) = 0.
n→∞
7.50 Example.
[ ]
Let Wn be a sequence of estimators such that Eθ (Wn − θ)2 → 0 as
n → ∞. Then by Chebychev’s inequality, for ε > 0,
[ ]
Eθ (Wn − θ)2
Pθ (|W − θ| ≥ ε) ≤ −→ 0 as n → ∞.
ε2
So Wn is consistent for θ. In particular, the sample mean X n is consistent
for the population mean µ since
[ ] σ2
E (X n − µ)2 = −→ 0 as n → ∞,
n
where σ 2 is the population variance.
7.52 Remarks.
By considering the Kullback-Leibler information, we showed consistency of
θ̂n when the parameter space Θ consists of finitely many points. The
general case requires more delicate arguments.
7.54 Remarks.
Recall the Cramér-Rao lower bound
( )2
τ ′ (θ)
( )2 .
n Eθ ∂
∂θ
log f (X | θ)
Proof.
Fix θ0 ∈ Θ. Let ∑X1 , X2 , . . . be iid f (x | θ0 ). Let
n
log L(θ | X n ) = i=1 log f (Xi | θ) be the log likelihood function. By
Taylor’s expansion at θ0 ,
∂ ∂ ∂2
log L(θ | X n ) = log L(θ0 | X n ) + (θ − θ0 ) 2 log L(θ0 | X n )
∂θ ∂θ ∂θ
+ higher-order terms,
∂ ∂ ∂2
0= log L(θ̂n | X n ) ≈ log L(θ0 | X n ) + (θ̂n − θ0 ) 2 log L(θ0 | X n ),
∂θ ∂θ ∂θ
(contd.) Proof.
from which it follows that
1 ∑ ∂
n
√ log f (Xi | θ0 )
√ n ∂θ
i=1
7.56 n(θ̂n − θ0 ) ≈ .
1 ∑ ∂2
n
− log f (Xi | θ0 )
n ∂θ2
i=1
∑n ∂2
Since − n1 log f (Xi | θ0 ) converges in probability to
2
i=1 ∂θ 2
∑n
−Eθ0 ∂θ∂
2 log f (X | θ0 ) = Iθ0 and
√1
n
∂
i=1 ∂θ
log f (Xi | θ0 ) converges in
distribution to n(0, Iθ0 ), the right-hand side of (7.56)(converges in )
√
distribution to n(0, 1/Iθ0 ). By the delta method, n τ (θ̂n ) − τ (θ0 )
( ( )2 / )
converges in distribution to n 0, τ ′ (θ0 ) Iθ0 .
7.57 Remarks.
We may say that θ̂n is asymptotically normal with mean θ and variance
(nIθ )−1 . Thus the variance of θ̂n decreases at the rate of n−1 as n
increases. Method-of-moments estimators are also asymptotically normal
with variances decaying at the rate of n−1 . That is, asymptotic variances
are of the form c(θ)/n for some c(θ) ≥ Iθ−1 . If c(θ) > Iθ−1 , then the
estimator is not asymptotically efficient.
7.59 Remarks.
If ARE(Vn , Wn ) > 1, Vn is said to be asymptotically more efficient than
Wn . Note that ARE(Vn , Wn ) depends on θ in general. It may happen that
ARE(Vn , Wn ) is greater than 1 for some values of θ but less than 1 for some
other values of θ. If ARE(Vn , Wn ) = 2 (say), then W2n has approximately
σ2
W σ2
the same variance as Vn for large n (since 2n = nV ). In other words, for
the two estimators W and V to have the same variance, W requires about
twice the sample size as V does.
where ( )2
τ ′ (λ) e−2λ
v(λ) = = = λe−2λ .
Iλ λ−1
Thus,
λe−2λ λ
ARE(τ̂n , e−λ̂n ) = = λ ,
− e−λ ) e −1
e−λ (1
which is less than 1 for λ > 0 and decreases to 0 as λ → ∞.