Beamer 7

Introduction to point estimation
Chapter 7 Methods of finding estimators

Methods of evaluating estimators
MATH5063
Instructor: Yi-Ching Yao
Text: Statistical Inference (2nd ed.) by G. Casella and R.L. Berger
Week #7, Fall 2017
Instructor: Yi-Ching Yao MATH5063 1 / 92

Outline

Methods of finding estimators

Consider a random sample X1 , . . . , Xn from an unknown { pdf/pmf f} (x | θ)

which belongs to a (known) family of distributions f (x | θ) : θ ∈ Θ . We
are interested in estimating the unknown value of θ based on the sample
X1 , . . . , Xn . Note that estimating θ is equivalent to estimating the (true)
pdf/pmf f (x | θ). More generally, it may be desired to estimate τ (θ) which
is a known function of θ.
7.1 Remarks.
{ }
The family of distributions f (x | θ) : θ ∈ Θ is called a parametric
(
family of distributions if θ is finite-dimensional. )Let γ : Θ → Φ be a 1-1
mapping. Let ϕ = γ(θ) and g(x | ϕ) = f x | γ −1 (ϕ) , so that
{ } { }
f (x | θ) : θ ∈ Θ and g(x | ϕ) : ϕ ∈ Φ are the same family of
distributions with the former being parameterized by θ and the latter by ϕ.
Often a particular( parameterization is preferred because of easy
interpretations e.g. θ may correspond to the mean of the distribution
)
f (x | θ) .

7.2 Definition (point estimator).

A point estimator is any function W (X1 , . . . , Xn ) of a sample, i.e. any
statistic is a point estimator.
7.3 Remarks.
An estimator is a function of the sample while an estimate is the realized
value of an estimator. Notationally,( W (X1 , . . . , Xn ) is an estimator and
W (x1 , . . . , xn ) is an estimate. Some authors use the word “estimate” for
)
both W (X1 , . . . , Xn ) and W (x1 , . . . , xn ).

Method of Moments:
Let X1 , . . . , Xn be a random sample from a pdf/pmf f (x | θ1 , . . . , θk ) which
depends on k parameters θ1 , . . . , θk . Method-of-moments estimators are
found by equating the first k sample moments to the corresponding
population moments, and solving the resulting system of equations. More
precisely, for l = 1, . . . , k, let
1∑ l
n
ml = Xi and µ′l = µ′l (θ1 , . . . , θk ) = E X l .

n
i=1
Then, the solution (θ̃1 , . . . , θ̃k ) to the system of k equations
ml = µ′l (θ1 , . . . , θk ), l = 1, . . . , k
is the method-of-moments estimator.

7.4 Example (Normal family).

Let X1 , . . . , Xn be iid n(θ, σ 2 ). We have
µ′1 = θ and µ′2 = θ2 + σ 2 ,
so that the method-of-moments estimator (θ̃, σe2 ) satisfies
X = θ̃
1∑ 2
X2 = Xi = θ̃2 + σe2 .
n
That is,
θ̃ = X
1∑ 2
σe2 = Xi − θ̃2
n
1∑ 2 2
= Xi − X
n
1∑ n−1 2
= (Xi − X)2 = S < S2.
n n

7.5 Example (Poisson family).

Let X1 , . . . , Xn be iid Poisson(λ). Since µ′1 = λ, a natural
method-of-moments estimator of λ is λ̃ = X. Alternatively, the
(population) variance of the Poisson(λ) distribution equals λ, a reasonable
estimator of λ is λ̃′ = S 2 (sample variance), which is also a
method-of-moments estimator. We will see that X is better than S 2 in
terms of mean squared error (MSE). On the other hand, if S 2 is
considerably different from X, the assumption of the Poisson distribution is
questionable.

Maximum Likelihood Estimators (MLE):

For a sample X1 , . . . , Xn from a pdf/pmf f (x | θ1 , . . . , θk ), the likelihood
function is given by
∏
n
L(θ | x) = L(θ1 , . . . , θk | x1 , . . . , xn ) = f (xi | θ1 , . . . , θk ).

i=1
Let θ̂(x) be a parameter value at which L(θ | x) attains its maximum. A

maximum likelihood estimator (MLE) of θ based on a sample X is θ̂(X).

7.6 Remarks.
Unless θ̂(x) occurs at the boundary of the parameter space, θ̂(x) is a
solution of the equations
∂
L(θ | x) = 0, i = 1, . . . , k.
∂θi
In most cases, the MLE cannot be found analytically. Thus, one needs to
resort to numerical methods to obtain the MLE approximately.
7.7 Remarks.
Maximizing L(θ | x) is equivalent to maximizing log L(θ | x) (the log
likelihood function).

7.8 Example (normal with known variance).

Let X1 , . . . , Xn be iid n(θ, 1). Then the likelihood function is
∑n
1 −1 (xi −θ)2
L(θ | x) = n e
2 i=1 ,
(2π) 2
and the log likelihood function is
1∑
n
n
log L(θ | x) = − log(2π) − (xi − θ)2 .
2 2
i=1
The equation d
dθ
log L(θ | x) = 0 reduces to
∑
n
(xi − θ) = 0,
i=1
whose solution is θ̂ = x. It follows that θ̂(X) = X is the MLE of θ.

7.9 Example (continued).

If θ is known to be non-negative, then θ̂ is the solution to the constrained
maximization problem:
∑n log L(θ | x) subject to θ ≥ 0, or equivalently, minimize
Maximizing
h(θ) = i=1 (xi − θ)2 subject to θ ≥ 0. Note that h(θ) is a quadratic
function of θ. Since x is the (unconstrained) MLE, we have θ̂ = x if x ≥ 0.
If x < 0, it is readily seen that θ̂ = 0 maximizes the (constrained) (log)
likelihood function. So,
θ̂ = max{x, 0}.
Figure 1 Figure 2

7.10 Example (normal with unknown variance).

Let X1 , . . . , Xn be iid n(θ, σ 2 ). Then,
1 ∑ (xi − θ)2
n
n n
log L(θ, σ 2 | x) = − log 2π − log σ 2 − ,
2 2 2 σ2
i=1
and
1 ∑
n
∂
log L(θ, σ 2 | x) = 2 (xi − θ)
∂θ σ
i=1
1 ∑
n
∂ n
log L(θ, σ 2
| x) = − + (xi − θ)2 .
∂(σ 2 ) 2σ 2 2σ 4
i=1

7.10 (contd.) Example (normal with unknown variance).

Thus, solving
∂ ∂
log L(θ, σ 2 | x) = 0 and log L(θ, σ 2 | x) = 0
∂θ ∂(σ 2 )
yields
1∑
n
n−1 2
θ̂ = x and σb2 = (xi − θ̂)2 = S .
n n
i=1
It can be rigorously verified that (θ̂, σb2 ) uniquely maximizes the (log)
likelihood function.

Bayes Estimators:
Classical (frequentist) approach: The parameter θ is fixed but unknown.
Bayesian approach: θ is random with some prior distribution (which is a
subjective distribution based on the experimenter’s prior knowledge about θ
before the data are collected). When a sample is taken from a pdf/pmf
f (x | θ), the prior distribution is then updated with this sample
information. The updated prior is called the posterior distribution.
Specifically, let π(θ) be the prior distribution. The sampling distribution
f (x | θ) is the conditional distribution of the random sample X given θ.
Thus the joint distribution of (θ, X) is f (x | θ) π(θ). When X is observed
with X = x, the updated prior (i.e. posterior) distribution of θ is the
conditional distribution of θ given X = x, i.e.
f (x | θ) π(θ)
π(θ | x) = ,
m(x)
∫
where m(x) = f (x | θ) π(θ) dθ, the marginal distribution of X.

We often use the mean of the posterior distribution (called the posterior
mean) of θ as a point estimate of θ. Note that the Bayes estimator
depends on the prior distribution. Different prior distributions result in
different Bayes estimators.

7.11 Example (Binomial Bayes estimation).

∑n
Let X1 , . . . , Xn be iid Bernoulli(p), so that Y = i=1 Xi is binomial(n, p).
Let π(p) (the prior of p) be beta(α, β). Then, the joint distribution of (Y, p)
is
[( ) ][ ]
n y Γ(α + β) α−1
f (y, p) = p (1 − p)n−y p (1 − p)β−1
y Γ(α)Γ(β)
( )
n Γ(α + β) y+α−1
= p (1 − p)n−y+β−1 .
y Γ(α)Γ(β)
The posterior distribution of p is

f (y, p)
f (p | y) = ∝ py+α−1 (1 − p)n−y+β−1 .
f (y)

7.11 (contd.) Example (Binomial Bayes estimation).

∫ 1
Since f (p | y) dp = 1, necessarily we have
0
Γ(n + α + β)
f (p | y) = py+α−1 (1 − p)n−y+β−1 ,
Γ(y + α)Γ(n − y + β)
which is beta(y + α, n − y + β). The (posterior) mean of

beta(y + α, n − y + β) is
y+α y+α
p̂B = = .
(y + α) + (n − y + β) n+α+β

7.11 (contd.) Example (Binomial Bayes estimation).

y α
Note that the MLE is and the (prior) mean of beta(α, β) is . We
n α+β
have
( )( ) ( )( )
y+α n y α+β α
p̂B = = + ,
n+α+β n+α+β n n+α+β α+β
a weighted average of the MLE and the prior mean. When the sample size
n is large, p̂B is close to the MLE. In this example, the prior is chosen to be
beta, resulting in a posterior which is also beta. In general, if the prior is
not beta, the posterior may not admit a closed-form expression. The family
of beta priors is called a conjugate family for the binomial distributions.

7.12 Definition (conjugate family).

Let F be a class of pdfs/pmfs f (x | θ) (indexed by θ). A class Π of prior
distributions is a conjugate family for F if the posterior is in the class Π
whenever the prior is in Π.

7.13 Example (Normal Bayes estimation).

Let X1 , . . . , Xn iid n(θ, σ 2 ). Suppose the prior distribution of θ is n(µ, τ 2 ).
Hence σ 2 , µ, and τ 2 are assumed known. Then, the joint distribution of
(X, θ) is
[( )n ∑ ][ ]
1 −1 (xi −θ)2 /σ 2 1 −1 (θ−µ)2 /τ 2
f (x, θ) = √ e 2 √ e2 .
2πσ 2 2πτ 2

7.13 (contd.) Example (Normal Bayes estimation).

Thus, the posterior of θ is
f (x, θ)
f (θ | x) =
f (x)
[ (∑
n )]
1 (xi − θ)2 (θ − µ)2
∝ exp − +
2 σ2 τ2
i=1
[ (∑
n )]
1 (xi − x)2 (θ − x)2 (θ − µ)2
= exp − + n +
2 σ2 σ2 τ2
i=1
[ ( )]
1 (θ − x)2 (θ − µ)2
∝ exp − +
2 σ 2 /n τ2
[ ( )( )2 ]
1 n 1 (n/σ 2 )x + (1/τ 2 )µ
∝ exp − + 2 θ− .
2 σ2 τ n/σ 2 + 1/τ 2

7.13 (contd.) Example (Normal Bayes estimation).

( −1 )
So the posterior of θ is n θ̂B , (n/σ 2 + 1/τ 2 ) , where
(n/σ 2 )x + (1/τ 2 )µ
θ̂B = .
n/σ 2 + 1/τ 2
Note that the normal family is its own conjugate family (when σ 2 is
known). The posterior mean (the Bayes estimator) of θ is a weighted
average of the MLE (X) and the prior mean µ, i.e.
( n/σ 2
) ( 1/τ 2
)
θ̂B = X+ µ.
n/σ 2 + 1/τ 2 n/σ 2 + 1/τ 2

The EM Algorithm:
The EM Algorithm is designed to find MLEs when the likelihood function
involves incomplete data. Specifically, let Y = (Y1 , . . . , Yn ) be the
incomplete (observed) data, and X = (X1 , . . . , Xn ) be the augmented (or
missing) data, making (Y , X) the complete data.
The densities g(· | θ) of Y and f (· | θ) of (Y , X) have the relationship
∫
g(y | θ) = f (y, x | θ) dx.
We call L(θ | y) = g(y | θ) and L(θ | y, x) = f (y, x | θ) the incomplete-data

likelihood and complete-data likelihood, respectively.

To find θ̂ that maximizes L(θ | y) (the MLE of θ), it may be easier to work
with L(θ | y, x) via the EM algorithm as described below.
(i) Start with an initial value θ(0) .
(ii) At the r-th step with current value[ θ(r) (r = 0, 1, . .. ), find ]the value
(denoted θ(r+1) ) that maximizes E log L(θ | y, X) θ(r) , y .
Note that the r-th step of the EM algorithm consists of two parts: The
[ ]
“E-step” calculates the expected log likelihood E log L(θ | y, X) θ(r) , y
and the “M-step” finds its maximum.

7.14 Theorem.
log L(θ(r) | y) increases monotonically in r.
To prove the the theorem, we need the following lemma.
7.15 Lemma.
For two pdfs g and h, we have
∫ [ ]
g(x)
log g(x) dx ≥ 0.
h(x)
(The integral is known as the Kullback-Leibler information.)

Proof.
Since ey ≥ 1 + y for all y ∈ R, we have, for x with g(x) > 0,
h(x) h(x) h(x)
= exp log ≥ 1 + log .
g(x) g(x) g(x)
It follows that
∫ [ ] ∫ [ ]
h(x) h(x)
log g(x) dx ≤ − 1 g(x) dx
g(x) g(x)
{x:g(x)>0} {x:g(x)>0}
∫
= h(x) dx − 1 ≤ 1 − 1 = 0,
{x:g(x)>0}
proving the lemma.

7.16 Remarks.
The Kullback-Leibler information
∫ [ ]
g(x)
D(g ∥ h) = log g(x) dx
h(x)
is also known as the Kullback-Leibler divergence or relative entropy. It

is a measure of how far a distribution h is away from g. Note that
D(g ∥ h) ≥ 0 and it equals 0 if and only if g = h. Note also that
D(g ∥ h) ̸= D(h ∥ g), so it is not a distance function.

7.17 Remarks on consistency of the MLE when the parameter space is

finite.
Suppose the parameter space consists of k + 1 points corresponding to k + 1
distributions f0 (x), f1 (x), . . . , fk (x). Suppose a sample of size (large) n
X1 , . . . , Xn is taken from f0 . Let fˆ denote the MLE based on X1 , . . . , Xn ,
i.e. fˆ = fl if
∑
n
∑
n
log fl (Xi ) ≥ log fl′ (Xi ) for all l′ ∈ {0, . . . , k}.

i=1 i=1
Note that, for l = 1, . . . , k,

[ ]
1 ∑ ∑ 1∑
n n n
f0 (Xi )
log f0 (Xi ) − log fl (Xi ) = log
n n fl (Xi )
i=1 i=1 i=1
∫ [ ]
f0 (x)
≈ log f0 (x) dx
fl (x)
= D(f0 ∥ fl ) > 0,
implying that P(fˆ = f0 ) ≈ 1 for large n.

Proof of Theorem 7.14.
] | y) ≥ log L(θ | y). Since θ

L(θ(r+1) (r) (r+1)
[ want to show log(r)
We maximizes
E log L(θ | y, X) θ , y , we have
[ ] [ ]
E log L(θ(r+1) | y, X) θ(r) , y ≥ E log L(θ(r) | y, X) θ(r) , y ,
or equivalently,
∫
[ ]
7.18 log f (y, x | θ(r+1) ) f (x | θ(r) , y) dx
∫
[ ]
≥ log f (y, x | θ(r) ) f (x | θ(r) , y) dx.

(contd.) Proof of Theorem 7.14.

By Lemma 7.15 with g(·) = f (· | θ(r) , y) and h(·) = f (· | θ(r+1) , y),
∫ [ ]
f (x | θ(r) , y)
log f (x | θ(r) , y) dx ≥ 0, i.e.
f (x | θ(r+1) , y)
∫
[ ]
7.19 log f (x | θ(r+1) , y) f (x | θ(r) , y) dx
∫
[ ]
≤ log f (x | θ(r) , y) f (x | θ(r) , y) dx.

(contd.) Proof of Theorem 7.14.

Note that
log f (y, x | θ(r+1) ) − log f (x | θ(r+1) , y) = log f (y | θ(r+1) )

log f (y, x | θ(r) ) − log f (x | θ(r) , y) = log f (y | θ(r) ).
Subtracting (7.19) from (7.18) yields

∫ ∫
[ ] [ ]
log f (y | θ (r+1)
) f (x | θ (r)
, y) dx ≥ log f (y | θ(r) ) f (x | θ(r) , y) dx,
which is nothing but
log L(θ(r+1) | y) = log f (y | θ(r+1) ) ≥ log f (y | θ(r) ) = log L(θ(r) | y),
completing the proof.

7.20 Remarks.
The fact that log L(θ(r) | y) increases monotonically does not guarantee
that θ(r) converges to the MLE. Under suitable conditions, it can be shown
that all limit points of an EM sequence {θ(r) } are stationary points of
L(θ | y), and L(θ(r) | y) converges monotonically to L(θ̂ | y) for some
stationary point θ̂ (which may be a local maximum or saddle point). In
practice, one should try different initial values in the hope that one of them
yields the MLE.

7.21 Example (Exercise 7.30 on page 360).

Suppose that we have a mixture density pf (x) + (1 − p)g(x), where the
“mixing” probability p is unknown but f and g are known (for simplicity).
A sample X1 , . . . , Xn is taken from this mixture density, so that the log
likelihood is
∑
n
[ ]
L(p | x) = log pf (xi ) + (1 − p)g(xi ) ,
i=1
which cannot be dealt with analytically. To apply the EM algorithm, we

augment the observed (or incomplete) data with Z = (Z1 , . . . , Zn ), where
Zi tells which component of the mixture Xi came from. In other words,
Xi | zi = 1 ∼ f (xi ) and Xi | zi = 0 ∼ g(xi )
and P(Zi = 1) = p. (Note that if Z were observable, Z would contain all

the information about p.)

7.21 (contd.) Example (Exercise 7.30 on page 360).

∏n [ ]zi [ ]1−zi
(a) The joint density of (X, Z) is i=1
pf (xi ) (1 − p)g(xi ) , so
that
n {
∑ [ ] [ ]}
log L(p | x, z) = zi log pf (xi ) + (1 − zi ) log (1 − p)g(xi ) .
i=1
(b) The conditional distribution of Zi | xi , p is Bernoulli with success

pf (xi )
probability .
pf (xi ) + (1 − p)g(xi )
Note that
P(Zi = 1, Xi ∈ dxi )
P(Zi = 1 | Xi ∈ dxi ) =
P(Xi ∈ dxi )
pf (xi ) dxi
=
pf (xi ) dxi + (1 − p)g(xi ) dxi
pf (xi )
= .
pf (xi ) + (1 − p)g(xi )

(c) At the r-th step, with current value p(r) , we have, for E-step, that
[ ]
E log L(p | Z, x) p(r) , x
[∑ ]
n

=E Zi log pf (xi ) + (1 − Zi ) log(1 − p)g(xi ) p(r) , x
i=1
n {
∑ p(r) f (xi )
= log pf (xi )
p(r) f (x i ) + (1 − p
(r) )g(x )
i
i=1
[ ] }
p(r) f (xi )
+ 1− log(1 − p)g(xi ) .
p f (xi ) + (1 − p(r) )g(xi )
(r)


For the M-step, the above expression (as a function of p) attains the
maximum at p = p(r+1) which satisfies
1∑
n
p(r) f (xi )
0=
p p f (xi ) + (1 − p(r) )g(xi )
(r)
i=1
[ ]
1 ∑
n
p(r) f (xi )
− 1 − (r)
1−p p f (xi ) + (1 − p(r) )g(xi )
i=1
1 1
= K− (n − K),
p 1−p
∑n p(r) f (xi )
where K = . So,
i=1 p(r) f (xi )+ (1 − p(r) )g(xi )
1∑
n
K p(r) f (xi )
p(r+1) = = .
n n p(r) f (xi ) + (1 − p(r) )g(xi )
i=1

7.22 Definition (Mean squared error).

The mean squared error (MSE) of an estimator W (X) of a
parameter θ is
Eθ (W − θ)2 (as a function of θ).
7.23 Remarks.
The subscript θ in Eθ refers to the pdf f (· | θ). Thus,
∫
( )2
Eθ (W − θ)2 = W (x) − θ f (x | θ) dx.

The MSE measures the averaged squared difference between the estimator
W and the parameter θ. In general, any increasing function of the absolute
distance |W − θ| would serve to measure the goodness of an estimator
(mean absolute error, Eθ |W − θ|, is a reasonable alternative).
The MSE is a popular and convenient measure for its mathematical
tractability. Moreover, the MSE incorporates two components, one
measuring the variability of the estimator (precision) and the other
measuring its bias (accuracy). More precisely, we have
Eθ (W − θ)2 = E (W − Eθ W + Eθ W − θ)2
= Varθ W + (Eθ W − θ)2
= (variance of W ) + (squared bias of W ).

7.24 Definition (Bias).

The bias of a point estimator W of a parameter θ is defined by
Biasθ W = Eθ W − θ. An estimator of θ is unbiased if its bias (as a
function of θ) is identically 0.
An unbiased estimator has the MSE equal to the variance, i.e.

Eθ (W − θ)2 = Varθ W if W is unbiased.

7.25 Example.
∫
Let X1 , . . . , Xn be a sample from f (x | θ). Let µ = µ(θ) = xf (x | θ) dx
∫
and σ 2 = σ 2 (θ) = x2 f (x | θ) dx − µ2 . Then, the sample mean X and
sample variance S 2 are unbiased estimators of µ and σ 2 , respectively. Now
suppose the f (x | θ) = n(µ, σ 2 ). Then,
σ2
E (X − µ)2 = Var X =
n
E (S 2 − σ 2 )2 = Var S 2 .
Recall that S 2 may be expressed as
σ 2 (Z12 + · · · + Zn−1
2
)
S2 = ,
n−1
where Z1 , . . . , Zn−1 are iid n(0, 1).

7.25 (contd.) Example.

Consequently,
σ4
E S4 = E (Z12 + · · · + Zn−1
2
)2
(n − 1)2
[ ]
σ4
= (n − 1) E Z14 + (n − 1)(n − 2) E Z12 Z22
(n − 1)2
[ ]
σ4 n+1 4
= 3(n − 1) + (n − 1)(n − 2) = σ .
(n − 1)2 n−1
2σ 4
MSE(S 2 ) = Var S 2 = E S 4 − σ 4 = .
n−1

7.25 (contd.) Example.

Recall that the MLE of σ 2 is
1∑
n
n−1 2
σb2 = (Xi − X)2 = S .
n n
i=1
n−1 2 σ2
The bias of σb2 is E σb2 − σ 2 = σ − σ2 = − and
( ) ( n )2 n
−
n − 1 n − 1 2σ 4
2(n 1) σ 4
Var σb2 = Var S2 = = . We have
n n n−1 n2
( )2 2(n − 1) σ 4
( )
σ2 2n − 1 4
MSE(σb2 ) = E (σb2 − σ 2 )2 = − + = σ
n n2 n2
( )
2
< σ 4 = MSE(S 2 ).
n−1

7.26 Remarks (cf. page 332).
Although σb2 is an biased estimator of σ 2 , its MSE is smaller than that of

the unbiased estimator S 2 . This does not imply that σb2 is better than S 2 ,
since MSE may not be the appropriate measure for a scale parameter. Note
that MSE penalizes equally for overestimation and underestimation, which
is fine for a location parameter. For a scale parameter, 0 is a natural lower
bound, so the estimation of a scale parameter is not symmetric. Use of
MSE in this case tends to be forgiving of underestimation.
7.27 Remarks.
In general, no estimator exists which is better than any other estimator at
every parameter value. For example, the trivial estimator W = 2.5, which
does not depend on the data, has MSE = 0 if θ = 2.5. This estimator is
obviously of no interest. Below we restrict our attention to the class of
unbiased estimators. (It should be pointed out that the class of unbiased
estimators is often too restrictive. It is instructive to consider estimators
with a small bias.) We will see that in some cases, there exists a best
unbiased estimator in this class in the following sense.

7.28 Definition (Best unbiased estimator).

An estimator W ∗ is a best unbiased estimator of τ (θ) if
(i) Eθ W ∗ = τ (θ) for all θ.
(ii) Varθ W ∗ ≤ Varθ W for all θ for any estimator W with Eθ W = τ (θ).
7.29 Remarks.
A best unbiased estimator of τ (θ) is also called a uniformly minimum
variance unbiased estimator (UMVUE) of τ (θ).

7.30 Theorem (Cramér-Rao inequality).
∫ X1 , . . . , Xn be a sample with joint pdf f (x | θ) such that

Let
∂
f (x | θ) dx = 0. Let W (X) = W (X1 , . . . , Xn ) be any estimator
∂θ
satisfying ∫ [ ]
d ∂
Eθ W (X) = W (x) f (x | θ) dx
dθ ∂θ
and
Varθ W (X) < ∞.
Then,
[ 2 ]
d
Eθ W (X)
Varθ W (X) ≥ dθ
[ ]2 .
∂
Eθ log f (X | θ)
∂θ

Proof.
Let τ (θ) = Eθ W (X). Then, by Cauchy-Schwarz inequality,
( )2 [ ]2
d d
τ (θ) = Eθ W (X)
dθ dθ
[∫ ]2
∂
= W (x) f (x | θ) dx
∂θ
[∫ ]2
( ) ∂
= W (x) − τ (θ) f (x | θ) dx
∂θ
[∫ ( ) ]2
( ) ∂
= W (x) − τ (θ) log f (x | θ) f (x | θ) dx
∂θ
[ ]2
( )∂
= Eθ W (X) − τ (θ) log f (X | θ)
∂θ
( )2
( )2 ∂
≤ Eθ W (X) − τ (θ) Eθ log f (X | θ) ,
∂θ

(contd.) Proof.
from which it follows that
[ 2 ]
d
Eθ W (X)
Varθ W (X) ≥ dθ
[ ]2 .
∂
Eθ log f (X | θ)
∂θ

7.31 Corollary (iid case).
If X1 , . . . , Xn are iid f (x | θ), then

[ 2 ]
d
Eθ W (X)
Varθ W (X) ≥ dθ
[ ]2 .
∂
n Eθ log f (X | θ)
∂θ
Proof.
Since f (x | θ) = f (x1 | θ) · · · f (xn | θ), we have
[ ]2 [∑
n ]2
∂ ∂
Eθ log f (X | θ) = Eθ log f (Xi | θ)
∂θ ∂θ
i=1
[ ]
2
∂
= n Eθ log f (X1 | θ) +
∂θ
[∂ ∂ ]
n(n − 1) E log f (X1 | θ) log f (X2 | θ) .
∂θ ∂θ

(contd.) Proof.
The corollary follows by noting that
[∂ ∂ ] ∂ ∂
E log f (X1 | θ) log f (X2 | θ) = E log f (X1 | θ) E log f (X2 | θ)
∂θ ∂θ ∂θ ∂θ
and
∫
∂ [∂ ]
E log f (X1 | θ) = log f (x | θ) f (x | θ) dx
∂θ ∂θ
∫
∂
= f (x | θ) dx = 0.
∂θ

7.32 Remarks.
Theorem 7.30 and Corollary 7.31 hold for the discrete case as well. In
either the discrete or continuous case, in order for the Cramér-Rao lower
bound to hold, what is required is to allow interchange of integration (over
x) and differentiation (with respect to θ).

7.33 Remarks.
[ ]2
The quantity Iθ = Eθ ∂θ ∂
log f (X | θ) is called Fisher information per
observation. For a sample consisting of n iid observations, the total Fisher
information is nIθ . Fisher information may be regarded as the limiting
version of Kullback-Leibler information in the following sense. Fix θ0 and
let θ be close to θ0 . Recall that
D(θ0 ∥ θ) = D(f (· | θ0 ) ∥ f (· | θ))

∫ [ ]
f (x | θ0 )
= log f (x | θ0 ) dx.
f (x | θ)
By Taylor’s expansion,
[ ]
∂
log f (x | θ) = log f (x | θ0 ) + log f (x | θ0 ) (θ − θ0 )
∂θ
[ ]
1 ∂2
+ log f (x | θ0 ) (θ − θ0 )2 + high-order terms.
2 ∂θ2

7.33 (contd.) Remarks.

It follows from the lemma below that for θ close to θ0 ,
∫ [ ]
∂
D(θ0 ∥ θ) ≈ − log f (x | θ0 ) (θ − θ0 )f (x | θ0 ) dx
∂θ
∫ [ ]
1 ∂2
− log f (x | θ0 ) (θ − θ0 )2 f (x | θ0 ) dx
2 ∂θ2
[ ]
1 ∂2
= − Eθ0 log f (X | θ0 ) (θ − θ0 )2
2 ∂θ2
[ ]2
1 ∂ 1
= Eθ 0 log f (X | θ0 ) (θ − θ0 )2 = Iθ0 (θ − θ0 )2 .
2 ∂θ 2

7.34 Lemma.
Under regularity conditions (to allow interchange of integration and
differentiation), we have
[ ]2 [ ]
∂ ∂2
Eθ log f (X | θ) = −Eθ log f (X | θ) .
∂θ ∂θ2
Proof.
We have
∫
∂ ∂
0= 1= f (x | θ) dx
∂θ ∂θ
∫
∂
= f (x | θ) dx
∂θ
∫ [ ]
∂
= log f (x | θ) f (x | θ) dx
∂θ

(contd.) Proof.
∫ [ ]
∂ ∂
0= log f (x | θ) f (x | θ) dx
∂θ ∂θ
∫ [ ] ∫ [ ][ ]
∂2 ∂ ∂
= log f (x | θ) f (x | θ) dx + log f (x | θ) f (x | θ) dx
∂θ2 ∂θ ∂θ
∫ [ ] ∫ [ ]
∂2 ∂ 2
= log f (x | θ) f (x | θ) dx + log f (x | θ) f (x | θ) dx
∂θ2 ∂θ
[ ] [ ]
∂2 ∂ 2
= Eθ 2
log f (X | θ) + Eθ log f (X | θ) ,
∂θ ∂θ
completing the proof.

7.35 Example (Poisson).

Let X1 , . . . , Xn be iid Poisson(λ). The MLE of λ is λ̂ = x, which is
unbiased.
1 1
Varλ (λ̂) = VarX1 = λ.
n n
Fisher information
[ 2 ]
∂
Iλ = Eλ log f (X | λ)
∂λ
[ 2 ]
∂
= −Eλ 2
log f (X | λ)
∂λ
( −λ X )
∂2 e λ
= −Eλ 2 log
∂λ X!
∂2
= −Eλ (−λ + X log λ − log X!)
∂λ2
= −Eλ X(−λ−2 ) = λ−2 Eλ X = λ−1 .

7.35 (contd.) Example (Poisson).

By the Cramér-Rao inequality, we have
1 λ
Varλ W (X) ≥ = = Varλ (λ̂)
nIλ n
for any unbiased estimator W (X) of λ. This shows that the MLE λ̂ is a
best unbiased estimator of λ. In particular, we have Varλ λ̂ ≤ Varλ S 2 since
the sample variance S 2 is also an unbiased estimator of λ.

More generally, consider a one-parameter exponential family

f (x | θ) = g(x) eθh(x)−H(θ) . Note that
∫
7.36 eH(θ) = g(x) eθh(x) dx.
Differentiating 7.36 yields

∫
7.37 H ′ (θ) eH(θ) = h(x) g(x) eθh(x) dx,
so that
∫
′
H (θ) = h(x) g(x) eθh(x)−H(θ) dx
= Eθ h(X).
Differentiating 7.37 yields

∫
[ ( )2 ] ( )2
H ′′ (θ) + H ′ (θ) eH(θ) = h(x) g(x) eθh(x) dx,

so that
∫
( )2
H ′′ (θ) + (H ′ (θ))2 = h(x) g(x) eθh(x)−H(θ) dx
( )2
= Eθ h(X) .
This shows that

( )2 ( )2
H ′′ (θ) = Eθ h(X) − H ′ (θ) = Varθ h(X) > 0.
In particular, H(θ) is a convex function and H ′ (θ) is increasing in θ.

Suppose we are interested in estimating τ (θ) = H ′ (θ), which is a
reparameterization of θ. The MLE of θ satisfies
∂
0= log L(θ | x1 , . . . , xn )
∂θ
[ ∑ n ]
∂
= θ h(xi ) − nH(θ)
∂θ
i=1
∑
n
∑
n
= h(xi ) − nH ′ (θ) = h(xi ) − nτ (θ),

i=1 i=1

i.e.
1∑
n
τ (θ̂) = H ′ (θ̂) = h(xi ).

n
i=1
In other words, the MLE of τ (θ) is
1∑
n
τ (θ̂) = h(xi ).
n
i=1
Note that
Eθ τ (θ̂) = Eθ h(X) = τ (θ)
1 H ′′ (θ) τ ′ (θ)
Varθ τ (θ̂) = Varθ h(X) = = .
n n n
Fisher information
[ 2 ]
∂
I θ = Eθ log f (X | θ)
∂θ
[ 2 ]
∂
= −Eθ 2
log f (X | θ)
∂θ
( )
= −Eθ − H ′′ (θ) = H ′′ (θ) = τ ′ (θ).

By the Cramér-Rao inequality, we have

( )2
τ ′ (θ) τ ′ (θ)
Eθ W (X) ≥ = = Varθ τ (θ̂)
nIθ n
for any unbiased estimator of W (X) of τ (θ). This shows that the MLE
τ (θ̂) = H ′ (θ̂) is a best unbiased estimator of τ (θ) = H ′ (θ).

7.38 Example (Bernoulli).

The Bernoulli family is one-parameter exponential where for x = 0, 1,
f (x | p) = px (1 − p)1−x = ex log p/(1−p)+log(1−p) .

p 1
Let θ = log , so that p = and
1−p 1 + e−θ
( )
f x | p(θ) = exθ−H(θ) ,
eθ
where H(θ) = log(1 + eθ ). Note that H ′ (θ) = 1+eθ
= 1
1+e−θ
= p. Thus the
MLE p̂ of p is X which is a best unbiased estimator of p as established
above. Indeed, we may directly prove this as follows. First,
p(1 − p)
Varp p̂ = .
n

7.38 (contd.) Example (Bernoulli).

Fisher information
eθ
Iθ = H ′′ (θ) = = p(1 − p).
(1 + eθ )2
It follows from the Cramér-Rao inequality that
( )2 ( )2
τ ′ (θ) H ′′ (θ) p(1 − p)
Varθ W (X) ≥ = = = Varθ p̂
nIθ n H ′′ (θ) n
for any unbiased estimator W (X) of p. So the MLE p̂ of p is a best

unbiased estimator of p.

7.39 Remarks.
Fisher information depends on the parameterization as follows. Consider
two parameterizations θ and ϕ. More precisely, there is a 1-1
correspondence between θ and ϕ, written θ = θ(ϕ) and ϕ = ϕ(θ). The pair
(θ, ϕ) refers to the same pdf f (x | θ) = g(x | ϕ). That is,
( )
f (x | θ) = g x | ϕ(θ) and
( )
g(x | ϕ) = f x | θ(ϕ) .
Taking the Bernoulli case as an example, we have
f (x | p) = px (1 − p)1−x = ex log p/(1−p)+log(1−p)
and θ
g(x | θ) = exθ−log(1+e )
where θ = log p/(1 − p).


Fisher information with respect to θ is
[ ]2
∂
Iθ = Eθ log f (X | θ)
∂θ
while Fisher information with respect to ϕ is
[ ]
2
∂
Iϕ = Eϕ log g(X | ϕ) .
∂ϕ


Note that
∂ ∂ ( )
log g(X | ϕ) = log f X | θ(ϕ)
∂ϕ ∂ϕ
[ ]
∂
= log f (X | θ) θ′ (ϕ),
∂θ
implying that
[ ]2 ( )2 [ ]2
∂ ∂
Iϕ = Eϕ log g(X | ϕ) = θ′ (ϕ) Eθ log f (X | θ)
∂ϕ ∂θ
( )2
= θ′ (ϕ) Iθ ,
which in turn implies that

( )−2 ( )2
Iθ = θ′ (ϕ) Iϕ = ϕ′ (θ) Iϕ .

7.40 Remarks.
In case that the support of f (· | θ) depends on θ, Fisher information may
not be well defined. For example, let f (x | θ) = θ1 for 0 ≤ x ≤ θ. The
support of f (· | θ) is [0, θ]. Note that
∫ θ ∫ θ
∂ ∂ ∂ 1
0= 1= f (x | θ)dx ̸= f (x | θ)dx = − .
∂θ ∂θ 0 0
∂θ θ
∂
“Formally”, ∂θ log f (x | θ) = ∂
∂θ
log θ−1 = − θ1 , so that Iθ would appear to
be equal to Eθ (− θ1 )2 = θ12 .
However, for ∆θ > 0,
∫ θ [ ]2
log f (x | θ) − log f (x | θ − ∆θ)
f (x | θ) dx
0
∆θ
∫ [ ]2
θ
log 1
− log 0 1
≥ θ
dx = ∞.
θ−∆θ
∆θ θ
This would suggest that Iθ = ∞.


For an iid sample X1 , . . . , Xn from f (x | θ) = 1/θ, 0 < x < θ, let
Y = max(X1 , . . . , Xn ), the largest order statistic. Since
fY (y | θ) = ny n−1 /θn , 0 < y < θ, we have
∫ θ
ny n n
Eθ Y = dy = θ.
0
θn n+1
n+1
Thus n
Y is an unbiased estimator of θ, whose variance equals
( ) ( )
n+1 n+1 2
Varθ Y = Varθ Y
n n
( )2 [ ( )2 ]
n+1 n
= Eθ Y 2 − θ
n n+1
( ) [ ( )2 ]
n+1 2 n n
= θ −
2
θ
n n+2 n+1
1 2 1
= θ < ,
n(n + 2) n( θ12 )
which violates the Cramér-Rao inequality if Iθ = 1

θ2
. If Iθ = ∞, the
Cramér-Rao inequality holds trivially.
7.41 Example (Normal variance).

For the normal case, the sample variance S 2 has mean σ 2 and variance
Var(S 2 | µ, σ 2 ) = 2σ 4 /(n − 1). Does S 2 attain the Cramér-Rao lower
bound?
To answer this question, we need to extend the Cramér-Rao lower bound to

the multi-parameter setting. Let θ = (θ1 , . . . , θk ) be a k-dimensional
parameter vector. Let X = (X1 , . . . , Xn ) have joint pdf and let
W (X) = W (X1 , . . . , Xn ) be a (real-valued) estimator with
τ (θ) = Eθ W (X).

Under suitable regularity conditions, we have the following

(generalized) Cramér-Rao inequality:
( )T
Varθ W (X) ≥ ∇τ (θ) I −1
θ ∇τ (θ),
( )k
where ∇τ (θ) = (∂τ /∂θ1 , . . . , ∂τ /∂θk )T and I θ = I θ (i, j) i,j=1
with
[ ]
∂ log f (X | θ) ∂ log f (X | θ)
I θ (i, j) = Eθ .
∂θi ∂θj
(I θ is assumed to be non-singular.)

Proof.
We have ∫
1= f (x | θ) dx for all θ,
yielding
∫
∂ ∂
0= 1= f (x | θ) dx
∂θi ∂θi
∫
∂
= f (x | θ) dx
∂θi
∫
∂ log f (x | θ)
7.42 = f (x | θ) dx.
∂θi
∫
Similarly, differentiating both sides of τ (θ) = W (x) f (x | θ) dx gives
∫
∂τ (θ) ∂ log f (x | θ)
7.43 = W (x) f (x | θ) dx.
∂θi ∂θi

(contd.) Proof.
Combining (7.42) and (7.43), we have for constants c1 , . . . , ck ,
∑ ∫ [ k ]
k
∂τ [ ] ∑ ∂
ci = W (x) − τ (θ) ci log f (x | θ) f (x | θ) dx
∂θi ∂θi
i=1 i=1
[ k ]
[ ] ∑ ∂ log f (X | θ)
= Eθ W (X) − τ (θ) ci .
∂θi
i=1
By Cauchy-Schwarz inequality,
(∑
k )2 [∑
k ]2
∂τ ∂ log f (X | θ)
ci ≤ Varθ W (X) E ci .
∂θi ∂θi
i=1 i=1
In matrix form, we have with c = (c1 , . . . , ck )T

( )2
cT ∇τ (θ) ≤ Varθ W (X) cT I θ c.

(contd.) Proof.
Thus,
( )2
cT ∇τ (θ)
Varθ W (X) ≥ sup
c cT I θ c
( 1 −1 )
cT I θ2 I θ 2
∇τ (θ)
= sup 1 1
c (cT I θ2 )(I θ2 c)
( −1 )2
c̃T I θ
2
∇τ (θ)
= sup
c̃ c̃T c̃
[( −1 )T ( 1
−2 )]2
Iθ 2 ∇τ (θ) Iθ τ (θ)
= ( )T ( )
−1 1
−2
Iθ 2
∇τ (θ) Iθ ∇τ (θ)
( )T ( )
= ∇τ (θ) I −1
θ ∇τ (θ) ,
1 1 1
where I θ2 is symmetric positive-definite such that I θ2 I θ2 = I θ . This proves
the (generalized) Cramér-Rao inequality.

When X1 , . . . , Xn are iid f (x | θ), we have f (x | θ) = f (x1 | θ) · · · f (xn | θ),

so that
[ ]
∂ log f (X | θ) ∂ log f (X | θ)
I θ (i, j) = Eθ
∂θi ∂θj
[ ][ ]
∂ ∑ ∂ ∑
n n
= Eθ log f (Xl | θ) log f (Xl | θ)

∂θi ∂θj
l=1 l=1
[ ][ ]
∂ ∂
= n Eθ log f (X1 | θ) log f (X1 | θ)
∂θi ∂θj
[ ]
∂2
= −n Eθ log f (X1 | θ) .
∂θi ∂θj

Then the Cramér-Rao lower bound is

( )T ( )
∇τ (θ) (n I θ )−1 ∇τ (θ) ,
where Fisher information matrix (per observation) I θ is given by

[ ][ ]
∂ ∂
I θ (i, j) = Eθ log f (X | θ) log f (X | θ)
∂θi ∂θj
[ ]
∂2
= −Eθ log f (X | θ) .
∂θi ∂θj

7.44 Example (Normal variance, continued).

We have
1 (x − µ)2
log f (x | µ, σ 2 ) = log(2πσ 2 )− 2 − ,
2σ 2
so that
∂2 1
log f (x | µ, σ 2 ) = − 2
∂µ2 σ
∂2 x−µ
log f (x | µ, σ 2 ) = − 4
∂µ ∂(σ 2 ) σ
∂2 1 (x − µ)2
log f (x | µ, σ 2
) = − .
∂(σ 2 )2 2σ 4 σ6
It follows that  
1
0
Iθ =  σ .
2
1
0
2σ 4

7.44 (contd.) Example (Normal variance, continued).

For the sample variance S 2 , we have τ (µ, σ 2 ) = E S 2 = σ 2 and the
Cramér-Rao lower bound equals
( )
( ) 0 2 4
0 1 (nI θ )−1 = σ ,
1 n
4
2σ
which is < n−1 = Var(S 2 | µ, σ 2 ). While S 2 fails to attain the Cramér-Rao
lower bound, it can be shown that S 2 is a best unbiased estimator of σ 2 .
We will not pursue this issue further, which requires the concept of
completeness of a statistic.

7.45 Remarks (Rao-Blackwell Theorem).

Let W (X) be any( unbiased )estimator of τ (θ). Let T (X) be sufficient for θ.
Define ϕ(T ) = E W (X) | T . Then Eθ ϕ(T ) = τ (θ) and
Varθ ϕ(T ) ≤ Varθ W (X) for all θ. That is, ϕ(T ) is a uniformly better
unbiased estimator of τ (θ).
Proof.
We have
( )
τ (θ) = Eθ W (X) = Eθ Eθ W (X) | T
= Eθ ϕ(T ),
showing that ϕ(T ) is an unbiased estimator of τ (θ). Moreover, we have

( ) ( )
Varθ W (X) = Varθ Eθ W (X) | T + Eθ Varθ W (X) | T
( )
= Varθ ϕ(T ) + Eθ Varθ W (X) | T
≥ Varθ ϕ(T ).

7.45 (contd.) Remarks (Rao-Blackwell Theorem).

Rao-Blackwell Theorem says that if an estimator is not a function of a
(minimal) sufficient statistic, one can get a better estimator by considering
the conditional expectation of the estimator given the (minimal) sufficient
statistic.

7.46 Remarks (Uniqueness of a best unbiased estimator).

Let W and W ′ be both best unbiased estimators of τ (θ). Consider
W ∗ = 21 (W + W ′ ), which is also unbiased. Moreover,
( )
1 1
Varθ W ≤ Varθ W ∗ = Varθ W + W′
2 2
1 1 1
= Varθ W + Varθ W ′ + Covθ (W, W ′ )
4 4 2
[ ]12
1 1 1
≤ Varθ W + Varθ W ′ + (Varθ W )(Varθ W ′ )
4 4 2
= Varθ W ,
showing that the two inequalities are in fact equalities.

7.46 (contd.) Remarks (Uniqueness of a best unbiased estimator).

In particular,
[ ]12
Covθ (W, W ′ ) = (Varθ W )(Varθ W ′ ) .
This can happen only when
W ′ = a W + b.
By Eθ W = Eθ W ′ and Varθ W = Varθ W ′ = Covθ (W, W ′ ), we conclude

that a = 1 and b = 0, i.e. W = W ′ .

Loss function optimality

Mean squared error is associated with the squared error loss. More general
loss functions may be considered. Let L(θ, a) be a loss function involving
the parameter value θ and “action” a. For simplicity, assume θ is
one-dimensional. If a = θ, then L(θ, a) = 0 (no loss is incurred if the correct
action is taken). As a moves away from θ, the loss increases. The special
case L(θ, a) = (a − θ)2 is the squared error loss function while
L(θ, a) = |a − θ| is the absolute error loss function. For a specified loss
function L(θ, a), the risk function for an estimator δ(X) is defined by
( )
R(θ, δ) = Eθ L θ, δ(X) .
We say an estimator δ is better than another estimator δ ′ if
R(θ, δ) ≤ R(θ, δ ′ ) for all θ
and the strict inequality holds for some θ.

7.47 Example (Risk of normal variance).

Let X1 , . . . , Xn be iid n(µ, σ 2 ). Consider estimating σ 2 using squared error
loss. It seems reasonable to restrict attention to the class of estimators
δb = bS 2 where S 2 is the sample variance and b is a positive constant. Then
( )
R (µ, σ 2 ), δb = E (bS 2 − σ 2 )2
= Var bS 2 + (E bS 2 − σ 2 )2
= b2 Var S 2 + (bσ 2 − σ 2 )2
( )
2σ 4
= b2 + (b − 1)2 σ 4
n−1
[ 2 ]
2b
= + (b − 1)2 σ 4 .
n−1
n−1 ( )
It is easily seen that b = minimizes R (µ, σ 2 ), δb . In other words, in
n+1
{ } n−1 2 1 ∑n
the class of estimators bS 2 : b > 0 , S = (Xi − X)2
n+1 n+1 i=1
has the smallest mean squared error.

7.48 Example (Variance estimation using Stein’s loss).

A criticism of squared error loss in a variance estimation problem is that
underestimation has only a finite penalty while overestimation has an
infinite penalty. C. Stein considered the loss function
a a
L(σ 2 , a) = − 1 − log 2 .
σ2 σ
Note that L(σ 2 , a) is a convex function of a with the minimum L(σ 2 , a) = 0
{ to ∞ as}a → 0 or a → ∞. Again, let’s
at a = σ 2 . Moreover, L(σ 2 , a) tends
consider the class of estimators bS 2 : b > 0 . Below we need not assume
normality. We have
( )
bS 2 bS 2
R(σ 2 , δb ) = E − 1 − log
σ2 σ2
S2 bS 2
= b E 2 − 1 − E log 2
σ σ
S2
= b − log b − 1 − E log 2 .
σ
{ at b = 1. }
Clearly, the risk function is minimized Thus, under Stein’s loss,
S 2 has the smallest risk in the class bS 2 : b > 0 .
Large Sample Theory

Let X1 , X2 , . . . be an iid sample from f (x | θ) where θ ∈ Θ is unknown. For
simplicity, we assume that θ is one-dimensional. Let Wn = Wn (X1 , . . . , Xn )
be an estimator of θ based on X1 , . . . , Xn . Thus {Wn : n = 1, 2, . . . } is a
sequence of estimators of θ.
7.49 Definition (Consistency).

A sequence of estimators Wn = Wn (X1 , . . . , Xn ) is a consistent sequence
of estimators of θ if for every θ ∈ Θ and every ε > 0,
lim Pθ (|Wn − θ| ≥ ε) = 0.
n→∞

7.50 Example.
[ ]
Let Wn be a sequence of estimators such that Eθ (Wn − θ)2 → 0 as
n → ∞. Then by Chebychev’s inequality, for ε > 0,
[ ]
Eθ (Wn − θ)2
Pθ (|W − θ| ≥ ε) ≤ −→ 0 as n → ∞.
ε2
So Wn is consistent for θ. In particular, the sample mean X n is consistent
for the population mean µ since
[ ] σ2
E (X n − µ)2 = −→ 0 as n → ∞,
n
where σ 2 is the population variance.

7.51 Theorem (Consistency of MLEs).

Let X1 , X2 , . . . be an iid f (x | θ). Let θ̂n be the MLE of θ based on
X1 . . . , Xn . Under regularity conditions, θ̂n is consistent for θ, i.e. for
ε > 0, limn→∞ Pθ (|θ̂n − θ| ≥ ε) = 0. Moreover, if τ (θ) is a continuous
function of θ, then τ (θ̂n ) is a consistent estimator of τ (θ).
7.52 Remarks.
By considering the Kullback-Leibler information, we showed consistency of
θ̂n when the parameter space Θ consists of finitely many points. The
general case requires more delicate arguments.

7.53 Definition (Asymptotic efficiency).

A sequence of estimators Wn is asymptotically efficient for τ (θ) if
√ ( ) ( ) ( )2 /
n Wn − τ (θ) → n 0, v(θ) in distribution where v(θ) = τ ′ (θ) Iθ , Iθ
(
being the Fisher information of a single observation X1 i.e.
( )2 )
I θ = Eθ ∂
∂θ
log f (X | θ) .
7.54 Remarks.
Recall the Cramér-Rao lower bound
( )2
τ ′ (θ)
( )2 .
n Eθ ∂
∂θ
log f (X | θ)

7.55 Theorem (Asymptotic efficiency of MLEs).

Let X1 , X2 , . . . be iid f (x | θ) and let θ̂n denote the MLE of θ based on
√ ( ) (function) of θ. Under regularity
X1 , . . . , Xn . Let τ (θ) be a smooth
conditions, n τ (θ̂n ) − τ (θ) → n 0, v(θ) in distribution.
Proof.
Fix θ0 ∈ Θ. Let ∑X1 , X2 , . . . be iid f (x | θ0 ). Let
n
log L(θ | X n ) = i=1 log f (Xi | θ) be the log likelihood function. By
Taylor’s expansion at θ0 ,
∂ ∂ ∂2
log L(θ | X n ) = log L(θ0 | X n ) + (θ − θ0 ) 2 log L(θ0 | X n )
∂θ ∂θ ∂θ
+ higher-order terms,
where X n = (X1 , . . . , Xn ). Ignoring the higher-order terms,
∂ ∂ ∂2
0= log L(θ̂n | X n ) ≈ log L(θ0 | X n ) + (θ̂n − θ0 ) 2 log L(θ0 | X n ),
∂θ ∂θ ∂θ

(contd.) Proof.
from which it follows that
1 ∑ ∂
n
√ log f (Xi | θ0 )
√ n ∂θ
i=1
7.56 n(θ̂n − θ0 ) ≈ .
1 ∑ ∂2
n
− log f (Xi | θ0 )
n ∂θ2
i=1
∑n ∂2
Since − n1 log f (Xi | θ0 ) converges in probability to
2
i=1 ∂θ 2
∑n
−Eθ0 ∂θ∂
2 log f (X | θ0 ) = Iθ0 and
√1
n
∂
i=1 ∂θ
log f (Xi | θ0 ) converges in
distribution to n(0, Iθ0 ), the right-hand side of (7.56)(converges in )
√
distribution to n(0, 1/Iθ0 ). By the delta method, n τ (θ̂n ) − τ (θ0 )
( ( )2 / )
converges in distribution to n 0, τ ′ (θ0 ) Iθ0 .

7.57 Remarks.
We may say that θ̂n is asymptotically normal with mean θ and variance
(nIθ )−1 . Thus the variance of θ̂n decreases at the rate of n−1 as n
increases. Method-of-moments estimators are also asymptotically normal
with variances decaying at the rate of n−1 . That is, asymptotic variances
are of the form c(θ)/n for some c(θ) ≥ Iθ−1 . If c(θ) > Iθ−1 , then the
estimator is not asymptotically efficient.

7.58 Definition (Asymptotic relative efficiency).

For two sequences of estimators of τ (θ) Wn and Vn satisfying
√
n[Wn − τ (θ)] −→ n(0, σW2
)
√
n[Vn − τ (θ)] −→ n(0, σV ),
2
the asymptotic relative efficiency (ARE) of Vn with respect to Wn is

2
ARE(Vn , Wn ) = σW /σV2 .
7.59 Remarks.
If ARE(Vn , Wn ) > 1, Vn is said to be asymptotically more efficient than
Wn . Note that ARE(Vn , Wn ) depends on θ in general. It may happen that
ARE(Vn , Wn ) is greater than 1 for some values of θ but less than 1 for some
other values of θ. If ARE(Vn , Wn ) = 2 (say), then W2n has approximately
σ2
W σ2
the same variance as Vn for large n (since 2n = nV ). In other words, for
the two estimators W and V to have the same variance, W requires about
twice the sample size as V does.

7.60 Example (ARE of Poisson estimates).

−λ
Let X1 , X2 , . . . be iid Poisson(λ). Suppose we like to estimate
∑n τ (λ) = e .
−λ
Since P(X1 = 0) = e , we may estimate τ (λ) by τ̂n = n i=1 Yi where
1
Yi = 1(Xi = 0). Since Y1 , Y2 , . . . ∼ iid Bernoulli(e−λ ), we have

√ ( )
n(τ̂n − e−λ ) −→ n 0, e−λ (1 − e−λ ) .
On the other hand, the MLE e−λ̂n of e−λ satisfies

√ −λ̂n ( )
n(e − e−λ ) −→ n 0, v(λ)
where ( )2
τ ′ (λ) e−2λ
v(λ) = = = λe−2λ .
Iλ λ−1
Thus,
λe−2λ λ
ARE(τ̂n , e−λ̂n ) = = λ ,
− e−λ ) e −1
e−λ (1
which is less than 1 for λ > 0 and decreases to 0 as λ → ∞.

Beamer 7

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Beamer 7

Uploaded by

Copyright:

Available Formats

Introduction to point estimation

Chapter 7 Methods of finding estimators

Instructor: Yi-Ching Yao

Text: Statistical Inference (2nd ed.) by G. Casella and R.L. Berger

Week #7, Fall 2017

Instructor: Yi-Ching Yao MATH5063 1 / 92

Introduction to point estimation

Instructor: Yi-Ching Yao MATH5063 2 / 92

Consider a random sample X1 , . . . , Xn from an unknown { pdf/pmf f} (x | θ)

Instructor: Yi-Ching Yao MATH5063 3 / 92

7.2 Definition (point estimator).

Instructor: Yi-Ching Yao MATH5063 4 / 92

ml = Xi and µ′l = µ′l (θ1 , . . . , θk ) = E X l .

Then, the solution (θ̃1 , . . . , θ̃k ) to the system of k equations

is the method-of-moments estimator.

Instructor: Yi-Ching Yao MATH5063 5 / 92

7.4 Example (Normal family).

µ′1 = θ and µ′2 = θ2 + σ 2 ,

so that the method-of-moments estimator (θ̃, σe2 ) satisfies

Instructor: Yi-Ching Yao MATH5063 6 / 92

7.5 Example (Poisson family).

Instructor: Yi-Ching Yao MATH5063 7 / 92

Maximum Likelihood Estimators (MLE):

L(θ | x) = L(θ1 , . . . , θk | x1 , . . . , xn ) = f (xi | θ1 , . . . , θk ).

Let θ̂(x) be a parameter value at which L(θ | x) attains its maximum. A

Instructor: Yi-Ching Yao MATH5063 8 / 92

Instructor: Yi-Ching Yao MATH5063 9 / 92

7.8 Example (normal with known variance).

and the log likelihood function is

whose solution is θ̂ = x. It follows that θ̂(X) = X is the MLE of θ.

Instructor: Yi-Ching Yao MATH5063 10 / 92

7.9 Example (continued).

Instructor: Yi-Ching Yao MATH5063 11 / 92

7.10 Example (normal with unknown variance).

Instructor: Yi-Ching Yao MATH5063 12 / 92

7.10 (contd.) Example (normal with unknown variance).

Instructor: Yi-Ching Yao MATH5063 13 / 92

Instructor: Yi-Ching Yao MATH5063 14 / 92

Instructor: Yi-Ching Yao MATH5063 15 / 92

7.11 Example (Binomial Bayes estimation).

The posterior distribution of p is

Instructor: Yi-Ching Yao MATH5063 16 / 92

7.11 (contd.) Example (Binomial Bayes estimation).

which is beta(y + α, n − y + β). The (posterior) mean of

Instructor: Yi-Ching Yao MATH5063 17 / 92

7.11 (contd.) Example (Binomial Bayes estimation).

Instructor: Yi-Ching Yao MATH5063 18 / 92

7.12 Definition (conjugate family).

Instructor: Yi-Ching Yao MATH5063 19 / 92

7.13 Example (Normal Bayes estimation).

Instructor: Yi-Ching Yao MATH5063 20 / 92

7.13 (contd.) Example (Normal Bayes estimation).

Instructor: Yi-Ching Yao MATH5063 21 / 92

7.13 (contd.) Example (Normal Bayes estimation).

Instructor: Yi-Ching Yao MATH5063 22 / 92

We call L(θ | y) = g(y | θ) and L(θ | y, x) = f (y, x | θ) the incomplete-data

Instructor: Yi-Ching Yao MATH5063 23 / 92

Instructor: Yi-Ching Yao MATH5063 24 / 92

To prove the the theorem, we need the following lemma.

(The integral is known as the Kullback-Leibler information.)

Instructor: Yi-Ching Yao MATH5063 25 / 92

proving the lemma.

Instructor: Yi-Ching Yao MATH5063 26 / 92

is also known as the Kullback-Leibler divergence or relative entropy. It