You are on page 1of 92

Introduction to point estimation

Chapter 7 Methods of finding estimators


Methods of evaluating estimators

MATH5063

Instructor: Yi-Ching Yao

Text: Statistical Inference (2nd ed.) by G. Casella and R.L. Berger

Week #7, Fall 2017

Instructor: Yi-Ching Yao MATH5063 1 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Outline

Introduction to point estimation


Methods of finding estimators
Methods of evaluating estimators

Instructor: Yi-Ching Yao MATH5063 2 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Consider a random sample X1 , . . . , Xn from an unknown { pdf/pmf f} (x | θ)


which belongs to a (known) family of distributions f (x | θ) : θ ∈ Θ . We
are interested in estimating the unknown value of θ based on the sample
X1 , . . . , Xn . Note that estimating θ is equivalent to estimating the (true)
pdf/pmf f (x | θ). More generally, it may be desired to estimate τ (θ) which
is a known function of θ.
7.1 Remarks.
{ }
The family of distributions f (x | θ) : θ ∈ Θ is called a parametric
(
family of distributions if θ is finite-dimensional. )Let γ : Θ → Φ be a 1-1
mapping. Let ϕ = γ(θ) and g(x | ϕ) = f x | γ −1 (ϕ) , so that
{ } { }
f (x | θ) : θ ∈ Θ and g(x | ϕ) : ϕ ∈ Φ are the same family of
distributions with the former being parameterized by θ and the latter by ϕ.
Often a particular( parameterization is preferred because of easy
interpretations e.g. θ may correspond to the mean of the distribution
)
f (x | θ) .

Instructor: Yi-Ching Yao MATH5063 3 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.2 Definition (point estimator).


A point estimator is any function W (X1 , . . . , Xn ) of a sample, i.e. any
statistic is a point estimator.

7.3 Remarks.
An estimator is a function of the sample while an estimate is the realized
value of an estimator. Notationally,( W (X1 , . . . , Xn ) is an estimator and
W (x1 , . . . , xn ) is an estimate. Some authors use the word “estimate” for
)
both W (X1 , . . . , Xn ) and W (x1 , . . . , xn ).

Instructor: Yi-Ching Yao MATH5063 4 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Method of Moments:
Let X1 , . . . , Xn be a random sample from a pdf/pmf f (x | θ1 , . . . , θk ) which
depends on k parameters θ1 , . . . , θk . Method-of-moments estimators are
found by equating the first k sample moments to the corresponding
population moments, and solving the resulting system of equations. More
precisely, for l = 1, . . . , k, let

1∑ l
n

ml = Xi and µ′l = µ′l (θ1 , . . . , θk ) = E X l .


n
i=1

Then, the solution (θ̃1 , . . . , θ̃k ) to the system of k equations

ml = µ′l (θ1 , . . . , θk ), l = 1, . . . , k

is the method-of-moments estimator.

Instructor: Yi-Ching Yao MATH5063 5 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.4 Example (Normal family).


Let X1 , . . . , Xn be iid n(θ, σ 2 ). We have

µ′1 = θ and µ′2 = θ2 + σ 2 ,

so that the method-of-moments estimator (θ̃, σe2 ) satisfies

X = θ̃
1∑ 2
X2 = Xi = θ̃2 + σe2 .
n
That is,

θ̃ = X
1∑ 2
σe2 = Xi − θ̃2
n
1∑ 2 2
= Xi − X
n
1∑ n−1 2
= (Xi − X)2 = S < S2.
n n

Instructor: Yi-Ching Yao MATH5063 6 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.5 Example (Poisson family).


Let X1 , . . . , Xn be iid Poisson(λ). Since µ′1 = λ, a natural
method-of-moments estimator of λ is λ̃ = X. Alternatively, the
(population) variance of the Poisson(λ) distribution equals λ, a reasonable
estimator of λ is λ̃′ = S 2 (sample variance), which is also a
method-of-moments estimator. We will see that X is better than S 2 in
terms of mean squared error (MSE). On the other hand, if S 2 is
considerably different from X, the assumption of the Poisson distribution is
questionable.

Instructor: Yi-Ching Yao MATH5063 7 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Maximum Likelihood Estimators (MLE):


For a sample X1 , . . . , Xn from a pdf/pmf f (x | θ1 , . . . , θk ), the likelihood
function is given by

n

L(θ | x) = L(θ1 , . . . , θk | x1 , . . . , xn ) = f (xi | θ1 , . . . , θk ).


i=1

Let θ̂(x) be a parameter value at which L(θ | x) attains its maximum. A


maximum likelihood estimator (MLE) of θ based on a sample X is θ̂(X).

Instructor: Yi-Ching Yao MATH5063 8 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.6 Remarks.
Unless θ̂(x) occurs at the boundary of the parameter space, θ̂(x) is a
solution of the equations

L(θ | x) = 0, i = 1, . . . , k.
∂θi
In most cases, the MLE cannot be found analytically. Thus, one needs to
resort to numerical methods to obtain the MLE approximately.

7.7 Remarks.
Maximizing L(θ | x) is equivalent to maximizing log L(θ | x) (the log
likelihood function).

Instructor: Yi-Ching Yao MATH5063 9 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.8 Example (normal with known variance).


Let X1 , . . . , Xn be iid n(θ, 1). Then the likelihood function is
∑n
1 −1 (xi −θ)2
L(θ | x) = n e
2 i=1 ,
(2π) 2

and the log likelihood function is

1∑
n
n
log L(θ | x) = − log(2π) − (xi − θ)2 .
2 2
i=1

The equation d

log L(θ | x) = 0 reduces to


n

(xi − θ) = 0,
i=1

whose solution is θ̂ = x. It follows that θ̂(X) = X is the MLE of θ.

Instructor: Yi-Ching Yao MATH5063 10 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.9 Example (continued).


If θ is known to be non-negative, then θ̂ is the solution to the constrained
maximization problem:
∑n log L(θ | x) subject to θ ≥ 0, or equivalently, minimize
Maximizing
h(θ) = i=1 (xi − θ)2 subject to θ ≥ 0. Note that h(θ) is a quadratic
function of θ. Since x is the (unconstrained) MLE, we have θ̂ = x if x ≥ 0.
If x < 0, it is readily seen that θ̂ = 0 maximizes the (constrained) (log)
likelihood function. So,
θ̂ = max{x, 0}.

Figure 1 Figure 2

Instructor: Yi-Ching Yao MATH5063 11 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.10 Example (normal with unknown variance).


Let X1 , . . . , Xn be iid n(θ, σ 2 ). Then,

1 ∑ (xi − θ)2
n
n n
log L(θ, σ 2 | x) = − log 2π − log σ 2 − ,
2 2 2 σ2
i=1

and

1 ∑
n

log L(θ, σ 2 | x) = 2 (xi − θ)
∂θ σ
i=1

1 ∑
n
∂ n
log L(θ, σ 2
| x) = − + (xi − θ)2 .
∂(σ 2 ) 2σ 2 2σ 4
i=1

Instructor: Yi-Ching Yao MATH5063 12 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.10 (contd.) Example (normal with unknown variance).


Thus, solving
∂ ∂
log L(θ, σ 2 | x) = 0 and log L(θ, σ 2 | x) = 0
∂θ ∂(σ 2 )
yields
1∑
n
n−1 2
θ̂ = x and σb2 = (xi − θ̂)2 = S .
n n
i=1

It can be rigorously verified that (θ̂, σb2 ) uniquely maximizes the (log)
likelihood function.

Instructor: Yi-Ching Yao MATH5063 13 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Bayes Estimators:
Classical (frequentist) approach: The parameter θ is fixed but unknown.
Bayesian approach: θ is random with some prior distribution (which is a
subjective distribution based on the experimenter’s prior knowledge about θ
before the data are collected). When a sample is taken from a pdf/pmf
f (x | θ), the prior distribution is then updated with this sample
information. The updated prior is called the posterior distribution.
Specifically, let π(θ) be the prior distribution. The sampling distribution
f (x | θ) is the conditional distribution of the random sample X given θ.
Thus the joint distribution of (θ, X) is f (x | θ) π(θ). When X is observed
with X = x, the updated prior (i.e. posterior) distribution of θ is the
conditional distribution of θ given X = x, i.e.
f (x | θ) π(θ)
π(θ | x) = ,
m(x)

where m(x) = f (x | θ) π(θ) dθ, the marginal distribution of X.

Instructor: Yi-Ching Yao MATH5063 14 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

We often use the mean of the posterior distribution (called the posterior
mean) of θ as a point estimate of θ. Note that the Bayes estimator
depends on the prior distribution. Different prior distributions result in
different Bayes estimators.

Instructor: Yi-Ching Yao MATH5063 15 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.11 Example (Binomial Bayes estimation).


∑n
Let X1 , . . . , Xn be iid Bernoulli(p), so that Y = i=1 Xi is binomial(n, p).
Let π(p) (the prior of p) be beta(α, β). Then, the joint distribution of (Y, p)
is
[( ) ][ ]
n y Γ(α + β) α−1
f (y, p) = p (1 − p)n−y p (1 − p)β−1
y Γ(α)Γ(β)
( )
n Γ(α + β) y+α−1
= p (1 − p)n−y+β−1 .
y Γ(α)Γ(β)

The posterior distribution of p is


f (y, p)
f (p | y) = ∝ py+α−1 (1 − p)n−y+β−1 .
f (y)

Instructor: Yi-Ching Yao MATH5063 16 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.11 (contd.) Example (Binomial Bayes estimation).


∫ 1
Since f (p | y) dp = 1, necessarily we have
0

Γ(n + α + β)
f (p | y) = py+α−1 (1 − p)n−y+β−1 ,
Γ(y + α)Γ(n − y + β)

which is beta(y + α, n − y + β). The (posterior) mean of


beta(y + α, n − y + β) is
y+α y+α
p̂B = = .
(y + α) + (n − y + β) n+α+β

Instructor: Yi-Ching Yao MATH5063 17 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.11 (contd.) Example (Binomial Bayes estimation).


y α
Note that the MLE is and the (prior) mean of beta(α, β) is . We
n α+β
have
( )( ) ( )( )
y+α n y α+β α
p̂B = = + ,
n+α+β n+α+β n n+α+β α+β
a weighted average of the MLE and the prior mean. When the sample size
n is large, p̂B is close to the MLE. In this example, the prior is chosen to be
beta, resulting in a posterior which is also beta. In general, if the prior is
not beta, the posterior may not admit a closed-form expression. The family
of beta priors is called a conjugate family for the binomial distributions.

Instructor: Yi-Ching Yao MATH5063 18 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.12 Definition (conjugate family).


Let F be a class of pdfs/pmfs f (x | θ) (indexed by θ). A class Π of prior
distributions is a conjugate family for F if the posterior is in the class Π
whenever the prior is in Π.

Instructor: Yi-Ching Yao MATH5063 19 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.13 Example (Normal Bayes estimation).


Let X1 , . . . , Xn iid n(θ, σ 2 ). Suppose the prior distribution of θ is n(µ, τ 2 ).
Hence σ 2 , µ, and τ 2 are assumed known. Then, the joint distribution of
(X, θ) is
[( )n ∑ ][ ]
1 −1 (xi −θ)2 /σ 2 1 −1 (θ−µ)2 /τ 2
f (x, θ) = √ e 2 √ e2 .
2πσ 2 2πτ 2

Instructor: Yi-Ching Yao MATH5063 20 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.13 (contd.) Example (Normal Bayes estimation).


Thus, the posterior of θ is
f (x, θ)
f (θ | x) =
f (x)
[ (∑
n )]
1 (xi − θ)2 (θ − µ)2
∝ exp − +
2 σ2 τ2
i=1
[ (∑
n )]
1 (xi − x)2 (θ − x)2 (θ − µ)2
= exp − + n +
2 σ2 σ2 τ2
i=1
[ ( )]
1 (θ − x)2 (θ − µ)2
∝ exp − +
2 σ 2 /n τ2
[ ( )( )2 ]
1 n 1 (n/σ 2 )x + (1/τ 2 )µ
∝ exp − + 2 θ− .
2 σ2 τ n/σ 2 + 1/τ 2

Instructor: Yi-Ching Yao MATH5063 21 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.13 (contd.) Example (Normal Bayes estimation).


( −1 )
So the posterior of θ is n θ̂B , (n/σ 2 + 1/τ 2 ) , where
(n/σ 2 )x + (1/τ 2 )µ
θ̂B = .
n/σ 2 + 1/τ 2
Note that the normal family is its own conjugate family (when σ 2 is
known). The posterior mean (the Bayes estimator) of θ is a weighted
average of the MLE (X) and the prior mean µ, i.e.
( n/σ 2
) ( 1/τ 2
)
θ̂B = X+ µ.
n/σ 2 + 1/τ 2 n/σ 2 + 1/τ 2

Instructor: Yi-Ching Yao MATH5063 22 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

The EM Algorithm:
The EM Algorithm is designed to find MLEs when the likelihood function
involves incomplete data. Specifically, let Y = (Y1 , . . . , Yn ) be the
incomplete (observed) data, and X = (X1 , . . . , Xn ) be the augmented (or
missing) data, making (Y , X) the complete data.
The densities g(· | θ) of Y and f (· | θ) of (Y , X) have the relationship

g(y | θ) = f (y, x | θ) dx.

We call L(θ | y) = g(y | θ) and L(θ | y, x) = f (y, x | θ) the incomplete-data


likelihood and complete-data likelihood, respectively.

Instructor: Yi-Ching Yao MATH5063 23 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

To find θ̂ that maximizes L(θ | y) (the MLE of θ), it may be easier to work
with L(θ | y, x) via the EM algorithm as described below.
(i) Start with an initial value θ(0) .

(ii) At the r-th step with current value[ θ(r) (r = 0, 1, . . . ), find ]the value
(denoted θ(r+1) ) that maximizes E log L(θ | y, X) θ(r) , y .

Note that the r-th step of the EM algorithm consists of two parts: The
[ ]
“E-step” calculates the expected log likelihood E log L(θ | y, X) θ(r) , y
and the “M-step” finds its maximum.

Instructor: Yi-Ching Yao MATH5063 24 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.14 Theorem.
log L(θ(r) | y) increases monotonically in r.

To prove the the theorem, we need the following lemma.

7.15 Lemma.
For two pdfs g and h, we have
∫ [ ]
g(x)
log g(x) dx ≥ 0.
h(x)

(The integral is known as the Kullback-Leibler information.)

Instructor: Yi-Ching Yao MATH5063 25 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Proof.
Since ey ≥ 1 + y for all y ∈ R, we have, for x with g(x) > 0,
h(x) h(x) h(x)
= exp log ≥ 1 + log .
g(x) g(x) g(x)
It follows that
∫ [ ] ∫ [ ]
h(x) h(x)
log g(x) dx ≤ − 1 g(x) dx
g(x) g(x)
{x:g(x)>0} {x:g(x)>0}

= h(x) dx − 1 ≤ 1 − 1 = 0,
{x:g(x)>0}

proving the lemma.

Instructor: Yi-Ching Yao MATH5063 26 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.16 Remarks.
The Kullback-Leibler information
∫ [ ]
g(x)
D(g ∥ h) = log g(x) dx
h(x)

is also known as the Kullback-Leibler divergence or relative entropy. It


is a measure of how far a distribution h is away from g. Note that
D(g ∥ h) ≥ 0 and it equals 0 if and only if g = h. Note also that
D(g ∥ h) ̸= D(h ∥ g), so it is not a distance function.

Instructor: Yi-Ching Yao MATH5063 27 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.17 Remarks on consistency of the MLE when the parameter space is


finite.
Suppose the parameter space consists of k + 1 points corresponding to k + 1
distributions f0 (x), f1 (x), . . . , fk (x). Suppose a sample of size (large) n
X1 , . . . , Xn is taken from f0 . Let fˆ denote the MLE based on X1 , . . . , Xn ,
i.e. fˆ = fl if

n

n

log fl (Xi ) ≥ log fl′ (Xi ) for all l′ ∈ {0, . . . , k}.


i=1 i=1

Note that, for l = 1, . . . , k,


[ ]
1 ∑ ∑ 1∑
n n n
f0 (Xi )
log f0 (Xi ) − log fl (Xi ) = log
n n fl (Xi )
i=1 i=1 i=1
∫ [ ]
f0 (x)
≈ log f0 (x) dx
fl (x)
= D(f0 ∥ fl ) > 0,

implying that P(fˆ = f0 ) ≈ 1 for large n.


Instructor: Yi-Ching Yao MATH5063 28 / 92
Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Proof of Theorem 7.14.

] | y) ≥ log L(θ | y). Since θ


L(θ(r+1) (r) (r+1)
[ want to show log(r)
We maximizes
E log L(θ | y, X) θ , y , we have
[ ] [ ]
E log L(θ(r+1) | y, X) θ(r) , y ≥ E log L(θ(r) | y, X) θ(r) , y ,

or equivalently,

[ ]
7.18 log f (y, x | θ(r+1) ) f (x | θ(r) , y) dx

[ ]
≥ log f (y, x | θ(r) ) f (x | θ(r) , y) dx.

Instructor: Yi-Ching Yao MATH5063 29 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

(contd.) Proof of Theorem 7.14.


By Lemma 7.15 with g(·) = f (· | θ(r) , y) and h(·) = f (· | θ(r+1) , y),
∫ [ ]
f (x | θ(r) , y)
log f (x | θ(r) , y) dx ≥ 0, i.e.
f (x | θ(r+1) , y)


[ ]
7.19 log f (x | θ(r+1) , y) f (x | θ(r) , y) dx

[ ]
≤ log f (x | θ(r) , y) f (x | θ(r) , y) dx.

Instructor: Yi-Ching Yao MATH5063 30 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

(contd.) Proof of Theorem 7.14.


Note that

log f (y, x | θ(r+1) ) − log f (x | θ(r+1) , y) = log f (y | θ(r+1) )


log f (y, x | θ(r) ) − log f (x | θ(r) , y) = log f (y | θ(r) ).

Subtracting (7.19) from (7.18) yields


∫ ∫
[ ] [ ]
log f (y | θ (r+1)
) f (x | θ (r)
, y) dx ≥ log f (y | θ(r) ) f (x | θ(r) , y) dx,

which is nothing but

log L(θ(r+1) | y) = log f (y | θ(r+1) ) ≥ log f (y | θ(r) ) = log L(θ(r) | y),

completing the proof.

Instructor: Yi-Ching Yao MATH5063 31 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.20 Remarks.
The fact that log L(θ(r) | y) increases monotonically does not guarantee
that θ(r) converges to the MLE. Under suitable conditions, it can be shown
that all limit points of an EM sequence {θ(r) } are stationary points of
L(θ | y), and L(θ(r) | y) converges monotonically to L(θ̂ | y) for some
stationary point θ̂ (which may be a local maximum or saddle point). In
practice, one should try different initial values in the hope that one of them
yields the MLE.

Instructor: Yi-Ching Yao MATH5063 32 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.21 Example (Exercise 7.30 on page 360).


Suppose that we have a mixture density pf (x) + (1 − p)g(x), where the
“mixing” probability p is unknown but f and g are known (for simplicity).
A sample X1 , . . . , Xn is taken from this mixture density, so that the log
likelihood is

n
[ ]
L(p | x) = log pf (xi ) + (1 − p)g(xi ) ,
i=1

which cannot be dealt with analytically. To apply the EM algorithm, we


augment the observed (or incomplete) data with Z = (Z1 , . . . , Zn ), where
Zi tells which component of the mixture Xi came from. In other words,

Xi | zi = 1 ∼ f (xi ) and Xi | zi = 0 ∼ g(xi )

and P(Zi = 1) = p. (Note that if Z were observable, Z would contain all


the information about p.)

Instructor: Yi-Ching Yao MATH5063 33 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.21 (contd.) Example (Exercise 7.30 on page 360).


∏n [ ]zi [ ]1−zi
(a) The joint density of (X, Z) is i=1
pf (xi ) (1 − p)g(xi ) , so
that
n {
∑ [ ] [ ]}
log L(p | x, z) = zi log pf (xi ) + (1 − zi ) log (1 − p)g(xi ) .
i=1

(b) The conditional distribution of Zi | xi , p is Bernoulli with success


pf (xi )
probability .
pf (xi ) + (1 − p)g(xi )
Note that
P(Zi = 1, Xi ∈ dxi )
P(Zi = 1 | Xi ∈ dxi ) =
P(Xi ∈ dxi )
pf (xi ) dxi
=
pf (xi ) dxi + (1 − p)g(xi ) dxi
pf (xi )
= .
pf (xi ) + (1 − p)g(xi )

Instructor: Yi-Ching Yao MATH5063 34 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.21 (contd.) Example (Exercise 7.30 on page 360).

(c) At the r-th step, with current value p(r) , we have, for E-step, that
[ ]
E log L(p | Z, x) p(r) , x
[∑ ]
n

=E Zi log pf (xi ) + (1 − Zi ) log(1 − p)g(xi ) p(r) , x
i=1
n {
∑ p(r) f (xi )
= log pf (xi )
p(r) f (x i ) + (1 − p
(r) )g(x )
i
i=1
[ ] }
p(r) f (xi )
+ 1− log(1 − p)g(xi ) .
p f (xi ) + (1 − p(r) )g(xi )
(r)

Instructor: Yi-Ching Yao MATH5063 35 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.21 (contd.) Example (Exercise 7.30 on page 360).


For the M-step, the above expression (as a function of p) attains the
maximum at p = p(r+1) which satisfies

1∑
n
p(r) f (xi )
0=
p p f (xi ) + (1 − p(r) )g(xi )
(r)
i=1
[ ]
1 ∑
n
p(r) f (xi )
− 1 − (r)
1−p p f (xi ) + (1 − p(r) )g(xi )
i=1
1 1
= K− (n − K),
p 1−p
∑n p(r) f (xi )
where K = . So,
i=1 p(r) f (xi )+ (1 − p(r) )g(xi )

1∑
n
K p(r) f (xi )
p(r+1) = = .
n n p(r) f (xi ) + (1 − p(r) )g(xi )
i=1

Instructor: Yi-Ching Yao MATH5063 36 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.22 Definition (Mean squared error).


The mean squared error (MSE) of an estimator W (X) of a
parameter θ is

Eθ (W − θ)2 (as a function of θ).

7.23 Remarks.
The subscript θ in Eθ refers to the pdf f (· | θ). Thus,

( )2
Eθ (W − θ)2 = W (x) − θ f (x | θ) dx.

Instructor: Yi-Ching Yao MATH5063 37 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

The MSE measures the averaged squared difference between the estimator
W and the parameter θ. In general, any increasing function of the absolute
distance |W − θ| would serve to measure the goodness of an estimator
(mean absolute error, Eθ |W − θ|, is a reasonable alternative).
The MSE is a popular and convenient measure for its mathematical
tractability. Moreover, the MSE incorporates two components, one
measuring the variability of the estimator (precision) and the other
measuring its bias (accuracy). More precisely, we have

Eθ (W − θ)2 = E (W − Eθ W + Eθ W − θ)2
= Varθ W + (Eθ W − θ)2
= (variance of W ) + (squared bias of W ).

Instructor: Yi-Ching Yao MATH5063 38 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.24 Definition (Bias).


The bias of a point estimator W of a parameter θ is defined by
Biasθ W = Eθ W − θ. An estimator of θ is unbiased if its bias (as a
function of θ) is identically 0.

An unbiased estimator has the MSE equal to the variance, i.e.


Eθ (W − θ)2 = Varθ W if W is unbiased.

Instructor: Yi-Ching Yao MATH5063 39 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.25 Example.

Let X1 , . . . , Xn be a sample from f (x | θ). Let µ = µ(θ) = xf (x | θ) dx

and σ 2 = σ 2 (θ) = x2 f (x | θ) dx − µ2 . Then, the sample mean X and
sample variance S 2 are unbiased estimators of µ and σ 2 , respectively. Now
suppose the f (x | θ) = n(µ, σ 2 ). Then,

σ2
E (X − µ)2 = Var X =
n
E (S 2 − σ 2 )2 = Var S 2 .

Recall that S 2 may be expressed as

σ 2 (Z12 + · · · + Zn−1
2
)
S2 = ,
n−1
where Z1 , . . . , Zn−1 are iid n(0, 1).

Instructor: Yi-Ching Yao MATH5063 40 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.25 (contd.) Example.


Consequently,

σ4
E S4 = E (Z12 + · · · + Zn−1
2
)2
(n − 1)2
[ ]
σ4
= (n − 1) E Z14 + (n − 1)(n − 2) E Z12 Z22
(n − 1)2
[ ]
σ4 n+1 4
= 3(n − 1) + (n − 1)(n − 2) = σ .
(n − 1)2 n−1

2σ 4
MSE(S 2 ) = Var S 2 = E S 4 − σ 4 = .
n−1

Instructor: Yi-Ching Yao MATH5063 41 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.25 (contd.) Example.


Recall that the MLE of σ 2 is

1∑
n
n−1 2
σb2 = (Xi − X)2 = S .
n n
i=1

n−1 2 σ2
The bias of σb2 is E σb2 − σ 2 = σ − σ2 = − and
( ) ( n )2 n

n − 1 n − 1 2σ 4
2(n 1) σ 4
Var σb2 = Var S2 = = . We have
n n n−1 n2
( )2 2(n − 1) σ 4
( )
σ2 2n − 1 4
MSE(σb2 ) = E (σb2 − σ 2 )2 = − + = σ
n n2 n2
( )
2
< σ 4 = MSE(S 2 ).
n−1

Instructor: Yi-Ching Yao MATH5063 42 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.26 Remarks (cf. page 332).

Although σb2 is an biased estimator of σ 2 , its MSE is smaller than that of


the unbiased estimator S 2 . This does not imply that σb2 is better than S 2 ,
since MSE may not be the appropriate measure for a scale parameter. Note
that MSE penalizes equally for overestimation and underestimation, which
is fine for a location parameter. For a scale parameter, 0 is a natural lower
bound, so the estimation of a scale parameter is not symmetric. Use of
MSE in this case tends to be forgiving of underestimation.

7.27 Remarks.
In general, no estimator exists which is better than any other estimator at
every parameter value. For example, the trivial estimator W = 2.5, which
does not depend on the data, has MSE = 0 if θ = 2.5. This estimator is
obviously of no interest. Below we restrict our attention to the class of
unbiased estimators. (It should be pointed out that the class of unbiased
estimators is often too restrictive. It is instructive to consider estimators
with a small bias.) We will see that in some cases, there exists a best
unbiased estimator in this class in the following sense.

Instructor: Yi-Ching Yao MATH5063 43 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.28 Definition (Best unbiased estimator).


An estimator W ∗ is a best unbiased estimator of τ (θ) if
(i) Eθ W ∗ = τ (θ) for all θ.

(ii) Varθ W ∗ ≤ Varθ W for all θ for any estimator W with Eθ W = τ (θ).

7.29 Remarks.
A best unbiased estimator of τ (θ) is also called a uniformly minimum
variance unbiased estimator (UMVUE) of τ (θ).

Instructor: Yi-Ching Yao MATH5063 44 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.30 Theorem (Cramér-Rao inequality).

∫ X1 , . . . , Xn be a sample with joint pdf f (x | θ) such that


Let

f (x | θ) dx = 0. Let W (X) = W (X1 , . . . , Xn ) be any estimator
∂θ
satisfying ∫ [ ]
d ∂
Eθ W (X) = W (x) f (x | θ) dx
dθ ∂θ
and
Varθ W (X) < ∞.
Then,
[ 2 ]
d
Eθ W (X)
Varθ W (X) ≥ dθ
[ ]2 .

Eθ log f (X | θ)
∂θ

Instructor: Yi-Ching Yao MATH5063 45 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Proof.
Let τ (θ) = Eθ W (X). Then, by Cauchy-Schwarz inequality,
( )2 [ ]2
d d
τ (θ) = Eθ W (X)
dθ dθ
[∫ ]2

= W (x) f (x | θ) dx
∂θ
[∫ ]2
( ) ∂
= W (x) − τ (θ) f (x | θ) dx
∂θ
[∫ ( ) ]2
( ) ∂
= W (x) − τ (θ) log f (x | θ) f (x | θ) dx
∂θ
[ ]2
( )∂
= Eθ W (X) − τ (θ) log f (X | θ)
∂θ
( )2
( )2 ∂
≤ Eθ W (X) − τ (θ) Eθ log f (X | θ) ,
∂θ

Instructor: Yi-Ching Yao MATH5063 46 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

(contd.) Proof.
from which it follows that
[ 2 ]
d
Eθ W (X)
Varθ W (X) ≥ dθ
[ ]2 .

Eθ log f (X | θ)
∂θ

Instructor: Yi-Ching Yao MATH5063 47 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.31 Corollary (iid case).

If X1 , . . . , Xn are iid f (x | θ), then


[ 2 ]
d
Eθ W (X)
Varθ W (X) ≥ dθ
[ ]2 .

n Eθ log f (X | θ)
∂θ

Proof.
Since f (x | θ) = f (x1 | θ) · · · f (xn | θ), we have
[ ]2 [∑
n ]2
∂ ∂
Eθ log f (X | θ) = Eθ log f (Xi | θ)
∂θ ∂θ
i=1
[ ]
2

= n Eθ log f (X1 | θ) +
∂θ
[∂ ∂ ]
n(n − 1) E log f (X1 | θ) log f (X2 | θ) .
∂θ ∂θ

Instructor: Yi-Ching Yao MATH5063 48 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

(contd.) Proof.
The corollary follows by noting that
[∂ ∂ ] ∂ ∂
E log f (X1 | θ) log f (X2 | θ) = E log f (X1 | θ) E log f (X2 | θ)
∂θ ∂θ ∂θ ∂θ
and

∂ [∂ ]
E log f (X1 | θ) = log f (x | θ) f (x | θ) dx
∂θ ∂θ


= f (x | θ) dx = 0.
∂θ

Instructor: Yi-Ching Yao MATH5063 49 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.32 Remarks.
Theorem 7.30 and Corollary 7.31 hold for the discrete case as well. In
either the discrete or continuous case, in order for the Cramér-Rao lower
bound to hold, what is required is to allow interchange of integration (over
x) and differentiation (with respect to θ).

Instructor: Yi-Ching Yao MATH5063 50 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.33 Remarks.
[ ]2
The quantity Iθ = Eθ ∂θ ∂
log f (X | θ) is called Fisher information per
observation. For a sample consisting of n iid observations, the total Fisher
information is nIθ . Fisher information may be regarded as the limiting
version of Kullback-Leibler information in the following sense. Fix θ0 and
let θ be close to θ0 . Recall that

D(θ0 ∥ θ) = D(f (· | θ0 ) ∥ f (· | θ))


∫ [ ]
f (x | θ0 )
= log f (x | θ0 ) dx.
f (x | θ)

By Taylor’s expansion,
[ ]

log f (x | θ) = log f (x | θ0 ) + log f (x | θ0 ) (θ − θ0 )
∂θ
[ ]
1 ∂2
+ log f (x | θ0 ) (θ − θ0 )2 + high-order terms.
2 ∂θ2

Instructor: Yi-Ching Yao MATH5063 51 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.33 (contd.) Remarks.


It follows from the lemma below that for θ close to θ0 ,
∫ [ ]

D(θ0 ∥ θ) ≈ − log f (x | θ0 ) (θ − θ0 )f (x | θ0 ) dx
∂θ
∫ [ ]
1 ∂2
− log f (x | θ0 ) (θ − θ0 )2 f (x | θ0 ) dx
2 ∂θ2
[ ]
1 ∂2
= − Eθ0 log f (X | θ0 ) (θ − θ0 )2
2 ∂θ2
[ ]2
1 ∂ 1
= Eθ 0 log f (X | θ0 ) (θ − θ0 )2 = Iθ0 (θ − θ0 )2 .
2 ∂θ 2

Instructor: Yi-Ching Yao MATH5063 52 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.34 Lemma.
Under regularity conditions (to allow interchange of integration and
differentiation), we have
[ ]2 [ ]
∂ ∂2
Eθ log f (X | θ) = −Eθ log f (X | θ) .
∂θ ∂θ2

Proof.
We have

∂ ∂
0= 1= f (x | θ) dx
∂θ ∂θ


= f (x | θ) dx
∂θ
∫ [ ]

= log f (x | θ) f (x | θ) dx
∂θ

Instructor: Yi-Ching Yao MATH5063 53 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

(contd.) Proof.
∫ [ ]
∂ ∂
0= log f (x | θ) f (x | θ) dx
∂θ ∂θ
∫ [ ] ∫ [ ][ ]
∂2 ∂ ∂
= log f (x | θ) f (x | θ) dx + log f (x | θ) f (x | θ) dx
∂θ2 ∂θ ∂θ
∫ [ ] ∫ [ ]
∂2 ∂ 2
= log f (x | θ) f (x | θ) dx + log f (x | θ) f (x | θ) dx
∂θ2 ∂θ
[ ] [ ]
∂2 ∂ 2
= Eθ 2
log f (X | θ) + Eθ log f (X | θ) ,
∂θ ∂θ
completing the proof.

Instructor: Yi-Ching Yao MATH5063 54 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.35 Example (Poisson).


Let X1 , . . . , Xn be iid Poisson(λ). The MLE of λ is λ̂ = x, which is
unbiased.
1 1
Varλ (λ̂) = VarX1 = λ.
n n
Fisher information
[ 2 ]

Iλ = Eλ log f (X | λ)
∂λ
[ 2 ]

= −Eλ 2
log f (X | λ)
∂λ
( −λ X )
∂2 e λ
= −Eλ 2 log
∂λ X!
∂2
= −Eλ (−λ + X log λ − log X!)
∂λ2
= −Eλ X(−λ−2 ) = λ−2 Eλ X = λ−1 .

Instructor: Yi-Ching Yao MATH5063 55 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.35 (contd.) Example (Poisson).


By the Cramér-Rao inequality, we have
1 λ
Varλ W (X) ≥ = = Varλ (λ̂)
nIλ n

for any unbiased estimator W (X) of λ. This shows that the MLE λ̂ is a
best unbiased estimator of λ. In particular, we have Varλ λ̂ ≤ Varλ S 2 since
the sample variance S 2 is also an unbiased estimator of λ.

Instructor: Yi-Ching Yao MATH5063 56 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

More generally, consider a one-parameter exponential family


f (x | θ) = g(x) eθh(x)−H(θ) . Note that

7.36 eH(θ) = g(x) eθh(x) dx.

Differentiating 7.36 yields



7.37 H ′ (θ) eH(θ) = h(x) g(x) eθh(x) dx,

so that


H (θ) = h(x) g(x) eθh(x)−H(θ) dx

= Eθ h(X).

Differentiating 7.37 yields



[ ( )2 ] ( )2
H ′′ (θ) + H ′ (θ) eH(θ) = h(x) g(x) eθh(x) dx,

Instructor: Yi-Ching Yao MATH5063 57 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

so that

( )2
H ′′ (θ) + (H ′ (θ))2 = h(x) g(x) eθh(x)−H(θ) dx
( )2
= Eθ h(X) .

This shows that


( )2 ( )2
H ′′ (θ) = Eθ h(X) − H ′ (θ) = Varθ h(X) > 0.

In particular, H(θ) is a convex function and H ′ (θ) is increasing in θ.


Suppose we are interested in estimating τ (θ) = H ′ (θ), which is a
reparameterization of θ. The MLE of θ satisfies

0= log L(θ | x1 , . . . , xn )
∂θ
[ ∑ n ]

= θ h(xi ) − nH(θ)
∂θ
i=1

n

n

= h(xi ) − nH ′ (θ) = h(xi ) − nτ (θ),


i=1 i=1

Instructor: Yi-Ching Yao MATH5063 58 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

i.e.
1∑
n

τ (θ̂) = H ′ (θ̂) = h(xi ).


n
i=1
In other words, the MLE of τ (θ) is
1∑
n

τ (θ̂) = h(xi ).
n
i=1

Note that
Eθ τ (θ̂) = Eθ h(X) = τ (θ)
1 H ′′ (θ) τ ′ (θ)
Varθ τ (θ̂) = Varθ h(X) = = .
n n n
Fisher information
[ 2 ]

I θ = Eθ log f (X | θ)
∂θ
[ 2 ]

= −Eθ 2
log f (X | θ)
∂θ
( )
= −Eθ − H ′′ (θ) = H ′′ (θ) = τ ′ (θ).

Instructor: Yi-Ching Yao MATH5063 59 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

By the Cramér-Rao inequality, we have


( )2
τ ′ (θ) τ ′ (θ)
Eθ W (X) ≥ = = Varθ τ (θ̂)
nIθ n
for any unbiased estimator of W (X) of τ (θ). This shows that the MLE
τ (θ̂) = H ′ (θ̂) is a best unbiased estimator of τ (θ) = H ′ (θ).

Instructor: Yi-Ching Yao MATH5063 60 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.38 Example (Bernoulli).


The Bernoulli family is one-parameter exponential where for x = 0, 1,

f (x | p) = px (1 − p)1−x = ex log p/(1−p)+log(1−p) .


p 1
Let θ = log , so that p = and
1−p 1 + e−θ
( )
f x | p(θ) = exθ−H(θ) ,


where H(θ) = log(1 + eθ ). Note that H ′ (θ) = 1+eθ
= 1
1+e−θ
= p. Thus the
MLE p̂ of p is X which is a best unbiased estimator of p as established
above. Indeed, we may directly prove this as follows. First,
p(1 − p)
Varp p̂ = .
n

Instructor: Yi-Ching Yao MATH5063 61 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.38 (contd.) Example (Bernoulli).


Fisher information

Iθ = H ′′ (θ) = = p(1 − p).
(1 + eθ )2
It follows from the Cramér-Rao inequality that
( )2 ( )2
τ ′ (θ) H ′′ (θ) p(1 − p)
Varθ W (X) ≥ = = = Varθ p̂
nIθ n H ′′ (θ) n

for any unbiased estimator W (X) of p. So the MLE p̂ of p is a best


unbiased estimator of p.

Instructor: Yi-Ching Yao MATH5063 62 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.39 Remarks.
Fisher information depends on the parameterization as follows. Consider
two parameterizations θ and ϕ. More precisely, there is a 1-1
correspondence between θ and ϕ, written θ = θ(ϕ) and ϕ = ϕ(θ). The pair
(θ, ϕ) refers to the same pdf f (x | θ) = g(x | ϕ). That is,
( )
f (x | θ) = g x | ϕ(θ) and
( )
g(x | ϕ) = f x | θ(ϕ) .

Taking the Bernoulli case as an example, we have

f (x | p) = px (1 − p)1−x = ex log p/(1−p)+log(1−p)

and θ
g(x | θ) = exθ−log(1+e )

where θ = log p/(1 − p).

Instructor: Yi-Ching Yao MATH5063 63 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.39 (contd.) Remarks.


Fisher information with respect to θ is
[ ]2

Iθ = Eθ log f (X | θ)
∂θ
while Fisher information with respect to ϕ is
[ ]
2

Iϕ = Eϕ log g(X | ϕ) .
∂ϕ

Instructor: Yi-Ching Yao MATH5063 64 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.39 (contd.) Remarks.


Note that
∂ ∂ ( )
log g(X | ϕ) = log f X | θ(ϕ)
∂ϕ ∂ϕ
[ ]

= log f (X | θ) θ′ (ϕ),
∂θ
implying that
[ ]2 ( )2 [ ]2
∂ ∂
Iϕ = Eϕ log g(X | ϕ) = θ′ (ϕ) Eθ log f (X | θ)
∂ϕ ∂θ
( )2
= θ′ (ϕ) Iθ ,

which in turn implies that


( )−2 ( )2
Iθ = θ′ (ϕ) Iϕ = ϕ′ (θ) Iϕ .

Instructor: Yi-Ching Yao MATH5063 65 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.40 Remarks.
In case that the support of f (· | θ) depends on θ, Fisher information may
not be well defined. For example, let f (x | θ) = θ1 for 0 ≤ x ≤ θ. The
support of f (· | θ) is [0, θ]. Note that
∫ θ ∫ θ
∂ ∂ ∂ 1
0= 1= f (x | θ)dx ̸= f (x | θ)dx = − .
∂θ ∂θ 0 0
∂θ θ

“Formally”, ∂θ log f (x | θ) = ∂
∂θ
log θ−1 = − θ1 , so that Iθ would appear to
be equal to Eθ (− θ1 )2 = θ12 .
However, for ∆θ > 0,
∫ θ [ ]2
log f (x | θ) − log f (x | θ − ∆θ)
f (x | θ) dx
0
∆θ
∫ [ ]2
θ
log 1
− log 0 1
≥ θ
dx = ∞.
θ−∆θ
∆θ θ

This would suggest that Iθ = ∞.

Instructor: Yi-Ching Yao MATH5063 66 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.40 (contd.) Remarks.


For an iid sample X1 , . . . , Xn from f (x | θ) = 1/θ, 0 < x < θ, let
Y = max(X1 , . . . , Xn ), the largest order statistic. Since
fY (y | θ) = ny n−1 /θn , 0 < y < θ, we have
∫ θ
ny n n
Eθ Y = dy = θ.
0
θn n+1
n+1
Thus n
Y is an unbiased estimator of θ, whose variance equals
( ) ( )
n+1 n+1 2
Varθ Y = Varθ Y
n n
( )2 [ ( )2 ]
n+1 n
= Eθ Y 2 − θ
n n+1
( ) [ ( )2 ]
n+1 2 n n
= θ −
2
θ
n n+2 n+1
1 2 1
= θ < ,
n(n + 2) n( θ12 )

which violates the Cramér-Rao inequality if Iθ = 1


θ2
. If Iθ = ∞, the
Cramér-Rao inequality holds trivially.
Instructor: Yi-Ching Yao MATH5063 67 / 92
Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.41 Example (Normal variance).


For the normal case, the sample variance S 2 has mean σ 2 and variance
Var(S 2 | µ, σ 2 ) = 2σ 4 /(n − 1). Does S 2 attain the Cramér-Rao lower
bound?

To answer this question, we need to extend the Cramér-Rao lower bound to


the multi-parameter setting. Let θ = (θ1 , . . . , θk ) be a k-dimensional
parameter vector. Let X = (X1 , . . . , Xn ) have joint pdf and let
W (X) = W (X1 , . . . , Xn ) be a (real-valued) estimator with
τ (θ) = Eθ W (X).

Instructor: Yi-Ching Yao MATH5063 68 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Under suitable regularity conditions, we have the following


(generalized) Cramér-Rao inequality:
( )T
Varθ W (X) ≥ ∇τ (θ) I −1
θ ∇τ (θ),
( )k
where ∇τ (θ) = (∂τ /∂θ1 , . . . , ∂τ /∂θk )T and I θ = I θ (i, j) i,j=1
with
[ ]
∂ log f (X | θ) ∂ log f (X | θ)
I θ (i, j) = Eθ .
∂θi ∂θj

(I θ is assumed to be non-singular.)

Instructor: Yi-Ching Yao MATH5063 69 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Proof.
We have ∫
1= f (x | θ) dx for all θ,

yielding

∂ ∂
0= 1= f (x | θ) dx
∂θi ∂θi


= f (x | θ) dx
∂θi

∂ log f (x | θ)
7.42 = f (x | θ) dx.
∂θi

Similarly, differentiating both sides of τ (θ) = W (x) f (x | θ) dx gives

∂τ (θ) ∂ log f (x | θ)
7.43 = W (x) f (x | θ) dx.
∂θi ∂θi

Instructor: Yi-Ching Yao MATH5063 70 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

(contd.) Proof.
Combining (7.42) and (7.43), we have for constants c1 , . . . , ck ,

∑ ∫ [ k ]
k
∂τ [ ] ∑ ∂
ci = W (x) − τ (θ) ci log f (x | θ) f (x | θ) dx
∂θi ∂θi
i=1 i=1
[ k ]
[ ] ∑ ∂ log f (X | θ)
= Eθ W (X) − τ (θ) ci .
∂θi
i=1

By Cauchy-Schwarz inequality,
(∑
k )2 [∑
k ]2
∂τ ∂ log f (X | θ)
ci ≤ Varθ W (X) E ci .
∂θi ∂θi
i=1 i=1

In matrix form, we have with c = (c1 , . . . , ck )T


( )2
cT ∇τ (θ) ≤ Varθ W (X) cT I θ c.

Instructor: Yi-Ching Yao MATH5063 71 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

(contd.) Proof.
Thus,
( )2
cT ∇τ (θ)
Varθ W (X) ≥ sup
c cT I θ c
( 1 −1 )
cT I θ2 I θ 2
∇τ (θ)
= sup 1 1
c (cT I θ2 )(I θ2 c)
( −1 )2
c̃T I θ
2
∇τ (θ)
= sup
c̃ c̃T c̃
[( −1 )T ( 1
−2 )]2
Iθ 2 ∇τ (θ) Iθ τ (θ)
= ( )T ( )
−1 1
−2
Iθ 2
∇τ (θ) Iθ ∇τ (θ)
( )T ( )
= ∇τ (θ) I −1
θ ∇τ (θ) ,
1 1 1
where I θ2 is symmetric positive-definite such that I θ2 I θ2 = I θ . This proves
the (generalized) Cramér-Rao inequality.

Instructor: Yi-Ching Yao MATH5063 72 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

When X1 , . . . , Xn are iid f (x | θ), we have f (x | θ) = f (x1 | θ) · · · f (xn | θ),


so that
[ ]
∂ log f (X | θ) ∂ log f (X | θ)
I θ (i, j) = Eθ
∂θi ∂θj
[ ][ ]
∂ ∑ ∂ ∑
n n

= Eθ log f (Xl | θ) log f (Xl | θ)


∂θi ∂θj
l=1 l=1
[ ][ ]
∂ ∂
= n Eθ log f (X1 | θ) log f (X1 | θ)
∂θi ∂θj
[ ]
∂2
= −n Eθ log f (X1 | θ) .
∂θi ∂θj

Instructor: Yi-Ching Yao MATH5063 73 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Then the Cramér-Rao lower bound is


( )T ( )
∇τ (θ) (n I θ )−1 ∇τ (θ) ,

where Fisher information matrix (per observation) I θ is given by


[ ][ ]
∂ ∂
I θ (i, j) = Eθ log f (X | θ) log f (X | θ)
∂θi ∂θj
[ ]
∂2
= −Eθ log f (X | θ) .
∂θi ∂θj

Instructor: Yi-Ching Yao MATH5063 74 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.44 Example (Normal variance, continued).


We have
1 (x − µ)2
log f (x | µ, σ 2 ) = log(2πσ 2 )− 2 − ,
2σ 2
so that
∂2 1
log f (x | µ, σ 2 ) = − 2
∂µ2 σ
∂2 x−µ
log f (x | µ, σ 2 ) = − 4
∂µ ∂(σ 2 ) σ
∂2 1 (x − µ)2
log f (x | µ, σ 2
) = − .
∂(σ 2 )2 2σ 4 σ6
It follows that  
1
0
Iθ =  σ .
2
1
0
2σ 4

Instructor: Yi-Ching Yao MATH5063 75 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.44 (contd.) Example (Normal variance, continued).


For the sample variance S 2 , we have τ (µ, σ 2 ) = E S 2 = σ 2 and the
Cramér-Rao lower bound equals
( )
( ) 0 2 4
0 1 (nI θ )−1 = σ ,
1 n
4

which is < n−1 = Var(S 2 | µ, σ 2 ). While S 2 fails to attain the Cramér-Rao
lower bound, it can be shown that S 2 is a best unbiased estimator of σ 2 .
We will not pursue this issue further, which requires the concept of
completeness of a statistic.

Instructor: Yi-Ching Yao MATH5063 76 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.45 Remarks (Rao-Blackwell Theorem).


Let W (X) be any( unbiased )estimator of τ (θ). Let T (X) be sufficient for θ.
Define ϕ(T ) = E W (X) | T . Then Eθ ϕ(T ) = τ (θ) and
Varθ ϕ(T ) ≤ Varθ W (X) for all θ. That is, ϕ(T ) is a uniformly better
unbiased estimator of τ (θ).

Proof.
We have
( )
τ (θ) = Eθ W (X) = Eθ Eθ W (X) | T
= Eθ ϕ(T ),

showing that ϕ(T ) is an unbiased estimator of τ (θ). Moreover, we have


( ) ( )
Varθ W (X) = Varθ Eθ W (X) | T + Eθ Varθ W (X) | T
( )
= Varθ ϕ(T ) + Eθ Varθ W (X) | T
≥ Varθ ϕ(T ).

Instructor: Yi-Ching Yao MATH5063 77 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.45 (contd.) Remarks (Rao-Blackwell Theorem).


Rao-Blackwell Theorem says that if an estimator is not a function of a
(minimal) sufficient statistic, one can get a better estimator by considering
the conditional expectation of the estimator given the (minimal) sufficient
statistic.

Instructor: Yi-Ching Yao MATH5063 78 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.46 Remarks (Uniqueness of a best unbiased estimator).


Let W and W ′ be both best unbiased estimators of τ (θ). Consider
W ∗ = 21 (W + W ′ ), which is also unbiased. Moreover,
( )
1 1
Varθ W ≤ Varθ W ∗ = Varθ W + W′
2 2
1 1 1
= Varθ W + Varθ W ′ + Covθ (W, W ′ )
4 4 2
[ ]12
1 1 1
≤ Varθ W + Varθ W ′ + (Varθ W )(Varθ W ′ )
4 4 2
= Varθ W ,

showing that the two inequalities are in fact equalities.

Instructor: Yi-Ching Yao MATH5063 79 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.46 (contd.) Remarks (Uniqueness of a best unbiased estimator).


In particular,
[ ]12
Covθ (W, W ′ ) = (Varθ W )(Varθ W ′ ) .
This can happen only when

W ′ = a W + b.

By Eθ W = Eθ W ′ and Varθ W = Varθ W ′ = Covθ (W, W ′ ), we conclude


that a = 1 and b = 0, i.e. W = W ′ .

Instructor: Yi-Ching Yao MATH5063 80 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Loss function optimality


Mean squared error is associated with the squared error loss. More general
loss functions may be considered. Let L(θ, a) be a loss function involving
the parameter value θ and “action” a. For simplicity, assume θ is
one-dimensional. If a = θ, then L(θ, a) = 0 (no loss is incurred if the correct
action is taken). As a moves away from θ, the loss increases. The special
case L(θ, a) = (a − θ)2 is the squared error loss function while
L(θ, a) = |a − θ| is the absolute error loss function. For a specified loss
function L(θ, a), the risk function for an estimator δ(X) is defined by
( )
R(θ, δ) = Eθ L θ, δ(X) .

We say an estimator δ is better than another estimator δ ′ if

R(θ, δ) ≤ R(θ, δ ′ ) for all θ

and the strict inequality holds for some θ.

Instructor: Yi-Ching Yao MATH5063 81 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.47 Example (Risk of normal variance).


Let X1 , . . . , Xn be iid n(µ, σ 2 ). Consider estimating σ 2 using squared error
loss. It seems reasonable to restrict attention to the class of estimators
δb = bS 2 where S 2 is the sample variance and b is a positive constant. Then
( )
R (µ, σ 2 ), δb = E (bS 2 − σ 2 )2
= Var bS 2 + (E bS 2 − σ 2 )2
= b2 Var S 2 + (bσ 2 − σ 2 )2
( )
2σ 4
= b2 + (b − 1)2 σ 4
n−1
[ 2 ]
2b
= + (b − 1)2 σ 4 .
n−1
n−1 ( )
It is easily seen that b = minimizes R (µ, σ 2 ), δb . In other words, in
n+1
{ } n−1 2 1 ∑n
the class of estimators bS 2 : b > 0 , S = (Xi − X)2
n+1 n+1 i=1
has the smallest mean squared error.

Instructor: Yi-Ching Yao MATH5063 82 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.48 Example (Variance estimation using Stein’s loss).


A criticism of squared error loss in a variance estimation problem is that
underestimation has only a finite penalty while overestimation has an
infinite penalty. C. Stein considered the loss function
a a
L(σ 2 , a) = − 1 − log 2 .
σ2 σ
Note that L(σ 2 , a) is a convex function of a with the minimum L(σ 2 , a) = 0
{ to ∞ as}a → 0 or a → ∞. Again, let’s
at a = σ 2 . Moreover, L(σ 2 , a) tends
consider the class of estimators bS 2 : b > 0 . Below we need not assume
normality. We have
( )
bS 2 bS 2
R(σ 2 , δb ) = E − 1 − log
σ2 σ2
S2 bS 2
= b E 2 − 1 − E log 2
σ σ
S2
= b − log b − 1 − E log 2 .
σ

{ at b = 1. }
Clearly, the risk function is minimized Thus, under Stein’s loss,
S 2 has the smallest risk in the class bS 2 : b > 0 .
Instructor: Yi-Ching Yao MATH5063 83 / 92
Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

Large Sample Theory


Let X1 , X2 , . . . be an iid sample from f (x | θ) where θ ∈ Θ is unknown. For
simplicity, we assume that θ is one-dimensional. Let Wn = Wn (X1 , . . . , Xn )
be an estimator of θ based on X1 , . . . , Xn . Thus {Wn : n = 1, 2, . . . } is a
sequence of estimators of θ.

7.49 Definition (Consistency).


A sequence of estimators Wn = Wn (X1 , . . . , Xn ) is a consistent sequence
of estimators of θ if for every θ ∈ Θ and every ε > 0,

lim Pθ (|Wn − θ| ≥ ε) = 0.
n→∞

Instructor: Yi-Ching Yao MATH5063 84 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.50 Example.
[ ]
Let Wn be a sequence of estimators such that Eθ (Wn − θ)2 → 0 as
n → ∞. Then by Chebychev’s inequality, for ε > 0,
[ ]
Eθ (Wn − θ)2
Pθ (|W − θ| ≥ ε) ≤ −→ 0 as n → ∞.
ε2
So Wn is consistent for θ. In particular, the sample mean X n is consistent
for the population mean µ since
[ ] σ2
E (X n − µ)2 = −→ 0 as n → ∞,
n
where σ 2 is the population variance.

Instructor: Yi-Ching Yao MATH5063 85 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.51 Theorem (Consistency of MLEs).


Let X1 , X2 , . . . be an iid f (x | θ). Let θ̂n be the MLE of θ based on
X1 . . . , Xn . Under regularity conditions, θ̂n is consistent for θ, i.e. for
ε > 0, limn→∞ Pθ (|θ̂n − θ| ≥ ε) = 0. Moreover, if τ (θ) is a continuous
function of θ, then τ (θ̂n ) is a consistent estimator of τ (θ).

7.52 Remarks.
By considering the Kullback-Leibler information, we showed consistency of
θ̂n when the parameter space Θ consists of finitely many points. The
general case requires more delicate arguments.

Instructor: Yi-Ching Yao MATH5063 86 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.53 Definition (Asymptotic efficiency).


A sequence of estimators Wn is asymptotically efficient for τ (θ) if
√ ( ) ( ) ( )2 /
n Wn − τ (θ) → n 0, v(θ) in distribution where v(θ) = τ ′ (θ) Iθ , Iθ
(
being the Fisher information of a single observation X1 i.e.
( )2 )
I θ = Eθ ∂
∂θ
log f (X | θ) .

7.54 Remarks.
Recall the Cramér-Rao lower bound
( )2
τ ′ (θ)
( )2 .
n Eθ ∂
∂θ
log f (X | θ)

Instructor: Yi-Ching Yao MATH5063 87 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.55 Theorem (Asymptotic efficiency of MLEs).


Let X1 , X2 , . . . be iid f (x | θ) and let θ̂n denote the MLE of θ based on
√ ( ) (function) of θ. Under regularity
X1 , . . . , Xn . Let τ (θ) be a smooth
conditions, n τ (θ̂n ) − τ (θ) → n 0, v(θ) in distribution.

Proof.
Fix θ0 ∈ Θ. Let ∑X1 , X2 , . . . be iid f (x | θ0 ). Let
n
log L(θ | X n ) = i=1 log f (Xi | θ) be the log likelihood function. By
Taylor’s expansion at θ0 ,

∂ ∂ ∂2
log L(θ | X n ) = log L(θ0 | X n ) + (θ − θ0 ) 2 log L(θ0 | X n )
∂θ ∂θ ∂θ
+ higher-order terms,

where X n = (X1 , . . . , Xn ). Ignoring the higher-order terms,

∂ ∂ ∂2
0= log L(θ̂n | X n ) ≈ log L(θ0 | X n ) + (θ̂n − θ0 ) 2 log L(θ0 | X n ),
∂θ ∂θ ∂θ

Instructor: Yi-Ching Yao MATH5063 88 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

(contd.) Proof.
from which it follows that

1 ∑ ∂
n

√ log f (Xi | θ0 )
√ n ∂θ
i=1
7.56 n(θ̂n − θ0 ) ≈ .
1 ∑ ∂2
n

− log f (Xi | θ0 )
n ∂θ2
i=1

∑n ∂2
Since − n1 log f (Xi | θ0 ) converges in probability to
2
i=1 ∂θ 2
∑n
−Eθ0 ∂θ∂
2 log f (X | θ0 ) = Iθ0 and
√1
n

i=1 ∂θ
log f (Xi | θ0 ) converges in
distribution to n(0, Iθ0 ), the right-hand side of (7.56)(converges in )

distribution to n(0, 1/Iθ0 ). By the delta method, n τ (θ̂n ) − τ (θ0 )
( ( )2 / )
converges in distribution to n 0, τ ′ (θ0 ) Iθ0 .

Instructor: Yi-Ching Yao MATH5063 89 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.57 Remarks.
We may say that θ̂n is asymptotically normal with mean θ and variance
(nIθ )−1 . Thus the variance of θ̂n decreases at the rate of n−1 as n
increases. Method-of-moments estimators are also asymptotically normal
with variances decaying at the rate of n−1 . That is, asymptotic variances
are of the form c(θ)/n for some c(θ) ≥ Iθ−1 . If c(θ) > Iθ−1 , then the
estimator is not asymptotically efficient.

Instructor: Yi-Ching Yao MATH5063 90 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.58 Definition (Asymptotic relative efficiency).


For two sequences of estimators of τ (θ) Wn and Vn satisfying

n[Wn − τ (θ)] −→ n(0, σW2
)

n[Vn − τ (θ)] −→ n(0, σV ),
2

the asymptotic relative efficiency (ARE) of Vn with respect to Wn is


2
ARE(Vn , Wn ) = σW /σV2 .

7.59 Remarks.
If ARE(Vn , Wn ) > 1, Vn is said to be asymptotically more efficient than
Wn . Note that ARE(Vn , Wn ) depends on θ in general. It may happen that
ARE(Vn , Wn ) is greater than 1 for some values of θ but less than 1 for some
other values of θ. If ARE(Vn , Wn ) = 2 (say), then W2n has approximately
σ2
W σ2
the same variance as Vn for large n (since 2n = nV ). In other words, for
the two estimators W and V to have the same variance, W requires about
twice the sample size as V does.

Instructor: Yi-Ching Yao MATH5063 91 / 92


Introduction to point estimation
Chapter 7 Methods of finding estimators
Methods of evaluating estimators

7.60 Example (ARE of Poisson estimates).


−λ
Let X1 , X2 , . . . be iid Poisson(λ). Suppose we like to estimate
∑n τ (λ) = e .
−λ
Since P(X1 = 0) = e , we may estimate τ (λ) by τ̂n = n i=1 Yi where
1

Yi = 1(Xi = 0). Since Y1 , Y2 , . . . ∼ iid Bernoulli(e−λ ), we have


√ ( )
n(τ̂n − e−λ ) −→ n 0, e−λ (1 − e−λ ) .

On the other hand, the MLE e−λ̂n of e−λ satisfies


√ −λ̂n ( )
n(e − e−λ ) −→ n 0, v(λ)

where ( )2
τ ′ (λ) e−2λ
v(λ) = = = λe−2λ .
Iλ λ−1
Thus,
λe−2λ λ
ARE(τ̂n , e−λ̂n ) = = λ ,
− e−λ ) e −1
e−λ (1
which is less than 1 for λ > 0 and decreases to 0 as λ → ∞.

Instructor: Yi-Ching Yao MATH5063 92 / 92

You might also like