You are on page 1of 139

Advanced Statistics II

Institut fr Statistik und konometrie


(Christian-Albrechts-Universitt zu Kiel)

April 16, 2016

Contents
1. Statistical models

2. Point estimation
2.1. Stochastic Models . . . . . . . . . . . . .
2.2. Estimators and their properties . . . . .
2.2.1. Finite-Sample Properties . . . . .
2.2.2. Asymptotic Properties . . . . . .
2.3. Sufficient Statistics . . . . . . . . . . . .
2.4. Minimal Sufficient Statistics . . . . . . .
2.5. Minimum Variance Unbiased Estimation
2.5.1. Cramr-Rao Lower Bond (CRLB)
2.5.2. Sufficiency and Completeness . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

3. Point estimation methods


3.1. Least-Squares Estimators for Linear Regression Models . . . .
3.1.1. The classical LRM assumptions . . . . . . . . . . . . .
3.1.2. The Least-Squares estimator for in the classical LRM
3.1.3. Properties of the LS estimator in the classical LRM . .
3.2. The Method of Maximum Likelihood . . . . . . . . . . . . . .
3.2.1. The Likelihood function and the ML estimator . . . . .
3.2.2. Finite sample properties the ML estimator . . . . . . .
3.2.3. Large sample properties the ML estimator . . . . . . .
3.2.4. MLE invariance principle . . . . . . . . . . . . . . . . .
3.3. The (generalized) method of moments . . . . . . . . . . . . .
3.4. Bayesian Estimation . . . . . . . . . . . . . . . . . . . . . . .
3.4.1. Prior and Posterior Distribution . . . . . . . . . . . . .
3.4.2. The Loss-Function Approach . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

4
4
8
9
14
18
23
27
28
36

.
.
.
.
.
.
.
.
.
.
.
.
.

45
45
47
50
52
64
64
67
70
78
80
95
96
100

4. Hypothesis testing
103
4.1. Fundamental Notations and Terminology of Hypothesis Testing . . . . . . . . . 103
4.2. Parametric Tests and Test Properties . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3. Construction of UMP Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
ii

Contents
4.4. Hypothesis-Testing Methods . . . . . .
4.4.1. Likelihood Ratio Tests . . . . .
4.4.2. Lagrange Multiplier (LM) Tests
4.4.3. Wald Tests . . . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

121
122
130
132

Appendix

A. Tables

iii

1. Statistical models
In the course Advanced Statistics I we discussed fundamental ideas of probability theory and
the theory of distributions. There we considered the probability space of a random experiment,
given by the 3-tuple
{S, Y, P },
where

S : sample space being the set of all outcomes of the experiment


Y

: event space being the set of all events (typically


a sigma-algebra on S)

P : probability set function having domain Y


used to assign probabilities to events.

There a typical question was:


Given the probability space, what can we say about the characteristics and properties of outcomes of an experiment?

Example 1.1 Consider the experiment of tossing a fair coin 50 times. Assume that we are
interested in the number of heads, say X. We know that this rv has a binomial distribution
with parameters n = 50 and p = 0.5. Hence, we have a completely specified probability space
with a sample space S = {0, 1, ..., 50}, an event space Y consisting of all subsets of S, and a
probability set function P characterized by the pdf of a binomial distribution. From this we can
deduce the characteristics of the outcomes of X like the expected number of heads (np = 25) or
the shape of the pdf of X.

In this course we now turn the question of the probability theory and the theory of distributions
around:
Given the observed characteristics and properties of outcomes of an experiment,
what can we say (infer) about the probability space?

1. Statistical models
Example 1.2 Assume that we have a sample of daily returns observed for the German stock
index DAX, which we interpret as the outcomes of a random process/experiment. As a financial
analyst we might be interested in finding a probability distribution (i.e. the probability space)
which can be used to describe or approximate the observed behavior of the returns.
Problems associated with this kind of question are addressed by the methods of statistical
inference.
In general, the term statistical inference refers to the inductive process of generating information
about characteristics of a population or process by analyzing a sample of objects or outcomes
from the population or process. A typical problem of statistical inference is as follows.
Let X be a rv that represents the population under investigation, and let f (x, ) denote
the parametric family of pdfs of X. The set of possible parameter values is denoted by
.
Then the job of the statistician is to decide on the basis of a sample randomly drawn from
the population, say {Xi , i = 1, ..., n}, which member of the family of pdfs {f (x, ), }
can represent the pdf of X.
Example 1.3 Consider a random sample of daily DAX returns, say {X1 , X2 , ..., Xn }. Assume
that the returns represent a random sample from a normal distribution, i.e.,
Xi iidN(, 2 ),

i = 1, ..., n,

where and 2 are unknown parameters. Our task is to generate statistical inferences based
upon the random sample about the population values for and 2 .
Statistics
In statistical inference, we use functions of the random sample X1 , ..., Xn to map/transform
sample information into inferences regarding the population characteristics of interest. The
functions used for this mapping are called statistics, defined as follows.
Definition (Statistic): Let X1 , ..., Xn be a random sample
from a population and let T (x1 , ..., xn ) be a real-valued function, which does not depend on unobservable quantities. Then
the random variable
Y = T (X1 , ..., Xn ) is called a (sample) statistic.
Often used statistics are
2

1. Statistical models
n =
the sample mean X

1
n

Pn

i=1

Xi ,

the rth order non-central sample moment Mr0 =


the sample variance Sn2 =

1
n

Pn

i=1 (Xi

1
n

Pn

i=1

Xir ,

n )2 .
X

If statistics have certain qualifying statistical properties, they can be used for the estimation of
population parameters or hypothesis testing. Then they are called estimators or test statistics,
respectively.
n , which has a lot of useful
Example 1.4 The most popular statistic is the sample mean X
statistical properties, some of which are summarized in the following. Let X1 , ..., Xn be a random
sample from a population with expectation EX = and variance var(X) = 2 . Then the sample
mean has the following properties
n = ,
EX
n ) = 2 /n,
var(X
n = , which follows from the WLLN,
plim X
a
n
X
N (, 2 /n), which follows from the CLT of Lindberg-Levy.

2. Point estimation
The rationale behind point estimation is loosely described as follows. Assume that we have a
realization of a random sample X1 , .., Xn from a joint pdf f (x1 , ..., xn ; ), where the form of
the pdf f is assumed to be known except that it contains a parameter with an unknown
value, say 0 . Then the objective of point estimation is to utilize the random sample outcome
x1 , .., xn to generate good (in some sense) estimates of the unknown value of 0 , or the value
of some function, say (0 ).
This estimation can be made in two ways.
The first is called point estimation: There the outcome of some statistic, say t(X1 , ..., Xn ),
represents the estimation of the unknown 0 or (0 ).
The second is called interval estimation: There we define two statistics, say t1 (X1 , ..., Xn )
and t2 (X1 , ..., Xn ) so that
[t1 (X1 , ..., Xn ) , t2 (X1 , ..., Xn )]
is an interval for which the probability can be determined that it contains the unknown
0 or (0 ).
In the following we will focus on point estimation. Point estimation admits two problems.
The first is to advise some means of obtaining a statistic to use as an estimator;
the second is to select criteria and techniques to define and find the best estimator among
many possible estimators.
Here we will be concerned with the second problem, and in the following chapter with the first
one.

2.1. Stochastic Models


The problem of point estimation begins with specifying a stochastic model which contains all
our assumptions about the generating process for the random sample x = (X1 , ..., Xn ) whose
4

2. Point estimation
outcome x = (x1 , ..., xn ) constitutes the observed data to be analyzed.
Definition (Statistical model): A statistical model for a random sample x consists of
a parametric functional form, f (x; ), for the joint pdf
of x indexed by the parameters
together with a parameter space, , that defines the set
of potential candidates for the true joint pdf of x as
{f (x; ) , }.
Remark: The true model is the joint pdf with = 0 , i.e., f (x; 0 ) and the estimation
problem consists in approximating 0 or some function (0 ). Examples of such functions
(0 ) are e.g. the population mean and variance

EX =

x f (x; 0 )dx,

var(X) =

(x EX)2 f (x; 0 )dx.

These are, at the same time, functionals1 of the density f (or corresponding of the cdf F ).
The method used to estimate will generally depend on the degree of specificity with which
one can define the stochastic model for the random sample X. Accordingly, one can distinguish between distribution-specific estimation methods and distribution-free estimation
methods.
Distribution-specific estimation
In this case, the estimation of 0 or (0 ) is associated with a fully specified stochastic
model assuming a specific parametric family of pdfs for the random sample x represented by
{f (x; ) , }.
Example 2.1 A fully specified model for a random sample of n daily returns of the DAX index,
say {Xi , i = 1, ..., n}, might be defined as

f (x1 , ..., xn ; ) =

n
Y

N(xi ; , 2 ), (, 2 ) ,

where = (, ) (0, ).

i=1

Distribution-specific estimation methods are the maximum-likelihood estimation procedure and


Bayesian estimation procedures based upon the posterior distribution.
1

Loosely speaking, functionals are real functions taking functions as arguments; the integral is one of the most
common functionals.

2. Point estimation
These methods require that the parametric form of the joint pdf of the random sample is fully
algebraically specified.
Distribution-free estimation
In this case, a specific functional form for the joint pdf f (x; ) is not assumed and may or
may not be fully specified.

Example 2.2 A partially specified model for the sample of the DAX returns would be


f (x1 , ..., xn ; ) =

n
Y

f (xi ; , 2 ), (, 2 ) ,

with

EXi = , var(Xi ) = 2 ,

i=1

where = (, ) (0, )

and f (xi ; , 2 ) is some continuous pdf.

This specification is very general, since it is the collection of all continuous joint pdfs for which
Q
f (x1 , ..., xn ; ) = ni=1 f (xi ; ) with EXi = and Var(Xi ) = 2 .

Estimation methods for partially specified models are the least-squares and the method of
moments estimation procedure. Should one be interested in the desity f itself, the Advanced
Statistics III course discusses non-parametric estimation.
The advantage of distribution-free estimation: It is based upon a general specification
for the joint distribution of the random sample. Hence, we can have great confidence that the
actual distribution is contained within that specification. This implies that the validity and
reliability of the estimation result is robust w.r.t. distributional assumptions.
The disadvantage of distribution-free estimation: In a context without specific distributional assumptions, the interpretation of the properties of point estimates is not as specific
or as detailed as when the family of distribution is defined with greater specificity. Moreover,
estimators may have different statistical properties.
In the context of a point estimation problem, two important assumptions regarding the parameter space are made.
Assumption 1: contains the true parameter value so that
0 .
This implies that the stochastic model {f (x; ) , } can be assumed to contain the true
distribution for the random sample under consideration.
Since represents the entire set of possible values for 0 , the relevance of this assumption is
perhaps obvious if our aim is to estimate 0 , we do not want to preclude 0 from the set
6

2. Point estimation
of potential estimates. In practice, this assumption may be a tentative assumption that needs
to be verified by statistical tests. More important is the assumption that the data actually are
informative about the parameters.
Assumption 2: is such that the parameter vector is identified.
The notion of the identifiability of is defined as follows.
Definition
(Parameter
identifiability):
Let
{f (x; ) , } be a statistical model for the random sample x. The parameter vector is said to be
identified iff 1 and 2 , f (x; 1 ) and f (x; 2 ) are
distinct if 1 6= 2 .
The definition states that if the parameter vector is not identified, then two or more different
-values, say 1 and 2 , are associated with exactly the same sampling distribution for x.
In this event, random-sample outcomes x cannot be used to discriminate between 1 and 2 ,
since the stochastic behavior of X under either possibility is indistinguishable.
The identifiability assumption ensures that different -values are associated with different
stochastic behavior of the random-sample outcomes. By this we make sure that the sample
outcomes are able to provide discriminatory information regarding the choice of to be used
in estimating 0 .
Example 2.3 A random sample X1 , ..., Xn is assumed to be generated by the process
Xi = x + Vi ,

Vi iid N(v , v2 ),

x , v , v2 > 0,

such that Xi iidN(x + v , v2 ). Hence, the stochastic model for that random sample can be
represented as


f (x; ) =

Qn

2
2
i=1 N(xi ; x + v , v ), (x , v , v ) ,

where = 3i=1 (0, ).

The question is whether or not = (x , v , v2 )0 is identified. The answer is no.


To see why, define = x + v and consider the stochastic model for the Xi s given b
 

,
= Qni=1 N(xi ; , v2 ), (, v2 )
f x;

= 2 (0, ).
where
i=1

Note that any choice of positive values for x and v that results in a given positive value for
results in exactly the same sampling distribution for the Xi s. (Also note that there is an infinite
number of such choices.)
7

2. Point estimation
In order to identify the model, one can impose identifying restrictions on = (x , v , v2 )0 such
as, e.g., v 0 if they are plausible.

2.2. Estimators and their properties


Point estimation is concerned with estimating or q() (i.e. providing an educated guess
for the actual value of the quantity of interest) using the outcome of a random sample X =
(X1 , ..., Xn )0 . The estimate is generated via an appropriate sample statistic, i.e. a function of
the random sample which is called estimator.
Definition (Point estimator): A statistic, T = t(x), whose
outcomes are used to estimate the value of a scalar or vector
function, q(), of the parameter vector, , is called a point
estimator.
An observed outcome of an estimator is called a point estimate.
Note that according to this definition, there is literally an infinite number of possible sample
statistics t(x) that are potential estimators for q(). Hence, a fundamental problem in point
estimation is the choice of a good estimator.
In order to rank the efficacy of potential estimators and to choose the optimal estimator, we need
some criteria measuring the goodness of estimators. Competing estimators can be compared
by using measures of closeness of the estimates to q(); since we deal with random quantities,
it should be in an expected or probabilistic sense.
In what follows, we discuss specific properties of estimators that are used to measure this
closeness. These properties of estimators are differentiated on the basis of whether they are
small-sample (exact) properties or large-sample properties.
Small-sample (finite-sample) properties refer to the exact sampling distribution of the
estimator for finite samples;
Large-sample (asymptotic) properties refer to approximations to sampling distribution
properties based upon asymptotic distributions and convergence in probability considerations.
In general, it is preferable to rank estimators based upon their exact sampling distribution,
but this is not always possible, since the finite sample properties are sometimes intractable to
analyze. In such cases we must rely on asymptotic properties.

2. Point estimation

2.2.1. Finite-Sample Properties


Definition (Mean square error (scalar case)): The mean
square error (MSE) of an estimator T = t(x) of q() is defined as
MSE (T ) = E [T q()]2

Remark: The notation E () is used to emphasize that the expectation depends upon the value
of . In the continuous case we have

MSE (T ) =
[t(x) q()]2 f (x; )dx.
Rn

The MSE measures the expected squared distance between the estimator T and the quantity
to be estimated q(). The MSE can be decomposed into the variance and the bias of the
estimator, as
E [T q()]2 = E [T E T + E T q()]2

= E (T E T )2 + E [E T q()]2 + 2E [(T E T )(E T q())]


= Var (T ) + [E T q()]2 0
|

{z

Bias

Hence, the MSE-criterion penalizes an estimator for having a large variance, a large bias, or
both. It also allows a trade-off between variance and bias in ranking estimators.
The definition of the MSE given above for scalar-valued estimators T can be generalized to the
mean square error matrix for multivariate estimators T see Mittelhammer (1996, Def. 7.7).
Estimators with smaller MSEs are preferred. Note, however, that since is unknown, we must
consider the MSE-performance for all possible true values of , i.e., for all .
It is quite often the case that an estimator will have lower MSEs than another estimator for
some values but not for others. A comparison of two estimators using the MSE-criterion
leads to the concept of relative efficiency.

2. Point estimation
Definition (Relative Efficiency (scalar case)): Let T and T
be two estimators of a scalar q(). The relative efficiency of
T w.r.t. T is given by
MSE (T )
RE (T, T ) =
,
MSE (T )

T is relatively more efficient than T if


RE (T, T ) 1 and such that RE (T, T ) >1.

The definition says that if T is more efficient than T , then there is no value for which T is
preferred to T on the basis of MSE, and for one or more values, T is preferred to T . In this
case T can be discarded as an estimator and T is called inadmissible for estimating q()
and T admissible.
Example 2.4 Suppose (X1 , ..., Xn ) is a random sample from a Bernoulli distribution with
P (xi = 1) = p and n = 25. Consider the following two estimators for p [0, 1]:
T =

1
n

Pn

i=1

Xi

T =

and

1
n+1

Pn

i=1

Xi .

The expectation and variance for T and T are


var(T ) = n1 var(Xi ) = n1 p(1 p)

ET = EXi = p,
ET =

n
EXi
n+1

np
,
n+1

var(T ) =

n
p(1
(n+1)2

p).

Note, that T is unbiased and T is biased. However, T has a larger variance than T . The
MSEs of the two estimators are
MSE(T ) = var(T ) + (ET p)2 =
MSE(T ) =

n
p(1
(n+1)2

p(1p)
25

np
p) + ( n+1
p)2 =

p(1p)
27.04

p2
.
676

The ratio of the MSEs is


RE(T, T ) =

MSE(T )
p
= .9246 + .037
.
MSE(T )
1p

Since this ratio depends on the unknown value of p, we must consider all the possible contingencies for p [0, 1]. Note that for
p 0 RE(T, T ) .9246 < 1,
while for

p 1 RE(T, T ) > 1.
10

2. Point estimation
Hence, neither estimator is preferred to the other on the basis of MSE, and thus neither estimator is inadmissible relative to the other.

A natural question to ask is wether or not an MSE-optimal estimator exists that has for all
values the smallest MSE among all estimators for q(). In general, no such MSE-optimal
estimator exists. This can be shown as follows.
Assume that we want to estimate the scalar . Consider the degenerate estimator T1 = t1 (X) =
1 (fixed value, ignoring the sample information) with
2

MSE (T1 ) = E(1 ) = (1 )

= 0.
( = 1 )

Now we can define for each value of such a degenerate estimator. Then for an estimator,
say T , to have minimum MSE for every potential value of , it would be necessary that
MSE (T ) = 0 .
(Otherwise, we would find a -value where the corresponding degenerate estimator has a smaller
MSE than T .) However, note that MSE (T ) = 0 implies that
E T =

and var (T ) = 0 ,

and thus that


P (T = ) = 1 .
This means that for an estimator T to have its MSE identically 0 for every value, it must
always estimate correctly, which would imply that it is possible to identify the true value
directly from the random-sample outcome.
In general, this is not possible in practice, and hence an estimator that has uniformly (i.e. for
all ) the smallest MSE typically does not exist. While there generally does not exist an
estimator that has uniformly the smallest MSE relative to all estimators, it is often possible to
find a MSE-optimal estimator if one restricts the class of estimators under consideration.
Two such restrictions are unbiasedness and linearity, leading to the class of unbiased and
linear estimators, which we consider in the following.
Definition (Unbiased estimator): An estimator T is said to
be an unbiased estimator of q() iff
E T = q() .
Otherwise, the estimation is said to be biased.
11

2. Point estimation
Unbiasedness means that the mean of the estimators distribution is equal to the parameter to
be estimated. Hence, an unbiased estimator has the appealing property that its outcomes are
equal to q() on the average.
Example 2.5 Let (X1 , ..., Xn ) be a random sample from an exponential distribution with pdf
f (x; ) = 1 ex/ I(0,) (x),
The estimator T =

1
n

Pn

with

EX = .

Xi is unbiased for estimating , since

i=1

ET =

1
n

Pn

i=1

EXi = .

Example 2.6 Let (X1 , ..., Xn ) be a random sample from a population with EX = and
var(X) = 2 . Consider the sample variance
T =

1
n

Pn

i=1 (Xi

2,
X)

with

=
X

1
n

Pn

i=1

Xi

as an estimator for 2 . Its expectation is




ET =
=

1
E
n
1
n

2
2

i=1 E(Xi ) n E(X )

Pn

1
n

2 = 1 E Pni=1 (Xi )2 n(X


)2
i=1 (Xi X)
n

Pn

{z

var(X) = 2

{z

= 2 /n
var(X)

= (1 n1 ) 2 6= 2

Hence, the sample variance T is a biased estimator for the population variance 2 . An unbiased
estimator obtains as
1 Pn
1
2
S2 =
i=1 (Xi X) .
1 T =
n1
(1 n )
In addition to this desirable property, we also want an estimator with a distribution not too
spread out, which could generate an estimate being far away from q(). This motivates the
objective that an estimator has minimum variance among all unbiased estimators which
is formally defined as follows.
Definition (Minimum Variance unbiased estimator
(MVUE) (scalar case)): An estimator T is said to be a
minimum-variance unbiased estimator of q() iff
1. E T = q() ,

that is, T is unbiased, and

2. var (T ) var (T )
ased estimator T .

for any other unbi-

12

2. Point estimation
The definition states that an estimator is a MVUE if the estimator is unbiased and if there is no
other unbiased estimator that has a smaller variance for any . Note that the MVUE has
the smallest MSE within the class of unbiased estimators. (Remember that MSE = Variance
+ Bias2 ).
Unfortunately, without the aid of theorems that facilitate the discovery of MVUEs, finding a
MVUE for q() is, if such an estimator exists at all, typically quite challenging.2
Hence, one sometimes restricts the attention to estimators that are unbiased and that have the
minimum variance among all unbiased estimators that are linear. Those estimators which are
called BLUE are defined as follows.
Definition (Best linear unbiased estimator (BLUE) (scalar
case)): An estimator T is said to be a BLUE of q() iff
1. T is a linear function of the random sample x =
(X1 , ..., Xn )0 , i.e.,
T = a 0 X = a1 X 1 + + an X n ,
2. E T = q() ,

that is, T is unbiased, and

3. var (T ) var (T )
and unbiased estimator T .

for any other linear

Note that the BLUE has the smallest MSE within the class of linear and unbiased estimators.
Example 2.7 Let (X1 , ..., Xn ) be a random sample from a population with EX = and
var(X) = 2 . The BLUE for is obtained as follows.
As a linear estimator the BLUE must have the form
T = a0 + a1 X 1 + + an X n .
The expectation of T is
ET = a0 + a1 EX1 + + an EXn = a0 +

Pn

i=1

ai .

Hence, for T to be unbiased, (that is ET = s) we require that


Pn

i=1

ai = 1

and

a0 = 0.

For an example showing how to find the MVUE without the aid of theorems see Mittelhammer (1996,
Example 7.4).

13

2. Point estimation
The variance of T is
Var(T ) =

P
2 ni=1

a2i =

"
Pn1 2
a
i=1

+ 1

Pn1
i=1

{z

since unbiasedness requires an = 1

ai

2

Pn1
i=1

.
}

ai

The first-order conditions for the ai s minimizing this variance are


Var(T )
ai

= 2 (2ai 2an ) = 0,

i = 1, ..., n 1,

Thus, all ai s need to be equal. This together with the restriction


a1 = a2 = = an =

Pn

i=1

ai = an .
ai = 1 implies that

1
.
n

Thus, the BLUE for is


T = 0 + n1 X1 + + n1 Xn =

1
n

Pn

i=1

Xi .

A prominent BLUE arises in the context of least-square estimation of the parameters of a linear
regression model, which we will discuss in the next chapter.

2.2.2. Asymptotic Properties


When the finite sample properties of estimators are intractable, we may have to rely on asymptotic properties to rank the estimators.
The finite sample properties of an estimator T can be intractable due to the fact that we do
not know its exact sampling distribution,
either because the estimator T is a complicated function of the random sample t(X),
or because our stochastic model does not assume a population distribution for X.
Asymptotic properties are essentially equivalent in concept to the finite sample properties,
except that the asymptotic properties are based upon the asymptotic distributions of estimators
rather than the estimators exact finite-sample distributions.
The first asymptotic property we consider is the consistency, which is defined as follows.
Definition (Consistent estimator): An estimator Tn is said
to be a consistent estimator of q() iff plim Tn = q()
.
14

2. Point estimation
A consistent estimator converges in probability to what is being estimated. Thus, for large
enough n, there is a high probability that the outcome of Tn will be in the interval [q()
 , q() + ] for arbitrarily small  > 0.
Equivalently, the sampling density of Tn concentrates on the true value q() as n .
Recall that convergence in mean square implies convergence in probability. Hence,
m

Tn q()

Tn q().

Thus, mean-squared convergence to q() is a sufficient condition for an estimator Tn to be


consistent for estimating q(). Specifically, if
lim ETn = q()

and

lim var(Tn ) = 0,

then Tn is a consistent estimator for q() by mean-squared convergence.


Example 2.8 Let (X1 , ..., Xn ) be a random sample from a population with EX = and
n = 1 Pn Xi is a consistent estimator for
var(X) = 2 . Then the sample mean Tn = X
i=1
n
since
n = ,
n ) = 2 0 as n .
EX
and
var(X
n

Pn

1
Now, consider as an alternative estimator Tn = nk
Tn is a biased estimator for , it is consistent, since

ETn =

n
nk

i=1

n) =
var(X

and

Xi for a fixed value of k. Even though

n 2
(nk)2

0 as

n .

This example shows that we typically have many consistent estimators for an estimation problem.
The following example shows that an estimator can be consistent for q() without converging
in mean square to q().
Example 2.9 Let be a population parameter. Consider an estimator Tn for with two
possible outcomes {, n} and pdf
f (tn , ) = (1

1 )I{} (tn )
n

1 I{n} (tn ).
n

This estimator is consistent with Tn since limn f (tn , ) = I{} (tn ).


However, since
ETn = (1

1 )
n

+n

1
n

15

as

n ,

2. Point estimation
Tn does not converge in mean square to .
Note that the divergence of the expectation is due to the fact that the pdf of Tn , although
collapsing to the point , ensuring consistency, is not collapsing at a fast enough rate for the
expectation to converge to .

Let us consider more closely how convergence might take place.


Definition (Consistent asymptotically normal (CAN) estimator): An estimator Tn is said to be a CAN estimator of
q() iff

n(Tn q()) N (0, ),

where is a p.d. matrix.

The asymptotic distribution of a CAN estimator Tn is obtained as


Zn =

a
n(Tn q()) N(0, )

Tn N(q(), n1 ).

The consistency of a CAN estimator Tn follows immediately, since by Slutskys theorem


1
n

n(Tn q())

|{z} |
p

{z




Tn q()

Z N(0, )

0Z

0,

(Slutsky)

which implies Tn q() 0 or Tn q().

Example 2.10 Let (X1 , ..., Xn ) be a random sample from a population with EX = and
n = 1 Pn Xi is a CAN estimator for since by
var(X) = 2 . Then the sample mean Tn = X
i=1
n
Lindberg-Levys CLT

n ) N(0, 2 )
n(X

and

a
n
X
N(, n1 2 ).

By defining the class of CAN estimators in terms of the limiting distribution of the transfor
mation n(Tn q()) (which utilizes as a sequence of centering the quantity to be estimated,

q(), and as a sequence of scaling n), we remove a potential nonuniqueness problem of


asymptotic distributions and their properties.
To clarify this nonuniqueness problem, let Tn denote an estimator of the scalar q() and
suppose that

n(Tn q()) N(0, 2 )

and thus
16

Tn N(q(), n1 2 ).

2. Point estimation
However, by Slutskys theorem it follows that for any constant k < n

n
nk

n(Tn q()) =

| {z } |
p

{z

Z N(0, 2 )

(T

q())
n

nk

1 Z N(0, 2 ),
(Slutsky)

Tn N(q(), nk
2 ).
n2

and thus

Hence, we have for Tn two alternative asymptotic distributions which would lead to different
asymptotic properties. The problem is that centering and scaling required to achieve a limiting
distribution is not unique. By restricting the use of asymptotic properties to the class of CAN
estimators which utilize the same sequence of centering and scaling we avoid nonuniqueness
of asymptotic properties.
Asymptotic versions of MSE, bias and variance can be defined w.r.t. the unique asymptotic
distribution of CAN estimators. The asymptotic MSE for a CAN estimator Tn for the
a
scalar q() with Tn N(q(), n1 2 ) is
AMSE (Tn ) = EA (Tn q())2

(EA : Expectation w.r.t. asym. distribution)

= Avar(Tn ) + [EA Tn q()]2


|

{z

(Avar: Variance of the asym. distribution)

Asymtotic Bias = 0

= Avar(Tn ),
=

1 2
.
n

Note that a CAN estimator for q() is necessarily asymptotically unbiased.


Based upon the AMSE, we can now define asymptotic relative efficiency uniquely within
the context of CAN estimators.
Definition (Asymptotic relative efficiency (scalar case)):
Let Tn and Tn be CAN estimators of q() such that
d

n1/2 (Tn q()) N 0, T2


and




n1/2 (Tn q()) N 0, T2 .

The asymptotic relative efficiency of Tn with respect to Tn is


ARE (Tn , Tn ) =

AMSE (Tn )
2
= T2 .
AMSE (Tn )
T

Tn is asymptotically relatively more efficient than Tn if


ARE (T, T ) 1 and ARE (T, T ) >1 for at least one .

17

2. Point estimation
If the estimator Tn is asymptotically relatively more efficient than Tn , then Tn is called asymptotically inadmissible. Otherwise, Tn is asymptotically admissible.
The definition of asymptotic relative efficiency given above refers to CAN estimators for a scalar
q(). For a multivariate generalization of this definition to CAN estimators for vectors q()
see Mittelhammer (1996, Def. 7.16).
The definition of asymptotic relative efficiency suggests to define asymptotic efficiency in
terms of a choice of estimator in the CAN class that has uniformly the smallest variance.
However, such an estimator does not exist without further restrictions on the CAN class. In
particular, one can show that for any CAN estimator, there is an alternative estimator that
has a smaller variance for at least one . Hence, we cannot define an achievable lower
bound to the asymptotic variance of CAN estimators.
On the other hand, one can show that under mild regularity conditions there does exist a lower
bound for the asymptotic variance of a CAN estimator that holds for all except on
a finite set of values (this is the so called Cramr-Rao lower bound). This result shown
by LeCam (1953)3 allows us to state a general definition of asymptotic efficiency for CAN
estimators.
Definition (Asymptotic efficiency (scalar case)): If T n is a
CAN estimator of q() having the smallest asymptotic variance among all CAN estimators , except on a finite
set of values, T n is said to be asymptotically efficient.

2.3. Sufficient Statistics


If a small collection of sufficient statistics can be found for a given stochastic model, then for
defining an estimator of q(), it is sufficient to consider only functions of this smaller set of
sufficient statistics as opposed to functions of all n outcomes of the original random sample.
Loosely speaking, the sufficiency of a statistic or a collection of statistics means that they
represent/summarize all of the information in a random sample that is useful for estimating
any unkown q(). Thus, in place of the original random-sample outcome, it is sufficient to
have the outcomes of those statistics to estimate any q().
Hence, sufficient statistics condense all the relevant information in a random sample on
q() and allow a data-reduction step to occur in point estimation problems.
3

Lucien Marie LeCam (1953), On some asymptotic properties of maximum likelihood estimates and related
Bayes estimates. University of California Publications in Statistics.

18

2. Point estimation
As we shall see later, sufficient statistics facilitate the construction of estimators with the MVUE
property or small MSEs.
Definition (Sufficient statistics): Let (X1 , . . . , Xn )
f (x1 , . . . , xn ; ) be a random sample, and let S1 =
s1 (X1 , . . . , Xn ), . . . , Sr = sr (X1 , . . . , Xn ) be r statistics. The
r statistics are said to be sufficient statistics for f (x; ) iff
f (x1 , . . . , xn ; | s1 , . . . , sr ) = h (x1 , . . . , xn ) ,
i.e., the conditional density of x, given s = [s1 , . . . , sr ]0 , does
not depend on the parameter .
In order to interpret this definition, note first that the conditional pdf f (x1 , . . . , xn ; | s1 , . . . , sr )
represents the probability distribution of the various ways in which the sample outcomes x occur
so as to generate exactly the value s = (s1 , ..., sr )0 . According to the definition, this probability
distribution has nothing to do with if S = (S1 , ..., Sn ) is sufficient.
Thus analyzing the various ways in which a given value s can occur cannot provide any additional information about , since the behavior of the outcomes of x, conditional on the fact
that s(x) = s, is totally unrelated to .
Example 2.11 Let X = (X1 , X2 , X3 )0 be a random sample from a Bernoulli population with
P (x = 1) = p. Consider the two statistics
S = s(X) = X1 + X2 + X3

and

T = t(X) = X1 X2 + X3 .

We now want to show that S is sufficient for p and T is not. The conditional pdfs f (x1 , x2 , x3 ; p | s)
and f (x1 , x2 , x3 ; p | t) are represented in the following table.
Values in the
range R(X)
(0, 0, 0)
(0, 0, 1)
(0, 1, 0)
(1, 0, 0)
(0, 1, 1)
(1, 0, 1)
(1, 1, 0)
(1, 1, 1)

Values
of S
0
1
1
1
2
2
2
3

Values
of T
f (x; p|s(x) = s)
0
1
0
0
1
1
1
2

1
1/3
1/3
1/3
1/3
1/3
1/3
1

19

f (x; p|t(x) = t)
(1 p)/(1 + p)
(1 p)/(1 + 2p)
p/(1 + p)
p/(1 + p)
p/(1 + 2p)
p/(1 + 2p)
p/(1 + 2p)
1

2. Point estimation
The conditional probabilities given in the last two columns are obtained as follows. For instance
the probability P (x1 = 0, x2 = 1, x3 = 0|s = 1) is obtained as
P (x1 = 0, x2 = 1, x3 = 0|s = 1) =
=
=

P (x1 =0,x2 =1,x3 =0,s=1)


P (s=1)
(1p)p(1p)
(since
(31)p1 (1p)31
1

(31)

(31)!1!
3!

(Def.)
S binomial(n = 3, p))

= 31 .

The probability P (x1 = 0, x2 = 1, x3 = 0|t = 0) obtains as


P (x1 = 0, x2 = 1, x3 = 0|t = 0)

P (x1 =0,x2 =1,x3 =0,t=0)


P (t=0)

(Def.),

where (see Table)


P (t = 0) = P (x1 = 0, x2 = 1, x3 = 0) + P (x1 = 0, x2 = 0, x3 = 0) + P (x1 = 1, x2 = 0, x3 = 0).
Thus
P (x1 = 0, x2 = 1, x3 = 0|t = 0)

(1p)2 p
2(1p)2 +(1p)3

p
.
1+p

Since the conditional pdf f (x1 , x2 , x3 ; p | s) has nothing to do with the value of p, the
statistic S is sufficient. However, the conditional pdf f (x1 , x2 , x3 ; p | t) depends on p; so T is
not sufficient.

In any problem of estimating q(), once the outcome of a set of sufficient statistics is observed,
the random sample outcome x can effectively be ignored for the remainder of the estimation
problem since s(x) captures all the relevant information that the sample has to offer regarding
q().
On the other hand, this implies that any estimator which is not based upon a sufficient statistic
must be inefficient since it does not capture all the relevant information that the sample has to
offer.
A practical problem in the use of sufficient statistics is their identification. A criterion which can
be helpful for identification of sufficient statistics is that given by the Neyman factorization
theorem.

Theorem 2.1 (Neymans Factorization Theorem) Let f (x; ) be the pdf of the random
sample (X1 , . . . , Xn ). The statistics S1 , . . . , Sr are sufficient statistics for f (x; ) iff f (x; )
can be factored as
f (x; ) = g (s1 (x), . . . , sr (x); ) h(x),
where g is a function of only s1 (x), . . . , sr (x) and , and h(x) does not depend on .
20

2. Point estimation
Proof
(Discrete case) Sufficiency of the factorization; Suppose that the factorization criterion is met. Let
B(a1 , ..., ar ) denote the set of sample outcomes x generating s1 = a1 ,...,sr = ar , i.e.


B(a) = (x1 , ..., xn ) : s1 (x) = a1 , ..., sr (x) = ar ; x R(X) ,


where a = (a1 , ..., ar )0 . Now note that
P (s1 = a1 , ..., sr = ar ) =

xB(a) f (x; )

which becomes under the factorization


= g(s1 (x) = a1 , . . . , sr (x) = ar ; )
{z

xB(a) h(x).

fixed values !

Furthermore, we have
f (x; |s1 = a1 , ..., sr = ar ) =
=

P (x , s1 =a1 ,...,sr =ar ;)


P (s1 =a1 ,...,sr =ar )
f (x;)
P (s1 =a1 ,...,sr =ar )

(By def. of conditional probability)

x B(a)

which becomes under the factorization


=

g(s1 (x)=a1 ,...,sr (x)=a


P r ;)h(x)
P
g(s1 (x)=a1 ,...,sr (x)=ar ;)
xB(a) h(x)

= h (x),
which does not depend on . Hence, if the factorization criterion is met, S1 , ..., Sr are sufficient
statistics.
Necessity of the factorization; Suppose S1 , ..., Sr are sufficient statistics. Note that by the definition
of the (discrete) conditional pdf that
P (x , s1 = a1 , ..., sr = ar ; ) = f (x; |s1 = a1 , ..., sr = ar ) P (s1 = a1 , ..., sr = ar )
Now, since P (x , s1 = a1 , ..., sr = ar ; ) = f (x; ) for x B(a) we get
f (x; ) = f (x; |s1 = a1 , ..., sr = ar ) P (s1 = a1 , ..., sr = ar )
|

} |

{z

{z

independent from , since the Si s

depends on via the Si s;

are supposed to be sufficient statistics;

denote this function by g()

denote this function by h()

Hence, if S1 , ..., Sr are sufficient statistics we can factor f (x; ) into the product of a function g of
the si s and and a function h which does not depend on .

21

2. Point estimation
Example 2.12 Let (X1 , . . . , Xn ) be a random sample from a Bernoulli population with pdf
f (x; p) = px (1 p)1x I{0,1} (x),

p [0, 1].

Note that the joint pdf of the random sample is given by


f (x1 , ..., xn ; p) =

Qn

i=1

= p

pxi (1 p)1xi I{0,1} (xi )

Pn
i=1

xi

Pn

(1 p)n

|
setting S =

{z
Pn
i=1

i=1

xi

}
Xi , this

Qn

i=1 I{0,1} (xi )

{z

independent from pand

term corresponds to g(s(x); p)

corresponds to h(x)

Thus, from Neymans factorization theorem, we can conclude that S =


statistic for f (x; p).

Pn

i=1

Xi is a sufficient

It follows that the value of the sum of the sample outcomes contains all the sample information
relevant for estimating q(p). Suppose, e.g., that n = 3 and that we observe s = 2. Then it is
irrelevant which of the following outcomes has generated s = 2:
x = (1, 1, 0),

x = (1, 0, 1),

x = (0, 1, 1).

Example 2.13 Let (X1 , . . . , Xn ) be a random sample from a N(, 2 ) population with =
(, 2 )0 . The joint pdf of the random sample is given by
f (x1 , ..., xn ; ) =

Qn

i=1

1 e 22 (xi )
2 2
1

1
e 22
(2 2 )n/2

1
1
e 22 (
( 2 )n/2

Pn
i=1

Pn

(xi )2

Pn

i=1

x2i 2

i=1

xi +n2 )

{z
setting S1 =

Pn
i=1

Xi and S2 =

}
Pn
i=1

Xi2 , this

term corresponds to g(s1 (x), s2 (x); )

1
(2)n/2

| {z }
independent from and
corresponds to h(x)

Thus, from Neymans factorization theorem, we can conclude that S1 =


Pn
2
2 0
i=1 Xi are sufficient statistics for estimating = (, ) .

Pn

i=1

Xi and S2 =

n = 1 Pn Xi and variance S 2 = 1 Pn (Xi X


n )2 are inNote that the sample mean X
i=1
i=1
n
n1
vertible functions of S1 and S2 . Hence, they provide the same information about as
n and S 2 are also sufficient statistics.
S1 and S2 . As we shall see, this implies that X

For further examples see Mittelhammer (1996) and Mood, Graybill and Boes (1974).

22

2. Point estimation
The use of Neymans factorization criterion for identifying sufficient statistics requires that we
are able to define the appropriate g(s(x); ) and h(x) functions that achieve the required factorization. However, the appropriate definition of such functions is not always readily apparent.
In the following section we will discuss an approach that might be useful for providing direction
to search sufficient statistics.

2.4. Minimal Sufficient Statistics


The objective of the use of sufficient statistics is to condense the data without losing any
information about the parameter by reducing the number of functions of the random sample
to represent all the sample information.
For example, in sampling from a N(, 2 )-distribution, we have noted that the two randomsample functions
Pn
Pn
2
and
i=1 Xi
i=1 Xi
are sufficient statistics condensing the sample information about = (, 2 ).
Note that the random sample itself, consisting of n random-sample functions
X1 , X2 , . . . , Xn ,
is also a sufficient statistic, since the conditional distribution of the sample given the sample
P
P
does not depend on (it is 1). However, it is clear that ( ni=1 Xi2 , ni=1 Xi ) R1+ R1 condense
the data more than (X1 , X2 , . . . , Xn ) Rn .
A natural question to consider is what is the smallest number of random-sample functions
that represent all the sample information about the parameter to be estimated. This relates to
the notion of a minimal sufficient statistic, which is, loosely speaking, the sufficient statistic
that is defined using the fewest number of functionally independent coordinate functions of the
random sample.
Definition (Minimal sufficient statistics): A sufficient statistic S = s(x) for f (x; ) is said to be a minimal sufficient
statistic if, for every other sufficient statistic T = t(x), a
function
hT () such that s(x) = hT (t(x))

x R (x).

23

2. Point estimation
The notation for the sample space R (x) indicates that the range of x is taken over all s
in the parameter space . If the support of the pdf does not change with (e.g., Normal,
Gamma, etc.) then R (x) = R(x).
This definition implies that the minimal sufficient statistic S utilizes the minimal set of
points for representing the sample information. This follows from the fact, that, by definition,
a function can never have more elements in its range than in its domain. (Recall that for each
argument we have only one function value, but one function value might be associated with
more than one argument.) Thus, if S = hT (T ) for any other sufficient statistic T , then the
number of elements in R(S) does not exceed the number of elements in R(T ), for any sufficient
statistic T . Hence minimal sufficient statistics provide the most parsimonious representation
of the sample information about the unknown parameters.
The definition is of little use in finding minimal sufficient statistics. Lehmann-Scheffs Minimal
Sufficiency Theorem provides an approach for finding minimal sufficient statistics. Rather than
presenting this theorem, we consider a corollary of the theorem, which is helpful for identifying
minimal sufficient statistics in cases where the sample range is independent of the distribution
parameter 4 .
Corollary 2.1 (Minimal Sufficiency when R is independent of ) Let x f (x; ), and
suppose that R(x) does not depend on . If the statistic S = s(x) is such that
f (x; )
f (y; )

does not depend on

iff (x, y)

satisfies s(x) = s(y),

then S = s(x) is a minimal sufficient statistic.


Proof
See Mittelhammer (1996), p. 396.


This Lehmann-Scheff result for identifying a minimal sufficient statistic requires that we are
able to find an appropriate function S = s(x). However, in many cases this result allows us to
transform the problem into one where a choice of S is readily apparent.
Example 2.14 Let (X1 , . . . , Xn ) be a random sample from a nondegenerate Bernoulli population with joint pdf
f (x; p) = p
4

Pn
i=1

xi

Pn

(1 p)n

i=1

xi

Qn

i=1 I{0,1} (xi )

for

p (0, 1).

For a discussion of the Lehmann-Scheffs Minimal Sufficiency Theorem, which includes cases where the
sample range is dependent of the distribution parameter see Mittelhammer (1996, p. 395-396).

24

2. Point estimation
The Lehmann-Scheff procedure for identifying a minimal sufficient statistic for p requires the
examination of the ratio
Pn

Pn

p i=1 xi (1 p)n i=1 xi ni=1 I{0,1} (xi )


f (x; p)
Pn
= Pn y
Q
f (y; p)
p i=1 i (1 p)n i=1 yi ni=1 I{0,1} (yi )
Q

x, y ni=1 {0, 1}.

Obviously, we have that


f (x; p)
f (y; p)

does not depend on p

iff

the constraint

Pn

i=1

xi =

Pn

i=1

yi is imposed.

Since the sample range R(x) is independent of p it follows by Corollary 2.1 that
minimal sufficient statistic for p.

Pn

i=1

Xi is a

The exponential class of distributions represents a collection of parametric families of distributions for which minimal sufficient statistics are straightforwardly defined.
Theorem 2.2 (Exponential class and sufficient statistics) Let f (x; ) be a member of the
exponential class of density functions

f (x; ) = exp

hP
k

i=1 ci ()gi (x) + d() + z(x) .

Then s(x) = [g1 (x), . . . , gk (x)]0 is a k-variate sufficient statistic, and if c1 (),...,ck (), are linearly
independent, the sufficient statistic is a minimal sufficient statistic.

Proof
That s(x) is a sufficient statistic follows directly from the Neyman factorization theorem by defining
f (x; ) = exp
|

hP

k
i=1 ci ()gi (x)

+ d() exp [z(x)]

{z

} |

g(g1 (x), ..., gk (x); )

{z

h(x)

in the theorem.
That s(x) is a minimal sufficient statistic follows from the Lehmann-Scheff approach of Corollary
2.1. In fact note that
hP
n
o
i
f (x; )
k
= exp
c
()
g
(x)

g
(y)
+
z(x)

z(y)
.
i
i
i
i=1
f (y; )

Assuming that the ci ()s are linearly independent, this ratio does not dependent on iff
x and y satisfy the constraints gi (x) = gi (y),

25

for i = 1, ..., k.

2. Point estimation


Example 2.15 Let (X1 , . . . , Xn ) be a random sample from a Gamma population with a joint
pdf which belongs to the exponential class
f (x; , ) =

1 xi /
1
e
i=1 () xi

Qn

= exp ( 1)

Pn

| {z } |
c1 ()

i=1 ln xi

{z
g1 (x)

Pn

i=1 xi n ln[ ()] .

} | {z } | {z }
c2 ()

g2 (x)

Thus, by Theorem 2.2 regarding the exponential class and sufficient statistics it follows that
P
P
[g1 (x) , g2 (x)]0 = [ ni=1 ln xi , ni=1 xi ]0 is a bivariate minimal sufficient statistic for (, ).

Sufficient statistics are not unique. This means that any one-to-one (i.e., invertible) function
of a (minimal) sufficient statistic S is also a (minimal) sufficient statistic. This fact follows
from the observation that a one-to-one transformation of a (minimal) sufficient statistic S
provides the same sample information about the unknown parameter as that provided by S.
The following theorem formalizes this observation.

Theorem 2.3 (Sufficiency of invertible functions of sufficient statistics) Let S = s(x)


be an r-dimensional sufficient statistic for f (x; ). If [s(x)] is an r-dimensional invertible function
of s(x), then

1. [s(x)] is an r-dimensional sufficient statistic for f (x; );


2. if s(x) is a minimal sufficient statistic, then [s(x)] is a minimal sufficient statistic.

Proof
1. Since is assumed to be invertible, it follows that s(x) = 1 { [s(x)] }.
Furthermore, since s(x) is sufficient it follows by Neymans factorization theorem
f (x; ) = g(

s(x)

; ) h(x)

= g( 1 { [s(x)] }; ) h(x)
= g(

[s(x)]

; ) h(x).

Thus, by Neymans factorization theorem [s(x)] is a sufficient statistic for .


2. See Mittelhammer (1996), p. 405.

26

2. Point estimation


Example 2.16 Let (X1 , . . . , Xn ) be a random sample from a N(, 2 ) population with =
(, 2 )0 .
Recall that S = ( ni=1 Xi , ni=1 Xi2 ) is a bivariate sufficient statistic for estimating . Furthermore, note that by Corollary 2.1, S is also a minimal sufficient statistic.
P

n =
Consider the sample mean X
define an invertible function of S

1
n

Pn

i=1

Xi and variance S 2 =

n, S 2) = (
(X

Pn

i=1

Xi ,

Pn

i=1

1
n1

Pn

i=1 (Xi

n )2 , which
X

Xi2 ).

n , S 2 ) is also a minimal sufficient statistic.


By Theorem 2.3 this implies that (X
In the last two sections, we have argued that
sufficient statistics fully represent the sample information about the unknown parameter;
minimal sufficient statistics utilize the most parsimonious form for the representation of
the sample information.
This immediately suggests that (minimal) sufficient statistics play a crucial role in the process
of constructing optimal point estimators. This issue will be discussed in the next section.

2.5. Minimum Variance Unbiased Estimation


Since estimators with minimum MSE typically do not exist, a reasonable procedure is to restrict
the class of estimators under consideration and to look for estimators with minimum MSE within
the restricted class.
One way of restricting the class of estimators is to consider only unbiased estimators and then
among them search for the MSE-optimal estimator.
Since the MSE obtains as MSE = Variance + Bias2 , seeking an MSE-optimal estimator in the
class of unbiased estimators is tantamount to seeking an unbiased estimator with minimum
variance (i.e. the MVUE).
In this section we discuss a number of results that aid the search for the MVUE. However, it
should be noted at the outset that MVUEs do not always exist; and when they exist, MVUEs
may be difficult to find.
27

2. Point estimation
In the first subsection we will derive a lower bound for the variance of unbiased estimators,
the Cramr-Rao Lower Bound (CRLB), and show how it can be useful in finding MVUEs.
In the second subsection we introduce the concept of complete sufficient statistics and show
how it can sometimes be used to identify MVUEs.

2.5.1. Cramr-Rao Lower Bond (CRLB)


The CRLB defines a lower bound for the variance of an unbiased estimator of q(). This implies
that if an estimator can be found whose variance attains the lower bound, then this estimator
is the MVUE for q()5 .
The applicability of the CRLB relies on the following regularity conditions on the joint pdf
f (x; ) of the random sample under investigation.

In the following discussion we will focus on the case where the parameter as well as q() are scalars and
where the sampling distribution is continuous. For an extension to the multivariate and/or discrete case
see Mittelhammer (1996), p. 408-418.

28

2. Point estimation
Definition (CRLB regularity conditions (scalar case)):
1. The parameter space for the parameter indexing the
pdf f (x; ) is an open interval with R1 .
2. The support of f (x; ), say A, is the same .
3. ln f (x; )/ exists and is finite x A, and .
4. We can differentiate under the integral as follows

f (x; )dx1 dxn

f (x; )
dx1 dxn .

5. For all unbiased estimators t(x) for q() with finite variance, we can differentiate under the integral as follows

0<E

"

t(x) f (x; )dx1 dxn

6.

t(x)

ln f (x; )

f (x; )
dx1 dxn .

!2 #

< .

In practice, the CRLB regularity conditions (1),(2),(3),(4), and (6), are generally not difficult
to verify. However, condition (5) can be complicated, since it must hold true for all unbiased
estimators of q().
There is a wide class of distributions, namely the exponential class, that satisfies the CRLB
regularity conditions (see Mittelhammer, 1996, Theorem 7.15).

Theorem 2.4 (Cramr-Rao Lower Bound (scalar case)) Let X1 , ..., Xn be a random sample from a population with pdf f (x; ) and let T = t(x) be an unbiased estimator for q(). Then
under the CRLB regularity conditions for the joint pdf f (x, ) given above
h

var (T )



nE

29

i
q() 2

ln f (X; )

2  .

2. Point estimation
Equality prevails iff there exists a function, say K(, n), such that
Pn

i=1

ln f (xi ; ) = K (, n) [t (x) q ()] .

Proof
Since T = t(x) is an unbiased estimator for q(), we have

q() = ET = t(x)f (x; )dx

q()

=
t(x)f (x; )dx

=
t(x)
f (x; )dx

(by condition 5).

The r.h.s. of the last equation is extended by a term which is equal to 0,


q()

t(x)
f (x; )dx q()

f (x; ) dx

=
=

f (x; )dx

[t(x) q()]

[ln f (x; )]

q()

i2

h 

E [t(x) q()]

}
}

=0

[t(x) q()]

= E [t(x) q()]
h

{z
{z

=1

(by condition 4)

f (x; )dx

(since

ln f

1 f
f )

ln f (x; )

ln f (x; )

i2

By Cauchy-Schwarzs inequality (saying that E(XY )2 EX 2 EY 2 ) applied to the r.h.s.


h

q()

i2

E [t(x) q()]2 E
|

{z

} |

var[t(x)]

h

i2 

ln f (x; )
{z

(0, )by condition 6

Solving the last inequality for var[t(x)] yields


h

Var[t(x)]
E

hn

i
q() 2

ln f (x; )

o2 i .

By the independence of the Xi s in x, it follows for the denominator of the r.h.s. that
E

hn

o2 i

ln f (x, )

= E
=

hn P

i=1

Pn

i=1

= nE

Pn

h

o2 i

ln f (Xi ; )

j=1 E

ln f (Xi ; )

ln f (Xi ; )

2 i

ln f (Xj ; )

where the last equality obtains by independence of the Xi s such that


E








ln f (Xi ; ) ln f (Xj ; ) = E ln f (Xi ; ) E ln f (Xj ; )

30

i 6= j,

2. Point estimation
and by noting that
E




)dx = 0.
ln f (X; ) =
[ln f (x; )]f (x; ) dx = | f (x;
{z
}
{z
}
|
=1

f (x; )
=

Together, these further yield


h

Var(T )
nE

hn

i
q() 2

o2 i ,

ln f (X; )

which completes the proof for the first part of the CRLB-theorem.
The inequality in the Cauchy-Schwarz inequality used above becomes an equality iff one function
is proportional to the other, i.e.

ln f (x; ) [t(x) q()],

which is equivalent to the fact that there exists a factor K(, n) independent of x such that

ln f (x; )
Pn
i=1 ln f (xi ; )

= K(, n)[t(x) q()]


= K(, n)[t(x) q()]

(by independ. of the Xi s).

This completes the second part of the CRLB-theorem.


The CRLB regularity conditions, which were stated for continuous sampling distributions, can
be modified for discrete sampling distributions, leaving the statement of the CLRB-theorem
unchanged. Furthermore, the CRLB regularity conditions, which were stated for the case where
and q() are scalars can be modified for the case where and q() are vectors, leading
to a multivariate version of the CRLB-theorem (see Mittelhammer, 1996, Theorems 7.16 and
7.17).
The CRLB theorem has two uses. First, it gives a lower bound for the variance of unbiased
estimators (first part of the theorem). Second, if an unbiased estimator whose variance coincides
with the CRLB can be found, then this estimator is the MVUE. The conditions under which
an unbiased estimator has a variance that achieves the CRLB (second part of the theorem)
aids finding a MVUE.
In fact, if there exists an unbiased estimator T = t (x) such that
Pn

i=1

ln f (xi ; ) = K (, n) [t (x) q ()] ,

for some function K(, n), then var(T ) coincides with the CRLB and T is the MVUE.

31

2. Point estimation
Example 2.17 Let (X1 , ..., Xn ) be a random sample from an exponential distribution with pdf
f (x; ) = ex ,

x, (0, ),

var(X) = 1/2 .

EX = 1/,

with

Find the CRLB for the variance of unbiased estimators of and 1/, and find the MVUEs of
and 1/.
First, we need to show that the six CRLB-conditions are satisfied.
1. Since (0, ), the parameter space is an open interval. X
2. The support of the exponential density is the same for all . X
3. Note that the joint pdf of the random sample is
f (x; ) = n e

Pn
i=1

xi

ln f (x; ) = n ln

and

Pn

i=1

xi ,

so that

ln f (x; ) =

Pn

i=1

xi ,

exists and x on the support of f . X


4. Regularity condition 4 assumes that we can differentiate under the integral so that

f (x; )dx =

f (x; )dx.

First consider the r.h.s. and note that

[ln f (x; )] f (x; )


 P

= n ni=1 xi f (x; ),

f (x; )

so that

f (x; )dx

n


= E

Pn

Pn

i=1 xi f (x; )dx

i=1

Xi

n 1

Now note that the l.h.s. of condition 4 is also equal to zero since
f (x; ) is a pdf and 1/ = 0. X

0.

f (x; )dx = 1 because

5. Regularity condition 5 assumes that we can differentiate under the integral so that we
have for any unbiased estimator t(x) for

t(x) f (x; )dx =


32

t(x)

f (x; )dx.

2. Point estimation
As mentioned above, the verification of this condition is rather complicated since we have
to show that it holds true for all unbiased estimators of .
a) Here, where we consider a sample from a pdf which belongs to the exponential class,
it is satisfied as discussed in Mittelhammer (1996, p. 410-411). X
6. Regarding the last condition first note that (see the proof for the CRLB-theorem)

hn

ln f (x, )

o2 i

hn

=nE

o2 i

ln f (Xi ; )

where
hn

o2 i

hn

ln f (Xi ; )

= E

Xi

o2 i

hn

= E Xi

o2 i

|{z}
EXi

= var(Xi ) =

1
.
2

ln f (x, )}2 ] is strictly positive and finite . X


Thus, E[{

Thus, the CRLB regularity conditions are met for our estimation problem. The CRLB for the

variance of an unbiased estimator T for q() = with


q() = 1 is
h

var(T )



nE

i
q() 2

2 

1
n(1/2 )

2
.
n

ln f (X; )

Regarding the existence of an unbiased estimator whose variance achieves the CRLB (second
part of the CRLB-theorem) note that
Pn

i=1

ln f (xi ; ) =

Pn

1
i=1 (

xi ) = n ( n1

Pn

i=1

xi

|{z}

),

EXi

n = 1 ni=1 xi is an unbiased estimator of EX = 1/. Hence, by taking K(, n) = n,


where X
n

t(x) = Xn and q() = 1/, we have


P

Pn

i=1

ln f (Xi ; ) = K(, n) [t(x) q()].

n is the MVUE for 1/, since its variance, given by var(X


n ) = 1/n2 , coincides with
Thus, X
the CRLB.

Further examples illustrating the use of the CRLB can be found in Mittelhammer (1996,
p. 411ff) and Mood, Graybill and Boes (1973, p. 319f).
33

2. Point estimation
We conclude the discussion of the CRLB with remarks on its form and its attainability.
An alternative form of the CRLB that utilizes the second-order derivative of ln f (x, ) w.r.t.
is sometimes useful in practice. Specifically, it is often the case that the so-called information
equality, given by
o2 i
h 2
i
hn

ln f (x; )
= E
E
2 ln f (x; )
holds true. Now, note that under the assumption that the Xi s in x are iid, the l.h.s. becomes
E

hn

o2 i

ln f (x; )

=nE

hn

o2 i

ln f (Xi ; )

which is the denominator of the CRLB given above.


For the r.h.s. of the information equality we obtain
E

2
2

(independence)

ln f (x; )

Pn

(identical distr.)

i=1 E

n E

2
2

2
2

ln f (Xi ; )
i

ln f (Xi ; ) .

Thus, under the information equality the CRLB can be represented as


h

CRLB =



nE

i
q() 2

2 

nE

ln f (X; )

i
q() 2

2
2

.

ln f (X; )

In order to see what is required for the information equality


hn

o2 i

ln f (; )

= E

2
2

ln f (; )

to hold, note that we obtain for the r.h.s.


h

E
|

2
2

ln f (; )
{z

= E

ii

= E

1
f (;)

= E

1
f (;)

= E

ln f (; )

r.h.s. of information equality

1
f (;)

ii

f (; )

2
f (; )
2

2
f (; )
2

1
f (;)2

+ E
|

hn

f (; )

o2 i

ln f (; )

o2 i

{z

(by product rule)


( f
=

ln f

f ).

l.h.s. of information equality

Thus, for the information equality to hold, we require that E


that this condition can be reformulated as
E

1
f (;)

2
f (; )
2

=
=

1
f (;)
2
2

34

1
f (;)

2
[f (; )]
2

f (; )dx = 0,

2
f (; )
2

f (; )dx

= 0. Now note

2. Point estimation
which is satisfied if we may twice differentiate

2
f (; )dx
2

=
=

f (; )dx under the integral sign, because then

f (; )dx

f (; )dx = 0.
{z

=1

Thus, the information equality and the alternative form of the CRLB hold if the joint pdf
f (x; ) for the random sample under consideration allows us to change two times the order of
integration w.r.t. x and differentiation w.r.t. . This is typically the case if the joint pdf f (x; )
belongs to the exponential class (see Mittelhammer, 1996, p. 414).
Often the CRLB for the variance of an unbiased estimator is not attainable, that is, there
often exists a lower bound for the variance that is greater than the CRLB. In that case the
variance of the MVUE is greater than the CRLB, which implies that the CRLB theorem is of
limited use for the identification of the MVUE. The following proposition indicates the cases
where both the CRLB-regularity conditions are satisfied and the CRLB is attainable.6

Theorem 2.5 (Exponential class and CRLB) If T is an unbiased estimator of some q()
whose variance coincides with the CRLB, then the pdf f (x; ) belongs to the exponential class;
and, conversely, if f (x; ) belongs to the exponential class, then there exists a unique unbiased
estimator T of some q() whose variance coincides with the CRLB.

Proof
Omitted.


The theorem tells us that we will be able to find an unbiased estimator whose variance attains
the CRLB iff the sampling density belongs to the exponential class; and there is only one
function q() for which such an estimator exists.
Recall the example where we considered a random sample from an exponential distribution
(belonging to the exponential class)
f (x; ) = ex ,

with EX = 1/.

There we have shown that


n
T =X
6

unbiased estimator for q() = 1/ attaining the CRLB.

See Mittelhammer (1996, p. 418) and Mood, Graybill and Boes (1973, p. 320) .

35

2. Point estimation
This implies that for q() = , e.g., no such estimator exists.
Since the CRLB is often of limited use in finding the MVUE, we need alternative approaches
for finding MVUEs. In the next subsection, we will discuss such an alternative approach.

2.5.2. Sufficiency and Completeness


In this subsection we continue our discussion on how to find MVUEs. The first result shows
that one may focus on sufficient statistics.
Theorem 2.6 (Rao-Blackwells Theorem (scalar case)) Let S = (S1 , ..., Sr )0 be an r-dimensional
sufficient statistic for f (x; ), and let T = t(x) be any unbiased estimator for the scalar q(). Define

T 0 = t0 (x) = E[T (x)|S1 , ..., Sr ].


Then
1. T 0 is a statistic and it is a function of S1 , ..., Sr ;
2. ET 0 = q(), that is, T 0 is an unbiased estimator of q();
3. var(T 0 ) var(T ) , where the equality is attained only if P (T 0 = T ) = 1.

Proof
1. First note that since S = (S1 , ..., Sr )0 is a sufficient statistic,
f (x; |s) = h(x)
Hence,
T 0 = E(T |S) =

does not depend on .

t(x) f (x; |s) dx = g(s).


|

{z

independent of

So T 0 is a statistic (does not depend on ) and is a function of S .


2. By the law of iterated expectations,
E(T 0 )

E[E(T |S)]
| {z }
= T

So T 0 is an unbiased estimator of q().

36

(l.i.e.)

E(T )

q().

2. Point estimation
3. The variance of T is
var(T ) = E[(T ET )2 ] = E[(T T 0 + T 0 ET 0 )2 ]

(since ET = ET 0 )

= E[(T T 0 )2 ] + 2E[(T T 0 )(T 0 ET 0 )] + var(T 0 ).


Now examine the second term on the r.h.s.. By the the law of iterated expectations,
E[(T T 0 )(T 0 ET 0 )] = E E[(T T 0 )(T 0 ET 0 )|S]


= E (T 0 ET 0 )E[(T T 0 )|S]


(since T 0 and ET 0 are constants given S)


(since E[(T T 0 )|S] = T 0 T 0 = 0),

= 0
and therefore
var(T )

E[(T T 0 )2 ] + var(T 0 )

{z

var(T 0 ).

The equality is attained iff E[(T T 0 )2 ] = 0, which requires that P (T 0 = T ) = 1.


The Rao-Backwell theorem says: Given an unbiased estimator, another unbiased estimator
that is a function of a sufficient statistic can be constructed and it will not have larger variance
but eventually smaller variance. In other words, conditioning an unbiased estimator on a
sufficient statistic might improve its MSE performance while it will never deteriorate the MSE
performance. This implies that the search for an MVUE can be restricted to functions of
sufficient statistics!

Example 2.18 Let (X1 , X2 , X3 ) be a random sample from a uniform distribution on the interval [0, ] with pdf
f (x; ) = 1 I[0,] (x).
Note that this pdf does not belong to the exponential class such that the CRLB theorem is not
applicable to find the MVUE for .
An unbiased estimator for is T = 2X[2] , that is two times the median, since
ET =

2EX[2] = 2EX
|

{z

= 2 2 = .

(since the pdf f is symmetric)

A sufficient statistic for the upper bound of the sample range is given by the sample
maximum, that is S = X[3] . (Note that the sample maximum contains all the information on
the upper bound that the sample has to offer.) According to the Rao-Blackwell theorem the

37

2. Point estimation
estimator of constructed as
T 0 = E(T |S) = E(2X[2] |X[3] )
should have a variance which is not larger than that of T .
The explicit functional form of T 0 = E(T |S) is obtained as follows. First note that

T 0 = E(2X[2] |X[3] ) =
where
f (x[2] |x[3] ) =

(2x[2] ) f (x[2] |x[3] )dx[2] ,

6x[2] /3
f (x[2] , x[3] )
=
=
f (x[3] )
3x2[3] /3
|

{z

T 0 = E(2X[2] |X[3] ) =

for

x[2] x[3] .

(see order statistics)

Thus, we have

2x[2]
,
x2[3]

x[3]
0

4x2[2]
x2[3]

dx[2] = 43 x[3] .

The variance of T is
Var(T ) = Var(2X[2] ) =

(2x[2] )2

f (x[2] )

dx[2] =

1 2
,
5

| {z }
=

6x[2]
2

3x2
[2]
3

and the variance of T 0 is


var(T 0 ) = var( 34 X[3] ) =

( 43 x[3] )2 f (x[3] ) dx[3] =

1 2
.
15

| {z }
=

3x2
[3]
3

Thus, we have that var(T 0 ) < var(T ) , which is consistent with the Rao-Blackwell theorem.

Before moving on, two comments are appropriate. First, if an unbiased estimator T is already a
function of a sufficient statistic S, then the estimator T 0 derived according to the Rao-Blackwell
theorem will be identical to T . Second, the Rao-Blackwell theorem tells us how to improve
on an unbiased estimator by conditioning on a sufficient statistic. This raises the question
whether or not such an unbiased estimator, obtained by conditioning on a sufficient statistic,
will be the MVUE. As we shall discuss now, this is the case if the sufficient statistic S used for
conditioning an unbiased estimator is complete.

38

2. Point estimation
Definition (Complete sufficient statistics): Let S =
[S1 , . . . , Sr ]0 be a sufficient statistic for f (x; ). The sufficient statistic S is said to be complete iff
E [z(S)] = 0 implies that P [z(s) = 0] = 1 ,
where z(S) is a statistic.
One implication of that definition is: If a sufficient statistic S is complete, it follows that
two different functions of S cannot have the same expected value. To see this, consider two
functions of a complete sufficient statistic S, say t(S) and t0 (S), and suppose that they have
the same expected value
E[t(S)] = E[t0 (S)] = q().
Now define
(S) = t(S) t0 (S),

so that

E[(S)]

0.

Now, since (S) is a function of a complete sufficient statistic, it must be the case (according
to the definition) that
P [(s) = 0] = 1

and hence

P [t(s) = t0 (s)] = 1.

Thus, t(S) and t0 (S) are the same function with probability 1, if they have the same expected
value and if S is a complete sufficient statistic.
An important implication of this result is that any unbiased estimator of q() that is a function
of a complete sufficient statistic is unique there cannot be more than one unbiased estimator
of q() defined as a function of a complete sufficient statistic.

Example 2.19 Let (X1 , . . . , Xn ) be a random sample from a Bernoulli population with P (X =
1) = p, and consider the statistic
P
S = ni=1 Xi ,
which is a sufficient statistic for p. To determine whether S is a complete sufficient statistic
we need to show that a function z(S) of S for which
E[z(S)] = 0 p [0, 1] is characterized by

P [z(s) = 0] = 1 p [0, 1].

First note that S binomial(n, p), so that E[z(S)] = 0 implies


E[z(S)] =

Pn

 

j=0 z(j)

= (1 p)n

n
j

Pn

pj (1 p)nj = 0
 

j=0 z(j)

n
j

j = 0,

39

where = p/(1 p).

2. Point estimation
Hence, E[z(S)] = 0 p [0, 1] requires that
Pn

 

j=0 z(j)

n
j

j = 0

p [0, 1].
 

For that polynomial in to be equal to 0 , all coefficients z(j) nj need to be equal to 0, that
is
 
z(j) nj = 0,
so that
z(j) = 0
j {0, 1, ..., n}.
|{z}
6= 0

Hence, E[z(S)] = 0 p requires that z(j) = 0 j such that E[z(S)] = 0 implies that P [z(s) =
P
0] = 1. Thus, S = ni=1 Xi is a complete sufficient statistic for p.

In general, the verification of the completeness of a sufficient statistic can require tricky analysis.
However, the following theorem identifies a large collection of distributions for which complete
sufficient statistics are relatively easy to identify.

Theorem 2.7 (Completeness in the exponential class) Let the joint density, f (x; ), of
the random sample (X1 , . . . , Xn ) be a member of a parametric family of densities belonging to
the exponential class of densities with pdf
f (x; ) = exp

hP
k

i=1 ci ()gi (x) + d() + z(x) .

If the range of [c1 (), . . . , ck ()]0 , , contains an open k-dimensional rectangle7 , then
s(x) = [g1 (x), . . . , gk (x)]0 is a complete sufficient statistic for f (x; ), .

Proof
See Rohatgi and Saleh (2001), An Introduction to Probability Theory and Mathematical Statistics,
John Wiley and Sons, p. 367f.


If complete sufficient statistics exist for a statistical model {f (x; ), }, then an alternative to the CRLB approach is available to identify the MVUE of q(). The approach is based
upon the Lehmann-Scheff completeness theorem.

The condition that the range of [c1 (), . . . , ck ()] contains an open k-dimensional rectangle excludes cases
where the ci ()s are linearly dependent. For a random sample from a N(, 2 ) distribution with (, 2 )
R1 R1+ , for example, the range of [c1 (), c2 ()]0 = [ 2 , 21 2 ] is the set R1 R1 and contains an open
2-dimensional rectangle.

40

2. Point estimation
Theorem 2.8 (Lehmann-Scheffs completeness theorem (scalar case)) Let S = (S1 , . . . , Sr )0
be a complete sufficient statistic for f (x; ). Let T = t(S) be an unbiased estimator for the
function q). Then T = t(S) is the MVUE of q().

Proof
Let T 0 be any unbiased estimator of q() which is a function of the complete sufficient statistic S,
that is T 0 = t0 (S). Then
E(T T 0 ) = 0

and

T T 0 is a function of S,

So by completeness of S,
P [t(S) = t0 (S)] = 1.
Hence, there is only one unbiased estimator of q() that is a function of S.
Now let T be any unbiased estimator of q(). Then T must be equal to
T = E(T |S),
since E(T |S) is an unbiased estimator of q() depending on S. By the Rao-Blackwell theorem,
var(T ) var(T )
so that T is the MVUE of q().


The Lehmann-Scheff completeness theorem says that if a complete sufficient statistic S exists
and if there is an unbiased estimator of q(), then there is an MVUE for q(); there are two
possible procedures for identifying the MVUE for q():
1. Find a statistic of the form t(S) such that Et(S) = q(). Then t(S) is necessarily the
MVUE of q().
2. Find any unbiased estimator of q(), say t (x). Then t(S) = E(t (x)|S) is the MVUE
of q().

Example 2.20 Let (X1 , ..., Xn ) be a random sample from a Poisson distribution with pdf
f (x; ) =

e x
x!

for

x = 0, 1, 2, . . . ,

41

EX = var(X) = .

2. Point estimation
To find the MVUE of q() = , note first that the joint pdf f (x; ) is a member of the exponential
class of densities and has the form
f (x; ) =

Qn

f (xi ; )

i=1

= exp ln()

en
Q

Pn

i=1

Pn

x
i=1 i

n
i=1

xi !

xi n ln

Qn

i=1

xi ! .

| {z } | {z }
g(x)

c()

Pn

Hence, by Theorem 2.7, the statistic g(x) =

i=1

Xi is a complete sufficient statistic for .

To identify the MVUE for , it suffices to find a function of the complete sufficient statistic
1 Pn

i=1 Xi whose expectation is . Since Xn = n


i=1 Xi is an unbiased estimator for EX = , it
1 Pn

is the obvious choice; so Xn = n i=1 Xi is the MVUE of .


Pn

Does the variance of the MVUE of , given by


n ) = var( 1
var(X
n

Pn

i=1

Xi ) = n ,

attain the CRLB?


For the CRLB for the variance of an unbiased estimator T of q() = we obtain
h

var(T )



nE

i
q() 2

2  ,

ln f (X; )

where ln f (x; ) = + x ln() ln(x!), such that






2

ln f (x; )
2 

ln f (X; )

= (1 + x )2 = 1 2 x +


= 12+

var(X)+ EX
2

x2
2

2

= 1 ,

and
var(T )
n =
Thus, the variance of the MVUE X

1
n

Pn

i=1

1
n 1

= n .

Xi of attains the CRLB.

Example 2.21 Let (X1 , ..., Xn ) be a random sample from a Poisson distribution with pdf
x
f (x; ) = e x! . Find the MVUE of q() = P (x = 0) = e .
According to Lehmann-Scheffs theorem, the MVUE can be derived by calculating the conditional expectation of some unbiased estimator T of e given the complete sufficient statistic
P
S = ni=1 Xi .

42

2. Point estimation
Since we can use any unbiased estimator, we may choose a simple one that would make the
calculations easy. Such a simple unbiased estimator of P (x = 0) = e is
T = I{0} (X1 ),

where X1 is the first rv in the sample.

Note that this is indeed an unbiased estimator since


E[I{0} (X1 )] = 1 P (x1 = 0) + 0 P (x1 > 0) = e .
By Lehmann-Scheffs theorem,
E[I{0} (X1 )|

Pn

i=1

Xi ] = 1 P (x1 = 0|

Pn

i=1

Xi ) + 0 P (x1 > 0|

Pn

i=1

Xi )

is the MVUE for e . To find this conditional expectation, we need to derive the conditional
P
probability P (x1 = 0| ni=1 Xi = s). This conditional probability obtains as
P (x1 = 0|

Pn

i=1

Pn

P (x1 =0 ,
Pn i=1 Xi =s)
P ( i=1 Xi =s)

Xi = s) =

(by Def.)

Pn

P (x1 =0 ,
Pn i=2 Xi =s)
P ( i=1 Xi =s)

P (x1 =0)P ( i=2 Xi =s)


Pn
P ( i=1 Xi =s)

Pn

(by independence of the Xi s)

Now we exploit the additivity property of the Poisson distribution, which implies
Xi iid Poisson()

Pn

i=1

Xi Poisson(n).

Hence
Pn

Xi = s) =

Pn

Xi = s) =

P(

i=1

P(

i=2

en (n)s
s!
e(n1) ([n1])s
,
s!

such that
P (x1 = 0|

Pn

i=1

Xi = s) =

e {e(n1) ([n1])s }/s!


en (n)s /s!

n1
n

s

Therefore,
T = E[I{0} (X1 )|

Pn

i=1

Xi ] = P (x1 = 0|

Pn

i=1

is the MVUE of e . Simple algebra reveals that


var(T ) = (e/n 1)e2
(the check is left for an exercise).
43

Xi ) =

n1
n

Pn Xi
i=1

2. Point estimation
Note that this variance has to be larger than the CRLB for the variance of an unbiased estimator
of e , since the variance of the MVUE of attains the CRLB (see previous example) and there is
only one function of , for which the MVUE attains the CRLB. In fact, the CRLB for the variance
of an unbiased estimator of q() = e is
h

var(T )



nE

i
q() 2

2 

ln f (X; )

44

e2
n 1

= n e2 .

3. Point estimation methods


In this chapter, we consider generally applicable methods which can be used to obtain specific
functional forms for estimators of q() that generally have desirable finite-sample or
asymptotic properties.
Of course, there is no single method for generating estimators of q() that will always, i.e. for
all estimation problems, lead to the best estimator or even an estimator that always has good
properties.
In this chapter, we will discuss four specific estimation methods that, under appropriate circumstances, will lead to good estimators for q(). The four methods we shall discuss are
1. Least-squares estimators for linear regression models
2. Maximum-likelihood method
3. Method-of-moments estimators
4. Bayesian a-posteriori estimators.

3.1. Least-Squares Estimators for Linear Regression Models


The linear regression model (LRM) is the single most useful tool in the tool kit of econometricians. It is used to study the relationship between a dependent or explained random
variable and one or more independent or explanatory variables. This is a basic econometric
tool; see the Advanced Econometrics I course.
In a nutshell, the LRM decomposes the ith dependent random variable in the random sample
Y1 , ..., Yn into the sum of its expectation and the deviation from its expectation, as
Yi = i + i ,

i = 1, ..., n,

45

3. Point estimation methods


where
i : expectation of Yi , i.e., EYi .
i : deviation of Yi from its expectation (random error/ disturbance).
Note that this representation implies that by construction E(i ) = E(Yi i ) = 0.
The LRM specializes this representation of Yi by assuming that E(Yi ) = i is a function of k
observable explanatory variables xi1 , ..., xik with a linear form
i = 1 xi1 + 2 xi2 + + k xik ,
where
j s : unknown regression parameters measuring the marginal
effects of the x-variables on the expectation of Yi .
The linearity in the linear regression model refers to the assumption that the regression
function i is linear in parameters.
Thus, the generic form of the LRM is
Yi = 1 xi1 + 2 xi2 + + k xik + i ,

i = 1, ..., n,

and a compact matrix representation of the LRM has the form


Y = x + ,
where

Y
1
..
Y = . ,

Yn

11
..
..
x= .
.

xn1

x1k
..

. ,
xnk

1
..
= . ,


1
..
= . .

n

Note that the errors i represent unobservable random variables, since they are deviations of
Yi from the unknown mean EYi = 1 xi1 + + k xik .
In order to illustrate the generality of the LRM, note that a specification like
2
i4
) + i ,
Yi = 1 + 2 zi2
+ 3 sin(zi3 ) + 4 ( zzi5

i = 1, ..., n,

is consistent with the linear form of a LRM, and a representation that is linear in explanatory

46

3. Point estimation methods


variables can be obtained upon defining
2
,
xi2 = zi2

xi1 = 1,

xi3 = sin(zi3 ),

xi4 = ( zzi4
).
i5

Moreover, a relationship between dependent and independent variables that is initially nonlinear
in the parameters might be transformable into a LRM. Consider, e.g., the following stochastic
Cobb-Douglas production function
Qi = ki1 `i 2 i ,
where yi being output, ki and `i being capital and labor inputs, and i denoting a random error
term. A logarithmic transformation yields
ln Qi = ln + 1 ln ki +2 ln `i + ln i ,
| {z }

| {z }

xi2

Yi

| {z }
xi3

| {z }
i

which has the form of a LRM.


However, note that a nonlinear relationship like
Qi = ki1 `i 2 + i
is not transformable into a LRM.

3.1.1. The classical LRM assumptions


The typical aim in analyzing an LRM will be to estimate the regression parameters in utilizing
a random sample realization y = (y1 , ..., yn )0 together with the values of the explanatory variables
x.
At this point, we have not yet specified all assumptions of the LRM sufficiently for estimation purposes. In the following, we discuss a set of assumptions, referred to as the classical
assumptions of the LRM, which imply appropriate estimation procedures.
Assumption (A1): EY = x

and

E = 0.

The first assumption (A1) says that (as already discussed above) the mean of the dependent
variables Y is a linear function in the regression parameters and has the form x. This
necessarily implies that E = 0.
Assumption (A2): x is a non-random n k matrix with rank
rk(x) = k (full column rank).
47

3. Point estimation methods


The first part of (A2) says that the explanatory variables in x are in contrast to the dependent
variable Y deterministic, and x is a fixed matrix of numbers. There are two alternative
interpretations of this part of Assumption (A2).
Under the first interpretation, the elements in x are considered as having been controlled
or fixed by the researcher in repeated random experiments. In this case the statistician can
select/design the vectors xi. = (xi1 , ..., xik ), i = 1, ..., n, and then she can observe the yi values
associated with the chosen xi values.
As an example, suppose we choose various levels of inputs to the process of beer production
(xi1 =water, xi2 =yeast) and observe the output level of beer (yi ) corresponding to each fixed
level of inputs. We might expect that we can produce for a given level of inputs xi on the average a certain output quantity EYi . However, for any given input-output observation (xi , yi ),
deviations from the average EYi might occur due to a myriad of factors that we cannot control
(quality of the water and the yeast, outdoor temperature and humidity, etc.). Then we might
aggregate the impact of those random factors to the error term i .
However, in many situations in economics and business, we cannot control/design the x-matrix.
There we are typically passive observers of the (xi , yi )-values, that have been generated by
consumers, entrepreneurs, the economy, or the market. In those cases it is more natural to
consider the observed x as the outcome of a random matrix x, just like y is the outcome of the
random vector Y . Then the assumption EY = x of the LRM is interpreted in a conditional
sense, i.e., the expectation of Y is conditional on the outcome x of x.
A more precise notation reflecting this interpretation would be given by E(Y |x) = x and for
the LRM model
Y |x = x + |x, with E(|x) = 0.
In this case, the entire analysis is then cast in the framework of being conditional on the
observed x.
It is important to note that the estimation methods to be discussed below and their properties
will be the same under either interpretation of the non-stochastic regressors.
The second part of Assumption (A2), (rk(x) = k) says that there is no exact linear relationship among the explanatory variables in the k columns of x. This ensures that the model is
identified. To see this, consider the following example.

Example 3.1 Consider the LRM for total income (Ci )


Ci = 1 + 2 nonlabor incomei + 3 salaryi + 4 total incomei + i .

48

3. Point estimation methods


Since total incomei = nonlabor incomei + salaryi , there is an exact linear dependency among the
regressors with
nonlabor incomei + salaryi total incomei = 0,
where is any number. Hence,
Ci = 1 + (2 + ) nonlabor incomei + (3 + ) salaryi
+(4 ) total incomei + i
is observationally equivalent to the LRM given above, which means that a substitution of
2 , 3 , 4 by (2 + ), (3 + ), (4 ) does not change the right-hand side of Ci . This implies
that the model is not identified, which is excluded by assumption (A2).

Assumption (A3): Cov() = E0 = 2 I.


This assumption implies that the covariance matrix of is given by

var(1 ) cov(1 , 2 ) cov(1 , n )

var(2 ) cov(2 , n )

Cov() =
..

..

.
.

var(n )

0
2
..
.

0
0
..
.
2

The fact that all of the variances of the elements in have the same value, i.e.
var(i ) = 2 ,

i = 1, ..., n,

is referred to as homoskedasticity. The fact that the off-diagonal entries are zero implies that
covariances are zero, i.e.
cov(i , j ) = 0,
i 6= j,
such that there is no correlation between any two elements in .
The three assumptions (A1)-(A3), i.e.
(A1): EY = x
and
E = 0
(A2): x is a non-random n k matrix with rank rk(x) = k
(A3): Cov() = E0 = 2 I
are the classical assumptions of the LRM
Y = x + .
49

3. Point estimation methods


Note that the LRM with the classical assumptions is not a fully specified stochastic model
for the random sample variables Y , since we have not yet specified a parametric family of
distribution for Y (or ). Nevertheless it is possible to suggest an estimator for , namely the
least-squares (LS) estimator, that has a number of desirable properties.
In the next subsections, we define this estimator and analyze its properties in the absence of a
distributional assumption for Y .

3.1.2. The Least-Squares estimator for in the classical LRM


The LS estimate of denoted by b solves for given observations of the dependent and the
vector of regressors {yi , xi. }ni=1 the minimization problem
b = arg min S(),

S() =

n
X

where

(yi xi. )2 = (y x)0 (y x) = y 0 y 2 0 x0 y + 0 x0 x.

i=1

The k first-order conditions for a minimum can be represented as


S(b)

= 2x0 y + 2x0 xb = 0
x0 xb = x0 y

(kLS normal equations)

b = (x0 x)1 x0 y.
Note that x is assumed to have full rank (A2). This implies that the (k k) matrix x0 x has
full rank and is thus invertible. The minimizer
b = (x0 x)1 x0 y
defines the LS estimate for . The LS estimator of is then defined by the random vector
= (x0 x)1 x0 Y = (x0 x)1 x0 (x + ) = + (x0 x)1 x0 .

Note that the second-order conditions for the minimization are satisfied, since
2 S()
0

= 2x0 x

is a p.d. matrix

(the matrix 2x0 x is p.d. since for any vector ` 6= 0 we obtain `0 2x0 x` = 2(x`)0 (x`) > 0), such
that the objective function S() is convex.

50

3. Point estimation methods


Based upon the LS estimate b we can compute the fitted/estimated yi -values, i.e.
yi = xi. b,

with

= xb,
y

which are estimates of the expectations EYi = xi. . The LS residuals given by
ei = yi yi = yi xi. b,

with

= y xb,
e=yy

are estimates for the errors/disturbances i .


Let us now discuss some algebraic aspects of the LS solution. The normal equations defining
the LS estimate are
x0 xb x0 y = x0 (y xb) = x0 e = 0

x0 e = 0.

Hence, we have

xi1 ei

Pn
..
. = 0, so that the regressor matrix x and the LS-residual
i=1

xik ei

vector e are orthogonal. If the LRM contains an intercept such that the first column of x is a
P
P
column of 1s, then the LS residuals sum to 0. This follows from ni=1 1ei = ni=1 ei = 0. The
vector of LS residuals can be written as
e = y xb = y x[(x0 x)1 x0 y] = [I (n) x(x0 x)1 x0 ]y = M y.
The (n n) matrix M = I (n) x(x0 x)1 x0 produces the LS residuals in the regression of y on x
when it pre-multiplies the vector y. Hence it is a residual generating matrix. The matrix
M is both symmetric (M = M 0 ) and idempotent (M M 0 = M ). Furthermore we have
M x = (I (n) x(x0 x)1 x0 )x = 0.
Hence, the LS residuals of a regression of x on x are equal to zero.
As a measure of how well the Y outcomes have been explained by the fitted regression plane
xb we can use the coefficient of determination which is defined as
R2 =

variation in y explained by the regression (SSR)


,
total variation in y(SST)

SST =

Pn

SSR =




Pn
2

yi y)
i=1 (

i=1 (yi

where

y)2
=
(LRM with intercept)

51

Pn

yi
i=1 (

y)2 .

3. Point estimation methods


For an LRM with intercept we have R2 [0, 1], and the larger the R2 the better the fit of the
LRM.

3.1.3. Properties of the LS estimator in the classical LRM


is a linear and unbiased estimator for with a covariance matrix
The LS estimator
2 (x0 x)1 . To show this, write
= (x0 x)1 x0 Y = + (x0 x)1 x0 .

Hence, the LS estimator has the form of a linear function of the random sample variables Yi ,
i.e.
= AY + d,

with A = (x0 x)1 x0 and d = 0.


Taking the expectation yields
= E( + (x0 x)1 x0 ) = + (x0 x)1 x0 E = .
E
|{z}
= 0

The covariance matrix obtains as


= E[(
)(
)0 ] = E[(x0 x)1 x0 0 x(x0 x)1 ]
Cov()
= (x0 x)1 x0 E[0 ] x(x0 x)1 = 2 (x0 x)1 .
| {z }
= 2 I

The following theorem establishes that the LS estimator has the smallest variance among all
linear and unbiased estimators.
=
Theorem 3.1 (Gauss-Markov Theorem) Under the classical assumptions of the LRM,
(x0 x)1 x0 Y is the best linear unbiased estimator of .

Proof
Let
= AY + d

0
be any linear estimator of . Its expectation is
= AE(x + ) + d = Ax + d.
E
0

52

3. Point estimation methods


to be unbiased, it is required that d = 0 and Ax = I. Its covariance
Hence, for the linear estimator
0
matrix obtains as
) = A Cov(Y ) A0 = 2 AA0 .
Cov(
0
| {z }
= 2 I

Since A can be represented as


A = (x0 x)1 x0 + [A (x0 x)1 x0 ] = (x0 x)1 x0 + |{z}
D ,
(k n)

) as
we can rewrite Cov(
0
) = 2 AA0 = 2 [(x0 x)1 x0 + D][(x0 x)1 x0 + D]0 = 2 [(x0 x)1 + DD 0 ],
Cov(
0
where the last equation follows from the fact that
Dx(x0 x)1 = [A (x0 x)1 x0 ] x(x0 x)1
|

{z

=D

(x0 x)1 (x0 x)1 = 0.

Ax
|{z}

= I(unbiasedness)

Thus we have
) = Cov()
+ 2 DD 0 ,
Cov(
0
where 2 DD 0 is a p.s.d. matrix so that for any linear combination `o0
) var(`o0 ).

var(`o0
0


The classical assumptions of the LRM are sufficient to allow the definition of an unbiased
estimator of the disturbance variance Var(i ) = 2 . This unbiased estimator is given by1
0

,
S2 =
nk

= Y x.

To show that S2 is an unbiased estimator of 2 , write


= x + x[ + (x0 x)1 x0 ] = [I (n) x(x0 x)1 x0 ] = M .
= Y x

{z

residual generating matrix

and its outcome, the residuals e = y xb.


= Y x
Note the difference between the random vector

53

3. Point estimation methods


Hence we obtain
0
0
0
0
) = E(0 M 0 M ) = E(
E(
0
{z }) = E(tr[ M ]) = E(tr[M ]) = tr(E[M ])
| M
scalar

(since tr(AB) = tr(BA)and Etr(W ) = tr(EW ))

= tr( 2 M ) = 2 tr(M ) = 2 tr(I (n) x(x0 x)1 x0 )


= 2 (n tr[(x0 x)1 x0 x]) = 2 (n k).
|

{z

I (k)

It follows that

)
E(
0
2

ES =
= 2.
nk

With this unbiased estimator for 2 we can compute


S2 (x0 x)1 ,
=
which is an unbiased estimator for the covariance matrix of the LS estimator Cov()
2 (x0 x)1 . The square root of the jth diagonal element of this matrix, {[S2 (x0 x)1 ]jj }1/2 ,
is the standard error of the LS estimator j .
Under the classical assumptions of the LRM, if
Theorem 3.2 (Consistency of )
(x0 x)

as

n ,

p
= (x0 x)1 x0 Y
is a consistent estimator of .
then
, so that

Proof
Under the classical LRM we have that
=
E

and

1

= 2 x0 x
Cov()

m
0 and thus

If (x0 x)1 0, then Cov()
. But since convergence in mean square implies

convergence in probability, it follows that plim = .


Theorem 3.2. says that the condition that (x0 x)1 0 is sufficient for consistency of the LS
estimator. This condition ensures that the x matrix is well-behaved in large samples. It
is fairly weak and is likely to be satisfied by typical data sets encountered in practice. The
convergence (x0 x)1 0 will occur iff each of the diagonal entries of (x0 x)1 goes to zero. The
necessity of this condition is obvious. The sufficiency follows from the fact that for p.d. matrices
54

3. Point estimation methods


(such as (x0 x)1 ) the (i, j)th entry is bounded above in absolute values by the square root of the
product of the (i, i)th and (j, j)th (diagonal) entry. To see this, consider the following (2 2)
example.

Example 3.2 Let x be (n 2) with

a11 a12
A=
.
a21 a22

(x0 x)

Note that A is symmetric and p.d. since (x0 x) is symmetric and p.d.. Since A is symmetric
and p.d., it follows that
|A| = a11 a22 a212 > 0,

a11 a22 > a212

so that

and

a11 a22 > |a12 |.

This is the boundedness result mentioned above. Therefore, if a11 0 and a22 0, then
a12 0.

In a LRM without intercept and one regressor, i.e. Yi = xi + i , we have


(x0 x) =

n
X

x2i ,

i=1

and the condition (x0 x)1 0 is satisfied if

Pn

2
i=1 xi

increases with n without bounds.

Theorem 3.3 (Consistency of S2 iid case) Under the classical assumptions of the LRM,
p
if the disturbances i are iid, then S2 2 , so that S2 is a consistent estimator of 2 .

Proof
Recall that the unbiased estimator S2 can be represented by
2

0

nk

0 M
nk

0 I (n) x(x0 x)1 x0


.
nk

= M

This yields
S2 =

0
nk

| {z }
p

Vn 2

0 x(x0 x)1 x0
nk

{z

Zn 0

Regarding the limiting behavior of the second term, note that


Zn =

0 x(x0 x)1 x0
nk

55

> 0,

3. Point estimation methods


since (x0 x)1 is p.d. with `o0 (x0 x)1 `o > 0 for any `o 6= 0. Also note that, under the classical LRM,
EZn =
=

1
0
0 1 0
nk E[ x(x x) x ] =
0
1
0 1 0
)]
nk tr[x(x x) x E(
| {z }

1
0
0 1 0
nk E[tr( x(x x) x )]

2 I

0 1 0
2
x) x x]
nk tr[(x
|
{z
}

2 k
nk .

I (k)

Then, by Markovs inequality for non-negative random variables, c > 0,


P (Zn c) EcZn =

2 k/(nk)
.
c

Since limn 2 k/(n k) = 0, it follows that Zn 0.


Regarding the probability limit of the first term, note that
0

plimVn = plim nk
= plim

n
nk

plim

1
n

Pn

2
i=1 i

The iid assumption for the i s implies that the 2i s are also iid with E2i = 2 . So by Khinchins WLLN,
plim n1

Pn

2
i=1 i

n
= 2 , while plim nk
= 1.

Collecting all terms we have plimS2 = 2 .


Theorem 3.3 establishes consistency of S2 for estimating 2 by assuming that the disturbances
i are iid. This assumption is fairly restrictive and in practice often violated. A theorem for
consistency of S2 which relaxes the iid conditions is found in Mittelhammer (1996, p. 442.).
There the iid assumption is replaced with the assumption that E4i < and certain conditions
on the dependence structure of {i }.
With a consistent estimator S2 for 2 we can compute
S2 (x0 x)1 ,
=
which is a consistent estimator for the covariance matrix of the LS estimator Cov()
2 (x0 x)1 .
Theorem 3.4 (Asymptotic Normality of iid case) Assume the classical assumptions
of the LRM. In addition, assume that
1. the i s are iid with P (|i | < m) = 1 for m < and i;
2. the regressors are such that |xij | < < i and j;
56

3. Point estimation methods


3. limn n1 x0 x = Q, where Q is a finite, p.d. matrix.
Then

and
N (, n1 2 Q1 ).

) N (0, 2 Q1 )
n(

Proof
Recall that the LS estimator can be expressed as
= + (x0 x)1 x0

) =
n(
n(x0 x)1 x0 = ( n1 x0 x)1
{z

1 x0
n

W n Q1

| {z }
d

V n N (0, 2 Q)

Regarding the limiting behavior of the first term W n note that


plimW n = lim ( n1 x0 x)1 = ( lim

1 0 1
x x)
n n

= Q1 .

{z

by assumption (c)

The second term V n can be written as


Vn=

1
n

x0 =

1
n

n
X
i=1

x0i. i ,
(k 1)

where the sequence of vectors {x0i. i } has the following properties:


1. Ex0i. i = 0 and Cov(x0i. i ) = x0i. var(i )xi. = 2 x0i. xi. ;
2. the (x0i. i )s are independent, since the i s are independent;
3. since |xij | < and P (|i | < m) = 1, the entries in (x0i. i ) are bounded in absolute values with
probability 1, so that P (|xij i | < m) = 1;
4. limn

1
n

Pn

0
i=1 Cov(xi. i )

= 2 limn

1
n

Pn
x0 xi. = 2 Q.
| i=1{z i. }
x0 x

Thus the random vectors in {x0i. i } satisfy the conditions of the CLT for independent bounded random
vectors (see Adv. Stat. I, or Mittelhammer, 1996, Theorem 5.32), so that
Vn=

1
n

n
X

x0i. i V N (0, 2 Q).

i=1

Collecting all terms, we have by Slutskys theorem

d
) = W n V n
n(
Q1 V
so that

) N (0, 2 Q1 )
n(

and

57

N (0, 2 Q1 ),

2
a

N (, n Q1 ).

3. Point estimation methods



The asymptotic distribution of the LS estimator given above is expressed in terms of the
unobservable limit matrix Q = limn n1 x0 x. If we use the approximation Q n1 x0 x, so
2
that n Q1 2 (x0 x)1 , we can approximate the distribution of the LS estimator by
a

N (, 2 (x0 x)1 ).

is regardless of the
Theorem 3.4 says that under certain conditions the LS estimator
distribution of the disturbances i approximately normally distributed, which is a consequence
of the CLT. As we shall see later, if the i s are normally distributed, then the normal distribution
holds in every sample, so it holds asymptotically as well.
for
Theorem 3.4 assumes certain conditions on the values of the regressors and regarding the
stochastic behavior of the disturbances, which deserve some discussion.
1. In practice, the iid assumption is often violated, e.g., due to heteroskedastic disturbances.
Also, the boundedness condition excludes many distributions for i , like the normal,
which
Gamma, lognormal, student-t, etc. For a theorem for asymptotic normality of
replaces those conditions by the weaker assumption that the i s are independent with
E4i < see Mittelhammer (1996, p. 445).
2. In practice, this condition is typically met.
3. This condition requires that
limn

1
n

Pn 2
x
i

ij

< and

limn | n1

Pn
i

xij xil | < ,

which is typically satisfied in cross-sectional applications. However, note that this condiP
tion is violated if we have a (time) trend such that xij = i, since limn n1 ni i2 = .
Theorem 3.5 (Asymptotic Normality of S2 iid case) Under the classical assumptions
of the LRM, if the i s are iid, and if E4i = 04 < , then

n(S2 2 ) N 0, 04 4

and

Proof

58

a
S2 N 2 , n1 [04 4 ] .

3. Point estimation methods


Note that

1 0

1
n( nk
0 M 2 ) =
n nk (I x(x0 x)1 x0 ) n 2
i
h n
i
h n

0 n 2
0 x(x0 x)1 x0
=
n k {z
nk
|
}
|
{z
}

n(S2 2 ) =

Un N (0, [04 4 ])

Wn 0

Regarding the limiting behavior of the second term Wn note that Wn is a positive random variable

with EWn =

nk 2
nk

(see the proof of Theorem 3.3). Thus, by Markovs inequality, c > 0,


n
P (Wn c) EW
=
c

2 nk/(nk)
.
c

p
Since limn 2 nk/(n k) = 0, it follows that Wn 0.
Regarding the first term Un note that by Slutskys theorem, both
Un

Un = ( nk
n ) Un

and

| {z }
p

k 2
n

| {z }
p

have the same limiting distribution. Note that

2
n 0
nk
k 2
Un = ( nk
)

n
nk ( n ) n n
1 0
1 Pn 2

=
n n n 2 =
n( n i=1 i 2 ).

Furthermore, E2i = 2 and E4i = 04 < so that var(2i ) = 04 4 < . Hence, a direct application
of the Lindberg-Levy CLT to the sequence of iid variables {2i } yields
Un =
Therefore,

n( n1

Pn

2
i=1 i

2 ) U N (0, 04 4 )

so that

Un U .


2
d
n(S 2 ) N 0, 04 4 .


The classic assumptions (A1)-(A3) of the LRM form a basic-level set of assumptions on which
useful properties (unbiasedness, BLUE-property, consistency, asymptotic normality) of the LS
estimator depend. Hence, it is instructive to discuss the effect that violations of the classical
assumptions have on those basic properties of the LS estimator.
is a biased estimator for , since
1. In general, this would imply that the LS estimator
= + (x0 x)1 x0 6= .
E
(Only for the special case that x0 = 0, the LS estimator remains unbiased.) This
violation of (A1) also implies in general that the variance estimator S2 would be biased,
and S2 would be inconsistent.
and that
59

3. Point estimation methods


2. This violation implies that there is a linear dependency among the regressors in the
columns of x (see the example regarding labor income and salary discussed in the context
= (x0 x)1 x0 Y does not exist, since
of assumption A2). In this case, the LS estimator
(x0 x)1 does not exist. This problem is also referred to as perfect multicollinearity.
3. In this case, the error terms are heteroskedastic and/or (auto)correlated. This vio is unbiased for , because its representation
lation does not change the result that
= + (x0 x)1 x0

= . However, note that the covariis unaffected, and taking expectation still yields E
is no longer 2 (x0 x)1 but
ance matrix of
= (x0 x)1 x0 x(x0 x)1 .
Cov()
0 as n , the
Furthermore, under mild conditions on ensuring that Cov()
estimator is still consistent. However, the proof of the Gauss-Markov result (Theorem
3.1) that is BLUE breaks down, so that is no longer the best linear unbiased estimator.
Finally, one can show that under the condition E0 = 6= 2 I the estimator S2 is
typically no longer2 unbiased and consistent for 2 .
Up to this point, we investigated the properties of the LS estimator under the classical LRM,
without assuming a specific parametric distribution for Y or . In the classical LRM, we now
introduce the additional assumption that
N (0, 2 I)

Y N (x, 2 I).

Note that this assumption necessarily implies that i iidN (0, 2 ).


and S2 then applies equally well
Of course, our entire preceding discussion of the properties of
to the case where is normally distributed, since the discussion was for arbitrary distributions of
. In the following, we address the question what additional properties can be attributed
and S2 when is normally distributed.
to the estimators
Theorem 3.6 Under the classical assumptions of the LRM, if N (0, 2 I), then
N (, 2 (x0 x)1 ),
1.

2. (n k)S2 / 2 2(nk) ,
and S2 are independent.
3.
2

See, e.g., Mittelhammer (1996, p. 456).

60

3. Point estimation methods


Proof
= (x0 x)1 x0 Y follows immediately from the fact that
is a linear function of
1. The normality of
the normally distributed vector Y .
2. In order to obtain the distribution of (n k)S2 / 2 2(nk) , we write this variable as
(nk)S2
2

0 M

,
2

where M = I x(x0 x)1 x0 ,

and where M is a symmetric and idempotent matrix with rk(M ) = tr(M ) = n k.3 Using the
spectral decomposition of M , i.e. M = P P 0 , where and P are, respectively, the diagonal matrix
of eigenvalues and the matrix of eigenvectors of M , we obtain
(nk)S2
2

0 M

0
P

P 0 = Z 0 Z,

|{z}
Z0

| {z }
Z

where
Z N (0 ,

1
2

0 2
P}) = N (0 , I).
|P {z
P 0P = I

Then

(nk)S2
2

= Z0

I (nk) 0
0

Z =

Pnk
i=1

Zi2 2(nk) .
{z

Zi iidN (0, 1)

and S2 follows from an application of suitable theorems from Adv. Stat. I.4
3. The independence of


From the fact that (n k)S2 / 2 2(nk) it follows that the variance estimator S2 is Gamma
distributed.5 The first two moments of S2 obtain as
E( nk
S2 ) = (n k)
2
S2 ) = 2(n k)
var( nk
2

so that
so that

ES2 = 2 ,
(nk)2
var(S2 ) = 2(n
4
2 4
var(S2 ) = nk
.

k)

We now come to a further important additional property of the LS estimator that results when
is not only the BLUE but also the
the disturbances are normally distributed. In that case,
MVUE, as stated in the following theorem.
See our discussion above of the unbiasedness of S2 .
The linear form Bx and the quadratic form x0 Ax are independent if x N (0, 2 I) and BA = 0, which
n and the sample variance Sn2 from a normal population are
can be used to prove that the sample mean X
independent.
5
See also Adv. Stat. I on the distribution of the sample variance from a normal population.
3
4

61

3. Point estimation methods


S2 ) Under Normality) Assume the classical asTheorem 3.7 (MVUE Property of (,
S2 ) is the MVUE for
sumptions of the LRM, and assume that N (0, 2 I). Then (,
(, 2 ).

Proof
Note that the normality assumption for the disturbances implies that the vector of the random sample
variables Y follows a N (x, 2 I)-distribution which belongs to the exponential class of distributions.
The form of the joint pdf of Y indexed by = ( 0 , 2 )0 is
f (y; ) =

1
(2)n/2 | 2 I|

exp 21 (y x)0 ( 2 I)1 (y x)

1
exp
(2 2 )n/2

1
(y
2 2

1
exp
(2 2 )n/2

1
0 x0 x
2 2

x)0 (y x)

21 2 y 0 y +
| {z }
c1 ()g1 (y)

1 0 0
xy
2

| {z }
c2 ()g2 (y)

By Theorem 2.7 (completeness in the exponential class)


g1 (Y ) = Y 0 Y

and

g2 (Y ) = x0 Y

are complete sufficient statistics for estimating and 2 , since the range of
[c1 (), c2 ()0 ]0 = [ 21 2 , 12 0 ]0
contains an open (k + 1)-dimensional rectangle. Then since
= (x0 x)1 x0 Y

S2 =

1
nk (Y

0 (Y x)
=
x)

0
1
nk (Y Y

Y 0 x(x0 x)1 x0 Y )

and S2 are unbiased,


are functions of the complete sufficient statistics Y 0 Y and x0 Y , and since
S2 ) is the MVUE for
it follows from Lehmann-Scheff completeness theorem (Theorem 2.8) that (,
(, 2 ).


Under the normality assumption, the covariance of the estimator given by 2 (x0 x)1 also
achieves the CRLB, whereas the variance of S2 given by 2 4 /(n k) does not. To show
this, consider the multivariate form of the CRLB for an unbiased estimator of the m-dimensional

62

3. Point estimation methods


vector q(), given by6
q() 0 ln f (Y ) ln f (Y ; ) 1 q()
E
CRLB =

 

 

q() 0
2 ln f (Y ; ) 1 q()
E
(under information equality)
=
.

Since under the classical LRM with normally distributed disturbances the joint pdf f (Y ; ) is
a member of the exponential class, the information equality holds true. Furthermore note that
for q() = = ( 0 , 2 )0 , we have
q()
= I (k+1) ,

and the derivatives of the log of the joint pdf, given by


n
1
ln f (y; ) = ln(2 2 ) 2 (y x)0 (y x)
2
2
= n2 ln(2 2 ) 21 2 0 x0 x 21 2 y 0 y + 12 0 x0 y,
are
ln f ()

1 0
x (y
2

x)

and

ln f ()
2

= 2n2 +

1
(y
2 4

x)0 (y x).

Then the second derivatives obtain as


2 ln f ()
0
2 ln f ()
2
2 ln f ()
[ 2 ]2

2 ln f ()
1
= 2 x0 x with E
= 12 x0 x
0





1 0
2 ln f ()
1 0
= 4 x (y x) with E 2 = E 4 x = 0



n
1
2 ln f ()
0
=
(y x) (y x) with E [2 ]2
2 4 6



=E

n
2 4

1 0

n
.
2 4

Collecting all terms, we obtain for the CRLB for (, 2 )


2 ln f ()
CRLB = I E
0


2 (x0 x)1
=
0

1

I =

1 0
xx
2

n
2 4

0
.
2 4
n

= 2 (x0 x)1 , the LS estimator


achieves the CRLB for unbiased estimators
Then since Cov()
is the MVUE. However, since
of proving that
var(S2 ) =
6

2 4
nk

>

2 4
,
n

This multivariate form of the CRLB obtains by a straightforward extension of the univariate CRLB see
also Mittelhammer (1996, Theorem 7.16).

63

3. Point estimation methods


the MVUE S2 does not achieve the CRLB.

3.2. The Method of Maximum Likelihood


The method of maximum likelihood (ML) can be used to estimate unknown parameters corresponding to the joint pdf of a random sample. In contrast to the LS estimator for the
classical LRM, the ML method inevitably requires a fully specified stochastic model assuming
a parametric family of distributions for the random sample variables.
When the parametric family of distributions of the random sample variables is fully specified,
apart from the unknown distribution parameters, then the joint pdf of the random sample
variables is the likelihood function. This function contains all the information available in
the sample about the underlying population. The strategy by which that information is used
in estimation defines the estimator.

3.2.1. The Likelihood function and the ML estimator


The ML method leads to an estimate of the parameter or q() by maximizing the likelihood
function of the parameters, given the outcome of the random sample.
The likelihood function is identical in functional form to the joint pdf of the random sample.
In particular, let f (x; ) denote the joint pdf of the random sample variables x = (X1 , ..., Xn )
indexed by the unknown parameter . Then the likelihood function is defined as
L(; x) f (x; ).
Note that we write the joint pdf as a function in the data x indexed/ conditioned on the
parameter , whereas when we form the likelihood, we write this function in reverse, as a
function in the parameters for given values of the data x.
is obtained as the value of that maximizes
The maximum likelihood (ML) estimate
the likelihood function. Thus
= arg max L(; x).

The ML-method can be interpreted


as choosing, from all candidates, the value of indexing the joint pdf f (x; ) that assigns
the highest probability (discrete case) or highest density weighting (continuous
case) to the random sample outcome, x, actually observed.
64

3. Point estimation methods


defines a particular member of a parametric family
Put another way, the ML estimate
of pdfs {f (x; ), } that assigns the highest likelihood to generating the data
actually observed.
The estimator implied by this procedure is the maximum likelihood estimator (MLE)
= arg max L(; x).

If the likelihood L(; x) is differentiable w.r.t. and has a maximum in the interior of the
can be found as the solution of the 1st-order conditions
parameter space , the ML estimate
(f.o.c.) for the maximizing value of , i.e.,

x)/1
L(;

L(; x)
..

= 0.
=
.

L(; x)/k
The 2nd-order condition for the maximizing value requires that the Hessian
definite matrix.

2 L(;x)

is a negative

Note that it may not be possible to explicitly solve the f.o.c., consisting of a system of k
equations in k unknown estimates 1 , ..., k . In this case, numerical methods are required to
that satisfies the f.o.c.. Even if there is no interior solution or if the likelihood is not
find
that solves max L(; x) is a ML estimate, no matter how it is
differentiable, a value
obtained.
Note that in some situations the (numerical) calculations are simplified by maximizing ln L(; x)

as opposed to L(; x). Since the log-transformation is strictly monotonically increasing, if


maximizes L(; x), it also maximizes ln L(; x), and vice versa, i.e.,
= arg max L(; x) = arg max ln L(; x).

Example 3.3 Let x = (X1 , ..., Xn ) be a random sample from an exponential population with
pdf
x (0, ), = (0, ).
f (x; ) = 1 exp( x ),
Then the likelihood function is given by
L(; x) f (x; ) =

Qn

i=1

f (xi ; ) =

1
n

exp( 1

The log-likelihood function is


ln L(; x) = n ln
65

Pn

i=1

xi .

Pn

i=1

xi ).

3. Point estimation methods


The f.o.c. for maximizing ln L(, x) w.r.t. is given by

d ln L(;x)
d

Thus the solution =

1
n

Pn

i=1

Pn

xi

i=1
2

= 0.

xi is the ML estimate of , and the MLE is


=

Pn

1
n

i=1

Xi .

Note that the 2nd-order condition for a maximizer is met, since


d2

ln L(;x)
d2

n
2

Pn

x
i=1 i

3

n
2

2n
3

= n2 < 0.

Example 3.4 Let x = (X1 , ..., Xn ) be a random sample from a normal population with pdf
f (x; , 2 ) =

1
2 2

exp

(x)2
2 2

x (, ), (, 2 ) = (, ) (0, ).

Then the log-likelihood function is


ln L(, 2 ; x) = n2 ln(2 2 )

1
2 2

Pn

i=1 (xi

)2 .

The f.o.c. for maximizing ln L(, 2 ; x) is given by


ln L(
,
2 ;x)

ln L(
,
2 ;x)
2

Pn

i=1 (xi

= 2n2 +

) = 0

1
2[
2 ]2

Pn

i=1 (xi

)2 = 0.

The solution defines the ML estimates for and 2 , i.e.,

1
n

Pn

i=1

xi

2 =

and

1
n

Pn

i=1 (xi

)2 .

n )2 are the
n = 1 Pni=1 Xi and the sample variance Sn2 = 1 Pni=1 (Xi X
Thus the sample mean X
n
n
MLEs of the population mean and variance for a random sample from a normal population.
To see that the 2nd order condition for a maximizer is met, compute the Hessian of ln L(; x)
with = (, 2 )0 . There we have
2

ln L(, 2 ;x)
2

2 ln L(, 2 ;x)
2

ln L(, 2 ;x)
[ 2 ]2




n2

= n2
,

2




n Pn
4 i=1 (xi )

n
2 4

=0
,




2
i=1 (xi )

1 Pn

66

=
,

2

n2n
2
4

= 2n4 .

3. Point estimation methods


Thus the Hessian is

2 ln L(;x)
0

n2
0
=
,
0 2n4

which is a n.d. matrix.

Example 3.5 Let x = (X1 , ..., Xn ) be a random sample from a uniform population with pdf
f (x; a, b) =

1
I (x),
ba [a,b]

< a < b < .

Then the likelihood function is given by


L(a, b; x) =

1
ba

n Q

n
i=1 I[a,b] (xi ).

The ML estimates of a, b are obtained by the solution of the constrained maximization


problem
max L(a, b; x)
s.t.
a xi b i = 1, ..., n.
a,b

Note that L() is monotonically increasing in a, while the admissible values of a are bounded
above by the smallest order statistic min(x1 , ..., xn ), and L() is monotonically decreasing in b,
while the admissible values of b are bounded below by the largest order statistic max(x1 , ..., xn ).
Thus the ML estimates are
a
= min(x1 , ..., xn )

and

b = max(x1 , ..., xn ).

3.2.2. Finite sample properties the ML estimator


There are several reasons why we might suspect that the ML method would lead to good
estimates of . First of all, if an unbiased estimator exists that achieves the CRLB, then,
under certain conditions, the MLE will be this estimator.
Theorem 3.8 (MLE Attainment of the CRLB) If there exists an unbiased estimator, T =
t(x), of that has a covariance matrix equal to the CRLB, and if the MLE can be defined by
solving the f.o.c. for maximizing the likelihood function, then the MLE is equal to T = t(x)
with probability 1.
Proof
(Univariate Case) Recalling the proof of the CRLB it is recognized that an unbiased estimator t(x)
for the scalar attains the CRLB iff
ln f (x; )
= K(, n)[t(x) ],

67

3. Point estimation methods


where K(, n) is a factor independent of x. Hence, an unbiased estimator of attaining the CRLB
must have the form

ln f (x; )
1
,
K(, n)

t(x) = +
with a variance
var[t(x)] =

h ln f (x; ) i
1
var
=

K(, n)2

1
ln f (x;) 2 

{z

CRLB

Solving the last equation for K, yields


K(, n) =

var

h ln f (x; ) i h ln f (x; ) 2 io1/2

|
(since

E[

ln f ()
]

so that

ln f ()

var[

{z
f ()dx =

ln f ()
]

f ()

= E[(

= E

dx =

ln f () 2
) ]

h ln f (x; ) 2 i

}
f ()dx = 0,

Accounting for this functional form of K, we obtain for an unbiased estimator of attaining the CRLB
the form

2 i

ln f (x; )
,

2 i

ln L(; x)
.

t(x) = +
E

h

ln f (x;)

and since ln L(; x) = ln f (x, )


1

t(x) = +
E

h

ln L(;x)

for in the preceding equality implies that t(x) = ,


since would
Substituting the ML estimate, ,
satisfy the f.o.c.

ln L(;x)

= 0.

Then outcomes of the MLE and t(x) coincide with probability 1. The extension of the corresponding
arguments to the multivariate case is straightforward for details see Mittelhammer (1996, p. 470).


Theorem 3.8 says that
if there exists a unbiased estimator for which attains the CRLB and
if the MLE is defined by solving the f.o.c. for maximizing the likelihood,
then the MLE will be the MVUE for .
This raises the question whether or not the MLE will still be the MVUE in cases where there
is no unbiased estimator attaining the CRLB. Hints to address this issue are provided by the
following theorem.
68

3. Point estimation methods


Theorem 3.9 (Unique MLEs are Functions of any Sufficient Statistics for f (x; ))
is uniquely defined in terms of x. If S = [S1 , . . . , Sr ]0 is
Assume that the MLE of , say ,
any vector of sufficient statistics for f (x; ) L(; x), then there exists a function of S, say
= (s).
(S), such that

Proof
The Neyman factorization theorem states that we can decompose the likelihood function as
L(; x) f (x; ) = g(s1 , ..., sr ; ) h(x),
where
S1 , ..., Sr are sufficient statistics that can be complete sufficient statistics if they exist;
g and h are nonnegative-valued functions;
h is independent of .
It follows that for a given value of x, if the MLE is unique, then

= arg max L(; x)

= arg max g(s1 , ..., sr ; ),

(hdoes not depend on )

maximizes L(; x) iff


maximizes g(s1 , ..., sr ; ). But the latter maximization problem
i.e.,
implies that the unique maximizer w.r.t. is then a function of s1 , ..., sr , i.e.,
= (s1 , ..., sr ).


If the sufficient statistics S1 , ..., Sr used in the Neyman factorization are complete, then the
is a function of the complete sufficient statistics, by Theorem 3.9. It
unique MLE
is additionally
follows from the Lehmann-Scheff completeness theorem that if the MLE
unbiased, then the MLE is the MVUE for . Thus if there exist complete sufficient
statistics and if the MLE is unique and unbiased then the MLE is the MVUE.

Example 3.6 Let x = (X1 , ..., Xn ) be a random sample from an exponential population with
pdf
x (0, ), = (0, )
f (x; ) = 1 exp( x ),

69

3. Point estimation methods


with EX = . The MLE of is unique and given by (see the example discussing the MLE for
the parameter of an exponential population)
=

1
n

Pn

i=1

Xi

= .
E

with

Since the joint pdf of the random sample variables belongs to the exponential class and has the
form
P
f (x, ) = exp{ 1 ni=1 xi n ln },
|

{z

c()g(x)

is a function of
g(x) = ni=1 is a complete sufficient statistic. By Theorem 3.9 the MLE
Pn
the complete sufficient statistic ( i=1 Xi ). Since the MLE is additionally unbiased it is the
MVUE of .
P

Up to this point, we have seen that


the ML method is a rather straightforward approach to define estimators
and in some cases the resulting estimator will be unbiased and be the MVUE.
On the other hand the MLE need not be unbiased nor be the MVUE.
It is useful to examine asymptotic properties of the MLEs since, even if the MLE does not
possess desirable finite-sample properties, it might possess desirable large-sample properties
under general conditions; and the MLEs are in fact most attractive because of their largesample properties.

3.2.3. Large sample properties the ML estimator


There are two general approaches that we can follow in establishing the asymptotic properties of the MLE.
The first approach can be followed if a MLE can be represented as an explicit function of the
= (x). Then it might be possible to apply WLLNs and/or CLTs
random sample variables,
to (x) directly to investigate the asymptotic properties of the MLE.
However, there are cases where the MLE cannot be represented as an explicit function of the
random sample since the f.o.c. for maximizing the likelihood is a system of highly non-linear
equations without an explicit solution which needs to be solved numerically.
Then one might follow the second approach, which tries to identify regularity conditions on
the likelihood functions that ensure that the MLE is consistent, asymptotically normal and
asymptotically efficient.
70

3. Point estimation methods


Note the following examples.

Example 3.7 Recall the example discussing the MLE for the parameter of an exponential
population. There we found that the MLE for is given by
= (x) =

1
n

Pn

i=1

Xi .

By application of Khinchins WLLN and the CLT of Lindberg-Levy to this MLE available as an
explicit function of x we can directly establish that
=
plim

N (, 2 /n).

and

Example 3.8 Let x = (X1 , ..., Xn ) be a random sample from a Gamma population with pdf
f (x; ) =

1
x1
()

exp( x ),

x (0, ),

= (, ).

The likelihood function is


L(; x) =

1
n ()n

Qn

i=1

x1
exp(
i

Pn

xi
i=1 ),

and the log-likelihood is


ln L(; x) = n ln() n[ln ()] + ( 1)

Pn

i=1

ln(xi )

Pn

xi
i=1 .

of the log-likelihood are given by


= (
The f.o.c.s characterizing the maximizer
, )
ln L

ln L


= n ln()
= n +

n d()
()

1 Pn

i=1

Pn

i=1

ln(xi ) = 0

xi = 0.

in terms of x,
Note that there is no explicit solution of this nonlinear system of f.o.c.s for
is implicitly a function of x. A unique value for satisfying the f.o.c.s needs to
although
be obtained numerically.
Since an explicit functional form for the MLE is not identifiable, an analysis of consistency
and asymptotic normality using the first approach is quite difficult. In such cases, we might try
to identify appropriate regularity conditions on L(; x) that represent sufficient conditions for
the MLE to be consistent and asymptotically normal and that are met by the specific likelihood
L(; x) under consideration.

Consistency of the MLE


71

3. Point estimation methods


Theorem 3.10 (MLE Consistency - iid and scalar case) Let X1 , ..., Xn be an iid random sample from a population with pdf f (x, ), where is a scalar. Assume that
R1. the set of joint pdfs {f (x; ), }, have common support, ;
R2. the parameter space, , is an open interval;
R3. ln L(; x) is continuously differentiable w.r.t. x ;
R4. ln L(; x)/ = 0 has a unique solution for , and the solution defines the unique

ML estimate, (x),
x .
p

Then
0 (true value of ), and the MLE is thus consistent for .

Proof
For any  > 0, let
` = 0 ,

such that ` , h

h = 0 + ,

(such s exist by R2). Now define the following event set


Hn = {x : ln L(0 ; x) > ln L(` ; x)

ln L(0 ; x) > ln L(h ; x)}.

&

Note that for all x Hn we have


(0  , 0 + ),
| {z }
`

| {z }
h

x)/ = 0. Hence, in order to show the consistency


since the MLE is unique and is defined by ln L(;
of the MLE we need to show that limn P (x Hn ) = 1.
Now define the event set
An () = {x : ln L(0 ; x) > ln L(; x)} for

6= 0 ,

and note that the event x An () can equivalently represented as


x An () ln L(; x) ln L(0 ; x) < 0

Pn

1
n

i=1

ln f (xi ; ) ln f (xi ; 0 )

 f (xi ;) 
i=1 ln f (xi ;0 )

Pn

< 0

(by the iid property of the Xi s)

n (x) < 0.

Because n (x) is the sample mean for iid random variables, Khinichins WLLN and Jensens inequality
imply that
p

n (x)

E ln
|

f (Xi ;)
f (Xi ;0 )

< ln E

f (Xi ;)
f (Xi ;0 )

{z

i
}

(by Jensens inequal. and the concavity of ln(z))

72

= ln(1)
|

{z

f (.;)
( f (.; ) f (.; 0 )dx = 1)
0

= 0.

3. Point estimation methods


Thus,
plimn (x) < 0,

which implies that

lim P [x An ()] = 1,

when

6= 0 .

Since the last equation holds for all s, including ` and h , this in turn implies that
lim P [x An (` )] =

lim P [x An (h )] = 1.

Then, since Hn = An (` ) An (h ), we have


lim P [x Hn ] =

lim P [x {An (` ) An (h )}]

{z

` ) P (A
h )
(since by Bonferronis ineq. P (A` Ah ) 1 P (A
` ) = 0and limn P (A
h ) = 0)
and limn P (A


Theorem 3.10 establishing the consistency of the MLE is based upon the assumption that the
random sample variables are iid, which in many practical applications is hard to justify.
This fairly restrictive iid assumption is not needed if we add to the list of the 4 regularity
conditions in Theorem 3.10 the (fairly weak) condition
R5. limn P [ln L (0 ; x) > ln L (; x)] = 1

for

6= 0 .

This condition essentially requires that the likelihood is such that as n the true value 0
maximizes the likelihood (and hence satisfies the definition of the ML-estimate) with probability
1. For further details see Mittelhammer (1996, Theorem 8.14.).
Theorem 3.10 establishes the consistency of the MLE in the scalar case. The extension of the
arguments to the multivariate case when is k-dimensional vector is straightforward. The
following theorem establishes such a multivariate extension.

Theorem 3.11 (MLE Consistency - Sufficient Conditions) Let {f (x; ) , } be the


statistical model for the random sample x. Let N () = { : d (, 0 ) < } be an open
neighborhood of 0 (true value of ).7 Assume
M1. the pdfs f (x; ), , have common support, ;
M2. ln L(; x) has continuous first-order partial derivatives w.r.t. x ;
=
M3. ln L(; x)/ = 0 has a unique solution that defines the unique ML estimate
arg max L(; x) x ;
7

N () is an open interval, the interior of a circle, the interior of a sphere, and the interior of a hypersphere in
1, 2, 3, and 4 dimensions, respectively.

73

3. Point estimation methods


M4. limn P [ln L(0 ; x) > maxN () ln L()] = 1 > 0 with being an open rectangle
containing 0 .
is such that
Then the MLE,
p

Proof
See Mittelhammer (1996, p.477-478).


For the MLE to be asymptotically normally distributed, additional conditions on the likelihood
function are needed. The following Theorem presents a collection of sufficient conditions that
ensure the asymptotic normality of the MLE of in the scalar and iid case.

Theorem 3.12 (MLE Asymptotic Normality - iid and scalar case) In addition to conditions (R1)-(R4) of Theorem 3.10 assume that
R6. 2 ln L(; x)/2 exists and is continuous in and x ;
h

R7. plim n1 ( 2 ln L ( ; x) /2 ) = H (0 ) 6= 0 for any sequence of random variables {n }


such that plim n = 0 .
is such that
Then the MLE
2 



0 ) N 0 ,
n(

ln f (Xi ; 0 )/

H(0 )2

N 0 , 1

n
a

ln f (Xi ; 0 )/

Proof
The f.o.c. for maximizing the likelihood implies
x)
ln L(;
= 0.

74

2 



H(0 )2

3. Point estimation methods


Representing the l.h.s. by a Taylor series expansion around 0 yields
x)
ln L(0 ; x) 2 ln L( ; x)
ln L(;
( 0 ) = 0,
=
+

2
+ (1 )0
where =

[0, 1].

for

Then, multiplying the Taylor series expansion by

n and exploiting the iid assumption for the random

sample variables obtains

0 )
n(

=
(iid)

h 1 2 ln L( ; x) i1 h 1 ln L( ; x) i
0

n
n
2
n
n
h1X
2 ln f (Xi ; ) i1 h

n i=1
2
|

{z

Un H(0 )

n
1X
ln f (Xi ; 0 )
n
n i=1

{z

Wn N 0, E

ln f (Xi ; 0 )/

}
2 

P

Regarding the limiting behavior of the first term Un note that
0 implies that
P
+ (1 )0
=
0 .

Hence we obtain
Un =

n
h 2 ln f (X ; ) i
1X
2 ln f (Xi ; ) p
i 0

E
n i=1
2
2

{z

WLLN and plim for continuous functions

(say)

H(0 ) 6= 0.

Regarding the second term Wn , note that the iid assumption for the Xi s implies that the summands
{ ln f (Xi ; 0 )/} are iid random variables

1. with E[ ln f (Xi ; 0 )/] = 0 (since E[ ln f /] = [ln f /]f dx = [1/f ][f /]f dx = [f /] dx =

[ f dx]/ = 0),

2. and with Var[ ln f (Xi ; 0 )/] = E[( ln f (Xi ; 0 )/)2 ].


Hence, we can use the CLT of Lindberg-Levy to obtain
Wn =

n

h ln f (X ; ) 2 i
1X
ln f (Xi ; 0 ) d
i 0
n
W N 0, E
.
n i=1

Collecting all terms, we have by Slutsky

0 ) = [Un ]
n(

Wn [H(0 )]

W N 0,

h

ln f (Xi ; 0 )/
H(0 )2

2 i


If the joint pdf defining the likelihood function meets the conditions for the information
75

3. Point estimation methods


equality (as pdfs from the exponential class) then we can write


E
so that

ln f (Xi ; )

2 

= E

 2
ln f (X

i ; )

= H(),

ln f (Xi ; )

which leads to the following form of the asymptotic distribution:


d

 

0 ) N 0 , [H(0 )]1 = N 0 , E
n(

ln f (Xi ; 0 )
a

N 0 , [n H(0 )]1 = N 0 , n E

|
{z


2 1 

2 1 

CRLB for unbiased estimators

.
}

In such cases, the MLE is the best asymptotically normal estimator, i.e. asymptotically
efficient.
In general, an estimator T of is said to be asymptotically efficient if it has an asymptotic
normal distribution with mean and a variance equal to the CRLB (see Mittelhammer, 1996,
Definition 7.22).
Note that the variance of the asymptotic distribution of the MLE is unknown, since it depends
via



 2
ln f (Xi ; 0 ) 2
ln f (Xi ; 0 )
and E
H(0 ) = E
2

on the unknown parameter 0 that must be estimated. Consistent estimates for the variance
of the asymptotic distribution of MLEs (which are necessary for asymptotic hypothesis tests)
can be obtained by replacing the two unknown expectations by consistent estimates, given by

 2
ln f (X

i ; 0 )

2
ln f (Xi ; 0 )

2

n  2

1X
ln f (Xi ; )
=
n i=1
2
n 
2
1X
ln f (Xi ; )
=
.
n i=1

Based upon those estimates we can compute the so called asymptotic standard errors of
MLEs.

Example 3.9 Let x = (X1 , ..., Xn ) be a random sample from an exponential population with
pdf
f (x; ) = 1 exp( x ),
x (0, ), = (0, ),
and a log-likelihood function
ln L(; x) = n ln

76

Pn

i=1

xi .

3. Point estimation methods


Now we can attempt to utilize the consistency theorem 3.10 and asymptotic-normality theorem
3.12 for demonstrating the consistency and asymptotic normality of the MLE of . This requires to prove whether the regularity conditions (R1)-(R4) (for consistency) and (R6),(R7)
(for asymptotic normality) are satisfied by the particular (log)-likelihood function.
R1. The support of the joint pdf f (x, ) consists of x Rn+ and is independent of .X
R2. The parameter space is (0, ) which is an open interval. X
R3. The log-likelihood is continuously differentiable w.r.t. for all x Rn+ and > 0. X
R4. The f.o.c. given by
ln L(;x)

= n +

1
2

Pn

i=1

xi = 0

P
has a unique solution defining the ML estimate = ( ni=1 xi )/n. X

is consistent.
Hence, by Theorem 3.10, the MLE
As to the regularity conditions (R6) and (R7) of Theorem 3.11:
R6. The second derivative given by
2 ln L(;x)
2

n
2

2
3

Pn

i=1

xi

exists and is continuous in for all x Rn+ and > 0. X


R7. Assuming that plim = 0 it follows by Slutsky that
plim n1

2 ln L( ;x)
2

= plim

1
[ ]2

2 1
[ ]3 n

Pn

i=1 Xi

= 12 = H(0 ) 6= 0.
0

is such that
Hence, it follows by Theorem 3.12 that the MLE


N 0 ,

E
1

2 i

h
ln f (Xi ;0 )/

H(0 )2

With regard to the particular form of the asymptotic variance note that


H(0 )2 =

1
02

2

1
,
04

and
h

 i
ln f (Xi ;0 ) 2

h

= E
=

1
02

1
0

1
X
02 i

2 i

= H(0 ),
77

h

= E

1
02

1
X2
04 i

2
X
03 i

i

3. Point estimation methods


such that the information equality is satisfied.
Collecting all terms of the asymptotic variance we have


N 0 ,

1 2

n 0

Theorem 3.12 establishes the asymptotic normality of the MLE in the scalar and iid case. The
following theorem provides an extension to the multivariate case when is a k-dimensional
vector and to the case where the random sample variables are not iid.

Theorem 3.13 (MLE Asymptotic Normality - Sufficient Conditions) In addition to conditions (M1)-(M4) of Theorem 3.11, assume
M5. 2 ln L(; x)/0 exists and is continuous in and x ;
h

M6. plim n1 ( 2 ln L ( ; x) /0 ) = H (0 ) is a nonsingular matrix for any sequence of


random variables {n } such that plim n = 0 ;
d

M7. n1/2 [ ln L (0 ; x)] / N (0, M (0 )) where M (0 ) is a symmetric, p.d. matrix.


is such that
Then the MLE, ,




d
0
N 0, H (0 )1 M (0 ) H (0 )1 .
n

Proof
See Mittelhammer (1996, p.480-482).


The regularity conditions (M5) and (M6) correspond to the conditions (R6) and (R7) used for
the univariate case in Theorem 3.11, while condition (M7) replaces the iid assumption used in
in Theorem 3.11.

3.2.4. MLE invariance principle


A useful property of the MLE is what has come to be known as the invariance property of
MLEs.

78

3. Point estimation methods


Suppose that the joint pdf for a random sample is indexed by a parameter and the
interest is in finding an estimator for some function of , say q().
is the MLE of , then q()
is the
In a nutshell, the invariance property says that if
MLE of q().
The invariance property is a mathematical result of the method of computing MLEs and is
established in the following theorem.8
be an MLE of the scalar
Theorem 3.14 (MLE Invariance Property - scalar case) Let
is an MLE of q().
, and let q() be a real-valued function of . Then q()

Proof
Consider for = q() the induced likelihood, say L , defined as the maximum likelihood value which
can be achieved for the set of -values generating a fixed , i.e., by
L (; x) =

max

{:q()=}

L(; x).

By definition, the value that maximizes L (; x) is the MLE of = q(). Hence, we must show that
x]. Note that
L (
; x) = L [q();
L (
; x) = max

max

{:q()=}

L(; x)

= max L(; x)

(by definition of L )

(since the iterated maximization =unconditional maximization over )

x).
= L(;

(by definition of )

Furthermore
x) =
L(;

max

{:q()=q()}

L(; x)

x].
= L [q();

the ML estimation )
(since is

(by definition of L )

x] and that q()


is the MLE of q().
Hence, the sequence of equalities shows that L (
; x) = L [q();

Example 3.10 We found that the MLE for the parameter of an exponential population is
= 1 Pn Xi . Then by the invariance principle, the MLE of 2 is given by
2 =
given by
i=1
n
P
( n1 ni=1 Xi )2 .
8

For a multivariate version of this theorem see Mittelhammer (1996, Theorem 8.20).

79

3. Point estimation methods


The invariance property provides the ML estimate for functions q() once an ML estimate for

has been obtained. This raises the question whether or not the properties of the MLE
Under fairly general conditions, one can show that
transfer to the estimator q().
is a consistent estimator of , then q()
is a consistent estimator of q(), and
if
is asymptotically normal, then q()
is also asymptotically normal.
if
The first property directly follows from the continuous mapping theorem which says that
= q(plim )
= q().
plim q()
The second property directly follows from the delta method, which says that a smooth function
of a random variable, which is asymptotically normal, is also asymptotically normal.
Summary of the MLE Properties
We have seen that the ML procedure is a relatively straightforward approach to obtain estimates
of parameters indexing the joint pdf of random sample variables or of any function q().
The MLE possesses the following useful properties, which makes the ML approach very attractive:
If an unbiased estimator of that achieves the CRLB exists, and if the MLE is defined
by solving the f.o.c.s, then the MLE achieves the CRLB.
If an MLE is unique, then the MLE is a function of any set of sufficient statistics,
including complete sufficient statistics provided that they exist.
If an MLE is unique and unbiased, and complete sufficient statistics exist, then the MLE
is the MVUE.
However, an MLE is not necessarily unbiased.
Under general regularity conditions on the likelihood function, the MLE is consistent,
asymptotically normal and asymptotically efficient.
is the MLE of , then q()
is the MLE of q().
If

3.3. The (generalized) method of moments


The MLE has the attractive properties to be consistent, asymptotically normal and asymptotically efficient, in the context of the fully specified stochastic model. The possible shortcoming in
80

3. Point estimation methods


this result is that to possess those properties, it is necessary to make possibly strong, restrictive
assumptions about the parametric distribution of the population under consideration.
The Method of Moments (MM) approach discussed in this section allows us to move away
from parametric assumptions, towards estimators which are robust to some variations in the
underlying population. The MM method leads to an estimate of the parameter by matching population (or theoretical) moments (which are functions of ) with appropriate sample
moments.
The first step is to define the moment conditions which specify which moments to match.
Definition (Moment Conditions): Let Y = (Y1 , ..., Yn )
be a random sample from a statistical model {f (y; ),
Rk } with true parameter value 0 . Let g(Yt , ) be a
continuous `-dimensional vector function of with ` k
such that
Eg(Yt , 0 ) = 0,
t = 1, ..., n.
This set of ` equations is called moment conditions and the
vector function g is called moment function.
Example 3.11 Let Y = (Y1 , ..., Yn ) be a random sample from a Gamma population with pdf
f (y; ) =

1
y 1
()

exp( y ),

y (0, ),

= (, ).

The first two non-central moments are


EYt2 = 0 02 (1 + 0 ).

EYt = 0 0 ,
Hence, by setting

Yt
,
g(Yt , ) = 2
2
Yt (1 + )
we obtain the following moment conditions

Yt 0 0
= 0.
Eg(Yt , 0 ) = E 2
Yt 0 02 (1 + 0 )
In the case where ` = k, the parameter is exactly identified by the moment conditions
Eg(Yt , ) = 0.
Then the moment conditions represent a set of k equations for k unknowns (see the previous
example). Solving these equations would give us the value of which satisfies the moments
81

3. Point estimation methods


conditions, and this would be the true value 0 . However, note that (for a given value of )
we cannot observe Eg(Yt , ), only g(Yt , ) !
In the case where ` > k, the parameter is over-identified by the moment conditions,
since we have more equations than unknowns. For the exactly identified case (k = `), the
based upon the moment conditions Eg(Yt , ) = 0
Method of Moments (MM) estimate
is obtained as the value of which solves the sample counterpart to the moment conditions.
Thus
n
1X
(say)
= 0.
g(yt , )
= g n (y, )
n t=1
The MM approach can be motivated as follows. If the sample mean g n (y, ) provides a good
estimate for the population mean Eg(Yt , ) for each , then we might expect that the solution
of

g n (y, ) = 0,
i.e.

would provide a good estimate of the solution of the moment conditions


Eg(Yt , ) = 0,

i.e.

0 .

defined by
The estimator implied by this procedure is the MM estimator
n
1X
= g n (Y , )
= 0.
g(Yt , )
n t=1

In more general contexts with over-identifying moment conditions (` > k), a generalized
MM estimate (GMM) of is obtained by minimizing a weighted distance between g n (y, )
and 0. See below.
Example 3.12 For the Gamma distribution in the previous example the sample moment counterpart to the moment conditions

Yt
= 0
Eg(Yt , ) = E 2
Yt 2 (1 + )
is given by

n
y
m01
1X
t
=
= 0.
g n (y, ) =
n t=1 yt2 2 (1 + )
m02 2 (1 + )

Taking the last equation and solving it for , , we find that the MM estimates are given by


(m01 )2 /[m02 (m01 )2 ]

=
.

[m02 (m01 )2 ] /m01


82

3. Point estimation methods


Note that those MM estimates for the parameters of a Gamma distribution can be represented
as an explicit function of the random sample variables. This is in sharp contrast to the ML
estimates of those parameters, for which no such explicit form exists.

Example 3.13 Let Y1 , ..., Yn be a random sample from a N (, 2 ) population. Based upon the
first two non-central moments, we can specify the following moment conditions

Yt
= 0,
Eg(Yt , ) = E 2
Yt ( 2 + 2 )

= (, 2 )0 .

The sample moment counterpart to the moment conditions is given by

n
y
m01
1X
t
=
= 0.
g n (y, ) =
n t=1 yt2 ( 2 + 2 )
m02 ( 2 + 2 )

Taking the last equation and solving it for , 2 , we find that the MM estimates are given by

m01

=
m02 (m01 )2

2
Note that the MM estimates
and
2 are the simple sample mean and the sample variance,
respectively.

In order to discuss the properties of the MM estimator, we consider the following estimation
context. It is assumed that the sample variables Y = (Y1 , ..., Yn ) are a collection of iid random
variables. Given a statistical model for the random sample {f (y; ), Rk }, our interest
is to estimate . In general the first k non-central moments of Y are functions of , i.e.
0r = EYtr = hr ();

r = 1, ..., k.

Hence, we can define the following moment conditions

Eg(Yt , ) =

Yt h1 ()
Yt2 h2 ()
..
.
Ytk hk ()

= 0

01
02
..
.
0k

h1 ()
h2 ()
..
.
hk ()

(say)

= h(),

where h() is assumed to be invertible, so that = h1 (01 , . . . , 0k ). The sample moment

83

3. Point estimation methods


counterparts to the moment conditions are then

n
1X

n i=1

yt h1 ()
yt2 h2 ()
..
.
ytk hk ()

= 0

m01
m02
..
.
m0k

h1 ()
h2 ()
..
.
hk ()

= h(),

and the solution for defines the MM estimate via the inverse function h1 as
= h1 (m0 , ..., m0 ).

1
k
The corresponding MM estimator is
= h1 (M10 , ..., Mk0 ),

where Mr0 =

1
n

Pn

t=1

Ytr .

= h1 (m0 , ..., m0 ) is continuous, in which


It is very often the case that the inverse function
k
1
case the MM-estimator inherits consistency from consistency of the sample moments Mr0 for
the population moments 0r .
(k1) = h1 (M10 , . . . , Mk0 ) be such
Theorem 3.15 (Consistency of MM Estimator) Let
that h1 (01 , . . . , 0k ) is continuous (01 , . . . 0k ) = {(01 , . . . , 0k ) : 0i = hi (), i = 1, . . . , k, }.
p

Then
.
Proof
Given the assumed continuity property of h1 we have
= plim h1 (M 0 , . . . , M 0 )
plim
1
k
= h1 (plim M10 , . . . , plim Mk0 )

(since the plim of a continuous function is the function at


the plim of the argument, see Theorem 5.2, Adv. Stat. I )

= h1 (01 , . . . , 0k )

(by consistency of Mr0 for 0r )

= .

(by the invertibility of h)

Example 3.14 Recall the example where we considered the MM estimator of (, 2 ) for a
normal population using the first two non-central moments. In this case, the MM-estimator is


M10
01

p

=
=
.
2
0
0 2
0
0 2

2
M2 (M1 )
2 (1 )
84

3. Point estimation methods


A consistent MM estimator will be asymptotically normally distributed if the inverse function
= h1 (01 , ..., 0k ) is continuously differentiable and its Jacobian has full rank.
= h1 (M10 , . . . , Mk0 )
Theorem 3.16 (Asymptotic Normality of MM Estimator) Let
such that h1 (01 , . . . , 0k ) is differentiable (01 , . . . , 0k ) = {(01 , . . . , 0k : 0i = hi (), i = 1, . . . , k,
and let the elements of the Jacobian

A (01 , . . . , 0k ) =

0
0
h1
1 (1 , . . . , k )
0
1
..
.

h1 (0 , . . . , 0 )
1
k
k
01

..

0
0
h1
1 (1 , . . . , k )
0
k
..
.

1
0
0
hk (1 , . . . , k )
0k

be continuous functions with A(01 , . . . , 0k ) having full rank (01 , . . . , 0k ) . Then

) N (0, AA0 ),
n(

and

N (, n1 AA0 ),

where = Cov(M10 , . . . , Mk0 ).


kk

Proof
= h1 (M 0 , . . . , M 0 ) converge in disRecall that the sample moments defining the MM estimator
1
k
tribution to a normal limiting distribution as


M10
01

.. ..
n
. .
Mk0
0k

d
N (0, ).

Since a) the MM estimator is a continuous function of those random variables (given by h1 ), b) the
partial derivatives of this function contained in the Jacobian matrix A are continuous, and c) A has
stated by the theorem follows directly from the
full rank, the asymptotic normal distribution of
theorem of the asymptotic distribution of continuous functions of asymptotically normally
distributed random variables (see the delta method e.g. Mittelhammer, 1996. Theorem 5.40,
p.288).


In the discussion above, we have seen that the MM estimator is conceptually and computationally very simple, and is consistent and asymptotically normal under fairly general conditions.
However, the MM estimator is often not unbiased, BLUE, MVUE, or asymptotically efficient.
Furthermore, its small sample properties are often unknown and critically dependent upon the
particular estimation problem under consideration. The MM estimator has been most often

85

3. Point estimation methods


applied in cases where the definition or computation of other estimators (such as the MLE) is
complex, or the sample size is quite large so that the consistency ensures a reasonably accurate
estimate. The following examples show how the MM estimator can be formulated to subsume
the LS estimator and the MLE as special cases.

Example 3.15 Consider the classical LRM given by


Yt = xt. + t ,

Et = 0,

with

t = 1, ..., n,

where is a k-dimensional parameter vector. Moment conditions can be specified by the kdimensional vector function
Eg(Yt , ) = Ex0t. t = Ex0t. (Yt xt. ) = 0.
The corresponding sample moment conditions are given by
1
n

Pn

t=1

g(yt , ) =

1
n

Pn

t=1

x0t. (yt xt. ) = n1 (x0 y x0 x) = 0.

= (x0 x)1 x0 y.
This system of equations solved for defines the MM estimate of as
Thus the LS estimator is an MM estimator.

Example 3.16 Let Y = (Y1 , ..., Yn ) be an iid random sample from a population with pdf
f (y; ). Then the log-likelihood is
Pn

ln L(; y) =

t=1

ln f (yt ; ).

Under the usual regularity conditions we have (in the continuous case)
ln f (Yt ; ) i
E
=

ln f (yt ; )

f (yt ; ) dyt =
{z

|
}

f (yt ; )dyt = 0.

f (yt ;)

Hence, moment conditions can be defined by the k-dimensional vector function




Eg (Yt , ) = E

ln f (Yt ; ) i
= 0.

The corresponding sample moment conditions are then


n

1X
1 X ln f (Yt ; )
1 ln L(; y)
=
= 0.
g (yt , ) =
n t=1
n t=1

This system solved for defines the MM estimate of .


86

3. Point estimation methods


Assuming the MLE of is defined as the solution to the f.o.c., the MLE is an MM estimator.

The GMM estimator is used when the k-dimensional parameter vector is over-identified by
the ` > k moment conditions Eg(Yt ; ) = 0.
In this case the corresponding sample moment conditions
n
1X
g(yt ; ) = 0
g n (y; ) =
n t=1

that
define a system with more equations than unknowns, such that we cannot find a vector
exactly satisfies the sample moment conditions. In this over-identified case the GMM estimate
is defined as the value of that satisfies the sample moment conditions as closely as possible.
Formally, the GMM estimate is defined to be the value that minimizes a weighted measure of the distance between the sample moments g n (y; ) and the zero vector,
as
Definition (GMM estimator):

Let

= arg min Qn (; y),

where
Qn (; y) = g n (y; )0 W n g n (y; ),
and W n is a (``) symmetric, p.d. weighting matrix and is
p
such that W n w, with w being a nonrandom, symmetric,
and p.d. matrix.
Note that for ` > k different weighting matrices W n lead to different GMM estimators. The
most simple weighting matrix would be W n = I. Also note that
Qn (; y) 0,
(since the GMM objective function Qn (; y) has a quadratic form with a p.d. weighting matrix
W n ), and
Qn (; y) = 0

g n (y; ) = 0.

Example 3.17 Let Y = (Y1 , ..., Yn ) be a random sample from a Gamma population with pdf
f (y; ) =

1
y 1
()

exp( y ),
87

y (0, ),

= (, ).

3. Point estimation methods


Based on the moments EY , EY 2 , E(ln Y ), and E(1/Y ) we can specify the following moment
conditions
h

Eg(Yt , ) = E Yt Yt2 2 (1 + ) ln Yt

d ln ()
d

ln 1/Yt 1/( 1)

= 0.

Hence the two-dimensional parameter vector is over-identified by those four moment conditions. If we select W n = I, the GMM estimate would obtain from minimizing the following
objective function:
Qn (; y) = [ n1

Pn

yt ]2 + [ n1

+ [ n1

Pn

ln(yt )

+ [ n1

Pn

1/yt 1/( 1)]2 .

t=1

t=1
t=1

Pn

t=1

d ln ()
d

yt2 2 (1 + )]2

ln ]2

As was the case for the MM estimator, general statements regarding the finite sample properties of the GMM estimator are not possible to derive. In the following we will discuss9 a
set of sufficient conditions for the consistency and the asymptotic normality of the GMM
estimator. We will also consider the asymptotic efficiency of the GMM estimator for a given
particular set of moment conditions.
For consistency of the GMM estimator we require some conditions on the behavior of the GMM
objective function. A set of sufficient conditions is given in the following definition.

This discussion is based on L. Mtys (1999), Generalized Method of Moments Estimation, Cambridge:
Cambridge University Press; Chapter 1. The seminal paper on the asymptotic properties of the GMM estimator is that of L.P. Hansen (1982), Large sample properties of generalized method of moments estimators,
Econometrica, p. 1029-1054.

88

3. Point estimation methods


Definition (GMM Consistency Conditions):
C1. The expectation of the moment function g(Yt , ) used
to define the moment conditions
(say)

Eg(Yt , ) = h() exists and is finite

C2. There exists a 0 such that

Eg(Yt , ) = 0

= 0 .

C3. Let gn,j (Y , ) be the jth sample moment in g n (Y , )


and hj () the corresponding population moment in
h(). Then
p

sup |gn,j (Y , ) hj ()| 0

for j = 1, ..., `.

Condition (C1) ensures that the population moments defining the moment conditions exist.
Condition (C2) says that the population moments take the value 0 at 0 and at no other
value of . This ensures that 0 can be identified by the population moment conditions
Eg(Yt , ) = 0.
Condition (C3), saying that
p

sup |gn,j (Y , ) hj ()| 0,

implies that the jth sample moment (as a function of ) converges in probability uniformly
to the corresponding population moment (as a function of ).
The uniformity of the convergence is a stronger requirement than the usual pointwise conp
vergence in probability which simply requires that gn,j (Y , ) hj () 0 for every
single fixed value separately. The importance of the uniformity of the convergence for the
p
consistency proof is that it implies that gn,j (Y , n )hj (n ) 0, where n is some sequence
of values. This may not be true if we only have pointwise convergence.
With the conditions (C1)-(C3), the consistency of the GMM estimator can be shown.
Theorem 3.17 (Consistency of the GMM Estimator) The GMM estimator of defined
as
= arg minQn (; Y ),

89

3. Point estimation methods


where
Qn (; Y ) = g n (Y ; )0 W n g n (Y ; ),
p

and where W n w, with w being a nonrandom, symmetric, and p.d. matrix. Then under the
p

conditions (C1) to (C3),
0 .

Proof
The conditions (C1)-(C3) imply that the GMM objective function Qn (; Y ) converges in probability
uniformly to the nonrandom (limit) function
Q() = h()0 w h() = E[g(Yt , )]0 w E[g(Yt , )],
i.e.
p

sup |Qn (; Y ) Q()| 0.

Furthermore note that condition (C2) implies that


Q() = E[g(Yt , )]0 w E[g(Yt , )] = 0

= 0 ,

and note that Q() has a quadratic form such that Q() 0. Hence it follows that
0 =

arg minQ().

Now since
minimizes the objective function Qn (; Y ),
- the GMM estimator
- the true value 0 minimizes the corresponding limit function Q(),
- and Qn (; Y ) converges in probability uniformly to Q(),
p

it follows that
0 .10


Under additional conditions relating to the moment conditions, it can be shown that the GMM
estimator is asymptotically normally distributed. A set of such conditions is given in the
following definition.

10

For the formal details of the proof see T. Amemiya (1985, p. 107), Advanced Econometrics, Cambridge:
Havard University Press.

90

3. Point estimation methods


Definition (GMM Asymptotic Normality Conditions):
C4. The sample moment g n (y; ) = n1 nt=1 g(yt ; ) is continuously differentiable w.r.t. , with a Jacobian matrix
P

Gn (y; ) =

n
g n (y; )
1X
g(yt ; )
=
.
0

n t=1 0
p

C5. For any sequence {n } such that n 0 ,


p

Gn (Y ; n ) G(0 ),
where G(0 ) is a nonrandom ` k matrix.
C6. The sequence of moment functions {g(Yt ; )} satisfies
a CLT, so that

n g n (Y ; ) Z N [0, V (0 )],

where V (0 ) = n Cov[g n (Y ; 0 )].


is defined as the solution of the f.o.c.s for
Condition (C4) ensures that the GMM estimator
a minimizer of the GMM objective function.
Note that condition (C5) follows from a WLLN applied to the derivative of the moment function
g(yt ; )/0 , and the plim for continuous functions.
Condition (C6) may follow from the standard Lindberg-Levy CLT, which would require that
the moment functions {g(Yt ; )} are a sequence of iid random variables.
In this case the asymptotic covariance becomes
n
1X
g(Yt ; 0 ) = Cov[g(Yt ; 0 )].
V (0 ) = n Cov[g n (Y ; 0 )] = n Cov
n t=1

In cases where the sequence of moment functions {g(Yt ; )} are heteroskedastic and/or correlated random variables, the standard Lindberg-Levy CLT needs to be replaced by a CLT for
heteroskedastic and/or correlated processes.

Theorem 3.18 (Asymptotic Normality of GMM Estimators) Under the conditions (C1)

91

3. Point estimation methods


of satisfies
to (C6), the GMM estimator

d
0 )
n(
N (0, )
where
i1

= G(0 )0 wG(0 )

G(0 )0 w V (0 ) w G(0 ) G(0 )0 wG(0 )

i1 

Proof
The GMM objective function to be minimized (ignoring its dependence from Y ) is
Qn () = g n ()0 W n g n ()
with first-order derivatives
Qn ()
g () 0
W n g n () = 2Gn ()0 W n g n (),
=2 n

are
so that the f.o.c.s defining the GMM estimator
0 W n g n ()
= 0.
Gn ()
given by
Now consider a first-order Taylor series representation of the sample moments g n ()
= g n (0 ) + Gn ( )(
0 ),
g n ()
n
where
+ (1 )0
n =

[0, 1].

and

p
p

Note that since
0 , we have n 0 .

0 W n we get the following represenIf we pre-multiply the equation of the Taylor expansion by Gn ()
tation of the GMM f.o.c.s:
0 W n g n ()
= Gn ()
0 W n g n (0 ) + Gn ()
0 W n Gn ( )(
0 ) = 0.
Gn ()
n
Multiplying the last equation by

n and solving it for

0 ) = Gn ()
0 W n Gn ( )
n(
n
| {z } |{z} |
p

G(0 )0

Then the limiting distribution for

{z

G(0 )

0 ) yields
n(

i1

0 Wn
Gn ()

n g n (0 ) .

| {z } |{z} |
p

G(0 )0

{z


n( 0 ) follows from Slutskys theorem.

92

N [0, V (0 )]

3. Point estimation methods


is given by
The asymptotic covariance of the GMM estimator
=
ACov()

i1
h
i1
1h
G(0 )0 wG(0 ) G(0 )0 w V (0 ) w G(0 ) G(0 )0 wG(0 ) .
n

Hence, the asymptotic covariance of the GMM estimator for a given set of moment conditions
p
and, therefore, its asymptotic efficiency depends upon the selected weighting matrix W n
w.
One can show that the optimal choice W n of W n which minimizes this asymptotic covariance
for a given set of moment conditions is characterized by11
p

W n

1

n Cov[g n (Y ; 0 )]

1

= V (0 )

Note that the asymptotic covariance of the GMM estimator based upon this optimal weighting
matrix is


i1

1h

=
ACov()
G(0 )0 V (0 )1 G(0 ) ,
n

Wn = Wn

which is the smallest possible for the set of moment conditions under consideration. The
computation of the optimal weighting matrix W n and, thereby the implementation of
the asymptotically optimal GMM, requires a consistent estimate for
V (0 ) = n Cov[g n (Y ; 0 )] = n Cov[ n1

Pn

t=1

g(Yt ; 0 )].

In cases where {g(Yt ; 0 )} is an iid sequence, such that


V (0 ) = Cov[g(Yt ; 0 )] = E[g(Yt ; 0 )g(Yt ; 0 )0 ],
such a consistent estimate obtains as
V =

1
n

Pn

t=1

g(Yt ; n )g(Yt ; n )0 ,

where plimn = 0 .

Note that the computation of the optimal weighting matrix W n requires a consistent estimate
for . For this reason we typically use in practice the following two-step procedure to obtain
the asymptotically efficient estimator:
Step (1): Select some arbitrary weighting matrix (say W n = I) to obtain an initial consistent
estimate of .
Step (2): Use this consistent estimate of to construct W n and to obtain the asymptotically
efficient GMM estimate.
11

For details see, e.g. L.P. Hansen (1982) Large sample properties of generalized method of moments estimators, Econometrica, p. 1029-1054.

93

3. Point estimation methods


The optimal weighting matrix W n ensures asymptotic efficiency within the class of GMM
estimators based upon a particular set of moment conditions.
Results relating to the optimal choice of moment conditions with which to define the GMM
estimator in the general case are not available. A discussion of this issue can be found in
R. Gallant and G. Tauchen (1996), Which moments to match? Econometric Theory, 12, p. 657681.
Example 3.18 Hansen and Singleton (1982)12 considered the estimation of a consumption
based capital asset pricing model. This is a model where a representative agent maximizes the
expected utility.
Let Ct denote consumption in period t, and U (Ct ) the corresponding utility function. The agent
is presumed to want to maximize his intertemporal utility by solving
max

{ct+ }

E U (Ct+ )|xt ,

where

=0

xt : (k 1) vector representing the investors information set at date t


(0, 1) : discount factor (smaller values of mean smaller weights
on future events)
There exist L different assets yielding a gross return for an investment from t to t + 1 of
(1 + Ri,t+1 ),

i = 1, ..., L.

This rate is unknown in period t and treated as a random variable.


If the agent takes a position in each of these L assets, the optimal portfolio maximizing the
inter-temporal utility satisfies (Euler equations)
dU (Ct )
dCt

= E (1 + Ri,t+1 )

| {z }

Utility loss

dU (Ct+1 )
dCt+1

xt ,

{z

i = 1, ..., L.

Expected utility gain caused by investing

caused by

1$ in period t and consuming the

saving 1$ in t

gross return (1 + Ri,t+1 ) in t + 1

Suppose that the utility function is parameterized as


U (Ct ) =
12

Ct1
,
1

( : coefficient of relative risk aversion).

L.P. Hansen and K. Singleton (1982), Generalized instrumental variable estimation of nonlinear rational
expectation models, Econometrica 50, p. 1029-1054.

94

3. Point estimation methods


Then the optimality condition becomes
Ct

xt
= E (1 + Ri,t+1 ) Ct+1

i
1 = E (1 + Ri,t+1 ) (Ct+1 /Ct ) xt ,

and hence


1 (1 + Ri,t+1 ) (Ct+1 /Ct )

E


1 (1 + Ri,t+1 ) (Ct+1 /Ct )

= 0

xt

Xt

= 0.

(by l.i.e.)

The aim is to estimate the parameter = (, ). Note that without further assumptions about
the distribution of the involved random variables we cannot derive a likelihood function which
would be required for ML estimation. However, the optimality conditions allow us to define a
set of moment conditions that can be used to estimate by GMM.
Let Y t = (R1,t+1 , ..., RL,t+1 , Ct+1 /Ct , X 0t )0 . Then stacking the optimality conditions for the L
assets produces a set of L k moment conditions of the form

Eg(Y t ; ) =



1 (1 + R1,t+1 ) (Ct+1 /Ct ) X t

..
E
.



1 (1 + RL,t+1 ) (Ct+1 /Ct ) X t

= 0.

(L k 1)

The GMM estimates obtains by minimizing the GMM objective function


Qn () =

h P
n
1
n

t=1

i0

g(y t ; ) W n

h P
n
1
n

t=1

g(y t ; ) .

3.4. Bayesian Estimation


Up to this point, we considered estimation problems from the classical statistical perspective.
There we assume that our random sample Y comes from some distribution with pdf f (y; ).
Moreover, we have assumed that is some fixed, though unknown parameter. Then the data
are used to obtain an estimate for , e.g., by maximizing the likelihood.
In Bayesian statistics, by contrast, itself is regarded as a random variable. This interpretation
is based on a subjective view of probability, which argues that our uncertainty about the
unknown can be expressed in terms of a probability distribution using rules of probability.
In this context, the central probability distribution is the conditional distribution of the unknown given the observed data y, with density f (|y). This distribution, referred to as
95

3. Point estimation methods


the posterior distribution, is of fundamental interest for the statistician in using the data y to
learn about .

3.4.1. Prior and Posterior Distribution


Assume that Y = (Y1 , ..., Yn ) is a random sample from the joint pdf f (y|), defining the
sample likelihood. Note that we denote the pdf of Y as f (y|) instead of using the notation
f (y; ) used up to this point. This notation should indicate that the parameter is viewed
as the value of a random variable.
Any information we have about before observing the data y is represented by the so-called
prior density denoted by f (). This is a density which we need to specify in such a way that
it reflects our confidence in our prior information.
Probability statements that would be made about after the data y have been observed are
based upon the so-called posterior density, which is defined as the conditional density for
given the data y, say f (|y).
Using Bayess theorem, the posterior density density obtains as
f (|y) =

f (y|)f ()
f (y|)f ()
f (, y)
.
=
=
f (y|)f ()d
f (y)
f (y)

The posterior combines our prior beliefs about summarized by the prior f () with the
information about in the data y, contained in the likelihood f (y|). Hence, the posterior
represents our revised beliefs about the distribution of after seeing the data y. It obtains as
a mixture of the prior information and current information, that is, the data.
Once obtained, the posterior is available to be the prior when the next body of data is available.
The principle involved is one of continual updating our knowledge about . This appears
nowhere in the classical analysis.
In this setting, f (y) is referred to as the marginal data density, which does not involve the
parameter of interest. It represents an inessential (integrating) constant. Since the data density
is an inessential constant not involving , it is often dropped. We then write
f (|y) f (y|) f (),
| {z }
posterior

| {z }
likelihood

| {z }
prior

where the symbol means is proportional to. Note that the product f (y|)f () does not
define a proper density. It represents a so called density kernel for the posterior density of .
96

3. Point estimation methods


A Bayesian estimator for the value of is the mean of the posterior distribution as stated in
the following definition.
Definition (Posterior Bayes Estimate):
Let Y =
(Y1 , ..., Yn ) be a random sample from the joint pdf f (y|) and
f () the prior density for . The posterior Bayes estimator
of q() is defined to be

E[q()|y] =

q()f (|y)d =

q()f (y|)f ()d

.
f (y|)f ()d

Note that the posterior Bayes estimator is a quantity that is obtained by integration. Only in
special cases, this integral can be worked out analytically (see the following examples). Whether
or not an analytical characterization can be obtained, critically depends on the functional forms
of the likelihood and the chosen prior density.
However, in general, we cannot work out this integral analytically and we need to rely on numerical integration. Here we can use either deterministic integration methods (Simponss rule,
Laplace approximations, quadrature rules) or Monte-Carlo techniques (Importance Sampling,
Gibbs sampling, Metropolis-Hastings sampling, Markov-Chain Monte-Carlo). We dont go into
the details here.
Example 3.19 Let Y = (Y1 , ..., Yn ) be a random sample from a N (, 2 ) population. The
variance 2 is assumed to be known. Suppose that prior information about can be represented
by a N (m, v 2 ) prior distribution, where the prior parameters (m, v 2 ) are known. The
functional form of the posterior distribution for
f (|y) f (y|) f ()
obtains as follows. Ignoring the inessential integrating constant, the likelihood and the prior
density have the form
n

f (y|) exp

1
2 2

2
i (yi ) ,

f () exp

1
(
2v 2

m)2 .

The product yields


f (|y) exp

1
2 2

P

2
i yi

yi + n

Ignoring the inessential constant factors exp{ 21 2


f (|y) exp

1
2

n2
2

2
v2

97

2 2

1
2v 2

2m + m



yi2 } and exp{ 2v12 m2 }, we get

yi

2 vm2



3. Point estimation methods


or
f (|y) exp
exp




1
2
1
2

n2
2



2
v2

nv 2 + 2
2 v2

{z

2
1/

2 2

yi

P

y
i i
2

2 vm2

m
v2

{z

2
/



 

Note that the r.h.s. has the form of a density kernel of a normal distribution for . To see this,
consider the density for an X N (x , x2 ) given by
f (x) = 1

2x2

exp{ 21 ( 12 x2 2 x2 x +
x

2x
)}.
x2

Hence, the posterior f (|y) is a N ( , 2 ) density with


2 =

2v2
,
nv 2 + 2

n
y v 2 + m 2
,
nv 2 + 2

where y =

1
n

yi ,

and the posterior Bayes estimate for is




E(|y) = =

n
n+( 2 /v 2 )

y +

( 2 /v 2 )
n+( 2 /v 2 )

m.

Note that the classical ML estimate for is y. Thus, the Bayesian estimate is a weighted
average of the classical estimate and an estimate based on prior information, namely the prior
mean m.
Furthermore note that the Bayesian estimate


E(|y) = =

n
n+( 2 /v 2 )

y +

( 2 /v 2 )
n+( 2 /v 2 )

depends on the prior variance v 2 . Smaller values of v 2 correspond to greater confidence in prior
information, and this would make the Bayesian estimate closer to m. In contrast, as v 2 becomes
larger, the Bayesian estimate approaches the classical estimate y.
For the limit of v 2 the prior density becomes a so-called diffuse or improper prior
density. In this case the prior information is so poor that it is completely ignored in forming
the estimate.
Recall that the ML estimator Y is an unbiased estimator of and the MVUE. This implies
that the posterior Bayes estimator is biased for a finite v 2 . One can show that in general a
posterior Bayes estimator is biased13 .

13

See, e.g. Mood, Graybill and Boes (1974, p. 343).

98

3. Point estimation methods


Example 3.20 Consider the classical LRM
Y

(n 1)

 N (0, 2 I),

x +  ,

(n k)

(k 1)

(n 1)

where the variance 2 is assumed to be known and where we want to estimate .


The likelihood has the form
f (y|) exp

1
(y
2 2

x) (y x) .

Suppose that prior information about can be represented by a N (m, 2 V )-distribution


with density


1
0 1
f () exp 22 ( m) V ( m) .
The density kernel of the posterior, given by the product f (y|)f (), is obtained in the same
way as in the previous univariate example and is
f (|y) exp

1
(
2 2

) (V

+ x x)( ) ,

where
= (V 1 + x0 x)1 (V 1 m + x0 y).
Hence the posterior for is a N [ , 2 (V 1 + x0 x)1 ] density.
The posterior Bayes estimate is
E(|y) = = (V 1 + x0 x)1 (V 1 m + x0 y)
and can be interpreted as follows:
Poor prior information about corresponds to a large prior variance V , that means a small
value of V 1 . The diffuse prior in this case can be represented as the limit V 1 0, for which
the posterior Bayes estimate becomes the LS estimate
= (x0 x)1 x0 y.
The variance of the posterior distribution becomes 2 (x0 x)1 . Thus the classical LS approach
for the LRM is reproduced as a special case of Bayesian inference with a diffuse prior.

In the previous example, we have assumed that the residual variance 2 is known. In general,
however, both and 2 are unknown, and a Bayesian analysis requires a prior distribution
for 2 . For this application it is convenient to assume for the 1/ 2 (the so-called precision)

99

3. Point estimation methods


a Gamma prior distribution. This ensures that the posterior Bayes estimate of 1/ 2 can be
worked out analytically.
In particular, if we assume for the same normal prior as used in the previous example and if
we assume for 1/ 2 a Gamma prior, then14
(1) the marginal posterior distribution of the precision 1/ 2 |y is a Gamma distribution and
(2) the marginal posterior distribution of |y is a k-dimensional student-t distribution.

3.4.2. The Loss-Function Approach


Up to this point, parameter estimates have been treated as if they were an end in themselves.
But in many settings they are obtained so as to enable us to make decisions. In such a context,
the estimate t = t(y) of q() is called a decision, the estimator T = t(Y ) a decision rule and
the estimation error [t q()] a decision error.
In order to measure the severity of a decision error, a so-called loss function denoted by
`(t; ) 0 is used. Popular loss functions are
`1 (t; ) = [t q()]2

(quadratic loss function),

`2 (t; ) = |t q()|

(absolute-error loss function).

Under both loss functions, the loss associated with large errors is larger than for small errors,
and the loss is zero when the estimation error is zero.
The average loss associated with an estimator (decision rule) T is the so-called risk function,
defined as



 
 

Rt () = E `(T ; )| = E ` t(Y ); | = ` t(y); f (y|)dy.
The risk function measures the average loss computed across different realizations of the random
sample Y for a given value of the parameter .
Note that the risk function for a quadratic loss function is


E `1 (T ; )| = E [t q()]

(Mean-squared error),

and for an absolute-error loss function




E `2 (T ; )| = E
14



|t q()|

(Mean absolute error).

see, e.g., G. Koop (2003), Bayesian Econometrics, Chichester: Wiley; Chapter 3.

100

3. Point estimation methods


Now observe that, in general, the risk function is a function of the parameter . Thus, since
the value of is unknown, it is difficult to make a choice between two estimators based upon
their respective risk functions.
In a Bayesian context, where we assume that is the value of a random variable with a known
(prior) distribution, we have the possibility to remove the dependence of the risk function from
by computing the mean of the risk function under the prior distribution.
Formally, this average risk of an estimator T for a given prior density f () is given by

rt = Ef () [Rt ()] =

Rt () f () d,
| {z } | {z }
risk

prior

and is referred to as the Bayes risk.


Note that the Bayes risk for an estimator for a given loss function and a given prior is a real
number. Thus, two estimators can readily be compared by comparing their respective Bayes
risks. We now can define the best estimator to be that one which has the smallest Bayes risk.
Definition (Bayes Estimator): The Bayes estimator
of q() is that estimator which has for a given loss function The following theorem helps us
and a given prior distribution for the smallest Bayes risk.
to identify the Bayes estimator. It says that the Bayes estimator minimizes the expected value
of the loss function under the posterior distribution (i.e., the posterior expected loss or posterior
risk).
Theorem 3.19 Let `(t; ) 0 be the loss function for estimating q() and let f (|y) denote
the posterior density obtained from the prior f () with domain and the likelihood f (y|)
with domain . Then, the Bayes estimator is that estimator T which minimizes the posterior
risk




`(t; )f (|y)d.
E `(T ; ) y =

Proof
The Bayes estimator has the smallest Bayes risk. The Bayes risk obtains as
h 

i

rt = Ef () [Rt ()] = Ef () E `(T ; )|


i
h
= `(t; ) f (y|)dy f ()d
i
h
= `(t; ) f (y|) f ()d dy
i
h
)f ()
= `(t; ) f (y |
d
f (y)dy
f (y)
=

{z

posterior



E `(t; ) y f (y)dy.


101

(since f () does not depend on y)

(by multiplying with f (y)/f (y))

3. Point estimation methods


Since the integrand in the last equation is nonnegative, the value of the integral rt is minimized if the


inner integral E `(t; ) y , i.e. the posterior risk, is minimized for each y .


We can thus establish the result that the Bayes estimator of for a quadratic loss function
is given by the mean of the posterior distribution of (i.e. the posterior Bayes estimator).

Corollary 3.1 Under a quadratic loss function `(t; ), the Bayes estimator of q() is given
by the posterior expectation of q(), i.e.
i

E q() y

q()f (|y)d.

Proof
The Bayes risk for a quadratic loss function is minimized if the corresponding posterior risk, i.e.
h

E `(t; ) y

= E [t q()]2 y

is minimized. Now note that an MSE of the form E[(a Z)2 ] is minimized as a function of a by
a = EZ. To see this consider the f.o.c.
dE[(aZ)2 ]
da

= 2E[(a Z)] = 0

a = EZ.

Thus, the posterior risk under a quadratic loss function E[[t q()]2 | y] is minimized by E[q() | y].

Example 3.21 Recall the example where we considered the posterior Bayes estimator of for
a random sample from a N (, 2 ) population with a known variance 2 and a prior given by
a N (m, v 2 ) distribution. For a quadratic loss function the Bayes estimator of is equal to the
posterior Bayes estimator and is given by


E(|y) = =

n
n+( 2 /v 2 )

102

y +

( 2 /v 2 )
n+( 2 /v 2 )

m.

4. Hypothesis testing
There are two major areas of statistical inference: the estimation of parameters discussed
in the previous two chapters, and the testing of hypotheses, which we shall discuss in this
chapter.
Statistical hypothesis testing concerns the use of a random sample of observations from the
population under consideration to judge the validity of a statement or hypothesis about this
population in such a way that the probability of making incorrect decisions can be controlled.
Examples of statements that might be tested are
A certain brand of batteries lasts at least 3 hours;
The average monthly return of a certain portfolio of risky assets exceeds 5%;
The probability that a consumer will buy a certain brand of coffees depends on his age.
In the following section, various concepts of statistical hypothesis testing are introduced.

4.1. Fundamental Notations and Terminology of Hypothesis


Testing
The starting point for statistical hypothesis testing is the formulation of a statistical hypothesis.
Definition (Statistical hypothesis): A set of potential probability distributions for a random sample from a population
is called a statistical hypothesis.
If the statistical hypothesis completely and uniquely identifies
the probability distribution, the hypothesis is called simple;
If the statistical hypothesis contains two or more potential
probability distributions, the hypothesis is called composite;
It is customary to represent a statistical hypothesis by the capital letter H and to use the
subscripts, when needed, to distinguish between various hypotheses under consideration, such
as H0 (null hypothesis) and H1 (alternative hypothesis).
103

4. Hypothesis testing
Example 4.1 A manufacturer of light bulbs claims that the percentage of defective light bulbs
in a shipment is no more than 2%. This claim can be transformed into a statistical hypothesis
as follows:
We take a random sample of n light bulbs and define the random variable
Xi =

1 if the ith light bulb is defective


0 if the ith light bulb is nondefective

i = 1, ..., n.

Note that by construction Xi is an iid Bernoulli distributed variable with P (xi = 1) = p.


Now we can translate the manufacturers claim into the following statistical hypothesis


H = f (x; p) =

Qn

xi
1xi
,
i=1 p (1 p)

p [0 , 0.02] .

p [0 , 0.02] defines the set of all distributions implied by the manufacturers claim. Since this
claim involves more than one distribution, the statistical hypothesis is a composite hypothesis.
Since the functional form for the joint pdf f (x; p) can be assumed to be known, the hypothesis
can be represented as it is usually the case in abbreviated form as
H : 0 p 0.02

: 0.02 < p 1.
and the complementary hypothesis as H

Suppose that the manufacturers claim was that the percentage of defective light bulbs in a
shipment is exactly 2%. Then we have a simple statistical hypothesis given by


H = f (x; p) =

Qn

1xi
xi
,
i=1 0.02 (1 0.02)

or simply H : p = 0.02.

Note that a statistical hypothesis H implicitly defines a set of potential outcomes for the random
sample x. Formally, this set obtains as
R(x|H) = {x : f (x; ) > 0 and f (x; ) H},
and is referred to as the range of x over H.
(the complement of H) and over H H
can be
Analogous definitions of the range of x over H
given as
= {x : f (x; ) > 0 and f (x; ) H},

R(x|H)
= {x : f (x; ) > 0 and f (x; ) H H}.

R(x|H H)
104

4. Hypothesis testing
and H H
are the same, which
In many applications it is the case that the ranges over H, H
are the same.
occurs when all of the supports of the pdfs in H and H
Example 4.2 Recall the previous example, where we considered the hypothesis that the percentage of defective light bulbs in a shipment is no more than 2%, such that
H : 0 p 0.02

and

: 0.02 < p 1.
H

and H H
are
The ranges of a sample x = (X1 , ..., Xn ) over the ranges H, H
= R(x|H H)
= n {0, 1}.
R(x|H) = R(x|H)
i=1
Suppose that the manufacturers claim was that there are no defective bulbs, such that
H:p=0

and

: 0 < p 1.
H

Then the ranges are


R(x|H) = ni=1 {0} and

= R(x|H H)
= n {0, 1}.
R(x|H)
i=1

Statistical hypothesis test


The objective of a test of a statistical hypothesis is to assess the validity of the hypothesis. In
a hypothesis testing problem, given the sample observations we need to decide either to accept
the hypothesis as true or to reject the hypothesis as false.
Definition (Statistical hypothesis test): A statistical hypothesis test is a rule, based on a random sample outcome x,
used to decide whether or not to reject a hypothesis H.
It follows from this definition that a statistical test transforms/maps the random sample outcome x into a 0/1 decision. This transformation is facilitated by partitioning the sample space
(that is, the set of all potential sample outcomes) into two disjoint subsets, namely
the rejection or critical region leading to a rejection of H,
and the acceptance region leading to an acceptance of H.
Definition (Critical region): A subset Cr of the sample
range such that if x Cr , then the hypothesis H is rejected is
called the critical region or rejection region. (The complement
of Cr is the acceptance region Ca with Cr Ca = ).
The definition of a critical region fully specifies a rule for deciding whether or not to reject H.
105

4. Hypothesis testing
Example 4.3 Let x = (X1 , ..., Xn ) be a sample from a N (, 25) population, where is unknown. Consider the hypothesis
H : < 17.
Intuitively, one would reject H if the sample mean xn is significantly larger than 17. Hence, as
a critical region one could define the following set of random sample outcomes
Cr = {x : xn > 17 +

5 }.
n

Type I error and type II error


The key issue of designing a test is the partitioning of the sample space into a critical region
and an acceptance region such that the test is good in some appropriate sense.
Operationally this amounts to designing tests that minimize the probability of incorrect decisions w.r.t. the validity of H.
There are two types of incorrect decisions (errors) that can be made.
Definition (Type I and type II error): The type I error
of a test for the hypothesis H is the random event that H is
rejected when H is true, i.e. the event
{x Cr

and His true}.

The type II error is the random event that H is accepted


when H is false, i.e.
{x
/ Cr

and His not true}.

The different situations associated with the type I and type II errors are depicted below.

106

4. Hypothesis testing
Ideal statistical test
Clearly, the ideal statistical test would be such that it leads with probability 1 to the correct
decision, implying that
P (type I error) = P (type II error) = 0.
For such an ideal test to exist, it must be possible to define a critical region Cr such that
x Cr His not true

and

x
/ Cr His true.

are two
Note that the definition of such a critical region requires that R(x|H) and R(x|H)
disjoint sets partitioning the sample space, so that

x Cr = R(x|H)
implies with certainty that His not true
and
x Ca = R(x|H) implies with certainty that His true.
would define an ideal error-free test.
Hence, in this case Cr = R(x|H)
6= , there are potential outcomes x that belong to R(x|H) as well as to
If R(x|H) R(x|H)
Those outcomes cannot be used to discriminate with certainty between H and H.

R(x|H).
6= , which is virtually always the case in practice, there would not
Hence if R(x|H) R(x|H)
exist an ideal test. In those cases, we might define tests that control the incidence of errors
such that they occur with acceptable probabilities.
Test statistic
A scalar statistic whose outcomes are used to define critical regions for a test is called a test
statistic.
Definition (Test statistic): Let Cr define the critical region
If T =
associated with a test of the hypothesis H versus H.
n
o
t(x) is a scalar statistic such that Cr = x : t(x) CrT , i.e.,
the critical region can be defined in terms of outcomes, CrT ,
of the statistic T , then T is referred to as a test statistic for
The set CrT will be referred to as
the hypothesis H versus H.
the critical (or rejection) region of the test statistic, T .
Example 4.4 Consider a sample from a N (, 25) population and the hypothesis H : < 17.
Then the critical region
Cr = {x : xn > 17 + 5n }
n.
is defined in terms of the test statistic X
107

4. Hypothesis testing

4.2. Parametric Tests and Test Properties


In parametric hypothesis testing it is assumed
are all characterized by members of a known
that the distributions in both H and H
parametric family of distributions {f (x; ), },
and that the only unknown component is the true value of the parameter .
can be defined entirely in terms of sets
In this case, the hypothesis H and its complement H
of parameter values indexing the assumed family of distributions. Hence the format of H and
is
H
: H = H ,
H : H and H
where H and H denote the sets of hypothesized parameter values.
In nonparametric hypothesis testing the hypothesis to be tested are not defined in terms of
parameter values per se. Rather, hypotheses refer to functional forms of pdfs. In the following
we focus on parametric hypothesis testing.
Power function
The power function describes the relationship between the value of the parameter and the
probability to reject H.
Definition (Power function): Let Cr be the critical region
of a test of H : H , where indexes a parametric
family of densities {f (x; ), = H H }. Then the
power function of the test is defined by

() = P (x Cr ; )

f (x; )dx

(cont. case)

xCr

f (x; )

(discr. case).

xCr

From the definition of the power function it follows that


for H : () = P (type I error),

1 () = P (correct decision)

for H : () = P (correct decision),

1 () = P (type II error).

Hence, the power function summarizes all of the characteristics of a test w.r.t. probabilities
of making a correct/incorrect decision. This makes it a useful tool for the comparison of
alternative tests for a particular parametric hypothesis.
108

4. Hypothesis testing
Example 4.5 Let x be a random sample of size n = 200 from a Bernoulli population with
p(x = 1) = . Consider testing H : 0 0.02 using a test based on
Cr = {x :

P200

i=1

xi > 5}.

The power function for this test is


P200

() = P (x Cr ; ) = P (

xi
| i=1
{z }

> 5; )

Binomial(200,)

P200

200!
j
j=6 j!(200j)! (1

)200j ,

[0, 1].

The power function of an ideal test is obviously given


() = IH () =

1,
0,

if H
.
else

and
When comparing two tests for a given H, a test is better if it has higher power for H
lower power for H, which implies that the better test has lower probabilities of both type
I and type II error.
Properties of statistical tests
The power function can be used to define important properties of a test, including the size and
the significance level of the test, and whether it is unbiased and consistent.
The size of a test is the maximum probability of a type I error.
Definition (Size of test): Let () be the power function
of a test Cr for the hypothesis H. Then
= sup () = sup P (x Cr ; )
H

is called the size of the test Cr .


The definition of the size implies that it is a measure of the minimum level of protection against
a type I error. The lower the size of a test, the lower the maximum probability of a type I
error.
A closely related concept to that of the size is the significance level of a test.
Definition (Significance level of test): A test of significance
level is any test for which
P (type I error) = P (x Cr | H) .

109

4. Hypothesis testing
The significance level is an upper bound of the type I error probability. The difference between
the size and the significance level is that the former is the supH P (type I error) while the
latter is only a bound that might not be equal to P (type I error) for any H nor equal to
supH P (type I error). Thus a test of H having size is a test of significance level for any
.

Example 4.6 Let x = (X1 , ..., Xn )0 be a random sample from a N (, 25) population and consider the hypothesis H : 17. What is the size of the test with critical region
Cr = {x : xn > 17 +

5 }
n

To answer this question, we first need to derive the power function. The power function obtains
as

() = P (x Cr ; ) = P (
xn > 17 + 5/ n ; )

= 1 P (
xn 17 + 5/ n ; )



17 + 5/ n
xn

= 1P
;
5/ n
5/ n
| {z }
N (0, 1)


17 + 5/ n

,
= 1
5/ n


where () denotes the cdf of a N (0, 1) distribution. Now note that


17 + 5/ n

.
() = 1
5/ n


|
|

{z

monot. decreasing in

{z

monot. increasing in

}
}

Thus the maximizer of the power function under the constraint H : 17 is the upper bound
of H, namely = 17. Hence the size of the test is
= sup () = 1 (1) = 0.159.
H

The concept of unbiasedness within the context of tests refers to a test that has smaller probability of rejecting H when H is true compared to when H is false.

110

4. Hypothesis testing
Definition (Unbiasedness of a test): Let () be the power
function of a test for the hypothesis H. The test is called
unbiased iff
sup () inf ().

The definition implies that if the height of the power function graph is everywhere lower for
the test is unbiased. Clearly, unbiasedness is a desirable property of
H than for H,
a test.
Another desirable property of a test is that, for a given size (supremum of the type-I error
probability), the test exhibits the smallest possible type-II error probability. Such a test is
called uniformly most powerful.
Definition (Uniformly most powerful (UMP) size- test):
Let = {Cr : supH () } be the set of all critical
regions with a size of at most for the hypothesis H. A test
with critical region Cr and with a power function Cr () is
called uniformly most powerful of size iff1

sup Cr () = ,

and

and Cr .
Cr () Cr () H

The definition implies that the UMP test of H is the best test of H providing protection against
type I error equal to . Equivalently, it is the size test having the most power uniformly in
In the case where H
is simple referring to one value of , the test C is called
for H.
r
most powerful (the adverb uniformly being redundant in this case).
Unfortunately, in many cases of practical interest, UMP tests do not exist. But, as we shall
see later, it is sometimes possible to restrict attention to the class of unbiased tests and define
a UMP test within this class.
Definition (Admissibility of test): Let Cr be a test of H. If
there exists an alternative test Cr such that


Cr ()
Cr ()


H

,
H

then Cr
with strict inequality holding for some H H,
is inadmissible.
111

4. Hypothesis testing
From the definition, it follows that an inadmissible test is one that is dominated by another
test in terms of protection against both, type I and type II error. Inadmissible tests can be
eliminated from any further consideration in testing applications.
A further desirable property of a test is that for a given significance level the probability
P (type II error) 0 as the sample size n . This property is called consistency.
In the definition, since the critical region of a test will generally change as n changes, we use
the notation Crn to indicate the dependence of the critical region on n.
Definition (Consistency of test): Let Crn be a sequence
of tests of H based on a random sample (X1 , ..., Xn ). Let
the significance level of the test Crn be n. Then the
sequence of tests Crn is said to be a consistent sequence of
significance level- tests iff

lim Crn () = 1 H.

From the definition, it follows that a consistent test is such that, in the limit, the probability
is 1 that H is rejected whenever H is false.

4.3. Construction of UMP Tests


In this section we discuss results that can be used to identify UMP tests, that is, tests which
have for a given size the smallest possible type-II error probability. In particular, we will
discuss two approaches for identifying UMP tests: The Neyman-Pearson and the monotone likelihood ratio approach. The discussion will be restricted to the scalar parameter
case, which is motivated by the fact that U M P tests do not exist for most multidimensional
parameter testing contexts.
Neyman-Pearson approach - simple hypotheses
We begin with the case where the Null and the alternative hypothesis are simple and can be
represented as
H0 : = 0 ,
H1 : = 1 .
This case, which is not representative of a typical testing problem, serves as a starting point
to motivate the basic principles. Note that since the alternative H1 refers to a single point, we
drop the adverb uniformly and try to identify the most powerful (MP) test. The following
theorem is useful in finding a MP test for a simple H0 and a simple H1 hypothesis.
112

4. Hypothesis testing
Theorem 4.1 (Neyman-Pearson Lemma) Let x be a random sample from f (x; ). Furthermore, let k > 0 be a positive constant and Cr a critical region which satisfy
1.

p(x Cr ; 0 ) = ,

2.

f (x; 0 )
k
f (x; 1 )

x Cr ;

3.

f (x; 0 )
>k
f (x; 1 )

x
/ Cr .

0 < < 1;

Then Cr is the most powerful critical region of size for testing the hypothesis H0 : = 0
versus H1 : = 1 .

Proof
Let Cr represent any other critical region of size so that
p(x Cr ; 0 ) = .

Next consider the indicator functions

1
ICr (x) =
0

and note that

if x Cr
else

1 if x C
r
,
ICr (x) =
0
else

and

1 I (x) 0
Cr
ICr (x) ICr (x) =
0 IC (x) 0
r

x Cr
x
/ Cr

Also note that the two conditions imply that

1
f (x; 1 )
f (x; 0 )
< k

if

xC
r
.
x
/ Cr

The preceding results together imply that


[ICr (x) ICr (x)]f (x; 1 )

1
[IC (x) ICr (x)]f (x; 0 ),
k r

x.

Assuming that x is discrete, the summation of both sides of this inequality over all x values in the
sample space R yields (if x is continuous, substitute the summation by an integration)
P

xR [ICr (x)

ICr (x)]f (x; 1 )

p(x Cr ; 1 ) p(x Cr ; 1 )

1
k

xR [ICr (x)

1
k [ p(x
|

ICr (x)]f (x; 0 )

Cr ; 0 ) p(x Cr ; 0 ) ].
{z

= size fixed to be

113

{z

= size fixed to be

4. Hypothesis testing
Hence we obtain
p(x Cr ; 1 ) p(x Cr ; 1 ) 0.
Hence, the size- test Cr has a power for 1 which is larger or equal to that of any other size- test
Cr . Thus Cr is the most powerful size- test.


The Neyman-Pearson lemma facilitates the construction of a most power test of H0 : = 0
versus H1 : = 1 . In particular, the most powerful critical region is defined by a test statistic,
namely the likelihood ratio
f (x; 0 )
.
=
f (x; 1 )
Note that the smaller , the less plausible H0 : = 0 relative to H1 : = 1 .
Using the likelihood ratio, the most powerful critical region can be represented as
Cr = {x : f (x; 0 )/f (x; 1 ) k},
where the critical value of k is selected such that the test Cr has a given size , i.e.
p(f (x; 0 )/f (x; 1 ) k ; 0 ) = .

Example 4.7 Let x be a random sample of size n from an exponential population with pdf
x (0, ),

f (x; ) = exp{x};

> 0.

Consider the hypotheses H0 : = 0 versus H1 : = 1 with 0 < 1 . The most powerful critical
region is defined in terms of the likelihood ratio, which obtains as
f (x; 0 )
0n exp{0 ni=1 xi }
0
= n
=
Pn
f (x; 1 )
1 exp{1 i=1 xi }
1
P

n

Pn

exp{(0 1 )

i=1

xi }.

According to the Neyman-Pearson lemma, the most powerful critical region has the form
Cr = {x : f (x; 0 )/f (x; 1 ) k}
= {x : (0 /1 )n exp{(0 1 )

Pn

i=1

xi } k}

(solving the inequality for

= {x :

Pn

i=1

P
i

xi (1 0 )1 ln[(1 /0 )n k]}
|

{z

114

xi , and noting that 0 < 1 )

4. Hypothesis testing
defining the critical region in terms of the test statistic Pi Xi , is selected
Now, the value for k,
such that Cr has a given size , i.e.
Pn

P (x Cr ; 0 ) = P (

i=1

0 ) = .
xi k;

Now note that since Xi Exponential(), the sum ni=1 Xi has a Gamma(n, )- distribution;
hence the relationship between the critical value of k and size is
P

0 ) =
P ( i=1 xi k;
Pn

0n n1
!
v
exp{0 v}dv = ,
(n)

which can be solved for k (which is the -quantile of the Gamma(n, 0 )- distribution). Thus
the most powerful test of size for H0 : = 0 versus H1 : = 1 is based upon the following
decision rule:
Reject H0 if

Pn

i=1

xi k,

where k = -quantile of a Gamma(n, 0 )-distribution.

It can be shown that whenever the random vector x is continuous, a Neyman-Pearson most
powerful test for H0 : = 0 versus H1 : = 1 will exist. In cases where the random vector x
is discrete, a Neyman-Pearson most powerful test will typically not exist. In those cases, the
most powerful test exists only for a limited number of sizes .
In order to see this, consider the construction of the Neyman-Pearson most powerful critical
region, whereby Cr obtains as
Cr = {x : f (x; 0 )/f (x; 1 ) k},

where k is selected such that

P (f (x; 0 )/f (x; 1 ) k; 0 ) = .


Now if x is discrete, the random variable f (x; 0 )/f (x; 1 ) is also discrete with a cdf that
exhibits discontinuities. This implies that the equality defining the value for k can be satisfied
at a limited set of values only.

Example 4.8 Let x be a random sample of size n from a Bernoulli population with pdf
f (x; ) = x (1 )1x ;

x {0, 1},

[0, 1].

Consider the hypotheses H0 : = 0 versus H1 : = 1 with 0 < 1 . The likelihood ratio is


P

xi

0 (1 1 )
i (1 0 )n i xi
f (x; 0 )
= 0P x
=
P
i
f (x; 1 )
1 (1 0 )
1 i (1 1 )n i xi


115

P xi 

i
1 n
0

1 1

4. Hypothesis testing
so that the Neyman-Pearson most powerful critical region has the form
Cr =

x:

0 (1 1 )
1 (1 0 )

P xi 

i
1 n
0

1 1

(solving the inequality for

x:

Pn

i=1 xi

P
i

xi , and noting that 0 < 1 )

ln(k) n ln[(1 0 )/(1 1 )]


.
ln[0 (1 1 )/1 (1 0 )]


{z

P
Since ni=1 Xi is Binomial(n, ) distributed, the relationship between the choice of k and the
size of the test implied by Cr is given by

= P (x Cr ; 0 ) = P

X
n

0
xi k;

n
X

j=k

i=1

n j
(1 0 )nj .
j 0

For 0 = 0.2, 1 = 0.8, and n = 20, possible choices of the test size are
k

1
2
3
4
5
6

0.9885
0.9308
0.7939
0.5886
0.3704
0.1958

7 0.0867
8 0.0321
9 0.0100
10 0.0026
11 0.0006
12 0.0001

Note that there are no choices of within the range [0.0001, 0.9885] other than the ones displayed
in the table. Hence, a Neyman-Pearson most powerful test of size, say, = 0.05 does not exist.
One can show that the Neyman-Pearson most powerful test of H0 : = 0 versus H1 : = 1
is also an unbiased test, that is, a test which has smaller probability of rejecting H0 , when H0
is true compared to when H0 is false (see Mittelhammer, 1996, Theorem 9.2).
Neyman-Pearson approach - composite hypotheses
Up to this point, we considered the construction of the most powerful test for a simple H0
against a simple H1 by means of the Neyman-Person lemma. In some cases, the NeymanPearson approach can also be used to identify UMP tests when the alternative hypothesis is
composite, i.e.
H0 : = 0 ,
H1 : 1 .
The idea is to show via Neyman-Pearson that the critical region Cr of a MP test of H0 : = 0
versus H1 : = 1 is the same 1 1 . It would follow that this Cr defines a uniformly
most powerful test for H0 : = 0 versus H1 : = 1 .
116

4. Hypothesis testing
Theorem 4.2 (UMP Test of a simple H0 versus a composite H1 ) Let x be a random
sample from f (x; ). Furthermore, let {k(1 ) > 0} be a sequence of positive constants with
1 1 and Cr a critical region which satisfy
1. p(x Cr ; 0 ) = , 0 < < 1;
2.

f (x; 0 )
k(1 ) x Cr and 1 1 ;
f (x; 1 )

3.

f (x; 0 )
> k(1 ) x
/ Cr and 1 1 .
f (x; 1 )

Then Cr is the uniformly most powerful critical region of size for testing the hypothesis
H0 : = 0 versus H1 : 1 .

Proof
According to the Neyman-Pearson lemma, the region Cr satisfying the conditions (1)-(3), defines the
most powerful size test for H0 : = 0 versus H1 : = 1 . Since the same critical region, Cr ,
applies for any 1 1 , the region Cr is also most powerful for any value of 1 1 . Thus Cr is a
UMP size- test.

Example 4.9 Recall the example where we consider a sample x of size n from an exponential
population with pdf f (x; ) = exp{x} and simple hypotheses H0 and H1 . Now consider the
hypotheses
H0 : = 0 versus H1 : > 0 .
As shown above, the most powerful size- test for H0 : = 0 and H1 : = 1 with 0 < 1 is
Cr = {x :

Pn

i=1

xi (1 0 )1 ln[(1 /0 )n k] k},

where k is selected to be the -quantile of a Gamma(n, 0 ) density, i.e.,

0n n1
v
exp{0 v}dv = .
(n)

The form of this critical region Cr is independent of the particular value of 1 (as long as
1 > 0 ). It is the same for each value of 1 > 0 . (Note that, since k is fixed, the values for
k depend upon 1 , so that k is a function k(1 ).) Thus, by the above Theorem, the test Cr also
defines a UMP test for H0 : = 0 versus H1 : > 0 .

117

4. Hypothesis testing
Note that for the hypotheses
H0 : = 0

versus H1 : <0 ,

(instead of H1 : > 0 ) the UMP size- test is


Pn

Cr = {x :

i=1

xi (1 0 )1 ln[(1 /0 )n k] k},

where k is selected to be the 1 -quantile of a Gamma(n, 0 ) density, i.e.,

0n n1
v
exp{0 v}dv = .
(n)

Thus, relative to the initial case where we considered H0 : = 0 versus H1 : >0 , the critical
region for the UMP size- test has changed! This implies that there will not exist a UMP test
for
H0 : = 0 versus H1 : 6=0 .

In the previous discussion we considered a testing situation with a simple Null and a composite
alternative hypothesis, i.e.,
H0 : = 0

versus H1 : 1 .

Note that a composite alternative can be either a one-sided alternative hypothesis, such
as
H1 : > 0 or H1 : < 0 ,
or a two-sided alternative hypothesis, such as
H1 : 6= 0 .
In practice, UMP size- tests of simple H0 s versus one-sided H1 s typically exist when the
parameter is a scalar and x is continuous (when x is discrete, then the UMP test typically
exists only for a limited number of sizes). In sharp contrast, UMP size- tests of simple H0 s
versus two-sided H1 s will typically not exist (see the previous example). In such cases one must
generally resort to seek a UMP test within a subclass of tests, such as the class of unbiased
tests.
Monotone likelihood ratio approach
The monotone likelihood ratio approach can be used to identify UMP size- tests of composite

118

4. Hypothesis testing
Null hypotheses versus composite one-sided alternative hypotheses, i.e,
H0 : 0 (or H0 : 0 ) versus H1 : > 0 (or H1 : < 0 ).
This procedure for defining UMP size- tests relies on the concept of monotone likelihood
ratios in statistics T = t(x). As we shall see, the concept of monotone likelihood ratios
allows us to apply the Neyman-Pearson approach to identify UMP tests of composite Null
hypotheses versus composite one-sided alternative hypotheses.
Definition (Monotone likelihood ratio): A family of density functions {f (x; ), } is said to have a monotone
likelihood ratio in the statistic T = t(x) iff
1 > 2 ,

the likelihood ratio

L(1 ; x)
f (x; 1 )
=
L(2 ; x)
f (x; 2 )

can be expressed as a nondecreasing function of t(x) x.


The fact that a family of densities has a monotone likelihood ratio in T implies that the smaller
the value for T , the less plausible the parameter value 1 will be relative to 2 .
Example 4.10 For the family of exponential densities {f (x; ) = exp{x}; > 0} we have
a likelihood ratio
L(1 ; x)
1n exp{1 ni=1 xi }
1
=
= n
Pn
L(2 ; x)
2 exp{2 i=1 xi }
2
P

n

| {z }

exp{ (1 2 )
|

{z

Pn

i=1

xi }.

> 0for 1 > 2

>0

Hence, 1 > 2 the likelihood ratio is a nonincreasing function of ni=1 xi and hence a nondeP
creasing function of t(x) = 1/ ni=1 xi . Thus the family of densities {f (x; )} for an exponential
population has a monotone likelihood ratio.
P

The verification of whether a particular family of densities has a monotone likelihood ratio can
be quite difficult. However, if the family belongs to the exponential class of densities, the
verification is simplified by the following result.
Theorem 4.3 (Monotone likelihood ratio and the exponential class) Let {f (x; ),
}, be a density family belonging to the exponential class of densities, as
f (x; ) = exp{c()g(x) + d() + z(x)}.
If c() is a nondecreasing function of , then {f (x; ), } has a monotone likelihood ratio
in the statistic g(x).
119

4. Hypothesis testing
Proof
Let 1 > 2 , and examine the likelihood ratio
n
L(1 ; x)
= exp
L(2 ; x)

[c(1 ) c(2 )]
{z

g(x) + d (1 , 2 ) .

0,
since c()is nondecreasing and 1 > 2

Thus the likelihood ratio can be expressed as a nondecreasing function of g(x).


If the family of densities can be shown to have a monotone likelihood ratio in some statistic,
then UMP size- tests of
H0 : 0 (or H0 : 0 ) versus H1 : > 0 (or H1 : < 0 ).
will exist and can be identified by the Neyman-Pearson approach.

Theorem 4.4 (Monotone likelihood ratios and UMP size- tests) Let {f (x; ), },
be a density family having a monotone likelihood ratio in the statistic t(x). Then
(1.)
Cr = {x : t(x) k},
is the a UMP size- test for

where kis such that

H0 : 0

P (t(x) k; 0 ) =

versus H1 : > 0 ;

(2.)
Cr = {x : t(x) k},
is the a UMP size- test for

where kis such that

H0 : 0

P (t(x) k; 0 ) =

versus H1 : < 0 .

Proof
See Mittelhammer (1996), Theorem 9.6 and Corollary 9.1.


The intuition behind the form of the UMP critical region implied by this result is the following.
For H0 : 0 versus H1 : > 0 the UMP critical region is given by Cr = {x : t(x) k}.

120

4. Hypothesis testing
Now note that all i s such that i < 0 imply that,
the larger t(x),

the larger

L(0 ;x)
,
L(i ;x)

the less plausible i s < 0 relative to 0 ,


and the less plausible H0 : 0

relative to H1 : > 0 .

Hence we should reject H0 if t(x)is too large.

Example 4.11 Let x be a random sample from an exponential population with density
f (x; ) = exp{x},

> 0.

Define an UMP size- test for H0 : 0 versus H1 : > 0 . As shown in the previous
example, the family of density functions {f (x; )} for an exponential population has a monotone
P
likelihood ratio in the statistic 1/ ni=1 Xi . Hence, we can rely on the UMP size- test for
monotone likelihood ratios, so that the UMP size- test is given by
Pn

Cr = {x : 1/
Pn

P (1/

i=1 xi

i=1 xi

k},

where kis selected such that

k; 0 ) = P (

Pn

i=1 xi

1/k; 0 ) = .

Since ni=1 Xi has a Gamma(n, )-distribution, the appropriate value of 1/k is the solution to
the integral equation
1/k n
0 n1
!
v
exp{0 v}dv =
(n)
0
P

(k is the inverse of the -quantile of a Gamma (n, 0 )-distribution).

4.4. Hypothesis-Testing Methods


In this chapter, we consider generally applicable methods which can be used to define explicit rules for testing statistical hypotheses. Examples for generally applicable methods for
constructing statistical tests are the likelihood ratio, Wald and Lagrange multiplier approach.
None of these methods is guaranteed to produce a test with optimal properties in all cases.
However, the virtues of these methods are that
they are straightforward to apply
they are applicable to a wide class of problems

121

4. Hypothesis testing
they generally have excellent asymptotic properties
and they often have good power in finite samples.

4.4.1. Likelihood Ratio Tests


The likelihood ratio (LR) test is based upon the ratio of maximized likelihood functions (the
generalized likelihood ratio). It is a natural procedure to use for testing hypotheses about ,
when ML estimation is being used to estimate .
Definition (Likelihood ratio test): Let L(; x) be the likelihood function for a sample x = (X1 , ..., Xn ). The generalized likelihood ratio (GLR) is defined as
(x) =

supH0 L (; x)
.
supH0 H1 L (; x)

A likelihood ratio test for testing H0 versus H1 is given by the


critical region
Cr = {x : (x) c}.
For a size test, the constant c is chosen to satisfy
sup () = sup P ( (x) c; ) = .
H0

H0

Two notes are in order:


1. Since H0 can be interpreted as a restriction, the numerator of the GLR
(x) =

supH0 L (; x)
supH0 H1 L (; x)

represents the likelihood at the constrained maximizer (constraint ML estimate), while


the denominator is, for H0 H1 = , the likelihood at the unconstrained maximizer (ML
estimate).
Since the likelihood values are non-negative, it follows that
(x) [0, 1].
Note also that in cases where the restriction H0 is not binding the constrained ML
estimate is equal to the unconstrained one, so that (x) = 1.
122

4. Hypothesis testing
2. Note that the GLR
(x) =

supH0 L (; x)
,
supH0 H1 L (; x)

tends to be small (large) when the restriction H0 is not true (is true). Hence, it
appears reasonable to use a critical region of the form
Cr = {x : (x) c},
where we reject H0 , when the GLR is too small.
Example 4.12 Let x be a random sample of size n from an exponential population with density
f (x; ) = exp{x},
H0 : 0

Define an the LR test for

> 0.

versus H1 : > 0 .

P
Recall that the (unconstrained) ML estimate of is = 1/ n1 i xi . Since H0 H1 = , it
follows that the denominator of the GLR is

sup L (; x) = sup [n exp{

xi }] = n/

xi

in

exp{n}.

The numerator of the GLR depends on whether < 0 is binding, and obtains as
sup L (; x) =
H0

sup [n exp{

0<0

xi }]

h P
in

n/ i xi exp{n}

[0n exp{0

xi }]

if

(non binding)

if

> 0

(binding)

(note that the likelihood function is strictly concave in ). Hence the GLR obtains as

(x) =

if

(non binding)

xi + n} if

> 0

(binding)

[0

xi /n]n exp{0

{z

<1

The critical region of the LR test is given by


Cr = {x : (x) c}

for

0 < c < 1,

where c = 1 is excluded. (Setting c = 1 would imply that we will always reject H0 .) Hence, Cr
can also be represented as
Cr = {x : [0

xi /n]n exp{0

P
xi + n} c and = 1/ n1 i xi > 0 }.
|

123

{z

(restriction is binding)

4. Hypothesis testing
For a size- test, c is selected such that
sup P (x Cr ; ) = .
0<0

In order to simplify the computation of P (x Cr ; ), we define the variable y 0 n1


rewrite Cr as
Cr = {x : y n exp{n(y 1)} c and y < 1}.

xi , and

Now note that the function y n exp{n(y 1)} is nondecreasing on [0, 1] and has a maximum
at y = 1. Hence we can rewrite Cr as
Cr = {x : y k and 0 < k < 1} = {x :

i 0 xi

nk and 0 < k < 1}.

Thus the power function obtains as


P (x Cr ; ) = P (

nk; ).

i 0 xi

nk; ) = P (

i 0 xi

with
sup P (x Cr ; ) = sup P (

0<0

0<0

(since P (

{z P

0 xi nk; ) P (

i 0 xi

nk; 0 ) .
}

0 xi nk; 0 ) 0 )

Finally note that for = 0 , the random variable i 0 Xi has a Gamma(n, 1) distribution.
It follows that the appropriate value of k for a size- LR test is the solution to the integral
equation
nk
1 n1
v
exp{v}dv = .
(n)
0
Finite sample properties of the LR test
Whether the LR test is unbiased and/or a UMP test for a given hypothesis must be established
on a case-by-case basis since it depends on both the characteristics of f (x; ) and the set of
H0 and H1 . However, there are some parallels between the LR test and the Neyman-Pearson
UMP test. In particular, if both H0 and H1 are simple, then the size- LR test and the
Neyman-Pearson MP size- test will be equivalent.
Theorem 4.5 (Equivalence of LR and NP tests when H0 and H1 are simple) Suppose
a size- LR test of H0 : = 0 versus H1 : = 1 exists with critical region
CrLR = {x : (x) c} ,

where P (x CrLR ; 0 ) = (0, 1).

Furthermore, suppose a Neyman-Pearson most powerful size- test also exists with critical
region
Cr = {x : L(0 ; x)/L(1 ; x) k} , where P (x Cr ; 0 ) = .
124

4. Hypothesis testing
Then the LR test and the Neyman-Pearson most powerful test are equivalent.
Proof
Since (x) [0, 1], and given that (0, 1), it follows that
P ((x) c; 0 ) =

only if

c < 1.

Hence, for a size < 1, the critical value of the LR test c must be less than 1. Now let
= arg max L(; x),

(x) =

so that the GLR is

{0 ,1 }

L(0 ; x)
.
x)
L(;

Then the partitioning of the sample space leads to the following relations
x A = {x : = 1 }

L(0 ; x)
L(0 ; x)
=
1
x)
L(1 ; x)
L(
;
| {z }
simple LR

x B = {x : = 0 }
It follows that

L(0 ; x)
c < 1,
L(1 ; x)

only if

{z

GLR

L(0 ; x)
L(0 ; x)

= 1.
x)
L(1 ; x)
L(;

(x) =

L(0 ; x)
c < 1,
x)
L(;

L(0 ; x)
and hence, only if = 1 and (x) =
(GLR = simple LR of Neyman-Pearson). Thus, for
L(1 ; x)
c < 1 and (0, 1), if
P

L(0 ; x)
c; 0
x)
L(;

and

L(0 ; x)
c; 0 = ,
L(1 ; x)


then Cr = CrLR and the LR test is equivalent to the Neyman-Pearson MP test of size .


The theorem implies that the LR test for a simple H0 versus a simple H1 is a most powerful
test, if an LR-test and a MP test exist.
Example 4.13 Recall the example in which we considered a sample x of size n from an exponential population with pdf
f (x; ) = exp{x};

x (0, ),

> 0,

and the simple hypotheses H0 : = 0 versus H1 : = 1 with 0 < 1 . The Neyman-Pearson


MP size- test was obtained as
Cr = {x :

xi k}
where k : -quantile of a Gamma(n, 0 )-distribution.
125

4. Hypothesis testing
Now the GLR for this problem is
(x) =

L(0 ; x)
max{0 ,1 } L(; x)

if = 0 L(0 ; x) > L(1 ; x)

1
n

(0 /1 )

exp{(0 1 )

xi } if = 1 L(0 ; x) L(1 ; x) .

{z

It follows that, for c < 1, the probability P ((x) c; ) is given by




P ((x) c; ) = P (0 /1 )n exp{(0 1 )

i xi } c ;

(solving the inequality for

= P


P

xi (1 0 )1 ln[(1 /0 )n c] ; .
|

Since

P
xi , and noting that 0 < 1 )
i

{z

xi has a Gamma(n; ) distribution, it follows that the size- LR test is given by

CrLR = {x :

xi c},

where c : -quantile of a Gamma(n, 0 )-distribution.

Thus, the critical regions of the LR test and the Neyman-Pearson MP test are identical.

As discussed above, the Neyman-Pearson lemma for simple H0 s versus simple H1 s can be
extended to the case of defining UMP tests for simple H0 s versus composite H1 s (see Theorem
4.2). Similar to this, the result of the equivalence of the LR test and the Neyman-Pearson MP
test for simple H0 s versus simple H1 s (see Theorem 4.4) can be extended to the case of simple
H0 s versus composite H1 s.

Theorem 4.6 (Equivalence of LR and NP tests when H0 is simple and H1 is composite)


Consider the hypotheses H0 : = 0 versus H1 : 1 and suppose the given critical region
CrLR = {x : (x) c}
of the LR test defines a size (0, 1) test. Furthermore, suppose that 1 H1 c(1 ) 0
such that
CrLR = {x : 1 (x) c(1 )} ,
where
1 (x) =

L (0 ; x)
max{0 ,1 } L(; x)

and

P (1 (x) c(1 ); 0 ) = .

Finally, suppose a Neyman-Pearson UMP test Cr of H0 versus H1 having size exists. Then
the size- LR test CrLR and the size- Neyman-Pearson UMP test Cr are equivalent.
126

4. Hypothesis testing
Proof
Given that a size- Neyman-Pearson UMP test for H0 : = 0 versus H1 : 1 exists, this test is
the size- Neyman-Pearson MP test for H0 : = 0 versus H1 : = 1 for every 1 1 . Because
it is assumed that for every 1 1 , both the Neyman-Pearson MP and the LR test of size for
H0 : = 0 versus H1 : = 1 exist, both tests are equivalent (see Theorem 4.4). It follows that, for
H0 : = 0 versus H1 : 1 , the size- LR test is equivalent to the size- Neyman-Pearson MP
test.


The theorem implies that if the critical region of a size- LR test of H0 : = 0 versus
H1 : = 1 is the same 1 1 , then if a size- Neyman-Pearson UMP test for H0 : = 0
versus H1 : 1 exists, it is given by the LR test.
Example 4.14 Consider a sample x of size n from an exponential population with pdf
f (x; ) = exp{x};

x (0, ),

> 0,

and the hypotheses H0 : = 1 versus H1 : 1 = (0, 1). The GLR for this problem obtains
as
(x) =
where L(1; x) = exp{

L(1; x)
,
sup(0,1] L(; x)

xi } and

sup L(; x) =
(0,1]

exp{

xi }

if

P
= n/ i xi 1
|

(n/

xi ) exp{n} if

{z

ML estimate

< 1

Hence we get

if 1
(x) = P
.
P
( i xi /n)n exp{n i xi } if < 1
1

It follows that, for c < 1, the probability P ((x) c; ) is given by




n
i xi /n) exp{n

P ((x) c; ) = P (

= P

i xi } c ;

; .
i 1 xi c

(See the example of the LR test for H0 : 0 vs. H1 : > 0 )

127

4. Hypothesis testing
Since

xi has a Gamma(n; ) distribution, it follows that the size- LR test is given by

CrLR = {x :

xi c},

where c : (1 )-quantile of a Gamma(n, 1)-distribution.

Recall that the Neyman Pearson UMP size- test for H0 : = 0 versus H1 : < 0 is
Cr = {x :

xi k},

where k : (1 )-quantile of a Gamma(n, 1)-distribution

(see the example where we considered UMP tests for H0 : = 0 vs. H1 : > 0 and H0 : = 0
vs. H1 : < 0 ). Thus, the critical regions of the LR test and the Neyman-Pearson UMP test
are identical.

Asymptotic properties of the LR test


In the preceding examples illustrating the construction of LR tests, we used a monotonically
increasing or decreasing function of the GLR (x), say h[(x)], whose pdf had a known tractable
form. Then the power function obtains using the pdf of h[(x)] as
P ((x) c; ) = P (h[(x)] c) or P (h[(x)] c),
which allows us to define a size- critical region in terms of the test statistic h[(x)]. Note that
we used such a function h[(x)] since the pdf of the GLR (x) itself typically has an untractable
form.
Unfortunately, in practice, it is not always possible to define such a statistic h[(x)] whose pdf
has a known tractable form. In this case, we can use a specific transformation of (x), namely
2 ln (x),
whose asymptotic distribution has a known tractable form. In particular, we assume that the
Null hypothesis relating to a k-dimensional parameter is of the following form:
H0 : R() = r,

with R() : known (q 1)vector function


r : (q 1)vector of known constants,

where R() = r places k linear and/or nonlinear restrictions on the elements of . Examples
are 21 + 32 = 3, exp 2 = 3. It is also assumed that none of the k restrictions is redundant.
In this case it can be shown that when H0 is true,
a

2 ln (x) 2(q) .

128

4. Hypothesis testing
Thus an asymptotically valid size- LR test of H0 : R() = r versus H1 : R() 6= r is
Cr = {x : 2 ln (x) 2q, },

where 2q, : (1 )-quantile of a 2 -distribution with q d.o.f.

The formal result on the asymptotic distribution of the transformed GLR 2 ln (x) when H0
is true is given in the following theorem.
Theorem 4.7 (Asymptotic distribution of the GLR when H0 is true) Assume that the
MLE of the (k 1) vector is consistent, asymptotically normal and asymptotically efficient.
Let
supH0 L(; x)
(x) =
supH0 H1 L(; x)
be the GLR statistic for testing H0 : R() = r versus H1 : R() 6= r, where R() is a (q 1)
continuously differentiable vector function having nonredundant coordinate functions and q k.
Then, when H0 is true,
d
2 ln (x) 2q .
Proof
See Mittelhammer (1996), Theorem 10.5.

Example 4.15 Consider a sample x of size n = 10 from a Poisson population with pdf
f (x; ) =

exp{}x
;
x!

x {0, 1, 2, ...},

> 0,

and assume that i xi = 20. Use an asymptotically valid size = 0.05 LR test to test the
hypotheses H0 : = 1.8 versus H1 : 6= 1.8.
P
The unrestricted ML estimate of is = i xi /n = 2; the restricted ML estimate of is
r = 1.8. It follows that the value of the GLR is given by

(x) =

L(1.8; x)
exp{10 1.8}1.820
=
,
sup(0,) L(; x)
exp{10 2}220

so that
2 ln (x) = 2[18 + 20 + 20(ln(1.8) ln(2))] = 0.2144.
The critical region for an asymptotic size of = 0.05 is
Cr = {x : 2 ln (x) 21,0.05 = 3.84},
129

4. Hypothesis testing
so that H0 cannot be rejected.

In order to establish an asymptotically valid method of investigating the power of the LR


test, we need to know the asymptotic distribution of the test statistic 2 ln (x) when H1
is true. For this purpose, one can analyze the following local alternatives to the Null
H0 : R() = r :
R() = r + n1/2 ,
where the vector specifies the direction of the deviation from the Null. Note that since
limn n1/2 = 0, alternatives that are locally close to H0 : R() = r for large enough n are
analyzed.
One can show that the limiting distribution of 2 ln (x) under the sequence of those local
alternatives is noncentral 2 as
d

2 ln (x) 2q (),

where :noncentrality parameter.

(For details see Mittelhammer, 1996, Theorem 10.6.)

4.4.2. Lagrange Multiplier (LM) Tests


As discussed in the previous section, the LR test utilizes as a measure of the discrepancy
between the restricted (H0 ) and unrestricted (H0 H1 ) parameter estimates the ratio of the
likelihood at the restricted estimate and the unrestricted estimate. A small value of this ratio
casts doubt of the validity of the restriction.
The LM test utilizes an alternative measure of discrepancy, namely the Lagrange Multiplier
appearing in the restricted ML optimization problem or, equivalently, the gradient of the loglikelihood evaluated the restricted estimates. Like the LR test, the LM test is a natural testing
procedure, when the parameter of interest is estimated by (restricted) ML.
In order to derive the form of the LM test for q restrictions H0 : R() = r versus H1 : R() 6= r,
consider the constrained maximization problem
max ln L(; x) 0 [R() r],
,

where is a (q 1) vector of LMs. The f.o.c.s for this problem are


r)
r ; x) R(
ln L(

r = 0

130

and

r ) r = 0,
R(

4. Hypothesis testing
r is the restricted ML estimate that solves the f.o.c.s and r the corresponding LMs.
where
r ; x)/
Note that large values of the LM r and large values of the log-likelihood gradient ln L(
indicate that large likelihood increases are possible if we relax the constraint H0 : R() = r.
Hence, an LM and a gradient which are significantly different from 0 suggest that the restriction
H0 is false and should be rejected.

The following theorem introduces the two versions of the LM test statistic (G) and establishes
its asymptotic distribution under H0 .
Theorem 4.8 (Asymptotic distribution of the LM test statistic when H0 is true) Assume
that the MLE of the (k 1) vector is consistent, asymptotically normal and asymptotically
r and r denote the restricted MLE and the LM that solve
efficient. Let
max ln L(; x) 0 [R() r],
,

where R() is a continuously differentiable (q 1) vector function that contains no redundant


coordinate functions. If R(0 )/0 has full row rank, then under H0 : R() = r it follows
that:
G =
=

0
0 R(r )
r

r ; x) 1 R(
r)
2 ln L(

r
0

r ; x)0  2 ln L(
r ; x) 1 ln L(
r ; x) d
ln L(

2q .
0

Proof
See Mittelhammer (1996), Theorem 10.7.


Theorem 4.8 provides the asymptotic distribution of the LM test statistic G under the restriction
R() = r . From this we can construct the following asymptotic size- test of H0 : R() = r
versus H1 : R() 6= r:
Cr = {x : g 2q, },

where 2q, : (1 )-quantile


of a 2 -distribution with qd.o.f.

The LM test can have a computational advantage over the LR test: The latter needs both
the restricted and the unrestricted ML estimate of , whereas the former requires only the
restricted estimate.
131

4. Hypothesis testing
Example 4.16 See Mittelhammer (1996), Example 10.11.

4.4.3. Wald Tests


The Wald test utilizes a third alternative measure of the discrepancy between the restricted
(H0 ) and unrestricted (H0 H1 ) parameter estimates. In particular, it assesses the plausibility
r , that is the restriction evaluated
of H0 : R() = r by the significance of the measure R()
.
at the unrestricted estimate
r indicate significant discrepancy between the hypothesized value
Significant values of R()
,
of R(), namely r, and the unrestricted estimate of R() provided by the data, namely R()
suggesting that H0 is false; see below.

be an MLE for the


In contrast to the LR and LM test, it is not necessary that the estimator
Wald test to be applicable. Hence, the Wald test is more general than either the LR or LM
test.

The following theorem introduces the Wald test statistic (W ) and establishes its asymptotic
distribution under H0 .

Theorem 4.9 (Asymptotic distribution of the Wald test statistic when H0 is true)
be
Let the random sample x of size n have the joint probability density function f (x; 0 ), let

d
n be a consistent
a consistent estimator for 0 such that n( 0 ) N (0, ), and let n
estimator of . Furthermore, consider the hypotheses H0 : R() = r versus H1 : R() 6= r,
where R() is a (q 1) continuously differentiable vector function of for which q k and
132

4. Hypothesis testing
R() contains no redundant coordinate functions. Finally, let R(0 )/0 have full row rank.
Then under H0 it follows that:
0

0 R() R()

n
W = [R() r]

1

r] 2q .
[R()

Proof
See Mittelhammer (1996), Theorem 10.9.


Theorem 4.9 provides the asymptotic distribution of the Wald test statistic W under the restriction R() = r. From this we can construct the following asymptotic size- test of H0 : R() = r
versus H1 : R() 6= r:
Cr = {x : w 2q, },

where 2q, : (1 )-quantile of a 2 -distribution with qd.o.f.

Example 4.17 See Mittelhammer (1996), Examples 10.12 and 10.13.

133

Appendix

A. Tables
Table A.1.: Quantiles of the 2 distribution

0.5%

1%

2.5%

5%

10%

90%

95%

97.5%

99%

99.5%

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

0.000
0.010
0.039
0.195
0.406
0.673
0.987
1.343
1.734
2.155
2.603
3.074
3.565
4.075
4.601
5.142
5.697
6.265
6.844
7.434

0.000
0.020
0.100
0.292
0.552
0.871
1.238
1.646
2.088
2.558
3.053
3.570
4.107
4.660
5.229
5.812
6.408
7.015
7.633
8.260

0.001
0.051
0.213
0.484
0.831
1.237
1.690
2.180
2.700
3.247
3.816
4.404
5.009
5.629
6.262
6.908
7.564
8.231
8.907
9.591

0.004
0.103
0.353
0.712
1.146
1.636
2.168
2.733
3.325
3.940
4.575
5.226
5.892
6.571
7.261
7.962
8.672
9.390
10.117
10.851

0.016
0.211
0.587
1.065
1.611
2.204
2.833
3.490
4.168
4.865
5.578
6.304
7.042
7.790
8.547
9.312
10.085
10.865
11.651
12.443

2.706
4.605
6.252
7.780
9.237
10.645
12.017
13.362
14.684
15.987
17.275
18.549
19.812
21.064
22.307
23.542
24.769
25.989
27.204
28.412

3.841
5.991
7.816
9.488
11.071
12.592
14.067
15.507
16.919
18.307
19.675
21.026
22.362
23.685
24.996
26.296
27.587
28.869
30.144
31.410

5.024
7.378
9.350
11.144
12.833
14.450
16.013
17.535
19.023
20.483
21.920
23.337
24.736
26.119
27.488
28.845
30.191
31.526
32.852
34.170

6.635
9.210
11.346
13.277
15.086
16.812
18.475
20.090
21.666
23.209
24.725
26.217
27.688
29.141
30.578
32.000
33.409
34.805
36.191
37.566

7.879
10.597
12.836
14.859
16.749
18.547
20.277
21.955
23.589
25.188
26.757
28.299
29.819
31.319
32.801
34.267
35.718
37.156
38.582
39.997

X 2 (): Quantiles 2p () of the 2 distribution with degrees of freedom.

A. Tables
Table A.2.: Quantiles of the standard normal distribution
p

0.000

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.50x
0.51x
0.52x
0.53x
0.54x

0.0000
0.0251
0.0502
0.0753
0.1004

0.0025
0.0276
0.0527
0.0778
0.1030

0.0050
0.0301
0.0552
0.0803
0.1055

0.0075
0.0326
0.0577
0.0828
0.1080

0.0100
0.0351
0.0602
0.0853
0.1105

0.0125
0.0376
0.0627
0.0878
0.1130

0.0150
0.0401
0.0652
0.0904
0.1156

0.0175
0.0426
0.0677
0.0929
0.1181

0.0201
0.0451
0.0702
0.0954
0.1206

0.0226
0.0476
0.0728
0.0979
0.1231

0.55x
0.56x
0.57x
0.58x
0.59x

0.1257
0.1510
0.1764
0.2019
0.2275

0.1282
0.1535
0.1789
0.2045
0.2301

0.1307
0.1560
0.1815
0.2070
0.2327

0.1332
0.1586
0.1840
0.2096
0.2353

0.1358
0.1611
0.1866
0.2121
0.2378

0.1383
0.1637
0.1891
0.2147
0.2404

0.1408
0.1662
0.1917
0.2173
0.2430

0.1434
0.1687
0.1942
0.2198
0.2456

0.1459
0.1713
0.1968
0.2224
0.2482

0.1484
0.1738
0.1993
0.2250
0.2508

0.60x
0.61x
0.62x
0.63x
0.64x

0.2533
0.2793
0.3055
0.3319
0.3585

0.2559
0.2819
0.3081
0.3345
0.3611

0.2585
0.2845
0.3107
0.3372
0.3638

0.2611
0.2871
0.3134
0.3398
0.3665

0.2637
0.2898
0.3160
0.3425
0.3692

0.2663
0.2924
0.3186
0.3451
0.3719

0.2689
0.2950
0.3213
0.3478
0.3745

0.2715
0.2976
0.3239
0.3505
0.3772

0.2741
0.3002
0.3266
0.3531
0.3799

0.2767
0.3029
0.3292
0.3558
0.3826

0.65x
0.66x
0.67x
0.68x
0.69x

0.3853
0.4125
0.4399
0.4677
0.4959

0.3880
0.4152
0.4427
0.4705
0.4987

0.3907
0.4179
0.4454
0.4733
0.5015

0.3934
0.4207
0.4482
0.4761
0.5044

0.3961
0.4234
0.4510
0.4789
0.5072

0.3989
0.4261
0.4538
0.4817
0.5101

0.4016
0.4289
0.4565
0.4845
0.5129

0.4043
0.4316
0.4593
0.4874
0.5158

0.4070
0.4344
0.4621
0.4902
0.5187

0.4097
0.4372
0.4649
0.4930
0.5215

0.70x
0.71x
0.72x
0.73x
0.74x

0.5244
0.5534
0.5828
0.6128
0.6433

0.5273
0.5563
0.5858
0.6158
0.6464

0.5302
0.5592
0.5888
0.6189
0.6495

0.5330
0.5622
0.5918
0.6219
0.6526

0.5359
0.5651
0.5948
0.6250
0.6557

0.5388
0.5681
0.5978
0.6280
0.6588

0.5417
0.5710
0.6008
0.6311
0.6620

0.5446
0.5740
0.6038
0.6341
0.6651

0.5476
0.5769
0.6068
0.6372
0.6682

0.5505
0.5799
0.6098
0.6403
0.6713

0.75x
0.76x
0.77x
0.78x
0.79x

0.6745
0.7063
0.7388
0.7722
0.8064

0.6776
0.7095
0.7421
0.7756
0.8099

0.6808
0.7128
0.7454
0.7790
0.8134

0.6840
0.7160
0.7488
0.7824
0.8169

0.6871
0.7192
0.7521
0.7858
0.8204

0.6903
0.7225
0.7554
0.7892
0.8239

0.6935
0.7257
0.7588
0.7926
0.8274

0.6967
0.7290
0.7621
0.7961
0.8310

0.6999
0.7323
0.7655
0.7995
0.8345

0.7031
0.7356
0.7688
0.8030
0.8381

0.80x
0.81x
0.82x
0.83x
0.84x

0.8416
0.8779
0.9154
0.9542
0.9945

0.8452
0.8816
0.9192
0.9581
0.9986

0.8488
0.8853
0.9230
0.9621
1.0027

0.8524
0.8890
0.9269
0.9661
1.0069

0.8560
0.8927
0.9307
0.9701
1.0110

0.8596
0.8965
0.9346
0.9741
1.0152

0.8633
0.9002
0.9385
0.9782
1.0194

0.8669
0.9040
0.9424
0.9822
1.0237

0.8705
0.9078
0.9463
0.9863
1.0279

0.8742
0.9116
0.9502
0.9904
1.0322

0.85x
0.86x
0.87x
0.88x
0.89x

1.0364
1.0803
1.1264
1.1750
1.2265

1.0407
1.0848
1.1311
1.1800
1.2319

1.0450
1.0893
1.1359
1.1850
1.2372

1.0494
1.0939
1.1407
1.1901
1.2426

1.0537
1.0985
1.1455
1.1952
1.2481

1.0581
1.1031
1.1503
1.2004
1.2536

1.0625
1.1077
1.1552
1.2055
1.2591

1.0669
1.1123
1.1601
1.2107
1.2646

1.0714
1.1170
1.1650
1.2160
1.2702

1.0758
1.1217
1.1700
1.2212
1.2759

0.90x
0.91x
0.92x
0.93x
0.94x

1.2816
1.3408
1.4051
1.4758
1.5548

1.2873
1.3469
1.4118
1.4833
1.5632

1.2930
1.3532
1.4187
1.4909
1.5718

1.2988
1.3595
1.4255
1.4985
1.5805

1.3047
1.3658
1.4325
1.5063
1.5893

1.3106
1.3722
1.4395
1.5141
1.5982

1.3165
1.3787
1.4466
1.5220
1.6072

1.3225
1.3852
1.4538
1.5301
1.6164

1.3285
1.3917
1.4611
1.5382
1.6258

1.3346
1.3984
1.4684
1.5464
1.6352

0.95x
0.96x
0.97x
0.98x
0.99x

1.6449
1.7507
1.8808
2.0537
2.3263

1.6546
1.7624
1.8957
2.0749
2.3656

1.6646
1.7744
1.9110
2.0969
2.4089

1.6747
1.7866
1.9268
2.1201
2.4573

1.6849
1.7991
1.9431
2.1444
2.5121

1.6954
1.8119
1.9600
2.1701
2.5758

1.7060
1.8250
1.9774
2.1973
2.6521

1.7169
1.8384
1.9954
2.2262
2.7478

1.7279
1.8522
2.0141
2.2571
2.8782

1.7392
1.8663
2.0335
2.2904
3.0902

Z N (0, 1): Quantiles zp = 1 (p) of the standard normal distribution.

II

You might also like