You are on page 1of 1024

Chapter 36

LARGE SAMPLE ESTIMATION AND HYPOTHESIS


TESTING*

WHITNEY K. NEWEY

Massachusetts Institute of Technology

DANIEL MCFADDEN

University of California, Berkeley

Contents

Abstract 2113
1. Introduction 2113
2. Consistency 2120
2.1. The basic consistency theorem 2121
2.2. Identification 2124
2.2.1. The maximum likelihood estimator 2124
2.2.2. Nonlinear least squares 2125
2.2.3. Generalized method of moments 2126
2.2.4. Classical minimum distance 2128
2.3. Uniform convergence and continuity 2129
2.4. Consistency of maximum likelihood 2131
2.5. Consistency of GMM 2132
2.6. Consistency without compactness 2133
2.1. Stochastic equicontinuity and uniform convergence 2136
2.8. Least absolute deviations examples 2138
2.8.1. Maximum score 2138
2.8.2. Censored least absolute deviations 2140
3. Asymptotic normality 2141
3.1. The basic results 2143
3.2. Asymptotic normality for MLE 2146
3.3. Asymptotic normality for GMM 2148

*We are grateful to the NSF for financial support and to Y. Ait-Sahalia, J. Porter, J. Powell, J. Robins,
P. Ruud, and T. Stoker for helpful comments.

Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden
0 1994 Elsevier Science B.V. All rights reserved
Ch. 36: Large Sample Estimation and Hypothesis Testing 2113

Abstract

Asymptotic distribution theory is the primary method used to examine the properties
of econometric estimators and tests. We present conditions for obtaining consistency
and asymptotic normality of a very general class of estimators (extremum esti-
mators). Consistent asymptotic variance estimators are given to enable approxi-
mation of the asymptotic distribution. Asymptotic efficiency is another desirable
property then considered. Throughout the chapter, the general results are also
specialized to common econometric estimators (e.g. MLE and GMM), and in
specific examples we work through the conditions for the various results in detail.
The results are also extended to two-step estimators (with finite-dimensional param-
eter estimation in the first step), estimators derived from nonsmooth objective
functions, and semiparametric two-step estimators (with nonparametric estimation
of an infinite-dimensional parameter in the first step). Finally, the trinity of test
statistics is considered within the quite general setting of GMM estimation, and
numerous examples are given.

1. Introduction

Large sample distribution theory is the cornerstone of statistical inference for


econometric models. The limiting distribution of a statistic gives approximate
distributional results that are often straightforward to derive, even in complicated
econometric models. These distributions are useful for approximate inference, in-
cluding constructing approximate confidence intervals and test statistics. Also, the
location and dispersion of the limiting distribution provides criteria for choosing
between different estimators. Of course, asymptotic results are sensitive to the
accuracy of the large sample approximation, but the approximation has been found
to be quite good in many cases and asymptotic distribution results are an important
starting point for further improvements, such as the bootstrap. Also, exact distribu-
tion theory is often difficult to derive in econometric models, and may not apply to
models with unspecified distributions, which are important in econometrics. Because
asymptotic theory is so useful for econometric models, it is important to have
general results with conditions that can be interpreted and applied to particular
estimators as easily as possible. The purpose of this chapter is the presentation of
such results.
Consistency and asymptotic normality are the two fundamental large sample
properties of estimators considered in this chapter. A consistent estimator 6 is one
that converges in probability to the true value Q,,, i.e. 6% 8,, as the sample size n
goes to infinity, for all possible true values. This is a mild property, only requiring

This property is sometimes referred to as weak consistency, with strong consistency holding when(j
converges almost surely to the true value. Throughout the chapter we focus on weak consistency,
although we also show how strong consistency can be proven.
2114 W.K. Newey and D. McFadden

that the estimator is close to the truth when the number of observations is nearly
infinite. Thus, an estimator that is not even consistent is usually considered in-
adequate. Also, consistency is useful because it means that the asymptotic distribu-
tion of an estimator is determined by its limiting behavior near the true parameter.
An asymptotically normal estimator 6is one where there is an increasing function
v(n) such that the distribution function of v(n)(8- 0,) converges to the Gaussian
distribution function with mean zero and variance V, i.e. v(n)(8 - 6,) A N(0, V).
The variance I/ of the limiting distribution is referred to as the asymptotic variance
of @. The estimator is ,,/&-consistent if v(n) = 6. This chapter focuses on the
&-consistent case, so that unless otherwise noted, asymptotic normality will be
taken to include ,,&-consistency.
Asymptotic normality and a consistent estimator of the asymptotic variance can
be used to construct approximate confidence intervals. In particular, for an esti-
mator c of V and for pori2satisfying Prob[N(O, 1) > gn,J = 42, an asymptotic 1 - CY
confidence interval is

Cal-@= ce-g,,2(m2,e+f,,2(3/n)2].
If P is a consistent estimator of I/ and I/ > 0, then asymptotic normality of 6 will
imply that Prob(B,EY1 -,)- 1 - a as n+ co. 2 Here asymptotic theory is important
for econometric practice, where consistent standard errors can be used for approxi-
mate confidence interval construction. Thus, it is useful to know that estimators are
asymptotically normal and to know how to form consistent standard errors in
applications. In addition, the magnitude of asymptotic variances for different esti-
mators helps choose between estimators in practice. If one estimator has a smaller
asymptotic variance, then an asymptotic confidence interval, as above, will be
shorter for that estimator in large samples, suggesting preference for its use in
applications. A prime example is generalized least squares with estimated distur-
bance variance matrix, which has smaller asymptotic variance than ordinary least
squares, and is often used in practice.
Many estimators share a common structure that is useful in showing consistency
and asymptotic normality, and in deriving the asymptotic variance. The benefit of
using this structure is that it distills the asymptotic theory to a few essential
ingredients. The cost is that applying general results to particular estimators often
requires thought and calculation. In our opinion, the benefits outweigh the costs,
and so in these notes we focus on general structures, illustrating their application
with examples.
One general structure, or framework, is the class of estimators that maximize
some objective function that depends on data and sample size, referred to as
extremum estimators. An estimator 8 is an extremum estimator if there is an

The proof of this result is an exercise in convergence in distribution and the Slutzky theorem, which
states that Y. 5 Y, and Z, %C implies Z, Y, &Y,.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2115

objective function o,(0) such that

o^maximizes o,(Q) subject to HE 0, (1.1)

where 0 is the set of possible parameter values. In the notation, dependence of H^


on n and of i? and o,,(G) on the data is suppressed for convenience. This estimator
is the maximizer of some objective function that depends on the data, hence the
term extremum estimator.3 R.A. Fisher (1921, 1925), Wald (1949) Huber (1967)
Jennrich (1969), and Malinvaud (1970) developed consistency and asymptotic nor-
mality results for various special cases of extremum estimators, and Amemiya (1973,
1985) formulated the general class of estimators and gave some useful results.
A prime example of an extremum estimator is the maximum likelihood (MLE).
Let the data (z,, , z,) be i.i.d. with p.d.f. f(zl0,) equal to some member of a family
of p.d.f.s f(zI0). Throughout, we will take the p.d.f. f(zl0) to mean a probability
function where z is discrete, and to possibly be conditioned on part of the observa-
tion z.~ The MLE satisfies eq. (1.1) with

Q,(0) = nP i lnf(ziI 0). (1.2)


i=l

Here o,(0) is the normalized log-likelihood. Of course, the monotonic transforma-


tion of taking the log of the likelihood and normalizing by n will not typically affect
the estimator, but it is a convenient normalization in the theory. Asymptotic theory
for the MLE was outlined by R.A. Fisher (192 1, 1925), and Walds (1949) consistency
theorem is the prototype result for extremum estimators. Also, Huber (1967) gave
weak conditions for consistency and asymptotic normality of the MLE and other
extremum estimators that maximize a sample average.5
A second example is the nonlinear least squares (NLS), where for data zi = (yi, xi)
with E[Y Ix] = h(x, d,), the estimator solves eq. (1.1) with

k(Q)= - n- l i [yi- h(Xi, !!I)]*. (1.3)


i=l

Here maximizing o,(H) is the same as minimizing the sum of squared residuals. The
asymptotic normality theorem of Jennrich (1969) is the prototype for many modern
results on asymptotic normality of extremum estimators.

3Extremum rather than maximum appears here because minimizers are also special cases, with
objective function equal to the negative of the minimand.
4More precisely, flzIH) is the density (Radon-Nikodym derivative) of the probability measure for z
with respect to some measure that may assign measure 1 to some singletons, allowing for discrete
variables, and for z = (y, x) may be the product of some measure for ~1with the marginal distribution of
X, allowing f(z)O) to be a conditional density given X.
5Estimators that maximize a sample average, i.e. where o,(H) = n- I:= 1q(z,,O),are often referred to
as m-estimators, where the m means maximum-likelihood-like.
2116 W.K. Nrwuy and D. McFuddrn

A third example is the generalized method of moments (GMM). Suppose that


there is a moment function vector g(z, H) such that the population moments satisfy
E[g(z, 0,)] = 0. A GMM estimator is one that minimizes a squared Euclidean
distance of sample moments from their population counterpart of zero. Let ii/ be
a positive semi-definite matrix, so that (m@m) P is a measure of the distance of m
from zero. A GMM estimator is one that solves eq. (1.1) with

&I) = -
[n-l izln Ytzi, O)
1
*[ n-l it1 e)].
Ytzi3 (1.4)

This class includes linear instrumental variables estimators, where g(z, 0) =x


( y - YO),x is a vector of instrumental variables, y is a left-hand-side dependent variable,
and Y are right-hand-side variables. In this case the population moment condition
E[g(z, (!I,)] = 0 is the same as the product of instrumental variables x and the
disturbance y - Y8, having mean zero. By varying I% one can construct a variety
of instrumental variables estimators, including two-stage least squares for k%=
(n-~;=Ixix;)-. The GMM class also includes nonlinear instrumental variables
estimators, where g(z, 0) = x.p(z, Q)for a residual p(z, Q),satisfying E[x*p(z, (!I,)] = 0.
Nonlinear instrumental variable estimators were developed and analyzed by Sargan
(1959) and Amemiya (1974). Also, the GMM class was formulated and general
results on asymptotic properties given in Burguete et al. (1982) and Hansen (1982).
The GMM class is general enough to also include MLE and NLS when those
estimators are viewed as solutions to their first-order conditions. In this case the
derivatives of Inf(zI 0) or - [y - h(x, H)12 become the moment functions, and there
are exactly as many moment functions as parameters. Thinking of GMM as includ-
ing MLE, NLS, and many other estimators is quite useful for analyzing their
asymptotic distribution, but not for showing consistency, as further discussed below.
A fourth example is classical minimum distance estimation (CMD). Suppose that
there is a vector of estimators fi A x0 and a vector of functions h(8) with 7c,,= II(
The idea is that 71consists of reduced form parameters, 0 consists of structural
parameters, and h(0) gives the mapping from structure to reduced form. An estima-
tor of 0 can be constructed by solving eq. (1.1) with

&@I)= - [72- h(U)]ci+t- h(U)], (1.5)

where k? is a positive semi-definite matrix. This class of estimators includes classical


minimum chi-square methods for discrete data, as well as estimators for simultaneous
equations models in Rothenberg (1973) and panel data in Chamberlain (1982). Its
asymptotic properties were developed by Chiang (1956) and Ferguson (1958).
A different framework that is sometimes useful is minimum distance estimation.

The l/n normalization in @does not affect the estimator, but, by the law oflarge numbers, will imply
that W converges in probability to a constant matrix, a condition imposed below.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2117

a class of estimators that solve eq. (1.1) for Q,,(d) = - &,(@@/g,(@, where d,(d) is a
vector of the data and parameters such that 9,(8,) LO and I@ is positive semi-
definite. Both GMM and CMD are special cases of minimum distance, with g,,(H) =
n- l XI= 1 g(zi, 0) for GMM and g,(0) = 72- h(0) for CMD. This framework is useful
for analyzing asymptotic normality of GMM and CMD, because (once) differenti-
ability of J,(0) is a sufficient smoothness condition, while twice differentiability is
often assumed for the objective function of an extremum estimator [see, e.g. Amemiya
(1985)]. Indeed, as discussed in Section 3, asymptotic normality of an extremum
estimator with a twice differentiable objective function Q,(e) is actually a special
case 0, asymptotic normality of a minimum distance estimator, with d,(0) = V,&(0)
and W equal to an identity matrix, where V, denotes the partial derivative. The idea
here is that when analyzing asymptotic normality, an extremum estimator can be
viewed as a solution to the first-order conditions V,&(Q) = 0, and in this form is a
minimum distance estimator.
For consistency, it can be a bad idea to treat an extremum estimator as a solution
to first-order conditions rather than a global maximum of an objective function,
because the first-order condition can have multiple roots even when the objective
function has a unique maximum. Thus, the first-order conditions may not identify
the parameters, even when there is a unique maximum to the objective function.
Also, it is often easier to specify primitive conditions for a unique maximum than
for a unique root of the first-order conditions. A classic example is the MLE for the
Cauchy location-scale model, where z is a scalar, p is a location parameter, 0 a scale
parameter, and f(z 10) = Ca- ( 1 + [(z - ~)/cJ]*)- 1 for a constant C. It is well known
that, even in large samples, there are many roots to the first-order conditions for
the location parameter ~1,although there is a global maximum to the likelihood
function; see Example 1 below. Econometric examples tend to be somewhat less
extreme, but can still have multiple roots. An example is the censored least absolute
deviations estimator of Powell (1984). This estimator solves eq. (1.1) for Q,,(O) =
-n-~;=,Jyi- max (0, xi0) 1,where yi = max (0, ~18, + si}, and si has conditional
median zero. A global maximum of this function over any compact set containing
the true parameter will be consistent, under certain conditions, but the gradient has
extraneous roots at any point where xi0 < 0 for all i (e.g. which can occur if xi is
bounded).
The importance for consistency of an extremum estimator being a global maximum
has practical implications. Many iterative maximization procedures (e.g. Newton
Raphson) may converge only to a local maximum, but consistency results only apply
to the global maximum. Thus, it is often important to search for a global maximum.
One approach to this problem is to try different starting values for iterative proce-
dures, and pick the estimator that maximizes the objective from among the con-
verged values. AS long as the extremum estimator is consistent and the true parameter
is an element of the interior of the parameter set 0, an extremum estimator will be

For GMM. the law of large numbers implies cj.(fI,) 50.


2118 W.K. Newey und D. McFadden

a root of the first-order conditions asymptotically, and hence will be included among
the local maxima. Also, this procedure can avoid extraneous boundary maxima, e.g.
those that can occur in maximum likelihood estimation of mixture models.
Figure 1 shows a schematic, illustrating the relationships between the various
types of estimators introduced so far: The name or mnemonic for each type of
estimator (e.g. MLE for maximum likelihood) is given, along with objective function
being maximized, except for GMM and CMD where the form of d,(0) is given. The
solid arrows indicate inclusion in a class of estimators. For example, MLE is
included in the class of extremum estimators and GMM is a minimum distance
estimator. The broken arrows indicate inclusion in the class when the estimator is
viewed as a solution to first-order conditions. In particular, the first-order conditions
for an extremum estimator are V,&(Q) = 0, making it a minimum distance estimator
with g,,(0) = V,&(e) and I%= I. Similarly, the first-order conditions for MLE make
it a GMM estimator with y(z, 0) = VBIn f(zl0) and those for NLS a GMM estimator
with g(z, 0) = - 2[y - h(x, B)]V,h(x, 0). As discussed above, these broken arrows are
useful for analyzing the asymptotic distribution, but not for consistency. Also, as
further discussed in Section 7, the broken arrows are not very useful when the
objective function o,(0) is not smooth.
The broad outline of the chapter is to treat consistency, asymptotic normality,
consistent asymptotic variance estimation, and asymptotic efficiency in that order.
The general results will be organized hierarchically across sections, with the asymp-
totic normality results assuming consistency and the asymptotic efficiency results
assuming asymptotic normality. In each section, some illustrative, self-contained
examples will be given. Two-step estimators will be discussed in a separate section,
partly as an illustration of how the general frameworks discussed here can be applied
and partly because of their intrinsic importance in econometric applications. Two
later sections deal with more advanced topics. Section 7 considers asymptotic
normality when the objective function o,(0) is not smooth. Section 8 develops some
asymptotic theory when @ depends on a nonparametric estimator (e.g. a kernel
regression, see Chapter 39).
This chapter is designed to provide an introduction to asymptotic theory for
nonlinear models, as well as a guide to recent developments. For this purpose,

Extremum Minimum Distance


------_---__*
O.@) -AW~cm

/ \ / \
NLS MLE GMM CMD

- i$,{yi - 4~ @l/n
,=I
i In f(dWn iglsh Q/n {A(@)) 3 - WI)
L-_________l___________T

Figure 1
Ch. 36: Lurge Sample Estimation und Hypothesis Testing 2119

Sections 226 have been organized in such a way that the more basic material is
collected in the first part of each section. In particular, Sections 2.1-2.5, 3.1-3.4,
4.1-4.3, 5.1, and 5.2, might be used as text for part of a second-year graduate
econometrics course, possibly also including some examples from the other parts
of this chapter.
The results for extremum and minimum distance estimators are general enough
to cover data that is a stationary stochastic process, but the regularity conditions
for GMM, MLE, and the more specific examples are restricted to i.i.d. data.
Modeling data as i.i.d. is satisfactory in many cross-section and panel data appli-
cations. Chapter 37 gives results for dependent observations.
This chapter assumes some familiarity with elementary concepts from analysis
(e.g. compact sets, continuous functions, etc.) and with probability theory. More
detailed familiarity with convergence concepts, laws of large numbers, and central
limit theorems is assumed, e.g. as in Chapter 3 of Amemiya (1985), although some
particularly important or potentially unfamiliar results will be cited in footnotes.
The most technical explanations, including measurability concerns, will be reserved
to footnotes.
Three basic examples will be used to illustrate the general results of this chapter.

Example 1.I (Cauchy location-scale)

In this example z is a scalar random variable, 0 = (11,c) is a two-dimensional vector,


and z is continuously distributed with p.d.f. f(zId,), where f(zl@ = C-a- { 1 +
[(z - ~)/a]~} -i and C is a constant. In this example p is a location parameter and
0 a scale parameter. This example is interesting because the MLE will be consistent,
in spite of the first-order conditions having many roots and the nonexistence of
moments of z (e.g. so the sample mean is not a consistent estimator of 0,).

Example 1.2 (Probit)

Probit is an MLE example where z = (y, x) for a binary variable y, y~(0, l}, and a
q x 1 vector of regressors x, and the conditional probability of y given x is f(zl0,)
for f(zl0) = @(x@~[ 1 - @(xQ)] -y. Here f(z ItI,) is a p.d.f. with respect to integration
that sums over the two different values of y and integrates over the distribution of
x, i.e. where the integral of any function a(y, x) is !a(~, x) dz = E[a( 1, x)] + Epu(O,x)].
This example illustrates how regressors can be allowed for, and is a model that is
often applied.

Example 1.3 (Hansen-Singleton)

This is a GMM (nonlinear instrumental variables) example, where g(z, 0) = x*p(z, 0)


for p(z, 0) = p*w*yy - 1. The functional form here is from Hansen and Singleton
(1982), where p is a rate of time preference, y a risk aversion parameter, w an asset
return, y a consumption ratio for adjacent time periods, and x consists of variables
Ch. 36: Large Sample Estimation and Hypothesis Testing 2121

lead to the estimator being close to one of the maxima, which does not give
consistency (because one of the maxima will not be the true value of the parameter).
The condition that QO(0) have a unique maximum at the true parameter is related to
identification.
The discussion so far only allows for a compact parameter set. In theory compact-
ness requires that one know bounds on the true parameter value, although this
constraint is often ignored in practice. It is possible to drop this assumption if the
function Q,(0) cannot rise too much as 8 becomes unbounded, as further discussed
below.
Uniform convergence and continuity of the limiting function are also important.
Uniform convergence corresponds to the feature of the graph that Q,(e) was in the
sleeve for all values of 0E 0. Conditions for uniform convergence are given below.
The rest of this section develops this descriptive discussion into precise results
on consistency of extremum estimators. Section 2.1 presents the basic consistency
theorem. Sections 2.222.5 give simple but general sufficient conditions for consistency,
including results for MLE and GMM. More advanced and/or technical material is
contained in Sections 2.662.8.

2.1. The basic consistency theorem

To state a theorem it is necessary to define precisely uniform convergence in


probability, as follows:

Uniform convergence_in probability: o,(d) converges uniformly in probability to


Qd@ meanssu~~~~l Q,(e)
- Qd@ 30.
The following is the fundamental consistency result for extremum estimators, and
is similar to Lemma 3 of Amemiya (1973).

Theorem 2.1

If there is a function QO(0) such that (i)&(8) IS uniquely maximized at 8,; (ii) 0 is
compact; (iii) QO(0) is continuous; (iv) Q,,(e) converges uniformly in probability to
Q,(0), then i?p. 19,.

Proof

For any E > 0 we have wit_h propability approaching one (w.p.a.1) (a) Q,(g) > Q,(O,) -
43 by eq. (1.1); (b) Qd@ > Q.(o)
- e/3 by (iv); (4 Q,&J > Qd&J - 43 by W9
The probability statements in this proof are only well defined if each of k&(8),, and &8,) are
measurable. The measurability issue can be bypassed by defining consistency and uniform convergence
in terms of outer measure. The outer measure of a (possibly nonmeasurable) event E is the infimum of
E[ Y] over all random variables Y with Y 2 l(8), where l(d) is the indicator function for the event 6.
2122 W.K. Newey and D. McFadden

Therefore, w.p.a. 1,

(b)
Q,(e, > Q,(o^, - J3? Q&J - 2E,3(? Qo(&J - E.

Thus, for any a > 0, Q,(Q) > Qe(0,) - E w.p.a.1. Let .,Ir be any open subset of 0
containing fI,. By 0 n.4 compact, (i), and (iii), SU~~~~~,-~Q~(~) = Qo(8*) < Qo(0,)
for some 0*~ 0 n Jt. Thus, choosing E = Qo_(fIo)- supBE .,,flCQ0(8), it follows that
w.p.a.1 Q,(6) > SU~~~~~,~~Q,,(H), and hence (3~~4. Q.E.D.

The conditions of this theorem are slightly stronger than necessary. It is not
necessary to assume that 8 actually maximi_zes_the objectiv_e function. This assump-
tion can be replaced by the hypothesis that Q,(e) 3 supBE @Q,,(d)+ o,(l). This replace-
ment has no effect on the proof, in particular on part (a), so that the conclusion
remains true. These modifications are useful for analyzing some estimators in
econometrics, such as the maximum score estimator of Manski (1975) and the
simulated moment estimators of Pakes (1986) and McFadden (1989). These modifi-
cations are not given in the statement of the consistency result in order to keep that
result simple, but will be used later.
Some of the other conditions can also be weakened. Assumption (iii) can be
changed to upper semi-continuity of Q,,(e) and (iv) to Q,,(e,) A Q,(fI,) and for all
E > 0, Q,(0) < Q,(e) + E for all 19~0 with probability approaching one. Under
these weaker conditions the conclusion still is satisfied, with exactly the same proof.
Theorem 2.1 is a weak consistency result, i.e. it shows I!?3 8,. A corresponding
strong consistency result, i.e. H^Z Ho, can be obtained by assuming that
supBE eJ Q,(0) - Qo(0) 1% 0 holds in place of uniform convergence in probability.
The proof is exactly the same as that above, except that as. for large enough n
replaces with probability approaching one. This and other results are stated here
for convergence in probability because it suffices for the asymptotic distribution
theory.
This result is quite general, applying to any topological space. Hence, it allows for
0 to be infinite-dimensional, i.e. for 19to be a function, as would be of interest for
nonparametric estimation of (say) a density or regression function. However, the
compactness of the parameter space is difficult to check or implausible in many
cases where B is infinite-dimensional.
To use this result to show consistency of a particular estimator it must be possible
to check the conditions. For this purpose it is important to have primitive conditions,
where the word primitive here is used synonymously with the phrase easy to
interpret. The compactness condition is primitive but the others are not, so that it
is important to discuss more primitive conditions, as will be done in the following
subsections.

I0 Uppersemi-continuity means that for any OE 0 and t: > 0 there is an open subset. V of 0 containing
0 such that Q(P) < Q,(0) + E for all UEA.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2123

Condition (i) is the identification condition discussed above, (ii) the boundedness
condition on the parameter set, and (iii) and (iv) the continuity and uniform conver-
gence conditions. These can be loosely grouped into substantive and regularity
conditions. The identification condition (i) is substantive. There are well known
examples where this condition fails, e.g. linear instrumental variables estimation
with fewer instruments than parameters. Thus, it is particularly important to be
able to specify primitive hypotheses for QO(@ to have a unique maximum. The
compactness condition (ii) is also substantive, with eOe 0 requiring that bounds on
the parameters be known. However, in applications the compactness restriction is
often ignored. This practice is justified for estimators where compactness can be
dropped without affecting consistency of estimators. Some of these estimators are
discussed in Section 2.6.
Uniform convergence and continuity are the hypotheses that are often referred
to as the standard regularity conditions for consistency. They will typically be
satisfied when moments of certain functions exist and there is some continuity in
Q,(O) or in the distribution of the data. Moment existence assumptions are needed
to use the law of large numbers to show convergence of Q,(0) to its limit Q,,(0).
Continuity of the limit QO(0) is quite a weak condition. It can even be true when
Q,(0) is not continuous, because continuity of the distribution of the data can
smooth out the discontinuities in the sample objective function. Primitive regu-
larity conditions for uniform convergence and continuity are given in Section 2.3.
Also, Section 2.7 relates uniform convergence to stochastic equicontinuity, a property
that is necessary and sufficient for uniform convergence, and gives more sufficient
conditions for uniform convergence.
To formulate primitive conditions for consistency of an extremum estimator, it
is necessary to first find Q0(f9). Usually it is straightforward to calculate QO(@ as the
probability limit of Q,(0) for any 0, a necessary condition for (iii) to be satisfied. This
calculation can be accomplished by applying the law of large numbers, or hypo-
theses about convergence of certain components. For example, the law of large
numbers implies that for MLE the limit of Q,(0) is QO(0) = E[lnf(zI 0)] and for NLS
QO(0) = - E[ {y - h(x, @}I. Note the role played here by the normalization of the
log-likelihood and sum of squared residuals, that leads to the objective function
converging to a nonzero limit. Similar calculations give the limit for GMM and
CMD, as further discussed below. Once this limit has been found, the consistency
will follow from the conditions of Theorem 2.1.
One device that may allow for consistency under weaker conditions is to treat 8
as a maximum of Q,(e) - Q,(e,) rather than just Q,(d). This is a magnitude normali-
zation that sometimes makes it possible to weaken hypotheses on existence of
moments. In the censored least absolute deviations example, where Q,,(e) =
-n-rC;=,lJ$- max (0, xi0) (, an assumption on existence of the expectation of y is
useful for applying a law of large numbers to show convergence of Q,(0). In contrast
Q,,(d) - Q,,(&) = -n- X1= 1[ (yi -max{O, x:6} I- (yi --ax (0, XI@,}I] is a bounded
function of yi, so that no such assumption is needed.
2124 W.K. Newey end D. McFadden

2.2. Ident$cution

The identification condition for consistency of an extremum estimator is that the


limit of the objective function has a unique maximum at the truth. This condition
is related to identification in the usual sense, which is that the distribution of the
data at the true parameter is different than that at any other possible parameter
value. To be precise, identification is a necessary condition for the limiting objective
function to have a unique maximum, but it is not in general sufficient. This section
focuses on identification conditions for MLE, NLS, GMM, and CMD, in order to
illustrate the kinds of results that are available.

2.2.1. The maximum likelihood estimator

An important feature of maximum likelihood is that identification is also sufficient


for a unique maximum. Let Y, # Y2 for random variables mean Prob({ Y1 # Y,})>O.

Lemma 2.2 (Information inequality)

If 8, is identified [tI # 0, and 0~ 0 implies f(z 10)# f(z 1O,)] and E[ 1In f(z 10)I] < cc
for all 0 then QO(tl) = E[lnf(zI@] has a unique maximum at 8,.

Proof

By the strict version of Jensens inequality, for any nonconstant, positive ran-
dom variable Y, - ln(E[Y]) < E[ - ln(Y)].r3 Then for a = f(zIfI)/f(zI0,) and
~~~,,Q,~~,~-Q,~~~=~C~-~~Cf~~I~~lf~~I~,~l~l~-~n~C~f(zl~)lf(zl~~)~l=
- In [i.f(z (B)dz] = 0. Q.E.D.

The term information inequality refers to an interpretation of QO(0) as an informa-


tion measure. This result means that MLE has the very nice feature that uniqueness
of the maximum of the limiting objective function occurs under the very weakest
possible condition of identification of 8,.
Conditions for identification in particular models are specific to those models. It

i If the set of maximands .1 of the objective function has more than one element, then this set does
not distinguish between the true parameter and other values. In this case further restrictions are needed
for identification. These restrictions are sometimes referred to as normalizations. Alternatively, one could
work with convergence in probability to a set .,*/R,but imposing normalization restrictions is more
practical, and is needed for asymptotic normality.
If Or, is not identified, then there will be some o# 0, such that the distribution of the data is the
same when 0 is the true parameter value>s when 0, is the true parameter value. Therefore, Q*(O) will
also be limiting objective function when 0 is the true parameter, and hence the requirement that Q,,(O)
be maximized at the true parameter implies that Q,,(O) has at least two maxima, flo and 0.
i3The strict version of Jensens inequality states that if a(y) is a strictly concave function [e.g.
a(y) = In(y)] and Y is a nonconstant random variable, then a(E[Y]) > E[a(Y)].
Ch. 36: Large Samplr Estimation and Hypothesis Testing 2125

is often possible to specify them in a way that is easy to interpret (i.e. in a primitive
way), as in the Cauchy example.

Exampk 1.1 continued

It will follow from Lemma 2.2 that E[ln,f(z10)] has a unique maximum at the
true parameter. Existence of E [I In f(z I@[] for all 0 follows from Ilnf(zIO)I d C, +
ln(l+a-2~~-~~2)<C1 +ln(C,.+C,lz12) for positive constants C,, C,, and C,,
and existence of E[ln(C, + C, Izl)]. Identification follows from f(zl0) being one-
to-one in the quadratic function (1 + [(z - ~)/a]~), the fact that quadratic functions
intersect at no more than two points, and the fact that the probability of any two
points is zero, so that Prob( { z:f(z 10)# f(z IO,)}) = 1 > 0. Thus, by the information
inequality, E [ln f(z I O)] has a unique maximum at OO.This example illustrates that it
can be quite easy to show that the expected log-likelihood has a unique maximum,
even when the first-order conditions for the MLE do not have unique roots.

Example I .2 continued

Throughout the probit example, the identification and regularity conditions


will be combined in the assumption that the second-moment matrix E[xx] exists
and is nonsingular. This assumption implies identification. To see why, note
that nonsingularity of E[xx] implies that it is positive definite. Let 0 # O,, so that
E[{x(O - O,)}] = (0 - O,)E[xx](O - 0,) > 0, implying that ~(0 - 0,) # 0, and
hence x0 # xOO, where as before not equals means not equal on a set of posi-
tive probability. Both Q(u) and @( - u) are strictly monotonic, so that x0 # ~0,
implies both @(xO) # @(xO,) and 1 - @(XS) # 1 - @(xO,), and hence that
f(z I 0) = @(xO)Y[1 - @(xO)] l py # f(z IO,).
Existence of E[xx] also implies that E[ Ilnf(zlO)l] < co. It is well known that the
derivative d In @(u)/du = %(u)= ~(U)/@(U) [for 4(u) = V,@(u)], is convex and asymp-
totes to - u as u -+ - cc, and to zero as u + co. Therefore, a mean-value expansion
around 0 = 0 gives

Iln @(xO)l = Iln @(O) + ~(x8)xO1d Iln Q(O)\ + i(x@)lxOI

~I~~~~~~I+~~~+I~~l~l~~Idl~~~(~~I+C(~+IIxII
lIOIl)llxlI IlOll.
Since 1 -@(u)=@(-u)andyis bounded, (lnf(zIO)Id2[Iln@(O)I+C(l + 11x/I x
II0 II)IIx /III
0 II
1, so existence of second moments of x implies that E[ Ilnf(z1 O)/] is
finite. This part of the probit example illustrates the detailed work that may be
needed to verify that moment existence assumptions like that of Lemma 2.2 are
satisfied.

2.2.2. Nonlinear least squares

The identification condition for NLS is that the mean square error E[ { y - h(x,O)l] =
- QJO) have a unique minimum at OO.As is easily shown, the mean square error
2126 W.K. Newey und D. McFudden

has a unique minimum at the conditional mean. I4 Since h(x,O,) = E[ylx] is the
conditional mean, the identification condition for NLS is that h(x, 0) # h(x, 0,) if
0 # 8,, i.e. that h(x, 0) is not the conditional mean when 8 # 0,. This is a natural
conditional mean identification condition for NLS.
In some cases identification will not be sufficient for conditional mean identifica-
tion. Intuitively, only parameters that affect the first conditional moment of y given
x can be identified by NLS. For example, if 8 includes conditional variance param-
eters, or parameters of other higher-order moments, then these parameters may
not be identified from the conditional mean.
As for identification, it is often easy to give primitive hypotheses for conditional
mean identification. For example, in the linear model h(x, 19)= xd conditional mean
identification holds if E[xx] is nonsingular, for then 6 # 0, implies ~6 # xO,,, as
shown in the probit example. For another example, suppose x is a positive scalar
and h(x, 6) = c( + bxy. As long as both PO and y0 are nonzero, the regression curve
for a different value of 6 intersects the true curve at most at three x points. Thus,
for identification it is sufficient that x have positive density over any interval, or
that x have more than three points that have positive probability.

2.2.3. Generalized method of moments

For generalized method of moments the limit function QO(fI)is a little more compli-
cated than for MLE or NLS, but is still easy to find. By the law of large numbers,
g,(O) L g,,(O) = E[g(z, O)], so that if 6 A W for some positive semi-definite matrix
W, then by continuity of multiplication, Q,(d) 3 Q,JO) = - go(O) Wg,(B). This func-
tion has a maximum of zero at 8,, so 8, will be identified if it is less than zero for
0 # 00.

Lemma 2.3 (GMM identification)

If W is positive semi-definite and, for go(Q) = E[g(z, S)], gO(O,) = 0 and Wg,(8) # 0
for 0 # 8, then QJfI) = - g0(0)Wg,(8) has a unique maximum at 8,.

Proof

Let R be such that RR = W. If 6# (I,, then 0 # Wg,(8) = RRgJB) implies Rg,(O) #O


and hence QO(@ = - [RgO(0)][Rgo(fl)] < QO(fl,) = 0 for 8 # Be. Q.E.D.

The GMM identification condition is that if 8 # 8, then go(O) is not in the null space
of W, which for nonsingular W reduces to go(B) being nonzero if 8 # 0,. A necessary
order condition for GMM identification is that there be at least as many moment

For m(x)= E[ylx] and a(x) any function with finite variance, iterated expectations gives
ECOI -a(~))~1 = ECOI -m(4)2l + ~JX{Y -m(4Hm(x) -&)}I + EC~m(x)-~(x)}~l~ EC{y-m(x)}],
with strict inequality if a(x) #m(x).
Ch. 36: Large Sumplr Esrimution and Hypothesis Testing 2121

functions as parameters. If there are fewer moments than parameters, then there
will typically be many solutions to ~~(8) = 0.
If the moment functions are linear, say y(z, Q) = g(z) + G(z)0, then the necessary
and sufficient rank condition for GMM identification is that the rank of WE[G(z)J
is equal to the number of columns. For example, consider a linear instrumental
variables estimator, where g(z, 19)= x.(y - YQ) for a residual y - YB and a vector
of instrumental variables x. The two-stage least squares estimator of 8 is a GMM
estimator with W = (C!= 1xixi/n)- . Suppose that E[xx] exists and is nonsingular,
so that W = (E[xx])- i by the law of large numbers. Then the rank condition for
GMM identification is E[xY] has full column rank, the well known instrumental
variables identification condition. If E[Ylx] = xrt then this condition reduces to
7~having full column rank, a version of the single equation identification condition
[see F.M. Fisher (1976) Theorem 2.7.11. More generally, E[xY] = E[xE[Yjx]],
so that GMM identification is the same as x having full rank covariance with
-uYlxl.
If E[g(z, 0)] is nonlinear in 0, then specifying primitive conditions for identification
becomes quite difficult. Here conditions for identification are like conditions for
unique solutions of nonlinear equations (as in E[g(z, e)] = 0), which are known to be
difficult. This difficulty is another reason to avoid formulating 8 as the solution to
the first-order condition when analyzing consistency, e.g. to avoid interpreting
MLE as a GMM estimator with g(z, 0) = V, In f(z 119). In some cases this difficulty is
unavoidable, as for instrumental variables estimators of nonlinear simultaneous
equations models. 5
Local identification analysis may be useful when it is difficult to find primitive
conditions for (global) identification. If g(z,@ is continuously differentiable and
VOE[g(z, 0)] = E[V,g(z, Q)], then by Rothenberg (1971), a sufficient condition for a
unique solution of WE[g(z, 8)] = 0 in a (small enough) neighborhood of 0, is that
WEIVOg(z,Bo)] have full column rank. This condition is also necessary for local
identification, and hence provides a necessary condition for global identification,
when E[V,g(z, Q)] has constant rank in a neighborhood of 8, [i.e. in Rothenbergs
(1971) regular case]. For example, for nonlinear 2SLS, where p(z, e) is a residual
and g(z, 0) = x.p(z, 8), the rank condition for local identification is that E[x.V,p(z, f&J]
has rank equal to its number of columns.
A practical solution to the problem of global GMM identification, that has
often been adopted, is to simply assume identification. This practice is reasonable,
given the difficulty of formulating primitive conditions, but it is important to check
that it is not a vacuous assumption whenever possible, by showing identification in
some special cases. In simple models it may be possible to show identification under
particular forms for conditional distributions. The Hansen-Singleton model pro-
vides one example.

There are some useful results on identification of nonlinear simultaneous equations models in Brown
(1983) and Roehrig (1989), although global identification analysis of instrumental variables estimators
remains difficult.
2128 W.K. Newey and D. McFadden

Example I .3 continued
Suppose that l? = (n-l C;= 1x,x;), so that the GMM estimator is nonlinear two-
stage least squares. By the law of large numbers, if E[xx] exists and is nonsingular,
l? will converge in probability to W = (E[xx])~, which is nonsingular. Then the
GMM identification condition is that there is a unique solution to E[xp(z, 0)] = 0
at 0 = H,, where p(z, 0) = {/?wy - 1). Quite primitive conditions for identification
can be formulated in a special log-linear case. Suppose that w = exp[a(x) + u] and
y = exp[b(x) + u], where (u, u) is independent of x, that a(x) + y,b(x) is constant, and
that rl(0,) = 1 for ~(0) = exp[a(x) + y,b(x)]aE[exp(u + yv)]. Suppose also that the
first element is a constant, so that the other elements can be assumed to have mean
zero (by demeaning if necessary, which is a nonsingular linear transformation,
and so does not affect the identification analysis). Let CI(X,y)=exp[(Y-yJb(x)].
Then E[p(z, @lx] = a(x, y)v](@- 1, which is zero for 0 = BO,and hence E[y(z, O,)] = 0.
For 8 # B,, E[g(z, 0)] = {E[cr(x, y)]q(8) - 1, Cov [x, a(x, y)]q(O)}. This expression is
nonzero if Cov[x, a(x, y)] is nonzero, because then the second term is nonzero if r](B)
is nonzero and the first term is nonzero if ~(8) = 0. Furthermore, if Cov [x, a(x, y)] = 0
for some y, then all of the elements of E[y(z, 0)] are zero for all /J and one can choose
/I > 0 so the first element is zero. Thus, Cov[x, c((x, y)] # 0 for y # y0 is a necessary
and sufficient condition for identification. In other words, the identification condition
is that for all y in the parameter set, some coefficient of a nonconstant variable
in the regression of a(x, y) on x is nonzero. This is a relatively primitive condition,
because we have some intuition about when regression coefficients are zero, although
it does depend on the form of b(x) and the distribution of x in a complicated way.
If b(x) is a nonconstant, monotonic function of a linear combination of x, then
this covariance will be nonzero. l6 Thus, in this example it is found that the assump-
tion of GMM identification is not vacuous, that there are some nice special cases
where identification does hold.

2.2.4. Classical minimum distance

The analysis of CMD identification is very similar to that for GMM. If AL r-r0
and %I W, W positive semi-definite, then Q(0) = - [72- h(B)]@72 - h(6)] -%
- [rco - h(0)] W[q, - h(O)] = Q,(O). The condition for Qo(8) to have a unique maxi-
mum (of zero) at 0, is that h(8,) = rcOand h(B) - h(0,) is not in the null space of W
if 0 # Be, which reduces to h(B) # h(B,) if W is nonsingular. If h(8) is linear in 8 then
there is a readily interpretable rank condition for identification, but otherwise the
analysis of global identification is difficult. A rank condition for local identification
is that the rank of W*V,h(O,) equals the number of components of 0.

It is well known that Cov[.x,J(x)] # 0 for any monotonic, nonconstant function ,f(x) of a random
variable x.
Ch. 36: Laryr Sample Estimation and Hypothesis Testing 2129

2.3. Unform convergence and continuity

Once conditions for identification have been found and compactness of the parameter
set has been assumed, the only other primitive conditions for consistency required
by Theorem 2.1 are those for uniform convergence in probability and continuity of
the limiting objective function. This subsection gives primitive hypotheses for these
conditions that, when combined with identification, lead to primitive conditions for
consistency of particular estimators.
For many estimators, results on uniform convergence of sample averages, known
as uniform laws oflarge numbers, can be used to specify primitive regularity conditions.
Examples include MLE, NLS, and GMM, each of which depends on sample
averages. The following uniform law of large numbers is useful for these estimators.
Let a(z, 6) be a matrix of functions of an observation z and the parameter 0, and for
a matrix A = [aj,], let 11A 11= (&&) be the Euclidean norm.

Lemma 2.4

If the data are i.i.d., @is compact, a(~,, 0) is continuous at each 0~ 0 with probability
one, and there is d(z) with 11 a(z,d)ll d d(z) for all 8~0 and E[d(z)] < co, then
E[a(z, e)] is continuous and supeto /In- x1= i a(~,, 0) - E[a(z, 0)] I/ 3 0.

The conditions of this result are similar to assumptions of Walds (1949) consistency
proof, and it is implied by Lemma 1 of Tauchen (1985).
The conditions of this result are quite weak. In particular, they allow for a(~,@
to not be continuous on all of 0 for given z.l Consequently, this result is useful
even when the objective function is not continuous, as for Manskis (1975) maximum
score estimator and the simulation-based estimators of Pakes (1986) and McFadden
(1989). Also, this result can be extended to dependent data. The conclusion remains
true if the i.i.d. hypothesis is changed to strict stationarity and ergodicity of zi.i8
The two conditions imposed on a(z, 0) are a continuity condition and a moment
existence condition. These conditions are very primitive. The continuity condition
can often be verified by inspection. The moment existence hypothesis just requires
a data-dependent upper bound on IIa(z, 0) II that has finite expectation. This condition
is sometimes referred to as a dominance condition, where d(z) is the dominating
function. Because it only requires that certain moments exist, it is a regularity
condition rather than a substantive restriction.
It is often quite easy to see that the continuity condition is satisfied and to specify
moment hypotheses for the dominance condition, as in the examples.

r 'The conditions of Lemma 2.4 are not sufficient for measurability of the supremum in the conclusion,
but are sufficient for convergence of the supremum in outer measure. Convergence in outer measure is
sufficient for consistency of the estimator in terms of outer measure, a result that is useful when the
objective function is not continuous, as previously noted,
Strict stationarity means that the distribution of (zi, zi + ,, , z.,+,) does not depend on i for any tn,
and ergodicity implies that n- I:= ,a(zJ + E[a(zJ] for (measurable) functions a(z) with E[ la(z)l] < CO.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2131

2.4. Consistency of maximum likelihood

The conditions for identification in Section 2.2 and the uniform convergence result
of Lemma 2.4, allow specification of primitive regularity conditions for particular
kinds of estimators. A consistency result for MLE can be formulated as follows:

Theorem 2.5

Suppose that zi, (i = 1,2,. . .), are i.i.d. with p.d.f. f(zJ0,) and (i) if 8 f8, then
f(zi18) #f(zilO,); (ii) B,E@, which is compact; (iii) In f(z,le) is continuous at each
8~0 with probability one; (iv) E[supe,oIlnf(~18)1] < co. Then &Lo,.

Proof

Proceed by verifying the conditions of Theorem 2.1. Condition 2.1(i) follows by 2.5(i)
and (iv) and Lemma 2.2. Condition 2.l(ii) holds by 2S(ii). Conditions 2.l(iii) and (iv)
follow by Lemma 2.4. Q.E.D.

The conditions of this result are quite primitive and also quite weak. The conclusion
is consistency of the MLE. Thus, a particular MLE can be shown to be consistent
by checking the conditions of this result, which are identification, compactness,
continuity of the log-likelihood at particular points, and a dominance condition for
the log-likelihood. Often it is easy to specify conditions for identification, continuity
holds by inspection, and the dominance condition can be shown to hold with a little
algebra. The Cauchy location-scale model is an example.

Example 1 .l continued

To show consistency of the Cauchy MLE, one can proceed to verify the hypotheses
of Theorem 2.5. Condition (i) was shown in Section 2.2.1. Conditions (iii) and (iv)
were shown in Section 2.3. Then the conditions of Theorem 2.5 imply that when 0
is any compact set containing 8,, the Cauchy MLE is consistent.

A similar result can be stated for probit (i.e. Example 1.2). It is not given here because
it is possible to drop the compactness hypothesis of Theorem 2.5. The probit
log-likelihood turns out to be concave in parameters, leading to a simple consistency
result without a compact parameter space. This result is discussed in Section 2.6.
Theorem 2.5 remains true if the i.i.d. assumption is replaced with the condition
thatz,,~,,... is stationary and ergodic with (marginal) p.d.f. of zi given byf(z IO,).
This relaxation of the i.i.d. assumption is possible because the limit function remains
unchanged (so the information inequality still applies) and, as noted in Section 2.3,
uniform convergence and continuity of the limit still hold.
A similar consistency result for NLS could be formulated by combining condi-
tional mean identification, compactness of the parameter space, h(x, 13)being conti-
2132 W.K. Nrwey and D. McFadden

nuous at each H with probability one, and a dominance condition. Formulating


such a result is left as an exercise.

2.5. Consistency ofGMM


A consistency result for GMM can be formulated as follows:

Theorem 2.6

Suppose that zi, (i = 1,2,. .), are i.i.d., I%% W, and (i) W is positive semi-definite
and WE[g(z, t3)] = 0 only if (I = 8,; (ii) tIO~0, which is compact; (iii) g(z, 0) is continuous
at each QE 0 with probability one; (iv) E[sup~,~ I/g(z, 0) I/] < co. Then 6% (so.

ProQf

Proceed by verifying the hypotheses of Theorem 2.1. Condition 2.1(i) follows


by 2.6(i) and Lemma 2.3. Condition 2.l(ii) holds by 2.6(ii). By Lemma 2.4
applied to a(z, 0) = g(z, g), for g,(e) = n- x:1= ,g(zi, 0) and go(g) = E[g(z, g)], one has
supBEe I(g,(8) - go(g) II30 and go(d) is continuous. Thus, 2.l(iii) holds by
QO(0) = - go(g) WY,(Q) continuous. By 0 compact, go(e) is bounded on 0, and by
the triangle and Cauchy-Schwartz inequalities,

I!A(@- Qo@)
I

G IICM@
- Yov4II2II + II + 2 IIso(@)
II IId,(@- s,(@ II II @ II
+ llSo(~N2
II @- WII,

so that sup,,,lQ,(g) - Q,Jg)I AO, and 2.l(iv) holds. Q.E.D.

The conditions of this result are quite weak, allowing for discontinuity in the
moment functions. 9 Consequently, this result is general enough to cover the
simulated moment estimators of Pakes (1986) and McFadden (1989), or the interval
moment estimator of Newey (1988).
To use this result to show consistency of a GMM estimator, one proceeds to
check the conditions, as in the Hansen-Singleton example.

19Measurability of the estimator becomes an issue in this case, although this can be finessed by
working with outer measure, as previously noted.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2133

Example 1.3 continued

Assume that E[xx] < a, so that I% A W = (E[xx])- . For hypothesis (i), simply
assume that E[y(z, 0)] = 0 has a unique solution at 0, among all PIE0. Unfortunately,
as discussed in Section 2.2, it is difficult to give more primitive assumptions for this
identification condition. Also, assume that @is compact, so that (ii) holds. Then (iii)
holds by inspection, and as discussed in Section 2.3, (iv) holds as long as the moment
existence conditions given there are satisfied. Thus, under these assumptions, the
estimator will be consistent.

Theorem 2.6 remains true if the i.i.d. assumption is replaced with the condition
that zlr z2,. . is stationary and ergodic. Also, a similar consistency result could be
formulated for CMD, by combining uniqueness of the solution to 7c,,= h(8) with
compactness of the parameter space and continuity of h(O). Details are left as an
exercise.

2.6. Consistency without compactness

The compactness assumption is restrictive, because it implicitly requires that there


be known bounds on the true parameter value. It is useful in practice to be able to
drop this restriction, so that conditions for consistency without compactness are of
interest. One nice result is available when the objective function is concave. Intuitively,
concavity prevents the objective function from turning up as the parameter moves
far away from the truth. A precise result based on this intuition is the following one:

Theorem 2.7

If there is a function QO(0) such that (i) QO(0) 1s uniquely maximized at 0,; (ii) B0 is
an element of the interior of a convex set 0 and o,,(e) is concave; and (iii) o,(e) L
QO(0) for all 8~0, then fin exists with probability approaching one and 8,,-%te,.

Proof

Let %?be a closed sphere of radius 2~ around 8, that is contained in the interior of
0 and let %?!be its boundary. Concavity is preserved by pointwise limits, so that
QO(0) is also concave. A concave function is continuous on the interior of its domain,
so that QO(0) is continuous on V?. Also, by Theorem 10.8 of Rockafellar (1970),
pointwise convergence of concave functions on a dense subset of an open set implies
uniform convergence on any compact subset of the open set. It then follows as in
Andersen and Gill (1982) that o,(e) converges to QO(fI) in probability uniformly on
any compact subset of 0, and in particular on %Y.Hence, by Theorem 2.1, the
maximand f!?!of o,,(e) on % is consistent for 0,. Then the event that g,, is within c of
fIO, so that Q,(g,,) 3 max,&,(@, occurs with probability approaching one. In this
event, for any 0 outside W, there is a linear convex combination ,J$ + (1 - ,I)0
2134 W.K. Newry and D. McFadden

that lies in g (with A < l), so that_ Q,(g,,) 3 Q,[ng,, + (1 - i)U]. By concavity,
Q.[ng,,_+ (1 - i)O] 3 ,$,(g,,) + (1 - E_)_Q,(e). Putting these inequalities together,
(1 - i)Q,(@ > (1 - i)Q,(0), implying 8, is the maximand over 0. Q.E.D.

This theorem is similar to Corollary II.2 of Andersen and Gill (1982) and Lemma
A of Newey and Powell (1987). In addition to allowing for noncompact 0, it only
requires pointwise convergence. This weaker hypothesis is possible because point-
wise convergence of concave functions implies uniform con_vergence (see the proof).
This result also contains the additional conclusion that 0 exists with probability
approaching one, which is needed because of noncompactness of 0.
This theorem leads to simple conditions for consistency without compactness for
both MLE and GMM. For MLE, if in Theorem 2.5, (ii)- are replaced by 0
convex, In f(z 10)concave in 0 (with probability one), and E[ 1In f(z 10)I] < 03 for all
0, then the law of large numbers and Theorem 2.7 give consistency. In other words,
with concavity the conditions of Lemma 2.2 are sufficient for consistency of the
MLE. Probit is an example.

Example 1.2 continued

It was shown in Section 2.2.1 that the conditions of Lemma 2.2 are satisfied. Thus,
to show consistency of the probit MLE it suffices to show concavity of the log-
likelihood, which will be implied by concavity of In @(x@)and In @( - ~0). Since ~8
is linear in H, it suffices to show concavity of In a(u) in u. This concavity follows
from the well known fact that d In @(u)/du = ~(U)/@(U) is monotonic decreasing [as
well as the general Pratt (1981) result discussed below].

For GMM, if y(z, 0) is linear in 0 and I?f is positive semi-definite then the objective
function is concave, so if in Theorem 2.6, (ii)- are replaced by the requirement
that E[ /Ig(z, 0) 111< n3 for all tj~ 0, the conclusion of Theorem 2.7 will give consis-
tency of GMM. This linear moment function case includes linear instrumental
variables estimators, where compactness is well known to not be essential.
This result can easily be generalized to estimators with objective functions that
are concave after reparametrization. If conditions (i) and (iii) are satisfied and there
is a one-to-one mapping r(0) with continuous inverse such that &-(I.)] is
concave_ on^ r(O) and $0,) is an element of the interior of r( O), then the maximizing
value i of Q.[r - (J)] will be consistent for i, = s(d,) by Theorem 2.7 and invariance
of a maxima to one-to-one reparametrization, and i? = r- (I) will be consistent for
8, = z-~(&) by continuity of the inverse.
An important class of estimators with objective functions that are concave after
reparametrization are univariate continuous/discrete regression models with log-
concave densities, as discussed in Olsen (1978) and Pratt (1981). To describe this
class, first consider a continuous regression model y = x& + cOc, where E is indepen-
dent of x with p.d.f. g(s). In this case the (conditional on x) log-likelihood is
- In 0 + In sCa_ (y - xfi)] for (B, C)E 0 = @x(0, co). If In g(E) is concave, then this
Ch. 36: Large Sample Estimation and Hypothesis Testing 2135

log-likelihood need not be concave, but the likelihood In / + ln Y(YY- ~6) is concave
in the one-to-one reparametrization y = Q- and 6 = /~/a. Thus, the average log-
likelihood is also concave in these parameters, so that the above generalization of
Theorem 2.7 implies consistency of the MLE estimators of fi and r~ when the
maximization takes place over 0 = Rkx(O, a), if In g(c) is concave. There are many
log-concave densities, including those proportional to exp( - Ixl) for CI3 1 (including
the Gaussian), logistic, and the gamma and beta when the p.d.f. is bounded, so this
concavity property is shared by many models of interest.
The reparametrized log-likelihood is also concave when y is only partially
observed. As shown by Pratt (1981), concavity of lng(a) also implies concavity of
ln[G(u)- G(w)] in u and w, for the CDF G(u)=~~~(E)~E.~~ That is, the log-
probability of an interval will be concave in the endpoints. Consequently, the
log-likelihood for partial observability will be concave in the parameters when each
of the endpoints is a linear function of the parameters. Thus, the MLE will be
consistent without compactness in partially observed regression models with log-
concave densities, which includes probit, logit, Tobit, and ordered probit with
unknown censoring points.
There are many other estimators with concave objective functions, where some
version of Theorem 2.7 has been used to show consistency without compactness.
These include the estimators in Andersen and Gill (1982), Newey and Powell (1987),
and Honort (1992).
It is also possible to relax compactness with some nonconcave objective functions.
Indeed, the original Wald (1949) MLE consistency theorem allowed for noncom-
pactness, and Huber (1967) has given similar results for other estimators. The basic
idea is to bound the objective function above uniformly in parameters that are far
enough away from the truth. For example, consider the MLE. Suppose that there
is a compact set % such that E[supBtOnMc In f(z 1d)] < E[ln f(z) fl,)]. Then by the
law of large numbers, with probability approaching one, supBtOnXc&(0) d n-l x
c;= 1 suPoE@nfjc In f(zil@) < n-Cy= I In f(zl do), and the maximum must lie in %.
Once the maximum is known to be in a compact set with probability approaching
one, Theorem 2.1 applies to give consistency.
Unfortunately, the Wald idea does not work in regression models, which are quite
common in econometrics. The problem is that the likelihood depends on regression
parameters 8 through linear combinations of the form ~9, so that for given x
changing 8 along the null-space of x does not change the likelihood. Some results
that do allow for regressors are given in McDonald and Newey (1988), where it is
shown how compactness on 0 can be dropped when the objective takes the form
Q,(e) = n- xy= 1 a(Zi, X:O) an d a (z, u) goes to - co as u becomes unbounded. It would
be useful to have other results that apply to regression models with nonconcave
objective functions.

Pratt (1981) also showed that concavity of In g(c) is necessary as well as sufficient for ln[G(u) ~ G(w)]
to be concave over all v and w.
2136 W.K. Newey and D. McFadden

Compactness is essential for consistency of some extremum estimators. For


example, consider the MLE in a model where z is a mixture of normals, having
likelihood f(z 1Q)= pea-~+!$a-(z-p)] +(I -p)y~f$Cy~l(z-~)l for8=(p,a,6y),
some 0 < p < 1, and the standard normal p.d.f. d(c) = (271) 12e-E22. An interpreta-
tion of this model is that z is drawn from N(p, a2) with probability p and from N(cc, r2)
with probability (1 - p). The problem with noncompactness for the MLE in this
model is that for certain p (and u) values, the average log-likelihood becomes
unbounded as g (or y) goes to zero. Thus, for existence and consistency of the MLE
it is necessary to bound 0 (and y) away from zero. To be specific, suppose that p = Zi
for some i. Then f(z,lfI) = ~.a ~@(O)$(l -p)y-lf$cy~l(zi-cc)]+co as o+o,
and assuming that zj # zi for all j # i, cs occurs with probability one, f(zj/U)+
(1 -p)y-l~[y-l(zj-@]>O. Hence, Q,,(e)= n-Cy=r lnf(zilO) becomes un-
bounded as (T+O for p = zi. In spite of this fact, if the parameter set is assumed to
be compact, so that (Tand y are bounded away from zero, then Theorem 2.5 gives
consistency of the MLE. In particular, it is straightforward to show that (I is
identified, so that, by the information inequality, E[ln f(zl@] has a unique
maximum at Be. The problem here is that the convergence of the sample objective
function is not uniform over small values of fr.
This example is extreme, but there are interesting econometric examples that have
this feature. One of these is the disequilibrium model without observed regime of
Fair and Jaffee (1972), where y = min{xp, + G,,E,~6, + you}, E and u are standard
normal and independent of each other and of x and w, and the regressors include
constants. This model also has an unbounded average log-likelihood as 0 -+ 0 for
a certain values of /I, but the MLE over any compact set containing the truth will
be consistent under the conditions of Theorem 2.5.
Unfortunately, as a practical matter one may not be sure about lower bounds on
variances, and even if one were sure, extraneous maxima can appear at the lower
bounds in small samples. An approach to this problem is to search among local
maxima that satisfy the first-order conditions for the one that maximizes the
likelihood. This approach may work in the normal mixture and disequilibrium
models, but might not give a consistent estimator when the true value lies on the
boundary (and the first-order conditions are not satisfied on the boundary).

2.7. Stochastic equicontinuity and uniform convergence

Stochastic equicontinuity is important in recent developments in asymptotic distri-


bution theory, as described in the chapter by Andrews in this handbook. This
concept is also important for uniform convergence, as can be illustrated by the
nonstochastic case. Consider a sequence of continuous, nonstochastic functions
{Q,(0)},= 1. For nonrandom functions, equicontinuity means that the gap between
Q,(0) and Q,(6) can be made small uniformly in n by making g be close enough to
0, i.e. a sequence of functions is equicontinuous if they are continuous uniformly in
Ch. 36: Lurqr Sample Estimation and Hypothesis Testing 2137

n. More precisely, equicontinuity holds if for each 8, c > 0 there exists 6 > 0 with
1Q,(8) ~ Q,(e)1 < E for all Jj6 0 11< 6 and all 11.~~ It is well known that if Q,(0)
converges to Q,J0) pointwise, i.e. for all UE 0, and @is compact, then equicontinuity
is a necessary and sufficient condition for uniform convergence [e.g. see Rudin
(1976)]. The ideas behind it being a necessary and sufficient condition for uniform
convergence is that pointwise convergence is the same as uniform covergence on
any finite grid of points, and a finite grid of points can approximately cover a
compact set, so that uniform convergence means that the functions cannot vary too
much as 0 moves off the grid.
To apply the same ideas to uniform convergence in probability it is necessary to
define an in probability version of equicontinuity. The following version is for-
mulated in Newey (1991 a).

Stochastic_equicontinuity: For every c, n > 0 there exists a sequence of random


variables d, and a sample size no such that for n > n,, Prob( 1d^,1> E) < q and for
each 0 there is an open set JV containing 8 with

Here t_he function d^, acts like a random epsilon, bounding the effect of changing
0 on Q,(e). Consequently, similar reasoning to the nonstochastic case can be used
to show that stochastic equicontinuity is an essential condition for uniform conver-
gence, as stated in the following result:

Lemma 2.8

Suppose 0 is compact and Qo(B) is continuous. Then ~up~,~lQ,(~) - Qo(@ 30


if and only if Q,(0) L Qo(e) for all 9~ @and Q,(O) is stochastically equicontinuous.

The proof of this result is given in Newey (1991a). It is also possible to state an
almost sure convergence version of this result, although this does not seem to
produce the variety of conditions for uniform convergence that stochastic equi-
continuity does; see Andrews (1992).
One useful sufficient condition for uniform convergence that is motivated by the
form of the stochastic equicontinuity property is a global, in probability Lipschitz
condition, as in the hypotheses of the following result. Let O,(l) denote a sequence
of random variables that is bounded in probability.22

One can allow for discontinuity in the functions by allowing the difference to be less than I: only for
n > fi, where fi depends on E, but not on H. This modification is closer to the stochastic equicontinuity
condition given here, which does allow for discontinuity.
Y is bounded in probability if for every E > 0 there exists ii and q such that Prob(l Y,l > 1)< E for
n > ii.
2138 W.K. Newey and D. McFadden

Lemma 2.9

If 0 is compact, QO(0) is contmuous,_Q,,(0) %QO(0) for all 0~0, and there is


cr>O and B,=O,(l) such that for all 0, HE 0, 1o,(8) - Q^,(O)ld k,, I/g- 0 11
OL,then
su~~lto I Q,(@ - QdfO 5 0.

Prooj

By Lemma 2.8 it suffices to show stochastic equicontinuity. Pick E, ye> 0. By


B,n = o,(l) there is M such that Prob( IB,I > M) < r] for all n large enough. Let
A,, = BJM_and .-1/= [&:J e- 0 11<c/M}.ThenProb(I&>a)=Prob(Ifi,I>M) <y
and for all 0, ~E.~V, IQ,,(o) - Q,,(0)1 < 6,,Il& 8 lla < 2,. Q.E.D.

This result is useful in formulating the uniform law of large numbers given in
Wooldridges chapter in this volume. It is also useful when the objective function
Q,(e) is not a simple function of sample averages (i.e. where uniform laws of large
numbers do not apply). Further examples and discussion are given in Newey
(1991a).

2.8. Least ubsolute deviations examples

Estimators that minimize a sum of absolute deviations provide interesting examples.


The objective function that these estimators minimize is not differentiable, so that
weak regularity conditions are needed for verifying consistency and asymptotic
normality. Also, these estimators have certain robustness properties that make them
interesting in their own right. In linear models the least absolute deviations estimator
is known to be more asymptotically more efficient than least squares for thick-tailed
distributions. In the binary choice and censored regression models the least absolute
deviations estimator is consistent without any functional form assumptions on the
distribution of the disturbance. The linear model has been much discussed in the
statistics and economics literature [e.g. see Bloomfeld and Steiger (1983)], so it
seems more interesting to consider here other cases. To this end two examples are
given: maximum score, which applies to the binary choice model, and censored least
absolute deviations.

2.8.1. Maximum score

The maximum score estimator of Manski (I 975) is an interesting example because


it has a noncontinuous objective function, where the weak regularity conditions
of Lemma 2.4 are essential, and because it is a distribution-free estimator for binary
choice. Maximum score is used to estimate 8, in the model y = I(xB, + E > 0), where
l(.s&)denotes the indicator for the event .d (equal to one if d occurs and zero
Ch. 36: Lurye Sumple Estimation and Hypothesis Testing 2139

otherwise), and E is a disturbance term with a conditional median (given x) ofzero.


The estimator solves eq. (1.1) for

!A(@=-H-It i=l
lyi- l(x;H>o)/.

A scale normalization is necessary (as usual for binary choice), and a convenient
one here is to restrict all elements of 0 to satisfy //0 /I = 1.
To show consistency of the maximum score estimator, one can use conditions
for identification and Lemma 2.4 to directly verify all the hypotheses of Theorem
2.1. By the law of large numbers, Q,(e) will have probability limit Qe(0) =
- EC/y - l(xU > O)l]. To show that this limiting objective has a unique maximum
at fIO,one can use the well known result that for any random variable Y, the expected
absolute deviation E[ 1Y - a(x)I] is strictly minimized at any median of the condi-
tional distribution of Y given x. For a binary variable such as y, the median is unique
when Prob(y = 1 Ix) # +, equal to one when the conditional probability is more than
i and equal to zero when it is less than i. Assume that 0 is the unique conditional
median of E given x and that Prob(xB, = 0) = 0. Then Prob(y = 1 Ix) > ( < ) 3 if
and only if ~0, > ( < ) 0, so Prob(y = 1 Ix) = i occurs with probability zero, and
hence l(xt), > 0) is the unique median of y given x. Thus, it suffices to show that
l(xB > 0) # l(xB, > 0) if 0 # 19,. For this purpose, suppose that there are corre-
sponding partitions 8 = (or, fl;, and x = (x,, x;) such that x&S = 0 only if 6 = 0; also
assume that the conditional distribution of x1 given x2 is continuous with a p.d.f.
that is positive on R, and the coefficient O,, of x1 is nonzero. Under these conditions,
if 0 # 8, then l(xB > 0) # l(xB, > 0), the idea being that the continuous distribution
of x1 means that it is allowed that there is a region of x1 values where the sign of x8
is different. Also, under this condition, ~8, = 0 with zero probability, so y has a
unique conditional median of l(x8, > 0) that differs from i(x8 > 0) when 0 # fI,,, so
that QO(@ has a unique maximum at 0,.
For uniform convergence it is enough to assume that x0 is continuously distri-
buted for each 0. For example, if the coefficient of x1 is nonzero for all 0~0 then
this condition will hold. Then, l(xB > 0) will be continuous at each tI with probability
one, and by y and l(xB > 0) bounded, the dominance condition will be satisfied, so
the conclusion of Lemma 2.4 gives continuity of Qo(0) and uniform convergence of
Q,,(e) to Qe(@. The following result summarizes these conditions:

Theorem 2.10

If y = l(xB, + E > 0) and (i) the conditional distribution of E given x has a unique
median at I: = 0; (ii) there are corresponding partitions x = (x,, xi) and 8 = (e,, pZ)

13A median of the distribution of a random variable Y is the set of values m SUCKthat Prob( Y 2 m) > f
and Prob(y < m) 2 +.
2140 W.K. Nrwey and D. McFadden

such that Prob(x;G # 0) > 0 for 6 # 0 and the conditional distribution of xi given
x2 is continuous with support R; and (iii) ~8 is continuously distributed for all
0~0= (H:lIHIl = l}; then 850,.

2.8.2. Censored leust ubsolute deviations

Censored least absolute deviations is used to estimate B0 in the model y =


max{O, ~0, + F} where c has a unique conditional median at zero. It is obtained by
solvingeq.(l.l)forQ,(0)= -n-~~=i (lyi- max{O,x~~}~-~yi-max{O,xj~,}~)=
Q,(U) - Q,(0,). Consistency of 8 can be shown by using Lemma 2.4 to verify the
conditions of Theorem 2.1. The function Iyi - max (0, xi0) 1- Iyi - max {0, xi@,} I is
continuous in 8 by inspection, and by the triangle inequality its absolute value is
bounded above by Imax{O,x~H}I + Imax{O,xI8,}I d lIxJ( 118ll + IId,ll), so that if
E[ 11x II] < cc the dominance condition is satisfied. Then by the conclusion of
Lemma 2.4, Q,(0) converges uniformly in probability to QO(@= E[ ly - max{O,x8} I-
ly - max{O, ~8,) I]. Thus, for the normalized objective function, uniform conver-
gence does not require any moments of y to exist, as promised in Section 2.1.
Identification will follow from the fact that the conditional median minimizes the
expected absolute deviation. Suppose that P(xB, > 0) and P(x6 # Olx8, > 0) > 0
if 6 # 0. 24 By E having a uniqu e conditional median at zero, y has a unique
conditional median at max{O, xo,}. Therefore, to show identification it suffices to
show that max{O, xd} # max{O, xBO} if 8 # 0,. There are two cases to consider. In
case one, l(xU > 0) # 1(x@, > 0), implying max{O,xB,} # max{O,x@}. In case two,
1(x@> 0) = l(x0, > 0), so that max 10, x(9) - max 10, xBO}= l(xB, > O)x(H- 0,) # 0
by the identifying assumption. Thus, QO(0) has a unique maximum over all of R4 at
BO. Summarizing these conditions leads to the following result:

Theorem 2.11

If (i) y = max{O, ~8, + a}, the conditional distribution of E given x has a unique
median at E = 0; (ii) Prob(xB, > 0) > 0, Prob(xG # Olx0, > 0) > 0; (iii) E[li x 111< a;
and (iv) 0 is any compact set containing BO, then 8 3 8,.

As previously promised, this result shows that no assumption on the existence of


moments of y is needed for consistency of censored least absolute deviations. Also,
it shows that in spite of the first-order conditions being identically zero over all 0
where xi0 < 0 for all the observations, the global maximum of the least absolute
deviations estimator, over any compact set containing the true parameter, will be
consistent. It is not known whether the compactness restriction can be relaxed for
this estimator; the objective function is not concave, and it is not known whether
some other approach can be used to get rid of compactness.

241t suffices for the second condition that E[l(uU, > 0)x.x] is nonsingular.
Ch. 36: Large Sample Estimation and Hypothesis Testiny 2141

3. Asymptotic normality

Before giving precise conditions for asymptotic normality, it is helpful to sketch the
main ideas. The key idea is that in large samples estimators are approximately equal
to linear combinations of sample averages, so that the central limit theorem gives
asymptotic normality. This idea can be illustrated by describing the approximation
for the MLE. When the log-likelihood is differentiable and 8 is in the interior of the
parameter set 0, the first-order condition 0 = n x1= 1V, In f(zi I$) will be satisfied.
Assuming twice continuous differentiability of the log-likelihood, the mean-value
theorem applied to each element of the right-hand side of this first-order condition
gives

(3.1)

where t?is a mean value on the line joining i? and 19~and V,, denotes the Hessian
matrix of second derivatives. 5 Let J = E[V, In f(z (0,) (V, In f(z 1tl,)}] be the infor-
mation matrix and H = E[V,, In f(z 1O,)] the expected Hessian. Multiplying through
by Jn and solving for &(e^ - 6,) gives

p (Hessian Conv.) d (CLT) (3.2)


(Inverse Cont.)
I 1
H-1 NO. J)

By the well known zero-mean property of the score V,ln ,f(z/Q,) and the central
limit theorem, the second term will converge in distribution to N(0, .I). Also, since
eis between 6 and 8,, it will be consistent if 8 is, so that by a law of large numbers
that is uniform in 0 converging to 8, the Hessian term converges in probability to
H. Then the inverse Hessian converges in probability to H- by continuity of the
inverse at a nonsingular matrix. It then follows from the Slutzky theorem that
&(6- 0,) % N(0, Hm 1JH-).26 Furthermore, by the information matrix equality

25The mean-value theorem only applies to individual elements of the partial derivatives, so that 0
actually differs from element to element of the vector equation (3.1). Measurability of these mean values
holds because they minimize the absolute value of the remainder term, setting it equal to zero, and thus
are extremum estimators; see Jennrich (1969).
*The Slutzky theorem is Y, 5 Y, and Z, Ac*Z,Y,
-WY,.
2142 W,K. Newey und D. McFadden

H = -J, the asymptotic variance will have the usual inverse information matrix
form J-l.
This expansion shows that the maximum likelihood estimator is approximately
equal to a linear combination of the average score in large samples, so that asymptotic
normality follows by the central limit theorem applied to the score. This result is
the prototype for many other asymptotic normality results. It has several components,
including a first-order condition that is expanded around the truth, convergence of
an inverse Hessian, and a score that follows the central limit theorem. Each of these
components is important to the result. The first-order condition is a consequence
of the estimator being in the interior of the parameter space.27 If the estimator
remains on the boundary asymptotically, then it may not be asymptotically normal,
as further discussed below. Also, if the inverse Hessian does not converge to a
constant or the average score does not satisfy a central limit theorem, then the
estimator may not be asymptotically normal. An example like this is least squares
estimation of an autoregressive model with a unit root, as further discussed in
Chapter 2.
One condition that is not essential to asymptotic normality is the information
matrix equality. If the distribution is misspecified [i.e. is not f(zI fI,)] then the MLE
may still be consistent and asymptotically normal. For example, for certain expo-
nential family densities, such as the normal, conditional mean parameters will be
consistently estimated even though the likelihood is misspecified; e.g. see Gourieroux
et al. (1984). However, the distribution misspecification will result in a more compli-
cated form H- 'JH-' for the asymptotic variance. This more complicated form
must be allowed for to construct a consistent asymptotic variance estimator under
misspecification.
As described above, asymptotic normality results from convergence in probability
of the Hessian, convergence in distribution of the average score, and the Slutzky
theorem. There is another way to describe the asymptotic normality results that is
often used. Consider an estimator 6, and suppose that there is a function G(z) such
that

fi(e- 0,) = t $(zi)/$ + o,(l), EC$(Z)l = 0, ~%$(z)lc/(ZYl exists, (3.3)


i=l

where o,(l) denote: a random vector that converges in probability to zero. Asymp-
totic normality of 6then results from the central limit theorem applied to Cy= 1$(zi)/
,,h, with asymptotic variance given by the variance of I/I(Z).An estimator satisfying
this equation is referred to as asymptotically lineur. The function II/(z) is referred to
as the influence function, motivated by the fact that it gives the effect of a single

It is sufficient that the estimator be in the relative interior of 0, allowing for equality restrictions
to be imposed on 0, such as 0 = r(g) for smooth ~b) and the true ) being in an open ball. The first-order
condition does rule out inequality restrictions that are asymptotically binding.
Ch. 36: Lurge Sumplr Estimation and Hypothesis Testing 2143

observation on the estimator, up to the o,(l) remainder term. This description is


useful because all the information about the asymptotic variance is summarized in
the influence function. Also, the influence function is important in determining the
robustness properties of the estimator; e.g. see Huber (1964).
The MLE is an example of an asymptotically linear estimator, with influence
function $(z) = - H V, In ,f(z IO,). In this example the remainder term is, for the
mean value a, - [(n C;= 1V,,,,In f(zi 1g))- - H - In- li2Cr= ,V, In f(zil e,), which
converges in probability to zero because the inverse Hessian converges in probability
to H and the $I times the average score converges in distribution. Each of NLS
and GMM is also asymptotically linear, with influence functions that will be
described below. In general the CMD estimator need not be asymptotically linear,
because its asymptotic properties depend only on the reduced form estimator fi.
However, if the reduced form estimator 72is asymptotically linear the CMD will
also be.
The idea of approximating an estimator by a sample average and applying the
central limit theorem can be used to state rigorous asymptotic normality results for
extremum estimators. In Section 3.1 precise results are given for cases where the
objective function is sufficiently smooth, allowing a Taylor expansion like that of
eq. (3.1). Asymptotic normality for nonsmooth objective functions is discussed in
Section 7.

3.1. The husic results

For asymptotic normality, two basic results are useful, one for an extremum
estimator and one for a minimum distance estimator. The relationship between
these results will be discussed below. The first theorem is for an extremum estimator.

Theorem 3.1

Suppose that 8 satisfies eq. (l.l), @A O,, and (i) o,Einterior(O); (ii) o,(e) is twice
continuously differentiable in a neighborhood Jf of Be; (iii) &V,&,(0,,) % N(0, Z);
(iv) there is H(Q) that is continuous at 8, and supBEN IIV,,&(@ - H(d)11 30; (v)
H = H(H,) is nonsingular. Then J&(8 - 0,) % N(0, H l,?ZH- ).

Proqf

A sketch of a proof is given here, with full details described in Section 3.5. Condi-
tions (i)-(iii) imply that V,&(8) = 0 with probability approaching one. Expanding
around B0 and solving for ,,&(8 - 0,) = - I?(e)- $V,&(0,), where E?(B) = V,,&(0)
and f?is a mean value, located between Band 8,. By ep. Be and (iv), with probability
approaching - one, I/fi(q - H /I< /IE?(g) - H(g) II + )IH(g) - H II d supBEell fi(O) -
H(B) /I + /IH(0) - H/I 3 0. Then by continuity of matrix inversion, - f?(g)- l 3
-H-l. The conclusion then follows by the Slutzky theorem. Q.E.D.
2144 W.K. Newey and D. McFuddun

The asymptotic variance matrix in the conclusion of this result has a complicated
form, being equal to the product H -'EH- '.In the case of maximum likelihood
this form simplifies to J- , the inverse of the information matrix, because of the
information matrix equality. An analogous simplification occurs for some other
estimators, such as NLS where Var(ylx) is constant (i.e. under homoskedasticity).
As further discussed in Section 5, a simplified asymptotic variance matrix is a feature
of an efficient estimator in some class.
The true parameter being interior to the parameter set, condition (i), is essential
to asymptotic normality. If 0 imposes inequality restrictions on 0 that are asympto-
tically binding, then the estimator may not be asymptotically normal. For example,
consider estimation of the mean of a normal distribution that is constrained to be
nonnegative, i.e. f(z 1H) = (271~~)- exp [ - (z - ~)~/20~], 8 = (p, 02), and 0 = [0, co) x
(0, acj). It is straightforward to check that the MLE of ~1 is ii = Z,Z > 0, fi = 0
otherwise. If PO = 0, violating condition (ii), then Prob(P = 0) = i and Jnfi is N(O,o)
conditional on fi > 0. Therefore, for every n (and hence also asymptotically), the
distribution of &(fl- pO) is a mixture of a spike at zero with probability i and the
positive half normal distribution. Thus, the conclusion of Theorem 3.1 is not true.
This example illustrates that asymptotic normality can fail when the maximum
occurs on the boundary. The general theory for the boundary case is quite compli-
cated, and an account will not be given in this chapter.
Condition (ii), on twice differentiability of Q,(s), can be considerably weakened
without affecting the result. In particular, for GMM and CMD, asymptotic normality
can easily be shown when the moment functions only have first derivatives. With
considerably more work, it is possible to obtain asymptotic normality when Q,,(e)
is not even once differentiable, as discussed in Section 7.
Condition (iii) is analogous to asymptotic normality of the scores. It -11 often
follow from a central limit theorem for the sample averages that make up V,Q,(0,).
Condition (iv) is uniform convergence of the Hessian over a neighborhood of the
true parameter and continuity of the limiting function. This same type of condition
(on the objective function) is important for consistency of the estimator, and was
discussed in Section 2. Consequently, the results of Section 2 can be applied to give
primitive hypotheses for condition (iv). In particular, when the Hessian is a sample
average, or depends on sample averages, Lemma 2.4 can be applied. If the average
is continuous in the parameters, as will typically be implied by condition (iv), and
a dominance condition is satisfied, then the conclusion of Lemma 2.4 will give
uniform convergence. Using Lemma 2.4 in this way will be illustrated for MLE and
GMM.
Condition (v) can be interpreted as a strict local identification condition, because
H = V,,Q,(H,) (under regularity conditions that allow interchange of the limiting
and differentiation operations.) Thus, nonsingularity of H is the sufficient (second-
order) condition for there to be a unique local maximum at 0,. Furthermore, if
V,,QO(0) is regular, in the sense of Rothenberg (1971) that it has constant rank in
a neighborhood of 8,, then nonsingularity of H follows from Qa(0) having a unique
Ch. 36: Large Sample Estimation and ffypothesis Testing 2145

maximum at fIO.A local identification condition in these cases is that His nonsingular.
As stated above, asymptotic normality of GMM and CMD can be shown under
once differentiability, rather than twice differentiability. The following asymptotic
normality result for general minimum distance estimators is useful for this purpose.

Theorem 3.2

Suppose that H^satisfies eq. (1.1) for Q,(0) = - 4,(0)ii/g,,(e) where ii/ 3 W, W is
positive semi-definite, @Lo,, and (i) .Q,Einterior(O); (ii) g,(e) is continuously
differentiable in a neighborhood JV of 8,; (iii) $9,(8,) 5 N(O,n); (iv) there is G(8)
that is continuous at 0, and supBE y /(V&,,(e) - G(U) II A 0; (v) for G = G(e,), G WC
is nonsingular. Then $(8- 0,) bI[O,(GWG)-GWf2WG(GWG)-1.

The argument is similar to the proof of Theorem 3.1. By (i) and (ii), with probability
approaching one the first-order conditions G(@t@@,($ = 0 are satisfied, for G(0) =
V&,,(0). Expanding d,(8) around B0 and solving gives Jn(e^- e,,) = - [G(@ x
I?%@)] - 1G^(@I&$,(&,), w h ere t?is a mean value. By (iv) and similar reasoning as
for Theorem 3.1, G(8) A G and G(g) A G. Then by(v), - [G(@@G(@]-16(e),%~
- (GWG)- 'G'W, so the conclusion follows by (iii) and the Slutzky theorem.
Q.E.D.

When W = Q - , the asymptotic variance of a minimum distance estimator simplifies


to (GQ - G)) . As is discussed in Section 5, the value W = L2 _ corresponds to an
efficient weighting matrix, so as for the MLE the simpler asymptotic variance matrix
is associated with an efficient estimator.
Conditions (i)-(v) of Theorem 3.2 are analogous to the corresponding conditions
of Theorem 3.1, and most of the discussion given there also applies in the minimum
distance case. In particular, the differentiability condition for g,(e) can be weakened,
as discussed in Section 7.
For analyzing asymptotic normality, extremum estimators can be thought of as
a special case of minimum distance estimators, with V&,(e) = d,(0) and t?f = I = W.
The_ first-order conditions for extremum estimators imply that o,(tI)@g,(fI) =
V,Q,(0)V,Q,(@ has a minimum (of zero) at 0 = 8. Then the G and n of Theorem 3.2
are the H and Z of Theorem 3.1, respectively, and the asymptotic variance of the
extremum estimator is that of the minimum distance estimator, with (GWG)- x
GWf2WG(GWG)p1 =(HH)-HLH(HH)m = H-ZHpl. Thus, minimum dis-
tance estimation provides a general framework for analyzing asymptotic normality,
although, as previously discussed, it is better to work directly with the maximum,
rather than the first-order conditions, when analyzing consistency.28

18This generality suggests that Theorem 3.1 could be formulated as a special case of Theorem 3.2.
The results are not organLed in this way because it seems easier to apply Theorem 3.1 directly to
particular extremum estimators.
2146 W.K. Newey und D. McFadden

3.2. Asymptotic normality jbr MLE

The conditions for asymptotic normality of an extremum estimator can be specialized


to give a result for MLE.

Theorem 3.3
Suppose that zl,. . . , z, are i.i.d., the hypotheses of Theorem 2.5 are satisfied and (i)
d,Einterior(O); (ii) f(zl0) is twice continuously differentiable and f(zl0) > 0 in a
neighborhood ,X of 8,; (iii) {suP~~,~- 11V,f(zl B) //dz < co, jsupe._, IIV,,f(zl@ I)dz < m;;
(iv) J = ECVBln f(z I 4,) PO In f(z I 6Jil exists and is nonsingular; (v) E[suP~~_,~ 11 VBHx
lnf(z~8)~l]<co.Then~(8-8,)~N(O,J~).

Proof

The proof proceeds by verifying the hypotheses of Theorem 3.1. By Theorem 2.5,
o^A do. Important intermediate results are that the score s(z) = V, lnJ(zI U,) has
mean zero and the information matrix equality .I = - E[V,,Inf(zI0,)]. These
results follow by differentiating the identity jf(zlB)dz twice, and interchanging the
order of differentiation and integration, as allowed by (iii) and Lemma 3.6 in Section
3.5. Then conditions 3.1(i), (ii) hold by 3.3(i), (ii). Also, 3.l(iii) holds, with Z = J,
by E[s(z)] = 0, existence of J, and the LindberggLevy central limit theorem. To
show 3.l(iv) with H = -J, let 0 be a compact set contained in JY and contain-
ing fIOin its interior, so that the hypotheses of Lemma 2.4 are satisfied for a(z, 0) =
V,, In ,f(zl 0) by (ii) and (v). Condition 3.1 (v) then follows by nonsingularity of .I. Now
Jn(H^-0,) %N(O, H-JHP)=N(O,JP1)follows by theconclusionofTheorem 3.1
andH= -J. Q.E.D.

The hypotheses of Theorem 2.5 are only used to make sure that @-% O,, so that
they can be replaced by any other conditions that imply consistency. For example,
the conditions that 8, is identified, In f(z / 19)is concave in 6, and E[ IIn f(z 10)I] < x
for all 8 can be used as replacements for Theorem 2.5, because Theorem 2.7 then
gives 8At10. More generally, the MLE will be asymptotically normal if it is
consistent and the other conditions (i)-(v) of Theorem 3.3 are satisfied.
It is straightforward to derive a corresponding result for nonlinear least squares,
by using Lemma 2.4, the law of large numbers, and the Lindberg-Levy central limit
theorem to provide primitive conditions for Theorem 3.1. The statement of a
theorem is left as an exercise for the interested reader. The resulting asymptotic
variance for NLS will be H-ZH -I, for E[ylx] = h(x, U,), h&x, 0) = V,h(x, 0), H =
- E[h,(x, O,)h,(x, O,)] and Z = E[ {y - h(x, O,)}h,(x, Q,)h,(x, O,)]. The variance
matrix simplifies to a2H - when E[ {y - h(x, BO)}2 Ix] is a constant 02, a well known
efficiency condition for NLS.
Ch. 36: Larye Sump/e Estimation and Hypothesis Testing 2147

As previously stated, MLE and NLS will be asymptotically linear, with the MLE
influence function given by J- VOIn j(zI 0,). The NLS influence function will have
a similar form,

It/(z)= { EChk ~oP,(.?Qdl} - l h&x,Q,) [y - 4x, U,)], (3.4)

as can be shown by expanding the first-order conditions for NLS.


The previous examples provide useful illustrations of how the regularity condi-
tions can be verified.

Example 1.1 continued

In the Cauchy location and scale case, f(z18) = G- y[o- (z - p)] for Y(E)=
l/[rc( 1 + E)]. To show asymptotic normality of the MLE, the conditions of Theorem
3.3 can be verified. The hypotheses of Theorem 2.5 were shown in Section 2. For
the parameter set previously specified for this example, condition (i) requires that
p0 and (me are interior points of the allowed intervals. Condition (ii) holds by
inspection. It is straightforward to verify the dominance conditions for (iii) and (v).
For example, (v) follows by noting that V,,lnf(z10) is bounded, uniformly in
bounded p and 0, and 0 bounded away from zero. To show condition (iv), consider
cc=(~(~,c(J # 0. Note that a,(1 + z2)[tiV01nf(z~8,)] = cr,2z + ~~(1 + z) + c(,2z2=
~1~+ (2c(,)z + (3u,)z2 is a polynomial and hence is nonzero on an interval. Therefore,
E[{cxV,ln~f(z~0,,)}2] = c(J M> 0. Since this conclusion is true for any CI# 0, J must
be nonsingular.

Example 1.2 continued

Existence and nonsingularity of E[xx] are sufficient for asymptotic normality of


the probit MLE. Consistency of 8 was shown in Section 2.6, so that only conditions
(i)-(v) of Theorem 3.3 are needed (as noted following Theorem 3.3). Condition (i)
holds because 0 = Rq is an open set. Condition (ii) holds by inspection of f(z 10) =
y@(xO) + (1 - y)@( - x(9). For condition (iii), it is well known that 4(u) and 4(u)
are uniformly bounded, implying V&z /0) = (1 - 2y)4(xH)x and V,,f(z 10)= (1 - 2y) x
~,(x@xx are bounded by C( 1 + I/x 11 2, for some constant C. Also, integration over
dz is the sum over y and the expectation over x {i.e. ja(y, x)dz = E[a(O, x) + a( 1, x)] },
so that i( 1 + 11 x I/2)dz = 2 + 2E[ //x 111< GC. For (iv), it can be shown that J =
E[i.(x0&( - xd,)xx], for j(u) = ~(U)/@(U). Existence of J follows by E.(u)i.(- ~1)
bounded, and nonsingularity by %(u)A(- u) bounded away from zero on any open
interval.29 Condition (v) follows from V,, In ,f(z IQ,,)= [&.(xB,)y + &,( - xtI,)( 1 - y)]xx

291t can be shown that Z(u)i.( - a) is bounded using lH8pitals rule. Also, for any Ir>O, J 2 E[l(lxH,I <
fi)i(xfI,)n( -xtI,)xx] 2 CE[ l(lxO,I < C)x.x] in the positive semi-definite sense, the last term is positive
definite for large enough V by nonsingularity of E[xx].
2148 W.K. Newey and D. McFuddm

and boundedness of I_,(u). This example illustrates how conditions on existence


of moments may be useful regularity conditions for consistency and asymptotic
normality of an MLE, and how detailed work may be needed to check the
conditions.

3.3. Asymptotic normulity for GMM

The conditions on asymptotic normality of minimum distance estimators can be


specialized to give a result for GMM.

Theorem 3.4

Suppose that the hypotheses ofTheorem 2.6 are satisfied, r;i/ A W, and (i) 0,Einterior
of 0; (ii) g(z,O) is continuously differentiable in a neighborhood _t of 0,, with
probability approaching one; (iii) E[g(z, fl,)] = 0 and E[ I/g(z, 0,) I/1 is finite;
(iv) E[su~,,~ Ij V&z, 0) 111< co;(v) GWG is nonsingular for G = E[V,g(z, fl,)]. Then
for 0 = ECg(z, @,Jg(z, Hdl,$(@ - 0,) ~N[O,(GWG)GWBWG(GWG)~].

Proof

The proof will be sketched, although a complete proof like that of Theorem 3.1
given in Section 3.5 could be given. By (i), (ii), and (iii), the first-order condition
2G,,(@%~,(8) = 0 is satisfied with probability approaching one, for G,(e) = V&,,(0).
Expanding J,,(g) around fI,, multiplying through by $, and solving gives

(3.5)

where 0 is the mean value. By (iv), G,,(8) LG and G,(g) 3 G, so that by (v),
[G,(~))~~,(8)]-~,(~))ii/ ~(GWG)~GW. The conclusion then follows by the
Slutzky theorem. Q.E.D.

The complicated asymptotic variance formula simplifies to (GR G)- when W =


R- . As shown in Hansen (1982) and further discussed in Section 5, this value for
W is optimal in the sense that it minimizes the asymptotic variance matrix of the
GMM estimator.
The hypotheses of Theorem 2.6 are only used to make sure that I!?L BO, so that
they can be replaced by any other conditions that imply consistency. For example,
the conditions that 8, is identified, g(z, 0) is linear in 8, and E[ /Ig(z, II) 111< cc for all
8 can be used as replacements for Theorem 2.6, because Theorem 2.7 then gives
830,. More generally, a GMM estimator will be asymptotically normal if it is
consistent and the other conditions (i))(v) of Theorem 3.4 are satisfied.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2149

It is straightforward to derive a corresponding result for classical minimum


distance, under the conditions that 6 is consistent, &[72 - h(e,)] L N(0, fl) for
some R, h(8) is continuously differentiable in a neighborhood of Be, and GWG is
nonsingular for G = V&(0,). The statement of a theorem is left as an exercise for the
interested reader. The resulting asymptotic variance for CMD will have the same
form as given in the conclusion of Theorem 3.4.
By expanding the GMM first-order conditions, as in eq. (3.5), it is straightforward
to show that GMM is asymptotically linear with influence function

$(z) = - (G WC) - G Wg(z, 0,). (3.6)

In general CMD need not be asymptotically linear, but will be if the reduced form
estimator 72 is asymptotically linear. Expanding the first-order conditions for 6
around the truth gives $(e^- 0,) = - (GWG)-66&(72 - x0), where G = V&(8),
G = V,@(8), and @is the mean value. Then &(fi - rra) converging in distribution
and(~~G)-~ii/~(GWG)-GW. implies that &(8- 0,) = - (GWG)-G x
W&(72 - TC,J+ o,(l). Therefore, if 72is asymptotically linear with influence function
ll/(z), the CMD estimator will also be asymptotically linear with influence function

t&z) = - (GWG)- GW$(z). (3.7)

The Hansen-Singleton example provides a useful illustration of how the conditions


of Theorem 3.4 can be verified.

Example 1.3 continued

It was shown in Section 2 that sufficient conditions for consistency are that
E[x(BwyY - l)] = 0 have a unique solution at 0eE 0 = [Be, /3,]x[yl, y,], and that
E[llx(l]<co and E[IJxll J~l(lyI~+Iyl~~)]<co.Toobtainasymptoticnorrnality,
impose the additional conditions that B,&nterior(O), ye < 0, E[ 11x II1 < co,
E[ 11 x II 1w Izyzyo] < co, and E[x(wyYo, w*ln(y)yYo)] has rank 2. Then condition (i) of
Theorem 3.4 is satisfied by assumption. Condition (ii) is also satisfied, with Veg(z, 0) =
x(wyy, w-ln(y)yY). Condition (iii) is satisfied by the additional, second-moment re-
strictions, and by the GMM identification hypothesis.
To check condition (iv), note that IIn(y) I is bounded above by C( 1y 1p-E+ 1y I) for
any E > 0 and constant C big enough. Let N be a neighborhood of B,, such that
ye + E < y < yU- E for all &_N. Then SUP~,~~ liV,g(z,e)iI ~CllxlllwlCl +ln(y)] x
~~~~l~l~~~lI~III~l~~+l~l~~+l~l~~~~~~~l~l~~lI~III~l~l~l~+l~l~~~, so that
condition (iv) follows by the previously assumed moment condition. Finally, condi-
tion (v) holds by the previous rank condition and W = (E[xx])- nonsingular.
Thus, under the assumptions imposed above, the nonlinear two-stage least squares
estimator will be consistent and asymptotically normal, with asymptotic variance
as given in the conclusion of Theorem 3.4.
2150 W.K. Nrrvey and D. McFudden

3.4. One-step theorems

A result that is useful, particularly for efficient estimation, pertains to the properties
of estimators that are obtained from a single iteration of a numerical maximization
procedure, such as NewtonRaphson. If the starting point is an estimator that is
asymptotically normal, then the estimator from applying one iteration will have the
same asymptotic variance as the maximum of an objective function. This result is
particularly helpful when simple initial estimators can be constructed, but an
efficient estimator is more complicated, because it means that a single iteration will
yield an efficient estimator.
To describe a one-step extremum estimator, let ?? be an initial estimator and l?
be an estimator of H = plim[V,,&(B,)]. Consider the estimator

8= e- I7 - lV,&(O). (3.8)

If l? = V,,&(@ then eq. (3.8) describes one Newton-Raphson iteration. More


generally it might be described as a modified NewtonRaphson step with some
other value of fi used in place of the Hessian. The useful property of this estimator
is that it will have the same asymptotic variance as the maximizer of o,(Q), if
&(& 0,) is bounded in probability. Consequently, if the extremum estimator is
efficient in some class, so will be the one-step estimator, while the one-step estimator
is computationally more convenient than the extremum estimator.30
An important example is the MLE. In this case the Hessian limit is the negative
of the information matrix, so that fi = -J is an estimated Hessian. The corre-
sponding iteration is

e= @+ J-n- f V,lnf(zi)8). (3.9)


i=l

For the Hessian estimator of the information matrix 7 = - n x1= 1V,, In f(zi Ig),
eq. (3.9) is one NewtonRaphson iteration. One could also use one of the other
information matrix estimators discussed in Section 4. This is a general form of the
famous linearized maximum likelihood estimator. It will have the same asymptotic
variance as MLE, and hence inherit the asymptotic efficiency of the MLE.
For minimum distance estimators it is convenient to use a version that does not
involve second derivatives of the moments. For c = V,d,(@, the matrix - 2Gl?G
is an estimator of the Hessian of the objective function - ~,,(O)l?~,(0) at the true
parameter value, because the terms that involve the second derivatives of Q,(e) are
asymptotically negligible.31 Plugging a = - 2Gl?fi/G into eq. (3.8) gives a one-step

An alternative one-step estimator can be obtained by maximizing over the step size, rather than
setting it equal to one, as t? = fI + xd^for d^= - H P,,&(@ and z= argmax,Q,(O + 22). This estimator
will also have the same asymptotic variance as the solution to eq. (l.l), as shown by Newey (1987).
31These terms are all multiplied bv one or more elements of iJO,), which all converge to zero.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2151

minimum distance estimator,

e=e- (Cfr;i/G)-G~~gn(H). (3.10)

Alternatively, one could replace G by any consistent estimator of plim[V&,(8,)].


This estimator will have the same asymptotic variance as a minimum distance
estimator with weighting matrix I?. In particular, if I%is a consistent estimator of
fl- , an efficient choice of weighting matrix, then e has the same asymptotic
variance as the minimum distance estimator with an efficient weighting matrix.
An example is provided by GMM estimation. Let G = n- x1= 1V&z,, g) and let
fi be an estimator of R = E[y(z, fI,)g(z, 0,)], such as fi = n- r C;= 1 g(zi, 8)g(z, g).
Then the one-step estimator of eq. (3.10) is

--
H1=8-(GrQ-lC)-l~ffl-~ t g(zi,iJ)/n. (3.1 1)
i=l

This is a one-step GMM estimator with efficient choice of weighting matrix.


The results showing that the one-step estimators have the same asymptotic
variances as the maximizing values are quite similar for both extremum and mini-
mum distance estimators, so it is convenient to group them together in the following
result:

Theorem 3.5

Suppose that h(s- 0,) is bounded in probability. If I!? satisfies eq. (3.8), the
conditions of Theorem 3.1 are satisfied, and either I? = V,,,&(@ or Z? 3 H, then
$(Q- 0,) L N(0, H- ZH- ). If esatisfies eq. (3.10), the conditions of Theorem 3.2
are satisfied, and either G= V&J@ or G L G, then J&(8- (3,) % N[O, (GWG)- l x
GWl2WG(GWG)-1.

Proof

Using eq. (3.8) and expanding V,&(@ around 8, gives:

where 4 is the mean value. By 1-l -% H-l and the Slutzky theorem, the second
term -converges_ .in distribution to N(0, H- ZH- ). By condition (iv) of Theorem
3.1, HpV,,Q,(@+H- H = I, so that the first term is a product of a term that
converges in probability to zero with a term that is bounded in probability, so that
the first term converges in probability to zero, giving the conclusion. The result for
minimum distance follows by a similar argument applied to the expansion of eq. (3.10)
2152 W.K. Newey and D. McFadden

given by &(e- (3,) = [Z - (c~~)-c~V,g,(e)]~(e- 6,) - (Gk%-G@


J%(&). Q.E.D.

This result can be specialized to MLE or GMM by imposing the conditions of


Theorem 3.3 or 3.4, but for brevity this specialization is not given here.
The proof of this result could be modified to give the slightly stronger conclusion
that &(e - 6) 3 0,-a condition that is referred to as asymptotic equivalence of
the estimators Band 0. Rothenberg(l984) showed that for MLE, if a second iteration
is undertaken, i.e. f? in eq. (3.8) solves the same equation for some other initial
estimator, then n(e - 6) -% 0. Thus, a second iteration makes the estimator asympto-
tically closer to the extremum estimator. This result has been extended to multiple
iterations and other types of estimators in Robinson (1988a).

3.5. Technicalities

A complete proof of Theorem 3.1

Without loss of generality, assume that Af is a convex, open set contained in 0.


Let i be the indicator function for the event that &eJlr. Note that $11-*0, implies
i 3 1. By condition (ii) and the first-order conditio_ns fo_ra maximum, i*V&,,(@ = 0.
Also, b_y a mean-value expansion theorem, 0 = 1 *V,Q,(e,), % 1 *VQQ^,,(e,)Je- 0,),
where tIj is a random variable equal to the mean value when 1 = 1 and equal to fIO
otherwise. Then c&0,. Let H denote the matrix with jth row Vi&(gj);. By
condition (iv), H L H. Let 7 be the indicator for film and H nonsingular. Then
by condition (v), i -% 1, and 0 = i.V,&(&,) + i*H(e- III,), so that $(e- 0,) =
--
1H &V&,(6,) + (1 - i)J%(e - 0,). Then since ifi - 3 H- by condition (v),

&V,&(&J 5 N(O, z) b y condition (iii), and (1 - i)&(i!? - 0,) 3 0 by i -% 1, the


conclusion follows by the Slutzky theorem and the fact that if Y, -% Ye and 2, -
Y, 5 0 then Z, % Y,. Q.E.D.

The proof that the score has zero mean and of the information matrix equality. By the
proof of Theorem 3.3 it suffices to show that J f (zl B)dz is twice differentiable and
that the order of differentiation and integration can be interchanged. The following
well known lemma, e.g. as found in Bartle (1966, Corollary 5.9), is useful for showing
that the order of differentiation and integration can be interchanged.

Lemma 3.6

If a(z, 13) is continuously differentiable on an open set ./lr of 8,, a.s. dz, and
Jsu~,,~ 11V,a(z, 19)1)dz < co, then ia(z, f3)dz is continuously differentiable and
V,ja(z, B)dz = j[V,a(z, fI)]dz for f3~Jlr.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2153

Proof
Continuity of l [V&z, 0)] dz on X follows by continuity of V&z, ~9)in 0 and the
dominated convergence theorem. Also, for all eclose enough to 8, the line jo@ing 8
and 0 will lie in Jlr, so a mean-value expansion gives a(z, g) = a(z, 0) + V&z, @(fJ- 0) +
r(z, g),, where, for the mean value f?(z), I(Z, $) = {V&z, g(z)] - V&z, 0)}(8- 0). AS
&+ 0, )(r(z,@ 1)/j e- 8 1)< (1V&z, g(z)] - V&z, (3)II+0 by continuity of V&z, 0).
Also, i@, 0) i / ii 8 - 6 iI G 2 sUPeE~ /IV&z, 0) ii , so by the dominated convergence
theorem, jlr(z, @(dz/~~8-0(~-+0. Therefore,lja(z, 8)dz-Sa(z, @dz- {j[Ve4z, e)]dz} x
(~-e)(=IS~(Z,B)dzldSIr(z,8)Idz=0(1le-eli). Q.E.D.

The needed result that f f(zI0)dz is twice differentiable and that f (zlf3) can be
differentiated under the integral then follows by Lemma 3.6 and conditions (ii) and
(iii) of Theorem 3.3.

4. Consistent asymptotic variance estimation

A consistent estimator of the asymptotic variance is important for construction of


asymptotic confidence intervals, as discussed in the introduction. The basic idea for
constructing variance estimators is to substitute, or plug-in, estimators of the
various components in the formulae for the asymptotic variance. For both extremum
and minimum distance estimators, derivatives of sample functions can be used to
estimate the Hessian or Jacobian terms in the asymptotic variance, when the
derivatives exist. Even when derivatives do not exist, numerical approximations can
be used to estimate Hessian or Jacobian terms, as discussed in Section 7. The more
difficult term is the one that results from asymptotic normality of ,,&V,f&(e,) or
&n&(0,). The form of this term depends on the nature of the estimator and whether
there is dependence in the data. In this chapter, estimation of this more difficult
term will only be discussed under i.i.d. data, with Wooldridges chapter in this
volume giving results for dependent observations.
To better describe variance estimation it is helpful to consider separately extremum
and minimum distance estimators. The asymptotic variance of an extremum estima-
tor is H- 'ZH- ',where H is the probability limit of Vee&(BO) and Z is the
asymptotic variance of ^A$&7,&e,). Thus, an estimator of the asymptotic variance
can be formed as fi- 'ZH- ',where fi is an estimator of H and 2 is an estimator
of Z. An estimator of H can be constructed in a general way, by substituting 8 for
8, in the Hessian of the objective function, i.e. l? = V,,&(8). It is more difficult to
find a general estimator of .Z, because it depends on the nature of the extremum
estimator and the properties of the data.
In some cases, including MLE and NLS, an estimator of Z can be formed in a
straightforward way from sample second moments. For example, for MLE the
central limit theorem implies that ;I: = E[V, In f (z I/3,-J{V, In f (z IO,,)}], so that an
2154 W.K. Newey and D. McFadden

estimator can be formed by substituting moments for expectations and estimators


for true parameter, i.e. 2 = II- x1= 1Ve In f(zil 8) {V, In f(zii g)}!. More generally, an
analogous estimator can be constructed whenever the objective function is a sample
average, Q,@) = n Cr= 1q(z,,fl), e.g.where q(z,0) = - [y - h(x, O)]' for NLS. In this
case $V,Q,(tI,) = n - I2 C;= r Veq(zi, N,), so the central limit theorem will imply that
Z = E[V,q(z, BO){V,q(z, 8,)}].32 This second-moment matrix can be estimated
as

2 = n-l .f v,q(z,,8){v,q(z,,~)}', Q,(d)= Iv1 i q(z,fl). (4.1)


i=l i=l

In cases where the asymptotic variance simplifies it will be possible to simplify


the variance estimator in a corresponding way. For example the MLE asymptotic
variance is the inverse of the information matrix, which can be estimated by J^- ,
for an estimator J^ of the information matrix. Of course, this also means that there
are several ways to construct a variance estimator. For the MLE, jcan be estimated
from the Hessian, the sample second moment of the score, or even the general
formula &,??I?. Asymptotic distribution theory is silent about the choice
between these estimators, when the models are correctly specified (i.e. the assumptions
that lead to simplification are true), because any consistent estimator will lead to
asymptotically correct confidence intervals. Thus, the choice between them has to
be based on other considerations, such as computational ease or more refined
asymptotic accuracy and length of the confidence intervals. These considerations
are inherently specific to the estimator, although many results seem to suggest it is
better to avoid estimating higher-order moments in the formation of variance
estimators. If the model is not correctly specified, then the simplifications may not
be valid, so that one should use the general form fi- Tfi?- , as pointed out by
Huber (1967) and White (1982a). This case is particularly interesting when 8 is
consistent even though the model is misspecified, as for some MLE estimators with
exponential family likelihoods; see Gourieroux et al. (1984).
For minimum distance estimation it is straightforward to estimate the Jacobian
term G in the asymptotic variance (GWG))GW~RG(GWG)-, as G = V&,(u^).
Also, by assumption W will be a consistent estimator of W. A general method of
forming B is more difficult because the form of fl depends on the nature of the
estimator.
For GMM an estimator of R can be formed from sample second moments. By
the central limit theorem, the asymptotic variance of Jng,(fl,,) = n- I2 C;= 1g(zi, 0,)
is R = E[g(z, e,)g(z, O,)]. Thus, an estimator can be formed by substituting sample

32The derivative V,q(z,O,,) can often be shown to have mean zero, as needed for the central limit
theorem, by a direct argument. Alternatively, a zero mean will follow from the first-order condition for
maximization of Q,,(O) = E[q(z,O)]at 0,.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2155

moments for the expectation and an estimator of 8 for the true value, as

i=l

As discussed in Section 3, extremum estimators can be considered as special cases


of minimum distance estimators for analyzing asymptotic normality. More speci-
fically, an extremum estimator with o,(O) = n- x1= ,q(z,, 0) will be a GMM estima-
tor with g(z, 0) = V,q(z, 0). Consequently, the estimator in eq. (4.1) is actually a
special case of the one in eq. (4.2).
For minimum distance estimators, where Q,(d) = r? - h(O), the asymptotic variance
R of $g,(O,) is just the asymptotic variance of R. Thus, to form h one simply uses
a consistent estimator of the asymptotic variance of 72.If r? is itself an extremum or
GMM estimator, its asymptotic variance can be estimated in the way described
above.
When the asymptotic variance matrix simplifies there will be a corresponding
simplification for an estimator. In particular, if W = 0-l then the asymptotic
variance is (GO-G)-, so that a corresponding estimator is (c& 6)-l. Alter-
natively, if I? is a consistent estimator of a- , a variance estimator is (@ii/& .
In addition, it may also be possible to estimate L2 in alternative ways. For example,
for linear instrumental variables where g(z, 0) = x(y - Y@, the estimator in eq. (4.2)
is II- XI= r x,xi(y, - Y$), which is consistent even if si = yi - YIfI, is heteroskedastic.
An alternative estimator that would be consistent under homoskedasticity (i.e. if
E[s2 Ix] is constant) is c?C~, 1xixi/n for 82 = n- Cr= 1(yi - Y$)2.
For minimum distance estimators, the choice between different consistent
variance estimators can be based on considerations such as those discussed for
extremum estimators, when the model is correctly specified. When the model is
not correctly specified and there are more elements in d,(O) than 8, the formula
(G WC)- G WR WG(G WC) is no longer the correct asymptotic variance matrix,
the reason being that other terms enter the asymptotic variance because S,,(J) need
not converge to zero. It is possible to show that 6 is asymptotically normal when
centered at its limit, by treating it as an extremum estimator, but the formula is very
complicated [e.g. see Maasoumi and Phillips (1982)]. This formula is not used often
in econometrics, because it is so complicated and because, in most models where
d,,(O) has more elements than 0, the estimator will not be consistent under mis-
specification.

4.1. The basic results

It is easy to state a consistency result for asymptotic variance estimation if ,E or r3


is assumed to be consistent. A result for extremum estimators is:
2156 W.K. Newey and D. McFadden

Theorem 4.1

If the hypotheses of Theorem 3.1 are satisfied, fi = V,,&,(6), and 2 AZ, then
fi-f&l !J+H-l~H-l.

Proof

By asymptotic normality, o^3 to. By c_ondition(iv) of Theorem 3.1, with probability


approaching one, IF-W 5 IlH-WtWI + lW(~)-ffll ~suP~~.,~IIV~~Q~(~)-H(~)II +
(1H(8) - HI/ LO, SO that H g H. The conclusion then follows by condition (v)
of Theorem 3.1 and continuity of matrix inversion and multiplication. Q.E.D.

A corresponding result for minimum distance estimators is:

Theorem 4.2

If the hypotheses of Theorem 3.2 are satisfied, 6 = V&,(8), and fi -% 0, then


(~~~)-~~ii~~(~~6)- %(GWG)-GWf2WG(GWG)-?

Proof

It follows similarly to the proof of Theorem 4.1 that condition (iv) of Theorem 3.2
implies 6 5 G, while %A W and fi % 0 hold by hypothesis. The conclusion
then follows from condition (v) of Theorem 3.2 and continuity of matrix inversion
and multiplication. Q.E.D.

As discussed above, the asymptotic variance for MLE, NLS, and GMM can be
estimated using sample second moments, with true parameters replaced by estima-
tors. This type of estimator will be consistent by the law of large numbers, as long
as the use of estimators in place of true parameters does not affect the limit. The
following result is useful in this respect.

Lemma 4.3

If zi is i.i.d., a(z, 0) is continuous at 8, with probability one, and there is a neigh-


borhood Jf _of fI,, such that E[sup,,,- /Ia(~, @II] < cu, then for any e%B,,
n - 1Z;= 1a(z, 0) 3 E[a(z, tl,)].

Proof

By consistency oft? there is 6, + 0 such that II8 - 8, I( < 6, with probability approach-
ing one. Let A,(z) = su~~~~-~~,,d 6,,11 a(z, tl) - a(z, 0,) II. By continuity of a(z, 0) at et,,
d,(z) + 0 with probability one,.while by the dominance condition, for n large enough
d,,(z) d 2 SUPes.A~I(a(z, 0) I(. Then by the dominated convergence theorem, E[d,(z)]-+
0, so by the Markov inequality, P( I n- Cr= 1A,(zi)/ > E) d E[A.(z)]/c -+ 0 for all
E>O, giving n-Cy= IA,, 3 0. By Khintchines law of large numbers, n- XI= 1u x
Ch. 36: Large Sample Estimation and Hypothesis Testing 2157

(zi, fI,) % E[a(z, O,,)]. Also, with pro_bability approaching one, (1n-Cr= lu(Zi, 8) -
npCr= la(Zi,8,)li<n-'C~=, ~lU(zi,8)-a(Zi,8,)1~,<n-'~~=,A,(z,)% 0,so thecon-
elusion follows by the triangle inequality. Q.E.D.

The conditions of this result are even weaker than those of Lemma 2.4, because
the conclusion is simply uniform convergence at the true parameter. In particular,
the function is only required to be continuous at the true parameter. This weak type
of condition is not very important for the cases considered so far, e.g. for GMM
where the moment functions have been assumed to be differentiable, but it is very
useful for the results of Section 7, where some discontinuity of the moments is
allowed. For example, for the censored LAD estimator the asymptotic variance
depends on indicator functions for positivity of x% and Lemma 4.3 can be used to
show consistency of asymptotic variance estimators that depend on such indicator
functions.

4.2. Variance estimation for MLE

The asymptotic variance of the maximum likelihood estimator is J- , the inverse


of the Fisher information matrix. It can be consistently estimated from J^- , where
J^ is a consistent estimator of the information matrix. There are several ways to
estimate the information matrix. To describe these ways, let s(z, 9) = V, lnf(zI 9)
denote the score. Then by the information matrix equality, J = E[s(z, %&(z, %,)I =
- E[V,s(z, %,)I = J(%,), where J(9) = - j [V&z, %)]f(zI %)dz. That is, J is the expec-
tation of the outer product of the score and the expectation of the negative of the
derivative of the score, i.e. of the Hessian of the log-likelihood. This form suggests
that J might be estimated by the method of moments, replacing expectations by
sample averages and unknown parameter values by estimates. This yields two
estimators,

j1 = n - t s(zi, @s(z,,B)/n, j2 = -n-l t V,,lnf(zJ%).


i=l i=l

The second estimator is just the negative of the Hessian, and so will be consistent
under the conditions of Theorem 3.3. Lemma 4.3 can be used to formulate conditions
for consistency of the first estimator.
A third estimator could be obtained by substituting 6 in the integrated function
J(9). This estimator is often not feasible in econometrics, because f(z 1%)is a condi-
tional likelihood, e.g. conditioned on regressors, and so the integration in J(9)
involves the unknown marginal distribution. An alternative estimator that is feasible
is the sample average of the conditional information matrix. To describe this
estimator, suppose that z = (y, x) and that f(z 1%)= f(y Ix, 9) is the conditional density
of y given x. Let J(x, II) = E[s(z, %)s(z,%)Ix, O] = is(z, U)s(z, U)f(y Ix, 0)dy be the con-
2158 W.K. Newey and D. McFadden

ditional information matrix, so that J = E[J(x, e,)] by the law of iterated expecta-
tions. The third estimator of the information matrix is then

& = f: J(x,, 8)/n. (4.4)


i=l

Lemma 4.3 can be used to develop conditions for consistency of this estimator. In
particular, it will often be the case that a(~, 8) = J(x, 0) is continuous in 8, because
the integration in J(x, 0) tends to smooth out any discontinuities. Consistency will
then follow from a dominance condition for J(x, d).
The following result gives conditions for consistency ofall three of these estimators:

Theorem 4.4

Suppose that the hypotheses of Theorem 3.3 are satisfied. Then 51; A Jp . Also,
if there is a neighborhood N of B0 such that E[su~,,_~ 11
s(z, 0) 111-=zco then 51; 3
J- I. Also, if J(x, 0) is continuous at B0 with probability one and E[su~,,,+~ /)J(x, Q)/)] <
c;o then.?;AJ-.

Proof

It follows as in the proof of Theorem 4.1 that 512 A J- . Also, by s(z, 0) continuously
differentiable in a neighborhood of 0,, u(z, e) = s(z, 8)s(z, 0) so consistency of I;
follows from Lemma 4.3. Also, consistency of IT1 follows by Lemma 4.3 with
a(z, 0) = J(x, 0). Q.E.D.

The regularity conditions for consistency of each of these estimators are quite weak,
and so typically they all will be consistent when the likelihood is twice differentiable.
Since only consistency is required for asymptotically correct confidence intervals
for 0, the asymptotic theory for @provides no guide as to which of these one should
use. However, there are some known properties of these estimators that are useful
in deciding which to use. First, 5, is easier to compute than j2, which is easier to
compute than j3. Because it is easiest to compute, jr has seen much use in maximum
likelihood estimation and inference, as in Berndt et al. (1974). In at least some cases
they seem to rank the opposite way in terms of how closely the asymptotic theory
approximates the true confidence interval distribution; e.g. see Davidson and
MacKinnon (1984). Since the estimators are ranked differently according to different
criteria, none of them seems always preferred to the others.
One property shared by all inverse information matrix estimators for the MLE
variance is that they may not be consistent if the distribution is misspecified, as
pointed out by Huber (1967) and White (1982a). If .f(zle,) is not the true p.d.f. then
the information matrix equality will generally not hold. An alternative estimator
that will be consistent is the general extremum estimator fdrmula J^;J^,j;.
Sufficient regularity conditions for its consistency are that 8% 8,, In f(z18) satisfy
Ch. 36: Large Sample Estimation and Hypothesis Testing 2159

parts (ii) and (iv) of Theorem 3.3, E[sup Ot,,, /IV, In f(z (0) (I*] be finite for a neighbor-
hood .,+ of 8,, and EIVOe In f(zl Q,)] be nonsingular.

Example I .l continued

It would be straightforward to give the formulae j1 and j2 using the derivatives


derived earlier. In this example, there are no conditioning variables x, so that 5^;
would simply be the information formula evaluated at 8. Alternatively, since it is
known that the information matrix is diagonal, one could replace J^; and ji with
same matrices, except that before the inversion the off-diagonal elements are set
equal to zero. For example, the matrix corresponding to jil would produce a
variance estimator for j2 of ncP/C~, I/ce(.$), for & = 8- (zi - fi). Consistency of all of
these estimators will follow by Theorem 4.4

Sometimes some extra conditions are needed for consistency of j; or 1; , as


illustrated by the probit example.

Example 1.2 continued

For probit, the three information matrix estimators discussed above are, for L(E)=
&Y@(s),

5^2 = j3 + n-l i xix:[d{~(- ~)-l~(U)}/d~]lo_x,~[yi - @(x$)],


i=l

_T1 = n- 2 xixi@( - xi& *L(xj@*{ yi - @(x$)}~.


i=l

Bothj;%J-andj; LJ-1 will follow from consistency of 6, E[ IIx I/1 finite,


and J nonsingular. However, consistency of 5^; seems to require that E[ 11 x II] is
finite, because the score satisfies IIV, lnf(zlQ)l12 < I@(xO)-@( - ~0)) ~(x8)(4/~~~~~d
4CC,(l + IIx II II4 )I2 IIx II2d C(1+ IIx II).

The variance of nonlinear least squares has some special features that can be used
to simplify its calculation. By the conditional mean assumption that E [y Ix] = h(x, fl,),
the Hessian term in the asymptotic variance is

H = 2{ECh(x,Wdx> &,)I- W,,(x, 4,) CY - hk 6,) } I}


= 2ECk9b,4$,(x, &)I,
where h, denotes the gradient, h,, the Hessian of h(x, O), and the second equality
2160 W.K. Newry and D. McFadden

follows by the law of iterated expectations. Therefore, H can be estimated by


s = 2n- C:= ,h,(xi, @h&xi, @, which is convenient because it only depends on first
derivatives, rather than first and second derivatives. Under homoskedasticity the
matrix Z also simplifies, to 4f~~E[h,(x, 8,)h,(x, e,)] for cr2 F E[ { y - h(x, U,)}2],
which can be estimated by 2e2H for e2 = nP Cy= i { y - h(xi, d)}. Combining this
estimator of Z with the one for H gives an asymptotic variance estimator of the
form ? = fiTfiP1 = 262fim . Consistency of this estimator can be shown by
applying the conditions of Lemma 4.3 to both u(z, 6) = {y - h(x, 19)) and a(z, 8) =
h,(x, @h&x, e), which is left as an exercise.
If there is heteroskedasticity then the variance of y does not factor out of Z, so
that one must use the estimator z= 4n-Cr, ,h,(xi, @h&xi, @{ yi - h(xi, 8)}2. Also,
if the conditional expectation is misspecified, then second derivatives of the regression
function do not disappea_r from the Hessian (except in the linear case), so that one
must use the estimator H = 2n- x1= 1 [h&x,, @h&xi, i$ + h&xi, @{ yi - h(xi, @}I.
A variance estimator for NLS that is consistent in spite of heteroskedasticity or
misspecification is fi-&-, as discussed in White (1982b). One could formulate
consistency conditions for this estimator by applying Lemma 4.3. The details are
left as an exercise.

4.3. Asymptotic vuriance estimation,for GMM

The asymptotic variance of a GMM estimator is (GWG))G~~l&G(G~G)-,


which can be estimated by substituting estimators for each of G, W and 0. As
p_reviously discussed,_estima_tors of G and Ware readily available, and are given by
G = n- x1= 1VOy(zi, e) and W, where k@is the original weighting matrix. To estimate
R = E[g(z, H&z, 0,)], one can replace the population moment by a sample average
and the true parameter by an estimator, to form fi = n- Cy= r g(zi, @)g(z,, I!?),as in
eq. (4.2). The estimator of the asymptotic variance is then given by e = (GI%G)- x
G,I@r2 I?G@l?G,_- l.
Consistencyof Sz will follow from Lemma 4.3 with a(z, 8) = g(z, B)g(z, 0), so that
consistency of Fwill hold under the conditions of Theorem 4.2, as applied to GMM.
A result that summarizes these conditions is the following one:

Theorem 4.5

If the hypotheses of Theorem 3.4 are satisfied, g(z,@ is continuous at B0 with


probability_ one,a_nd_for ^a neighborhood
?. . ^ JV of 8,, E[su~~,~ I/g(z, 0) 11
2] < co, then
V=(&$G)-GWRWG(GWG)- -( GWG)-GWRWG(GWG)-.

Proof

By Lemma 4.3 applied to a(z, 0) = g(z, H)g(z, 6), fiL a. Also, the proof of Theorem
3.4 shows that the hypotheses of Theorem 3.2 are satisfied, so the conclusion follows
by Theorem 4.2. Q.E.D.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2161

If @is a consistent estimator of a-, i.e. the probability limit W of @is equal to
n-l, then a simpler estimator of the asymptotic variance can be formed as p =
(@k&l. Alternatively, one could form &as in eq. (4.2) and use v = ((?fi-& .
Little seems to be known about the relative merits of these two procedures in small
samples, i.e. which (if either) of the initial I%or the final d-l gives more accurate
or shorter confidence intervals.
The asymptotic variance estimator c is very general, in that it does not require
that the second moment matrix a= E[g(z,B,)g(z,8,)] be restricted in any way.
Consequently, consistency of ? does not require substantive distributional restrictions
other than E[g(z, Q,)] = 0.33 For example, in the context of least squares estimation,
where y(z, 0) = x( y - xd), l?f = I, and (? = - C;= 1x,xi/n, this GMM variance esti-
mator is P = k[n-C~= lxixi(yi - x$)]&, the Eicker (1967) and White (1980)
heteroskedasticity consistent variance estimator. Furthermore, the GMM variance
estimator includes many heteroskedasticity-robust IV variance estimators, as dis-
cussed in Hansen (1982).
When there is more information about the model than just the moment restrictions,
it may improve the asymptotic confidence interval approximation to try to use this
information in estimation of the asymptotic variance. An example is least squares,
where the usual estimator under homoskedasticity is n(Cr, 1xix:)- C( yi - x@/
(n - K), where K is the dimension of x. It is well known that under homoskedasticity
this estimator gives more accurate confidence intervals than the heteroskedasticity
consistent one, e.g. leading to exact confidence intervals from the t-distribution
under normality.

Example 1.3 continued

The nonlinear two-stage least squares estimator for the Hansen-Singleton example
is a GMM estimator with g(z, 0) = x{bwyY - 1) and @= x1= 1x,xi/n, so that an
asymptotic variance estimator can be formed by applying the general GMM formula to
this case. Here an estimator of the variance of the moment functions can be formed
as described above, with 8= n-~~=,x,x,{&viyf - l}. The Jacobian estimator is
G^= n- Cr= 1xi(wi yly^,Bwi In ( yi)yr). The corresponding asymptotic variance esti-
mator then comes from the general GMM formula (~f~~)-~~~~~~(~f~~)~ .
Consistency of this estimator will follow under the conditions of Theorem 4.5. It
was previously shown that all of these conditions are satisfied except the additional
moment assumption stated in Theorem 4.5. For this assumption, it suffices that the
upper and lower limits on y, namely yr and y,, satisfy E[~~x~/*~w~~(I~~*~ + Iyl*)] < co.
This condition requires that slightly more moments exist than the previous condi-
tions that were imposed.

331f this restriction is not satisfied, then a GMM estimator may still be asymptotically normal, but
the asymptotic variance is much more complicated; see Maasoumi and Phillips (1982) for the instrumental
variables case.
2162 W.K. Newey and D. McFudden

5. Asymptotic efficiency

Asymptotically normal estimators can be compared on the basis of their asymptotic


variances, with one being asymptotically efficient relative to another if it has at least
as small an asymptotic variance for all possible true parameter values. Asymptotic
efficiency is desirable because an efficient estimator will be closer to the true
parameter value in large samples; if o^is asymptotically efficient relative to 8 then
for all constants K, Prob() e- O,I d K/&) > Prob( (8- 8,I < K/J%) for all n large
enough. Efficiency is important in practice, because it results in smaller asymptotic
confidence intervals, as discussed in the introduction.
This section discusses general results on asymptotic efficiency within a class of
estimators, and application of these results to important estimation environments,
both old and new. In focusing on efficiency within a class of estimators, we follow
much of the econometrics and statistics literature. 34 Also, this efficiency framework
allows one to derive results on efficiency within classes of limited information
estimators (such as single equation estimators in a simultaneous system), which are
of interest because they are relatively insensitive to misspecification and easier to
compute. An alternative approach to efficiency analysis, that also allows for limited
information estimators, is through semiparametric efficiency bounds, e.g. see Newey
(1990). The approach taken here, focusing on classes of estimators, is simpler and
more directly linked to the rest of this chapter.
Two of the most important and famous efficiency results are efficiency of maximum
likelihood and the form of an optimal weighting matrix for minimum distance
estimation. Other useful results are efficiency of heteroskedasticity-corrected genera-
lized least squares in the class of weighted least squares estimators and two-stage
least squares as an efficient instrumental variables estimator. All of these results
share a common structure that is useful in understanding them and deriving new
ones. To motivate this structure, and focus attention on the most important results,
we first consider separately maximum likelihood and minimum distance estimation.

5.1. Eficiency of maximum likelihood estimation

Efficiency of maximum likelihood is a central proposition of statistics that dates from


the work of R.A. Fisher (1921). Although maximum likelihood is not efficient in the
class of all asymptotically normal estimators, because of superefficient estimators,
it is efficient in quite general classes of estimators.35 One such general class is the

341n particular, one of the precise results on efficiency of MLE is the HajekkLeCam representation
theory, which shows efficiency in a class of reyular estimators. See, e.g. Newey (1990) for a discussion of
regularity.
35The word superefficient refers to a certain type ofestimator, attributed to Hodges, that is used to
show tha?there does not exist an efficient estimator in the class of all asymptotically normal estimators.
Suppose 0 is asymptotically normal, and for some numb_er t( and 0 ^( p < i, suppose that 0 ha_s positive
asympiotic variance when the trueparameter is rx. Let B = e if nalU - al > 1 and 0 = a if nPIO - a( < 1.
Then 6 is superefficient relative to 8, having the same asymptotic variance when the true parameter is
not cxbut having a smaller asymptotic variance, of zero, when the true parameter is X.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2163

class of GMM estimators, which includes method of moments, least squares, instru-
mental variables, and other estimators. Because this class includes so many esti-
mators of interest, efficiency in this class is a useful way of thinking about MLE
efficiency.
Asymptotic efficiency of MLE among GMM estimators is shown by comparing
asymptotic variances. The asymptotic variance of the MLE is (E[ss])-, where
s = V, In f(zl0,) is the score, with the z and 8 arguments suppressed for notational
convenience. The asymptotic variance of a GMM estimator can be written as
m-%l)r Jm~l(~b;l)- l where m, = (E[Veg(z, (3,)])WV0g(z, 0,) and m =
(E[V,g(z,8,)])Wg(z, 0,). At this point the relationship between the GMM and
MLE variances is not clear. It turns out that a relationship can be derived from
an interpretation of E[me] as the covariance of m with the score. To obtain this
interpretation, consider the GMM moment condition jg(z, 19)f(z ItI) dz = 0. This
condition is typically an identity over the parameter space that is necessary for
consistency of a GMM estimator. If it did not hold at a parameter value, then the
GMM estimator may not converge to the parameter at that point, and hence would
not be consistent.36 Differentiating this identity, assuming differentiation under the
integral is allowed, gives

s
0 = Vo s(z,W(zl@dzle=e,

= Cvodz,@lf(z I@dz + & 3CV,f(z I@I dz


s B=B
= ECVddz,
4Jl + %dz, &JVoInf(z IWI, (5.1)

where the last equality follows by multiplying and dividing V, f(z IO,) by f(z IO,).
This is the generalized information matrix equality, including the information matrix
equality as a special case, where g(z, 0) = V,ln f(~l8).~ It implies that E[m,] +
E[ms] = 0, i.e. that E[ms] = - E[ms]. Then the difference of the GMM and MLE
asymptotic variances can be written as

(E[mJ-E[mm](E[m~])- -(E[ss])~

= (E[ms])-E[mm](E[sm])p - (E[ss])-

= (E[ms])-{E[mm] - E[ms](E[ss])-E[sm]}(E[sm])-

= (E[ms])-E[UU](E[sm])-, U = m - E[ms] (E[ss])- 1 s. (5.2)

3hRecall that consistency means that the estimator converges in probability to the true parameter for
all oossible true oarameter values.
;A similar equality, used to derive the Cramer-Rao bound for the variance of unbiased estimators,
is obtained by differentiating the identity 0 = JOdF,, where F,, is the distribution of the data when 0 is
the true parameter value.
2164 W.K. Newey and D. McFadden

Since E[UU] is positive semi-definite, the difference of the respective variance


matrices is also positive semi-definite, and hence the MLE is asymptotically efficient
in the class of GMM estimators.
To give a precise result it is necessary to specify regularity conditions for the
generalized information matrix equality of eq. (5.1). Conditions can be formulated
by imposing smoothness on the square root of the likelihood, f(zl@, similar to
the regularity conditions for MLE efficiency of LeCam (1956) and Hajek (1970). A
precise result on efficiency of MLE in the class of GMM estimators can then be
stated as:

Theorem 5.1

If the conditions of Theorem 3.4 are satisfied,f(zl 0)12 is continuously differentiable at


O,, J is nonsingular, and for all 8 in a neighborhood JY of BO,JsuP~~,~ 11g(z, g) /I2 x
f(z I 0) dz and s ~upij~.~~ I/V&z Ir$12 11
2 dz are bounded and Jg(z, @f(z 10)dz = 0, then
(GWG)- GWRWG(GWG) - JP1 is positive semi-definite.

The proof is postponed until Section 5.6. This result states that J- is a lower
bound on the asymptotic variance of a GMM estimator. Asymptotic efficiency of
MLE among GMM estimators then follows from Theorem 3.4, because the MLE
will have J for its asymptotic variance.38

5.2. Optimal minimum distance estimation

The asymptotic variance of a minimum distance estimator depends on the limit W


of the weighting matrix I@.When W = a-, the asymptotic variance of a minimum
distance estimator is (GR-G)-. It turns out that this estimator is efficient in
the class of minimum distance estimators. To show this result, let Z be any random
vector such that a= E[ZZ], and let m = GWZ and fi = GKZ. Then by
G WC = E[mfi] and GR- G = E[riifi],

(GWG)~GWL!WG(GWG)~l-(G~nlG)-
= (GWG)PIEIUU](GWG)-, U = m - E[mfi](E[tirii])-6. (5.3)

Since E[UU] is positive semi-definite, the difference of the asymptotic variances is


positive semi-definite. This proves the following result:

38 It is possible to show this result under the weaker condition that f(zlO) is mean-square differenti-
able, which allows for f(zlO) to not be continuously differentiable. This condition is further discussed in
Section 5.5.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2165

Theorem 5.2
If f2 is nonsingular, a minimum distance estimator with W = plim(@) = R-r is
asymptotically efficient in the class of minimum distance estimators.

This type of result is familiar from efficiency theory for CMD and GMM estimation.
For example, in minimum chi-square estimation, where b(Q) = 72- $0) the efficient
weighting matrix W is the inverse of the asymptotic variance of fi, a result given by
Chiang (1956) and Ferguson (1958). For GMM, where Q(H)= x1=, g(Zi, d)/n, the
efficient weighting matrix is the inverse of the variance of g(zi, fI,), a result derived
by Hansen (1982). Each of these results is a special case of Theorem 5.2.
Construction of an efficient minimum distance estimator is quite simple, because
the weighting matrix affects the asymptotic distrib_ution only t_hrough its probability
limit. All that is required is a consistent estimator R, for then W = fX will converge
in probability to rZP . Since an estimator of R is needed for asymptotic variance
estimation, very little additional effort is required to form an efficient weighting
matrix. An efficient minimum distance estimator can then be constructed by
minimizing d(O)& g(O). Alternatively, the one-step estimator r?= &(@a- c)) x
eh- g(6) will also be efficient, because it is asymptotically equivalent to the fully
iterated minimum distance estimator.
The condition that W = fl- is sufficient but not necessary for efficiency. A
necessary and sufficient condition can be obtained by further examination of eq.
(5.3). A minimum distance estimator will be efficient if and only if the random vector
U is zero. This vector is the residual from a population regression of m on I+&and
so will be zero if and only if m is a linear combination of fi, i.e. there is a constant
matrix C such that GWZ = CGR-2. Since 2 has nonsingular variance matrix,
this condition is the same as

GW = CGO-. (5.4)

This is the necessary and sufficient condition for efficiency of a minimum distance
estimator.

5.3. A general eficiency framework

The maximum likelihood and minimum distance efficiency results have a similar
structure, as can be seen by comparing eqs. (5.2) and (5.3). This structure can be
exploited to construct an eficiency framework that includes these and other impor-
tant results, and is useful for finding efficient estimators. To describe this framework
one needs notation for the asymptotic variance associated with an estimator. To
this end, let r denote an index for the asymptotic variance of an estimator in some
2166 W.K. Newey and D. McFadden

class, where r is an element of some abstract set. A completely general form for z
would be the sequence of functions of the data that is the sequence of estimators.
However, since r is only needed to index the asymptotic variance, a simpler specifi-
cation will often suffice. For example, in the class of minimum distance estimators
with given g,(O), the asymptotic variance depends only on W = plim(I@, so that it
suffices to specify that z = W.
The framework considered here is one where there is a random vector Z such
that for each r (corresponding to an estimator), there is D(z) and m(Z, z) with the
asymptotic variance V(r) satisfying

V(7) = D(z)_ l E[m(Z, T)rn(Z, T)]D(T)- l. (5.5)

Note that the random vector Z is held fixed as t varies. The function m(Z, z) can
often be interpreted as a score or moment function, and the matrix D(z) as a Jacobian
matrix for the parameters. For example, the asymptotic variances of the class of
GMM estimators satisfy this formula, with z being [g(z, 8&G, W], Z = z being a
single observation, m(Z, r) = GWg(z, tl,), and D(r) = GWG. Another example is
minimum distance estimators, where Z is any random vector with mean zero and
variance 0, z = W, m(Z, z) = GWZ, and D(T) = G WC.
In this framework, there is an interesting and useful characterization of an efficient
estimator.

Theorem 5.3

If Z satisfies D(z) = E[m(Z, z)m(Z, ?)I for all z then any estimator with variance L(f)
is efficient. Furthermore, suppose that for any ri, r2, and constant square matrices
C,, C, such that C,D(z,) + C&r,) is nonsingular, there is z3 with (i) (linearity of
the moment function set) m(Z,r,) = C,m(Z,z,) + C,m(Z,z,); (ii) (linearity of D)
D(r,) = C,D(t,) + C,D(z,). If there is an efficient estimator with E[m(Z, z)m(Z, z)]
nonsingular then there is an efficient estimator with index F such that D(z) =
E[m(Z, z)m(Z, f)] for all z.

Proof

If r and S satisfy D(z) = E[m(Z,r)m(Z,?)] then the difference of the respective


asymptotic variances satisfies, for m = m(Z, z) and 6 = m(Z, ?),

V(7) - V(f) = (E[m~])~E[mm](E[fim])~ - (E[tid])~ l

= (E[mti])-E[UU](E[tim])p ,

CJ= m-E[mrii](E[@iti])-ti, (5.6)

so the first conclusion follows by E[UU] positive semi-definite. To show the second
conclusion, let ll/(Z, t) = D(T)- m(Z, T), so that V(7) = E[$(Z, z)$(Z, s)]. Consider
Ch. 36: Large Sample Estimation and Hypothesis Testing 2167

any constant matrix B, and for 7r and T* let C, = BD(7,)) and C, = (I - B)D(T~)-
note that C,D(z,) + C,D(z,) = I is nonsingular, so by (i) and (ii) there is 73 such that
Bl+b(Z,7,)+(Z-B)II/(Z,7,) = c,m(z,7,)+C,m(Z,7,)=m(Z,7,) = I-m(Z,z,) =
[C,D(t,) + C,D(t,)]- m(Z, z3) = D(z,)- 'm(Z, 7j) = I/I(Z, 73). Thus, the set ($(Z, 7)}
is affine, in the sense that B$(Z, tl) + (I - B)$(Z, z2) is in this set for any 71, z2 and
constant matrix B. Let $(Z,?) correspond to an efficient estimator. Suppose that
there is 7 with E[($ - $)$I # 0 for $ = $(Z, 7) and & = $(Z, 5). Then $ - 6 # 0, so
there exists a constant matrix F such that e = F($ - $) has nonsingular variance
andE[e~]#O.LetB=-E[~e](E[ee])-Fandu=~+B(~-~)=(Z-B)~+B~.
By the affine property of {rj(Z,z)} there is zsuch that k(f) = E[uu] = E[$$] -
E[$e](E[ee])-E[e$] = V(T) - E[$e](E[ee])-E[e$], which is smaller than
V(S) in the positive semi-definite sense. This conclusion contradicts the assumed
- -,
efficiency of Z, so that the assumption that E[($ - $)tj ] # 0 contradicts efficiency.
Thus, it follows that E[($ - $)I+?]= 0 for all 7, i.e. that for all 7,

D(t)- E[m(Z,r)m(Z,f)]D(?)- = D(t)-'E[m(Z,~)m(Z,~)']D(~)- . (5.7)

By the assumed nonsingularity of E[m(Z, T)m(Z, Z)], this equation can be solved
for D(7) to give D(7) = E[m(Z, z)m(Z, T)](E[m(Z, f)m(Z, 2)])- D(f). Since C =
D(f)(E[m(Z, f)m(Z, ?)I)- is a nonsingular matrix it follows by (i) and (ii) that there
exists ? with m(Z, ?) = Cm(Z, Y). Furthermore, by linearity of D(7) it follows that
V(?)= V(Z), so that the estimator corresponding to zis efficient. The second con-
clusion then follows from D(7) = E[m(Z, z)m(Z, S)] for all 7. Q.E.D.

This result states that

D(7) = E[m(Z, t)m(Z, Z)], for all 7, (5.8)

is sufficient for Z to correspond to an efficient estimator and is necessary for some


efficient estimator if the set of moment functions is linear and the Jacobian is a linear
function of the scores. This equality is a generalization of the information matrix
equality. Hansen (1985a) formulated and used this condition to derive efficient
instrumental variables estimators, and gave more primitive hypotheses for condi-
tions (i) and (ii) of Theorem 5.3. Also, the framework here is a modified version of
that of Bates and White (1992) for general classes of estimators. The sufficiency part
of Theorem 5.3 appears in both of these papers. The necessity part of Theorem 5.3
appears to be new, but is closely related to R.A. Fishers (1925) necessary condition
for an efficient statistic, as further discussed below.
One interpretation of eq. (5.8) is that the asymptotic covariance between an
efficient estimator and any other estimator is the variance of the efficient estimator.
This characterization of an efficient estimator was discussed in R.A. Fisher (1925),
2168 W.K.NeweyandD.McFaddvn

and is useful in constructing Hausman (1978) specification tests. It is derived by


assuming that the asymptotic covariance between two estimators in the class takes
theform D(r,)~'E[m(Z,z,)m(Z,s,)']D(z,)-", as can usually be verified by stacking
the two estimators and deriving theirjoint asymptotic variance (and hence asympto-
tic covariance). For example, consider two different GMM estimators 8, and g2,
with two different moment functions g,(z, 6) and g2(z, @, and r = q for simplicity.
The vector y*= (@,, &) can be considered a joint GMM estimator with moment
vector g(z, y) = [gr(z, H,), gz(z, @,)I. The Jacobian matrix of the stacked moment
vector will be block diagonal, and hence so will its inverse, so that the asymptotic
covariance between 6, and 6, will be {E[V,g,(z, e,)]} _ E[g,(z, d0)g2(z, O,)] x
{ECV,g,(z, &,)I) - l. ThISISexactly of the form D(T,)- E[m(Z, tl)m(Z, TJ']O(T~)-I',
where Z = z, m(Z, TV)= g,(z,O,),etc. When the covariance takes this form, the
covariance between any estimator and one satisfying eq. (5.8) will be D(T)-' x
E[m(Z,z)m(Z,~)l]D(~)-=I~D(~)~=D(~)-E[m(Z,t)m(Z,~)]D(~)- = V(t), the
variance of the efficient estimator. R.A. Fisher (1925) showed that this covariance
condition is sufficient for efficiency, and that it is also necessary if the class of
statistics is linear, in a certain sense. The role of conditions (i) and (ii) is to guarantee
that R.A. Fishers (1925) linearity condition is satisfied.
Another interpretation ofeq. (5.8) is that the variance of any estimator in the class
can be written as the sum of the efficient variance and the variance of a noise term.
Let u(Z)= D(T)-'m(Z,T)-D(f)-'m(Z,f), and note that U(Z) is orthogonal to
D(5)_ m(Z, Z) by eq. (5.8). Thus, V(T)= V(Z)+ E[CI(Z)U(Z)]. This interpretation
is a second-moment version of the Hajek and LeCam efficiency results.

5.4. Solving fir the smallest asymptotic variance

The characterization of an efficient estimator given in Theorem 5.3 is very useful


for finding efficient estimators. Equation (5.8) can often be used to solve for Z, by
following two steps: (1) specify the class of estimators so that conditions (i) and (ii)
of Theorem 5.3 are satisfied, i.e. so the set of moment functions is linear and the
Jacobian D is linear in the moment functions; (2) look for Z such that D(T) =
E[m(Z, s)m(Z, Z)]. The importance of step (1) is that the linearity conditions guarantee
that a solution to eq. (5.8) exists when there is an efficient estimator [with the
variance of m(Z, t) nonsingular], so that the effort of solving eq. (5.8) will not be in
vain. Although for some classes of estimators the linearity conditions are not met,
it often seems to be possible to enlarge the class of estimators so that the linearity
conditions are met without affecting the efficient estimator. An example is weighted
least squares estimation, as further discussed below.
Using eq. (5.8) to solve for an efficient estimator can be illustrated with several
examples, both old and new. Consider first minimum distance estimators. The
asymptotic variance has the form given in eq. (5.5) for the score GWZ and the
Jacobian term G WC. The equation for the efficient W is then 0 = G WC - GWf26G =
Ch. 36: Large Sample Estimation and Hypothesis Testing 2169

GW(I - flW)G, which holds if fll?f= I, i.e. w = R- . Thus, in this example one
can solve directly for the optimal weight matrix.
Another example is provided by the problem of deriving the efficient instruments
for a nonlinear instrumental variables estimator. Let p(z, (3)denote an s x 1 residual
vector, and suppose that there is a vector of variables x such that a conditional
moment restriction,

ma &I)Ixl = 0, (5.9)

is satisfied. Here p(z, 0) can be thought of as a vector of residuals and x as a vector


of instrumental variables. A simple example is a nonlinear regression model y =
,f(x, (3,) + E, @&lx] = 0, where the residual p(z, 0) = y - f(x, 0) will satisfy the condi-
tional moment restriction in eq. (5.9) by E having conditional mean zero. Another
familiar example is a single equation of a simultaneous equations system, where
p(z, 0) = y - Y8 and Y are the right-hand-side endogenous variables.
An important class of estimators are instrumental variable, or GMM estimators,
based on eq. (5.9). This conditional moment restriction implies the unconditional
moment restriction that E[A(x)p(z, e,)] = 0 for any q x s matrix of functions A(x).
Thus, a GMM estimator can be based on the moment functions g(z, 0) = A(x)p(z, 0).
Noting that V&z, 0) = A(x)V,p(z, Q), it follows by Theorem 3.4 that the asymptotic
variance of such a GMM estimator will be

441) ~C44dz, ~JP(z,&J441 {~C44Ve~k441 > 2


WV = {~%4x)Vep(z>
(5.10)

where no weighting matrix is present because g(z, Q) = A(x)p(z,B) has the same
number of components as 0. This asymptotic variance satisfies eq. (5.5), where
T = A(-) indexes the asymptotic variance. By choosing p(z, 0) and A(x) in certain
ways, this class of asymptotic variances can be set up to include all weighted least
squares estimators, all single equation instrumental variables estimators, or all
system instrumental variables estimators. In particular, cases with more instru-
mental variables than parameters can be included by specifying A(x) to be a linear
combination of all the instrumental variables, with linear combination coefficients
given by the probability limit of corresponding sample values. For example, suppose
the residual is a scalar p(z,@ = y- YB, and consider the 2SLS estimator with
instrumental variables x. Its asymptotic variance has the form given in eq. (5.10) for
A(x) = E[ Yx](E[x~])~x. In this example, the probability limit of the linear com-
bination coefficients is E[Yx](E[xx])-. For system instrumental variables esti-
mators these coefficients could also depend on the residual variance, e.g. allowing
for 3SLS.
The asymptotic variance in eq. (5.10) satisfies eq. (5.5) for Z=z, D(r)= E[A(x) x
V&z, Q,)], and m(Z, r) = A(x)p(Z, 0,). Furthermore, both m(Z, r) and D(r) are linear
in A(x), so that conditions (i) and (ii) should be satisfied if the set of functions {A(x)}
2170 W.K. Newey and D. McFadden

is linear. To be specific, consider the class of all A(x) such that E[A(x)V&z, O,)]
and E[ )1.4(x)11 2 /)p(z, 0,) I/2] exist. Then conditions (i) and (ii) are satisfied with
TV= A3(*) = CIA,(.) _t C,A,(V).~~ Thus, by Theorem 5.3, if an efficient choice of
instruments exist there will be one that solves eq. (5.8). To find such a solution, let
G(x) = E[V,p(z, 0,)j x] and 0(x) = E[p(z, Qp(z, 0,) (xl, so that by iterated expecta-
tions eq. (5.8) is 0 = E[A(x)(G(x) - Q(x)A(x)}]. This equation will be satisfied if
G(x) - Q(x),?(x) = 0, i.e. if

A(x) = G(x)'O(x)- . (5.11)

Consequently, this function minimizes the asymptotic variance. Also, the asympto-
tic variance is invariant to nonsingular linear transformations, so that A(x) =
CG(x)n(x)- will also minimize the asymptotic variance for any nonsingular
constant matrix C.
This efficient instrument formula includes many important efficiency results as
special cases. For example, for nonlinear weighted least squares it shows that the
optimal weight is the inverse of the conditional variance of the residual: For
G,(0) = - n- 1C;= 1w(xi)[ yi - h(xi, O)], the conclusion of Theorem 3.1 will give an
asymptotic variance in eq. (5.10) with A(x) = w(x)h,(x, S,), and the efficient estimator
has A(x) = {E[a2 1x] } - h,(x, Q,), corresponding to weighting by the inverse of the
conditional variance. This example also illustrates how efficiency in a class that does
not satisfy assumptions (i) and (ii) of Theorem 5.3 (i.e. the linearity conditions), can
be shown by enlarging the class: the set of scores (or moments) for weighted least
squares estimators is not linear in the sense of assumption (i), but by also including
variances for instrumental variable estimators, based on the moment conditions
y(z, 19)= A(x)[y - h(x, tI)], one obtains a class that includes weighted least squares,
satisfies linearity, and has an efficient member given by a weighted least squares
estimator. Of course, in a simple example like this one it is not necessary to check
linearity, but in using eq. (5.8) to derive new efficiency results, it is a good idea to
set up the class of estimators so that the linearity hypothesis is satisfied, and hence
some solution to eq. (5.8) exists (when there is an efficient estimator).
Another example of optimal instrument variables is the well known result on
efficiency of 2SLS in the class of instrumental variables estimators with possibly
nonlinear instruments: If p(z, 0) = y - YO, E[ Yjx] = 17x, and c2 = E[p(z, B,J2 1x]
is constant, then G(x) = - Iix and 0(x) = 02, and the 2SLS instruments are
E[ Yx](E[xx])- lx = 17x = - 02&x), a nonsingular linear combination of A(x). As
noted above, for efficiency it suffices that the instruments are a nonsingular linear
combination of A(x), implying efficiency of 2SLS.
This general form A(x) for the optimal instruments has been previously derived
in Chamberlain (1987), but here it serves to illustrate how eq. (5.8) can be used to

Existence of the asymptotic variance matrix corresponding to 53 follows by the triangle and
Cauchy-Schwartz inequalities.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2171

derive the form of an optimal estimator. In this example, an optimal choice of


estimator follows immediately from the form of eq. (5.8) and there is no need to
guess what form the optimal instruments might take.

5.5. Feasible efficient estimation

In general, an efficient estimator can depend on nuisance parameters or functions.


For example, in minimum distance estimation the efficient weighting matrix is a
nuisance parameter that is unknown. Often there is a nuisance function, i.e. an
infinite-dimensional nuisance parameter, such as the optimal instruments discussed
in Section 5.4. The true value of these nuisance parameters is generally unknown,
so that it is not feasible to use the true value to construct an efficient estimator. One
feasible approach to efficient estimation is to use estimates in place of true nuisance
parameters, i.e. to plug-in consistent nuisance parameter estimates, in the con-
struction of the estimator. For example, an approach to feasible, optimal weighted
least squares estimator is to maximize - n-l x1= r G(xi)[yi - h(xi, 8)12, where a(x)
is an estimator of 1/E[.a2 1x].
This approach will give an efficient estimator, if the estimation of the nuisance
parameters does not affect the asymptotic variance of 6. It has already been shown,
in Section 5.2, that this approach works for minimum distance estimation, where it
suffices for efficiency that the weight matrix converges in probability to R - . More
generally, a result developed in Section 6, on two-step estimators, suggests that
estimation of the nuisance parameters should not affect efficiency. One can think
of the plug-in approach to efficient estimation as a two-step estimator, where the
first step is estimating the nuisance parameter or function, and the second is
construction of &. According to a principle developed in the next section, the
first-step estimation has no effect on the second-step estimator if consistency of the
first-step estimator does not affect consistency of the second. This principle generally
applies to efficient estimators, where nuisance parameter estimates that converge
to wrong values do not affect consistency of the estimator of parameters of interest.
For example, consistency of the weighted least squares estimator is not affected by
the form of the weights (as long as they satisfy certain regularity conditions). Thus,
results on two-step estimation suggest that the plug-in approach should usually
yield an efficient estimator.
The plug-in approach is often easy to implement when there are a finite number
of nuisance parameters or when one is willing to assume that the nuisance function
can be parametrized by a finite number of parameters. Finding a consistent estimator
of the true nuisance parameters to be used in the estimator is often straightforward. A
well known example is the efficient linear combination matrix Z7= E[Yx](E[xx])-
for an instrumental variables estimator, which is consistently estimated by the 2SLS
coefficients fi= xy= r Yix:(Cy= ,x,x~)-. Another example is the optimal weight for
nonlinear least squares. If the conditional variance is parametrized as a(~, y), then
2172 W.K. Newey and D. McFadden

the true y can be consistently estimated from the nonlinear least squares regression
of $ on aZ(xi, y), where Ei = yi - h(xi, I$ (i = 1,. . , n), are the residuals from a
preliminary consistent estimator (7.
Of course, regularity conditions are, useful for showing that estimation of the
nuisance parameters does not affect the asymptotic variance of the estimator. To
give a precise statement it is helpful to be more specific about the nature of the
estimator. A quite general type of plug-in estimator is a GMM estimator that
depends on preliminary estimates of some parameters. Let g(z, 19,y) denote a q x 1
vector of functions of the parameters of interest and nuisance parameters y, and let
y*be a first-step estimator. Consider an estimator e that, with probability approaching
one. solves

n- l f cJ(Zi,f&y*)= 0. (5.12)
i=l

This class is quite general, because eq. (5.12) can often be interpreted as the first-
order conditions for an estimator. For example, it includes weighted least squares
estimators with an estimated weight w(x,y*), for which eq. (5.12) is the first-order
condition with g(z, 8, y) = w(x, y&(x, 0)[ y - h(x, 8)]. One type of estimator not in-
cluded is CMD, but the main result of interest here is efficient choice of weighting
matrix, as already discussed in Section 5.2.
Suppose also that y*is a GMM estimator, satisfying n-l x1= i m(zi, y) = 0. If this
equation is stacked with eq. (5.12), the pair (6, $) becomes a joint GMM estimator,
so that regularity conditions for asymptotic efficiency can be obtained from the
assumptions for Theorem 3.4. This result, and its application to more general types
of two-step estimators, is described in Section 6. In particular, Theorem 6.1 can be
applied to show that 6from eq. (5.12) is efficient. If the hypotheses of that result are
satisfied and G, = E[V,g(z, B,, yO)] = 0 then 8 will be asymptotically normal with
asymptotic variance the same as if 7 = yO. As further discussed in Section 6, the
condition G, = 0 is related to the requirement that consistency of ji not affect
consistency of 8. As noted above, this condition is a useful one for determining
whether the estimation of the nuisance parameters affects the asymptotic variance
of the feasible estimator 6.
To show how to analyze particular feasible estimators, it is useful to give an
example.

Linear regression with linear heteroskedusticity: Consider a linear model where


&lx] = ~8, and C?(X) = Var( y Jx) = wc(~for some w = w(x) that is a function of
x. As noted above, the efficient estimator among those that solve n-Cy= i A(xi) x
[ yi - x:(3] = 0 has A(x) = A(x) = (~a,)) x. A feasible efficient estimator can be
constructed by using a squared residual regression to form an estimator oi for Q,
and plugging this estimator into the first-order conditions. More precisely, let p be
the least squares estimator from a regression of y on x and & the least squares
Ch. 36: Large Sample Estimation and Hypothesis Testing 2173

estimator from a regression of (y - x/j?) on w. Suppose that waO is bounded below


and let r(u) be a positive function that is continuously differentiable with bounded
derivative and z(u) = u for u greater than the lower bound on wcx,,.~ Consider 8
obtained from solving CT= i r(w&)) xi(yi - xi@ = 0. This estimator is a two-step
GMM estimator like that given above with

y = (cc,fl), m(z, y) = [( y - xP)x, {( y - xpy - wcr}w],


g(z,8, y) = T(WcI)- x( y - de).

It is straightforward to verify that the vector of moment functions [m(z, y), g(z, 8, y)]
satisfies the conditions of Theorem 6.1 if w is bounded, x and y have finite fourth
moments, and E[xx] and E[ww] are nonsingular. Furthermore, E[V,g(z, do, yo)] =
- E[~(wa~)-~(y - x~,)xw] = 0, so that this feasible estimator will be efficient.

In many cases the efficiency of a plug-in estimator may be adversely affected if


the parametrization of the nuisance functions is incorrect. For example, if in a
linear model, heteroskedasticity is specified as exponential, but the true conditional
variance takes another form, then the weighted least squares estimator based on an
exponential variance function will not be efficient. Consistency will generally not
be affected, and there will be only a little loss in efficiency if the parametrization
is approximately correct, but there could be big efficiency losses if the parametrized
functional form is far from the true one. This potential problem with efficiency
suggests that one might want to use nonparametric nuisance function estimators,
that do not impose any restrictions on functional form. For the same reasons
discussed above, one would expect that estimation of the nuisance function does
not affect the limiting distribution, so that the resulting feasible estimators would
be efficient. Examples of this type of approach are Stone (197.3 Bickel (1982), and
Carroll (1982). These estimators are quite complicated, so an account is not given
here, except to say that similar estimators are discussed in Section 8.

5.4. Technicalities

It is possible to show the generalized information matrix equality in eq. (5.1) under
a condition that allows for f(zl @I2 to not be continuously differentiable and g(z, 0)
to not be continuous. For the root-density, this condition is mean-squaredifferen-
tiability at fIO with respect to integration over z, meaning that there is 6(z) with
l /l&z) )I2dz < co such that J[f(zI @I2 - f(zl Qo)12- 6(z)(H- 0,)12 dz = o( II8 - ,!?,I)2,

The T(U)function is a trimming device similar to those used in the semiparametric estimation
literature. This specification requires knowing a lower bound on the conditional variance. It is also
possible to allow T(U)to approach the identity for all u > 0 as the sample size grows, but this would
complicate the analysis.
2174 W.K. Newey and D. McFadden

as O-+9,. As shown in Bickel et al. (1992) it will suffice for this condition that
,f(zl0) is continuously differentiable in 0 (for almost all z) and that J(0) = jV, In ,f(zlO) x
{VOln~(zIfO>,!(zI@d z ISnonsingular and continuous in 8. Here 6(z) is the derivative
off(z I w2, so by V,f(z 141/2= +f(z l 0)1/2V0 In f(z Id), the expression for the infor-
mation matrix in terms of 6(z) is J = 4J6(z)&z) dz. A precise result on efficiency of
MLE in the class of GMM estimators can then be stated as:

Lemma 5.4

If(i) ,f(~l@r/~ is mean-square differentiable at 0, with derivative 6(z); (ii) E[g(z, Q)]
is differentiable at 0, with derivative G; (iii) g(z, 0) is continuous at B,, with probability
one; (iv) there is a neighborhood _N of 6, and a function d(z) such that IIg(z, 0) II d d(z)
and Srl(z)f(z IO)dz is bounded for BEN; then lg(z, Q)f(z 10)dz is differentiable at
B0 with derivative G + 2jg(z, Q,)~(z)f(~lQ,)~~ dz.

Proof

The proof is similar to that of Lemma 7.2 of Ibragimov and Hasminskii (1981). Let
r(0) = f(z IQi, g(e) = g(z, 0), 6 = 6(z), and A(B) = r(0) - r(Q,) - iY(d - Q,), suppress-
ing the z argument for notational convenience. Also, let m(8,8) = ~g(@r(Q2dz
and M = jg(b,)&(ll,) dz. By (ii), m(0, d,) - m(B,, ~9,)- G(B - 0,) = o( Ij 0 - do I/). Also,
by the triangle inequality, I/m(0,@ - m(B,, 0,) - (G + 2M)(0 - 0,) 11< I/m(e,e,) -
m(fl,, 0,) - G(8 - 0,) /I + 11
m(6,6) - m(@,d,) - 2M(6 - 0,) 11,so that to show the con-
clusion it suffices to show IIm(d, 0) - m(d, 0,) - 2M(B - 0,) II = o( 110- B. /I). To show
this, note by the triangle inequality,

IIde, 4 - MO,&4 - 2M(8 - 0,) I/ = g(d)[r(d) - r(0,J2] dz - 2~(0 - 0,)


IIs !I

d s(QCr@)+ r(bJl44 dz + Cd4 - g(~o)lr(80)6dz II8 - 8, II


IIS II IIS II

+ Cs(e)r(e)-s(e,)r(e,)lsdz II~-~,ll=~,+~,ll~-8,l/+R,ll8-~,ll.

Therefore, it suffices to show that R, =o( /I0-0, II), R, -0, and R, +O as O+ fl,,.
By (iv) and the triangle and Cauchy-Schwartz inequalities,

R, < { [ [g(Q2r(R, d.z]li2 + [ Ig(Q2r(&J2 dzy )[p(B) dz]12

Also, by (iii) and (iv) and the dominated convergence theorem, E[ I(g(e) - g(0,) \I] +O,
Ch. 36: Large Sample Estimation and Hypothesis Testing 2175

so by the Cauchy-Schwartz inequality, R, <(EC IIg(0)-g(0,) Il])(j 116112 dz)+O.


Also, by the triangle inequality, R, < R, + 1 I/g(0) 11Ir(0) - r(Q,)l /I6 II dz, while for
K > 0,

/Icd@It Ir(@ - 44,) I II6 II dz d 44 Ir@) - 4%) I II6 II dz


s s

d 4.4 Ir(Q)- r&J I II6 IIdz + K Ir(Q)- 4%)l II6 II dz


s d(z)>.4 s

I is
l/2
< d(~)~lr(0) - r(B,)12 dz II6 II2dz
d(z) 3 X

lr(R)-r(i),)2dz112{ S6dz)L-i.

By (iv), i d(z)Ir(Q) - r(0,)12 dz < 2fd(z)2r(B)2 dz + 2jd(z)2r(6,)2 dz is bounded. Also,


by the dominated convergence theorem, JdcZjaK I/6 (1dz + 0 as K + co, and by (i),
j Ir(4 - r(bJ I2d z + 0, so that the last term converges to zero for any K. Consider
E > 0 and choose K so jdtz,a K ((6 ((2 dz < 3~. Then by the last term is less than +E for
0 close enough to 0,, implying that j 1)g(0) I/Ir(8) - r(Q,)l 116I/ dz < E for 0 close
enough to (IO. The conclusion then follows by the triangle inequality. Q.E.D.

Proof of Theorem 5.1

By condition (iv) of Theorem 3.4 and Lemma 3.5, g(z, e) is continuous on a neighbor-
hood of 8, and E[g(z, 0)] is differentiable at B0 with derivative G = E[Vsy(z, (II,)].
Also, f(z I0) is mean-square differentiable by the dominance condition in Theorem
5.1, as can be shown by the usual mean-value expansion argument. Also, by the
conditions of Theorem 5.1, the derivative is equal to $1 [f(zj0,)>O]f(z(0,)-2 x
V,f(z) 0,) on a set of full measure, so that the derivative in the conclusion of Lemma
5.4 is G + %(z, WV0 ln f(zl&Jl. Also, IIdz, 0)II d 44 = su~~~.~- IIAZ, 0)II Has
j442.fW)dz bounded,so that the conclusion of Lemma 5.4 holds. Then for
u = g(z, 0,) + GJ- V,in f(zIB,),

(GWG)- GB~WG(GWG)- 1 - J-l

=(GWG)-GW(~uudz)WG(GWG)-,

so the conclusion follows by i UUdz positive semi-definite. Q.E.D.

6. Two-step estimators

A two-step estimator is one that depends on some preliminary, first-step estimator


of a parameter vector. They provide a useful illustration of how the previous results
2116 W.K. Newey and D. McFadden

can be applied, even to complicated estimators. In particular, it is shown in this


section that two-step estimators can be fit into the GMM framework. Two-step
estimators are also of interest in their own right. As discussed in Section 5, feasible
efficient estimators often are two-step estimators, with the first step being the
estimation of nuisance parameters that affect efficiency. Also, they provide a simpler
alternative to complicated joint estimators. Examples of two-step estimators in
econometrics are the Heckman (1976) sample selection estimator and the Barro
(1977) estimator for linear models that depend on expectations and/or corresponding
residuals. Their properties have been analyzed by Newey (1984) and Pagan (1984,1986),
among others.
An important question for two-step estimators is whether the estimation of the
first step affects the asymptotic variance of the second, and if so, what effect does
the first step have. Ignoring the first step can lead to inconsistent standard error
estimates, and hence confidence intervals that are not even asymptotically valid.
This section develops a simple condition for whether the first step affects the second,
which is that an effect is present if and only if consistency of the first-step estimator
affects consistency of the second-step estimator. This condition is useful because
one can often see by inspection whether first-step inconsistency leads to the second-
step inconsistency. This section also describes conditions for ignoring the first
step to lead to either an underestimate or an overestimate of the standard
errors.
When the variance of the second step is affected by the estimation in the first step,
asymptotically valid standard errors for the second step require a correction for the
first-step estimation. This section derives consistent standard error estimators by
applying the general GMM formula. The results are illustrated by a sample selection
model.
The efficiency results of Section 5 can also be applied, to characterize efficient
members of some class of two-step estimators. For brevity these results are given
in Newey (1993) rather than here.

6.1. Two-step estimators as joint GMM estimators

The class of GMM estimators is sufficiently general to include two-step estimators


where moment functions from the first step and the second step can be stacked
to form a vector of moment conditions. Theorem 3.4 can then be applied to specify
regularity conditions for asymptotic normality, and the conclusion of Theorem 3.4
will provide the asymptotic variance, which can then be analyzed to derive the
results described above. Previous results can also be used to show consistency,
which is an assumption for the asymptotic normality results, but to focus attention
on the most interesting features of two-step estimators, consistency will just be
assumed in this section.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2117

A general type of estimator 8 that has as special cases most examples of interest
is one that, with probability approaching one, solves an equation

n- i$l dzi, 8, y*)= O, (6.1)

where g(z,B,y) is a vector of functions with the same dimension as 0 and y*is a
first-step estimator. This equation is exactly the same as eq. (5.12), but here the
purpose is analyzing the asymptotic distribution of gin general rather than specifying
regularity conditions for $ to have no effect. The estimator can be treated as part
of a joint GMM estimator if y^also satisfies a moment condition of the form, with
probability approaching one,

n-l i m(z,,y)=O, (6.2)


i=l

where m(z,y) is a vector with the same dimension as y. If g(z, 0,~) and m(z,r) are
stacked to form J(z, 8, y) = [m(z, O),g(z, 8, y)], then eqs. (6.1) and (6.2) are simply
the two components of the joint moment equation n-i C;= 1 g(zi, 8,y*)
= 0.Thus, the
two-step estimator from eq. (6.1) can be viewed as a GMM estimator.
An interesting example of a two-step estimator that fits into this framework is
Heckmans (1976) sample selection estimator.

Sample selection example: In this example the first step +$is a probit estimator with
regressors x. The second step is least squares regression in the subsample where the
probit-dependent variable is one, i.e. in the selected sample, with regressors given
by w and i(xy^) for n(o) = ~(U)/@(U). Let d be the probit-dependent variable, that is
equal to either zero or one. This estimator is useful when y is only observed if d = 1,
e.g. where y is wages and d is labor force participation. The idea is that joint
normality of the regression y = w/& + u and the probit equation leads to E[yl w,
d = 1, x] = wp,, + cc,A(xy,), where a, is nonzero if the probit- and regression-depen-
dent variables are not independent. Thus, %(xcr,) can be thought of as an additional
regressor that corrects for the endogenous subsample.
This two-step estimator will satisfy eqs. (6.1) and (6.2) for

Y(4 8,Y)= d
[
A(&1 CY-w'B-~wr)l~
m(z,y) = Il(xy)a=-( -xy)x[d- @(xy)], (6.3)

where 8 = (/Y, a). Then eq. (6.1) becomes the first-order condition for least squares
on the selected sample and eq. (6.2) the first-order condition for probit.
2178 W.K. Newley and D. McFadden

Regularity conditions for asymptotic normality can be formulated by applying the


asymptotic normality result for GMM, i.e. Theorem 3.4, to the stacked vector of
moment conditions. Also, the conclusion of Theorem 3.4 and partitioned inversion
can then be used to calculate the asymptotic variance of 8, as in the following result.
Let

G, = W,dZ> Q,>ro)l, G, = ECV,g(z,&>YO)I> Y(Z)= dz, &, Yoh


M = ECV,mk ~o)l, I,@) = - M m(z, y,,). (6.4)

Theorem 6.1

Ifeqs. (6.1) and (6.2) are satisfied with probability approaching one, 8% 8,, y*3 ye,
and g(z, 8, y) satisfies conditions (i)-(v) of Theorem 3.4, then 8 and 9 are asymptoti-
cally normal and $(& 0,) 4 N(0, V) where I/ = G; EC {g(z) + G,$(z)}(g(z) +
G,WJ1G,.
Proof

By eqs. (6.1) and (6.2), with probability approaching one (8, y*)is a GMM estimator
with moment function g(z,_B, y) = [m(z,y),g(z, e,y)] and I? equal to an identity
matrix. By (~?1@6= G-, the asymptotic variance of the estimator is
(W- Z(IE[#(z, do, y&(z, 8,, y,)]zz;(~~zz1)- l = CT-E[ij(z, 8,, y&(z, o,, yJ]G- l.
Also, the expected Jacobian matrix and its inverse are given by

(6.5)

Noting that the first row of G- is G; [I, - GYM - 1 and that [I, - G,M- 1 x
g(z, BO,yO) = g(z) + G&(z), the asymptotic variance of 8, which is the upper left block
of the joint variance matrix, follows by partitioned matrix multiplication. Q.E.D.

An alternative approach to deriving the asymptotic distribution of two-step esti-


mators is to work directly from eq. (6. l), expanding in 6to solve for &(e^ - 6,) and
then expanding the result around the true yO. To describe this approach, first note
that 9 is an asymptotically linear estimator with influence function $(z) =
- M- m(zi, ye), where fi(y* - yO) = Cr= 1 $(zi)/$ + op(1). Then expanding the
left-hand side of eq. (6.1) around B0 and solving gives:

Jj2(8-8,)= - a-1 t Vog(z. @) n


-l iFl eO,YV&
[ i=l
1) ?

1
Stzi,

=-[& t i=l
V,g(z,,8,y^)
1-l
Ch. 36: Large Sample Estimation and Hypothesis Testing 2179

x
ii$l g(zi)l& + [,-l i,
i=l
vyCl(zi, eO, V,]\;;(9
- YOJ]
= -
i=l
+ up,
GB1 t {g(zi)+ Gyti(zJ}lJn

where (? and 7 are mean values and the third equality follows by convergence of
y^and the mean values and the conclusion of Lemma 2.4. The conclusion then follows
by applying the central limit theorem to the term following the last equality.
One advantage of this approach is that it only uses the influence function
representation &($ - ye) = x1= 1 tj(z,)/& + o,(l) for 9, and not the GMM formula
in eq. (6.2). This generalization is useful when y*is not a GMM estimator. The GMM
approach has been adopted here because it leads to straightforward primitive
conditions, while an influence representation for y*is not a very primitive condition.
Also the GMM approach can be generalized to allow y*to be a two-step, or even
multistep, estimator by stacking moment conditions for estimators that affect 3 with
the moment conditions for 0 and y.

6.2. The efect ofjrst-step estimation on second-step standard errors

One important feature of two-step estimators is that ignoring the first step in
calculating standard errors can lead to inconsistent standard errors for the second
step. The asymptotic variance for the estimator solving eq. (6.1) with y*= yO, i.e. the
asymptotic variance ignoring the presence of y*in the first stage, is G; E[g(z)g(z)]G; l.
In general, this matrix differs from the asymptotic variance given in the conclusion
of Theorem 6.1, because it does not account for the presence of the first-step
estimators.
Ignoring the first step will be valid if G, = 0. Also, if G, # 0, then ignoring the first
step will generally be invalid, leading to an incorrect asymptotic variance formula,
because nonzero G, means that, except for unusual cases, E[g(z)g(z)] will not equal
E[ (g(z) + G&(z)} {g(z) + G&(z)}]. Thus, the condition for estimation of the first
step to have no effect on the second-step asymptotic variance is G, = 0.
A nonzero G, can be interpreted as meaning that inconsistency in the first-step
estimator leads to inconsistency in the second-step estimator. This interpretation is
useful, because it gives a comparatively simple criterion for determining if first-stage
estimation has to be accounted for. To derive this interpretation, consider the
solution 8(y) to E[g(z, B(y), y)] = 0. Because 8 satisfies the sample version of this
condition, B(y) should be the probability limit of the second-step estimator when J?
converges to y (under appropriate regularity conditions, such as those of Section 2).
Assuming differentiation inside the expectation is allowed, the implicit function
theorem gives

V$(y,) = - G; Gy. (6.7)


2180 W.K. Newey and D. McFadden

By nonsingularity of G,, the necessary and sufficient condition for G, = 0 is that


V,H(yJ = 0. Since H(y,) = H,, the condition that V,B(y,J = 0 is a local, first-order
condition that inconsistency in y*does not affect consistency of 8. The following
result adds regularity conditions for this first-order condition to be interpreted as
a consistency condition.

Theorem 6.2

Suppose that the conditions of Theorem 6.1 are satisfied and g(z, 0, y) satisfies the
conditions of Lemma 2.4 for the parameter vector (H,y). If &A 8, even when
j-y # yO, then G, = 0. Also suppose that E[V,g(z, 8,, y)] has constant rank on a
neighborhood of yO. If for any neighborhood of y0 there is y in that neighborhood
such that 8 does not converge in probability to H, when $ L y, then G, # 0.

Proof

By Lemma 2.4, 8 3 8, and y*3 y imply that Cy= r g(zi, 8, y^)/n -% E[g(z, 8,, y)].
The sample moment conditions (6.1) thus imply E[g(z, BO,y)] = 0. Differentiat-
ing this identity with respect to y at y = y0 gives G, = 0.41 To show the second
conclusion, let H(y) denote the limit of e when 9 L y. By the previous argument,
E[g(z, 8(y), y)] = 0. Also, by the implicit function theorem 0(y) is continuous at yO,
with @ye) = BO.By the conditions of Theorem 6.1, G&8, y) = E[V,g(z, 0, y)] is contin-
uous in a neighborhood of B0 and yO, and so will be nonsingular on a small enough
neighborhood by G, nonsingular. Consider a small enough convex neighborhood
where this nonsingularity condition holds and E[V,g(z, 8,, y)] has constant rank. A
mean-value expansion gives E[g(z, 8,, ?)I.= E[g(z, B(y), y)] + G,(& y)[e, - 0(y)] ~0.
Another expansion then gives E[g(z, Be, y)] = E[V,g(z, O,, -$](y - y,,) # 0, implying
E[V,g(z, do, v)] # 0, and hence G, # 0 (by the derivative having constant rank).
Q.E.D.

This results states that, under certain regularity conditions, the first-step estimator
affects second-step standard errors, i.e. G, # 0, if and only if inconsistency in the
first step leads to inconsistency in the second step. The sample selection estimator
provides an example of how this criterion can be applied.

Sample selection continued: The second-step estimator is a regression where some


of the regressors depend on y. In general, including the wrong regressors leads to
inconsistency, so that, by Theorem 6.2, the second-step standard errors will be
affected by the first step. One special case where the estimator will still be consistent
is if q, = 0, because including a regressor that does not belong does not affect
consistency. Thus, by Theorem 6.2, no adjustment is needed (i.e. G, = 0) if c(~ = 0.
This result is useful for constructing tests of whether these regressors belong, because

41Differentiation inside the expectation is allowed by Lemma 3.6.


Ch. 36: Large Sample Estimation and Hypothesis Testing 2181

it means that under the null hypothesis the test that ignores the first stage will have
asymptotically correct size. These results can be confirmed by calculating

where n,(o) = di(v)/dv. By inspection this matrix is generally nonzero, but is zero if
a, = 0.

This criterion can also be applied to subsets of the second-step coefficients. Let S
denote a selection matrix such that SA is a matrix of rows of A, so that Se is a
subvector of the second-step coefficients. Then the asymptotic variance of Se is
SC, E[ {g(z) + G&(z)} {g(z) + G,$(z)}]G; S, while the asymptotic variance that
ignores the first step is SC; E[g(z)g(z)]G; 1S. The general condition for equality
of these two matrices is

0= - SC,' G, = SV,B(y,) = V,[SB(y,)], (6.8)

where the second equality follows by eq. (6.7). This is a first-order version of the
statement that asymptotic variance of Skis affected by the first-step estimator if and
only if consistency of the first step affects consistency of the second. This condition
could be made precise by modifying Theorem 6.2, but for simplicity this modification
is not given here.

Sample selection continued: As is well known, if the correct and incorrect regressors
are independent of the other regressors then including the wrong regressor only
affects consistency of the coefficient of the constant. Thus, the second-step standard
errors of the coefficients of nonconstant variables in w will not be affected by the
first-step estimation if w and x are independent.

One can also derive conditions for the correct asymptotic variance to be larger or
smaller than the one that ignores the first step. A condition for the correct asymptotic
variance to be larger, given in Newey (1984), is that the first- and second-step
moment conditions are uncorrelated, i.e.

Gdz, &, xJm(z,YJI = 0. (6.9)

In this case E[g(z)$(z)] = 0, so the correct variance is G; E[g(z)g(z)]G; I +


G; ~,WW$WlG;~, 2which is larger, in the positive semi-definite sense, than
the one G; E[g(z)g(z)]G; that ignores first-step estimation.
2182 W.K. Newey and D. McFadden

Sump/e selection continued: In this example, E[y - wfiO - cr,i(xy,)l w, d = 1, x] = 0,


which implies (6.9). Thus, the standard error formula that ignores the first-step
estimation will understate the asymptotic standard error.

A condition for the correct asymptotic variance to be smaller than the one that
ignores the first step, given by Pierce (1982), is that

m(z) = m(z, yO) = V, ln f(z I Q,, yd (6.10)

In this case, the identities Sm(z, ~)f(zl O,, y) dz = 0 and lg(z, 0,, y)f(z Id,, y) dz = 0 can
be differentiated to obtain the generalized information matrix equalities M =
- E[s(z)s(z)] and G, = - E[g(z)s(z)]. It then follows that G, = - E[g(z)m(z)] =
- J%d4wl I~c$wwl> - l> so that the correct asymptotic variance is
G; 1~Cg(4g(4lG; - G; 'ECgWWl { ~C+WWl> - -f%WsMlG; . This
variance is smaller, in the positive semi-definite sense, than the one that ignores the
first step.
Equation (6.10) is a useful condition, because it implies that conservative asymp-
totic confidence intervals can be constructed by ignoring the first stage. Unfortunately,
the cases where it is satisfied are somewhat rare. A necessary condition for eq. (6.10)
is that the information matrix for Q and y be block diagonal, because eq. (6.10)
implies that the asymptotic variance of y*is {E[m(z)m(z)]} - , which is only obtainable
when the information matrix is block diagonal. Consequently, if g(z, 8, y) were the
score for 8, then G, = 0 by the information matrix equality, and hence estimation
of 9 would have no effect on the second-stage variance. Thus, eq. (6.10) only leads
to a lowering of the variance when g(z, 8, y) is not the score, i.e. 8 is not an efficient
estimator.
One case where eq. (6.10) holds is if there is a factorization of the likelihood
f(z ItI, y) = fl(z IB)f,(z Iy) and y^is the MLE of y. In particular, if fi (z 10)is a conditional
likelihood and f,(zl y) = fi(x 17) a marginal likelihood of variables x, i.e. x are
ancillary to 8, then eq. (6.8) is satisfied when y*is an efficient estimator of yO.

6.3. Consistent asymptotic variance estimation for two-step estimators

The interpretation of a two-step estimator as a joint GMM estimator can be used


to construct a consistent estimator of the asymptotic variance when G, # 0, by
applying the general GMM formula. The Jacobian terms can be estimated by
sample Jacobians, i.e. as

60~n-l t v,g(ziy8,9),
Gy= 6 t V,g(Z,,BJ),ii = n-l i V,m(z,,y*).
i=l i=l i=l

The second-moment matrix can be estimated by a sample second-moment matrix


2183
Ch. 36: Larye Sample Estimation and Hypothesis Testing

di = y(zi, 8, y*)and Ai = m(z,, f), of the form fi= n- x1= ,(& &i)(& &I). An estimator
of the joint asymptotic variance of 8 and 7 is then given by

An estimator of the asymptotic variance of the second step 8 can be extracted from
the upper left block of this matrix. A convenient expression, corresponding to that
in Theorem 6.1, can be obtained by letting $i = - & l&z,, so that the upper left
block of ? is

(6.11)

If the moment functions are uncorrelated as in eq. (6.9) so that the first-step
estimation increases the second-step variance, then for ?? = n- Cy= 1JitJ:y an asymp-
totic variance estimator for 8 is

(6.12)

This estimator is quite convenient, because most of its pieces can be recovered from
standard output of computer programs. The first of the two terms being summed
is a variance estimate that ignores the first step, as often provided by computer
output (possibly in a different form than here). An estimated variance FYis also often
provided by standard output from the first step. In many cases 6; can also be
recovered from the first step. Thus, often the only part of this variance estimator
requiring application-specific calculation is eY. This simplification is only possible
under eq. (6.9). If the first- and second-step moment conditions are correlated then
one will need the individual observations Gi, in order to properly account for the
covariance between the first- and second-step moments.
A consistency result for these asymptotic variance estimators can be obtained by
applying the results of Section 4 to these joint moment conditions. It will suffice to
assume that the joint moment vector g(z, 0, y) = [m(z, y), y(z, 0, r)] satisfies the
conditions of Theorem 4.5. Because it is such a direct application of previous results
a formal statement is not given here.
In some cases it may be possible to simplify PO by using restrictions on the form
of Jacobians and variance matrices that are implied by a model. The use of such
restrictions in the general formula can be illustrated by deriving a consistent
asymptotic variance estimator for the example.
2184 W.K. Newey and D. McFadden

Sumple selection example continued: Let Wi = di[wI, /z(xIyo)] and %i = di[wI, i(.$j)].
Note that by the residual having conditional mean zero given w, d = 1, and x, it is
the case that G, = - E[diWiWJ and G, = - a,E[di~,,(xlyo)WiX11, where terms in-
volving second derivatives have dropped out by the residual having conditional
mean zero. Estimates of these matrices are given by ee = - x1= 1ki/iA~/~ and
G, = -oily= II.,(x~j)ii/,x~/n. Applying eq. (6.12) to this case, for ii = yi - W#, 3i),
then gives

(6.13)

where pY is a probit estimator of the asymp_totic_variance of &(y - yO), e.g. as


provided by a canned computer program, and 17~ G; Gy is the matrix of coefficients
from a multivariate regression of c?%,(x~y*)xi
on Wi. This estimator is the sum of the
White (1980) variance matrix for least squares and a correction term for the first-
stage estimation.42 It will be a consistent estimator of the asymptotic variance of
JII@ - do).43

7. Asymptotic normality with nonsmooth objective functions

The previous asymptotic normality results for MLE and GMM require that the
log-likelihood be twice differentiable and that the moment functions be once differenti-
able. There are many examples of estimators where these functions are not that
smooth. These include Koenker and Bassett (1978), Powells (1984, 1986) censored
least absolute deviations and symmetrically trimmed estimators, Newey and Powells
(1987) asymmetric least squares estimator, and the simulated moment estimators, of
Pakes (1986) and McFadden (1989). Therefore, it is important to have asymptotic
normality results that allow for nonsmooth objective functions.
Asymptotic normality results for nonsmooth functions were developed by Daniels
(1961), Huber (1967), Pollard (1985), and Pakes and Pollard (1989). The basic insight
of these papers is that smoothness of the objective function can be replaced by
smoothness of the limit if certain remainder terms are small. This insight is useful
because the limiting objective functions are often expectations that are smoother
than their sample counterparts.

4*Contrary to a statement given in Amemiya (1985), the correction term is needed here.
43The normalization by the total sample size means that one can obtain asymptotic confidence
intervals as described in Section 1, with the n given there equal to the total sample size. This procedure
is equivalent to ignoring the n divisor in Section 1and dropping the n from the probit asymptotic variance
estimator (as is usually done in canned programs) and from the lead term in eq. (6.13).
Ch. 36: Large Sample Estimation and Hypothesis Testing 2185

To illustrate how this approach works it is useful to give a heuristic description.


The basic idea is the approximation

&@)- e^,&)r &e - &J + Qo(4 - Qo(4J


E&e - (3,) + (0 - O,)H(B - 8,)/2,
(7.1)

where 6, is a derivative, or approximate derivative, of Q,,(e) at B,,, H = V,,Q,(B,),


and the second approximate equality uses the first-order condition V,QO(e,) = 0 in
a second-order expansion of QO(0). This is an approximation of Q,(e) by a quadratic
function. Assuming that the approximation error is of the right order, the maximum
of the approximation should be close to the true maximum, and the maximum of
the approxi_mation is 8 = B0 - H- fi,,. This random yariable will be asymptotically
normal if D, is, so that asymptotic normality of 0 will follow from asymptotic
normality of its approximate value 8.

7.1. The basic results

In order to make the previous argument precise the approximation error in eq. (7.1)
has to be small enough. Indeed, the reason that eq. (7.1) is used, rather than some
other expansion, is because it leads to approximation errors of just the right size.
Suppose for discussion Purposes that 6,, = V&(6,), where the derivative exists with
probability one. Then Q,(e) - Q,(e,) - 6;(0 - 0,) goes to zero faster than 118- doI/
does, by the definition of a derivative. Similarly, QO(e) - QO(O,) goes to zero faster
than ((8 - 0, (([since V,Q,(B,) = 01. Also, assuming that J%@,,(e) - Qo(@] is boun-
ded in probability for each 8, as would typically be the case when Q,(e) is made
up of sample averages, and noting that $0, bounded in probability follows by
asymptotic normality, it follows that the remainder term,

k(e) = JtrcOm - O,w - 6,te - 0,) - mv3 - ade,w Ii 8 - e. II, (7.2)

is bounded in probability for each 0. Then, the combination of these two properties
suggests that l?,(e) goes to zero as the sample size grows and 8 goes to BO,a stochastic
equicontinuity property. If so, then the remainder term in eq. (7.1) will be of order
oP( I/0 - 8, I//& + II8 - 8, /I*). The next result shows that a slightly weaker condition
is sufficient for the approximation in eq. (7.1) to lead to asymptotic normality of 8.

Theorem 7.1

Suppose that Q.(8) 2 supti&(@ - o&r- ), 8 A 8,, and (i) QO(0) is maximized on
@ at 8,; (ii) 8, is an interior point of 0, (iii) Qe(0) is twice differentiable at 8,
2186 W.K. Newey and D. McFadden

with nonsingular second derivative H; (iv) &fi 5 N(O,Q; (v) for any 6, +O,
s~p~~~-~,,,,,~~R,(e)/[l + JnllO - ~,I111 LO. Then &(e- &J ~N(O,H-~H-).

The proof of this result is given in Section 7.4. This result is essentially a version of
Theorem 2 of Pollard (1985) that applies to any objective function rather than just
a sample average, with an analogous method of proof. The key remainder condition
is assumption (v), which is referred to by Pollard as stochastic diflerentiability. It is
slightly weaker than k,(O) converging to zero, because of the presence of the
denominator term (1 + & /I8 - 8, II)- , which is similar to a term Huber (1967)
used. In several cases the presence of this denominator term is quite useful, because it
leads to a weaker condition on the remainder without affecting the conclusion.
Although assumption (v) is quite complicated, primitive conditions for it are avail-
able, as further discussed below.
The other conditions are more straightforward._Consistency can be shown using
Theorem 2.1, or the generalization that allows for 8 to be an approximate maximum,
as suggested in the text following Theorem 2.1. Assumptions (ii) and (iii) are quite
primitive, although verifying assumption (iii) may require substantial detailed work.
Assumption (iv) will follow from a central limit theorem in the usual case where
6, is equal to a sample average.
There are several examples of GMM estimators in econometrics where the moments
are not continuous in the parameters, including the simulated moment estimators
of Pakes (1986) and McFadden (1989). For these estimators it is useful to have more
specific conditions than those given in Theorem 7.1. One way such conditions can
be formulated is in an asymptotic normality result for minimum distance estimators
where g,(e) is allowed to be discontinuous. The following is such a result.

Theorem 7.2
Suppose that $,,(@I?o.(@ < info,0Q.(8)i@&(8) + o,(n-), 8-% 8,, and I? L W, W is
positive semi-definite, where there is go(e) such that (i) gO(O,) = 0; (ii) g,,(d) is differen-
tiable at B0 with derivative G such that GWG is nonsingular; (iii) 8, is an interior
point of 0; (iv) +g,(e,) L NO, z3; (v) for any 6, + 0, supllO- OolI
$6,& II8,u4 -
$,(e,)-g&III/[1 +fiIIe-e,II] LO. Then ,/(k@wV[O,(GWG)-GR
WZWG (GWG)-1.

The proof is given in Section 7.4. For the case where Q,(e) has the same number of
elements as 8, this result is similar to Hubers (1967), and in the general case is like
Pakes and Pollards (1989), although the method of proof is different than either of
these papers. The conditions of this result are similar to those for Theorem 7.1. The
function go(e) should be thought of as the limit of d,(e), as in Section 3. Most of the
conditions are straightforward to interpret, except for assumption (v). This assump-
tion is a stochastic equicontinuity assumption analogous to the condition (v)
of Theorem 7.1. Stochastic equicontinuity is the appropriate term here because
when go(e) is the pointwise limit of $,,(e), i.e. d,(e) Ago(B) for all 0, then for all
Ch. 36: Laryr Sample Estimation and Hypothesis Testing 2187

8 # 8,, & 11Q,(O) - &,(8,) - go(H) II/[ 1 + Ji )I0 - B. II] AO. Thus, condition (v) can
be thought of as an additional requirement that this convergence be uniform over
any shrinking neighborhood of BO.As discussed in Section 2, stochastic equicontinuity
is an essential condition for uniform convergence.
Theorem 7.2 is a special case of Theorem 7.1, in the sense that the proof proceeds
by showing that the conditions of Theorem 7.1 are satisfied. Thus, in the nonsmooth
case, asymptotic normality for minimum distance is a special case of asymptotic
normality for an extremum estimator, in contrast to the results of Section 3. This
relationship is the natural one when the conditions are sufficiently weak, because a
minimum distance estimator is a special case of a general extremum estimator.
For some extremum estimators where V,&,(0) exists with probability one it
is possible to, use Theorem 7.2 to show asymptotic normality, by setting i,,(e)
equal to V,Q,(@. An example is censored least absolute deviations, where
V,&(0) = n - l C;= 1xil(xj8 > 0)[ 1 - 2.l(y < xe)]. However, when this is done there
is an additional condition that has to be checked, namely that )/V,Q,(0) )/* d
inf,, 8 11V,&(e) II2 + o,(n- ), for which it suffices to show that J&V&,(@ L 0. This
is an asymptotic first-order condition for nonsmooth objective functions that
generally has to be verified by direct calculations. Theorem 7.1 does not take this
assumption to be one of its hypotheses, so that the task of checking the asymptotic
first-order condition can be bypassed by working directly with the extremum
estimator as in Theorem 7.1. In terms of the literature, this means that Hubers
(1967) asymptotic first-order condition can be bypassed by working directly with
the extremum formulation of the estimator, as in Pollard (1985). The cost of doing
this is that the remainder in condition (v) of Theorem 7.1 tends to be more compli-
cated than the remainder in condition (v) of Theorem 7.2, making that regularity
condition more difficult to check.
The most complicated regularity condition in Theorems 7.1 and 7.2 is assumption
(v). This condition is difficult to check in the form given, but there are more primitive
conditions available. In particular, for Q,(0) = n Cy= 1 q(z,, 8), where the objective
function is a sample average, Pollard (1985) has given primitive conditions for
stochastic differentiability. Also, for GMM where J,(0) = C;= i g(z, 0)/n and go(B) =
E[g(z, 0)], primitive conditions for stochastic equicontinuity are given in Andrews
(1994) chapter of this handbook. Andrews (1994) actually gives conditions for a
stronger result, that s~p,,~_~,,, da./% )Id,(0) - .&(0,) - go(e) 1)L 0, i.e. for (v) of
Theorem 7.2 without the denominator term. The conditions described in Pollard
(1985) and Andrews (1994) allow for very weak conditions on g(z, 0), e.g. it can even
be discontinuous in 8. Because there is a wide variety of such conditions, we do not
attempt to describe them here, but instead refer the reader to Pollard (1985) and
Andrews (1994).
There is a primitive condition for stochastic equicontinuity that is not covered in
these other papers, that allows for g(z, 8) to be Lipschitz at 0, and differentiable with
probability one, rather than continuously differentiable. This condition is simple
but has a number of applications, as we discuss next.
2188 W.K. Newey and D. McFadden

7.2. Stochastic equicontinuity for Lipschitz moment,functions

The following result gives a primitive condition for the stochastic equicontinuity
hypothesis of Theorem 7.2 for GMM, where Q,(e) = nP Cy= 1g(Zi, 0) and go(O)=
ECg(z,
@I.

Theorem 7.3

Suppose that E[g(z, O,)] = 0 and there are d(z) and E > 0 such that with probability one,
r(z, d) = IIdz, Q)- & 0,) - W(fl - 44 II/IIQ- 0, I/+ 0 as Q+ oo,~C~W,,,-,~,,
Ccx
r(z,B)] < a, and n- Cr= 1d(zi) LE[d(z)]. Then assumptions (ii) and (v) of
Theorem 7.2 are satisfied for G = E[d (z)].

Proof

For any E > 0, let r(z,E) = sup, o-00, BEIIr(z, 0) 11.With probability one r(z, E) + 0 as
E+ 0, so by the dominated convergence theorem, E[r(z, E)] + 0 as E+ 0. Then for
0 + 0, and s = IIQ- 4, II, IIad@- sd4) - (30 - 0,) I/= IIEC&, 0)- g(z,0,) - 44 x
(0 - O,)]11
d E[r(z, E)] II0 - 0, /I+O, giving assumption (iii). For assumption (v), note
that for all (5with /I8 - 0, I/ < 6,, by the definition of r(z, E) and the Markov inequality,
Jn II4,(@- &(Ho)- go(@I//Cl + fi II0 - 0, II1 d Jn CIICY=1{d(zi) - EC&)1 } x
(0- f&)/nII + {C1=Ir(zi, Wn + ECr(z, S.)l > II0 - 00IIl/(1 + Jn II0 - 00II1d IICy=1
j A(zJ - J%A(z)lj/n II + ~,@Cr(z,%)I) JS 0. Q.E.D.

The condition on r(z, Cl) in this result was formulated by Hansen et al. (1992). The
requirement that r(z, 0) --f 0 as 8 + B0 means that, with probability one, g(z, 19)is
differentiable with derivative A(z) at BO.The dominance condition further restricts
this remainder to be well behaved uniformly near the true parameter. This uniformity
property requires that g(z, e) be Lipschitz at B0 with an integrable Lipschitz constant.44
A useful aspect of this result is that the hypotheses only require that Cr= 1A(zi) 3
E[A(z)], and place no other restriction on the dependence of the observations. This
result will be quite useful in the time series context, as it is used in Hansen et al.
(1992). Another useful feature is that the conclusion includes differentiability of go(e)
at B,, a bonus resulting from the dominance condition on the remainder.
The conditions of Theorem 7.3 are strictly weaker than the requirement of Section
3 that g(z, 0) be continuously differentiable in a neighborhood of B0 with derivative
that is dominated by an integrable function, as can be shown in a straightforward
way. An example of a function that satisfies Theorem 7.3, but not the stronger
continuous differentiability condition, is the moment conditions corresponding to
Hubers (1964) robust location estimator.

44For44 = SUPI~~~~,,~
< &tiz, 01, the triangle and Cauchy-Schwarz inequalities imply 1)~(z,o)- g(~,0,) I/<
Ill&) II + &)I II0 - 6, Il.
Ch. 36: Largr Sample Estimution and Hypothesis Testing 2189

Huhers robust locution estimator: The first-order conditions for this estimator are
n~~~~,p(yi~~)=Oforp(c)=-l(cd-l)+l(-l~~~l)~+1(~31).Thisesti-
mator will be consistent for B0 where y is symmetrically distributed around 8,. The
motivation for this estimator is that its first-order condition is a bounded, continuous
function of the data, giving it a certain robustness property; see Huber (1964). This
estimator is a GMM estimator with g(z, 0) = p(y - 0). The function p(c) is differen-
tiable everywhere except at - 1 or 1, with derivative P,(E) = l( - 1 < E < 1). Let
d(z)= -p,(y-U,).ThenforE=y-H,and6=H,-U,

r(z, 0) = Ig(z, Q)- dz, RJ)- d (z)(fl~ 4J l/l Q- b I


= IP(E+ 4 - PM - PEWI/ IdI
=~[-1(E+6<-1)+1(&~-1)]+[1(E+~>1)-1(E~1)]

+[l(-l<E+6<1)-1(-1<E<l)](E+S)1/1~1.

ForO<6< 1,

r(z,~)=~1(-1-~<~d-1)+1(1-6d~<1)+[1(-1-~6~~-1)

-l(l -6<e< l)](E+fi)I/lfil

~1(-6~E+1~0)(~+~E+1~)//~~+1(-~6~-1<0)(/&-1~+6)/~6~

62[1(-6<E+ 1 <O)+ l(-66E- 1<0)]<2.

Applying an analogous argument for negative - 1 d 6 < 0 gives r(z,O) <


2[l(lc-lId/6~)+l(le+lI~~6~)]d4. Therefore, if Prob(&=l)=O and
Prob(& = - 1) = 0 then r(z, 0) + 0 with probability one as 0 -+ 0, (i.e. as 6 -+ 0). Also,
r(z, fl) < 4. Thus, the conditions of Theorem 7.3 are satisfied.

Other examples of estimators that satisfy these conditions are the asymmetric least
squares estimator of Newey and Powell (1987) and the symmetrically trimmed
estimators for censored Tobit models of Powell (1986) and Honori: (1992). All of
these examples are interesting, and illustrate the usefulness of Theorem 7.3.

7.3. Asymptotic variance estimation

Just as in the smooth case the asymptotic variance of extremum and minimum
distance estimators contain derivative and variance terms. In the smooth case the
derivative terms were easy to estimate, using derivatives of the objective functions.
In the nonsmooth case these estimates are no longer available, so alternatives must
be found. One alternative is numerical derivatives.
For the general extremum estimator of Theorem 7.1, the matrix H can be
2190 W.K. Newey and D. McFadden

estimated by a second-order numerical derivative of the objective function. Let e,


denote the ith unit vector, E, a small positive constant that depends on the sample
size, and fi the matrix with i, jth element

fiij = [Q(o^+ eis, + ejs,) - Q(@- eis, + ejs,) - Q(@+ eie, - eje,)

+ Q(B- eis, - ejsn)]/4$.

Under certain conditions on E,,, the hypotheses of Theorem 7.1 will suffice for
consistency of G for the H in the asymptotic variance of Theorem 7.1. For a
minimum distance estimator a numerical derivative estimator G of G hasjth column

Gj = [i(B + ejc,) - d(@


- eje,)]/2s,.

This estimator will be consistent under the conditions of Theorem 7.2. The following
result shows consistency:

Theorem 7.4

Suppose that E, + 0 and E,,& + co. If the conditions of Theorem 7.1 are satisfied
then fi AH. Also, if the conditions of Theorem 7.2 are satisfied then G 5 G.

This result is proved in Section 7.4. Similar results have been given by McFadden
(1989), Newey (1990), and Pakes and Pollard (1989).
A practical problem for both of these estimators is the degree of difference (i.e.
the magnitude of s,) used to form the numerical derivatives. Our specification of the
same E, for each component is only good if 6 has been scaled so that its components
have similar magnitude. Alternatively, different E, could be used for different compo-
nents, according to their scale. Choosing the size of&,, is a difficult problem, although
analogies with the choice of bandwidth for nonparametric regression, as discussed
in the chapter by Hardle and Linton (1994), might be useful. One possibility is to
graph some component as a function of E, and then choose E, small, but not in a
region where the function is very choppy. Also, it might be possible to estimate
variance and bias terms, and choose E, to balance them, although this is beyond the
scope of this chapter.
In specific cases it may be possible to construct estimators that do not involve
numerical differentiation. For example, in the smooth case we know that a numerical
derivative can be replaced by analytical derivatives. A similar replacement is often
possible under the conditions of Theorem 7.3. In many cases where Theorem 7.3
applies, g(z,@ will often be differentiable with probability one with a derivative
V,g(z, 0) that is continuou_s in 8 with probabil$y one and dominated by an integrable
function. Consistency of G = n- Cy= 1V,g(z, 0) will then follow from Lemma 4.3. For
example, it is straightforward to show that this reasoning applies to the Huber loca-
tionestimator,withV,g(z,O)=-1(-1~y-~<l)and~=~~~,1(-l~yi-~~l)/n.
Ch. 36: Lurye Sample Estimation and Hypothesis Testing 2191

Estimation of the other terms in the asymptotic variance of 8 can usually be


carried out in the way described in Section 4. For example, for GMM the moment
function g(z,fI) will typically be continuous in 8 with probability one and be
dominated by a square integrable function, so that Lemma 4.3 will imp_ly the
consistency of fi = Cr= 1 g(zi, 6)g(zi, @/II. Also, extremum estimators where Q,(0) =
nP Cr= lq(z, U), q(z, 0) will usually be differentiable almost everywhere, and Lemma
4.3 will yield consistency of the variance estimator given in eq. (4.1).

7.4. Technicalities

Because they are long and somewhat complicated, the proofs of Theorems 7.1,7.2,
and 7.4 are given here rather than previously.

Proof of Theorem 7.1

Let Q(e) = Q,(e) and Q(0) = Qo(@. First it will be proven that $118 - 8, /I = O,(l),
i.e. that 8is &-consistent. By Q(0) having a local maximum at 8,, its first derivative
is zero at O,, and hence Q(0) = Q(0,) + (0 - 8,)H(fI - (!I,)/2 + o( /I 0 - 8, II2). Also, H
is negative definite by fI,, a maximum and nonsingularity of H, so that there is C > 0
and a small enough neighborhood of B0 with (t9 - 8,)H(B - 8,)/2 + o( 110- 8,II 2, <
- C 110- 0, I/2. Therefore, by 8 A 8,, with probability approaching one (w.p.a.l),
Q(8)< Q(d,) - C II& 8, II2. Choose U, so that 8~ U, w.p.a.1, so that by (v) $I&Q),I <
(1 +~IIo^-~oll)~pu).

0 d &6) - &I,) + o&n- ) = Q(8) - Q&J + 6(& 0,) + 116 @,,I1i?(6) + o&n-)

d -C/18-8,112+ llfil( l&e,11 + Ile-e,11(1 +Jni18-8,/1)o,(n-l2)+Op(lZ-)

d -CC+o,(l)]1~8-e,112+0,(n-2)11~-~o~I +o,(n?).

Since C + ~~(1) is bounded away from zero w.p.a.1, it follows that /I6- 8, (I2 <
O&n- 12)II8 - 0, II + o&n- ), and hence, completing the square, that [ iI&- 8, II +
0plnm2)]2 < O&K). Taking the square root of both sides, it follows that
I lld- Boll + O,(n- 12)ld O,(n- 12), so by the triangle inequality, 11G-S,, 11<I Il8-0,l~ +
0,(n-12)1 + 1- 0,(n-12)l 6 0,(n-12).
Next, let e= H,, - H- 6, and note that by construction it is $-consistent. Then
by &-consistency of 8, twice differentiability of Q(0), and (v) it follows that

2[Q(8) - Q(0,) + Q&J] = (s - &JH(o^ - Q,) + 26(8 - 0,) + o&n- )

=(8-Q,)H(&-&)-2(8-8,)H&8,)+0&n-).

Similarly, 2[!j(& - f&O,) + Q(&)] = (8 - O,,)H@ - e,,) + 26(8 - 0,) + ~,(n- l) =


2192 W.K. Nwry und D. McFadden

-(H- H,)H(& H,) + o,(nP1). Then since 0 is contained within 0 w.p.a.1,


2[&8)- Q^(e,) + Q(e,)] -2[&& ~ dc!IO) + Q(fI,)] > o&n- ), so by the last equation
and the corresponding equation for 0,

o,(nP) < (e- H,)H(8- 0,) - 2(H1- 8,)H(8- e,) + (S- e,)%@- 0,)
=(H^-@H(H^-G)& -CIIo^-el[.

Therefore, ((Jr1(6 - 0,) ~ ( - H - Jr&) jl = fi Ij8 - 811% 0, so the conclusion

follows by - H-&6 -+d N(0, H flH _ ) and the Slutzky theorem. Q.E.D.

Proof of Theorem 7.2

Let g(0) = d,(0) and y(8) = go(B). The proof proceeds by verifying the hypotheses of
Theorem 7.1, for Q(0) = - g(H)Wy(0)/2, Q(0) = - 4(tI)f?4(@/2 + d^(@, and d^(B)
equal to a certain function specified below. By (i) and (ii), Q(0) = - [G(e - 19,) +
4 II6 - 41 II)IWG(@- 41)+ o(II 0 - JoII)I/2 = Q(6.J+ (0- AJW~- 4JP +
o(/IB- 0, )/2),for H = - G WC and Q(0,) = 0, so that Q(0) is twice differentiable at
do. Also, by W positive semi-definite and G WC nonsingular, H is negative definite,
implying that there is a neighborhood of 8, on which Q(0) has a unique maximum
(of zero) at 6= BO. Thus, hypotheses (i)-(ii) of Theorem 7.1 are satisfied. By the
Slutzky theorem, 6 = - G@$g(0,) % N(0,0) for B = G'WZ WC, so that hypo-
thesis (v) of Theorem 7.1 is satisfied. It therefore remains to check the initial supposi-
tion and hypothesis (v) of Theorem 7.1.
Let e(0) = [Q(H) - .&fI,) - g(H)]/[ 1 + $11 H - 8, I/1. Then

Let 4?(e) = - 4(@@&@)/2 + &(6)$6(6)/2 + ~(8,)%6((8). For any 6, + 0, by


(v), ~~P,,e-so,,~a,IQ1(~)-~-~(~)~~(~)/~~l~~,(~)s~P,,,_,,,,,{llE*(~)I/ ll4Qll+
O&l - 12))= o&n- ), so that by (9, o(6) 3 SU~,,~_~,,,G,,&I) - o&n-). Thus,
the initial supposition of Theorem 7.1 is satisfied. To check hypothesis (v), note that
by E*(e,)= 0, for i(0) as defined above,

&I k(e)lC1
+& II1Id t pj(e),
III9- UC3

r^I(4 = JG& II0 - 8, II + II0 - &I II2)lE*(mq~)I/[ IIH - 8 I/(1 ;t J )I8 - 8, II)],

f2(4 = hICs(@ + G(Q- 4@iW,)llII IIQ- 0, II(1 + & 110- fl, II)],
Ch. 36: Large Sample Estimation and Hypothesis Testing 2193

f3(@ = n I Cd@ + 44mm) l/(1 + 4 II 0 - 0, II),

tde) = JliIg(e)f~~(e)I/lle-eoii,

Then for 6,-O and U = (0: j(8-0,ll <<s,,}, sup,F,(e)6Cn.sup,l(~((B)(1211 till =0,(l),
sup,p,(e)d~sup,{0(lie - e,ii)ii @II iw,)ii
=WW(II~ - 4mw = 0,(u
~~P,ww ~~uPu~li~(e)l~iJTIlie-eOll)+JtIil~(eo)~l~~~ ~ll~~~,JIS~w~~l~
{SUpa~(l~e-eolo+~,(l))o,(l)=o,(l), ~~~,r*,(~~~~~~,i(li~(~~l~il~e-~oli~ll~~~
~~h&llwl~ = 0,(l), and sw,(e) 6 ~~P,(lk7(eWile - eol12)il@ - w = 0,(l).
Q.E.D.

Proof of Theorem 7.4

Let a be a constant vector. By the conclusion of Theorem 7.1, II8 + a&,- do I( = O,,(E,).
Then by hypothesis (v),

Io(e^
+ &,,a)- &4J - Q@ + q) + Q&J I
d /I6 + aEn- e. I/ [I JW + w)l + IIfi II IIe+ a&,- 0, II1
~qJ(~,){u +Jw+ &,a- doII)qdll&) + O&,l&)) = O&,Z)~

Also, by twice differentiability of Q(0) at 8,,

IE,~[Q(~+
&,a) - Q(O,)] -aHa/2(

= IE; 2[(6 + &,a- e,)viq6 + &,a- e,)/2 + o( I/8 + &,a - 8,1))] - aHa/21
G jq1(f9- O,)Ha( + lE;2(& e,yH@- e,)l + 0,(l) = 0,(i).

It then follows by the triangle inequality that

fiij 3 [2(ei + ej)H(ei + ej) - (ei - ej)H(ei - ej) - (ej - eJH(ej - eJ]/8

= 2[eiHe, + e;Hej - ejHe, - eJHej]/8 + eiHej

= eiHej = Hi,,

giving the first conclusion. For the second_conclusion, it follows from hypothesis (v)
of Theorem 7.2, similarly to the proof for H, that /IJ(fl + c,a) - Q(O,)- g(t?+ &,a)(I <
(1 +,& I/8 + &,,a- 0, I()0&n,- li2) = O&E; ), and by differentiability of g(0) at B0 that
/Ig(d + q,a)/c,, - Ga I/ d /I G(B - f&)/c, 11+ O(E; l I/8 + Ena- do )I) = op( 1). The second
conclusion then follows by the triangle inequality. Q.E.D.
2194 W.K. Nrwev and D. McFadden

8. Semiparametric two-step estimators

Two-step estimators where the first step is a function rather than a finite-dimensional
parameter, referred to here as semiparametric two-step estimators, are of interest
in a number of econometric applications. 45 As noted in Section 5, they are useful
for constructing feasible efficient estimators when there is a nuisance function
present. Also, they provide estimators for certain econometric parameters of interest
without restricting functional form, such as consumer surplus in an example discus-
sed below. An interesting property of these estimators is that they can be Jn-
consistent, even though the convergence rate for the first-step functions is slower
than fi. This section discusses how and when this property holds, and gives
regularity conditions for asymptotic normality of the second-step estimator. The
regularity conditions here are somewhat more technical than those of previous
sections, as required by the infinite-dimensional first step.
The type of estimator to be considered here will be one that solves

n- l t g(zi, 8, 9) = 0,
i=l

where f can include infinite-dimensional functions and g(z, 0, y) is some function of


a data observation z, the parameters of interest 0, and a function y. This estimator
is exactly like that considered in Section 6, except for the conceptual difference that
y is allowed to denote a function rather than a finite-dimensional vector. Here,
g(z,U,y) is a vector valued function of a function. Such things are usually referred
to as functionals.
Examples are useful for illustrating how semiparametric two-step estimators can
be fit into this framework.

V-estimators: Consider a simultaneous equations model where the residual p(z, d)


is independent of the instrumental variables x. Let u(x,p) be a vector of func-
tions of the instrumental variables and the residual p. Independence implies that
EC~{x,pk4,)]1 = ECSajx,p(Z,e,)}dF,(P)Iwhere F,(z) is the distribution of a
single observation. For example, if a(x,p) is multiplicatively separable, then this
restriction is that the expectation of the product is the product of the expectations.
This restriction can be exploited by replacing expectations with sample averages and
dF(Z) with an estimator, and then solving the corresponding equation, as in

(8.2)

where m(z,, z2, 0) = a[~,, p(z,, Q)] - a[~,, p(z,, O)]. This estimator has the form given

45This terminology may not be completely consistent with Powells chapter of this handbook.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2195

in eq. (8.1), where y is the CDF of a single observation, y(z, 0, y) = s m(z, F,O)dy(F),
and $ is the empirical distribution with y(f) = Cr=, l(zi < 5)/n. It is referred to as a
V-estimator because double averages like that in eq. (8.2) are often referred to as
V-statistics [Serfling (1980)]. V-statistics are related to U-statistics, which have been
considered in recent econometric literature [e.g. Powell et al. (1989) and Robinson
(1988b)] and are further discussed below.
The general class of V-estimators were considered in Newey (1989). If a(x,p) is
multiplicatively separable in x and p then these estimators just set a vector of sample
covariances equal to zero. It turns out though, that the optimal u(x, p) may not be
multiplicatively separable, e.g. it can include Jacobian terms, making the generaliza-
tion in eq. (8.2) of some interest. Also, Honor6 and Powell (1992) have recently
suggested estimators that are similar to those in equation (8.2), and given conditions
that allow for lack of smoothness of m(z,, z2, H) in fl.

Nonpurumetric approximate consumer surplus estimation: Suppose that the demand


function as a function of price is given by h,(x) = E[qlx], where 4 is quantity
demanded and x is price. The approximate consumer surplus for a price change
from a to h is Ii,h,(x)dx. A nonparametric estimator can be constructed by replacing
the true condttronal expectation by a nonparametric estimator. One such is a kernel
estimator of the form i(x) = EYE 1q&&x - xi)/C;= 1K,(x - xi), where K,(u) =
a-K(o/a), r is the dimension of x, K(u) is a function such that JK(u)du = 1, and 0
is a bandwidth term that is chosen by the econometrician. This estimator is a
weighted average of qi, with the weight for the ith observation given by K,(x - xi)/
cjn= 1K,(x - xj). The bandwidth 0 controls the amount of local weighting and hence
the variance and bias of this estimator. As 0 goes down, more weight will tend to
be given to observations with Xi close to x, lowering bias_, but raising variance by
giving more weight to fewer observations. Alternatively, h(x) can be interpreted as
a ratio estimator, with a denominator j(x) = n-l Cr= 1K,(x - xi) that is an esti-
mator of the density of x. These kernel estimators are further discussed in Hardle
and Linton (1994).
A kernel estimator of h,(x) can be used to construct a consumer surplus estimator
of the form t?=isi(x)dx. This estimator takes the form given in eq. (8.1), for
y = (r,, y2) where yr(x) is a density for x, yz(x) is the product of a density for x and
a conditional expectation of y given x, g(z, 8, y) = Jf:[Y2(x)IY1(x)]dx - 6, y,(x) =
n- Cy= 1 K,(x - xi) and f2(x) = K x1= 1q,KJx - xi). This particular specification,
where y consists separately of the numerator and denominator of i(x), is convenient
in the analysis to follow.

In both ofthese examples there is some flexibility in the formulation of the estimator
as a solution to eq. (8.1). For V-estimators, one could integrate over the first
argument in a[~,, p(z,, 0)] rather than the second. In the consumer surplus example,
one could set y = h rather than equal to the separate numerator and denominator
terms. This flexibility is useful, because it allows the estimator to be set up in a way
2196 W.K. Newey and D. McFadden

that is most convenient for verifying the regularity conditions for asymptotic
normality.
This section will focus on conditions for asymptotic normality, taking consis-
tency as given, similarly to Section 6. Consistency can often be shown by applying
Theorem 2.1 directly, e.g. with uniform convergence resulting from application of
Lemma 2.4. Also, when y(z, 0, y) is linear in 0, as in the consumer surplus example,
then consistency is not needed for the asymptotic normality arguments.

8.1. Asymptotic normality and consistent variance estimation

To motivate the precise results to be given, it is helpful to consider an expansion


for 8. Expanding eq. (8.1) and solving for $(@- 0,) gives

Ji(8- 0,) = - n-l YCz, Y) = g(Z, 003 y),

(8.3)

where e is the mean value. The usual (uniform) convergence arguments, when
combined with consistency of e and y*, suggest that 6 x1= 1V,g(zi, e, 9) 3
E[V,g(z,tI,,y,)] = G,. Thus, the behavior of the Jacobian term in eq. (8.3) is not
conceptually difficult, only technically difficult because of the presence of non-
parametric estimates. The score term C;= r g(zi, y)/& is much more interesting and
difficult. Showing asymptotic normality requires accounting for the presence of the
infinite-dimensional term 7. Section 6 shows how to do this for the finite-
dimensional case, by expanding around the true value and using an influence
function representation for 9. The infinite-dimensional case requires a significant
generalization. One such is given in the next result, from Newey (1992a). Let 11 y /I
denote a norm, such as ~up~+~~ /Iy(x) I/.

Theorem 8.1

Suppose that EC&, YJI = 0, ECIIdz, 14 II21 < co, and there is 6(z) with E[G(z)] = 0,
E[ II6(z)l12] < co, and (i) (linearization) there is a function G(z,y - yO) that is
linear in y - ye such that for all y with IIy - y. II small enough, IIg(z, y) - g(z, ye) -
G(z, y - yO)II d b(z) IIy - y. II2, and E[b(z)]& 11
y*- y. II2 3 0; (ii) (stochastic equicon-
tinuity) C;= 1 [G(zi, y^- yO) - j G(z, $ --?/O)dFo]/fi 3 0; (iii) (mean-square differen-
tiability) there is 6(z) and a measure F such that EC&z)] = 0, E[ II6(z) II2] < co and
for all IIy - y. I/ small enough, JG(z, y*- Ye)dF, = !6(z)dF; (iv) for the empirical
distribution F [F(z) = n- C;= 1 l(z, d z)], &[j&z)dF - S6(z)dF] -%O. Then
Cy= 1 Stzi7 7)/J n3 N(0, f2), where R = Var [g(zi, yO) + 6(zi)].
Ch. 36: Large Sumple Estimation und Hypothesis Testing 2197

Proof

It follows by the triangle inequality that Cr= 1[g(zi> y*)- g(Zi, ~0) - S(Zi)]/Ji JL 0,
and by the central limit theorem that Cy= 1[g(Zi, yo) + 6(Zi)]/& A N(O, 0).
Q.E.D.

This result is just a decomposition of the remainder term Cr= I g(Zi>


p)/,,h- EYEI
Cdzi~
~0)+ &zJll& As will be illustrated
for the examples, it provides a useful
outline of how asymptotic normality of a semiparametric two-step estimator can
be shown. In addition, the assumptions of this result are useful for understanding
how Cr= 1dzi, Y)lJ n can have a limiting distribution, even though y*is not fi-
consistent.
Assumption(i) requires that the remainder term from a linearization be small. The
remainder term in this condition is analogous to g(z, y) - g(z, yO)- [V,g(z, y,J](y - yO)
from parametric, two-step estimators. Here the functional G(z, y - ye) takes the
place of [V,g(z, yO)](y - yO). The condition on this remainder requires either that it
be zero, where b(z) = 0, or that the convergence rate of 9 be faster than npii4, in
terms of the norm /Iy /I. Often such a convergence rate will require that the under-
lying nonparametric function satisfy certain smoothness restrictions, as further
discussed in Section 8.3.
Assumption (ii) is analogous to the requirement for parametric two-step estima-
tors that {n C;= 1V,g(zi, y,,) - E[V,g(z, y,-J] I(? - ye) converge to zero. It is referred
to as a stochastic equicontinuity condition for similar reasons as condition (v) of
Theorem 7.2. Andrews (1990) has recently given quite general sufficient conditions
for condition (ii). Alternatively, it may be possible to show by direct calculation that
condition (ii) holds, under weaker conditions than those given in Andrews (1990).
For example, in the V-estimator example, condition (ii) is a well known projection
result for V-statistics (or U-statistics), as further discussed in Section 8.2. For kernel
estimators, condition (ii) will follow from combining a V-statistic projection and a
condition that the bias goes to zero, as further discussed in Section 8.3.
Both conditions (i) and (ii) involve second-order terms. Thus, both of these
conditions are regularity conditions, meaning that they should be satisfied if g(z, y)
is sufficiently smooth and y*sufficiently well behaved. The terms in (iii) and (iv) are
first-order terms. These conditions are the ones that allow I;= 1 g(z, y^)/Ji to be
asymptotically normal, even though y^may converge at a slower rate. The key
condition is (iii), which imposes a representation of JG(z, y*- ye)dF, as an integral
with respect to an estimated measure. The interpretation of this representation is
that [G(z, jj - y,JdF, can be viewed as an average over some estimated distribution.
As discussed in Newey (1992a), this condition is essentially equivalent to finiteness
of the semiparametric variance bound for estimation of J G(z, y - yo)dF,. It is
referred to as mean-square differentiability because the representation as an
integral lG(z)dF(z, y) means that if dF(z, y) 12 has a mean-square derivative then
2198 W.K. Newey and D. McFadden

{G(z)dF(z, y) will be differentiable in y, as shown in Ibragimov and Hasminskii


(1981). This is an essential condition for a finite semiparametric variance bound, as
discussed in Van der Vaart (1991), which in turn is a necessary condition for
Jn-consistency of j G(z, y*- yo)dF,. If jG(z,$ - yo)dF, cannot be viewed as an
average over an estimated distribution, then it will not be &-consistent. Thus,
condition (iii) is the key one to obtaining &-consistency.
Condition (iv) requires that the difference between the estimator F and the
empirical distribution be small, in the sense of difference of integrals. This condition
embodies a requirement that p be nonparametric, because otherwise it could not
be close to the empirical measure. For kernel estimators it will turn out that part
(iv) is a pure bias condition, requiring that a bias term goes to zero faster than l/fi.
For other estimators this condition may not impose such a severe bias requirement,
as for the series estimators discussed in Newey (1992a).
An implication of conditions (iii) and (iv) is that Jil@z)d(F - F,) = JG(z)d&.
(6 -F,) converges in distribution to a normal random vector, a key result. An
alternative way to obtain this result is to show that fi(@ - F,) is a stochastic
process that converges in distribution in a metric for which !6(z)d(*) is continuous,
and then apply the continuous mapping theorem. 46 This approach is followed in
Ait-Sahalia (1993).
One piece of knowledge that is useful in verifying the conditions of Theorem 8.1
is the form of 6(z). As discussed in Newey (1992a), a straightforward derivative
calculation is often useful for finding 6(z). Let v denote the parameters of some
general distribution where q0 is equal to the truth, and let y(q) denote the true value
of y when 7 are the true parameters. The calculation is to find 6(z) such that
V,jg[z, y(q)] dF, = E[S(z)Sh], where the derivative is taken at the true distribution.
The reason that this reproduces the 6(z) of Theorem 8.1 is that condition (i) will
imply that V,Jg[z, y(q)]dF, = V,lG[z, y(q) - yo]dF, [under the regularity condition
that /Iy(q) - y 11is a differentiable function of y], so (iii) implies that V,sg[z, Y(q)]dF, =
V,jG(z)dF(q) = E[S(z)Sk]. This calculation is like the Gateaux derivative calculation
discussed in Huber (1981), except that it allows for the distributions to be continuous
in some variables. With 6(z) in hand, one can then proceed to check the conditions
of Theorem 8.1. This calculation is even useful when some result other than Theorem
8.1 is used to show asymptotic normality, because it leads to the form of the
remainder term Cr= 1 [g(zi, $) - g(zi, yO) - 6(zi)]/fi that should be small to get
asymptotic normality.
Theorem 8.1 can be combined with conditions for convergence of the Jacobian
to obtain conditions for asymptotic normality-of 4, as in the following result.

46The continuous mapping theorem states that if Y(n) AZ and h(y) is continuous on the support of
Z then hLY(n)] Ah(Z).
Ch. 36: Large Sample Estimation and Hypothesis Testing 2199

Theorem 8.2

If 8% O,, the conditions of Theorem 8.1 are satisfied, and (i) there are a norm
(1y 11,E > 0, and a neighborhood .Af of 8, such that for IIy - y. II small enough,
sup& Ir IIv&Y(z,@,Y) - v&dzi, @,YO)II d Nz) IIY - YO IHEand E[b(z)] 11y*- y0 113 0; (ii)
V,g(z,,fI, yO) satisfies the conditions of Lemma 4.3; (iii) G, is nonsingular; then
Jr@ - 0,) % N(0, c; RG, I).

Pr#Of

It suffices to show that IZ- Cy=, V&z,, 6! 9) 3 G,, because then the conclusion will
follow from the conclusion of Theorem 8.1, eq. (8.3), and arguments like those of
Section 3. Condition (i) implies that [x1= 1b(zi)/n] I/y^- y0 11% 0 by the Markov
inequality, SO n~'Cr=,(/V,g(Zi,8,y^)-V,g(zi,8,y,)(l ~[n-lC1=Ih(Zi)]Ilp-y,ll"~O.
Also, by the conclusion of Lemma 4.3, II- xy= 1Veg(zi, & yO)A G,. The conclusion
then follows by the triangle inequality. Q.E.D.

This result provides one set of sufficient conditions for convergence of the Jacobian
term. They are specified so as to be similar to those of Theorem 8.1, involving a
norm for y. In particular cases it may be useful to employ some other method for
showing Jacobian convergence, as will be illustrated in Section 8.2. A similar
comment applies to the consistency condition. Consistency can be shown by impos-
ing conditions like (i) and (ii) to give uniform convergence of an objective function,
but this result will not cover all cases. In some cases it may be better to work directly
with Theorem 2.1 to show consistency.
The asymptotic variance of a semiparametric two-step estimator is Gi flG; .
As usual, a consistent estimator can be formed by plugging in estimators of the
different pieces. An estimator of the Jacobian term can be formed in a straight-
forward way, as

CB= n l i v,g(zi, f7,g.


i=l

Consistency of GBfor G, will follow under the same conditions as used for asymptotic
normality of I!?,because of the need to show consistency of the Jacobian matrix in
the Taylor expansion. The more difficult term to estimate is the score variance 0.
One way to estimate this term is to form an estimator g(z) of the function 6(z) that
appears in the asymptotic variance, and then construct

= Iz~ ~
i=l
{g(Zi, e,B) + I} (g(Z, e,y*)+ I}'. (8.5)

An estimator of the asymptotic variance can then be formed as G; fiG; .


2200 WK. Newey and D. McFadden

It is difficult at this level of generality to give primitive conditions for consistency


of a variance estimator, because these will depend on the nature of 6(z). One useful
intermediate result is the following one.

Lemma 8.3

If the conditjons of Theorem 8.1 are sati_sfied, xy= 1 11g(Zi, 6, f) - g(Zi, 8,, yo) 11/n 5 0,
and C;= 1 11 6(zi) - d(zi) II2/n L 0, then fi L 0.

Proof
Let zii = g(zi, 8, 7) + J(zi) and ui = g(zi, 8,, yO) + 6(zJ, so that fl= E[uiuIl and 8 =
~1~ 1ti,ti:/n. By the assumptions and the triangle inequality, x1= 1 IIfii - UCII2/n 11,~).
Also, by the LLN, x1= 1uiui/n -S E[UiUi]. Also, IIx1= 1riirii/n - Cy= 1u&/n II d
CyzI ~Iliiri~-UiU~ll/n <Cy=l llrii-iI12/n+2C~=1 /IiI//I~i-iIllndC~=~IItii-uil12/
n + z(Cy, 1IIuiII2/nP2E1= 1IIhi - ui II/n)II2 3 0, because convergence of the diag-
onal elements of Cl= 1uiuI/n implies that Cy= 1 /IUi I/2/n is bounded in probability.
Q.E.D.

Powell et al. (1989) use an analogous intermediate result to show consistency of


their variance estimator. More primitive conditions are not given because it is
difficult to specify them in a way that would cover all examples of interest.
These results provide a useful way of organizing and understanding asymptotic
normality of semiparametric two-step estimators. In the analysis to follow, their
usefulness will be illustrated by considering V-estimators and estimators where the
first step is a kernel estimator. These results are also useful in showing asymptotic
normality when the first step is a series regression estimator, i.e. an estimator
obtained from least squares regression of some dependent variable on approximat-
ing functions. The series estimator case is considered in Newey (1992a).

8.2. V-estimators

A V-estimator, as in eq. (8.2), is useful as an illustration of the results. As previously


noted, this estimator has g(z, y) = J m (z, Z,8,)dy(?), and y*is the empirical distribution
with p(5) = x1= 1 l(zi d 5)/n. For this estimator, condition (i) of Theorem 8.1 is
automatically satisfied, with b(z) = 0 because g(z, y) is linear in y. Condition (ii) needs
to be verified. To see what this condition means, let m(z,, z2) = m(z,, z2, fl,), ml(z) =
Sm(z, ?)dF,(Z), m2(z) = Im(z, z)dF,(Z), and p = j~m(z,~)dF,(z)dF&). Then

i$l CG(zi, $ - ~0) - J G(z, Y*- ~dd~J/~n

=&{n~li~l[n~
Ii I[
ml(Zi) -
j=l
m(ZhZj) - n-l i;. m2(zi) - p
i=l II
Ch. 36: Lurge Sample Estimation and Hypothesis Testing 2201

It will follow from U- and V-statistic theory that this remainder term is small.
A U-statistic has the form fi = np (n - I)- xi< ja(zi, zj), where u(z,, z2) = a(~,, zJ.
A V-statistic has the form p = II- C;= IC;= I m(z,, zj). A V-statistic is equal to a
U-statistic plus an asymptotically negligible term, as in p = n-CF, 1 Pn(Zi, Zi) +

[(n - 1)/n] 6, where a(~,, zj) = m(zi, zj) + m(zj, zi). The lead term, n-Cr, 1m(z,, Zi) is a
negligible own observations term, that converges in probability to zero at the rate
l/n as long as E[m(z,, zi)] is finite.
The condition that the remainder term in eq. (8.6) have probability limit zero is
known as the projection theorem for U- or V-statistics. For a U-statistic, a(z) =
fu(z, Z)dF,(Z), and E[c?(z)] = 0, the projection theorem states if the data are i.i.d. and
u(z,,z,) has finite second moments, then &[C? - np Cr= 1a( LO, where
n-CrZl a-( z J 1s re ferred to as the projection of the U-statistic on the basic obser-
vations; see Serfling (1980). The V-statistic projection theorem states that the
remainder in eq. (8.6) converges in probability to zero. The V-statistic projection
theorem is implied by the U-statistic projection theorem, as can be shown in the
following way. Let a(~,, z2) = m(z,, z2) + m(z,, zl) - 2~~ so

n-t i, [Wl(zi~zj)-~]="~zi~l [m(Zi,Zi)-~]+[(I1-1)/11]~.


i=l j=l

The first term following the equality should be negligible. The second term following
the equality is a multiple of the U-statistic, where the multiplying constant converges
to 1. Furthermore, a(z) = ml(z) + m2(z) - 2~ in this case, so the projection of the
U-statistic on the basic observations is n-l x1= 1[ml(zi) + m*(Zi) - 2~1. The U-
statistic projection theorem then implies that the remainder in eq. (8.6) is small.
Thus, it will follow from eq. (8.6) and the U-statistic projection theorem that
condition (ii) of Theorem 8.1 is satisfied.
The previous discussion indicates that, for V-estimators, assumption (ii) follows
from the V-statistic projection theorem. This projection result will also be important
for assumption (ii) for kernel estimators, although in that case the V-statistic varies
with the sample size. For this reason it is helpful to allow for m(z,,z,) to depend on
n when stating a precise result. Let m,,(z) = Jm,,(z, ,?)dF,(?), mn2(z) = Im,(z, z)dF,(Z),
and Y,,= O,(r,) mean that 11Y, II/r, is bounded in probability for the Euclidean norm
II* II.

Lemma 8.4

z,,z*, . . are i.i.d. then n-C;= 1Cjn= 1m,(zi,zj) - n-l C= 1[m,l(zi) + m,,(zi)] + p =
zl, ZJ III/n + (ECIIm,(z,,z2)II21)2/fl).
2202 W.K. Newey und D. McFadden

The proof is technical, and so is postponed until Section 8.4. A consequence of this
result is that condition (ii), the stochastic equicontinuity hypothesis, will be satisfied
for U-estimators as long as E[ 11 m(zl, zl, Q,) II1 and EC IIm(z,,z2,0,) III are finite.
Lemma 8.4 actually gives a stronger result, that the convergence rate of the remain-
der is l/Jr~, but this result will not be used until Section 8.3.
With condition (ii) (finally) out of the way, one can consider conditions (iii) and (iv)
for V-estimators. Assuming that p = 0, note that j G(z, y*- Ye)dF, ;j [ jm(z, z,e,) x
dF,(z)] dF(1) = j G(z)dF(z) for 6(z) = m2(z) = j m(Z, z, B,)dF,(z) and F(z) equal to the
empirical distribution. Thus, in this example conditions (iii) and (iv) are automatically
satisfied because of the form of the estimator, giving all the assumptions of Theorem
8.1, with g(z, 8,, yO) + 6(z) = ml(z) + m2(z). An asymptotic normality result for V-
estimators can then be stated by specifying conditions for uniform convergence of
the Jacobian. The following condition is useful in this respect, and is also useful for
showing the uniform convergence assumption of Theorem 2.1 and V-estimators.

Lemma 8.5

If 21, z*, . . . are i.i.d., a(z,, z2, O), is continuous at each (3~ 0 with probability one,
4zI,zI, 0)III< Q and EC~UP,,, II4z,, z2,0)II1 < ~0,thenEC4zI,z2,@I
ECsupo,ell
is continuous in 0~ 0, and supBE 8 /I nm2 x1= 1 x7= 1a(z, zj, d) - ELZ(Z,, z2, O)] (I J+ 0.

The proof is postponed until Section 8.4.


This result can be used to formulate conditions for asymptotic normality by
adding a condition for convergence of the Jacobian.

Theorem 8.6

Suppose that zr, z2,. . are i.i.d., 63 C+,,(i) E[m(z,, z2, @,)I = 0, E[ II m(z,, zl, 8,) II] <
co, E[ /Im(z,, z2, /!I,) /I2] < 03, (ii) m(z,, zl, 19)and m(z,, z2, 19)are continuously differen-
tiable on a neighborhood of (I0 with probability one, and there is a neighborhood
N of 8,, such that ECsup,,,- IIVom(z,, zl, 6)II1 < ~0andEC~UP,~.,~ IIVom(zI, z2, 4 II1 <
co, (iii) GB = E [V,m(z,, z2, (!I,)] is nonsingular. Then &(e-- 0,) L N(0, G, !CCIGg)
for 0 = Var { j [m(z, Z,0,) + m(Z, z, 0,)] dF,(z)}.

Proof

It follows by Lemma 8.4, assumption (i), and the preceding discussion that conditions
(i)-(iv) of Theorem 8.1 are satisfied for g(z, ye) + 6(z) = 1 [m(z, Z, 0,) + m(Z, z, tI,)]dF,(Z),
so it follows by the conclusion of Theorem 8.1 that &cC~=, CT= 1m(z,, zj, 0,) 3
N(O,C?). Therefore, it suffices to show that n-Cr, 1cjn= 1V,m(z,, zj, I$ L G, for any
t?L 8,. This condition follows by Lemma 8.5 and the triangle inequality. Q.E.D.

To use this result to make inferences about 8 it is useful to have an asymptotic


variance estimator. Let GH= nT 2 C;= 1x7=, V,m(z,, zj, 8) be a Jacobian estimator.
Ch. 36: Larye Sample Estimation and Hypothesis Testiny 2203

This estimator wll be consistent for G, under the conditions of Theorem 8.6. An
estimator of g(z, O,, yO) + 6(z) can be constructed by replacing BOby 6 and F, by E
in the expression given in R, to form

Iii = n- l j$l[rn(Zi, zj, 6)+ rn(Zj, zi, 8)],

The following result is useful for showing consistency of this estimator. Let
m,(z,, z2, 0) depend on n and m,,(z) be as defined above.

Lemma 8.7

Lf~ll~-~,I/=0,(1) then n~1~~~~~~n~~j~~m,(~~,~j,~)-m,l(~i)~~2=0,{n-1 x

ECsuP,,.,_ II%(Z,, 21,Q)II2 + sup,,,, IIVcPn(zl~z2,~) II2 + II%(Z,, z2,hJ) III ).


This result is proved in Section 8.4. Consistency of the variance estimator can now
be shown, using Lemma 8.7.

Theorem 8.8

If the conditions of Theorem 8.6 are satisfied, E[su~,,,~ /Im(z,, zl, 0) I/2] < cc and
E[su~,,,~ 1)Vem,(zl, z2, 0) I/1 < cc then G; dG; 3 Cc RG; .

Proof

It follows by Lemmas 8.7 and 8.3 that b Aa, and it follows as in the proof of
Theorem 8.6 that 6, L Gil, so the conclusion follows by continuity of matrix
multiplication. Q.E.D.

8.3. First-step kernel estimation

There are many examples of semiparametric two-step estimators that depend on


kernel density or conditional expectations estimators. These include the estimators
of Powell et al. (1989) and Robinson (1988b). Also, the nonparametric approximate
consumer surplus estimator introduced earlier is of this form. For these estimators
it is possible to formulate primitive assumptions for asymptotic normality, based
on the conditions of Section 8.1.
Suppose that y denotes a vector of functions of variables x, where x is an r x 1
subvector of the data observation z. Let y denote another subvector of the data.
The first-step estimator to be considered here will be the function of x with

y*(x)= n- i yiKb(x - xi). (8.7)


2204 W.K. Newey and D. McFadden

This is a kernel estimator of fO(x)E[ylx], where Jo(x) is the marginal density of x.


A kernel estimator of the density of x will be a component of y(x) where the
corresponding component of y is identically equal to 1. The nonparametric con-
sumer surplus estimator depends on 9 of this form, where yi = (1, qi).
Unlike V-estimators, two-step estimators that depend on the 9 of eq. (8.6) will
often be nonlinear in y*.Consequently, the linearization condition (i) of Theorem 8.1
will be important for these estimators. For example, the nonparametric consumer
surplus estimator depends on a ratio, with g(z,y) = ~~[Y2(x)/y1(x)]dx - BO. In this
example the linearization G(z, y - y_) is obtained by expanding the ratio inside the
integral. By ii/g - a/b = bmCl - b- (6 - b)] [ii - a - (u/b)(g - b)], the lineariza-
tion of Z/g around a/b is b- [E - (I - (u/b)(g - b)]. Therefore, the linear functional
of assumption (i) is

WY) =
sbfoW
a
C- h&4,llyWx.

Ify,,(x) = f,Jx) is bounded away from zero, y2,,(x) is bounded, and yi(x) is uniformly
(8.8)

close to yie(x) on [a, b], then the remainder term will satisfy

Ig(z,Y)- dz, 14 - G(z,Y - ~0)I

d bl~&)I - f,,(x)-Cl + lMx)llClr~(4 - f&)lz + IyAx) - ~zo(4121d~


sP
d c SUP,,,qb]
11
dx) - h,tx) 11
. (8.9)

Therefore assumption (i) of Theorem 8.1 will be satisfied if & supXt,a,bl IIy*(x) -

Yob) II .
2ao47

One feature of the consumer surplus example that is shared by other cases where
conditional expectations are present is that the density in the denominator must be
bounded away from zero in order for the remainder to be well behaved. This
condition requires that the density only effects the estimator through its values on
a bounded set, a fixed trimming condition, where the word trimming refers to
limiting the effect of the density. In some examples, such as the consumer surplus
one, this fixed trimming condition arises naturally, because the estimator only
depends on x over a range of values. In other cases it may be necessary to guarantee
that this condition holds by adding a weight function, as in the weighted average
derivative example below. It may be possible to avoid this assumption, using results
like those of Robinson (1988b), where the amount of trimming is allowed to decrease
with sample size, but for simplicity this generalization is not considered here.

471n this case p,(x) will be uniformly close to y,Jx), and so will be bounded away from zero with
probability approaching one if yIo(x) is bounded away from zero, on [a, h].
Ch. 36: Large Sample Estimation and Hypothesis Testing 2205

In general, to check the linearization condition (i) of Theorem 8.1 it is necessary


to specify a norm for the function y. A norm that is quite convenient and applies to
many examples is a supremum norm on a function and its derivatives. This norm
does not give quite as sharp results as an integral norm, but it applies to many more
examples, and one does not lose very much in working with a supremum norm
rather than an integral norm.48
Let ajy(x)/axj denote any vector consisting of all distinct jth-order partial deriva-
tives of all elements of y(x). Also, let 3 denote a set that is contained in the support
of x, and for some nonnegative integer d let

This type of norm is often referred to as a Sobolev norm.


With this norm the nj4 convergence rate of Theorem 8.1 will hold if the kernel
estimator g(x) and its derivatives converge uniformly on CCat a sufficiently fast rate.
To make sure that the rate is attainable it is useful to impose some conditions on
the kernel, the true function y,(x), the data vector y, and the bandwidth. The first
assumption gives some useful conditions for the kernel.

Assumption 8.1

K(u) is differentiable of order d, the derivatives of order d are bounded, K(u) is zero
outside a bounded set, jX(u)du = 1, there is a positive integer m such that for all
j<m,SK(u)[~~=,u]du=O.

The existence of the dth derivative of the kernel means that IIf 11will be well defined.
The requirement that K(u) is zero outside a bounded set could probably be relaxed,
but is maintained here for simplicity. The other two conditions are important for
controlling the bias of the estimator. They can be explained by considering an
expansion of the bias of y(x). For simplicity, suppose that x is a scalar, and note
EC?(x)] = SE[ylZ],f,(l)K,(x - I)d,? = ~y,+)K,(x - I)dZ. Making the change of
variables u = (x - .%)/a and expanding around CJ= 0 gives

E[-f(x)] = y,,(x - ou)K(u)du


s

= 2 Ojajy,(xyaxjK(u)ujdu + Cm ayotx + ouyaxvqu)u~du


O<j<m s s

= ye(x)
+ 0m amyotx
+ i7u)/axv(u)umdu, (8.10)
s
48With an integral norm, the Inn term in the results below could be dropped. The other terms
dominate this one, so that this change would not result in much improvement.
2206 WK. Newey and D. McFadden

where 6 is an intermediate value, assuming that derivatives up to order m of ye(x)


exist. The role of jK(u)du = 1 is to make the coefficient of y,(x) equal to 1, in the
expansion. The role of the zero moment condition {K(u)ujdu = 0, (j < m), is to
make all of the lower-order powers of cr disappear, so that the difference between
E[y*(x)] and yO(x) is of order grn. Thus, the larger m is, with a corresponding number
of derivatives of y,(x), the faster will be the convergence rate of E[y*(x)] to y&x).
Kernels with this moment property will have to be negative when j 3 2. They are
often referred to as higher-order or bias-reducing kernels. Such higher-order
kernels are used to obtain the r2/4 convergence rate for y*and are also important
for assumption (iv) of Theorem 8.1.
In order to guarantee that bias-reducing kernels have the desired effect, the
function being estimated must be sufficiently smooth. The following condition
imposes such smoothness.

Assumption 8.2

There is a version of yO(x) that is continuously differentiable to order d with bounded


derivatives on an open set containing .F.

This assumption, when combined with Assumption 8.1 and the expansion given
above produce the following result on the bias of the kernel estimator 9. Let E[y*]
denote E[y^(x)] as a function of x.

Lemma 8.9

If Assumptions 8.1 and 8.2 are satisfied then 11E[$] - y /I = O(C).

This result is a standard one on kernel estimators, as described in Hardle and Linton
(1994), so its proof is omitted.
To obtain a uniform convergence rate for f is also helpful to impose the following
condition.

Assumption 8.3

There is p 3 4 such that E[ 11


y II] < co and E[ lly Ilplx]fO(x) is bounded.

Assumptions 8.1-8.3 can be combined to obtain the following result:

Lemma 8.10

If Assumptions 8.1-8.3 are satisfied and cr = a(n) such that o(n)+0 and n1 -(2ip),(rr)/
In n -+ cc then IIy*- y,, /I = O,[(ln n)l/* (w~+*~))* + ~1.

This result is proved in Newey (1992b). Its proof is quite long and technical, and so
is omitted. It follows from this result that Jn (Iy*- y0 (I* 3 0, as required for as-
Ch. 36: Large Sample Estimation and Hypothesis Testing 2207

sumption (i) of Theorem 8.1, if ,1-2Pa(n)/ln n+ 03, &no2 --f 0, and Ja In n/


(nd+ 2d)+ 0. Th ese conditions will be satisfied for a range of bandwidth sequences
o(n), if m and p are big enough, i.e. if the kernel is of high-enough order, the true
function y(x) is smooth enough, and there are enough moments of y. However, large
values of m will be required if r is large.
For kernel estimators it turns out that assumption (ii) of Theorem 8.1 will follow
from combining a V-statistic projection with a small bias condition. Suppose that
G(z, v) is linear in y, and let 7 = I?[?]. Then G(z, 9 - yO) = G(z, 9 - 7) + G(z, 7 - yO).
Let m,(zi, Zj) = G[zi, .YjK,( * - xj)], m&) = Jm,(Z Z)@cd4 = SG[z, .YjK,( * - Xj)] dF,(z),
and assume that m,,(z) = 1 m,(z, 2) dF,(Z) = G(z, 17)as should follow by the linearity
of G(z, y). Then

G(z, y*- 7) dF,(z)


s

(8.11)

where the second equality follows by linearity of G(z, y). The convergence in prob-
ability of this term to zero will follow by the V-statistic projection result of Lemma
8.4. The other term, &(x1= 1G(zi, 7 -,yJ/n - 1 G(z, 7 - y) dF,(z)}, will converge in
probability to zero if E[ )(G(z, 7 - yO)I/2] + 0, by Chebyshevs inequality, which
should happen in great generality by y -+ y0 as 0 40, as described precisely in the
proof of Theorem 8.11 below. Thus, a V-statistic projection result when combined
with a small bias condition that E[ 1)G(z, v - yO)/I1 goes to zero, gives condition (ii)
of Theorem 8.1.
For kernel estimators, a simple condition for the mean-square differentiability
assumption (iii) of Theorem 8.1 is that there is a conformable matrix v(x) of functions
of x such that

jG(z, ?;)dF, = ~.(xh(x)


dx, (8.12)

for some v(x).This condition says j G(z, y) dF, can be represented as an integral, i.e.
as an average over values of x. It leads to a simple form for 6(z). As previously
discussed, in general 6(z) can be calculated by differentiating f G[z, y(q)] dF, with
respect to the parameters q of a distribution of z, and finding 6(z) such that
V,J G[z, y(q)] dFO = E[S(z)SJ for the score S, and all sufficiently regular parametri-
zations. Let I$[.] denote the expectation with respect to the distribution at this
2208 W.K. Newey and D. McFadden

parametrization. Here, the law of iterated expectations implies that

s
so differentiating
s
GCz,~h)l dF, = V(X)Y(X>
v)dx =
svW,CyIx1.0x Iv) dx = E,Cv(x)yl,

gives V,j G[z, y(v)] dF, = V,E,[v(x)y] = E[v(x)ySJ = E[G(z)SJ,


for

J(z)= V(X)Y- ECv(x)~l. (8.13)

For example, for the consumer surplus estimator, by eq. (8.8), one has v(x) =
l(U~X~b)f~(x)-lC-h~(x),ll and y = (l,q), so that 6(z) = l(a 6 x d b)f,(x)- x
c4- Mx)l.
With a candidate for 6(z) in hand, it is easier to find the integral representation
for assumption (iii) of Theorem 8.1. Partition z as z = (x, w), where w are the
components of z other than x. By a change of variables, 1 K,(x - xi) dx = j K(u) du = 1,
so that

s G(z,y^-y,)dF,= v(x)y,(x)dx=n-

f
i,
i=l s
V(X)yiK,(X - Xi) dx

- E[v(x)y] = n 1 izl J 6(X,WJKg(X- xi) dx = Jd(z)d,


(8.14)

where the integral of a function a(z) over d$ is equal to np x1= I sa(x, wi)K,(x - xi)dx.
The integral here will be the expectation over a distribution when K(u) 2 0, but when
K(u) can be negative, as for higher-order kernels, then the integral cannot be interpreted
as an expectation.
The final condition of Theorem 8.1, i.e. assumption (iv), will follow under straight-
forward conditions. To verify assumption (iv) of Theorem 8.1, it is useful to note that
the integral in eq. (8.14) is close to the empirical measure, the main difference being
that the empirical distribution of x has been replaced by a smoothed version with
density n- x1= 1K,(x - xi) [for K(u) 3 01. Consequently, the difference between the
two integrals can be interpreted as a smoothing bias term, with

b(z)diSd(z)dF=K $r [ Sv(x)K,(X-xi)dX-V(Xi)]Yi. (8.15)

By Chebyshevs inequality, sufficient conditions for Jn times this term to converge


in probability to zero are that JnE[y,{ jv(x)K,(x - xi)dx - V(Xi)}] -0 and that
CIIY~II*IISV(X)K,(X-~~)~X-VV(X~)II~I~O.A s s.h own below, the bias-reducing kernel
and smoothness parts of Assumptions 8.1-8.3 are useful in showing that the first
Ch. 36: Large Sample Estimation and Hypothesis Testing 2209

condition holds, while continuity of v(x) at most points of v(x) is useful for showing
the second. In particular, one can show that the remainder term in eq. (8.15) is small,
even when v(x) is discontinuous, as is important in the consumer surplus example.
Putting together the various arguments described above leads to a result on
asymptotic normality of the score x1= r g(zi, y)/&.

Theorem 8.11

Suppose that Assumptions 8.1-8.3 are satisfied, E[g(z, yO)] = 0, E[ I/g(z, y,,) )(1 < a,
X is a compact set, cr = o(n) with na2+4d/(ln n)2 -+ cc and na2m + 0, and there is
a vector of functionals G(z, y) that is linear in y such that (i) for ll y - y. I/ small
enough, IIdz, y)- dz, yo)- W, Y- yo) II 6 W IIY-y. II2,ECWI < ~0;(3 IIW, Y) II d
c(z) 1)y 1)and E[c(z)] < co; (iii) there is v(x) with 1 G(z, y) dF,(z) = lv(x)y(x) dx for
all /ly 11< co; (iv) v(x) is continuous almost everywhere, 111v(x) 11dx < co, and there
is E > 0 such that E[sup,,,,, GE(Iv(x + u) II41 < co. Then for 6(z) = v(x)y - E[v(x)y],

Cr= 1S(zi, Y^)l& 5 N(0, Var [g(z, yo) + 6(Z)]}.

Proof

The proof proceeds by verifying the conditions of Theorem 8.1. To show assump-
tion (i) it suffices to show fi /Iy*- y. 11 2 30 which follows by the rate conditions
on 0 and Lemma 8.10. To show assumption iii), note that by K(u) having bounded
derivatives of order d and bounded support, (IG[z, yK,(. - x)] 11d o-c(z) IIy )I. It
then follows by Lemma 8.4 that the remainder term of eq. (8.11) is O,,(n- a- x
{E[c(z,)/( y, II] + (E[c(z~)~ IIy, 112])12})= o,(l) by n-a-+O. Also, the rate condi-
tions imply 0 --f 0, so that E[ I(G(z, 7 - yo) )I2] d E[c(z)~] 117- y. 1)2 + 0, so that the
other remainder term for assumption (ii) also goes to zero, as discussed following
eq. (8.11). Assumption (iii) was verified in the text, with dF as described there. To
show assumption (iv), note that

v(x)K,(x - xi) dx - v(xi) yi


I Ill
v(x)K(u)y,(x - au) dudx -

Cyo(x - 0~)- yo(x)PW du

< & II v(x) II 1 Ca jll


[y,(x - au) - y,,(x)lK(u) du 11dx < v(x)IIdx, (8.16)
s iI1
2210 W.K. Newey and D. McFadden

for some constant C. Therefore, //J&[ { Sv(x)K,(x-xi) dx - V(Xi)}yi] 11d CJjza+O.


Also, by almost everywhere continuity of v(x), v(x + au) + v(x) for almost all x and U.
Also, on the bounded support of K(u), for small enough 0, v(x + W) d SU~~~~~~ S ,v(x + o),
so by the dominated convergence theorem, j v(x + au)K(u) du + j v(x)K(u) du = v(x)
for almost all x. Another application of the dominated convergence theorem, using
boundedness of K(u) gives E[ 11 j v(x)K,(x - xi) dx - v(xi) 114]-0, so by the Cauchy-
Schwartz inequality, E[ 11yi I/2 11j v(x)K,(x - xi) dx - v(xi) II2] + 0. Condition (iv)
then follows from the Chebyshev inequality, since the mean and variance of
II- l C;= 1[I v(x)K,(x - xi) dx - v(x,)]y, go to zero. Q.E.D.

The assumptions of Theorem 8.11 can be combined with conditions for convergence
of the Jacobian to obtain an asymptotic normality result with a first-step kernel
estimator. As before, let R = Var [g(z, y,,) + 6(z)].

Theorem 8.12

Suppose that e -% 00~ interior(O), the assumptions of Theorem 8.11 are satisfied,
E(g(z, yO)] = 0 and E[ 11g(z, ye) 11
2] < co, for 11
y - y. II small enough, g(z, 0, y) is contin-
uously differentiable in 0 on a neighborhood _# of O,, there are b(z), s > 0 with
EC&)1< ~0, IIV~s(z,~,y)-V,g(z,~,,y,)/I d&)Cl/Q-4ll+ IIY-Y~II~~~ and
E[V,g(z, Oo,yo)] exists and is nonsingular. Then $(& 0,) 3 N(0, G; L2G; I).

Proof

It follows similarly to the proof of Theorem 8.2 that 6; 3 G; , so the conclusion


follows from Theorem 8.11 similarly to the proof of Theorem 8.2. Q.E.D.

As previously discussed, the asymptotic variance can be estimatedby G,,86, I,


whereG,=n- C;= 1 Vsg(zi, e,y*) and 8= n- x1= lliiti; for ai = g(zi, 0, $) + 6(zi). The
main question here is how to construct an estimator of 6(z). Typically, the form of
6(z) will be known from assumption (iii) of Theorem 8.11, with 6(z) = 6(z, 8,, yo) for
some known function 6(z, 0, y). An estimator of 6(z) can then be formed by substituting
8 and $3for 8, and y. to form

8(z)= 6(z, 6,jq. (8.17)

The following result gives regularity conditions for consistency of the corresponding
asymptotic variance estimator.

Theorem 8.13

Suppose that the assumptions of Theorem 8.12 are satisfied and there are b(z), s > 0,
such that E[~(z)~] < cc and for /Iy - y. /I small enough, IIg(z, 19,y)-g(z, do, y)II d h(z) x
CIIQ-Q,ll+ /I~-~~ll~l and 11~~~,~,~~-~6(~,~~,~~~ll
dWCII~-~oI/+ /IY-Y~II~.
Then 6; 86; l L G; RG; .
Ch. 36: Large Sample Estimation and Hypothesis Testing 2211

Proof

It suffices to show that the assumptions of Theorem 8.3 are satisfied. By the
conditions of Theorem 8.12, I/t? - 0, I/ 3 0 and /I9 - y0 I/ 5 0, so with probability
approaching one,

because n- x1= 1 b(zi) 2 is bounded in probability by the Markov inequality. It


follows similarly that Cr= 1 11 8(zi) - 6(Zi) II/ n 30, so the conclusion follows by
Theorem 8.3. Q.E.D.

In some cases 6(z, 0, y) may be complex and difficult to calculate, making it hard to
form the estimator 6(z, e,?). There is an alternative estimator, recently developed in
Newey (1992b), that does not have these problems. It uses only the form of g(z, 6,~)
and the kernel to calculate the estimator. For a scalar [ the estimator is given by

i(zi)=v, n-l
[
j$l C71zj38,y* + i,K,(yxil}]~
i=O
(8.18)

This estimator can be thought of as the influence of the ith observation through the
kernel estimator. It can be calculated by either analytical or numerical differentiation.
Consistency of the corresponding asymptotic variance estimator is shown in Newey
(1992b).
It is helpful to consider some examples to illustrate how these results for first-step
kernel estimates can be used.

Nonparametric consumer surplus continued: To show asymptotic normality, one can


first check the conditions of Theorem 8.11. This estimator has g(z, yO) = Jib,(x) dx -
8, = 0, so the first two conditions are automatically satisfied. Let X = [a, b], which
is a compact set, and suppose that Assumptions 8.1-8.3 are satisfied with m = 2,
d = 0, and p = 4, so that the norm IIy 1) is just a supremum norm, involving no
derivatives. Note that m = 2 only requires that JuK(u)du = 0, which is satisfied by
many kernels. This condition also requires that fO(x) and fO(x)E[q Ix] have versions
that are twice continuously differentiable on an open set containing [a, b], and that
q have a fourth moment. Suppose that no2/(ln n)+ CC and no4 +O, giving the
bandwidth conditions of Theorem 8.11, with r = 1 (here x is a scalar) and d = 0.
Suppose that f,,(x) is bounded away from zero on [a, b]. Then, as previously shown
in eq. (8.9), assumption (i) is satisfied, with b(z) equal to a constant and G(z, y) =
!,bfo(x)- C- Mx), lldx) dx. Assumption (ii) holds by inspection by fO(x)- and
h,(x) bounded. As previously noted, assumption (iii) holds with v(x) = l(a < x < b) x
fO(x)- [ - h,(x), 11. This function is continuous except at the points x = a and x = b,
2212 W.K. Newey and D. McFadden

and is bounded, so that assumption (iv) is satisfied. Then by the conclusion of


Theorem 8.11 it follows that

i;(x) - 0, LV(O, E[l(a ,< x d 4f,(x)~{q - hI(x))21)> (8.19)


>

an asymptotic normality result for a nonparametric consumer surplus estimator.


To estimate the asymptotic variance, note that in this example, 6(z) = l(a d x d b) x
f&)- [I4- Mx)l = &z,h) for h(z,Y)= 1(a d x d b)y,(~)~[q - y1(x)-1y2(x)]. Then
for 6(z) = 6(z, y^),an asymptotic variance estimator will be

= f 8(Zi)2/n
= n-l i$l l(U <Xi < b)f(Xi)p2[qi- &(Xi)]2. (8.20)
i=l

By the density bounded away from zero on 3 = [a, b], for /Iy - y. /I small enough
that yr (x) is also bounded away from zero on .oll,16(zi, y) - 6(zi, yO)1d C( 1 + qi) 11y - y0 1)
for some constant C, so that the conditions of Theorem 8.13 are satisfied, implying
consistency of d.

Weighted average derivative estimation: There are many examples of models where
there is a dependent variable with E[qlx] = T(X /3,Jfor a parameter vector /IO, as
discussed in Powells chapter of this handbook. When the conditional expectation
satisfies this index restriction, then V,E[ql.x] = s,(x~,,)~~, where r,(v) = dr(v)/dv.
Consequently, for any bounded function w(x), E[w(x)V,E[q(x]] = E[w(x)r,(x/3,)]&,,
i.e. the weighted average derivative E[w(x)V,E[qlx]] is equal to a scale multiple of
the coefficients /I,,. Consequently, an estimate of /I0 that is consistent up to scale can
be formed as

B=n-' t W(Xi)V,L(Xi),
C(X)= i qiK,(X-Xi)/i K,(X-Xi). (8.21)
i=l i=l i=l

This is a weighted average derivative estimator.


This estimator takes the form given above where yIO(x) = f,Jx), yIO(x) = fO(x) x
ECq Ixl, and

Yk 0, v) = %47,cY2(4lY,(~)l - 8. (8.22)

The weight w(x) is useful as a fixed trimming device, that will allow the application
of Theorem 8.11 even though there is a denominator term in g(z, 0, y). For this
purpose, let 3 be a compact set, and suppose that w(x) is zero outside % and
bounded. Also impose the condition that fe(x) = yIO(x) is bounded away from zero
on I%^.Suppose that Assumptions 8.1-8.3 are satisfied, n~?+~/(ln ~)~+co and &+O.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2213

These conditions will require that m > r + 2, so that the kernel must be of the
higher-order type, and yO(x) must be differentiable of higher order than the dimension
of the regressors plus 2. Then it is straightforward to verify that assumption (i) of
Theorem 8.11 is satisfied where the norm (/y )I includes the first derivative, i.e. where
d = 1, with a linear term given by

G(z,Y)= w(x)Cdx)~(x)+ V,r(xMx)l,


%b4 = .I-&)- l c- &Ax) + kl(xb(x), - SWI, Md = .foW l c- Mx), II>
(8.23)

where an x subscript denotes a vector of partial derivatives, and s(x) = fO,.(x)/fO(x)


is the score for the density of x. This result follows from expanding the ratio
V,[y,(x)/y,(x)] at each given point for x, using arguments similar to those in the
previous example. Assumption (ii) also holds by inspection, by fO(x) bounded away
from zero.
To obtain assumption (iii) in this example, an additional step is required. In
particular, the derivatives V,y(x) have to be transformed to the function values y(x)
in order to obtain the representation in assumption (iii). The way this is done is by
integration by parts, as in

HwW,~Wd41=
s w(x)fo(x)b,(x~CV,~(x)l
dx

=-s V,Cw(x)fo(x)~o(x)l~O
dx>

v,Cww-ow,(x)l = w,(x)C- Mx), II+ w(x)c- 4&d, 01

It then follows that 1 G(z, y) dF, = j v(x)y(x) dx, for

44 = - w,(x)C- w4 11- w(x)II- &&4,01 + wb)a,(x)


= - {WAX) c- Mx), 11= 04
+ w(x)s(x)> c- h_l(x),11,
t(x) = - w,(x) -- w(x)s(x). (8.24)

By the assumption that fO(x) is bounded away from zero on .!Zand that 9 is compact,
the function a(~)[ - h,(x), l] is bounded, continuous, and zero outside a compact set,
so that condition (iv) of Theorem 8.11 is satisfied. Noting that 6(z) = C(x)[q - h,(x)],
the conclusion of Theorem 8.11 then gives

w(xi)V,&xi) - 80
1L W,Var{w(x)V,h,(x) + QX)[q - &(x)]}).
(8.25)
2214 W.K. Newey and D. McFadden

The asymptotic variance of this estimator can be estimated as

&!=n- i, I,?$, pi = W(Xi)V,~(Xi) - H^+ ~(Xi)[qi - I], (8.26)


i=l

where z(x) = - w,(x) - w(x)fJx)/f^(x) for T(X) = n- C;= 1 K(x - xi). Consistency
of this asymptotic variance estimator will follow analogously to the consumer
surplus example.
One cautionary note due to Stoker (1991) is that the kernel weighted average
derivative estimators tend to have large small sample biases. Stoker (1991) suggests
a corrected estimate of - [n-l Cy= 1e^(x,)x~]- 8, and shows that this correction
tends to reduce bias 8 and does not affect the asymptotic variance. Newey et al.
(1992) suggest an alternative estimator o^+ n- C;= 1 &xi) [qi - &)I, and show that
this also tends to have smaller bias than 6. Newey et al. (1992) also show how to
extend this correction to any two-step semiparametric estimator with a first-step
kernel.

8.4. Technicalities

Proof of Lemma 8.4

Let mij = m(zi, zj), 61,.= m,(z,), and fi., = m2(zi). Note that E[ /Im, 1 - p 111d E[ 11 m, I 111
+(E[ I~m,,~~2])12 and (E[I~m,, -p(12])12 <2(E[ /Im,2~~2])12 by the triangle in-
equality. Thus, by replacing m(z,, z2) with m(z,, z2) - p it can be assumed that p = 0.
Note that IICijmij/n2 - Ci(fii. + fi.,)/n II = 11 Cij(mij - 61,. - Kj)/n2 II < I/xi+ j(mij -
rii,. - ti.j)/n2 II + IICi(mii - 6,. -6.,)/n I/= Tl + T2. Note E[ TJ <(EC I/ml 1 /I + 2 x
ECllm12 111)/n. Also, for i #j, k #P let vijk/ = E[(mij - tii. - rKj)(m,( - fi,. - ti./)].
By i.i.d. observations, if neither k nor 8 is equal to i or j, then vijk/ = 0. Also for e not
equal to i orj, viji/ = E[(mij - tii.)(mi/ - tip)] = E[E[(mij - &.)(m,( - ti,.)Izi,zj]] =
E[(mij - fii.)(E[mit Izi, zj] - tip)] = 0 = vijj/. Similarly, vijk/ = 0 if k equals neither
i nor j. Thus,

CT:1 = C C Vijk//n4 = 1 (vijij + rijji)/n4


i#jk#/ i#j

= 2(n2 - n)E[ IIm12 - ti,.-ti., 112]/n4= E[ l/ml2 - 6,. - Kz., lj2]0(np2),

and Tl =O,({E[IIml2-~l.-~.2/~2]}112n2-1)=Op({E[~~ml2~~2]}1~2n~). The con-


clusion then follows by the triangle inequality. Q.E.D.

Proof of Lemma 8.5

Continuity of a(z,, z2, l3) follows by the dominated convergence theorem. Without
changing notation let a(z,, z2, 0) = a@,, z2, 0) - E[a(z,, z2, Q]. This function satisfies
the same dominance conditions as a(z,, z2, e), so it henceforth suffices to assume that
Ch. 36: Large Sample Estimation and Hypothesis Testing 2215

E[a(z,,z,, e)] = 0 for all 0. Let O(0) = n-(n - 1)-i Ci+ja(Zi,Zj,B)t and note that
spOe@ IIn-z Ci,jatzi, zj9@ - tic@ II A 0. Then by well known results on U-statistics
as in Serfling (1980), for each 0, 6(e) -%O. It therefore suffices to show stochastic
e_quicontinuity of 8. The rest of:he proof proceeds as in the proof of Lemma 2.4, with
di,(e, 6) = suplia- ~iiG,J/Ia(zi,zj, 0)- a(zi,zj, 0)II replacing 4(& :I, Ci+j replaci_ng Cy=1T
and the U-statistic convergence result n- (n - 1)-i xi+ jAij(d, 6) -% E[A12(B, S)]
replacing the law of large numbers. Q.E.D

Proof of Lemma 8.7

Let fiij = m,(zi, zj, g),,mij = m,(zi, zj, Q,), and ml, = m,i(zJ. By the triangle inequality,
we have

n-l t I/II-l t diij-mlil12<Cn-2 izI II&ii II2


i=l j=l

+cn- t
i=l
IIn-
j~i(Jnij-mij)l12+c~-1i~l II(n-I)- j~i(Wlij-mli)l12

+Cnp2 t JJmli))2=R1+R2+R3+R4.
i=l

for some positive constant C. Let b(zi) = SUP~~.~ IIm,(zi, zi, 0) II and b(z, zj) =
sup~~,~ IIVom(zi, zjr e) 11.With probability approaching one, R, <_CK2 Cl= 1b(z,) =
O,(n-E[b(z,f]}. Also, R2~Cn-~~=lI~n-~jzib(zi,zj)~~2~~~-~e,~12~Cn-2 X

Cifjb(zi,zj)211e-e~l12=0,{n-E[b( zl, z,)]}. Also, by the Chebyshev and Cauchy-


Schwartz inequalities, E[R,] d CE[ IIml2 II]/n and E[R,] < CE[ (Im,, Il]/n. The
conclusion then follows by the Markov and triangle inequalities. Q.E.D.

9. Hypothesis testing with GMM estimators

This section outlines the large sample theory of hypothesis testing for GMM
estimators. The trinity of Wald, Lagrange multiplier, and likelihood ratio test
statistics from maximum likelihood estimation extend virtually unchanged to this
more general setting. Our treatment provides a unified framework that specializes
to both classical maximum likelihood methods and traditional linear models esti-
mated on the basis of orthogonality restrictions.
Suppose data z are generated by a process that is parametrized by a k x 1 vector
8. Let /(z, 0) denote the log-likelihood of z, and let 8, denote the true value of 0 in
the population. Suppose there is an m x 1 vector of functions of z and 0, denoted
g(z, f3),that have zero expectation in the population if and only if 8 equals 0,:

g(e) = ~~5 1(z,~ f l) = g(z, 0) ee(veo)dz = 0, if 8 = 8,.


s
2216 W.K. Newey and D. McFadden

Then, Ey(z, H) are moments, and the analogy principle suggests that an estimator of
8, can be obtained by solving for 8 that makes the sample analogs of the population
moments small. Identification normally requires that m 3 k. If the inequality is strict,
and the moments are not degenerate, then there are overidentifying moments that
can be used to improve estimation efficiency and/or test the internal consistency of
the model.
In this set-up, there are several alternative interpretations of z. It may be the case
that z is a complete description of the data and P(z,Q) is the full information
likelihood. Alternatively, some components of observations may be margined out,
and P(z, 0) may be a marginal limited information likelihood. Examples are the
likelihood for one equation in a simultaneous equations system, or the likelihood
for continuous observations that are classified into discrete categories. Also, there
may be exogenous variables (covariates), and the full or limited information
likelihood above may be written conditioning on the values of these covariates.
From the standpoint of statistical analysis, variables that are conditioned out
behave like constants. Then, it does not matter for the discussion of hypothesis
testing that follows which interpretation above applies, except that when regularity
conditions are stated it should be understood that they hold almost surely with
respect to the distribution of covariates.
Several special cases of this general set-up occur frequently in applications. First,
if Qz,~) is a full or limited information likelihood function, and g(z,8) = V,L(z,@
is the score vector, then we obtain maximum likelihood estimation.49 Second, if
z = (y, x, w) and g(z, 0) = w(y - x0) asserts orthogonality in the population between
instruments w and regression disturbances E = y - x0,, then GMM specializes to
2SLS, or in the case that w = x, to OLS. These linear regression set-ups generalize
immediately to nonlinear regression orthogonality conditions based on the form
Y(Z,0) = WCY - h(x, @I.
Suppose an i.i.d. sample zi, . . . , z, is obtained from the data generation process.
A GMM estimator of 0, is the vector 6,, that minimizes the generalized distance of
the sample moments from zero, where this generalized distance is defined by the
quadratic form

with l,(0) = (l/n)C:, i g(z,, (3)and 0, an m x m positive definite symmetric matrix that
defines a distance metric. Define the covariance matrix of the moments, fl =
mxm
Eg(z, O,)g(z, 0,). Efficient
weighting of a given set of m moments requires that 0,
converge to Ras n + m.50 Also, define the Jacobian matrix mfik = EVOg(z, O,), and

@If the sample score has multiple roots, we assume that a root is selected that achieves a global
maximum of the likelihood function.
50This weighting is efficient in that it minimizes the asymptotic covariance matrix in the class of all
estimators obtained by setting to zero k linear combinations of the m moment conditions. Obviously, if
there are exactly k moments, then the weighting is irrelevant. It is often useful to obtain initial consistent
asymptotically normal GMM estimators employing an inefficient weighting that reduces computation,
and then apply the one-step theorem to get efficient estimators.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2217

let G, denote an array that approaches G as n -+ co. The arrays 0, and G, may be
functions of (preliminary) estimates g,, of 8,. When it is necessary to make this
dependence explicit, write Q,,(g,,) and G,(g,,).
Theorems 2.6, 3.4, and 4.5 for consistency, asymptotic normality, and asymptotic
covariance matrix estimation, guarantee that the unconstrained GMM estimator
& = argmw,~Q,,(@ IS consistent and asymptotically normal, with &(8 - 0,) L
N(0, B- ); where B = GR- G. Further, from Theorem 4.5, the asymptotic covariance
matrix can be estimated using

G, = t, clVedz,,&, J+G,
f

where 8,, is any &-consistent estimator of 0, [i.e., &(8,, - 0,) is stochastically


bounded]. A practical procedure for estimation is to first estimate 0 using the
GMM criterion with an arbitrary L2,, such as J2,, = 1. This produces an initial
$-consistent estimator I!?~,.
Then use the formulae above to estimate the asympto-
tically efficient R,, and use the GMM criterion with this distance metric to obtain
the final estimator gn. Equation (5.1) establishes that r- - Ey(z,B,)V,G(z,0,) s
EV,g(z, 0,) = G. It will sometimes be convenient to estimate G by

In the maximum likelihood case g = V,d, one has a= r= G, and the asymptotic
covariance matrix of the unconstrained estimator simplifies to OR .

9.1. The null hypothesis and the constrained GMM estimator

Suppose there is an r-dimensional null hypothesis on the data generation process,

H,: r; 1(Q,) = 0.

We will consider alternatives to the null of the form

H 1 : a(@,)# 0,

or asymptotically local alternatives of the form

H,,: a(&) = SJ& # 0.


2218 W.K. Newey and D. McFadden

Assume that F& z V,a(&,) has rank r. The null hypothesis may be linear or nonlinear.

A particularly simple case is He. 6 = do, or a(@ = 0 - do, so the parameter vector 8
is completely specified under the null. More generally, there will be k - r parameters
to be estimated when one imposes the null. One can define a constrained GMM
estimator by optimizing the GMM criterion subject to the null hypothesis:

g,, = argmaxtiEOQn(@, subject to a(0) = 0.

Define a Lagrangian for t?: _Y;p,(6,y)= Q,(0) - , z ,(6yr; 1. In this expression, y is

the vector of undetermined Lagrangian multipliers; these will be nonzero when the

0=
constraints are binding. The first-order conditions for solution of this problem are

[I [
0
&VOQA@J -vo@J,/% II
- 46) l-

A first result establishes that g,, is consistent under the null or local alternatives:

Theorem 9.1

Suppose the hypotheses of Theorem 2.6. Suppose ~(0s) = S/,/Z, including the
null when 6 = 0, with a continuously differentiable and A of rank r. Then ez 60.

Proof

Let f3,, minimize [E&(e)]fl- [&J,(@] subject to a(0) = S/J%. Continuity of this
objective function and the uniqueness of its minimum imply eon + 8,. Then Q,(8,) 6
Q,(e,,) -% 0, implying Q,(gJ LO. But Q, converges uniformly to [E~,(@]~- x
[&j,(6)], so the argument of Theorem 2.6 implies t?,,3 0,. Q.E.D.

The consistency of g implies

V,Q,(e,) A - GR - Eg(z, 6,) = 0,

V,a(e,) a A * A?, = - V,Q,(e,) + oP 5 0,

and since A is of full rank, 7, LO. A central limit theorem implies

(9.1)

A Taylors expansion of the sample moments about 8. gives

&W) = &,@o) + G,,/&J - &,I, (9.2)


Ch. 36: Large Sample Estimation and Hypothesis Testing 2219

with G, evaluated at points between 8 and 8,. Substituting this expression for the
final term in the unconstrained first-order condition 0 = &V,Q,(g,J = - Gbf2; x
g,,(@,,)and using the consistency of e^, and uniform convergence of G,(0) yields

0 = - GR - 12ulln+ S&e, - 0,) + oP

=+(e, - 0,) = B-GC li2@ n + o P (9.3)

Similarly, substituting &&(t?,J = $&,(0,) + G,&(t? - 0,) = - GQn-12@, +


G$(en - &,) + op, and J&z(&) = J&(0,) + A&(e, - 0,) + op = 6 + Afi(g - 0,) +
op in the first-order conditions for an yields

(9.4)

From the formula for partitioned inverses,

1
~-l/ZMBI/ B-,q(AB-l,q-l
[, ;I-l=[ (ABmA)-AB-l -(&-A)-
(9.5)

where M = I - B- 12A(AB- A)- AK l is a k x k idempotent matrix of rank


k - r. Applying this to eq. (9.4) yields

(9.6)

Then, the asymptotic distribution of &(t?,, - 0,) under a local alternative, or the
null with 6 =O, is N[ - B-A(AB-A)-6,B-12MB-2].
Writing out M = I-B- 2A(AB- A)- AB- 12 yields

JJt(B,-8o)=B-1GR-1I2U11,-~-1A(AB-1~)-1~~-1~R-1/2~,

-B-A(AB-A)-6 + op. (9.7)

The first terms on the right-hand side of eq. (9.7) and the right-hand side of eq. (9.3)
are identical, to order op. Then, they can be combined to conclude that

&(e, - e,) = B -l A (AB-A)-AB-GR-2~n+B-A(AB-A)-16+op,


(9.8)

so that &(6,, . asymptotically


- g,,) IS normal with mean B- A(AB- A)-6 and
2220 W.K. Newey and D. McFadden

Table 1

Asymptotic
Statistic Formula covariance matrix

B~GQ~%,+o, B- EC
Jn(& - %)
Bm l/2MBm l/2
&$-So) -B~A(AB~A)~6+B~2MB~ZG~~Z~i,+o,
J;l(e,-e,) B-A(AB-A)-6+B-A(AB~A)~AB~GR-z~~C,+op BmA(ABmA)mABm

&r. (AB-A)~6+(AB~A)~AB-GR~2~~11,+o, (AB-A)-

$4&) S+AB-GR-*42 +o AB-A

,,h,Q,(e,) A(AB~A)-B+A:(ABA)-AB-LCR-~.+o, A(AB-A)-A

covariance matrix B- /(I - M)BPi2 z BP A(AB-lA)-ABP . Note that the


asymptotic covariance matrices satisfy acov(& - e,) = acov 8n - acov S,, or the uari-
ante of the difference equals the difference of the variances. This proposition is familiar
in a maximum likelihood context where the variance in the deviation between an
efficient estimator and any other estimator equals the difference of the variances.
We see here that it also applies to relatively efficient GMM estimators that use
available moments and constraints optimally.
The results above and some of their implications are summarized in Table 1. Each
statistic is distributed asymptotically as a linear transformation of a common
standard normal random vector %. Recall that B = GR- G is a positive definite
kxkmatrix,andletC=B~-acov8,.RecallthatM=Z-B~2A(AB~A)~x
AK I2 is a k x k idempotent matrix of rank k - r.

9.2. The test statistics

The test statistics for the null hypothesis fall into three major classes, sometimes
called the trinity. Wald statistics are based on deviations of the unconstrained
estimates from values consistent with the null. Lagrange multiplier (LM) or score
statistics are based on deviations of the constrained estimates from values solving
the unconstrained problem. Distance metric statistics are based on differences in the
GMM criterion between the unconstrained and constrained estimators. In the case
of maximum likelihood estimation, the distance metric statistic is asymptotically
equivalent to the likelihood ratio statistic. There are several variants for Wald
statistics in the case of the general nonlinear hypothesis; these reduce to the same
expression in the simple case where the parameter vector is completely determined
under the null. The same is true for the LM statistic. There are often significant
computational advantages to using one member or variant of the trinity rather than
another. On the other hand, they are all asymptotically equivalent. Thus, at least to
first-order asymptotic approximation, there is no statistical reason to choose be-
Ch. 36: Large Sample Estimation and Hypothesis Testing 2221

Figure 3. GMM tests

tween them. This pattern of first-order asymptotic equivalence for GMM estimates
is exactly the same as for maximum likelihood estimates.
Figure 3 illustrates the relationship between distance metric (DM), Wald (W), and
score (LM) tests. In the case of maximum likelihood estimation, the distance metric
criterion is replaced by the likelihood ratio.
The arguments 0, and f?,,are the unconstrained GMM estimator and the GMM
estimator subject to the null hypothesis, respectively. The GMM criterion func-
tion is plotted, along with quadratic approximations to this function through the
respective arguments 6, and &. The Wald statistic (W) can be interpreted as
twice the difference in the criterion function at the two estimates, using a quad-
ratic approximation to the criterion function at 6,. The Lagrange multiplier (LM)
statistic can be interpreted as twice the difference in the criterion function of the
two estimates, using a quadratic approximation at 15%.
The distance metric (DM)
statistic is twice the difference in the distance metric between the unconstrained
and constrained estimators.
We develop the test statistics initially for the general nonlinear hypothesis ~(0,) =
0; the various statistics we consider are given in Table 2. In this table, recall that
acov 87,= B and acov g,, = B- MB- l . In the following section, we consider the
important special cases, including maximum likelihood and nonlinear least squares.
2222 W.K. Newey and D. McFadden

Table 2

Test statistics

Wald statistics
WI na(e.Y[AB-A]-a(B,)

W, n(&- f?J{acov(JJ - acov(G)}-(6 - 13)


=n(e-B,)B-A(AB-A)-AB-(~-~)

W3 t1(8 - GJ acov(JJ (6 - f7J

Lagrange multiplier statistics


LM,, rq$4B~ A&
LM,, nV,Q,(B,){A(AB-A)-A}-V,Q,(B,)
= V,Q,(B,)B~A(AB~A)~AB~V,Q,(B,)
LM,, nv,Q.@JB- V,Q.(e.)
Distance metric statistic
DM, - 2n[Q.(e., - Q,(&)l

In particular, when the hypothesis is that a subset of the parameters are constants,
there are some simplifications of the statistics, and some versions are indistin-
guishable.
The following theorem gives the large sample distributions of these statistics:

Theorem 9.2

Suppose the conditions of Theorems 2.6,3.4, and 4.5 are satisfied, and a(8) is contin-
uously differentiable with A of rank r. The test statistics in Table 2 are asymptotically
equivalent under the null or under local alternatives. Under the null, the statistics
converge in distribution to a chi-square with r degrees of freedom. Under a local
alternative a(&,) = S/J& the statistics converge in distribution to a noncentral
chi-square with r degrees of freedom and a noncentrality parameter 6(AB- A)- 6.

Proof

All of the test statistics are constructed from the expressions in Table 1. If 4 is an
expression from the table with asymptotic covariance matrix R = acov q and asymp-
totic mean RA under local alternatives to the null, then the statistic will be of the
form qR+q, where R+ is any symmetric matrix that satisfies RR+R = R. The matrix
R+ will be the ordinary inverse R- if R is nonsingular, and may be the Moore-
Penrose generalized inverse R - if R is singular. Section 9.8 defines generalized
inverses, and Lemma 9.7 in that section shows that if q is a normal random vector
with covariance matrix R of rank r and mean R1, then qR+q is distributed noncentral
chi-square with r degrees of freedom and noncentrality parameter AR;i under local
alternatives to the null.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2223

Consider W,,. Under the local alternative ~(0,) = S/&, row five of Table 1 gives
q=6+AB-Gfi-2ul;! normal with mean S and a nonsingular covariance matrix
R = AB-A. Let A = R-6. Then Lemma 9.7 implies the result with noncentrality
parameter iR/1= 6R 6 = 6(AB- A)- 6.
Consider W,,. The generalized inverse R of R = acov 8,, - acov t?,,can be written
as:

The first identity substitutes the covariance formula from row 2 of Table 1. The
second and third equalities follow from Section 9.8, Lemma 9.5, (5) and (4),
respectively. One can check that A = R-B- A(AB- A)- 6 satisfies RI = B- A x
(AB-A)-8, so that ;IRA = d(AB-A)-6.
The statistic W,, is obtained by noting that for R = BPA(AB-A)-AB-, the
matrix R+ = B satisfies RR+R = R and /z = RfB-A(AB-A)-6 satisfies RL =
B- A(AB- A)- 6.
Similar arguments establish the properties of the LM statistics. In particular,
the second form of the statistic LM,, follows from previous argument that
A(AB-A)-A and B-A(AB-A)-AB- are generalized inverses, and the
statistic LM,, is obtained by noting that R = A(AB-A)-A has RR+R = R when
R+ =B-.
To demonstrate the asymptotic equivalence of DM, to the earlier statistics, make
a Taylors expansion of the sample moments for i?n about &, J&,(f?,,) = J$,(&) +
G,,&(g,, - 8) + oP, and substitute this in the expression for DM, to obtain

with the last equality holding since G$?; $nj,(&) = 0. Q.E.D.

The Wald statistic W,, asks how close are the unconstrained estimators to
satisfying the constraints; i.e., how close to zero is a(B,)? This variety of the test is
particularly useful when the unconstrained estimator is available and the matrix A
is easy to compute. For example, when the null is that a subvector of parameters
equal constants, then A is a selection matrix that picks out the corresponding rows
and columns of B- , and this test reduces to a quadratic form with the deviations
of the estimators from their hypothesized values in the wings, and the inverse of
their asymptotic covariance matrix in the center. In the special case H,: 8 = 8, one
has A = I.
2224 W.K. Newey and D. McFadden

The Wald test W,, is useful if both the unconstrained and constrained estimators
are available. Its first version requires only the readily available asymptotic
covariance matrices of the two estimators, but for r < k requires calculation of a
generalized inverse. Algorithms for this are available, but are often not as
numerically stable as classical inversion algorithms because near-zero and exact-
zero characteristic roots are treated very differently. The second version involves
only ordinary inverses, and is potentially quite useful for computation in
applications.
The Wald statistic W,, treats the constrained estimators us ifthey were constants
with a zero asymptotic covariance matrix. This statistic is particularly simple to
compute when the unconstrained and constrained estimators are available, as no
matrix differences or generalized inverses are involved, and the matrix A need not
be computed. The statistic W,, is in general larger than W,, in finite samples, since
the center of the second quadratic form is (acov6J and the center of the first
quadratic form is (acov e?, - acov I!?~)-, while the tails are the same. Nevertheless,
the two statistics are asymptotically equivalent.
The approach of Lagrange multiplier or score tests is to calculate the constrained
estimator e,, and then to base a statistic on the discrepancy from zero at this
argument of a condition that would be zero if the constraint were not binding. The
statistic LM,, asks how close the Lagrangian multipliers Y,, measuring the degree
to which the hypothesized constraints are binding, are to zero. This statistic is easy
to compute if the constrained estimation problem is actually solved by Lagrangian
methods, and the multipliers are obtained as part of the calculation. The statistic
LM,, asks how close to zero is the gradient of the distance criterion, evaluated at
the constrained estimator. This statistic is useful when the constrained estimator is
available and it is easy to compute the gradient of the distance criterion, say using
the algorithm to seek minimum distance estimates. The second version of the
statistic avoids computation of a generalized inverse.
The statistic LM,, bears the same relationship to LM,, that W,, bears to W,,.
This flavor of the test statistic is particularly convenient to calculate, as it can be
obtained by auxiliary regressions starting from the constrained estimator g.,:

Theorem 9.3

LM,, can be calculated by a 2SLS regression:


(a) Regress V,d(z,, f?,J on g(z,, g,,), and retrieve fitted values VO?(z,, I?,,).
(b) Regress 1 on V&z,, r?), and retrieve fitted values 9,. Then LM,, = C:= 19:.
For MLE, g = V&, and this procedure reduces to OLS.

Proof

Let y be an n-vector of ls, X an n x k array whose rows are V,&, Z an n x m array


whose rows are g. The first regression yields X = Z(ZZ)-ZX, and the second
regression yields 9 = X(X_%-Xy. Then, (l/n)ZZ = C?,, (l/n)ZX = r,, (l/n)Zy =
Ch. 36: Large Sample Estimation and Hypothesis Testing 2225

99 = y@?X)- 2y = yZ(zq zx[xz(zz)- ZX] - xz(zz)- Zy.

Note that V,Q,(g,,) = - Gin; J,(g,,) = - l-:0; g,(t?,,). Substituting terms, y^p=
LM,,. Q.E.D.

Another form of the auxiliary regression for computing LM,, arises in the case of
nonlinear instrumental variable regression. Consider the model y, = k(x,, 8,) + E*
with E(E,[ wt) = 0 and E(sf 1w,) = 02, where w, is a vector of instruments. Define
z, = (y,, x,, wt) and g(q, 0) = w,Cy, - k(x,, @I. Then @(a, 0,) = 0 and Eg(z, ~,)g(z, &J =
02Ew,w:. The GMM criterion Q,(0) for this model is

the scalar g2 does not affect the optimization of this function. Consider the hypo-
thesis ~(0,) = 0, and let I!?,,be the GMM estimator obtained subject to this hypothesis.
One can compute LM,, by the following method:
(a) Regress V,k(x,, 8,) on w,, and retrieve the fitted values V&.
(b) Regress the residual u, = y, - k(x,, f?,,)on V,k,, and retrieve the fitted values 12,.
Then LM,, = nx:, 1tif/C:= 1uf = nR2, with R2 the uncentered multiple correlation
coefficient. Note that this is not in general the same as the standard R2 produced
by OLS, since the denominator of that definition is the sum of squared deviations
of the dependent variable about its mean. When the dependent variable has mean
zero (e.g. if the nonlinear regression has an additive intercept term), the centered
and uncentered definitions coincide.
The approach of the distance metric test is based on the discrepancy between the
value of the distance metric, evaluated at the constrained estimate, and the minimum
attained by the unconstrained estimate. This estimator is particularly convenient
when both the unconstrained and constrained estimators can be computed, and the
estimation algorithm returns the goodness-of-fit statistics. In the case of linear or
nonlinear least squares, this is the familiar test statistic based on the sum of squared
residuals from the constrained and unconstrained regressions.
The tests based on GMM estimation with an optimal weight matrix can be
extended to any extremum estimator. Consider such an estimator, satisfying
eq. (1.1). Also, let e be a restricted estimator, maximizing Q,(0) subject to a(0) = 0.
Suppose that the equality H = - Z is satisfied, for the Hessian matrix H and the
asymptotic variance Z [of JIV,Q,(~,)] from Theorem 3.1. This property is a
generalization of the information matrix equality to any extremum estimator. For
GMM estimation with optimal weight matrix, this equality is satisfied if the objective
function is normalized by i, i.e. Q,(0) = +9,(8)8- J,(0). Let 2 denote an estimator
2226 W.K. Newey and D. McFadden

of ,?Zbased on Band E an estimator based on t?. Consider the following test statistics:

w = ..(@[m2-a(B), 2 = V@(B),
--
LM = nV,&,(O)Z- V&@),

DM = 2n[Q,(@ - Q,(e,].

The statistic W is analogous to the first Wald statistic in Table 2 and the statistic
LM to the third LM statistic in Table 2. We could also give analogs of the other sta-
tistics in Table 2, but for brevity we leave these extensions to the reader. Under the
conditions of Theorems 2.1, 3.1, and 4.1, H = - Z and the same conditions on a(0)
previously given, these three test statistics will all have an asymptotic chi-squared
distribution, with degrees of freedom equal to the number of components of a(8).
As we have discussed, optimal GMM estimation provides one example of these
statistics. The MLE also provides an example, as does optimal CMD estimation.
Nonlinear least squares also fits this framework, if homoskedasticity holds and the
objective function is normalized in the right way. Suppose that Var(ylx) = c?, a
constant. Consider the objective function Q,(O) = (2b2)- x1= 1 [yi - h(x, f3)12,where
d2 is an estimator of rs2. Then it is straightforward to check that, because of the
normalization of dividing by 2b2, the condition H = - Z is satisfied. In this example,
the DM test statistic will have a familiar squared residual form.
There are many examples of estimators where H = - Z is not satisfied. In these
cases, the Wald statistic can still be used, but 2-l must be replaced by a consistent
estimator of the asymptotic variance of 6. There is another version of the LM
statistic that will be asymptotically equivalent to the Wald statistic in this case, but
for brevity we do not describe it here. Furthermore, the DM statistic will not have
a chi-squared distribution. These results are further discussed for quasi-maximum
likelihood estimation by White (1982a), and for the general extremum estimator
case by Gourieroux et al. (1983).

9.3. One-step versions qf the trinity

Calculation of Wald or Lagrange multiplier test statistics in finite samples requires


estimation of G, R, and/or A. Any convenient consistent estimates of these arrays
will do, and will preserve the asymptotic equivalence of the tests under the null and
local alternatives. In particular, one can evaluate terms entering the definitions of
these arrays at &, t?,,,or any other consistent estimator of 8,. In sample analogs that
converge to these arrays by the law of large numbers, one can freely substitute
sample and population terms that leave the probability limits unchanged. For
example, if z, = (y,, xr) and 8 is any consistent estimator of 8,,, then R can be
estimated by (1) an analytic ex_pressio_n for Eg(z, O)g(z, O), evaluated at e,, (2) a
sample average (l/n)C:= 1dz,, &Jd.q, &J, or (3) a sample average of conditional
Ch. 36: Large Sample Estimation and Hypothesis Testing 2221

expectations
(lln)C:, 1 ~,~,&J, x,, AMY,x,, 8,J. These first-order efficiency equiv-
alences do not hold in finite samples, or even to higher orders of &I. Thus, there
may be clear choices between these when higher orders of approximation are taken
into account.
The next result is an application of the one-step theorem in Section 3.4, and shows
how one can start from any initial &-consistent estimator of 8,, and in one
iteration obtain versions of the trinity that are asymptotically equivalent to versions
obtained when the exact estimators e^, and GE are used. Further, the required
iterations can usually be cast as regressions, so their computation is relatively
elementary. Consider the GMM criterion Q,(0). Suppose gn is any consistent esti-
mator of B0 such that $(gn -j,,) is stochastically bounded. Let 6 be the uncon-
strained maximizer of Q, and 19, be the maximizer of Q subject to the constraint
a(6) = 0. Suppose the null hypothesis, or a local alternztive, ~(0,) = S/,/i, is true.
The unconstrained one-step estimator from eq. (3.1 l), 0, = e, - (G:R; G,,) CL x
8; d,(e,), satisfies &($n - e^,)L 0. Similarly, define one-step constrained estima-
tors from the Lagrangian first-order conditions:

[$[;]-[A y[s!J
Note in this definition that y = 0 is a trivial initially consistent estimator of the
Lagrangian multipliers under the null or local alternatives, and that the arrays B
and A can be estimated at 8,. The one-step theorem again applies, yielding
fi(& - 6,) 3 0 and fi(%, - m) 3 0. Then, these one-step equivalents can be
substituted in any of the test statistics of the trinity without changing their
asymptotic distribution.
A regression procedure for calculating the one-step expressions is often useful
for computation. The adjustment from & yielding the one-step unconstrained
estimator is obtained by a two-stage least squares regression of the constant one
on V&z,, I%), with g(zt, 8,) as instruments; i.e.

(a) Regress each component of V~d(z~, &), on g(zt, &) in the sample t = 1, . . , n,
and retrieve fitted values V&(zt, &).
(b) Regress 1 on V~i(z,,&); and adjust &, by the amounts of the fitted coeffi-
cients.

Step (a) yields V&z,, 8,J = g(z,, 8J2; I-,,, and step (b) yields coefficients
2228 W.K. Newey and D. McFadden

This is the adjustment indicated by the one-step theorem.


Computation of one-step constrained estimators is conveniently done using the
formulae

= & + A - BPA(AB-A): [a(&) + AA],

6, = - (.4BPA)- a(& E - (AK l/I)- [a(e,) + AA],

with A and B evaluated at 8,. To derive these formulae from the first-order conditions
fqr the Lagrangian problem, replace V,Q,(@ by the expression - (rJ2; I-b) x
(8 -i,,) from:the one-step definition of the unconstrained estimator, replace a(g,,)
by a(fI,) + A(8, - g,,), and use the formula for a partitioned inverse.

9.4. Special cases

Maximum likelihood. We have noted that maximum likelihood estimation can be


treated as GMM estimation with moments equal to the score, g = V,t. The statistics
in Table 2 remain the same, with the simplification that I3 = f2( = G = r). The
likelihood ratio statistic 2n[L,(8,) - L,(&J], where L,(0) = (l/n) C:= 1Qz,, d), is shown
by a Taylors expansion about g to be asymptotically equivalent to the Wald
statistic W,,, and hence to all the statistics in Table 2.
Suppose one sets up an estimation problem in terms of a maximum likelihood
criterion, but that one does not in fact have the true likelihood function. Suppose
that in spite of this misspecification, optimization of the selected criterion yields
consistent estimates. One place this commonly arises is when panel data observations
are serially correlated, but one writes down the marginal likelihoods of the obser-
vations ignoring serial correlation. These are sometimes called pseudo-likelihood
criteria. The resulting estimators can be interpreted as GMM estimators, so that
hypotheses can be tested using the statistics in Table 2. Note however that now
G # 0, so that B = GS2- G must be estimated in full, and one cannot do tests using
a likelihood ratio of the pseudo-likelihood function.
Least squares. Consider the nonlinear regression model y = h(x, 0) + E, and suppose
E(ylx)=h(x,8)andE[{y-h(x,8)}21x]= cr*. Minimizing the least squares criterion
Q,(0) = C:= 1 [y, - h(z,, Q] is asymptotically equivalent to GMM estimation with
g(z, 19)= [y - h(x, B)]V,h(x, 13)and a distance metric R, = (a2/n) C:= 1 [V,h(x, 0,)] x
[V,h(x, e,)]. For this problem, B = R = G. If h(z,, 0) = z,O is linear, one has g(z,, (I) =
u,(@z,, where u,(0) = y, - z,O is the regression residual, and 0, = (o/n) C:= 1z,z;.
Instrumental variables. Consider the regression model y, = h(z,, 0,) + E, where E,
may be correlated with V,h(z,,O,). Suppose there are instruments w such that
E(e,I w,) = 0. For this problem, one has the moment conditions g(y,, z,, w,, 0) =
[y, - h(z,, U)]f(w,) satisfying Eg( yt, z,, w,, 0,) = 0 for any vector of functions f(w) of
Ch. 36: Large Sample Estimation and Hypothesis Testing 2229

the instruments, so the GMM criterion becomes

Qn(4= f ; ; 1Y, - Nz,,Q))f(w,) 0, t, *Cl


{Y, - et, e))f(w,)1
f 1 1 [

with 0, = (cr2/n)C:= rf(w,)f(w,). Suppose that it were feasible to construct the


conditional expectation of the gradient of the regression function conditioned on
w, qt = E[V&(z,, S,)\ wJ. This is the optimal vector of functions of the instruments,
in the sense that the GMM estimator based on f(w) = q will yield estimators with
an asymptotic covariance matrix that is smaller in the positive definite sense than
any other distinct vector of functions of w. A feasible GMM estimator with good
efficiency properties may then be obtained by first obtaining a preliminary,,&
consistent estimator 8,, employing a simple practical distance metric, second, regres-
sing V&z,, 8) on a flexible family of functions of wt, such as low-order polynomials
in w, and, third, using fitted values from this regression as the vector of functions
f(w,) in a final GMM estimation. Note that only one Newton-Raphson step is
needed in the last stage. Simplifications of this problem result when h(z, 0) = z0 is
linear in 8; in this case, the feasible procedure above is simply 2SLS, and no iteration
is needed.
Simple hypotheses. An important practical case of the general nonlinear hypothesis
~(0,) = 0 is that a subset of the parameters are zero. (A hypothesis that parameters
equal constants other than zero can be reduced to this case by reparametrization.)

Assume@=(,xf+fl~r)
and H,: /I = 0. The first-order conditions for solution

of this problem are 0 = &V,Q,(i?,,), 0 = &VsQ,,(g,,) + &y,, and 0 = &,, implying

Y,= -V~Q.(R.),andA=[~~~_~~~~~ . Let C = B- be the asymptotic covariance

matrix of &(gn - 0,), and AB- A = C,, the submatrix of C for j?. Taylors expan-
sions about t?of the first-order conditions imply ,,&(a. - CI,)= - B&I,, fib + or
A
and &Y, = C& - ~~a~~l~,,J&~n + op= j?C&lfln + op. Then the Wald statistics
are

One can check the asymptotic equivalence of these statistics by substituting the
expression for &(&-c(,). The LM statistic, in any version, becomes LM, =
nV,Qn(t)n)CssV,Q,(B,). Recall that B, hence C, can be evaluated at any consistent
estimator of 8,. In particular, the constrained estimator is consistent under the null
2230 W.K. Newey and D. McFadden

or under local alternatives. The LM testing procedure for this case is then to (a)
compute the constrained estimator Cr,,subject to the condition /3 = 0, (b) calculate
the gradient and Hessian of Q, with respect to the full parameter vector, evaluated
at cl, and /I = 0, and (c) form the quadratic form above for LM, from the /I part of
the gradient and the /? submatrix of the inverse of the Hessian. Note that this does
not require any iteration of the GMM criterion with respect to the full parameter
vector.
It is also possible to carry out the calculation of the LM, test statistic using
auxiliary regressions. This could be done using the auxiliary regression technique
introduced earlier for the calculation of LM,, in the case of any nonlinear hypothesis,
but a variant is available for this case that reduces the size of the regressions
required. The steps are as follows:
(a) Regress VJ(z,,8,) and V,J(z,, I?,,) on g(z,, t?,,), and retrieve the fitted values
V,?(Z~, g,J and V$~(Z,, I?).
(b) Regress VD?(z,, 0,) on Vol?(z(zr, f?,J, and retrieve the residual u(z,, I$,).
(c) Regress the constant 1 on the residual u(z,, g,,), and calculate the sum of squares
of the$rted values of 1. This quantity is LM,.
To justify this method, start from the gradient of the GMM criterion,

0= V,Q&,,O)= - %fl;'&(&,O),
v,Q,@,>O)
= - G,,~;'&@,,O),
where G, is partitioned into its CI and /I submatrices. From the formula for the
partitioned inverses, one has for C = BP1 the expression

c,, = [z-,n-rr; - T~~-Tb(T,R-r~)-r,n-r~]-.

The fitted values from step (a) satisfy

V&z,, 6,) = g(zr, e,)fl, G&,

and

V&z,, e,) = g(z,, e,)a, CL,.

Then the residuals from step (b) satisfy

u(z,, e,) = g(zf, e,)fiR 1G& - g(zf, @,)a; G;,(G,$; r G;,) - G&2; t G&.

Then

f,
f
zl u(zt, &I = VsQ .(s- (9- V,Q,b%n
W(G,,$, l G;J - G,&n-1Gbs
Ch. 36: Large Sample Estimation and Hypothesis Testing 2231

Then, the step (c) regression yields LM,. In the case of maximum likelihood
estimation, step (a) is redundant and can be omitted.

9.5. Tests for overidentifying restrictions

Consider the GMM estimator based on moments g(z,, 0), where g is m x 1,O is k x 1,
and m > k, so there are overidentifying moments. Thecriterion

Q,(@ = - $M9fJ, &(@,


evaluated at its maximizing argument @,,for any 0,&n, has the property that
- 2n& = - 2nQ,(&) Lx:_, under the null hypothesis that Eg(z, 0,) = 0. This
statistic then provides a specification test for the overidentifying moment_ in g. It
can also be used as an indicator for convergence in numerical search for 0,.
To demonstrate this result, recall from eqs. (9.1) and (9.2) that - a- 12&&(tI,) =
02c,% % m N(0, I) and &(f!? - 0,) = B- GO- j2%! + op. Then, a Taylors ex-
pansion yields

&c?,(&)= - fii2%, + G,,(G$?; 1G,)- lGp; 12@,+ op = - 12;2~,q, +o,,

where R, = I - 0; 112Gn(Gb12;1G,)~ Gkf2; I2 is idempotent of rank m - k. Then

- 2nQ,(6,,) = %:R,%, + op Ax:_,.

Suppose that instead of estimating 6 using the full list of moments, one uses a
linear combination Lg(z, f3), where L is r x m with k < r < m. In particular, L may
select a subset of the moments. Let t? denote the GMM estimator obtained from
these moment combinations, and assume the identification conditions are satisfied
so I?,, is Jn-consistent. Then the statistic S = nd,,(e,)l2; I2 R,,R; 12d,(8,) -% xi_,
under If,, and this statistic is asymptotically equivalent to the statistic - 2nQ,(&).
This result holds for any &-consistent estimator $ of 8,, not necessarily the
optimal GMM estimator for the moments Lg(z, 0), or even an initially consistent
estimator based on only these moments. The distance metric in the center of the
quadratic form S does not depend on L, so that the formula for the statistic is
invariant with respect to the choice of the initially consistent estimator. This implies
in particular that the test statistics S for overidentifying restrictions, starting from
2232 W.K. Newey and D. McFadden

different subsets of the moment conditions, are all asymptotically equivalent. HOW-
ever, the presence of the idempotent matrix R, in the center of the quadratic form
S is critical to its statistical properties. Only the GMM distance metric criterion
using all moments, evaluated at &, is asymptotically equivalent to S. Substitution
of another &-consistent estimator @,, in place of 8n yields an asymptotically
equivalent version of S, but - 2nQ,(gJ is not asymptotically chi-square distributed.
These results are a simple coroLlary of the one-step theorem. Starting from gn, the
one-step estimator of 6 is Jn(O, - 8,,) = - (Glfl; G,)) Gifl; d,(g). Then, one
has a one-step estimator ,/;lS,,($,,) = JnJ,,(gJ + G,,,I@~~ - e,) = 0; R,f2; I2 x
&ti,(e,). Substituting this expression in the formula for - 2nQ,,(fi,J yields the
statistic S.
The test for overidentifying restrictions can be recast as an LM test by artificially
embedding the original model in a richer model. Partition the moments

Yk4
dz, 4 =
1
[ Y2(Z,@)

where g1 is k x 1 with G, = EV,g(z,8,) of rank k, and g2 is (m-k) x 1 with


G, = EV0g2(z, 0,). Embed this in the model

where II/ is an (m - k) vector of additional parameters. The first-order condition for


GMM estimation of this expanded model is

The second block of conditions are satisfied by $,, = &((?,,), no matter what I!?,,,so
g,, is determined by 0 = GJ2; lgi(e,). This is simply the estimator obtained from
the first block of moments, and coincides with the earlier definition of g. Thus,
unconstrained estimation of the expanded model coincides with restricted estimation
of the original model.
Next consider GMM estimation of the expanded model subject to H,: $ = 0.
This constrained estimation obviously coincides with GMM estimation using all
moments in the original model, and yields e^,. Thus, constrained estimation of the
expanded model coincides with unrestricted estimation of the original model.
The distance metric test statistic for the constraint Ic/= 0 in the expanded model
is DM, = - 2n[&(&,,, 0) - &(q,,, $J] = - 2nQ,(&), where Q denotes the criterion
as a function of the expanded parameter list. One has Q,(6,0) = Q,,(8) from the
coincidence of the constrained expanded model estimator and the unrestricted
Ch. 36: Large Sample Estimation and Hypothesis Testing 2233

original model estimator, and one has Q,(e,, $,,) = 0 since the number of moments
equals the number of parameters. Then, the test statistic - 2nQ,(&) for overidentify-
ing restrictions is identical to a distance metric test in the expanded model, and
hence asymptotically equivalent to any of the trinity of tests for H,: II/ = 0 in the
expanded model.
We give four examples of econometric problems that can be formulated as tests
for overidentifying restrictions:

Example 9.1

If y = xb + F with E(E(x) = 0, E(E Ix) = 02, then the moments

1
4Y - XP)
g(zJJ) = (Y_xp)2_,2
[

can be used to estimate p and 02. If E is normal, then these GMM estimators are
MLE. Normality can be tested via the additional moments that give skewness and
kurtosis,

s2k B)=
[ ( y
(Y - xP)la
- xj?)/a -. 3 1
Example 9.2

In the linear model y = xb + E with E(E[x) = 0 and E(E~B,Ix) = 0 for t # s, but with
possible heteroskedasticity of unknown form, one gets the OLS estimates b of /I and
V(b) = s2(XX)- under the null hypothesis of homoskedasticity. A test for homo-
skedasticity can be based on the population moments 0 = E vecu[xx(e2 - 02)],
where vecu means the vector formed from the upper triangle of the array. The
sample value of this moment vector is

vecui
I gl I
.+A (Y, - xtBY- s> ,
1

the difference between the White robust estimator and the standard OLS estimator
of vecu [Xf2X].

Example 9.3

If P(z, 8) is the log-likelihood of an observation, and H^,is the MLE, then an addi-
tional moment condition that should hold if the model is specified correctly is the
information matrix equality

0 = EV,,,/(z, 4,) + EV,/(z, U,)V,/(z, 8,).


2234 W.K. Nrwey and D. McFudden

The sample analog is Whites information matrix test, which then can be interpreted
as a GMM test for overidentifying restrictions.

Example 9.4

In the nonlinear model y = h(x, 0) + E with E(E~x) = 0, and e, a GMM estimator


based on moments w(x)[y - h(x, fI)], w h ere w(x) is some vector of functions of x,
suppose one is interested in testing the stronger assumption that E is independent of
x. A necessary and sufficient condition for independence is E[w(x) - Ew(x)] x
f[ y - h(x, 19,)] = 0 for every function f and vector of functions w for which the
moments exist. A specification test can be based on a selection of such moments.

9.6. Specijication tests in linear models5

GMM tests for overidentifying restrictions have particularly convenient forms in


linear models; see Newey and West (1988) and Hansen and Singleton (1982). Three
standard specification tests will be shown to have this interpretation. We summarize
a few properties of projections that will be used in the following discussion. Let
Yp, = X(X/X)-X denote the projection matrix from R onto the linear subspace X
spanned by an n x p array X. (We use a Moore-Penrose generalized inverse in the
definition of Yx to handle the possibility that X is less than full rank; see Section
9.8.) Let 2, = I - Yp, denote the projection matrix onto the linear subspace orthog-
onal to X. Note that Yx and sx are idempotent. If X is a subspace generated by
an array X and w is a subspace generated by an array W = [X Z] that contains
X, then YxYw = ??,&Fx = gx; i.e. a projection onto a subspace is left invariant by
a further projection onto a larger subspace, and a two-stage projection onto a large
subspace followed by a projection onto a smaller one is the same as projecting
directly onto the smaller one. The subspace of VVthat is orthogonal to X is generated
by 2x W; i.e., it is the set of linear combinations of the residuals, orthogonal to X,
obtained by regressing Won X. Any y in R has a unique decomposition y = Yxy +
sxYwy + ZIwy into the sum of projections onto X, the subspace of W orthogonal
to X, and the subspace orthogonal to W. The projection Z?x.Yw can be rewritten
%xYw = gw - Yx = Yw9, = &Y&, or since $x W = Sx[X Z] = [0 2xZ],
z?xPw = 9 Ll,yw-- Pdxz = 2,Z(Z9,Z)mZLL?x. This implies that 2xYw is idempotent
since (Z!,.Y\,)($,Y,) = Z!.x(YwZ?x)Yw = 2,(Z?!,Y,)g, = sxYw.
Omitted variables test: Consider the regression model y = X/3 + E,where y is n x 1,
X is n x k, E(EIX) = 0, and E(EEIX) = a2Z. Suppose one has the hypothesis H,:
B1 = 0, where /I1 is a p x 1 subvector of /I. Define u = y - Xb to be the residual
associated with an estimator b of /I. The GMM criterion is then 2nQ = uX(XX)-.
Xu/a2. The projection matrix Px = X(X/X)-X that appears in the center of this
criterion can obviously be decomposed as Yx = g)x2 + (9X - Yx,). Under H,,

Paul Ruud contributed substantially to this section.


Ch. 36: Large Sample Estimation and Hypothesis Testing 2235

u = y - X,b, and Xu can be interpreted as k = p + q overidentifying moments for


the q parameters p2. Then, the GMM test statistic for overidentifying restrictions
is the minimum value - 2n& in b, of uPP,u/a2. But Pp,u = I?pxZu + (Yx - Ypx,)y
and minb2uYx2u = 0 (at the OLS estimator under H, that makes u orthogonal to
X2). Then - 2~0, = ~(9~ - PPx,)y/a 2. The unknown variance c? in this formula
can be replaced by any consistent estimators 2, in particular the estimated variance
of the disturbance from either the restricted or the unrestricted regression, without
altering the asymptotic distribution, which is xf under the null hypothesis.
The statistic - 2n& has three alternative interpretations. First,

SSR,, - SSR,
- 2n& = yP,ylcT2 - Ypxz Y/u2 = g2

which is the difference of the sum of squared residuals from the restricted regression
under If, and the sum of squared residuals from the unrestricted regression, normal-
ized by a2. This is a large sample version of the usual finite sample F-test for H,.
Second, note that the fitted value of the dependent variable from the restricted
regression is Jo = Px2y, and from the unrestricted regression is 9, = .Ppxy, so that

- 24, = (9b90 - p:9,)/a = (90 - 9)(90 - 9)/@ = IIPO- 9 II/~.

Then, the statistic is calculated from the distance between the fitted values of the
dependent variable with and without H, imposed. Note that this computation
requires no covariance matrix calculations. Third, let b, denote the GMM estimator
restricted by H, and b, denote the unrestricted GMM estimator. Then, b, consists
of the OLS estimator for p2 and the hypothesized value 0 for /II, while b, is the OLS
estimator for the full parameter vector. Note that j0 = Xb, and 9, = Xb,, so that
j0 - 9, = X(b, - b,). Then

- 24, = (b, - b,)(XX/a2)(b, - b,) = (b, - b,)V(b,)- (6, - b,).

This is the Wald statistic W,,. From the equivalent form W,, of the Wald statistic,
this can also be written as a quadratic form - 2~10, = b;,,V(b,,,)-lb,,,, where b,,,
is the subvector of unrestricted estimates for the parameters that are zero under the
null hypothesis.
The Hausman exogeneity test: Consider the regression y = X,/I, + X2B2 +
X3/jj + E, and the null hypothesis that X, is exogenous, where X2 is known to be
exogenous, and X, is known to be endogenous. Suppose N is an array of instruments,
including X2, that are sufficient to identify the coefficients when the hypothesis is
false. Let W = [N X,] be the full set of instruments available when the null hypo-
thesis is tfue. Then the best instruments under the null hypothesis are XAO= 9,X =
[Xl, X,X,], and the best instruments under the alternative are ,XU =Yp,X E
[X, X2 X,]. The test statistic for overidentifying restrictions is - 2nQ, = y(Ppx, -
.Yg,)y/a, as in the previous case. This can be written - 2nQ, = (SSRi - SSRi0)/a2,
2236 WK. Newey and D. McFadden

with the numerator the difference in sum of squared residuals from an OLS regression
of y on 2, and an OLS regression of y on r?,. Also, - 2nQ^, = 11 jf, - jiu II/o, the
difference between the fitted values of y from a regression on 2, and a regression
on 2,. Finally,

- 2nQ^, = (b,s,s, - ~2s&CVb2s& - W2~d-(b2~~~o - b2dT


an extension of the Hausman-Taylor exogeneity test to the problem where some
variables are suspect and others are known to be exogenous. Newey and West (1988)
show that the matrix in the center of this quadratic form has rank equal to the rank
of X,, and that the test statistic can be written equivalently as a quadratic form in
the subvector of differences of the 2SLS estimates for the X, coefficients, with the
ordinary inverse of the corresponding submatrix of differences of variances in the
center of the quadratic form.
Testingfor overidentifying restrictions in a structural system: Consider an equation
y = X/I + E from a system of simultaneous equations, and let W denote the array of
instruments (exogenous and predetermined variables) in the system. Let _? = 9,X
denote the fitted values of X obtained from OLS estimation of the reduced form.
The equation is oueridentijied if the number of instruments W exceeds the number
of right-hand-side variables X. The GMM test statistic for overidentification is the
minimum in fi of

- 2nQ,(/?) = uP,u/a2 = uPiu/a2 + ~(9~ - Pg)u/o,

where u = y - Xg. As before, - 2n& = y(P, - Pi)y/a. Under H,, this statistic
is asymptotically chi-squared distributed with degrees of freedom equal to the
difference in ranks of Wand 2. This statistic can be interpreted as the difference in
the sum of squared residuals from the 2SLS regression of y on X and the sum of
squared residuals from the reduced form regression of y on W, normalized by CJ~.
A computationally convenient equivalent form is - 2n& = II J&, - $2 (I2/a2, the sum
of squares of the difference between the reduced form fitted values and the 2SLS
fitted values of y, normalized by c2. Finally, - 2n& = ysgP,,,gky/02 = nR2/a2,
where R2 is the multiple correlation coefficient from regressing the 2SLS residuals
on all the instruments; this result follows from the equivalent formulae for the
projection onto the subspace of VWorthogonal to the subspace spanned by 2;;. This
test statistic does not have a version that can be written as a quadratic form with
the wings containing a difference of coefficient estimates from the 2SLS and reduced
form regressions.

9.7. Specification testing in multinomial models

As applications of GMM testing, we consider hypotheses arising in the context of


analysis of discrete response data. The first example is a test for omitted variables
Ck. 36: Larye Sample Estimation and Hypothesis Testing 2237

in multinomial data, which extends to various tests of functional specification by


introduction of appropriate omitted variables. The second example tests for the
presence of random effects in discrete panel data.

Example 9.5

Suppose J multinomial outcomes are indexed C = { 1,. . , J}. Define z = (d,, . , d,, x),
where d, is one if outcome j is observed, and zero otherwise. The x are exogenous
variables. The log-likelihood of an observation is

e(Z, e) = C di log P,(i, X, e),


SC

where P&i,x, 19) is the probability that i is observed from C, given x. Suppose
0 = (CI,/?), and the null hypothesis He: fi = 0. We derive an LM test starting from the
maximum likelihood estimates of a under the constraint fi = 0. Define

ui = [di - P,(i, X, gn)]Pc(i, x, c??,,-I,

qi = P&,x, ~J1*VO logPc(i, x, GJ.

Then, in a sample t = 1,. . . , n, one has (l/n) C:= 1 V&(zt, g,,) E (l/n) C:= 1CiEc qituit.
Also, (l/n) C:= i Cisc qiqj 3 R since

fl= -EVged~ -EVsC [di - Pc(i,x,8o)]VslogPc(i,x,Bo)


iEC
= E C Pc(i,x,0e)[VtilogP(i,x,e)][V,logP(i,x,~)].
icC

Then,

This statistic can be computed from the sum of squares of the fitted values of uit
from an auxiliary regression over i and t of uit on qil. If R2 is the multiple correlation
coefficient from this regression, and U is the sample mean of the uir, then LM,, =
n( J - 1)R2 + (1 - R2)d2.
McFadden (1987) shows for the multinomial logit model that the Hausman and
McFadden (1984) test for the independence from irrelevant alternatives property of
this model can be calculated as an omitted variable test of the form above, where
the omitted variables are interactions of the original variables and dummy variables
for subsets of C where nonindependence is suspected. Similarly, Lagrange multiplier
tests of the logit model against nested logit alternatives can be cast as omitted
2238 W.K. Newey and D. McFadden

variable tests where the omitted variables are interactions of dummy variables for
suspect subsets A of C and variables of the form log[P,(1, x, I!?,,)/C,,, P,(i, x, e,)].

Example 9.6

We develop a Lagrange multiplier test for unobserved heterogeneity in discrete


panel data. A case is observed to be either in state d, = + 1 or d, = - 1 in periods
t = 1,. . . , T. A probability model for these observations that allows unobserved
heterogeneity is

where x r,. . . , xT are exogenous, PI,. . , /jT and 6 are parameters, F is a cumulative
distribution function for a density that is symmetric about zero, and v is an
unobserved case effect heterogeneity. The density h(v) is normalized so that Ev = 0
and Ev2= 1.
When 6 = 0, this model reduces to a series of independent Bernoulli trials,

P(d,,. ..,d,lx,,. .,x,,Bl>. .,/LO) = fi fk&xtBtL


1=1

and is easily estimated. For example, F normal yields binary probits, and F logistic
yields binary logits. A Lagrange multiplier test for 6 = 0 will detect the presence of
unobserved heterogeneity across cases. Assume a sample of n cases, drawn randomly
from the population. The LM test statistic is

LM=
[
Cw
12
n C (Vd2/n- (VAVJYln 1 (VpW&Yln -
i I[ 1

where e is the log-likelihood of the case, V,/ = (VP,/, , V,,Z?), and all the derivatives
are evaluated at 6 = 0 and the Bernoulli model estimates of /I. The j? derivatives are
straightforward,

e,t = d,x,f(d,x,P,)/F(d,x,B,),

where f is the density of F. The 6 derivative is more delicate, requiring use of


IHopitals rule:

la=; ~- f(4xtBJ + i 4f(4x,B,) 2


W,x,BJ2 I[ t=I F(d,x,&) II
Ch. 36: Large Sample Estimation and Hypothesis Testing 2239

The reason for introducing 6 in the form above, so J&J appeared in the probability,
was to get a statistic where C V& was not identically zero. The alternative would
have been to develop the test statistic in terms of the first non-identically zero higher
derivative; see Lee and Chesher (1986).
The LM statistic can be calculated by regressing the constant 1 on V& and
V,,P, . . . ) V,e, where all these derivatives are evaluated at 6 = 0 and the Bernoulli
model estimates, and then forming the sum of squares of the fitted values. Note that
the LM statistic is independent of the shape of the heterogeneity distribution h(v),
and is thus a robust test against heterogeneity of any form.

9.8. Technicalities

Some test statistics are conveniently defined using generalized inverses. This section
gives a constructive definition of a generalized inverse, and lists some of its properties.
A matrix ,& is a Moore-Penrose generalized inverse of a matrix ,A, k if it has three
properties:
(i) AA-A = A,
(ii) A-AA- = A-,
(iii) AA _ and A -A are symmetric.
There are other generalized inverse definitions that have some, but not all, of these
properties; in particular A + will denote any matrix that satisfies (i).
First, a method for constructing a generalized inverse is described, and then some
of the implications of the definition are developed. The construction is called the
singular value decomposition (SVD) of a matrix, and is of independent interest as a
tool for finding the eigenvalues and eigenvectors of a symmetric matrix, and for
calculation of inverses of moment matrices of data with high multicollinearity; see
Press et al. (1986) for computational algorithms and programs.

Lemma 9.4

Every real m x k matrix A of rank r can be decomposed into a product

A=UDV
mxk mxrrxrrxk

where D is a diagonal matrix with positive nonincreasing elements down the


diagonal, and U and V are column-orthonormal; i.e. UU = I, = VV.

Proof

The m x m matrix AA is symmetric and positive semi-definite. Then, there exists an


m x m orthonormal matrix W, partitioned W = [W, W,] with WI of dimension
m x r, such that w;(AA)W, = G is diagonal with positive, nonincreasing diagonal
2240 WK. Newey and D. McFadden

elements, and W;(AA)W, = 0, implying AW, = 0. Define D from G by replacing the


diagonal elements of G by their positive square roots. Note that W' W = I = W W =
W, W; + W, W;. Define U = W, and V = D-l UA. Then, UU = I, and VV =
D~UAUD~=D-GD-=I,.Further,A=(Z,-W,W;)A=UUA=UD1/.This
establishes the decomposition. Q.E.D.

Note that if A is symmetric, then U is the array of eigenvectors of A corresponding


to the nonzero roots, so that AU = UD,, with D, the r x r diagonal matrix with the
nonzero eigenvalues in descending magnitude down the diagonal. In this case,
V = AUD- = UD,D-. Since the elements of D, and D are identical except
possibly for sign, the columns of U and V are either equal (for positive roots) or
reversed in sign (for negative roots).

Lemma 9.5

The Moore-Penrose generalized inverse of an m x k matrix A is the matrix A- =


V D-l U Let A denote any matrix, including A-, that satisfies AA+A = A.
kxr rxr rxm
These matrices satisfy:
(1) A+ = A- if A is square and nonsingular.
(2) The system of equations Ax = y has a solution if and only if y = AA+y, and the
linear subspace of all solutions is the set ofvectors x = A+y + [Z - A+A]z for all
ZERk.
(3) AA+ and A+A are idempotent.
(4) If A is idempotent, then A = A-.
(5) If A = BCD with B and D nonsingular, then A- = D-C-B-, and any matrix
A+ = D-C+B- satisfies AA+A = A.

Proof

Elementary; see Pringle and Rayner (1971).

Lemma 9.6

If A is square, symmetric, and positive semi-definite of rank r, then


(1) There exist Q positive definite and R idempotent of rank r such that A = QRQ
and A- = Q-RQ-.
(2) There exists kt, column-orthonormal such that UAU = D is nonsingular diag-
onal and A- = U(UAU)- U.
(3) A has a symmetric square root B = A, and A- = B-B-.

Proof

Let W = [U W,] be an orthogonal matrix diagonalizing A. Then, UAU = D, a

diagonal matrix of positive eigenvalues, and A W, = 0. Define Q = W


ID:1:_.
W,R= W
1, 0
[ 00 1w and B = UDi2U.

Q.E.D.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2241

Lemma 9.7
If y - N(A,I, A), with A of rank I, and A+ is any symmetric matrix satisfying AA+A = A,
then yA+y is noncentral chi-square distributed with I degrees of freedom and
noncentrality parameter ,lAl.

Proof

Let W = [U W,] be an orthonormal matrix that diagonalizes A, as in the proof of


Lemma 9.6, with UAU = D, a positive diagonal r x r matrix, and WAW, = 0,

implying A W, = 0. Then, the nonsingular transformation z=

mean [ Dm1FA] and covariance matrix

buted N(D- 2UA2,Z,), z2 = W,y = 0, implying wy = [Dli2z, 01. It is standard


that zz has a noncentral chi-square distribution with r degrees of freedom and
noncentrality parameter AAUD-UAA = 2A;1. The condition A = AA+A implies
UAU = UAWWA+ WWAU, or

D = [DO]WA+ W[DO]= D(UA+U)D.

Hence, UA+U = D-l. Then

yA+y = yWWA+ WWy = [z;D~O](WA+ W)[D2z; 01

= z;D~(UA+U)D~~Z~ = z;zl.
Q.E.D.

References

Ait-Sahalia, Y. (1993) Asymptotic Theory for Functionals of Kernel Estimators, MIT Ph.D. thesis.
Amemiya, T. (1973) Regression Analysis When the Dependent Variable is Truncated Normal.
Econometrica, 41, 997-1016.
Amemiya, T. (1974) The Nonlinear Two-Stage Least-Squares Estimator, Journal of Econometrics, 2,
105-l 10.
Amemiya, T. (1985) Advanced Econometrics, Cambridge, MA: Harvard University Press.
Andersen, P.K. and R.D. Gill (1982) Coxs Regression Model for Counting Processes: A Large Sample
Study, The Annals of Statistics, 19, 1100-1120.
Andrews, D.W.K. (1990) Asymptotics for Semiparametric Econometric Models: I. Estimation and
Testing, Cowles Foundation Discussion Paper No. 908R.
Andrews, D.W.K. (1992) Generic Uniform Convergence, Econometric Theory, 8,241-257.
Andrews, D.W.K. (1994) Empirical Process Methods in Econometrics, in: R. Engle and D. McFadden,
eds., Handbook ofEconometrics, Vol. 4, Amsterdam: North-Holland.
Barro, R.J. (1977) Unanticipated Money Growth and Unemployment in the United States, American
Economic Reoiew, 67, 101-115.
2242 W.K. Newey and D. McFadden

Bartle, R.G. (1966) The Elements oflntegration, New York: John Wiley and Sons.
Bates, C.E. and H. White (1992) Determination of Estimators with Minimum Asymptotic Covariance
Matrices, preprint, University of California, San Diego.
Berndt, E.R., B.H. Hall, R.E. Hall and J.A. Hausman (1974) Estimation and Inference in Nonlinear
Structural Models, Annals of Economic and Social Measurement, 3,653-666.
Bickel, P. (1982) On Adaptive Estimation, Annals of Statistics, 10, 6477671.
Bickel, P., C.A.J. Klaassen, Y. Ritov and J.A. Wellner (1992) Efficient and Adaptive Inference in
Semiparametric Models Forthcoming monograph, Baltimore, MD: Johns Hopkins University Press.
Billingsley, P. (1968) Convergence ofProbability Measures, New York: Wiley.
Bloomfeld, P. and W.L. Steiger (1983) Least Absolute Deviations: Theory, Applications, and Algorithms,
Boston: Birkhauser.
Brown, B.W. (1983) The Identification Problem in Systems Nonlinear in the Variables, Econometrica,
51, 175-196.
Burguete, J., A.R. Gallant and G. Souza (1982) On the Unification of the Asymptotic Theory of
Nonlinear Econometric Models, Econometric Reviews, 1, 151-190.
Carroll, R.J. (1982) Adapting for Heteroskedasticity in Linear Models, Annals of Statistics, 10,1224&1233.
Chamberlain, G. (1982) Multivariate Regression Models for Panel Data, Journal of Econometrics, 18,
5-46.
Chamberlain, G. (1987) Asymptotic Efficiency in Estimation with Conditional Moment Restrictions,
Journal of Econometrics, 34, 305-334.
Chesher, A. (1984) Testing for Neglected Heterogeneity, Econometrica, 52, 865-872.
Chiang, C.L. (1956) On Regular Best Asymptotically Normal Estimates, Annals of Mathematical
Statistics, 27, 336-351.
Daniels, H.E. (1961) The Asymptotic Efficiency of a Maximum Likelihood Estimator, in: Fourth
Berkeley Symposium on Mathematical Statistics and Probability, pp. 151-163, Berkeley: University of
California Press.
Davidson, R. and J. MacKinnon (1984) Convenient Tests for Probit and Logit Models, Journal of
Econometrics, 25, 241-262.
Eichenbaum, M.S., L.P. Hansen and K.J. Singleton (1988) A Time Series Analysis of Representative
Agent Models of Consumption and Leisure Choice Under Uncertainty, Quarterly Journal of Econo-
mics, 103, 5 l-78.
Eicker, F. (1967) Limit Theorems for Regressions with Unequal and Dependent Errors, in: L.M.
LeCam and J. Neyman, eds., Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics
and Probability, Berkeley: University of California Press.
Fair, R.C. and D.M. Jaffee (1972) Methods of Estimation for Markets in Disequilibrium, Econometrica,
40,497-514.
Ferguson, T.S. (1958) A Method of Generating Best Asymptotically Normal Estimates with Application
to the Estimation of Bacterial Densities, Annals of Mathematical Statistics, 29, 1046-1062.
Fisher, F.M. (1976) The Identification Problem in Econometrics, New York: Krieger.
Fisher, R.A. (1921) On the Mathematical Foundations of Theoretical Statistics, Philosophical Trans-
actions, A, 222, 309-368.
Fisher, R.A. (1925) Theory of Statistical Estimation, Proceedings of the Cambridge Philosophical
Society, 22, 700-725.
Gourieroux, C., A. Monfort and A. Trognon (1983) Testing Nested or Nonnested Hypotheses, Journal
of Econometrics, 21, 83-l 15.
Gourieroux, C., A. Monfort and A. Trognon (1984) Psuedo Maximum Likelihood Methods: Theory,
Econometrica, 52, 68 l-700.
Hajek, J. (1970) A Characterization of Limiting Distributions of Regular Estimates, Z. Wahrschein-
lichkeitstheorie uerw. Geb., 14, 323-330.
Hansen, L.P. (1982) Large Sample Properties of Generalized Method of Moments Estimators, Eco-
nometrica, 50, 1029-1054.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2243

Hansen, L.P. (1985a) A Method for Calculating Bounds on the Asymptotic Covariance Matrices of
Generalized Method of Moments Estimators, Journal ofEconometrics, 30, 203-238.
Hansen, L.P. (1985b) Notes on Two Step GMM Estimators, Discussion, December meetings of the
Econometric Society.
Hansen, L.P. and K.J. Singleton (1982) Generalized Instrumental Variable Estimation of Nonlinear
Rational Expectations Models, Econometrica, 50, 1269-1286.
Hansen, L.P., J. Heaton and R. Jagannathan (I 992) Econometric Evaluation of Intertemporal Asset
Pricing Models Using Volatility Bounds, mimeo, University of Chicago.
Hardle, W. (1990) Applied Nonparametric Regression, Cambridge: Cambridge University Press.
Hiirdle, W. and 0. Linton (1994) Nonparametric Regression, in: R. Engle and D. McFadden, eds.,
Handbook of Econometrics, Vol. 4, Amsterdam: North-Holland.
Hausman, J.A. (1978) Specification Tests in Econometrics, Econometrica, 46, 1251-1271.
Hausman, J.A. and D. McFadden (1984) Specification Tests for the Multinomial Logit Model,
Econometrica, 52, I2 19-l 240.
Heckman, J.J. (1976) The Common Structure of Statistical Models of Truncation, Sample Selection,
and Limited Dependent Variables and a Simple Estimator for Such Models, Annals ofEconomic and
Social Measurement, 5,475-492.
Honor&, B.E. (1992) Timmed LAD and Least Squares Estimation of Truncated and Censored Models
with Fixed Effects, Econometrica, 60, 533-565.
Honor& B.E. and J.L. Powell (1992) Pairwise Difference Estimators of Linear, Censored, and Truncated
Regression Models, mimeo, Northwestern University.
Huber, P.J. (1964) Robust Estimation of a Location Parameter, Annals ofMathematical Statistics, 35,
73-101.
Huber, P. (1967) The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions,
in: L.M. LeCam and J. Neyman, eds., Proceedings of the Fifth Berkeley Symposium on Mathematical
Statistics and Probability, Berkeley: University of California Press.
Huber, P. (1981) Robust Statistics, New York: Wiley.
Ibragimov, LA. and R.Z. Hasminskii (1981) Statistical Estimation: Asymptotic Theory, New York:
Springer-Verlag.
Jennrich (1969), Asymptotic Properties of Nonlinear Least Squares Estimators, Annals of Mathematical
Statistics, 20, 633-643.
Koenker, R. and G. Bassett (1978) Regression Quantiles, Econometrica, 46, 33-50.
LeCam, L. (1956) On the Asymptotic Theory of Estimation and Testing Hypotheses, in: L.M. LeCam
and J. Neyman, eds., Proceedings of the Third Berkeley Symposium on Mathematical Statistics and
Probability, vol. 1, pp. 129-156, Berkeley: University of California Press.
Lee, L. F. and A. Chesher (1986) Specification Testing when the Score Statistics are Identically Zero,
Journal ofEconometrics, 31, 121-149.
Maasoumi, E. and P.C.B. Phillips (1982) On the Behavior of Inconsistent Instrumental Variables
Estimators, Journal ofEconometrics, 19, 183-201.
Malinvaud, E. (1970) The Consistency of Nonlinear Regressions, Annals of Mathematical Statistics,
41,956-969.
Manski, C. (1975) Maximum Score Estimation of the Stochastic Utility Model of Choice, Journal of
Econometrics, 3, 205-228.
McDonald, J.B. and W.K. Newey (1988) Partially Adaptive Estimation of Regression Models Via the
Generalized T Distribution, Econometric Theory, 4, 428-457.
McFadden, D. (1987) Regression-Based Specification Tests for the Multinomial Logit Model, Journal
of Econometrics, 34, 63-82.
McFadden, D. (1989) A Method of Simulated Moments for Estimation of Multinomial Discrete
Response Models Without Numerical Integration, Econometricu, 57, 995-1026.
McFadden, D. (1990) An Introduction to Asymptotic Theory: Lecture Notes for 14.381, mimeo,
MIT.
2244 W.K. Newey and D. McFadden

Newey, W.K. (1984) A Method of Moments Interpretation of Sequential Estimators, Economics


Letters, 14, 201-206.
Newey, W.K. (1985) Generalized Method of Moments Specification Testing, Journal ofEconometrics,
29,229-256.
Newey, W.K. (1987) Asymptotic Properties of a One-Step Estimator Obtained from an Optimal Step
Size, Econometric Theory, 3, 305.
Newey, W.K. (1988) Interval Moment Estimation of the Truncated Regression Model, mimeo, Depart-
ment of Economics, MIT.
Newey, W.K. (1989) Locally Efficient, Residual-Based Estimation of Nonlinear Simultaneous Equa-
tions Models, mimeo, Department of Economics, Princeton University.
Newey, W.K. (1990) Semiparametric Efficiency Bounds, Journal of Applied Econometrics, 5,99-l 35.
Newey, W.K. (1991a) Uniform Convergence in Probability and Stochastic Equicontinuity, Econo-
metrica, 59, 1161-l 167.
Newey, W.K. (1991b) Efficient Estimation of Tobit Models Under Conditional Symmetry, in: W.
Barnett, J. Powell and G. Tauchen, eds., Semiparametric and Nonparametric Methods in Statistics and
Econometrics, Cambridge: Cambridge University Press.
Newey, W.K. (1992a) The Asymptotic Variance of Semiparametric Estimators, MIT Working Paper.
Newey, W.K. (1992b) Partial Means, Kernel Estimation, and a General Asymptotic Variance Estimator,
mimeo, MIT.
Newey, W.K. (1993) Efficient Two-Step Instrumental Variables Estimation, mimeo, MIT.
Newey, W.K. and J.L. Powell (1987) Asymmetric Least Squares Estimation and Testing, Econometrica,
55,819-847.
Newey, W.K. and K. West (1988) Hypothesis Testing with Efficient Method of Moments Estimation,
International Economic Review, 28, 777-787.
Newey, W.K., F. Hsieh and J. Robins (1992) Bias Corrected Semiparametric Estimation, mimeo,
MIT.
Olsen, R.J. (1978) Note on the Uniqueness of the Maximum Likelihood Estimator for the Tobit Model,
Econometrica, 46, 1211~1216.
Pagan, A.R. (1984) Econometric Issues in the Analysis of Regressions with Generated Regressors,
International Economic Review, 25,221-247.
Pagan, A.R. (1986) Two Stage and Related Estimators and Their Applications, Reuiew of Economic
Studies, 53, 517-538.
Pakes, A. (1986) Patents as Options: Some Estimates of the Value of Holding European Patent Stocks,
Econometrica, 54, 755-785.
Pakes, A. and D. Pollard (1989) Simulation and the Asymptotics of Optimization Estimators, Econo-
metrica, 57, 1027-1057.
Pierce, D.A. (1982) The Asymptotic Effect of Substituting Estimators for Parameters in Certain Types
of Statistics, Annals ofStatistics, IO, 475-478.
Pollard, D. (1985) New Ways to Prove Central Limit Theorems, Econometric Theory, 1, 295-314.
Pollard, D. (1989) Empirical Processes: Theory and Applications, CBMS/NSF Regional Conference Series
Lecture Notes.
Powell, J.L. (1984) Least Absolute Deviations Estimation for the Censored Regression Model, Journal
ofEconometrics, 25, 303-325.
Powell, J.L. (1986) Symmetrically Trimmed Least Squares Estimation for Tobit Models, Econometrica,
54.1435-1460.
Powell, J.L., J.H. Stock and T.M. Stoker (1989) Semiparametric Estimation of Index Coefficients,
Econometrica, 57, 1403-1430.
Pratt,J.W. (1981) Concavity of the Log Likelihood, Journal ofthe American Statistical Association, 76,
103%106.
Press, W.H., B.P. Flannery, S.A. Tenkolsky and W.T. Vetterling (1986) Numerical Recipes, Cambridge
University Press.
Ch. 36: Large Sample Estimation and Hypothesis Testing 2245

Pringle, R. and A. Rayner (1971) Generalized Inverse Matrices, London: Griffin.


Robins, J. (1991) Estimation with Missing Data, preprint, Epidemiology Department, Harvard School
of Public Health.
Robinson, P.M. (1988a) The Stochastic Difference Between Econometric Statistics, Econometrica, 56,
531-548.
Robinson, P. (1988b) Root-N-Consistent Semiparametric Regression, Econometrica, 56, 931-954.
Rockafellar, T. (1970) Convex Analysis, Princeton: Princeton University Press.
Roehrig, C.S. (1989) Conditions for Identification in Nonparametric and Parametric Models, Econo-
metrica, 56, 433-447.
Rothenberg, T.J. (1971) Identification in Parametric Models, Econometrica, 39, 577-592.
Rothenberg, T. J. (1973) Eficient Estimation with a priori Ir$ormation, Cowles Foundation Monograph
23, New Haven: Yale University Press.
Rothenberg, T.J. (1984) Approximating the Distributions of Econometric Estimators and Test Statistics,
Ch. 15 in: Z. Griliches and M.D. Intriligator, eds., Handbook of Econometrics, Vol 2, Amsterdam,
North-Holland.
Rudin, W. (1976) Principles ofMathematical Analysis, New York: McGraw-Hill.
Sargan, J.D. (1959) The Estimation of Relationships with Autocorrelated Residuals by the Use of
Instrumental Variables, Journal of the Royal Statistical Society Series B, 21, 91-105.
Serfling, R.J. (1980) Approximation Theorems of MathematicalStatistics, New York: Wiley.
Stoker, T. (1991) Smoothing Bias in the Measurement of Marginal Effects, MIT Sloan School Working
Paper, WP3377-91-ESA.
Stone, C. (1975) Adaptive Maximum Likelihood Estimators of a Location Parameter, Annals of
Statistics, 3, 267-284.
Tauchen, G.E. (1985) Diagnostic Testing and Evaluation of Maximum Likelihood Models, Journal of
Econometrics, 30, 4 155443.
Van der Vaart, A. (1991) On Differentiable Functionals, Annals ofStatistics, 19, 178204.
Wald (1949) Note on the Consistency of the Maximum Likelihood Estimate, Annals ofMathematical
Statistcs, 20, 595-601.
White, H. (1980) A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for
Heteroskedasticity, Econometrica, 48, 8177838.
White, H. (1982a)Maximum Likelihood Estimation ofMisspecified Models, Econometrica, 50, l-25.
White, H. (1982b) Consequences and Detection of Misspecified Linear Regression Models, Journal of
the American Statistical Association, 76, 419-433.
Chapter 37

EMPIRICAL PROCESS METHODS


IN ECONOMETRICS

DONALD W.K. ANDREWS

Co&s Foundation Yale University

Contents

Abstract 2248
1. Introduction 2248
2. Weak convergence and stochastic equicontinuity 2249
3. Applications 2253
3.1. Review of applications 2253
3.2. Parametric M-estimators based on non-differentiable criterion functions 2255
3.3. Tests when a nuisance parameter is present only under the alternative 2259
3.4. Semiparametric estimation 2263
4. Stochastic equicontinuity via symmetrization 2267
4.1. Primitive conditions for stochastic equicontinuity 2267

4.2. Examples 2273


5. Stochastic equicontinuity via bracketing 2276
6. Conclusion 2283
Appendix 2284
References 2292

This paper is a substantial revision of the first part of the paper Andrews (1989). I thank D.
McFadden for comments and suggestions concerning this revision. I gratefully acknowledge research
support from the Alfred P. Sloan Foundation and the National Science Foundation through a Research
Fellowship and grant nos. SES-8618617, SES-8821021, and SES-9121914 respectively.

Handbook of Econometrics, Volume IV, Edited by R.F. En& and D.L. McFadden
0 1994 Elsevier Science B. V. All rights reserved
2248 D. W.K. Andrew

Abstract

This paper provides an introduction to the use of empirical process methods in


econometrics. These methods can be used to establish the large sample properties
of econometric estimators and test statistics. In the first part of the paper, key
terminology and results are introduced and discussed heuristically. Applications in
the econometrics literature are briefly reviewed. A select set of three classes of
applications is discussed in more detail.
The second part of the paper shows how one can verify a key property called
stochastic equicontinuity. The paper takes several stochastic equicontinuity results
from the probability literature, which rely on entropy conditions of one sort or
another, and provides primitive sufficient conditions under which the entropy
conditions hold. This yields stochastic equicontinuity results that are readily applic-
able in a variety of contexts. Examples are provided.

1. Introduction

This paper discusses the use of empirical process methods in econometrics. It begins
by defining, and discussing heuristically, empirical processes, weak convergence,
and stochastic equicontinuity. The paper then provides a brief review of the use of
empirical process methods in the econometrics literature. Their use is primarily in
the establishment of the asymptotic distributions of various estimators and test
statistics.
Next, the paper discusses three classes of applications of empirical process methods
in more detail. The first is the establishment of asymptotic normality of parametric
M-estimators that are based on non-differentiable criterion functions. This includes
least absolute deviations and method of simulated moments estimators, among
others. The second is the establishment of asymptotic normality of semiparametric
estimators that depend on preliminary nonparametric estimators. This includes
weighted least squares estimators of partially linear regression models and semi-
parametric generalized method of moments estimators of parameters defined by
conditional moment restrictions, among others. The third is the establishment of
the asymptotic null distributions of several test statistics that apply in the non-
standard testing scenario in which a nuisance parameter appears under the alter-
native hypothesis, but not under the null. Examples of such testing problems include
tests of variable relevance in certain nonlinear models, such as models with Box-
Cox transformed variables, and tests of cross-sectional constancy in regression
models.
As shown in the first part of the paper, the verification of stochastic equiconti-
nuity in a given application is the key step in utilizing empirical process results. The
Ch. 37: Empirical Process Methods in Econometrics 2249

second part of the paper provides methods for verifying stochastic equicontinuity.
Numerous results are available in the probability literature concerning sufficient
conditions for stochastic equicontinuity (references are given below). Most of these
results rely on some sort of entropy condition. For application to specific estimation
and testing problems, such entropy conditions are not sufficiently primitive. The
second part of the paper provides an array of primitive conditions under which
such entropy conditions hold, and hence, under which stochastic equicontinuity
obtains. The primitive conditions considered here include: differentiability condi-
tions, Lipschitz conditions, LP continuity conditions, Vapnikkcervonenkis condi-
tions, and combinations thereof. Applications discussed in the first part of the
paper are employed to exemplify the use of these primitive conditions.
The empirical process results discussed here apply only to random variables (rvs)
that are independent or m-dependent (i.e. independent beyond lags of length m).
There is a growing literature on empirical processes with more general forms of
temporal dependence. See Andrews (1993) for a review of this literature.
The remainder of this paper is organized as follows: Section 2 defines and
discusses empirical processes, weak convergence, and stochastic equicontinuity.
Section 3 gives a brief review of the use of empirical process methods in the econo-
metrics literature and discusses three classes of applications in more detail. Sections
4 and 5 provide stochastic equicontinuity results of the paper. Section 6 provides a
brief conclusion. An Appendix contains proofs of results stated in Sections 4 and 5.

2. Weak convergence and stochastic equicontinuity

We begin by introducing some notation. Let ( Wr,: t G T, T 2 l} be a triangular


array of w-valued rvs defined on a probability space (0, d, P), where w is a (Bore1
measurable) subset of Rk. For notational simplicity, we abbreviate W,, by W, below.
Let .Y be a pseudometric space with pseudometric p.* Let

A! = {m(~,z):z&~-) (2.1)

be a class of R-valued functions defined on -ly- and indexed by KEY. Define an


empirical process vT(.) by

VT(~)= it Cm(W,, r) - Em( W,, 7)] for r~r-, (2.2)


Jr1

That is, F is a metric space except that p(~, , TV)= 0 does not necessarily imply that r1 = r2. For
example, the class of square integrable functions on [0, 11 with p(s,,r,) = [lA(T,(W) - T2(W))Zdw]12.is

a pseudometric space, but not a metric space. The reason is that if rr(w) equals T?(W) for all w except
one point, say, then ~(5,. T2) = 0, but TV # TV. In order to handle sets Y that are function spaces
of the above type, we allow F to be a pseudometric space rather than a (more restrictive) metric space.
2250 D.W.K. Andrew

where CT abbreviates xF= i. The empirical process vT(.) is a particular type of


stochastic process. If Y = [0,11, then vT(.) is a stochastic process on [0,11. For
parametric applications of empirical process theory, Y is usually a subset of RP.
For semiparametric and nonparametric,applications, Y is often a class of func-
tions. In some other applications, such as chi-square diagnostic test applications,
.q is a class of subsets of RP.
We now define weak convergence of the sequence of empirical processes
{vT(.): T 2 l} to some stochastic process v(.) indexed by elements z of Y. (v(.) may
or may not be defined on the same probability space (a,,&, P) as vT(.) VT> 1.)
Let * denote weak convergence of stochastic processes, as defined below. Let %
denote convergence in distribution of some sequence of rvs. Let 1).1)denote the
Euclidean norm. All limits below are taken as T-+ 00.

Definition of weak convergence

v~(.)=-v(.) if E*f(v,(.))+Ef(v(.)) VfWB(F_)),

where B(Y) is the class of bounded R-valued functions on Y (which includes all
realizations of vr(.) and v(.) by assumption), d is the uniform metric on B(Y) (i.e.
d(b,, b2) = sup,,r 11 b,(z) - b2(7) II), and @(B(S)) is the class of all bounded uni-
formly continuous (with respect to the metric d) real functions on B(Y).

In the definition, E* denotes outer expectation. Correspondingly, P* denotes outer


probability below. (It is used because it is desirable not to require vr(.) to be a
measurable random element of the metric space (B(Y), d) with its Bore1 o-field, since
measurability in this context can be too restrictive. For example, if (B(Y), d) is the
space of functions D[O, l] with the uniform metric, then the standard empirical
distribution function is not measurable with respect to its Bore1 a-field. The limit
stochastic process v(.), on the other hand, is sufficiently well-behaved in applications
that it is assumed to be measurable in the definition.)
The above definition is due to HoffmanJorgensen. It is widely used in the recent
probability literature, e.g. see Pollard (1990, Section 9).
Weak convergence is a useful concept for econometrics, because it can be used
to establish the asymptotic distributions of estimators and test statistics. Section 3
below illustrates how.
For now, we consider sufficient conditions for weak convergence. In many appli-
cations of interest, the limit process v(.) is (uniformly p) continuous in t with
probability one. In such cases, a property of the sequence of empirical processes
{vr(.): T 2 11, called stochastic equicontinuity, is a key member of a set of sufficient
conditions for weak convergence. It also is implied by weak convergence (if the limit
process v(.) is as above).
Ch. 37: Empirical Process Methods in Econometrics 2251

Dejnition of stochastic equicontinuity

{I+(.): T> l} IS
t s t oc h as tKU11y equicontinuous if VE> 0 and q > 0,36 > 0 such that

T+m
lim P
-[
SUP
T,.?2E~:&J(r,,rZ)<d
llV&1)-M~2)II >II
1
-Cc. (2.3)

Basically, a sequence of empirical processes iv,(.): T > l} is stochastically equi-


continuous if vT(.) is continuous in z uniformly over Y at least with high probability
and for T large. Thus, stochastic equicontinuity is a probabilistic and asymptotic
generalization of the uniform continuity of a function.
The concept of stochastic equicontinuity is quite old and appears in the literature
under various guises. For example, it appears in Theorem 8.2 of Billingsley (1968,
p. 55), which is attributed to Prohorov (1956), for the case of 9 = [O, 11. Moreover,
a non-asymptotic analogue of stochastic equicontinuity arises in the even older
literature on the existence of stochastic processes with continuous sample paths.
The concept of stochastic equicontinuity is important for two reasons. First, as
mentioned above, stochastic equicontinuity is a key member of a set of sufficient
conditions for weak convergence. These conditions are specified immediately below.
Second, in many applications it is not necessary to establish a full functional limit
(i.e. weak convergence) result to obtain the desired result - it suffices to establish
just stochastic equicontinuity. Examples of this are given in Section 3 below.
Sufficient conditions for weak convergence are given in the following widely used
result. A proof of the result can be found in Pollard (1990, Section 10) (but the basic
result has been around for some time). Recall that a pseudometric space is said to
be totally bounded if it can be covered by a finite number of c-balls VE> 0. (For
example, a subset of Euclidean space is totally bounded if and only if it is bounded.)

Proposition

If (i) (Y,p) is a totally bounded pseudometric space, (ii) finite dimensional (fidi)
convergence holds: V finite subsets (z,, . . . , T_,)of Y-, (v,(z,), . . . , ~~(7,)) converges in
distribution, and (iii) {v*(.): T 3 l} is stochastically equicontinuous, then there
exists a (Borel-measurable with respect to d) B(F)-valued stochastic process. v(.),
whose sample paths are uniformly p continuous with probability one, such that
VT(.)JV(.).
Conversely, if v=(.)*v(.) for v(.) with the properties above and (i) holds, then
(ii) and (iii) hold.

Condition (ii) of the proposition typically is verified by applying a multivariate


central limit theorem (CLT) (or a univariate CLT coupled with the Cramer-Wold
device, see Billingsley (1968)). There are numerous CLTs in the literature that cover
different configurations of non-identical distributions and temporal dependence.
2252 D. W.K. Andrews

Condition (i) of the proposition is straightforward to verify if Y is a subset of


Euclidean space and is typically a by-product of the verification of stochastic
equicontinuity in other cases. In consequence, the verification of stochastic equi-
continuity is the key step in verifying weak convergence (and, as mentioned above,
is often the desired end in its own right). For these reasons, we provide further
discussion of the stochastic equicontinuity condition here and we provide methods
for verifying it in several sections below.
Two equivalent definitions of stochastic equicontinuity are the following:
(i) {v,(.): T 3 1) is stochastically equicontinuous if for every sequence of constants
(6,) that converges to zero, we have SUP~(,~,~~)~~~IV~(Z~)- vT(rZ)l 30 where
A denotes convergence in probability, and (ii) {vT(.): vT 3 l} is stochastically
equicontinuous if for all sequences of random elements {Z^iT} and {tZT} that
satisfy p(z^,,,f,,) LO, we have v,(Q,,) - v,(z^,,) L 0. The latter characterization
of stochastic equicontinuity reflects its use in the semiparametric examples below.
Allowing {QiT} and {tZT} to be random in the latter characterization is crucial. If
only fixed sequences were considered, then the property would be substantially
weaker-it would not deliver the result that vT(z*,,)- vr.(fZT) 30 ~ and its
proof would be substantially simpler - the property would follow directly from
Chebyshevs inequality.
To demonstrate the plausibility of the stochastic equicontinuity property, suppose
JZ contains only linear functions, i.e. ~2 = {g: g(w) = wt for some FERN} and p is
the Euclidean metric. In this simple linear case,

< E, (2.4)

where the first inequality holds by the CauchyySchwarz inequality and the second
inequality holds for 6 sufficiently small provided (l/J?)x T( W, - E IV,) = O,( 1).
Thus, Iv,(.): T 3 l} is stochastically equicontinuous in this case if the rvs
{ W, - E W,: t < T, T 2 l} satisfy an ordinary CLT.
For classes of nonlinear functions, the stochastic equicontinuity property is sub-
stantially more difficult to verify than for linear functions. Indeed, it is not difficult
to demonstrate that it does not hold for all classes of functions J?. Some restrictions
on .k are necessary ~ ~2! cannot be too complex/large.
To see this, suppose { W,: t d T, T 3 l} are iid with distribution P, that is abso-
lutely continuous with respect to Lebesgue measure and J$? is the class of indicator
Ch. 37: Empirical Process Methods in Econometrics 2253

functions of all Bore1 sets in %.Let z denote a Bore1 set in w and let Y denote
the collection of all such sets. Then, m(w, t) = l(w~r). Take p(r,, z2) = (J(m(w, ri) -
m(w, rz))*dPl(w)) I* . For any two sets zl, r2 in Y that have finite numbers of
elements, v,(zj) = (l/$)C~l(W,~t~) and p(r1,z2) = 0, since P1(WI~tj) = 0 forj =
1,2. Given any T 2 1 and any realization o~Q, there exist finite sets tlTo and
rZTwin Y such that W,(o)~r,,~ and IVJo)$r,rwVt d T, where W,(o) denotes the
value of W, when o is realized. This yields vr-(riTw) = @, v~(~*~J = 0, and
supP(rl,Q)<dlvT(zl) - vr(~J\ 3 $?. In consequence, (v~(.): T 2 l} is not stochasti-
tally equicontinuous. The class of functions & is too large.
In Sections 4 and 5 below, we discuss various entropy conditions that restrict
the complexity/size of the class of functions J& sufficiently that stochastic equi-
continuity holds. Before doing so, however, we illustrate how weak convergence
and stochastic equicontinuity results can be fruitfully employed in various econo-
metric applications.

3. Applications

3.1. Review of applications

In this subsection, we briefly describe a number of applications of empirical process


theory that appear in the econometrics literature. There are numerous others that
appear in the statistics literature, see Shorack and Wellner (1986) and Wellner
(1992) for some references.
The applications and use of empirical process methods in econometrics are fairly
diverse. Some applications use a full weak convergence result; others just use a
stochastic equicontinuity result. Most applications use empirical process theory for
normalized sums of rvs, but some use the corresponding theory for U-processes,
see Kim and Pollard (1990) and Sherman (1992). The applications include estimation
problems and testing problems. Here we categorize the applications not by the type
of empirical process method used, but by area of application. We consider estimation
first, then testing.
Empirical process methods are useful in obtaining the asymptotic normality of
parametric optimization estimators when the criterion function that defines the
estimator is not differentiable. Estimators that fit into this category include robust
M-estimators (see Huber (1973)). regression quantiles (see Koenker and Bassett
(1978)), censored regression quantiles (see Powell (1984, 1986a)), trimmed LAD
estimators (see Honore (1992)), and method of simulated moments estimators (see
McFadden (1989) and Pakes and Pollard (1989)). Huber (1967) gave some asymp-
totic normality results for a class of M-estimators of the above sort using empirical
process-like methods. His results have been utilized by numerous econometricians,
e.g. see Powell (1984). Empirical process methods were utilized explicitly in several
subsequent papers that treat parametric estimation with non-differentiable criterion
2254 D. WK. Andrews

functions, see Pollard (1984, 1985) McFadden (1989), Pakes and Pollard (1989) and
Andrews (1988a). Also, see Newey and McFadden (1994) in this handbook. In
Section 3.2 below, we illustrate one way in which empirical process methods can be
exploited for problems of this sort.
Empirical process methods also have been utilized in the semiparametric econo-
metrics literature. They have been used to establish the asymptotic normality (and,
in a few cases, other limiting distributions) of various estimators. References include
Horowitz (1988, 1992), Kim and Pollard (1990), Andrews (1994a), Newey (1989),
White and Stinchcombe (1991), Olley and Pakes (1991), Pakes and Olley (1991),
Ait-Sahalia (1992a, b), Sherman (1993,1994) and Cavanagh and Sherman (1992).
Kim and Pollard (1990) establish the asymptotic (non-normal) distribution of
Manskis (1975) maximum score estimator for binary choice models using empirical
process theory for U-statistics. Horowitz (1992) establishes the asymptotic normal
distribution of a smoothed version of the maximum score estimator. Andrews
(1994a), Newey (1989), Pakes and Olley (1991) and Ait-Sahalia (1992b) all use
empirical process theory to establish the asymptotic normality of classes of semi-
parametric estimators that employ nonparametric estimators in their definition.
Andrews (1994a), Newey (1989) and Pakes and Olley (1991) use stochastic equi-
continuity results, whereas Ait-Sahalia (1992b) utilizes a full weak convergence result.
Sherman (1993,1994) and Cavanagh and Sherman (1992) establish asymptotic
normality of a number of semiparametric estimators using empirical process theory
of U-statistics. Section 3.3 below gives a heuristic description of one way in which
empirical process methods can be used for semiparametric estimation problems.
A third area of application of empirical process methods to estimation problems
is that of nonparametrics. Gallant (1989) and Gallant and Souza (1991) use these
methods to establish the asymptotic normality of certain seminonparametric (i.e.
nonparametric series) estimators. In their proof, empirical process methods are used
to establish that a law of large numbers holds uniformly over a class of functions
that expands with the sample size. Andrews (1994b) uses empirical process methods
to show that nonparametric kernel density and regression estimators are consistent
when the dependent variable or the regressor variables are residuals from some
preliminary estimation procedure (as often occurs in semiparametric applications).
Empirical process methods also have been utilized very effectively in justifying
the use of bootstrap confidence intervals. References include Gine and Zinn (1990),
Arcones and Gine (1992) and Hahn (1995).
Next, we consider testing problems. Empirical process methods have been used
in the literature to obtain the asymptotic null (and local alternative) distributions
of a wide variety of test statistics. These include test statistics for chi-square
diagnostic tests (see Andrews (1988b, c)), consistent model specification tests (see
Bierens (1990), Yatchew (1992), Hansen (1992a), De Jong (1992) and Stinchcombe
and White (1993)), tests of nonlinear restrictions in semiparametric models (see
Andrews (1988a)), tests of specification of semiparametric models (see Whang and
Andrews (1993) and White and Hong (1992)), tests of stochastic dominance (see
Ch. 37: Empirical Process Methods in Econometrics 2255

Klecan et al. (1990), and tests of hypotheses for which a nuisance parameter appears
only under the alternative (see Davies (1977,1987), Bera and Higgins (1992), Hansen
(1991, 1992b), Andrews and Ploberger (1994) and Stinchcombe and White (1993).
For tests of the latter sort, Section 3.4 below describes how empirical process
methods are utilized.
Last, we note that stochastic equicontinuity can be used to obtain uniform laws
of large numbers that can be employed in proofs of consistency of extremum
estimators. For example, see Pollard (1984, Chapter 2), Newey (1991) and Andrews
(1992).

3.2. Parametric M-estimators based on non-d@erentiable criterion functions

Here we give a heuristic description of one way in which empirical process theory
can be used to establish the asymptotic normality of parametric M-estimators (or
GMM estimators) that are based on criterion functions that are not differentiable
with respect to the unknown parameter. This treatment follows that of Andrews
(1988a) most closely (in which a formal statement of assumptions and results can
be found). Other references are given in Section 3.1 above.
Suppose ? is a consistent estimator of a parameter ~,,ER~ that satisfies a set of
p first order conditions

m,(e) = 0 (3.1)
at least with probability that goes to one as T-+ CO, where

tiT(7) = -$tjm( W,, t).


1
(3.2)

Here, W, is an observed vector of random variables and m( ., .) is a known RP-valued


function. Examples are given below.
If m( W,, z) is differentiable in z, one can establish the asymptotic normality of
5^by expanding fiti, about t0 using element by element mean value expansions.
This is the standard way of establishing asymptotic normality of f (or, more
precisely, of fi(z^ - to)). In a variety of applications, however, the function m(W,, T)
is not differentiable in 5, or not even continuous, due to the appearance of a sign
function, an indicator function or a kinked function, etc. Examples are listed above
and below. In such cases, one can still establish asymptotic normality of t^provided
Em(W,, 2) is differentiable in t. Since the expectation operator is a smoothing
operator, Em( W,, z) is often differentiable in t even though m( W,, t) is not.
One method is as follows. Let

6$(t) = $ $ Em( W,, t). (3.3)


2256 D. WK. Andrew

To establish asymptotic normality off, one can replace (element by element) mean
value expansions of +(Z*) about r0 by corresponding mean value expansions of
fi+(rO) about d and then use empirical process methods to establish the limit
distribution of the expansion. In particular, such mean value expansions yield

0= J%qT,) = J!%;(Q)- apii;(qyaT~JT(f - TV),

where the first equality holds by the population orthogonality conditions (by
assumption) and 7 lies on the line segment joining t* and r,, (and takes different
values in each row of a[Krr(?)]/ar). Under suitable assumptions on {m(kV,, r):
t < T, T >, l}, one obtains

(For example, if the rvs W, are identically distributed, it suffices to have


a[Em( W,, z,)]/at continuous in r at r,,.) Thus, provided M is nonsingular, one has

fi(z^- TO)= (M 1+ op(l)@%*,(t). (3.5)

(Here, o,(l) denotes a term that converges in probability to zero as T + co.)


Now, the asymptotic distribution of fi(s* - re) is obtained by using empirical
process methods to determine the asymptotic distribution of fitiT( We write

- fitiT = [Jrrn,(Q- @ii;(t)] - JTrn,(t*)


= tvT@) - vT(TO)) + vTh,) - fifi,(t). (3.6)

The third term on the right hand side (rhs) of (3.6) is o,(l) by (3.1). The second
term on the rhs of (3.6) is asymptotically normal by an ordinary CLT under suit-
able moment and temporal dependence assumptions, since vr(t,,) is a normalized
sum of mean zero rvs. That is, we have

(3.7)

where S = lim,, m varC(lIJT)CTm(w,,~,)l. F or example, if the rvs W, are inde-


pendent and identically distributed (iid), it suffices to have S = Em( W,, z,)m( W,, to)
well-defined.)
Next, the first term on the rhs of (3.6) is o,(l) provided {vT(.): T 2 l> is
stochastically equicontinuous and Z Lro. This follows because given any q > 0
and E > 0, there exists a 6 > 0 such that
Ch. 37: Empirical Process Methods in Econometrics 2257

lim P(I vT(f)- vTh)l > d


T-ra,
d lim P( 1VT(t) - vT(ro)l > q, p(t, To) d 6) + lim P(p(z*, rO) > 6)
T-rm T+OZ

d lim P sup 1vT(z) - vT(zO)/ > q


-(
T-CC re.F:p(r,ro)< d >

< 6 (3.8)

where the second inequality uses z^A t0 and the third uses stochastic equicontinuity.
Combining (3.5))(3.8) yields the desired result that

JT(r*- zo) L N(O,M-'S(M-I)') as T+ co. (3.9)

It remains to show how one can verify the stochastic equicontinuity of (VT(.): T 2 l}.
This is done in Sections 4 and 5 below. Before doing so, we consider several
examples.

Example 1

M-estimators for standard, censored and truncated linear regression model. In


the models considered here, {(K, X,): t d T} are observed rvs and {(Y:, XT): t d T}
are latent rvs. The models are defined by

Yf = xye, + u,, t=l,...,T,

linear regression (LR): (YE X,) = (Y:, X:)7


censored regression (CR): (Y,, X,) = (Y: 1(Y: 2 Cl), Xf),

truncated regression (TR): (q, X,) = (Y: 1(Y: 2 0), XT 1(Y: 2 0)). (3.10)

Depending upon the context, the errors (U,} may satisfy any one of a number of
assumptions such as constant conditional mean or quantile for all t or symmetry
about zero for all t. We need not be specific for present purposes.
We consider M-estimators ? of r,, that satisfy the equations

O=~ci/l(r,-X;r*)~,(w,,~)X* (3.11)

with probability + 1 as T-, co, where W, = (Y,, Xi,. Such estimators fit the general
framework of (3.1)-(3.2) with

m(w, r) = 11/i(y - x~)$~(w, r)x, where w = (y, x). (3.12)


2258 D. WK. Andrews

Examples of such M-estimators in the literature include the following:


(a) LR model: Let $r(z) = sgn(z) and tiz = 1 to obtain the least absolute
deviations (LAD) estimator. Let $r(z) = q - l(y - x~ < 0) and $* = 1 to obtain
Koenker and Bassetts (1978) regression quantile estimator for quantile qE(O, 1).
Let rc/1(z) = (z A c) v (- c) (where A and v are the min and max operators
respectively) and $z = 1 to obtain Hubers (1973) M-estimator with truncation at
+ c. Let $t (z) = 1q - 1(y - xt < O)l and $z(w, r) = y - xs to obtain Newey and
Powells (1987) asymmetric LS estimator.
(b) CR model: Let $r(z) = q - 1(y - xr < 0) and tjz(w, r) = l(xr > 0) to obtain
Powells (1984, 1986a) censored regression quantile estimator for quantile qE(O, 1).
Let $r = 1 and tjz(w, r) = 1(x? > O)[(y - xr) A xr] to obtain Powells (1986b)
symmetrically trimmed LS estimator.
(c) TR model: Let $r = 1 and $z(w, r) = l(y < 2xt)(y - xr) to obtain Powells
(1986b) symmetrically trimmed LS estimator.
(Note that for the Huber M-estimator of the LR model one would usually
simultaneously estimate a scale parameter for the errors U,. For brevity, we omit
this above.)

Example 2

Method of simulated moments (MSM) estimator for multinomial probit. The


model and estimator considered here are as in McFadden (1989) and Pakes and
Pollard (1989). We consider a discrete response model with r possible responses.
Let D, be an observed response vector that takes values in {ei: i = 1,. . . , I}, where
ei=(O ,..., O,l,O ,..., 0) is the ith elementary r-vector. Let Zli denote an observed
b-vector of covariates - one for each possible response i = 1,. , r. Let Z, =
cZ:rZ:2... Z;J. The model is defined such that

D, = e, if (Zti - Z,,)(j3(s0) + A(r,)U,) 3 0 Vl = 1,. . . , r, (3.13)

where U, N N(O,Z,) is an unobserved normal rv, /3(.) and A(.) are known RbX -
and RbX -valued functions of an unknown parameter rOey c RP.
McFaddens MSM estimator of r0 is constructed using s independent simulated
N(0, I,) rvs (Y,, , . . . , Y,,) and a matrix of instruments g(Z,, r), where g(., .) is a
known R b-valued function. The MSM estimator is an example of the estimator
of (3.1)-(3.2) with W, L (D,, Z,, Ytl,. . . , Y,,) and

m(, r) = g(z, r) d - 1 ,gI HCz(P(r) + A(z)Yj)I 2 (3.14)


( J >
where w = (d, z, y,, , . . , y,). Here, H[.] is a (0, I}-valued function whose ith element
is of the form

nl
1=1
CCzi - zlY(B(t)+ A(z)Yj) 3 Ol. (3.15)
Ch. 37: Empirical Process Methods in Econometrics 2259

3.3. Tests when a nuisance parameter is present only under the alternative

In this section we consider a class of testing problems for which empirical process
limit theory can be usefully exploited. The testing problems considered are ones
for which a nuisance parameter is present under the alternative hypothesis, but
not under the null hypothesis. Such testing problems are non-standard. In
consequence, the usual asymptotic distributional and optimality properties of
likelihood ratio (LR), Lagrange multiplier (LM), and Wald (W) tests do not apply.
Consider a parametric model with parameters 8 and T, where & 0 c R, TEF c R.
Let 0 = (/I, S), where BERN, and FERN, and s = p + q. The null and alternative
hypotheses of interest are

H,: /I=0 and


(3.16)
H,: pzo.

Under the null hypothesis, the distribution of the data does not depend on the
parameter r by assumption. Under the alternative hypothesis, it does. Two
examples are the following.

Example 3

This example is a test for variable relevance. We want to test whether a regressor
variable/vector Z, belongs in a nonlinear regression model. This model is

Y,=dX,,4)+LWZ,,z)+ U,, u,-N(o,d,), t= l,...,~. (3.17)

The functions g and h are assumed known. The parameters (/?,bl,fi2, r) are
unknown. The regressors (X,,Z,) and/or the errors U, are presumed to exhibit
some sort of asymptotically weak temporal dependence. As an example, the term
h(Z,,r) might be of the Box-Cox form (Z: - 1)/r. Under the null hypothesis
H,: /I = 0, Z, does not enter the regression function and the parameter r is not
present.

Example 4

This example is a test of cross-sectional constancy in a nonlinear regression model.


A parameter r (ERR) partitions the sample space of some observed variable
Z, (E R) into two regions. In one region the regression parameter is 6, (ERR) and in
the other region it is 6, + /I. A test of cross-sectional constancy of the regression
parameters corresponds to a test of the null hypothesis H,: p = 0. The parameter
r is present only under the alternative.
To be concrete, the model is

for Wt34>0 for t=l T


(3.18)
for
,..., 3

h(Z,,z) 6 0
2260 D. W.K. Andrew

where the errors CJ, N iid N(O,6,), the regressors X, and the rv Z, are m-dependent
and identically distributed, and g(.;) and h(.;) are known real functions, For
example, h(Z,,t) could equal Z, - r, where the real rv Z, is an element of X,, an
element of Xt_d for some integer d 2 1, or Y,_, for some integer d > 1. The model
could be generalized to allow for more regions than two.

Problems of the sort considered above were first treated in a general way by
Davies (1977, 1987). Davies proposed using the LR test. Let LR(r) denote the LR
test statistic (i.e. minus two times the log likelihood ratio) when t is specified under
the alternative. For given r, LR(r) has standard asymptotic properties (under
standard regularity conditions). In particular, it converges in distribution under
the null to a random variable X2(r) that has a xi distribution. When r is not given,
but is allowed to take any value in y, the LR statistic is

sup LR(r). (3.19)


rsf

This statistic has power against a much wider variety of alternatives than the
statistic LR(r) for some fixed value of r.
To mount a test based on SUP,,~ LR(r), one needs to determine its asymptotic
null distribution. This can be achieved by establishing that the stochastic process
LR(r), viewed as a random function indexed by r, converges weakly to a stochastic
process X(r). Then, it is easy to show that the asymptotic null distribution of
SUP,,~ LR(t) is that of the supremum of the chi-square process X(r). The methods
discussed below can be used to provide a rigorous justification of this type of
argument.
Hansen (1991) extended Davies results to non-likelihood testing scenarios,
considered LM versions of the test, and pointed out a variety of applications of
such tests in econometrics.
A drawback of the supLR test statistic is that it does not possess standard
asymptotic optimality properties. Andrews and Ploberger (1994) derived a class
of tests that do. They considered a weighted average power criterion that is
similar to that considered by Wald (1943). Optimal tests turn out to be average
exponential tests:

Exp-LR = (1 + c)-~~ ]exp( k& LRo)dJW. (3.20)

where J(.) is a specified weight function over r~9 and c is a scalar parameter
that indexes whether one is directing power against close or distant alternatives
(i.e. against b small or /I large). Let Exp-LM and Exp-W denote the test statistic
defined as in (3.20) but with LR(t) replaced by LM(7) and W(7), respectively,
where the latter are defined analogously to LR(7). The three statistics Exp-LR,
Ch. 37: Empirical Process Methods in Econometrics 2261

Exp-LM, and Exp-W each have asymptotic optimality properties. Using empirical
process results, each can be shown to have an asymptotic null distribution that
is a function of the stochastic process X(z) discussed above.
First, we introduce some notation. Let I,(B,r) denote a criterion function that
is used to estimate the parameters 6 and r. The leading case is when l,(Q, r) is the
log likelihood function for the sample of size T. Let D&.(8, r) denote the s-vector of
partial derivatives of I,(Q,r) with respect to 8. Let 8, denote the true value of 8
under the null hypothesis H,, i.e. B0 = (0,s;). (Note that D1,(8,, r) depends on z
in general even though I,(B,,s) does not.)
By some manipulations (e.g. see Andrews and Ploberger (1994)), one can show
that the test statistics SUP~~,~LR(r), Exp-LR, Exp-LM, and Exp-W equal a conti-
nuous real function of the normalized score process {D/,(0,, r)/,,@: try-) plus an
op( 1) term under H,. In view of the continuous mapping theorem (e.g. see Pollard
(1984, Chapter 111.2)), the asymptotic null distributions of these statistics are given
by the same functions of the limit process as T-r co of {D1,(8,, r)/fi: reF_).
More specifically, let

VT(T)= AN,(A,,7). (3.21)


Jr
(Note that EDIr(BO,r) = 0 under Ho, since these are the population first order
conditions for the estimator.) Then, for some continuous function g of v,(.), we
have

sup LR(r) = g(vT(.)) + o,(l) under H,. (3.22)


re.!7

(Here, continuity is defined with respect to the uniform metric d on the space of
bounded R-valued functions on Y-, i.e. B(Y).) If vr.(.)* v(.), then

;,,p LR(r) 5 g(v(.)) under H,, (3.23)

which is the desired result. The distribution of g(v(.)) yields asymptotic critical
values for the test statistic SUP,,,~ LR(z). The results are analogous for Exp-LR,
Exp-LM, and Exp-W.
In conclusion, if one can establish the weak convergence result, v=(.)*v(.) as
T-t co, then one can obtain the asymptotic distribution of the test statistics of
interest. As discussed in Section 2, the key condition for weak convergence is
stochastic equicontinuity. The verification of stochastic equicontinuity for Examples
3 and 4 is discussed in Sections 4 and 5 below. Here, we specify the form of v=(z)
in these examples.
2242 D. WK. Andrews

Examples 3 (continued)

In this example, 1,(O,r) is the log likelihood function under the assumption of iid
normal errors:

and

VT(Z) = 1 D1,(0,, 7) = (3.24)


fi

Since 7 only appears in the first term, it suffices to show that { (l/fl)xTU,h(Z,, .):
T 3 l} is stochastically equicontinuous.

Example 4 (continued)

In this cross-sectional constancy example, I(& 7) is the log likelihood function under
the assumption of iid normal innovations:

MO, 7) = - +gZnd, -+f: [r, - g(X,,6J l(h(Z,,z) > 0)


2 1

- dx,, 61 + B) l(W,, 7) d (31


and

L D&-(0,, 7) =

Jr

Since 7 only appears in the first term, it suffices to show that {(I/fi)CTU, x
a[g(X,, 8,,)]/&S, 1(h(Z,;) d 0): T 3 I} is stochastically equicontinuous.
Ch. 37: Empirical Process Methods in Econometrics 2263

3.4. Semiparametric estimation

We now consider the application of stochastic equicontinuity results to semipara-


metric estimation problems. The approach that is discussed below is given in more
detail in Andrews (1994a). Other approaches are referenced in Section 3.1 above.
Consider a two-stage estimator e of a finite dimensional parameter 0e~ 0 c R.
In the first stage, an infinite dimensional parameter estimator z*is computed, such
as a nonparametric regression or density estimator or its derivative. In the second
stage, the estimator 8 of 8, is obtained from a set of estimating equations that
depend on the preliminary estimator t^. Many semiparametric estimators in the
literature can be defined in this way.
By linearizing the estimating equations, one can show that the asymptotic
distribution of ,/?((8- 19,)depends on an empirical process vr(t), evaluated at the
preliminary estimator f. That is, it depends on vr(?). To obtain the asymptotic
distribution of 8, then, one needs to obtain that of vr(?). If r*converges in prob-
ability to some t0 (under a suitable pseudometric) and vT(r) is stochastically
equicontinuous, then one can show that v=(f) - Q(Q) 50 and the asymptotic
behavior of ,/?(e^- 19,) depends on that of v&& which is obtained straightfor-
wardly from an ordinary CLT. Thus, one can effectively utilize empirical process
stochastic equicontinuity results in establishing the asymptotic distributions of
semiparametric estimators.
We now provide some more details of the argument sketched above. Let the data
consist of {W,: t Q T}. Consider a system of p estimating equations

ti,(B, f) = f $m(e, f), (3.26)

where m(0, r) =, m( W,, 8, z) and m(., ., *) is an RP-valued known function. Suppose


the estimator 0 solves the equations

J%iT(& f) = 0 (3.27)

(at least with probability that goes to one as T+ CO). These equations might be
the first order conditions from some minimization problem.
We suppose consistency of 8 has already been established, i.e. e-%0, (see
Andrews (1994:) for sufficient conditions). We wish to determine the asymptotic
distribution of 8. When m( W,, 8, t) is a smooth function of 8, the following approach
can be used. Element by element mean value expansions stacked yield

o,(l) = w& 4 = .JTm,(e,,f) + a[rii,(e*, f)yaelfi@- e,), (3.28)

where 8* lies between 6 and 0, (and 0* may differ from row to row in
2264 D.W.K. Andrews

a[fi,(O*, z*)],W). Under suitable conditions,

(3.29)

Thus,

JT(e^- 0,) = -(A!_ l + o,(l))Jrrn,(O,,~)

= - (M- 1 + o,(l))CJr(m,(e,, t*) - m;(e,,z*)) + @ii;(8,, ?)I,


(3.30)

where ti*,(O,z) = (l/T)CTEm(W,, 8,~).


Again under suitable conditions, either

for some covariance matrix A, see Andrews (1994a).


Let

VT(4= JeMe,, z) - fqe,, t)). (3.32)

Note that v=(.) is a stochastic process indexed by an infinite dimensional parameter


in this case. This differs from the other examples in this section for which r is finite
dimensional.
Under standard conditions, one can establish that

%-bo)5 N(O,S) (3.33)

for some covariance matrix S, by applying an ordinary CLT. If, in addition, one
can show that

- VT(%)J+0,
VT(z*) (3.34)

then we obtain

JT(& e,) = -(M-l + %U))CVT(Q)


+ @qe,,?)I
= - M- CVTh)+ JTmge,,f)] + O,(l)

5 N(O, M - (S + A)(M )), (3.35)

which is the desired result,


Ch. 37: Empirical Process Methods in Econometrics 2265

To prove (3.34), we can use the stochastic equicontinuity property. Suppose

(i) {v,(.): T 2 1) is stochastically equicontinuous for some choice of F


and pseudometric p on r-,
(ii) P(QEF)+ 1, and
(iii) p(?, tO) J+ 0, (3.36)

then (3.34) holds (as shown below).


Note that there exist tradeoffs between conditions (i), (ii), and (iii) of l(3.36) in
terms of the difficulty of verification and the strength of the regularity conditions
needed. For example, a larger set Y makes it more difficult to verify (i), but easier
to verify (ii). A stronger pseudometric p makes it easier to verify (i), but more
difficult to verify (iii).
Since the sufficiency of (3.36) for (3.34) is the key to the approach considered
here, we provide a proof of this simple result. We have: V E > 0, V n > 0,3 6 > 0 such
that

lim P(I vT(z*)


- vT(d > rl)
T-30

< lim P( I VT(?) - vT(zo) 1> q, QEF, p(t, zo) d 6)


T-CC

+ lim P(2#Y or p(t^,r,) > 6)


T+CC

sup 1vT(z) - vT(zO) 1> 1


re.F: p(r,ro) < d

< E, (3.37)

where the term on the third line of (3.37) is zero by (ii) and (iii) and the last
inequality holds by (i). Since E > 0 is arbitrary, (3.34) follows.
To conclude, one can establish the fi-consistency and asymptotic normality
of the semiparametric estimator 6 if one can establish, among other things, that
{v,(.): T 2 l} is stochastically equicontinuous. Next, we consider the application
of this approach to two examples and illustrate the form of vT(.) in these examples.
In Sections 4 and 5, we discuss the verification of stochastic equicontinuity when
M = {m(., t): ZEY} is an infinite dimensional class of functions.

Example 5

This example considers a weighted least squares (WLS) estimator of the partially
linear regression (PLR) model. The PLR model is given by

Y, = X:6, + g(Z,) + U, and E( U,I X,, Z,) = 0 a.s. (3.38)


2266 D. W.K. Andrew

for t= l,..., T, where the real function g(.) is unknown, W, = (Y,,X:,Z:) is iid or
m-dependent and identically distributed, Y,, U,eR, X,, tl,ERP and Z,eRka. This
model is also discussed by Hlrdle and Linton (1994) in this handbook. The WLS
estimator is defined for the case where the conditional variance of U, given (X,, Z,)
depends only on Z,. This estimator is a weighted version of Robinsons (1988)
semiparametric LS estimator. The PLR model with heteroskedasticity of the above
form can be generated by a sample selection model with nonparametric selection
equation (e.g. see Andrews (1994a)). Let rlO(Z,) = E(Y,IZ,),r,,(Z,) = E(X,IZ,),
r3JZt) = E(U: IZ,) and r. = (riO, rio, rzo)). Let fj(.) be an estimator of tjO(.) for
j = 1,2,3. The semiparametric WLS estimator of the PLR model is given by

-1
e=[ 51 5wt)(X, - Z^2(Zt))(Xt
- ~*m)/~,m
1

x i 5wJw, - z^2(Zt))(yt- ~lm)/~,(z,), (3.39)


1

where r( W,) = l(Z,~f%*) is a trimming function and 5?* is a bounded subset of


Rka. This estimator is of the form (3.16)-(3.17) with

m(K, 8, f) = S(K)Cr, - %(Z,) - (X, - z^,(Z,))ei LX, - e,(Z,)l/t3(Z,). (3.40)

To establish the asymptotic normality of z^using the approach above, one needs
to establish stochastic equicontinuity for the empirical process vr(.) when the class
of functions JJ is given by

J? = {m(., Bo, t): ZEF} where

m(w, eo, r) = <(w)~Y - ri(z) - (x - r,(z))eoi cx - z,(z)I/z,(z), (3.41)

w = (y,x,z), r = (r,,r;,r,) and F is as defined below. Here, the elements ZEF


are possible realizations of the vector nonparametric estimator 2. By definition,
3 c Rk is the domain of rj(z) for j = 1,2,3 and 2 includes the support of Z, V t 3 1.
By assumption, the trimming set 6* c 3. If d* = 2, then no trimming occurs
and t(w) is redundant. If i%*is a proper subset of 2, then trimming occurs and
the WLS estimator 8 is based on only nontrimmed observations.

Example 6

This example considers generalized method of moments (GMM) estimators of


parameters defined by conditional moments restrictions (CMR).
In this example, 0, is the unique parameter vector that solves the equations

E($(Z,,e)IX,)=O a.s. Vt> 1 (3.42)


Ch. 37: Empirical Process Methods in Econometrics 2261

for some specified R-valued function Ic/(., .), where X,eRkn. Examples of this model
in econometrics are quite numerous, see Chamberlain (1987) and Newey (1990).
Let %(X,) = E($(Z,, t%)$(z,, &)lX,), d&X,) = ECaC$(z,, 4Jll~@I~,l and
to(X,) = d,(X,)R, (X,). By assumption, a,(.), A,(.), and rO(.) do not depend
on t. Let fi(.) and A(.) be nonparametric estimators of a,(.) and A,(.). Let
t*(.) = d^(.)lt;,- (.). Let W, = (Z;, Xi).
A GMM estimator 6 of B,, minimizes

over 0~ 0 c RP, (3.43)

where 9 is a data-dependent weight matrix. To obtain the asymptotic distribution


of this estimator using the approach above, we need to establish a stochastic
equicontinuity result for the empirical process vT(.) when the class of functions J?
is given by

M = {m(., do, 5): TEL?-}, where

m(w, &, r) = r(x)lcI(z, 6,) = A(x)n- (x)$(z, &,), (3.44)

w = (z, x) and Y is defined below.

4. Stochastic equicontinuity via symmetrization

4.1. Primitive conditions for stochastic equicontinuity

In this section we provide primitive conditions for stochastic equicontinuity. These


conditions are applied to some of the examples of Section 3 in Section 4.2 below.
We utilize an empirical process result of Pollard (1990) altered to encompass
m-dependent rather than independent rvs and reduced in generality somewhat to
achieve a simplification of the conditions. This result depends on a condition,
which we refer to as Pollards entropy condition, that is based on how well the
functions in JV can be approximated by a finite number of functions, where the
distance between functions is measured by the largest L(Q) distance over all
distributions Q that have finite support. The main purpose of this section is to
establish primitive conditions under which the entropy condition holds. Following
this, a number of examples are provided to illustrate the ease of verification of
the entropy condition.
First, we note that stochastic equicontinuity of a vector-valued empirical process
(i.e. s > 1) follows from the stochastic equicontinuity of each element of the empirical
process. In consequence, we focus attention on real-valued empirical processes
(s = 1).
2268 D. W.K. Andrews

The pseudometric p on Y is defined in this section by

E(m(W,, 71) - m(W,, 72))2 12.3 (4.1)


>

Let Q denote a probability measure on W. For a real function f on W, let


Qf 2 = 1% f*(w)dQ(w). Let 9 be a class of functions in c(Q). The L2(Q) cover
numbers of 9 are defined as follows:

Definition

For any E > 0, the cover number N*(E, Q, F) is the smallest value of n for which
there exist functions fI, . . ,f,,
in 4 such that minj, ,(Q(f - fj)*)li2 < EVlf~p.
N2(&, Q, 9) = co if no such n exists.

The log of N2(&,Q,S7 is referred to as the L*(Q) &-entropy of 9. Let 2 denote the
class of all probability measures Q on W that concentrate on a finite set. The
following entropy/cover number condition was introduced in Pollard (1982).

Definition

A class F of real functions defined on W satisfies Pollards entropy condition if

sup [log N*(E(QF~)~, Q, F)]12 de < co, (4.2)


s o QE~

where F is some envelope function for 9, i.e. F is a real function on W for which
If(.)1 < F(.)VfEF.
As ~10, the cover number N2(&(QF2)*, Q,p) increases. Pollards entropy
condition requires that it cannot increase too quickly as ~10. This restricts the
complexity/size of 9 and does so in a way that is sufficient for stochastic equi-
continuity given suitable moment and temporal dependence assumptions. In
particular, the following three assumptions are sufficient for stochastic equi-
continuity.

Assumption A

JZZ satisfies Pollards entropy condition with some envelope ti.

Assumption B

lim T_ 3. ( l/T)CTEti2 (IV,) < CCfor some 6 > 0, where M is as in Assumption A.

3The pseudometric p(., .) is defined here using a dummy variable N (rather than T) to avoid confusion
when we consider objects such as plim T_rcp(Q,so). Note that p(.;) is taken to be independent of the
sample size T.
Ch. 37: Empirical Process Methods in Econometrics 2269

Assumption C

( W,: t < T, T 2 1j is an m-dependent triangular array of rvs.

Theorem 1 (Pollard)

Under Assumptions A-C, {vT(.): T > l} is stochastically equicontinuous with p


given by (4.1).

Comments

(1) Theorem 1 is proved using a symmetrization argument. In particular, one


obtains a maximal inequality for vT(r) by showing that SUP,,,~ 1vT(t)j is less variable
than suproY l(l/fi)CT 1o,m( W,, z)l, where (6,: t d T} are iid rvs that are indepen-
dent of { W,: t < T) and have Rudemacher distribution (i.e. r~( equals + 1 or - 1,
each with probability i). Conditional on { W,} one performs a chaining argument
that relies on Hoeffdings inequality for tail probabilities of sums of bounded, mean
zero, independent rvs. The bound in this case is small when the average sum of
squares of the bounds on the individual rvs is small. In the present case, the latter
is just (lIT)Clm T ( W t, z). The maximal inequality ultimately is applied to the
empirical measure constructed from differences of the form m( W,, zl) - m( W,, r2)
rather than to just m(W,, z). In consequence, the measure of distance between
m(.,z,) and m(.,z,) that makes the bound effective is an L2(P,) pseudometric,
where P, denotes the empirical distribution of (W,: t d Tj. This pseudometric is
random and depends on T, but is conveniently dominated by the largest L2(Q)
pseudometric over all distributions Q with finite support. This explains the
appearance of the latter in the definition of Pollards entropy condition. To see
why Pollards entropy condition takes the precise form given above, one has to
inspect the details of the chaining argument. The interested reader can do so, see
Pollard (1990, Section 3).
(2) When Assumptions A-C hold, F is totally bounded under the pseudometric
p provided p is equivalent to the pseudometric p* defined by p*(z,,z2) =
&,+ co[(l/N)CyE(m(W,, zi) - m(W,, T2))2]12. By equivalent, we mean that
p*(~, , TV)2 Cp(z,, z2) V tl, Z*EF for some C > 0. (p*(~i, z2) < p(r,, ZJ holds auto-
matically.) Of course, p equals p* if the rvs W, are identically distributed. The
proof of total boundedness is analogous to that given in the proof of Theorem 10.7
in Pollard (1990).

Combinatorial arguments have been used to establish that certain classes of


functions, often referred to as Vapnik-Cervonenkis (VC) classes of one sort or
another, satisfy Pollards entropy condition, see Pollard (1984, Chapter 2; 1990,
Section 4) and Dudley (1987). Here we consider the most important of these VC
classes for applications (type I classes below) and we show that several other classes
of functions satisfy Pollards entropy condition. These include Lipschitz functions
2270 D.W.K. Andrew

indexed by finite dimensional parameters (type II classes) and infinite dimensional


classes of smooth functions (type III classes). The latter are important for appli-
cations to semiparametric and nonparametric problems because they cover
realizations of nonparametric estimators (under suitable assumptions).
Having established that Pollards entropy condition holds for several useful
classes of functions, we proceed below to show that functions from these classes
can be mixed and matched, e.g. by addition, multiplication and division, to
obtain new classes that satisfy Pollards entropy condition. In consequence, one
can routinely build up fairly complicated classes of functions that satisfy Pollards
entropy condition. In particular, one can build up classes of functions that are
suitable for use in the examples above.
The first class of functions we consider are applicable in the non-differentiable
M-estimator Examples 1 and 2 (see Section 3.2 above).

Dejinition

A class F of real functions on W is called a type I class if it is of the


form (a) 8 = {f:f(w) = ~4 V w~-Iy- for some 5~ Y c Rk} or (b) 9 = {f:f(w) =
h(wt) V w~.q for some <E Y c Rk, hi V,}, where V, is some set of functions from
R to R each with total variation less than or equal to K < co.

Common choices for h in (b) include the indicator function, the sign function,
and Huber $-functions, among others.
For the more knowledgeable reader (concerning empirical processes), we note
that it is sometimes useful to extend the definition of type I classes of functions
to include various classes of functions called VC classes. By definition, such classes
include (i) classes of indicator functions of VC sets, (ii) VC major classes of uniformly
bounded functions, (iii) VC hull classes, (iv) VC subgraph classes, and (v) VC
subgraph hull classes, where each of these classes is as defined in Dudley (1987)
(but without the restriction that f > 0 V~EF). For brevity and simplicity, we do
not discuss all of these classes here.
The second class of functions we consider contains functions that are indexed
by a finite dimensional parameter and are Lipschitz with respect to that parameter:

Dejinition

A class F of real functions on W is called a type II class if each function f in F


satisfies: f(.) = f(., t) for some re5-, where Y is some bounded subset of Euclidean
space and f(., r) is Lipschitz in r, i.e.,

Lf(~*~l)-f(~>~2)1<B(.)llr, -52/I V/t,,T,EY (4.3)

for some function B( .): W + R.


Ch. 37: Empirical Process Methods in Econometrics 2271

The third class of functions we consider is an infinite dimensional class of


functions that is useful for semiparametric and nonparametric applications such
as Examples 5 and 6. This class is more complicated to define than type I and
II classes. The reader may wish to skip this section on first reading and move
ahead to Theorem 2.
The third class of functions contains functions that depend on w = (~1, ~6) only
through a subvector w, that has dimension k, <k. The functions are smooth on
a restricted subset of W and are equal to a constant elsewhere. Define WO =
{w,ER~Y 3 wb s.t. (wb, wL)~ly}. For w, hERk, we write w = (~1, wb,and h = (hb, hb).

Dejinition

A class Y of real functions on W is called a type III class if

(i) each f in 9 depends on w only through a subvector w, of dimension k, < k,


(ii) for some real number q > k,/2, some constant C < co, and some set W,*, which
is a subset of Wa and is a connected compact subset of Rka, each f EF satisfies
the smoothness condition: V WEW and w + hE-W^,

f(w + h)= .r,,


$ B,(h,, w,) + Nh,, w,) and R(h,, w,) d C/I h, llq, (4.4)

where B,(h,, w,) is homogeneous of degree v in h, and (4, C, WJ) do not depend
on f, w, or h,
(iii) for some constant K and all f~9, f(w) = K V WEW such that w,,EW~ - WT.

Typically the expansion of f(w + h) in (4.4) is a Taylor expansion of order [q]


and the function B&H,, w,) is the vth differential of f at w, i.e.

&(h,, wy) = 1
y
!
VI!...

where zV denotes the sum over all ordered k,-tuples (v,, . . . , v,+,) of nonnegative
integers such that vr + ... + vk, = v, w, = (W,1,. . . , W&)l and h = (h,l,. . , h,k.y.
Sufficient conditions for condition (ii) above are: (a) for some real number
q > k,/2, f~9 has partial derivatives of order [q] on W* = {weW: wa~W~};
(b) the [q]th order partial derivatives of f satisfy a Lipschitz condition with
exponent q - [qJ and some Lipschitz constant C* that does not depend on
f V f ~9; and (c) W,* is a convex compact set.
The envelope of a type III class 9 can be taken to be a constant function, since
the functions in 9 are uniformly bounded in absolute value over WEW and f~9.
Type III classes can be extended to allow Wa to be a finite union of connected
compact subsets of Rkm.In this case, (4.4) only needs to hold V wgW and w + hEY+
such that w, and w, + h, are in the same connected set in W,*.
2212 D.W.K. Andrews

In applications, type III classes of functions typically are classes of realizations


of nonparametric function estimates. Since these realizations usually depend on
only a subvector W,, of W, = (Wb,, Wb,), it is advantageous to define type III
classes to contain functions that may depend on only part of W,. By mixing and
matching functions of type III with functions of types I and II (see below), classes
of functions are obtained that depend on all of w.
In applications where the subvector W,, of W, is a bounded rv, one may have
YV,*= W,. In applications where W,, is an unbounded rv, vV~ must be a proper
subset of wa for 9 to be a type III class. A common case where the latter arises
in the examples of Andrews (1994a) is when W,, is an unbounded rv, all the
observations are used to estimate a nonparametric function I for w,EYV~, and
the semiparametric estimator only uses observations W, such that W,, is in a
bounded set -WT. In this case, one sets the nonparametric estimator of rO(w,) equal
to zero outside YV,*and the realizations of this trimmed estimator form a type III
class if they satisfy the smoothness condition (ii) for w,E%~.

Theorem 2

If g is a class of functions of type I, II, or III, then Pollards entropy condition


(4.2) (i.e. Assumption A) holds with envelope F(.) given by 1 v SUP~~,~If(.
1 v su~r~.~ If(.)1 v B(.), or 1 v su~~~~~ If( .) 1,respectively, where v is the maximum
operator.

Comment

For type I classes, the result of Theorem 2 follows from results in the literature
such as Pollard (1984, Chapter II) and Dudley (1987) (see the Appendix for details).
For type II classes, Theorem 2 is established directly. It is similar to Lemma 2.13
of Pakes and Pollard (1989). For type III classes, Theorem 2 is established using
uniform metric entropy results of Kolmogorov and Tihomirov (1961).

We now show how one can mix and match functions of types I, II, and III to
obtain a wide variety of classes that satisfy Pollards entropy condition (Assumption
A). Let 3 and g* be classes of I x s matrix-valued functions defined on -Iy- with
scalar envelopes G and G*, respectively (i.e. G: -ly- + R and Igij(.) I < G( .) V i =
1>..., r,vj= 1, . . . , s, V g&J). Let g and g* denote generic elements of 3 and g*.
Let Z be defined as 3 is, but with s x u-valued functions. Let h denote a generic
element of Z. We say that a class of matrix-valued functions 3, ?J*, or 2 satisfies
Pollards entropy condition or is of type I, II, or III if that is the case element by
element for each of the rs or su elements of its functions.
Let~~O*={g+g*}(={g+g*:g~~,g*~~}),~~={gh},4ev~*=(gvg*},
9~Y*={gr\g*} and Igl={lgl}, w h ere v, A, and 1.1 denote the element by
element maximum, minimum, and absolute value operators respectively. If I = s
and g(w) is non-singular V w~-ly- and VgM, let 3-i = {g-i}. Let ,&(.) denote
the smallest eigenvalue of the matrix.
Ch. 37: Empirical Process Methods in Econometrics 2213

Theorem 3

If g, F?*, and 9 satisfy Pollards entropy condition with envelopes G, G*, and H,
respectively, then so do each of the following classes (with envelopes given in
parentheses): %ug* (G v G*), g@O* (G + G*), Y.8 ((G v l)(H v l)), $9 v 9*
(G v G*), 9 A Y* (G v G*), and 191 (G). If in addition r = s and 3-l has a finite
envelope c, and 9-i also satisfies Pollards entropy condition (with envelope
(G v l)@).

Comments

(1) The stability properties of Pollards entropy condition given in Theorem 3 are
quite similar to stability properties of packing numbers considered in Pollard
(1990).
(2) If r = s and infgsY infwEw &,(g(w)) > 0, then 9-i has an envelope that is
uniformly bounded by a finite constant.

4.2. Examples

We now show how Theorems l-3 can be applied in the examples of Section 3 to
obtain stochastic equicontinuity of vT(.).

Example 1 (continued)

By Theorems l-3, the following conditions are sufficient for stochastic equicontinuity
of vr(.) in this example.

(4 {(Y,, X,): t > l> is an m-dependent sequence of rvs.

(ii) ~~~~~~IlX~ll 2+6<cc for some 6>0.

1<cc
(iii) {$,(., r): ZEF} satisfies Pollards entropy condition with envelope

supI$,(.,r)Iand G $$E (IlX,l12+s+ l)s~pI$,(W,,r)~~+~ for


rE3 T-r, [ re.F
some 6 > 0.
(iv) 11/r(.) is a function of bounded variation. (4.5)

Sufficiency of conditions (i)-(iv) for stochastic equicontinuity of vT(.) is established


as follows. The sets (g: g(w) = t+kl(y- XT) for some TEF} and {h: h(w) = x} are type
I classes with envelopes C, and /Ix 11,respectively, for some constant C, < co, and
hence satisfy Pollards entropy condition by Theorem 2. This result, condition (iii),
and the 9% result of Theorem 3 show that A satisfies Pollards entropy condition
with envelope ( )Ix /I v l)(su~,,,~ I $2(w, T)I v 1). Stochastic equicontinuity now
follows from Theorem 1, since Assumption B is implied by conditions (ii) and (iii).
2214 D.W.K. Andrews

For the particular M-estimators considered in Example 1 above, condition (iv) is


always satisfied and condition (iii) is automatically satisfied given (ii) whenever $2 = 1
or $2(w, r) = 1(xs > 0). When tj2(w, t) = y - X'T,I+b2(W, t) = l(XT > o)[(y - XT) A XT],
or $*(w, r) - 1(y < 2xr)(y - xr), condition (iii) is satisfied provided Y is bounded
and

- 1 r
lim ~~C[EIU,12+6+EilX(i14+b +E~lU,X,~~2+b]<c0 forsome 6>0.
T-GPTI

This follows from Theorem 3, since {1(xs > 0): reY}, {y - X'T:TEF}, {x'T:TE.?}
and (1 (y < ~xT): TELT} are type I classes with envelopes 1, Iu I + I\x (I supIGg 11
r - ?. 11,
IIx IIsu~,,.~ IIT II and1, respectively, where u = y - x~e.

Example 2 (continued)

In the method of simulated moments example, the following conditions are


sufficient for stochastic equicontinuity of vT(.).

(9 (U&Z,, ytl,..., Y,,): t 3 l} is an m-dependent sequence of rvs.


(ii) {g(., t): reY_) is a type IT class of functions with Lipschitz function B(.)

that satisfies EB*+(Z,) + Esup ))g(Z,,r)))2+6 < co


re.Y >
for some 6 > 0. (4.6)

Note that condition (ii) holds if g(w,r) is differentiable in ZV w~-ly-,Vr~~-,~ is


open, and

Sufficiency is established as follows. Classes of functions of the form


{ l((Zi-Zl)'(fl(z)+ A(z)yj)> 0): rsY c RP} are type I classes with envelopes equal
to 1 (by including products ziyj and z,yj as additional elements of w) and hence
satisfy Pollards entropy condition by Theorem 2. {g(.,r):rEY} also satisfies
Pollards entropy condition with envelope 1 v supres 1)g(, t) II v B(.) by condition
(ii) and Theorem 2. The 9% result of Theorem 3 now implies that A satisfies
Pollards entropy condition with envelope 1 v SUP,,,~ IIg(, r) II v B(.). Stochastic
equicontinuity now follows by Theorem 1.

Example 5 (continued)

By applying Theorems l-3, we find the following conditions are sufficient for
stochastic equicontinuity of vT(.) in the WLS/PLR example. With some abuse of
Ch. 37: Empirical Process Methods in Econometrics 2275

notation, let rj(w) denote a function on W that depends on w only through the
k,-subvector z and equals tj(z) above for j = 1,2,3. The sufficient conditions
are:

(4 {(K,X,,Z,):t2 1) is an m-dependent identically distributed sequence


of rvs.
(ii) El1 Yt-Xle,I12++EIIX,I12+S+E/I(Y,-XX:B,)X,l)2+6< cc for some
6 > 0.
(iii) F={t:r=(tl,t2, tJ),tjEFj for j = 1,2,3}. Fj is a type III class of
RPj-valued functions on W c Rk that depend on w =(y, x, z) only through
the k,-vector z for j = 1,2,3, where pi = 1, p2 =p and p3 = 1, and
C 1
y-3 = tj: inf lr3(w)l 2 E for some E > 0. (4.7)
i wsll 1

The set W,* in the definition of the type III class Fj equals g* in this example
for j = 1,2,3. Since g* is bounded by condition (iii), conditions (i)-(iii) can be
satisfied without trimming only if the rvs {Z,: t > l} are bounded.
Sufficiency of conditions (i)-(iii) for stochastic equicontinuity is established as
follows. Let h,(w) = y - ~0, and h2(w) = x. By Theorem 2, {c}, (hi}, {h2} and Fj
satisfy Pollards entropy condition with envelopes 1, Ih, 1,Ih, I and Cj, respectively,
for some constant C,E[~, co), for j = 1,2,3. By the 9-l result of Theorem 3, so
does {l/r,:rj~Fj} with envelope CJE 2 2. By the F?% and $!?@?J* results of
Theorem 3 applied several times, .&! satisfies Pollards entropy condition with
envelope (lh,l v l)C,+(lh,I v l)C,+(lh,I v l)((h,( v l)C, for some finite con-
stants C,, C,, and C,. Hence, Theorem 1 yields the stochastic equicontinuity of
v,(.), since (ii) suffices for Assumption B.
Next, we consider the conditions P(Z*EF)+ 1 and ? Are of (3.36). Suppose
(i) fj(z) is a nonparametric estimator of rjO(z) that is trimmed outside T* to
equal zero for j = 1,2 and one for j = 3,
(ii) %* is a finite union of convex compact subsets of Rka,
(iii) fj(z) and its partial derivatives of order d [q] + 1 are uniformly consistent
over ZEN* for Tag and its corresponding partial derivatives, for j = 1,2,3,
for some q > k,/2, and
(iv) the partial derivatives of order [q] + 1 of Tag are uniformly bounded over
ZEN!* and infiea* Ewmin(~&z)) > 0.
Then, the realizations of fj(z), viewed as functions of w, lie in a type III class of
functions with probability -+ 1 for j = 1,2,3 and t L T,, uniformly over 5? (where
zjO(z) is defined for ZEN - %* to equal zero for j = 1,2 and one for j = 3). Hence,
the above conditions plus (i) and (ii) of (4.7) imply that conditions (i)-(iii) of
(3.36) hold. If fj(z) is a kernel regression estimator for j = 1,2,3, then sufficient
conditions for the above uniform consistency properties are given in Andrews
(1994b).
2276 D. W.K. Andrew

5. Stochastic equicontinuity via bracketing

This section provides an alternative set of sufficient conditions for stochastic


equicontinuity to those considered in Section 4. We utilize a bracketing result of
Ossiander (1987) for iid rvs altered to encompass m-dependent rather than inde-
pendent rvs and extended as in Pollard (1989) to allow for non-identically distri-
buted rvs. This result depends on a condition, that we refer to as Ossianders
entropy condition, that is based on how well the functions in JZ can be approximated
by a finite number of functions that bracket each of the functions in A. The
bracketing error is measured by the largest L(P,) distance over all distributions
P, of IV, for t d T, T 3 1. The main purpose of this section is to give primitive
conditions under which Ossianders entropy condition holds.
The results given here are particularly useful in three contexts. The first context
is when r is finite dimensional and m(W,, t) is a non-smooth function of some
nonlinear function of t and W,. For example, the rn(W,,~) function for the LAD
estimator of a nonlinear regression model is of this form. In this case, it is difficult
to verify Pollards entropy condition, so Theorems l-3 are difficult to apply. The
second context concerns semiparametric and nonparametric applications in which
the parameter r is infinite dimensional and is a bounded smooth function with an
unbounded domain. Realizations of smooth nonparametric estimators are some-
times of this form. Theorem 2 above does not apply in this case. The third context
concerns semiparametric and nonparametric applications in which r is infinite
dimensional, is a bounded smooth function on one set out of a countable collection
of sets and is constant outside this set. For example, realizations of trimmed
nonparametric estimators with variable trimming sets are sometimes of this form.
The pseudometric p on r that is used in this section is defined by

p(rl,~2) = sup (W(W, tl) - WK,T~))~)~. (5.1)


ti N.N> 1

We adopt the following notational convention: For any real function f on


~,@lf(K)IJYP = supwsr- If(w)1 if P = 00.
An entropy condition analogous to Pollards is defined using the following
bracketing cover numbers.

Dejnition

For any E > 0 and p~[2, m], the Lp bracketing cover number N:(e, P,,F)is
the smallest value of n for which there exist real functions a,, . . . ,a, and
b,, ,b, on YV such that for each f~9 one has If - ajl < bj for some j < II and
maxjG n supt< r r> l (Eb$( Wr))lIpd E, where { W,: t d T, T > l} has distribution deter-
mined by PF
The log of N~(E, P,F) is referred to as the Lp bracketing E-entropy of F. The
following entropy condition was introduced by Ossiander (1987) (for the case p = 2).
Ch. 37: Empirical Process Methods in Econometrics 2271

Definition

A class F of real functions on ?Y satisfies Ossianders Lp entropy condition for some


p~[2, co] if

s 0
1

(log N;(E, P, F))2 d& < a3. (5.2)

As with Pollards entropy condition, Ossianders entropy condition restricts the


complexity/size of F by restricting the rate ofincrease of the cover numbers as ~10.
Often our interest in Ossianders Lp entropy condition is limited to the case
where p = 2, as in Ossiander (1987) and Pollard (1989). To show that Ossianders
Lp entropy condition holds for p = 2 for a class of products of functions 32,
however, we need to consider the case p > 2. The latter situation arises quite
frequently in applications of interest.

Assumption D

_k satisfies Ossianders Lp entropy condition with p = 2 and has envelope I&

Theorem 4

Under Assumptions B-D (with M in Assumption B given by Assumption D rather


than Assumption A), {vT(.): T > l} is stochastically equicontinuous with p given
by (5.1) and F is totally bounded under p.

Comments

1. The proof of this theorem follows easily from Theorem 2 of Pollard (1989) (as
shown in the Appendix). Pollards result is based on methods introduced by
Ossiander (1987). Ossianders result, in turn, in an extension of work by Dudley
(1978).
2. As in Section 4, one establishes stochastic equicontinuity here via maximal
inequalities. With the bracketing approach, however, one applies a chaining
argument directly to the empirical measure rather than to a symmetrized version
of it. The chaining argument relies on the Bernstein inequality for the tail prob-
abilities of a sum of mean zero, independent rvs. The upper bound in Bernsteins
inequality is small when the L2(P,) norms of the underlying rvs are small, where
P, denotes the distribution of the tth underlying rv. The bound ultimately is applied
with the underlying rvs given by the centered difference between an arbitrary
function in _&and one of the functions from a finite set of approximating functions,
each evaluated at W,. In consequence, these functions need to be close in an L2(P,)
sense for all t < T for the bound to be effective, where P, denotes the distribution
of W,. This explains the appearance of the supremum L2(P,) norm as the measure
of approximation error in Ossianders L2 entropy condition.
2278 D. W.K. Andrew

We now provide primitive conditions under which Ossianders entropy condition


is satisfied. The method is analogous to that used for Pollards entropy condition.
First, we show that several useful classes of functions satisfy the condition. Then,
we show how functions from these classes can be mixed and matched to obtain
more general classes that satisfy the condition.

Dejinition

A class 9 of real functions on w is called a type IV class under P with index


p~[2, CO] if each function f in F satisfies f(.) = f(., r) for some Roy-, where F is
some bounded subset of Euclidean space, and

l/P

V r~r and V 6 > 0 in a neighborhood of 0, for some finite positive constants C


and I,+,where { W,: t d T, T b l} has distribution determined by P.4
Condition (5.3) is an Lp continuity condition that weakens the Lipschitz condition
(4.3) of type II classes (provided suptG r,r> l(EBp(W,))p < 00). The Lp continuity
condition allows for discontinuous functions such as sign and indicator functions.
For example, for the LAD estimator of a nonlinear regression model one takes
f( W,, z) = sgn (Y, - g(X,, z))a[g(X,, z)]/hj for different elements rj of r. Under
appropriate conditions on (Y,, X,) and on the regression function g(., .), the resultant
class of functions can be shown to be of type IV under P with index p.

Example 3 (continued)

In this test of variable relevance example, J& is a type IV class with p = 2 under
the following condition:

sup EU: SUP IW,,~,) - h(Z,,z)l d Cd* (5.4)


r> I ?,:~I?,-?/~
<s

for all thy, for all 6 > 0, and for some finite positive constants C and $. Condition
(5.4) is easy to verify if h(Z,,t) is differentiable in r. By a mean value expansion,
(5.4) holds if supt, 1 E II II, supTGF a[h(z,, z)]/ik II2 < 00 and r is bounded. On the
other hand, condition (5.4) can be verified even if h(Z,,z) is discontinuous in r.
For example, suppose h(Z,, z) = l(h*(Z,, r) d 0) for some real differentiable function
h*(Z,, z). In this case, it can be shown that condition (5.4) holds if supta 1 El U,)2+6 < CO
for some 6 > 0, sup*> 1 SUP,,,~ (Ia[h*(Z,, z)yar Ii d C, < cc a.s. for some constant
C,, and h*(Z,, t) has a (Lebesgue) density that is bounded above uniformly over ZEF.

41f need be, the bound in (5.3) can be replaced by CIlog61- for arbitrary constants CE(~, co) and
i. > 1 and Theorem 5 still goes through.
Ch. 37: Empirical Process Methods in Econometrics 2219

Example 4 (continued)
Jl is a type IV class with p = 2 in this cross-sectional constancy example under
the same conditions as in Example 3 with U, of Example 3 replaced by
U,a[g(X,,s,,)]/a8, and with h(Z,,z) taken to be of the non-differentiable form
1(h*(Z,, t) d 0) discussed above.
Note that the conditions placed on a type IV class of functions are weaker in
several respects than those placed on the functions in Hubers (1967, Lemma 3,
p 227) stochastic equicontinuity result. (Hubers conditions N-2, N-3(i), and
N-3(ii) are not used here, nor is his independence assumption on { W,}.)Hubers
result has been used extensively in the literature on M-estimators.
Next we consider an analogue of type III classes that allows for uniformly
bounded functions that are smooth on an unbounded domain. (Recall that the
functions of type III are smooth only on a bounded domain and equal a constant
elsewhere.) The class considered here can be applied to the WLS/PLR Example 5
or the GMM/CMR Example 6. Define wU as in Section 4 and lel w = (wb, wb),
h = (hb, hb), and W, = (W;,, Wb,).

Dejinition

A class 9 of real functions on w is called a type I/ class under P with index


PER 001,if
(i) each fin F depends on w only through a subvector w, of dimension k, d k,
(ii) wb is such that w0 n {w,ER~=: I/w, I/ < r} is a connected compact set V r > 0,
(iii) for some real number q > k,/2 and some finite constants C,, . . . , Clql, C,, each
f EF satisfies the smoothness condition V w~-llr and w + hew,

f (w+ h)= vro


y!B,(k,, w,) + W,, w,),

R(h,>w,) G C, IIh, I?, and IB,(h,, w,)l 6 C, IIh, II for v = 0,. . , Cd, (5.5)
where B,(h,, w,) is homogeneous of degree v in h, and (q, C,, . . . , C,) do not
depend on f,w,or h,
(iv) suPtg T,Ta 1 E I/ W,, Iii < co for some [ > pqkJ(2q - k,) under P.

In condition (iv) above, the condition [ > co, which arises when p = co, is taken
to hold if [ = 00. Condition (ii) above holds, for example, if IIT,= Rka.
As with type III classes, the expansion of f(w + h) in (5.5) is typically a Taylor
expansion and B,(h,, w,) is usually the vth differential of f at w. In this case, the
third condition of (5.5) holds if the partial derivatives of f of order <[q] are
uniformly bounded.
Sufficient conditions for condition (iii) above are: (a) for some real number
q > k,/2, each fEF has partial derivatives of order [q] on YF that are bounded
uniformly over W~YY and f EF, (b) the [q]th order partial derivatives off satisfy
2280 D. W.K. Andrews

a Lipschitz condition with exponent q - [q] and some Lipschitz constant C, that
does not depend on f, and (c) Y+$ is a convex set.
The envelope of a type V class 9 can be taken to be a constant function, since
the functions in 9 are uniformly bounded over wcw and f EF:.
Type V classes can be extended to allow wO to be such that _wbn{w,~RI~:
11w, 11d r} is a finite union of connected sets V r > 0. In this case, (5.5) only needs
to hold V w~-llr and w + hE-IY_ such that w, and h, are in the same connected set
in wb n {w,: IIw, II d r} for some r > 0.
In applications, the functions in type V classes usually are the realizations of
nonparametric function estimates. For example, nonparametric kernel density
estimates for bounded and unbounded rvs satisfy the uniform smoothness
conditions of type V classes under suitable assumptions. In addition, kernel
regression estimates for bounded and unbounded regressor variables satisfy the
uniform smoothness conditions if they are trimmed to equal a constant outside a
suitable bounded set and then smoothed (e.g. by convolution with another kernel).
The bounded set in this case may depend on T.
In some cases one may wish to consider nonparametric estimates that are
trimmed (i.e. set equal to a constant outside some set), but not subsequently
smoothed. Realizations of such estimates do not comprise a type V class because
the trimming procedure creates a discontinuity. The following class of functions
is designed for this scenario. It can be used with the WLS/PLR Example 5 and
the GMMjCMR Example 6. The trimming sets are restricted to come from a
countably infinite number of sets {wOj: j 3 l}. (This can be restrictive in practice.)

Definition

A class 9 of real functions on w is called a type VI class under P with index


PECK, 001,if
(i) each f in F depends on w only through a subvector w, of w of dimension
k, d k,
(ii) for some real number q > k, 12, some sequence {wOj: j 2 1) of connected
compact subsets of Rka that lie in wO, some sequence {Kj:j 3 l} of constants
that satisfy supja 1llyjl < co, and some finite constants C,, . . , CLql,C,, each
f~9- satisfies the smoothness condition: for some integer J,
(a) f(w) = K, V WE%/ for which w,+!~~~ and
(b) V w~YY and w + hEW for which w,E~~~ and w, + huEdyb,,

f(w + h) = .rO ,I;B,(h., wu) + R(h,, w,),

R(hm wJ d C, IIh, 114,and IMh,, w,)l d C, /Ih, II for v = 0,. . . , [q], (5.6)

where B,(h,, w.) is homogeneous of degree v in h, and (q, (Woj: j >, l}, C,, . . . , C,)
do not depend on f, w, or h.
Ch. 37: Empirical Process Methods in Econometrics 2281

(iii) I 1 E (1IV,, lli < cc for some iy> pqk,/(2q - k,) under P,
supti . T1T>
(iv) n(r) < K, exp(K,rr) for some 5 < 2[/p and some finite constants K 1, K,, where
n(r) is the number of sets Waj in the sequence {Waj: j 3 l} that do not include
(W&K: IIw, II G 4.

Conditions (i)-(iii) in the definition of a type VI class are quite similar to conditions
used above to define type III and type V classes. The difference is that with a type
VI class, the set on which the functions are smooth is not a single set, but may
vary from one function to the next among a countably infinite number of sets.
Condition (iv) restricts the number of ^whj sets that may be of a given radius or
less. Sufficient conditions for condition (iv) are the following. Suppose Wuj 3
Cw,EWb: IIw,II d r?(j)) f or allj sufficiently large, where II(.) is a nondecreasing real
function on the positive integers that diverges to infinity as j-+ a3. For example,
{Waj: j 3 l} could contain spheres, ellipses, and/or rectangles whose radii are
large for large j. If

q(j)3 D*(log j)lir (5.7)

for some positive finite constant D *, then condition (iv) holds. Thus, the radii
of the sets {~~j: j > 1) are only required to increase logarithmically for condition
(iv). This condition is not too restrictive, given that the number of trimming sets
{Waj} is countable. More restrictive is the latter condition that the number of
trimming sets {-Wbj} is countable.
As with type III and type V classes, the envelope of a type VI class of functions
can be taken to be a constant function.
The trimmed kernel regression estimators discussed in Andrews (1994b) provide
examples of nonparametric function estimates for which type VI classes are
applicable. For suitable trimming sets {WGj: j 2 l} and suitable smoothness
conditions on the true regression function, one can specify a type VI class that
contains all of the realizations of such kernel estimators in a set whose probability
-1.
The following result establishes Ossianders Lp entropy condition for classes of
type II-VI.

Theorem 5

Let p~[2,00]. If Y is a class of functions of type II with supt< T T, I (EBp(Wt))p <


co, of type III, or of type IV, V, or VI under P with index i, then Ossianders Lp
entropy condition (5.2) holds (with envelope F(.) given by supltF If(.)\).

Comments

(1) To obtain Assumption D for any of the classes of functions considered above,
one only needs to consider p = 2 in Theorem 5. To obtain Assumption D for a
2282 D. W.K. Andrew

class of the form 3&F, where 9 and 2 are classes of types II, III, IV, V or VI,
however, one needs to apply Theorem 5 to 9 and 2 for values of p greater than
2, see Theorem 6 below.
(2) Theorem 5 covers classes containing a finite number of functions, because
such functions are of type IV under any distribution P and for any index PE[~, co].
In particular, this is true for classes containing a single function. This observation
is useful when establishing Ossianders Lp entropy condition for classes of functions
that can be obtained by mixing and matching functions from several classes, see
below.
We now show how one can mix and match functions of types II-VI. Let
9?,9*, Y?, 9 @ %*, etc., be as defined in Section 4. We say that a class of matrix-
valued functions 3,9*, or H satisfies Ossianders Lp entropy condition or is of
type II, III, IV, V or VI if it does so, or if it is, element by element for each of the
IS or su elements of its functions. We adopt the convention that &/(A + p) = ~E(O, co]
if A = co and vice versa.

Theorem 6
(a) If 3 and 3* satisfy Ossianders Lp entropy condition for some p~[2, co], with
envelopes G and G*, respectively, then so do each of the following classes (with
envelopes given in parentheses): 9 u 3* (G v G*), 9 0 9* (G + G*), 3 v Y* (G v G*),
9 A 3* (G v G*), and IF?\(G). If in addition r = s and inf,,, inf,,,,,- A,,,(g(w)) = A.,
for some A.,,> 0, then 9-i also satisfies Ossianders Lp entropy condition (with
envelope r/E,,).
(b) The class 3% satisfies Ossianders Lp entropy condition with p equal to
cr~[2, co] and envelope sGH, if(i) 3 and A? satisfy Ossianders Lp entropy condition
with p equal to k(cc, co] and p equal to ,ULE(CL, co], respectively, (ii) +/(A + p) 3 CI,
and (iii) the envelopes G and H of Y and YP satisfy sup,< T,Ta 1(EG(W,)) < cc
and suptG T,Ta ,(EH(K)) < 00.

Example 6 (continued)
Theorems 4-6 can be used to verify stochastic equicontinuity of vT(.) and total
boundedness of F in the GMMjCMR example. With some abuse of notation, let
d(w) and n(w) denote functions on -w^ whose values depend on w only through
the k,-vector x and equal A(x) and Q(x) respectively. Similarly, let $(w, 0,) denote
the function on -w^ that depends on w only through z and equals ll/(z,e,). The
following conditions are sufficient.

(i) {(Z,,X,):t> l} is an m-dependent sequence of rvs.


(ii) ;;y E II$(Z,, &J II6 < ~0.

(iii) $ = {r: r = AR- for some AE~ and a~&}, where $3 and s4 are type V
or type VI classes of functions on FY c Rk with index p = 6 whose functions
Ch. 37: Empirical Process Methods in Econometrics 2283

depend on w only through the k,-vector x, and .d c R: inf &,(fl(w))> E


we*
for some E > 0. (5.8)

Note that condition (iii) of (5.8) includes a moment condition on X,:supta 1


E I/X, lir< co for some i > 6qk,/(2q - k,).
Sufficiency of conditions (i))(iii) for stochastic equicontinuity and total bounded-
ness is established as follows. By Theorem 5, {$(., (!I,)}, LS and d satisfy Ossianders
Lp entropy condition with p = 6 and with envelopes I$(.,tI,)l, C, and C,, res-
pectively, for some finite constants C,, C,. By the 9-l result of Theorem 6, so
does J4- with some constant envelope C, < co. By the 32 result of Theorem 6
applied with c1= 3 and 1, = p = 6, SS&- satisfies Ossianders Lp entropy condition
with p = 3 and some constant envelope C, < co. By this result, condition (ii), and
the 9% result of Theorem 6 applied with c1= 2, 2 = 3, p = 6, 9 = g&-r, and
Y? = ($(.,e,)}, JY satisfies Ossianders Lp entropy condition with p = 2 and
envelope C, I$(., Q,)l for some constant C, < co. Theorem 4 now yields stochastic
equicontinuity, since condition (ii) is sufficient for Assumption B.
Condition (iii) above covers the case where the domain of the nonparametric
functions is unbounded and the nonparametric estimators A and fi are not trimmed
to equal zero outside a single fixed bounded set, as is required when the symmetri-
zation results of Section 4 are applied. As discussed above, nonparametric kernel
regression estimators that are trimmed and smoothed or trimmed on variable sets
provide examples where condition (iii) holds under suitable assumptions for
realizations of the estimators that lie in a set whose probability + 1. For example,
Andrews (1994b) provides uniform consistency on expanding sets and LQconsistency
results for such estimators, as are required to establish that P(Z*EY) -+ 1 and z^3 z0
(the first and second parts of (3.36)) when stochastic equicontinuity is established
using conditions (i)-(iii) above.

6. Conclusion

This paper illustrates how empirical process methods can be utilized to find the
asymptotic distributions of econometric estimators and test statistics. The concepts
of empirical processes, weak convergence, and stochastic equicontinuity are
introduced. Primitive sufficient conditions for the key stochastic equicontinuity
property are outlined. Applications of empirical process methods in the econo-
metrics literature are reviewed briefly. More detailed discussion is given for three
classes of applications: M-estimators based on non-differentiable criterion func-
tions; tests of hypotheses for which a nuisance parameter is present only under
the alternative hypothesis; and semiparametric estimators that utilize preliminary
nonparametric estimators.
2284 D. W.K. Andrew

Appendix

Proof of Theorem 1
Write vT(.) as the sum of m empirical processes {vrj(.): T 3 l} forj = 1,. . , m, where
vTj(.) is based on the independent summands {m(W,, .): t = j + sm, s = 1,2,. .}. By
standard inequalities is suffices to prove the stochastic equicontinuity of {vTj(.):
T3 l} for each j.
The latter can be proved using Pollards (1990) proof of stochastic equicontinuity
for his functional CLT (Theorem 10.7). We take his functions &(w, t) to be of the
form m( IV,, r)/fl We alter his pseudometric from lim,, m [ (l/N)xyE 11m( W,, zl) -
m(W,, t2) 11
2]2 to that given in (3.1). Pollards proof of stochastic equicontinuity
relies on conditions (i) and (iii)-(v) of his Theorem 10.7. Condition (ii) of Theorem
10.7 is used only for obtaining convergence of the finite dimensional distributions,
which we do not need, and for ensuring that his pseudometric is well-defined. Our
pseudometric does not rely on this condition. Inspection of Pollards proof shows
that any pseudometric can be used for his stochastic equicontinuity result (although
not for his total boundedness result) provided his condition (v) holds. Thus, it
suffices to verify his conditions (i) and (iii)-(v).
Condition (i) requires that the functions {m(W,, t)/fi: t d T, T > l} are
manageable. This holds under Assumption A because Pollards packing numbers
satisfy

sup D(s Ia0 F.(w)I,a0 Pn,) d sup N,(.s/2, Q, A). (A.1)

Conditions (iii) and (iv) are implied by Assumption B. Condition (v) holds auto-
matically given our choice of pseudometric. Q.E.D.

Proof of Theorem 2
Type I classes of form (a) satisfy Pollards entropy condition by Lemmas II.28 and
11.36(ii) of Pollard (1984, pp 30 and 34). Type I classes of form (b) satisfy Pollards
entropy condition because (i) they are contained in VC hull classes by the proof
of Proposition 4.4 of Dudley (1987) and the fact that {f: f (w)= w'<V WEW, [ERR}
is a VC major class, see Pollard (1984, Lemma II. 18, p 20), (ii) VC hull classes are
contained in VC subgraph hull classes, and (iii) VC subgraph hull classes satisfy
Pollards entropy condition by Corollary 5.8 of Dudley (1987).
For classes of type 11, consider the functions f (., z,),. . :, f (.,
T,), where TV,. ..,T,
are points at the centers of disjoint cubes of diameter E(QF)~/(QB~)~ whose
union covers Y (c R for some s 2 1). Since

min (Q(f (., T) - f (., T~))~)"~ d min(QB2)j2 IIT - Tj IId &(QF)l, 64.2)
j<n j< n
Ch. 37: Empirical Process Methods in Econometrics 2285

N2(s(QF2)2, Q, 9) is d the number of cubes above. By choice of the envelope


F(.) = 1 v su~/,~ If(.)1 v B(.), 4QF) A (QB 21lj2 3 E, so the number of cubes is
< CE-~ for some C > 0 and all QE_%?.Thus, Pollards entropy condition holds with
envelope F( .).
For classes of type III, Pollards entropy condition holds because

sup N2(~(QF2)2, Q, 9) < C exp(e-k04) v EE(0, l] (A.3)


Qd

for some C < cc by Kolmogorov and Tihomirov (1961, Theorem XIII, p 308).
Since g > k,/2 by assumption, Pollards entropy condition holds. Q.E.D.

Proof of Theorem 3

For Yu Y*, we have

N26, Q, 3 u g*) d N26, Q, 3) + N26, Q, 3*X and so,

N(&(Q(G v G*)2)12, Q, %JuV*) < N2(s(QG2)2, Q, 9) + N2(~(QG*2)2, Q,?J*),


(A.4)

where the second inequality uses the facts that N2(s, Q,F) is nonincreasing in
E, Q(G v G*)2 2 QG2, and Q(G v G*)2 3 QG *2 . Pollards entropy condition follows
from the second inequality of (A.4).
For ?J 0 Y*, it suffices to suppose that I = s = 1. As above, Pollards entropy
condition follows from the inequalities

N26, Q, 9 0 9*) d N2(@, Q, Y)N,(s/2, Q, %*),


Q(G + G*)2 b QG2 and Q(G + G*)2 2 QG*2, (A.5)

where the first inequality holds because minjsn k<n,(J(g + g* - gj - g:)2dQ)12 d


minj,,(J(g -gj)2dQ)12 + mink,,(i(g* - g:)2dQj1i2.
For YZ, each element of ghis a finite union of products of scalar functions,
and so, using the result for Ye%*, it suffices to suppose that I = s = u = 1. For
notational simplicity, assume G = G v 1 and H = H v 1. Let Qc(.) = Q(.G2)/QG2
and QH(.) = Q(.H2)/QH2. Note that Qc, QHe5?. Let n = N2(~(Q,,G2)12, QH, 3) and
n* = N2(~(QCH2), Qc, X). Let gr,. . . ,g,, and h,, , h,. denote approximating
functions in 9 and %, respectively, that correspond to the cover numbers n and
n*. We use gjhk to approximate gh for g&J and hE%:

l/2
min
j$n,k<n*
(gh - gjh,) dQ
>
~~::(QH2S(g-gj)2d[Q~])12+
2286 D. W.K. Andrews

(4.6)

Thus, we get

N~(J~QG~H~)~, Q, $2) d N2($s(QHG2)2, QH, C!3N2($~(Q,$Z2)12, Qc, YE) and


sup N2(s(QGH2), Q, 3%)
QEY

< sup N&(Q&2)2, QH, 9) sup N2(34QcP2)12, Qc, 2)


QHE?? QCE9

= sup N2(+~(QG2)2, Q, B) sup N2(+~(QH2)2, Q, 2). 64.7)


QEZ? QE22

Pollards entropy condition follows from the latter inequality.


For 3 v B*, it suffices to suppose r = s = 1. Pollards entropy condition follows
from the inequalities

N2b Q, 3 v 9*) d N2W, Q, 94N,W, Q, %*I,


Q(G v G*)2 3 QG2 and Q(G v G*)* > QG*2, (4.8)

where the first inequality uses 1g v g* - gj v g: I< 1g - gjl + (g* - g: I. The proof
for 3 A CC?*is analogous (with the envelope still given by G v G* rather than
G A G*). The result for 131 follows because 1191- lajl I < lg - ajl.
Lastly, consider 3-r. For gEY, let g1 denote the Ith element of g, where I = 1,. ..,L
and L = r*. Let Y[ = {gl: gE3} and n, = N,(E/~, Q, 3J for some QE_!~?.We claim that
given any E > 0 and QE~?, there exist functions gr,. . . ,g,, in 9 with n < nF= rn,
such that for all g&

min max (Q(g - gi)2)112 < E. 64.9)


j<n I<L

To see this, note that by the assumption that 3 satisfies Pollards entropy
condition, for each 1 there exist real functions grr, . . . , glnr in %I such that for all
ge?J minjc,,(Q(g - glj)2)112 < 42. Form the set Y+ of all RL-valued functions
whose Ith element is gu for some j = 1,. . . , n, for 1= 1,. ..,L.The number of such
functions is n+ = nL=1 ,n,. The functions in 3 are not necessarily in 9. For each
function g+ in g+ consider the L'(Q) a/2-ball in 3 centered at g+. Take one function
from each non-empty ball and let gr, . . . , gn denote the chosen functions. These
functions satisfy the claim above.
Ch. 37: Empirical Process Methods in Econometrics 2287

If 9 satisfies Pollards entropy condition with envelope G, it also does so with


envelope G v 1. For notational simplicity, suppose G = G v 1. Given QeS, let
Q(.) = Q(.e4)/Qc4 (ES), where 6 is the envelope of 3-r. Take E and Q in the claim
above to equal E(oG4)2/r4 and Q respectively. Then, there exist functions gl,. . . , gn
in 9 such that

min max (Q(g-g$2)12 <.s(QG4)112/r4 and nb fi N2($s(QG4)1/2/r4, Q, G,).


j<n l<L I=1

Let l,=(l,..., 1) (ER) and let 1.1 denote the matrix of absolute values of the
matrix *. For arbitrary unit vectors b,cER, we have

min Q(bg-c - bgJ:c)


j<n
= min Q(bg- (gj - g)gJ: c)~
j<n

,<minr4Q(6Z1~(gj-g(1,)2=minr4Q~4 5 5 Qlg-g:lIgm-gy/
j<n jQn 1=1 m=1

< r8Qc min max Q(g- gf) G r8Qt?4&20G4/r8 = ,s2QG4G4. (A. 10)
j<n l<L

Thus, Nz(s(QG4G4), Q, %- ) d n d nf= 1N2(i~(QG4)112/r4, Q,G,) and

SU~N,(E(QG~Z;~)~, Q, C!- ) d sup Ir Nz($@G4)2/r4, 0, %J


Qd i&i? I=1

= sup fi N&E(QG~)~/~~, Q, 9,) d sup Ir N&E(QG)~~/~~, Q, ?Ir).


QE~ I=1 Q&? I=1

(A.1 1)

The integral over FE[O, l] of the square root of the logarithm of the right-hand
side (rhs) of (A.ll) is finite since 9 satisfies Pollards entropy condition with
envelope G = G v 1. Thus, 9-l satisfies Pollards entropy condition with envelope
(G v 1)c2. Q.E.D.

Proof ofTheorem 4
Total boundedness of Y under p follows straightforwardly from N:(E, P, ~64)-=c
cc VIE> 0. For stochastic equicontinuity of {v~(.): T 2 l}, by the same argument
as in the proof of Theorem 1, it suffices to prove the result when {W,: t d T} are
independent rvs. By Markovs inequality and Theorem 2 of Pollard (1989), we
D. W.K. Andrew

have

lim P* sup IV&i) - v&J r


-
T+CC ( p(rr,r2)<d

suP ivTtzl) - vT(zZ)i/?


P(rIxT2F6

for some constant C < co, where & > 0 is a constant that does not depend on T.
The second term on the right-hand side of (A.12) can be made arbitrarily small
by choice of 6 using Assumption D. The first term is less than or equal to

(A.13)

using Assumption B. Stochastic equicontinuity follows. Q.E.D.

Proof of Theorem 5

It suffices to prove the result for classes of types III-VI, because a type II class
with suplG T,T> i(EWW,)) < co is a type IV class under P with index p.
First, we consider classes of type III. For given E > 0, define the functions
aj, bj, j = 1,. , n, of the definition of Lp bracketing cover numbers as follows:
(a)V WGW such that w,EW~ - W,*, let aj(w) = K and b,(w) = OVj and (b)V weW
such that w,~Ilrz, let {uj(w): j = l,.. ., n,} be the functions constructed by
Kolmogorov and Tihomirov (1961, ~~312-314) in their proof of Theorem XIV
and let b,(w) = E Vj. These functions satisfy the conditions for Lp bracketing cover
numbers for all p~[2, co]. Hence, N~(E, P, 9) < n, V ~(0, 11, V p~[2, co]. The number
n, of such functions is < C exp E- kaq V ~(0, l] for some C < co by Kolmogorov
and Tihomirov (1961, Theorem XIV). Since q > k,/2 by assumption, Ossianders
entropy condition holds for all pe[O, a].
For a type IV class with index p, consider disjoint cubes in W of diameter 6 =
(E/C)~. The number N(E) of such cubes satisfies N(E) d C*E-~$ for some C* < co,
where d is the dimension of Y. Let rj be some element of the jth cube in F. Let
~j(~)=f(~~j)andbj=~uP~~r-,,~~<~lf(~~)-~j()I~B~(4.3),~uP~~T,~>~[Eb~(~~)IP,<
Cd = E. Thus, N~(E, P, 9) < N(E). Since jA(log N(a))2 de < c&: Ossianders L!
entropy condition holds.
For a type V class with index p, let W, = W n {WE Rk: (Iw, 11d r}, let Fr denote the
class of functions 9 restricted to -W;, and let N,(E, Wr, Fr) be the minimal number
Ch. 37: Empiricul Procrss Methods in Econometrics 2289

n of real functions fi,. . . , fn on vr such that mini<,, sup,,,lf(w) - Jj(w)l < E for
each fog,.. We claim that

N;(E,P, 9) d N,W, %(c,,Kc,,), (A. 14)

where r(c) = C&-p/r for some constant C < co when p < cc and r(c) = sup { (1w, I(:WEW}
(<co)whenp=cc.
Using the proof of Theorem XIV of Kolmogorov and Tihomirov (1961,
pp312-314), it can be seen that

F~(,,) ,< Dr(@aE - k/q < D*E -kd(!lil+ (l/q)1 (A.15)


1%N, (8,nW(E),
for some constants D, D* < co, where the second inequality holds only when p < co.
When p < cc, (A.14) and (A.15) combine to yield Ossianders Lp entropy condition
for 9 if k,(p/[ + l/q)/2 < 1, or equivalently, if [ > pqkJ(2q - k,) and q > kJ2, as
is assumed. When p = co, (A.14) and the first inequality of (A.15) combine to yield
Ossianders Lp entropy condition for fl provided q > k,/2, as is assumed.
It remains to show (A.14). For p = co, (A.14) follows immediately from the
definition of Nt(.) and N,(.), since z%&, = w and F,(e) = 9 when p = co. Next,
suppose p < co. For n = N,(E/~, w,, P,),), define real functions aj, bj, j = 1,. . . , n on
% as follows: On YJ$ take {aj(.): j= l,..., n} to be the functions constructed by
Kolmogorov and Tihomirov (1961, pp 312-314) in their proof of Theorem XIV
and let b,(.)=c/2 forj=l,...,n. On w--W;, take aj(.)=O and takes bj(.)=F
for j= l,..., n, where F is a constant for which supwcw If(w)1 < F Vf EY. Then,
for each REP, minj,,lf - ajl < bj and

< (E/Z) + FPr _ c Sup E I( W,, 11= (E/Z) + C*rei, (A. 16)
td T,T$ 1

where C* is defined implicitly. If we let r = T(E) = (2pC*/(2p - l))lii&-pr, then


sup,< T,Ta 1 Eb;( W,) < &pand (A. 14) holds.
Last, we consider type VI classes of functions. First, suppose p < co. We derive
an upper bound on Nf(s, P, F) for arbitrary E > 0. Let rc = C&-pii for some C < co
and let F be a constant for which supwSly If(w)1 < Ftlf~F. Let J be the index of
a set vO, that does not include {w,E%~~: 11 w, 1)d r,}. For functions REP whose
corresponding integer of part (ii) (of the definition of type VI classes) is J, take the
centering and E-bracketing functions ((a,, b,): 1= 1,. . . , n,,} (of the definition of Lp
bracketing cover numbers) as follows: (a) V WE?Y such that I(w, )( > rE, let al(w) = 0
and b,(w) = F, (b) V WEYY such that 1)w, I/ 6 r, and w,$wG,, let al(w) = K, and
2290 D. W.K. Andrew

b,(w) = 0, and (c) V wow such that 11 w, 11d rE and w,E~~~, let {al(w): 1= 1,. . . , neJ)
be the functions constructed by Kolmogorov and Tihomirov (1961) in the proof
of their Theorem XIV and let b,(w) = 42 V 1. The number neJ of such functions
is < D, exp [D,r,kaE-kaig] by Theorem XIV of Kolmogorov and Tihomirov (1961),
since {w: I/w, 11< r,, w,EWoJ} c {w: /Iw, iI< r,}.
Next, for all functions ~IzP- whose corresponding integer J of part (ii) is such
that waJ contains {w,~^jYb: 11 w, )I < r}, take the centering and s-bracketing functions
{(a,,b,):l= l,..., n,} as follows. (a) Vw~w such that II w,II > r, let al(w) = 0
and b,(w) = FV 1 and (b) V weY such that IIw, II d r, let {al(w): I = 1,. . . , n,} be the
functions constructed by Kolmogorov and Tihomirov (1961) in the proof of their
Theorem XIV and let b,(w) = s/2V 1. The number of such functions also is
<D, exp[D,r~~s-k~Liq].
Now, the number of indices J for which waJ does not include {w,~^lyb: /I w, I/ < r}
is n(r,). Hence, the total number of centering/s-bracketing functions considered above
is d (n(r,) + 1) D, exp[DZrtaePkaq]. Also note that suptG T,T~ ,(Ebf(Wt))P < E for
all of the functions b, introduced above by the same calculations as in (A.16)
provided C (of the definition of r,) is defined suitably. Hence,

Nf(&,P,F) d (n(r,) + l)D, exp[D,r2s-kQq]

d (K, exp[K,Cr~-P51 ] + l)D, exp[D,Ck~s-k~(pii+ 1q)]. (A.17)

With this bound, Ossianders Lp entropy condition holds provided p5/(2[) < 1 and
k,(p/i + l/q)/2 < 1, or equivalently, 4 < 2[lp, q > k,/2 and i > pqk,/(2q - k,), as
is assumed.
For the case where p= CO, take r(8) = sup{)1 w,II: WE%} < cc VE >O in the
argument above. Then, Ossianders L entropy condition holds provided q > k,/2,
as is assumed. Q.E.D.

Proof of Theorem 6

For YuU*, the result is obvious. For 3@ g*, it suffices to suppose that r = s = 1.
Let (g, Uj, bj) and (g*, a:, b:) for ge3 and g*E??* be defined analogously to (f, uj, bj)
given in the definition of the Lp bracketing cover numbers. We have

(E(bj + bF)P)lP < (EbS)P + (Eb:P)lP < 2E, and SO,

N,B(k I, 3 0 3*) d N;(E, P, 4e)N;(c, P, 9?*). (A.18)

The result follows.


For 3 v Y*, it also suffices to suppose that r = s = 1. We have

Igvg*-~jv~jLI~(g-U~~+~g*-~~~~bj+b~, andso,

N32~ P, 3 v Y*) ,< N;(E, P, %)N;(E, P, c-c?*). (A.19)


Ch. 37: Empirical Process Methods in Econometrics 2291

The result for 99 A 97* is analogous.


For 191, the result follows from the inequality 119I- (Uj( 1d 19- Uj(.
Next consider %- . For gE9, let g denote the Ith element of g for 1= 1,. . . , L,
where L= r2. By the same argument as used to prove the claim in the proof of
the Y- result of Theorem 3, there exist r x r matrix functions a,, . . . , a, and
b ,,...,b,suchthat(i)aj~Yforallj,<n,(ii)forallg~~,lg-~~(~bfforall1= l,...,L
for some j d n, (iii) [E(bi)P] lp < E V 1,Vj, and (iv) n d nf= 1Nt(.s/2, P, YJ.
By an eigenvector/eigenvalue decomposition, we get [g- I d II x ,(IJ&)l, xI =
II x y/i, element by element and Iuj ( ,< lr x y/A,. Thus, for arbitrary unit vectors
b,cER, we have: For any gE9 there exists uj and bj for which

)hg-c-buJ:cl dIb)lg-)luj-gg)lujllcl <(r4/L:)1;bjl, and

(E[(r4/il:)l~bj1,]P}1iP < (r6/A:)&. (A.20)

Thus, N:(r6&/;1:, P, 9-l) d n < nf= ,N:(.5/2, P, YJ and the result follows.
To prove part (b) of Theorem 6 concerning 92, note that each element of gh
(for geB and hi&) is a finite union of products of scalar functions, and so, using
the result for 9 @9* it suffices to suppose that r = s = u = 1. Let (g, uj, bj) and
(h,uJ, b,*) be defined analogously to (f,uj, bj) given in the definition of the Lp
bracketing cover numbers, with p = i and p = p respectively. We have

d Gb; + I(u: - k) + kl bj < Gbl + Hb, + b,b: (A.21)

and

(E(Gb: + Hbj + bjb:,,) d (EGb~) + (EHbS) + (Ebjbl*a)la


< (EGaPi(r-a))(a-a)/ap(Ebl*P)lir+ (EHani(~-a))(~-a)inn(Eb))l/~

+ (EbOfi(-a))(-a)in(Eb:)/
J

,< sup ((EC) + (EH))& + 2


t< T,T> 1
,< c*& (A.22)

for ~(0, 11, where C* is defined implicitly and the dependence of each of the
functions G, b:, etc. on W, is suppressed for notational simplicity. The second and
third inequalities hold by Holders inequality and the fact that &/(,I + p) > c(implies
that a~/@ - LY)d 2 and an/(2 - IX)d p. Equations (A.21) and (A.22) imply that

Nf(C*tz, P, 9%) d N;@,P, ~)N;(E, P, 2) (A.23)


2292 D. W.K. Andrews

and the desired result follows. Note that using the notational conventions stated
in the text, (A.21)-(A.23) hold whether or not c1= CO, 1, = CO or p = co. Q.E.D.

References

Ait-Sahalia, Y. (1992a) Nonparametric Pricing of Interest Rate Derivative Securities, Department of


Economics, MIT, unpublished manuscript.
Ait-Sahalia, Y. (1992b) The Delta and Bootstrap Methods for Nonparametric Kernel Functionals,
Department of Economics, MIT, unpublished manuscript.
Andrews, D.W.K. (1988a) Asymptotics for Semiparametric Econometric Models: I. Estimation and
Testing, Cowles Foundation Discussion Paper No. 908R, Yale University.
Andrews. D.W.K. (1988b) Chi-square Diagnostic Tests for Econometric Models: Introduction and
Applications, Journal of Econometrics, 31, 135-156.
Andrews. D.W.K. (19886) Chi-sauare Diagnostic Tests for Econometric Models: Theory, Econometrica,
56, 1419-1453. L -
Andrews, D.W.K. (1989) Asymptotics for Semiparametric Econometric Models: II. Stochastic Equi-
continuity and Nonparametric Kernel Estimation, Cowles Foundation Discussion Paper No. 909R,
Yale University.
Andrews, D.W.K. (1992) Generic Uniform Convergence, Econometric Theory, 8, 241-257.
Andrews. D.W.K. (1993) An Introduction to Econometric Applications of Empirical Process Theory
for Dependent Random Variables, Econometric Reviews, ii, 183-216.
Andrews, D.W.K. (1994a) Asymptotics for Semiparametric Econometric Models Via Stochastic
Equicontinuity, Econometrica, 62, forthcoming.
Andrews, D.W.K. (1994b) Nonparametric Kernel Estimation for Semiparametric Models, Econometric
Theory, 10, forthcoming.
Andrews, D.W.K. and W. Ploberger (1994) Optimal Tests When a Nuisance Parameter Is Present
Only under the Alternative, Econometrica, 62, forthcoming.
Arcones, M. and E. Gine (1992) On the Bootstrap of M-estimators and Other Statistical Functional?,
in: R. LePage and L. Billard, eds., Exploring the Limits of the Bootstrap, New York: Wiley.
Bera, A. K. and M. L. Higgins (1992) A Test for Conditional Heteroskedasticity in Time Series Models,
Journal of Time Series Analysis, 13, 501-519.
Bierens, H. (1990) A Consistent Conditional Moment Test of Functional Form, Econometrica, 58,
1443-1458.
Billingsley, P. (1968) Convergence of Probability Measures. New York: Wiley.
Cavanagh, C. and R.P. Sherman (1992) Rank Estimators for Monotone Index Models, Bellcore
Economics Discussion Paper No. 84, Bellcore, Morristown, NJ.
Chamberlain, G. (1987) Asymptotic Efficiency in Estimation with Conditional Moment Restrictions,
Journal of Econometrics, 34, 305-324.
Davies, R.B. (1977) Hypothesis Testing When a Nuisance Parameter Is Present Only under the
Alternative, Biometrika, 64, 247-254.
Davies, R.B. (1987) Hypothesis Testing When a Nuisance Parameter Is Present Only under the
Alternative, Biometrika, 74, 33-43.
De Jong, R.M. (1992) The Bierens Test under Data Dependence, Department of Econometrics, Free
University, Amsterdam, unpublished manuscript.
Dudley, R.M. (1978) Central Limit Theorems for Empirical Measures, Annals of Probability, 6,
899-929.
Dudley, R.M. (1987) Universal Donsker Classes and Metric Entropy, Annals of Probability, 15,
1306G1326.
Gallant, A.R. (1989) On Asymptotic Normality When the Number of Regressors Increases and the
Minimum Eigenvalues of X,X/n Decreases, Institute of Statistics Mimeograph Series No. 1955,
North Carolina State University, Raleigh, NC.
Gallant, A.R. and G. Souza (1991) On the Asymptotic Normality of Fourier Flexible Form Estimates,
Journal of Econometrics, 50, 329-353.
Ch. 37: Empirical Process Methods in Econometrics 2293

Gine, E. and J. Zinn (1990) Bootstrapping General Empirical Measures, Annals of Prabability, 18,
851-869.
Hahn, J. (1995) Bootstrapping Quantile Regression Estimators, Econometric Theory, 11, forthcoming.
Hansen, B.E. (1991) Inference When a Nuisance Parameter Is Not Identified under the Null
Hypothesis, Working Paper No. 296, Rochester Center for Economic Research, University of
Rochester.
Hansen, B.E. (1992a) Testing the Conditional Mean Specification in Parametric Regression Using the
Empirical Score Process, Department of Economics, University of Rochester, unpublished
manuscript.
Hansen, B.E. (1992b) The Likelihood Test under Non-standard Conditions: Testing the Markov Trend
Model of GNP, Journal of Applied Econometrics, 7, s61-~82.
Hlrdle, W. and 0. Linton (1994) Applied Nonparametric Methods, in: Handbook of Econometrics,
Volume 4. Amsterdam: North-Holland.
Honore, B. (1992) Trimmed LAD and Least Squares Estimation of Truncated and Censored Regression
Models with Fixed Effects, Econometrica, 60, 533-565.
Horowitz, J.L. (1988) Semiparametric M-estimation of Censored Linear Regression Models, Adoances
in Econometrics, 7, 45-83.
Horowitz, J. L. (1992) A Smoothed Maximum Score Estimator for the Binary Response Model,
Econometrica, 60, 505-531.
Horowitz, J.L. and G.R. Neumann (1992) A Generalized Moments Specification Test of the
Proportional Hazards Model, Journal of the American Statistical Association, 87, 234-240.
Huber, P.J. (1967) The Behaviour of Maximum Likelihood Estimates under Nonstandard Conditions,
in Proceedings ofthe Fifth Berkeley Symposium in Mathematical Statistics and Probability, 1,221-233.
Berkeley: University of California.
Huber, P.J. (1973) Robust Regression: Asymptotics, Conjectures and Monte Carlo, Annals ofstatistics,
1, 799-821.
Kim, J. and D. Pollard (1990) Cube Root Asymptotics, Annals of Statistics, 18, 191-219.
Klecan, L., R. McFadden, and D. McFadden (1990) A Robust Test for Stochastic Dominance,
Department of Economics, MIT, unpublished manuscript.
Koenker, R. and G. Bassett (1978) Regression Quantiles, Econometrica, 46, 33-50.
Kolmogorov, A.N. and V.M. Tihomirov (1961) s-entropy and e-capacity of Sets in Functional Spaces,
American Mathematical Society Translations, Ser. 2, 17, 277-364.
Manski, C.F. (1975) Maximum Score Estimation of the Stochastic Utility Model of Choice, Journal
of Econometrics, 3, 205-228.
McFadden, D. (1989) A Method of Simulated Moments for Estimation of Discrete Response Models
without Numerical Integration, Econometrica, 57, 995-1026.
Newey, W.K. (1989) The Asymptotic Variance of Semiparametric Estimators, Department of
Economics, Princeton University, unpublished manuscript.
Newey, W. K. (1990) Efficient Instrumental Variables Estimation of Nonlinear Models, Econometrica,
58, 809-837.
Newey, W.K. (1991) Uniform Convergence in Probability and Stochastic Equicontinuity, Econo-
metrica, 59, 1161-l 167.
Newey, W.K. and D. McFadden (1994) Estimation in Large Samples, in: Handbook of Econometrics,
Volume 4. Amsterdam: North-Holland.
Newey, W.K. and J.L. Powell (1987) Asymmetric Least Squares Estimation and Testing, Econometrica,
55, 819-847.
Olley, S. and A. Pakes (1991) The Dynamics of Productivity in the Telecommunications Equipment
Industry, Department of Economics, Yale University, unpublished manuscript.
Ossiander, M. (1987) A Central Limit Theorem under Metric Entropy with Bracketing, Annals of
Probability, 15, 897-919.
Pakes, A. and S. Olley (1991) A Limit Theorem for a Smooth Class of Semiparamettic Estimators,
Department of Economics, Yale University, unpublished manuscript.
Pakes, A. and D. Pollard (1989) Simulation and the Asymptotics of Optimization Estimators,
Econometrica, 57, 1027-1057.
Pollard, D. (1982) A Central Limit Theorem for Empirical Processes, Journal of the Australian
Mathematical Society (Series A), 33, 235-248.
Pollard, D. (1984) Convergence of Stochastic Processes. New York: Springer-Verlag.
2294 D. W.K. Andrews

Pollard, D. (1985) New Ways to Prove Central Limit Theorems, Econometric Theory, 1, 2955314.
Pollard, D. (1989) A Maximal Inequality for Sums of Independent Processes under a Bracketing
Condition, Department of Statistics, Yale University, unpublished manuscript.
Pollard, D. (1990) Empirical Processes: Theory and Applications. CBMS Conference Series in Probability
and Statistics, Vol. 2. Hayward, CA: Institute of Mathematical Statistics.
Powell, J.L. (1984) Least Absolute Deviations Estimation for the Censored Regression Model, Journal
qf Econometrics, 25, 303-325.
Powell, J.L. (1986a) Censored Regression Quantiles, Journal of Econometrics, 32, 143- 155.
Powell, J.L. (1986b) Symmetrically Trimmed Least Squares Estimators for Tobit Models, Econo-
metrica, 54, 1435-1460.
Prohorov, Yu.V. (1956) Convergence of Random Processes and Limit Theorems in Probability
Theory, Theory of Probability and Its Applications, 1, 157-214.
Robinson, P.M. (1988) Root-N-Consistent Semiparametric Regression, Econometrica, 56, 931-954.
Sherman, R.P. (1992) Maximal Inequalities for Degenerate U-processes with Applications to
Optimization Estimators, unpublished manuscript, Bell Communications Research, Morristown,
NJ.
Sherman, R.P. (1993) The Limiting Distribution of the Maximum Rank Correlation Estimator,
Econometrica, 61, 123-137.
Sherman, R.P. (1994) U-processes in the Analysis of a Generalized Semiparametric Regression
Estimator, Econometric Theory, 10, forthcoming.
Shorack, G.R. and J.A. Wellner (1986) Empirical Processes with Applications to Statistics. New York:
Wiley.
Stinchcombe, M.B. and H. White (1993) Consistent Specification Testing with Unidentified Nuisance
Parameters Using Duality and Banach Space Limit Theory, Department of Economics, University
of California, San Diego, unpublished manuscript.
Wald, A. (1943) Tests of Statistical Hypotheses Concerning Several Parameters When the Number
of Observations Is Large, Transactions of the American Mathematical Society, 54, 426-482.
Wellner, J.A. (1992) Empirical Processes in Action: A Review, International Statistical Review, 60,
247-269.
Whang, Y.-J. and D.W.K. Andrews (1993) Tests of Model Specification for Parametric and Semi-
parametric Models, Journal of Econometrics, 57, 277-3 18.
White, H. and Y. Hong(1992) M-testing Using Finite and Infinite Dimensional Parameter Estimators,
Department of Economics, University of California, San Diego, unpublished manuscript.
White, H. and M. Stinchcombe (1991) Adaptive Efficient Weighted Least Squares with Dependent
Observations, in Directions in Robust Statistics and Diagnostics, Part II, ed. by W. Stahel and S.
Weisberg. Berlin: Springer.
Yatchew, A. (1992) Nonparametric Regression Tests Based on Least Squares, Econometric Theory,
8,435-451.
Chapter 38

APPLIED NONPARAMETRIC METHODS

WOLFGANG H;iRDLE*

Humboldt-Universitiit Berlin

OLIVER LINTON

Oxford University

Contents

Abstract 2297
1. Nonparametric estimation in econometrics 2297
2. Density estimation 2300
2.1. Kernels as windows 2300
2.2. Kernels and ill-posed problems 2301
2.3. Properties of kernels 2302
2.4, Properties of the kernel density estimator 2303
2.5. Estimation of multivariate densities, their derivatives and bias reduction 2304
2.6. Fast implementation of density estimation 2306
3. Regression estimation 2308
3.1. Kernel estimators 2308
3.2. k-Nearest neighbor estimators 2310
3.2.1. Ordinary k-NN estimators 2310
3.2.2. Symmetrized k-NN estimators 2311
3.3. Local polynomial estimators 2311
3.4. Spline estimators 2312
3.5. Series estimators 2313
3.6. Kernels, k-NN, splines, and series 2314

*This work was prepared while the first author was visiting CentER, KUB Tilburg, The Netherlands.
It was financed, in part, by contract No 26 of the programme P81e dattraction interuniversitaire of
the Belgian government.
+We would like to thank Don Andrews, Roger Koenker, Jens Perch Nielsen, Tom Rothenberg and
Richard Spady for helpful comments. Without the careful typewriting of Mariette Huysentruit and the
skillful programming of Marlene Miiller this work would not have been possible.

Handbook ofEconometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden
(c 1994 Elsevier Science B. V. All rights reserved
2296 W. Hiirdle and 0. Linton

3.7. Confidence intervals 2315


3.8. Regression derivatives and quantiles 2318
4. Optimality and bandwidth choice 2319
4.1. Optimality 2319
4.2. Choice of smoothing parameter 2321
4.2.1. Plug-in 2322
4.2.2. Crossvalidation 2322
4.2.3. Other data driven selectors 2323
5. Application to time series 2325
5. I Autoregression 2326
5.2. Correlated errors 2321
6. Applications to semiparametric estimation 2328
6. I The partially linear model 2329
6.2. Heteroskedastic nonlinear regression 2330
6.3. Single index models 233 I
7. Conclusions 2334
References 2334
Ch. 38: Applied Nonpuramrtric Methods 2297

Abstract

We review different approaches to nonparametric density and regression estimation.


Kernel estimators are motivated from local averaging and solving ill-posed
problems. Kernel estimators are compared to k-NN estimators, orthogonal series
and splines. Pointwise and uniform confidence bands are described, and the choice
of smoothing parameter is discussed. Finally, the method is applied to nonparametric
prediction of time series and to semiparametric estimation.

1. Nonparametric estimation in econometrics

Although economic theory generally provides only loose restrictions on the


distribution of observable quantities, much econometric work is based on tightly
specified parametric models and likelihood based methods of inference. Under
regularity conditions, maximum likelihood estimators consistently estimate the
unknown parameters of the likelihood function. Furthermore, they are asymptoti-
cally normal (at convergence rate the square root of the sample size) with a limiting
variance matrix that is minimal according to the Cramer-Rao theory. Hypothesis
tests constructed from the likelihood ratio, Wald or Lagrange multiplier principle
have therefore maximum local asymptotic power. However, when the parametric
model is not true, these estimators may not be fully efficient, and in many cases - for
example in regression when the functional form is misspecified - may not even be
consistent. The costs of imposing the strong restrictions required for parametric
estimation and testing can be considerable. Furthermore, as McFadden says in his
1985 presidential address to the Econometric Society, the parametric approach

interposes an untidy veil between econometric analysis and the propositions of economic
theory, which are mostly abstract without specific dimensional or functional restrictions.

Therefore, much effort has gone into developing procedures that can be used in the
absence of strong a priori restrictions. This survey examines nonparametric
smoothing methods which do not impose parametric restrictions on functional
form. We put emphasis on econometric applications and implementations on
currently available computer technology.
There are many examples of density estimation in econometrics. Income distri-
butions - see Hildenbrand and Hildenbrand (1986) - are of interest with regard to
welfare analysis, while the density of stock returns has long been of interest to
financial economists following Mandelbrot (1963) and Fama (1965). Figure 1 shows
a density estimate of the stock return data of Pagan and Schwert (1990) in comparison
with a normal density. We include a bandwidth factor in the scale parameter to
correct for the finite sample bias of the kernel method.
2298 W. Hiirdle and 0. Linton

Stock Returns

-0.15 -0:10 -0(05 o.bo 0.b5 o.io 0. 15


Returns

Figure 1 Density estimator of stock returns of Pagan and Schwert data compared with a mean zero
normal density (thin line) with standard deviation J&%, d = 0.035and & = 0.009, both evaluated at
a grid of 100 equispaced points. Sample size was 1104. The bandwidth h was determined by the XploRe
macro denauto according to Silvermans rule of thumb method.

Regression smoothing methods are used frequently in demand analysis ~ see for
example Deaton (1991), Banks et al. (1993) and Hausman and Newey (1992).
Figure 2 shows a nonparametric kernel regression estimate of the statistical Engel
curve for food expenditure and total income. For comparison the (parametric) Leser
curve is also included.
There are four main uses for nonparametric smoothing procedures. Firstly, they
can be employed as a convenient and succinct means of displaying the features of
a dataset and hence to aid practical parametric model building. Secondly, they can
be used for diagnostic checking of an estimated parametric model. Thirdly, one may
want to conduct inference under only the very weak restrictions imposed in fully
nonparametric structures. Finally, nonparametric estimators are frequently required
in the construction of estimators of Euclidean-valued quantities in semiparametric
models.
By using smoothing methods one can broaden the class of structures under which
the chosen procedure gives valid inference. Unfortunately, this robustness is not
free. Centered nonparametric estimators converge at rate Jnh, where h +O is a
smoothing parameter, which is slower than the & rate for parametric estimators
in correctly specified models. It is also sometimes suggested that the asymptotic
Ch. 38: Applied Nonparametric Methods 2299

Enael Curve

Figure 2. A kernel regression smoother applied to the food expenditure as a function of total income.
Data from the Family Expenditure Survey (196%1983), year 1973, Quartic kernel, bandwidth h = 0.2.
The data have been normalized by mean income. Standard deviation of net income is 0.544. The kernel
has been computed using the XploRe macro regest.

distributions themselves can be poor approximations in small samples. However,


this problem is also found in parametric situations. The difference is quantitative
rather than qualitative: typically, centered nonparametric estimators behave simi-
larly to parametric ones in which II has been replaced by nh. The closeness of the
approximation is investigated further in Hall (1992).
Smoothing techniques have a long history starting at least in 1857 when the
Saxonian economist Engel found the law named after him. He analyzed Belgian
data on household expenditure, using what we would now call the regressogram.
Whittaker (1923) used a graduation method for regression curve estimation which
one would now call spline smoothing. Nadaraya (1964) and Watson (1964) provided
an extension for general random design based on kernel methods. In time series,
Daniel1 (1946) introduced the smoothed periodogram for consistent estimation of
the spectral density. Fix and Hodges (1951) extended this for the estimation of a
probability density. Rosenblatt (1956) proved asymptotic consistency of the kernel
density estimator.
These methods have developed considerably in the last ten years, and are now
frequently used by applied econometricians - see the recent survey by Deaton
(1993). The massive increase in computing power as well as the increased availability
of large cross-sectional and high-frequency financial time-series datasets are partly
responsible for the popularity of these methods. They are typically simple to
implement in software like GAUSS or XploRe (1993).
We base our survey of these methods around kernels. All the techniques we review
for nonparametric regression are linear in the data, and thus can be viewed as kernel
estimators with a certain equivalent weighting function. Since smoothing parameter
selection methods and confidence intervals have been mostly studied for kernels,
2300 W. Hiirdle and 0. Linton

we feel obliged to concentrate on these methods as the basic unit of account in


nonparametric smoothing.

2. Density estimation

It is simplest to describe the nonparametric approach in the setting of density


estimation, so we begin with that. Suppose we are given iid real-valued observations
{Xi};, 1 with density f. Sometimes ~ for the crossvalidation algorithm described in
Section 4 and for semiparametric estimation - it is required to estimate f at each
sample point, while on other occasions it is sufficient to estimate at a grid of points
xi,. . . , xM for M fixed. We shall for the most part restrict our attention to the latter
situation, and in particular concentrate on estimation at a single point x.
Below we give two approaches to estimating f(x).

2.1. Kernels as windows

If f is smooth in a small neighborhood [x - h,x + h] of x, we can justify the


following approximation,
x+h

fh.f(x) zz f(u)du = P(XE[X - h,x + A]), (I)


s x-h
by the mean value theorem. The right-hand side of (1) can be approximated by
counting the number of X,s in this small interval of length 2h, and then dividing
by n. This is a histogram estimator with bincenter x and binwidth 2h. Let K(u) =
;Z([U~ d l), where I (.) is the indicator function taking the value 1 when the event is
true and zero otherwise. Then the histogram estimator can be written as

Th(X) = n- t Kh(X - Xi), (2)


i=l

where Kh(.) = hp K(./h). This is also a kernel density estimator of f(x) with kernel
K(u) = $I( 1u 1< 1) and bandwidth h.
The step function kernel weights each observation inside the window equally,
even though observations closer to x should possess better information than more
distant ones. In addition, the step function estimator is discontinuous in x, which
is unattractive given the smoothness assumption on f. Both objectives can be
satisfied by choosing a smoother window function K as kernel, i.e. one for which
K(u) + 0 as 1u I + 1. One example is the so-called quartic kernel

K(u)=g(l -u~)~~(IuI < 1). (3)


In the next section we give an alternative motivation for kernel estimators. The
less technically able reader may skip this section.
Ch. 3X: Applied Nonpurametric Methods 2301

2.2. Kernels and ill-posed problems

An alternative approach to the estimation of ,f is to find the best smooth approxi-


mation to the empirical distribution function and to take its derivative.
The distribution function F is related to f by

Af(xl =
s m

--Lo
Z(u < x)f (u)du = F(x), (4)

which is called a Fredholm equation with integral operator Af(x) = s mf (u) du.
Recovering the density from the distribution function is the same as finding the
inverse of the operator A. In practice, we must replace the distribution function by
the empirical distribution function (edf) F,(x) = n-Cy= lI(Xi < x), which converges
to F at rate &. However, this is a step function and cannot be differentiated to
obtain an approximation to f(x). Put another way, the Fredholm problem is
ill-posed since for a sequence F, tending to F, the solutions (satisfying Af,= F,) do
not necessarily converge to f: the inverse operator in (4) is not continuous, see
Vapnik (1982, p. 22).
Solutions to ill-posed problems can be obtained using the Tikhonov (1963)
regularization method. Let Q(f) be a lower semicontinuous functional called the
stabilizer. The idea of the regularization method is to find indirectly a solution to
Af= F by use of the stabilizer. Note that the solution of Af = F minimizes (with
respect to f^)

sH 1
30 ai

Z(u 3 x)_?(u)du - F(x) * dx.


pm -c*)

The stabilizer a(T) = IIf 11is now added to this equation with a Lagrange
parameter A,

R,(?,F) = Z(x 3 u)f(u)du - F(x)


1 sm
* dx + 1
-Cc
f*(U)du. (5)

Since we do not know F(x), we replace it by the edf F,(x) and obtain the problem
of minimizing the functional R,($ F,) with respect to f.
A necessary condition for a solution f^ is

Z(x > s)f^(s)


ds - F,(x)
I dx + i.?(u) = 0.

Applying the Fourier transform for generalized functions and noting that the
2302 W. Hiirdle und 0. Linton

Fourier transform of I(u 3 0) is (i/w) + rr&w) (with 6(.) the delta function), we obtain

where I- is the Fourier transform of f. Solving this equation for r and then
applying the inverse Fourier transform, we obtain

J(x) = n- 1 ,g 2( -,l~-W~,
fi

Thus we obtain a kernel estimator with kernel K(u) = $exp( - 1~1) and bandwidth
h = &. More details are given in Vapnik (1982, p. 302).

2.3. Properties of kernels

In the first two sections we derived different approaches to kernel smoothing. Here
we would like to collect and summarize some properties of kernels. A kernel is a
piecewise continuous function, symmetric around zero, integrating to one:

K(u) = K( - u);
s
K(u)du = 1.

It need not have bounded support, although many commonly used kernels live on
(6)

[ - 1, 11. In most applications K is a positive probability density function, however


for theoretical reasons it is sometimes useful to consider kernels that take on
negative values. For any integerj, let

~j(K) = UK(u) du; Vj(K) = K(u)du.


s s

The order p of a kernel is defined as the first nonzero moment,

Pj = O, j= l,...,p- 1; &J z 0. (7)

We mostly restrict our attention to positive kernels which can be at most of order 2.
An example of a higher order kernel (of order 4) is

K(u) = $7u4 - 1ou2 + 3)1(1 UI < 1).

A list of common kernel functions is given in Table 1. We shall comment later on


the values in the third column.
Ch. 38: Applied Nonparametric Methods 2303

Table 1
Common kernel functions.

Kernel K(u) WO,,, w

Epanechnikov $(l -u~)l(JuI < 1) 1


Quartic $1 - uz)zz(IuI < 1) 1.005
Triangular (1 - lulY(lul Q 1) 1.011
Gauss (2x)-exp( - d/2) 1.041
Uniform ;z(lul Q 1) 1.060

2.4. Properties of the kernel density estimator

The kernel estimator is a sum of iid random variables, and therefore

~C?,,Wl = f&(x - z)f (z) d.z = K, *f (x), (8)


s

where * denotes convolution, assuming the integral exists. When f is N(0, a) and
K is standard normal, E[f,,(x)] is therefore the normal density with standard
deviation d&?? evaluated at x, see Silverman (1986, p. 37). This explains our
modification to the normal density in Figure 1.
More generally, it is necessary to approximate E[f,,(x)] by a Taylor series
expansion. Firstly, we change variables

E[&x)] = K(u)f
(x - uh) du. (9)
s

Then expanding f (x - uh) about f(x)


gives

-K&,(x)1= f(x) + ;h2MK)f (x) + o(h2), (10)

provided f(x) is continuous in a neighborhood of x. Therefore, the bias of f,,(x) is


O(h2) as h + 0.
By similar calculation,

VarC_Lh(x)l
= $v2(K)f (x), (11)

see Silverman (1986, p. 38). Therefore, provided h-+0 and nh-+ a, T,,(x) A f(x).
Further asymptotic properties of the kernel density estimator are given in Prakasa
Rao (1983).
The statistical properties of r,,(x) depend closely on the bandwidth h: the bias
2304 W. H&de und 0. Linton

increases and the variance decreases with h. We investigate how the estimator itself
depends on the bandwidth using the income data of Figure 2. Figure 3a shows a
kernel density estimate for the income data with bandwidth h = 0.2 computed using
the quartic kernel in Equation 3 and evaluated at a grid of 100 equispaced points.
There is a clear bimodal structure for this implementation. A larger bandwidth
h = 0.4 creates a single model structure as shown in Figure 3b, while a smaller
h = 0.05 results in Figure 3c where, in addition to the bimodal feature, there is
considerable small scale variation in the density.
It is therefore important to have some method of choosing h. This problem has
been heavily researched ~ see Jones et al. (1992) for a collection of recent results and
discussion. We take up the issue of automatic bandwidth selection in greater detail
for the regression case in Section 4.2. We mention here one method that is
frequently used in practice ~ Silvermans rule of thumb. Let 8 be the sample
variance of the data. Silverman (1986) proposed choosing the bandwidth to be

This rule is optimal (according to the IMSE - see Section 4 below) for the normal
density, and is not far from optimal for most symmetric, unimodal densities. This
procedure was used to select h in Figure 1.

2.5. Estimation of multivariate densities, their derivatives and bias reduction

A multivariate (d-dimensional) density function f can be estimated by the kernel


estimator

(12)

where kH(.) = {det(H)} k(H- .), where k(.) is a d-dimensional kernel function,
while H is a d by d bandwidth matrix. A convenient choice in practice is to take
H = hS, where S is the sample covariance matrix and h is a scalar bandwidth
sequence, and to give k a product structure, i.e. let k(u) = n4=1 K(uj), where
u=(ur,..., I.+)~ and K(.) is a univariate kernel function.
Partial derivatives off can be estimated by the appropriate partial derivatives
of fH(x) (providing k(.) has the same number of nonzero continuous derivatives).
For any d-vector r = (rl, . . . , rd) and any function g(.) define

where I rl = Cj= 1rj, then T;(x) estimates f(x).


Ch. 38: Applied Nonparametric Methods 2305

8
4.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Net income

x
-0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Net Income

x I
1
-0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
Net income

Figure 3. Kernel density estimates of net income distribution: (a) h = 0.2, (b) h = 0.4, (c) h = 0.05. Family
Expenditure Survey (1968-1983). XploRe macro denest. Year 1973.
2306 W. Hiirdle and 0. Linton

The properties ofmultivariate derivative estimators are described in Prakasa Rao


(1983, p. 237). In fact, when a bandwidth H = kA is used, where h is scalar and A is
any fixed positive definite d by d matrix, then Var[fc;l(x)] = O(n- 1h-(2+d)), while
the bias is 0(h2). For a given bandwidth h, the variance increases with the number
of derivatives being estimated and with the dimensionality of X. The latter effect is
well known as the curse of dimensionality.
It is possible to improve the order of magnitude of the bias by using a pth order
kernel, where p > 2. In this case, the Taylor series expansion argument shows that
EC?,,(x)] -f(x) = O(kP), where p is an even integer. Unfortunately, with this method
there is the possibility of a negative density estimate, since K must be negative
somewhere. Abramson (1982) and Jones et al. (1993) define bias reduction techniques
that ensure a positive estimate. Jones and Foster (1993) review a number of other
bias reduction methods.
The merits of bias reduction methods are based on asymptotic approximations.
Marron and Wand (1992) derive exact expressions for the first two moments of higher
order kernel estimators in a general class of mixture densities and find that unless
very large samples are used, these estimators may not perform as well as the
asymptotic approximations suggest. Unless otherwise stated, we restrict our
attention to second order kernel estimators.

2.6. Fast implementation of density estimation

Fast evaluation of Equation 2 is especially important for optimization of the


smoothing parameter. This topic will be treated in Section 4.2. If the kernel density
estimator has to be computed at each observation point for k different bandwidths,
the number of calculations are 0(n2kk) for kernels with bounded support. For the
family expenditure dataset of Figure 1 with about 7000 observations this would
take too long for the type of interactive data analysis we envisage. To resolve this
problem we introduce the idea of discretization. The method is to map the raw data
onto an equally spaced grid of smaller cardinality. All subsequent calculations are
performed on this data summary which results in considerable computational
savings.
Let H,(x; A), I= 0, 1, . . . , A4 - 1, be the Ith histogram estimator of f(x) with origin
l/M and small binwidth d. The sensitivity of histograms with respect to choice of
origin is well known, see, e.g. Hardle (1991, Figure 1.16). However, if histograms
with different origins are then repeatedly averaged, the result becomes independent
of the histograms origins. Let rM,4(x) = (l/M)Cf?=,H,(x;A) be the averaged histo-
gram estimator. Then

Ax) = ,:h
.f,bf, .qItxEBj) 5 nj-iwi, (13)
,E i= -M
Ch. 38: Applied Nonparametric Methods 2307

where 2 = {. . , - l,O, 1,. . . >, Bj = [bj - +h, bj + ih] with h = A/M and bj = jh,
while nj = C;= ,Z(Xi~Bj) and wi = (M - Iii/M). At the bincenters

Note that {wi>E _M is, in fact, a discrete approximation to the (resealed) triangular
kernel K(u) = (1 - lu()l((u( < 1). More generally, weights wi can be used that
represent the discretization of any kernel K. When K is supported on [ - 1, 11, Wi
is the resealed evaluation of K at the points -i/M (i = - M, . . . , M). If a kernel with
non-compact support is used, such as the Gaussian for example, it is necessary to
truncate the kernel function. Figure 4 shows the weights chosen from the quartic
kernel with M = 5.
Since Equation 13 is essentially a convolution of the discrete kernel weights wi
with the bincounts nj, modern statistical languages such as GAUSS or XploRe that
supply a convolution command are very convenient for computation of Equation 13.
Binning the data takes exactly n operations. If C denotes the number of nonempty
bins, then evaluation of the binned estimator at the nonempty bins requires O(MC)
operations. In total we have a computational cost of O(n + kM,,,C) operations for
evaluating the binned estimator at k bandwidths, where M,,, = Max{ M ;;j = 1,. . . , k},
This is a big improvement.

Kernel and Discretization


I I I I I
r

Figure 4. The quartic kernel qua(u) = $1 - u)~I(/u~ < 1). Discretizing the kernel (without resealing)
leads to w-~ = qua(i/M), i = - M,. , M. Here M = 5 was chosen. The weights are represented by the
thick step function.
2308 W. Hiirdle and 0. Linton

The discretization technique also works for estimating derivatives and multi-
variate densities, see Hardle and Scott (1992) and Turlach (1992). This method is
basically a time domain version of the Fast Fourier Transform computational
approach advocated in Silverman (1986), see also Jones (1989).

3. Regression estimation

The most common method for studying the relationship between two variables X
and Y is to estimate the conditional expectation function m(x) = E( Y [X = x).
Suppose that

Yi = m(X,) + Ei, i=l ,...1 n, (14)

where si is an independent random error satisfying E(siIXi = x) = 0, and


Var(si(Xi = x) = o(x). In this section we restrict our attention to independent
sampling, but some extensions to the dependent sampling case are given in Section
5. The methods we consider are appropriate for both random design, where the (Xi, Yi)
are iid, and fixed design, where the Xi are fixed in repeated samples. In the random
design case, X is an ancillary statistic, and standard statistical practice - see Cox
and Hinkley (1974) - is to make inferences conditional on the sample (Xi}:, r.
However, many papers in the literature prove theoretical properties unconditionally,
and we shall, for ease of exposition, present results in this form. We quote most
results only for the case where X is scalar, although where appropriate we describe
the extension to multivariate data.
In some cases, it is convenient to restrict attention to the equispaced design
sequence Xi=i/n, i= l,..., n. Although this is unsuitable for most econometric
applications, there are situations where it is of interest; specifically, time itself is
conveniently described in this way. Also, the relative ranks of any variable (within
a given sample) are naturally equispaced - see Anand et al. (1993).
The estimators of m(x) we describe are all of the form x1= i W,,(x)Yi for some
weighting sequence (W,i(X)}1= 1, but arise from different motivations and possess
different statistical properties.

3.1. Kernel estimators

Given the technique of kernel density estimation, a natural way to estimate m(.) is
first to compute an estimate of the joint density f(x, y) of (X, Y) and, then, to
integrate it according to the formula

syfk Y)dy

s
m(x) = (15)
S(x>Y) d.v
Ch. 3X: Applied Nonparametric Methods 2309

The kernel density estimate T,Jx, y) of f(x, y) is

L(x,Y) = n- l i$lK~(x - Xi)KhO, - yi::)3

and by Equation 6

s T,,(x,
y)dy = n-l t
i=l
K,(x - Xi);
s
Y&X, Y) dy = n -

Plugging these into the numerator and denominator of Equation 15 we obtain the
NadarayaaWatson kernel estimate

i K,(x - Xi)Yi
(16)

The bandwidth h determines the degree of smoothness of A,,. This can be


immediately seen by considering the limits for h tending to zero or to infinity,
respectively. Indeed, at an observation Xi, &(Xi) -+ Yi, as h + 0, while at an arbitrary
point x, A,,(x) + y, as h + co. These two limit considerations make it clear that the
smoothing parameter h, in relation to the sample size n, should not converge to zero
too rapidly nor too slowly. Conditions for consistency of & are given in the
following theorem, proved in Schuster (1972):

Theorem 1
Let K(.) satisfy 11K(u)1 du < co and Lim lUI_ ,uK(u) = 0. Suppose also that m(x), f(x),
and (T(X)are continuous at x, and f(x) > 0. Then, provided h = h(n) + 0 and nh + co
as n + oo, we have

Ah(x) -% m(x).

The kernel estimator is asymptotically normal, as was first shown in Schuster


(1972).

Theorem 2
Suppose in addition to the conditions of Theorem 1 that SlK(~)l~+~du < co, for
some g > 0. Suppose also that m(x) and f(x) are twice continuously differentiable
at x and that E(J Y)29 Ix) exists and is continuous at x. Finally, suppose that
2310 W. Hiirdle and 0. Linton

Lim h5n < co.Then

Jnhc+w - m(x)- h24nAx)l*NO, Vnw(x)),


where

B,,(x) = $iLZ(K) m(x) + 2m(x)qx)


[ f 1
in, = v2wJ2Mf(x).
The Nadaraya-Watson estimator has an obvious generalization to d-dimensional
explanatory variables and pth order kernels. In this case, assuming a common
bandwidth h is used, the (asymptotic) bias is O(hp), when p is an even integer, while
the (asymptotic) variance is O(n~Kd).

3.2. k-Nearest neighbor estimators

3.2.1. Ordinary k-NN estimators

The kernel estimate was defined as a weighted average of the response variables in
a fixed neighborhood of x. The k-nearest neighbor (k-NN) estimate is defined as a
weighted average of the response variables in a varying neighborhood. This neighbor-
hood is defined through those X-variables which are among the k-nearest neighbors
of a point x.
Let J(x) = {i:Xi 1s one of the k-NN to x} be the set of indices of the k-nearest
neighbors of x. The k-NN estimate is the average of Ys with index in J+(X),

(17)

Connections to kernel smoothing can be made by considering Equation 17 as a


kernel smoother with uniform kernel K(u) = 31(juI < 1) and variable bandwidth
h = R(k), the distance between x and its furthest k-NN,

Note that in Equation 18, for this specific kernel, the denominator is equal to (k/nR)
the k-NN density estimate of f(x). The formula in Equation 18 provides sensible
estimators for arbitrary kernels. The bias and variance of this more general k-NN
estimator is given in a theorem by Mack (198 1).
Ch. 38: Applied Nonparametric Methods 2311

Theorem 3

Let the conditions of Theorem 2 hold, except that k + co, k/n -+ 0 and Lim k5/n4 < GO
asn-+co.Then

&%(x) - m(x)- Wn)*&,(41= NO, v,,(x)),

where

m(X) + Zm(x)f(xj
f
B,,(x) = P*(K)
i gf *(x)
V(X)= 2f?(x)Z*(K).

In contrast to kernel smoothing, the variance of the k-NN regression smoother


does not depend on f, the density of X. This makes sense since the k-NN estimator
always averages over exactly k observations independently of the distribution of the
X-variables. The bias constant B,,(x) is also different from the one for kernel
estimators given in Theorem 2. An approximate identity between k-NN and kernel
smoothers can be obtained by setting

k = 2nhf(x), (19)

or equivalently h = k/[2nf(x)]. For this choice of k or h respectively, the asymptotic


mean squared error formulas of Theorem 2 and Theorem 3 are identical.

3.2.2. Symmetrized k-NN estimators

A computationally useful modification of & is to restrict the k-nearest neighbors


always to symmetric neighborhoods, i.e., one takes k/2 neighbors to the left and k/2
neighbors to the right. In this case, weight-updating formulas can be given, see
Hardle (1990, Section 3.2). The bias formulas are slightly different, see Hlrdle and
Carroll (1990), but Equation 19 remains true.

3.3. Local polynomial estimators

The Nadaraya-Watson estimator can be regarded as the solution of the minimiza-


tion problem

&(x) = arg, min $ K,(x - Xi). (20)


i=l
2312 W. Hiirdle and 0. Linton

This motivates the local polynomial class of estimators. Let go,. , gp minimize

2.
ipqX-Xi) Yi-e,-e,(Xi-x)-...-H,(Xi-x~p
I (21)
[ P!

Then g0 serves as an estimator of m(x), while Qj estimates thejth derivative of m.


Clearly, 60 is linear in Y. A variation on these estimators called LOWESS was first
considered in Cleveland (1979) who employed a nearest neighbor window. Fan
(1992) establishes an asymptotic approximation for the case where p = 1, which he
calls the local linear estimator &,,Jx).

Theorem 4

Let the conditions of Theorem 2 hold. Then

Jnh@,,,(x)
- m(x) - h2W41 =W, Jdx)),

where

B,(x) = ~~~(~)m(x)

The local linear estimator is unbiased when m is linear, while the Nadaraya-Watson
estimator may be biased depending on the marginal density of the design.
We note here that fitting higher order polynomials can result in bias reduction,
see Fan and Gijbels (1992) and Ruppert and Wand (1992) - who also extend the
analysis to multidimensional explanatory variables.
The principle underlying the local polynomial estimator can be generalized in a
number of ways. Tibshirani (1984) introduced the local likelihood procedure in
which an arbitrary parametric regression function g(x; 8) substitutes the polynomial
in Equation 21. Fan, Heckman and Wand (1992) developed a theory for a nonpara-
metric estimator in a GLIM (Limited Dependent Variable) model in which, for
example, a probit likelihood function replaces the polynomial in Equation 21. An
advantage of this procedure is that low bias results when the parametric model is
true (Linton and Nielsen 1993).

3.4. Spline estimators

For any estimate +I of m, the residual sum of squares (RSS) is defined as


CT= r [ Yi - @Xi)12, which is a widely used criterion, in other contexts, for generating
estimators of regression functions. However, the RSS is minimized by A interpolating
the data, assuming no ties in the Xs, To avoid this problem it is necessary to add
a stabilizer. Most work is based on the stabilizer 0(&t) = J[&(u)]~ du, although see
Ch. 38: Applied Nonparametric Methods 2313

Ansley et al. (1993) and Koenker et al. (1993) for alternatives. The cubic spline
estimator A, is the (unique) minimizer of

R,(Gi, m) = $ [ Yi - @Xi)] + 2 [M(u)]~ du. (22)


i=l J

The spline &A has the following properties. It is a cubic polynomial between two
successive X-values at the observation points tin(.) and its first two derivatives are
continuous; at the boundary of the observation interval the spline is linear. This
characterization of the solution to Equation 22 allows the integral term on the right
hand side to be replaced by a quadratic form, see Eubank (1988) and Wahba (1990),
and computation of the estimator proceeds by standard, although computationally
intensive, matrix techniques.
The smoothing parameter 2 controls the degree of smoothness of the estimator
A,. As LO,h, interpolates the observations, while if A+ co,fi, tends to a least
squares regression line. Although &, is linear in the Y data, see Hardle (1990,
pp 58859), its dependency on the design and on the smoothing parameter is rather
complicated. This has resulted in rather less treatment of the statistical properties
of these estimators, except in rather simple settings, although see Wahba (1990) - in
fact, the extension to multivariate design is not straightforward. However, splines
are asymptotically equivalent to kernel smoothers as Silverman (1984) showed. The
equivalent kernel is

K(u)=fenp( -$)sin($+t), (23)

which is of fourth order, since its first three moments are zero, while the equivalent
bandwidth h = h(L; Xi) is

IQ; Xi) = Al% - f(x,)- l/4. (24)

One advantage of spline estimators over kernels is that global inequality and
equality constraints can be imposed more conveniently. For example, it may be
desirable to restrict the smooth to pass through a particular point - see Jones (1985).
Silverman (1985) discusses a Bayesian interpretation of the spline procedure.
However, from Section 2.2 we conclude that this interpretation can also be given
to kernel estimators.

3.5. Series estimators

Series estimators have received considerable attention in the econometrics literature,


following Elbadawi et al. (1983). This theory is very much tied to the structure of
2314 W. H&-d/e und 0. Linron

Hilbert space. Suppose that m has an expansion for all x:

Mx) = f j3jcPjtxh (25)


j=O

in terms of the orthogonal basis functions {~j},?Zo and their coefficients {/3j},?o.
Suitable basis systems include the Leyendre polynomials described in Hardle (1990)
and the Fourier series used in Gallant and Souza (1991).
A simple method of estimating m(x) involves firstly selecting a basis system and
a truncation sequence t(n), where t(n) is an integer less than II, and then regressing
K on Pti = (cPO(xi), . . .T (p,(Xi))r. Let (~j}$Jo be the least squares parameter
estimates, then

r(n) _
@(PI)(~) C Bj(Pjtx) = t wni(x)yi, (26)
j=O i=l

where W,(x) = (W,,r,. . . , WnJT, with

where vt, = (cp,(x),. . . , ~~(4)~and at= (a,. . . , ~3~.


These estimators are typically very easy to compute. In addition, the extension
to additive structures and semiparametric models is convenient, see Andrews and
Whang (1990) and Andrews (1991). Finally, provided t(n) grows at a sufficiently fast
rate, the optimal (given the smoothness of m) rate of convergence can be
established - see Stone (1982), while fixed window kernels achieve at best a rate of
convergence (of MSE) of n45. However, the same effect can be achieved by using a
kernel estimator, where the order of the kernel changes with n in such a way as to
produce bias reduction of the desired degree, see Miiller (1987). In any case, the
evidence of Marron and Wand (1992) cautions against the application of bias
reduction techniques unless quite large sample sizes are available. Finally, a major
disadvantage with the series method is that there is relatively little theory about
how to select the basis system and the smoothing parameter t(n).

3.6. Kernels, k-NN, splines and series

Splines and series are both global methods in the sense that they try to
approximate the whole curve at once, while kernel and nearest neighbor methods
work separately on each estimation point. Nevertheless, when X is uniformly
distributed, kernels and nearest neighbor estimators of m(x) are identical, while
spline estimators are roughly equivalent to a kernel estimator of order 4. Only when
the design is not equispaced, do substantial differences appear.
Ch. 38: Applied Nonparametric Methods 2315

We apply kernel, k-NN, orthogonal series (we used the Legendre system of
orthogonal polynomials), and splines to the car data set (Table 7, pp 352-355 in
Chambers et al. (1983)).
In each plot, we give a scatterplot of the data x = price in dollars of car (in 1979)
versus y = miles per US gallon of that car, and one of the nonparametric estimators.
The sample size is n = 74 observations. In Figure 5a we have plotted together with
the raw data a kernel smoother #r,, for which a quartic kernel was used with h = 2000.
Very similar to this is the spline smoother shown in Figure 5b (2 = 109). In this
example, the Xs are not too far from uniform. The effective local bandwidth for
the spline smoother from Equation 24 is a function of f-14 only, which does not
vary that much. Ofcourse at the right end with the isolated observation at x = 15906
and y = 21 (Cadillac Seville) both kernel and splines must have difficulties. Both
work essentially with a window of fixed width. The series estimator (Figure 5d) with
t = 8 is quite close to the spline estimator.
In contrast to these regression estimators stands the k-NN smoother (k = 11) in
Figure 5c. We used the symmetrized k-NN estimator for this plot. By formula (19)
the dependence of k on f is much stronger than for the spline. At the right end of
the price scale no local effect from the outlier described above is visible. By contrast
in the main body of the data where the density is high this k-NN smoother tends
to be wiggly.

3.7. Confidence intervals

The asymptotic distribution results contained in Theorems 2-4 can be used to


calculate pointwise confidence intervals for the estimators described above. In
practice, it is usual to ignore the bias term, since this is rather complicated,
depending on higher derivatives of the regression function and perhaps on the
derivatives of the density of X. This approach can be justified when a bandwidth is
chosen that makes the bias relatively small.
In this section we restrict our attention to the Nadaraya-Watson regression
estimator. In this case, we suppose that hn -+O, which ensures that the bias term
does not appear in the limiting distribution. Let

CLO(x) = &h(X) - $23

CUP(x) = &h(X) + c&,

where @(c,) = (1 -a) with a(.) the standard normal distribution, while g2 is a
consistent estimate of the asymptotic variance of&(x). Suitable estimators include

(1) .ff = n-A- v2uw;w_7,(4

(2) s*; = 8;(x) t w;;(x)


i=l
Kernel tstlmate Spline Estimate

I , * *-
4MxJ 6ma 6coo loom 12OCN 14Km 16ooO 4okxl 6&J 6cbJ loo00 rnooo 14&l 16&o
Price Price
KNN Estimate Orthogonal Series Estimate

::k** ****; :::;:

L
** * *
4m tmo 6ocKl loo00 12oca 14m 16OCXl 4cbo 6obo a& loo00 12boo 14boo l&
Plice Piice

Figure 5(a-d). Scatterplot of car price (x) and miles per gallon (y) with four different smooth approximations (n = 74, h = 2000, k = 11, i. = 109, r = 8).
$
Standard deviation of car price is 2918.
Ch. 38: Applied Nonparametric Methods 2317

(3) $3 = t w,2i(x)E*;,
i=l

where f,,(x) is defined in Equation 2, ti = Yi - oh are the nonparametric


residuals and S:(x) = x1= 1Wni(x )^f
E is
. a nonparametric estimator of a2(x) - see
Robinson (1987) and Hildenbrand and Kneip (1992) for a discussion of alternative
conditional variance estimators and their application.
With the above definitions.

P(rn(X)E[CLO(X), CUP(x)]} -+ 1 - a. (28)

These confidence intervals are frequently employed in econometric applications, see


for example Bierens and Pott-Buter (1990), Banks et al. (1993) and Gozalo (1989).
This approach is relevant if the behavior of the regression function at a single point
is under consideration. Usually, however, its behavior over an interval is under
study. In this case, pointwise confidence intervals do not take account of the joint
nature of the implicit null hypothesis.
We now consider uniform confidence bands for the function m, over some
compact subset x of the support of X. Without loss of generality we take x = [0,11.
We require functions CLO*(x) and CUP*(x) such that

P{m(x)E[CLO*(x),CUP*(x)] Vxex} -+ 1 - c(. (29)

Let

where 6 = ,/%go, and exp [ - 2 exp( - c,*)] = (1 - CY).


Then (29) is satisfied under
the conditions given in Hardle (1990, Theorem 4.3.1). See also Prakasa Rao (1983,
Theorem 2.1.17) for a treatment of the same problem for density estimators.
In Figure 6 we show the uniform confidence bands for the income data of Figure 2.
Hall (1993) advocates using the bootstrap to construct uniform confidence bands.
He argues that the error in (29) is O(l/log n), which can be improved to O((log h-)3/
nh) by the judicious use of this resampling method in the random design case. See
also Hall (1992) and Hlrdle (1990) for further applications of the bootstrap in
nonparametric statistics.
2318 W. Hiirdle and 0. Linton

Engel Curve and Confidence Bands

Net income

Figure 6. Uniform confidence bands for the income data. Food versus net income. Calculated using
XploRe macro reguncb.

3.8. Regression derivatives and quantiles

There are a number of other functionals of the conditional distribution that are of
interest for applications. The first derivative of the regression function measures the
strength of the relationship between Y and X, while second derivatives can quantify
the concavity or convexity of the regression function. Let &t(x) be any estimator of
m(x) that has at least r non-zero derivatives at x. Then m(x) can be estimated by
the rth derivative of Sz(x), denoted &(x). Miiller (1988) describes kernel estimators
of m(x) based on the convolution method of Gasser and Miiller (1984); their
method gives simpler bias expressions than the Nadaraya-Watson estimator. An
alternative technique is to fit a local polynomial (of order r) estimator, and take the
coefficient on the rth term in (21), see Ruppert and Wand (1992). In each case, the
resulting estimator is linear in Yi, with bias of order h2 and variance of order
n -lh-(2r+l)
Quantiles can also be useful. The median is an alternative - and robust - measure
of location, while other quantiles can help to describe the spread of the conditional
distribution. Let fy,x=x(y) denote the conditional distribution of Y given X = x, and
let c,(x) be the crth conditional quantile, i.e.

C=(x)
CC= f,,, Z,(Y) dy, (30)
s -co

where for simplicity we assume this is unique. There are several methods for
estimating c,(x).
Firstly, let Zj = [ Knj(X), Yj], where Wj(X) are kernel or nearest neighbor weights.
We first sort {Zj}j= 1 on the variable Yj, and find the largest index J such that

i wnj(x)d LX.

j= 1
Ch. 38: Applird Nonparametric Methods 2319

Then let

c*,(x) = YJ. (31)

Stute (1986) shows that e,(x) consistently estimates c,(x), with the same conver-
gence rates as in ordinary nonparametric regression, see also Bhattacharya and
Gangopadhyay (1990). When K is the uniform kernel and c( = i, this procedure
corresponds to the running median discussed in Hardle (1990, pp 69-71). A
smoother estimator is obtained by also smoothing in the y direction, i.e.

Provided K has at least r non-zero derivatives, the rth derivative of C,(X) can be
estimated by the rth derivative of e,(x). See Anand et al. (1993) and Robb et al. (1992)
for applications.
An alternative method of estimating conditional quantiles is through minimizing
an appropriate loss function. This idea originated in Koenker and Bassett (1978).
In particular,

e,(x) = arg,min t K,(x - X,)p,( K - 19), (32)


i=l

where p,(y) = \yJ + (2a - l)y, consistently estimates C,(X). Computation of the
estimator can be carried out by linear programming techniques. Chaudhuri (1991)
provides asymptotic theory for this estimator in a general multidimensional
context and for estimators of the derivatives of c,(x).
In neither (31) nor (32) is the estimator linear in &:., although the asymptotic
distribution of the estimators are determined by a linear approximation to them,
i.e. the estimators are asymptotically normal.

4. Optimality and bandwidth choice

4.1. Optimality

Let Q(h) be a performance criterion. We say that a bandwidth sequence h* is


asymptotically optimal if

Q@*) ~~
(33)
infhEHnQ@)
as n + co, where H, is the range of permissible bandwidths, There are a number of
alternative optimality criteria in use. Finally, we may be interested in the quadratic
2320 W. Hiirdle and 0. Linton

loss of the estimator at a single point x, which is measured by the Meun squared
error, MSE(&(x)}. Secondly, we may be only concerned with a global measure of
performance. In this case, we may consider the Integrated mean squared error,
IMSE = ~MSE[&,,(X)]Z(X)~( x )d x for some weighting function rc(.). An alternative
is the in-sample version of this, the aoeraged squared error

d,(h) = n-l j$I ChhCxj)


- m(Xj)127c(Xj). (34)

The purpose of rr(.) may be to downweight observations in the tail of Xs


distribution, and thereby to eliminate boundary effects - see Miiller (1988) for a
discussion. When h = O(n-I), the squared bias and the variance of the kernel
smoother have the same magnitude; this is the optimal order of magnitude for h
with respect to all three criteria, and the corresponding performance measures are
all 0(n-415) in this case.
Now let h = yn-I, where y is a constant. The optimal constant balances the
contributions to MSE from the squared bias and the variance respectively. From
Theorem 2 we obtain an approximate mean squared error expansion,

MSE[&,(x)] z n-hPV(x) + h4B2(x). (35)

and the bandwidth minimizing Equation 35 is

l/5
h,(x) = v(x) n-1/5
(36)
[ 4P(x) 1

Similarly, the optimal bandwidth with respect to IMSE is the same as in (36) with
v = jVx)n(x)f( x ) d x and B2 = JB2(x)~(x)f(x) dx replacing V(x) and B(x). Unfortu-
nately, in either case the optimal bandwidth depends on the unknown regression
function and design density. We discuss in Section 4.2 below how one can obtain
empirical versions of (36).
The optimal local bandwidths can vary considerably with x, a point which is best
illustrated for density estimation. Suppose that the density is standard normal and
a standard normal kernel is used. In this case, as x-+ co, h,(x)+ co: when data is
sparse a wider window is called for. Also at x = f 1, h,(x) = co, which reflects the
fact that 4 = 0 at these points. Elsewhere, substantiallyless smoothing is called for:
at f 2.236, h,(x) = 0.884n-5 (which is the minimum value of h,(x)). The optimal
global bandwidth is 1.06~ 115.
Although allowing the bandwidth to vary with x dominates over the strategy of
throughout choosing a single bandwidth, in practice this requires consjderably more
computation, and is rarely used in applications.
By substituting ha in (35), we find that the optimal MSE and IMSE depend on
Ch. 38: Applied Nonparametric Methods 2321

Table 2
Kernel exchange rate.

sfJ.q Uniform Triangle Epanechnikov Quartic Gaussian


Uniform 1.000 0.715 0.786 0.663 1.740
Triangle 1.398 1.000 1.099 0.927 2.432
Epanechnikov 1.272 0.910 1.000 0.844 2.214
Quartic 1.507 1.078 1.185 1.000 2.623
Gaussian 0.575 0.411 0.452 0.381 1.000

K only through

T(K) = v:W)PAK). (37)

This functional can be minimized with respect to K using the calculus of variations,
although it is necessary to first adopt a scale standardization of K - for details, see
Gasser et al. (1985). A kernel is said to be optimal if it minimizes (37). The optimal
kernel of order 2 is the Epanechnikov kernel given in Table 1. The third column of
this table shows the loss in efficiency of other kernels in relation to this optimal one.
Over a wide class of kernel estimators, the loss in efficiency is not that drastic; more
important is the choice of h than the choice of K.
Any kernel can be resealed as K*(.) = s- K(./s) which of course changes the value
of the kernel constants and hence h,. In particular,

v,(K*) = s-%,(K); &K*) = s2/4K).

We can uncouple the scaling effect by using for each kernel K, that K* with scale

s* = vAK*) 15
[ P;(K)1
for which pz(K*) = v2(K*). Now suppose we wish to compare two smooths with
kernels K, and bandwidths hj respectively. This can be done by transforming both
to their canonical scale, see Marron and Nolan (1989) and then comparing their
~7. In Table 2 we give the exchange rate between various commonly used kernels.
For example, the bandwidth of 0.2 used with a quartic kernel in Figure 2, translates
into a bandwidth of 0.133 for a uniform kernel and 0.076 for a Gaussian kernel.

4.2. Choice of smoothing parameter

For each nonparametric regression method, one has to choose how much to smooth
for the given dataset. In Section 3 we saw that k-NN, series, and spline estimation
are asymptotically equivalent to the kernel method, so we describe here only the
selection of bandwidth h for kernel regression smoothing.
2322 W. Hiirdle and 0. Linton

4.2.1. Plug-in

The asymptotic approximation given in (36) can be used to determine an optimal


local bandwidth. We can calculate an estimated optimal bandwidth iPl in which the
consistent estimators &i*(x), 6,$(x), f,,*(x) and &(x) replace the unknown functions.
We then use fig,,(x) to estimate m(x). Likewise, if a globally optimal bandwidth is
required, one must substitute estimators of the appropriate average functionals.
This procedure is generally fast and simple to implement. Its properties are
examined in Hardle et al. (1992a).
However, this method fails to provide pointwise optimal bandwidths, when m(x)
possesses less than two continuous derivatives. Finally, a major disadvantage of this
procedure is that a preliminary bandwidth h* must be chosen for estimation of m(x)
and the other quantities.

4.2.2. Crossvalidation

Crossvalidation is a convenient method of global bandwidth choice for many


problems, and relies on the well established principle of out-of-sample predictive
validation.
Suppose that optimality with respect to d,(h) is the aim. We must first replace
d,(h) by a computable approximation to it. A naive estimate would be to just replace
the unknown values m(Xj) by the observations Yj:

p(h)=n-
jzl ChL(xj)- yj124xj)~

This is called the resubstitution estimate.


However, this quantity makes use of each observation twice - the response
variable Yj is used in &,,(Xj) to predict itself. Therefore, p(h) can be made arbitrarily
small by taking h +O (when there are no tied X observations). This fact can be
expressed via asymptotic expressions for the moments of p. Conditional on
Xi,. . . , X,, we have

Cd)1 = Cd,(h)1 + i ,tl~(XMXJ - 2:I,tlwni(XJ~2(XJ~(XJ,


I
(38)

and the third term is of the same order of magnitude as E[d,(h)], but with a negative
sign. Therefore, d, is wrongly underestimated and the selected bandwidth will be
downward biased.
The simplest way to avoid this problem is to remove thejth observation

1 Kh(Xj - xi)yi

$$j(xj) = EL- -
(39)
1 Kh(Xj-Xi)
j#i
Ch. 38: Applied Nonpuramrtric Methods 2323

This leave-one-out estimate is used to form the so-called crossvalidation function

j=l

which is to be minimized with respect to h. For technical reasons, the minimum


must be taken only over a restricted set of bandwidths such as H, = [np(15pr),
n-(i5+i)], for some c > 0.

Theorem 5

Assume that the conditions given in Hardle (1990, Theorem 5.1.1) hold.
Then the bandwidth selection rule, Choose 6 to minimize CV(h) is asymptotically
optimal with respect to d,(h) and IMSE.

Proof

See Hardle and Marron (1985).

The conditions include the restriction that f > 0 on the compact support of rc,
moment conditions on E, and a Lipschitz condition on K. However, unlike the
plug-in procedure, m and f need not be differentiable (a Lipschitz condition is
required, however).

4.2.3. Other data driven selectors

There are a number of different automatic bandwidth selectors that produce


asymptotically optimal kernel smoothers. They are based on various ways of
correcting the downwards bias of the resubstitution estimate of dA(h). The function
p(h) is multiplied by a correction factor that in a sense penalizes hs which are too
small. The general form of this selector is

where c is the correction function with first-order Taylor expansion

Z(u) = 1 + 2u + O(u2), (41)

as u + 0. Some well known examples are:

(i) Generalized crossvalidation (Craven and Wahba 1979; Li 1985),

&&d) = (1 - u)-Z;
2324 W. Hiirdle and 0. Linton

(ii) Akaikes information criterion (Akaike 1970)

E,Ic(u) = exp 2~;

(iii) Finite prediction error (Akaike 1974).

E&U) = (1 + u)/( 1 - u);

(iv) Shibatas (198 1) model selector,

&(u) = 1 + 2u;

(v) Rices (1984) bandwidth selector,

&(U) = (1 - 2u)_.

Hgrdle et al. (1988) show that the general criterion G(h) works in producing
asymptotically optimal bandwidth selection, although they present their results for
the equispaced design case only.
The method of crossvalidation was applied to the car data set to find the optimal
smoothing parameter h. A plot of the crossvalidation function is given in Figure 7.

Crossvalidation Function

I 1 0 I I 1 I I
1500 1600 1700 1800 1900 2000 2100 2200
Bandwidth h

Figure 7. The crossvalidation function CV(h) for the car data. Quartic kernel. Computation made with
XploRe macro regcvl.
Ch. 38: Applied Nonparametric Methods 2325

The computation is for the quartic kernel using the WARPing method, see Hardle
and Scott (1992). The minimal & = argminCV(h) is at 1922 which shows that in
Figure 5a we used slightly too large a bandwidth.
Hardle et al. (1988) investigate how far the crossvalidation optimal i is from the
true optimum, &,, (that minimizes d,(h)). They show that for each optimization
method,

L-Ii,
nl/10
(T 1=+ NO, a2),

w4(Q- 4(~0)1 *c,x;, (43)

where o2 and C, are both positive. The above methods are all asymptotically
equivalent at this higher order of approximation. Another interesting result is that
the estimated i and optimum fro are actually negatively correlated! Hall and
Johnstone (1992) show how to correct for this effect in density estimation and in
regression with uniform Xs. It is still an open question how to improve this for the
general regression setting we are considering here.
There has been considerable research into finding improved methods of bandwidth
selection that give faster rates of convergence in (42). Most of this work is in density
estimation ~ see the recent review of Jones et al. (1992) for references. In this case,
various & consistent bandwidth selectors have been suggested. The finite sample
properties of these procedures are not well established, although Park and Turlach
(1992) contains some preliminary simulation evidence. Hardle et al. (1992a) con-
struct a $ consistent bandwidth selector for regression based on a bias reduction
technique.

5. Application to time series

In the theoretical development described up to this point, we have restricted our


attention to independent sampling. However, smoothing methods can also be
applied to dependent data. Considerable resources are devoted to providing
forecasts of macroeconomic entities such as GNP, unemployment and inflation,
while the benefits of predicting asset prices are obvious. In many cases linear models
have been the basis of econometric prediction, while more recently nonlinear models
such as ARCH have become popular. Nonparametric methods can also be applied
in this context, and provide a model free basis of predicting future outcomes. We
focus on the issue of functional form, rather than that of correlation structure - this
latter issue is treated, from a nonparametric point of view, in Brillinger (1980), see
also Phillips (1991) and Robinson (1991).
Suppose that we observe the vector time series {Z,};, 1, where Zi = ( Yi, Xi), and
Xi is strictly exogenous in the sense of Engle et al. (1983). It is convenient to assume
2326 W. Hiirdle and 0. Linton

that the process is stationary and mixing is as defined in Gallant and White (1988)
which includes most linear processes, for example, although extensions to certain
types of nonstationarity can also be permitted. We consider two distinct problems.
Firstly, we want to predict Yi from its own past which we call autoregression.
Secondly, we want to predict Yi from Xi. This problem we call regression with
correlated errors.

5.1. Autoregression

For convenience we restrict our attention to the problem of predicting the scalar
Yi+k given Yi for some k > 0. The best predictor is provided by the autoregression
function

Mk(Y) = E(Yi+kl yi = Y). (44)

More generally, one may wish to estimate the conditional variance of Yi+k from
lagged values,

v,(Y) = Var(Yi+kI yi = Yh

One can also estimate the predictive density fri+r,Pi. These quantities can be
estimated using any of the smoothing methods described in this chapter. See
Robinson (1983) and Bierens (1987) for some theoretical results including conver-
gence rates and asymptotic distributions.
Diebold and Nason (1990), Meese and Rose (1991) and Mizrach (1992) estimate
M(.) for use in predicting asset prices over short horizons. In each case, a locally
weighted regression estimator was employed with a nearest neighbor type window,
while bandwidth was chosen subjectively (except in Mizrach (1992) where cross-
validation was used). Not surprisingly, their published results concluded that there
was little gain in predictive accuracy over a simple random walk. Pagan and Hong
(1991), Pagan and Schwert (1990) and Pagan and Ullah (1988) estimate V( .) in order
to evaluate the risk premium of asset returns. They used a variety of nonparametric
methods including Fourier series and kernels. Their focus was on estimation rather
than prediction, and their procedures relied on some parametric estimation. See also
Whistler (1988) and Gallant et al. (1991).
A scientific basis can also be found for choosing bandwidth in this sampling
scheme. Hardle and Vieu (1991) showed that crossvalidation also works in the
autoregression problem - choose i = arg min CV(h) gives asymptotically optimal
estimates.
To illustrate this result we simulated an autoregressive process Yi = M( Yi_ i) + si
with

MY) = Y ev( - ~1, (45)


Ch. 38: Applied Nonparametric Methods 2321

True and Estimated Function M


J I I I I I c

*
6

4i.9 4.6 4.3 010 013 016 019


x

Figure 8. The time regression function M(y) = y exp( -y2) for the simulated example (thin line) and the
kernel smoother (thick line).

where the innovations si were uniformly distributed over the interval (- l/2,1/2).
Such a process is a-mixing with geometrically decreasing a(n) as shown by Doukhan
and Ghindes (1980) and Gyorfi et al. (1990, Section 111.4.4).The sample size investi-
gated was n = 100. The quartic kernel function in (3) was used. The minimum of
CV(h) was 6 = 0.43, while the maximum of d,(h) was at h = 0.52. The curve of d,(h)
was very flat for this example, since there was very little bias present. In Figure 8
we compare the estimated curve with the autoregression function and find good
agreement.

5.2. Correlated errors

We now consider the regression model

Yi = m(X,) + Ei,

where Xi is fixed in repeated samples and the errors Ei satisfy .E(&JXi)= 0, but are
autocorrelated. The kernel estimator A,,(X)of m(x) is consistent under quite general
conditions. In fact, its bias is the same as when the Q are independent. However, the
variance is generally affected by the dependency structure. Suppose that the error
2328 W. Hiirdle und 0. Linton

process is MA(l), i.e.

Ei = ui + l3ui_l,

where ui are iid with zero mean and variance c*. In this case,

Var[rfi,(x)] = fr* (1 + e2) i


[ i=l
Wii + 2flnf1 WniWni+ 1
i=l 1
which is O(K h-l), but differs from Theorem 2. If the explanatory variable were
time itself (i.e. Xi = i/n, i = 1,. . , n), then a further approximation is possible:

Var C%(x)1Z $Zr(l + t3* + 2@!,(K).

Hart and Wehrly (1986) develop MSE approximations in a regression model in


which the error correlation is a general function p(.) of the time between
observations.
Unfortunately, crossvalidation fails in this case. Suppose that the errors are AR(l)
with autoregression parameter close to one. The effect on the crossvalidation
technique described in Section 4 must be drastic. The error process stays a long
time on one side of the mean curve. Therefore, the bandwidth selection procedure
gives undersmoothed estimates, since it interprets the little bumps of the error process
as part of the regression curve. An example is given in Hardle (1990, Figures 7.6 and
7.7).
The effect of correlation on the crossvalidation criterion may be mitigated by
leaving out more than just one observation. For the MA(l) process, leaving out the
3 contiguous (in time) observations works. This leave-out-some technique is also
sometimes appealing in an independent setting. See the discussion of Hardle et al.
(1988) and Hart and Vieu (1990). It may also be possible to correct for this effect
by whitening the residuals in (40), although this has yet to be shown.

6. Applications to semiparametric estimation

Semiparametric models offer a compromise between parametric modeling and the


nonparametric approaches we have discussed. When data are high dimensional or
if it is necessary to account for both functional form and correlation of a general
nature, fully nonparametric methods may not perform well. In this case, semipara-
metric models may be preferred.
By a semiparametric model we mean that the density of the observable data,
conditional on any ancillary information, is completely specified by a finite
Ch. 38: Applied Nonparametric Methods 2329

dimensional parameter 8 and an unknown function G(.). The exhaustive monograph


of Bickel et al. (1992) develops a comprehensive theory of inference for a large
number of semiparametric models, although mostly within iid sampling. There are
a number of reviews for econometricians including Robinson (1988b), Newey (1990)
and Powell (this volume).
In many cases, f3 is of primary interest. Andrews (1989) provides asymptotic
theory for a general procedure designed to estimate 0 when a preliminary estimate
G of G is available. The method involves substituting G for G in an estimating
equation derived, perhaps, from a likelihood function. Typically, the dependence of
the estimated parameters 8on the nonparametric estimators disappears asymptoti-
cally, and

where fl, > 0.


Nevertheless, the small sample properties of e can depend quite closely on the
way in which this preliminary step is carried out ~ see the Monte Carlo evidence
contained in Engle and Gardiner (1976) Hsieh and Manski (1987), Stock (1989) and
Delgado (1992). Some recent work has investigated analytically the small sample
properties of semiparametric estimators. Carroll and Hardle (1989), Cavanagh
(1989), HCrdle et al. (1992b), Linton (1991, 1992,1993) and Powell and Stoker(1991)
develop asymptotic expansions of the form

(48)

where q1 and q2 both increase with n under restrictions on h(n). These expansions
yield a formula for the optimal bandwidth similar to (36). An important finding is
that different amounts of smoothing are required for i? and for G; in particular, it
is often optimal to undersmooth G (by an order of magnitude) when the properties
of i? are at stake.
The MSE expansions can be used to define a plug-in method of bandwidth choice
for 6 that is based on second order optimality considerations.

6.1. The partially linear model

Consider

Yi = fiTXi + Cf)(Zi)+ Ei; xi = dzi) + Vi, i=12 > ,..., n (49)

where 4(.) and g(.) are of unknown functional form, while E(aiIZi) = E(qijZi) = 0. If
an inappropriate parametric model is fit to $(.), the resulting MLE of p may be
2330 W. Hiirdlr and 0. Linton

inconsistent. This necessitates using nonparametric methods that allow a more


general functional form, when it is needed. Engle et al. (1986) uses this model to
estimate the effects of temperature on electricity demand, while Stock (1991) models
the effect of the proximity of toxic waste on house prices. In both cases, the effect
is highly nonlinear as the large number of covariates make a fully nonparametric
analysis infeasible. See also Olley and Pakes (1991). This specification also arises
from various sample selection models. See Ahn and Powell (1990) and Newey et al.
(1990).
Notice that

Yi - E( Yi 1Zi) = B[Xi - E(Xi 1Zi)] + Ei.

Robinson (1988a) constructed a semiparametric estimator of /I replacing g(Z,) =


E(X,/Z,) and m(Z,) = E( Yi/Zi) by nonparametric kernel estimators #,,(Zi) and &(Zi)
and then letting

fi = igl Ixi - C9h(Zi)}


Ixi - 9h(ziJ}]m $I lIxi - 4h(Zi)l Cyi- hh(zi)l.
[

In fact, Robinson modified this estimator by trimming out observations for which
the marginal density of Z was small. Robinsons estimator satisfies (47, provided
the dimensions of Z are not too high relative to the order of the kernel being used
(provided m and g are sufficiently smooth).
Linton (1992) establishes that the optimal bandwidth for b is O(n-29), when Z
is scalar, and the resulting correction to the (asymptotic) MSE of the standardized
estimator is O(n - 719).

4.2. Heteroskedastic nonlinear regression

Consider the following nonlinear regression model:

Yi = Z(Xi; p) + Ei, i= 1,2 ,..., n, (W

where t(.; /3) is known, while E(siIXi) = 0 and Var(siI Xi) = a2(Xi), where a*(.) is of
unknown functional form. Efficient estimation of j3 can be carried out using the
pseudo-likelihood principle. Assuming that the si are iid and normally distributed,
the sample log-likelihood function is proportional to

.)] = i [ Yi - z(X,; p)]a(x,)-


_P[/3; cJ2( 1, (51)
i=l

where a(.) is known. In the semiparametric situation we replace a2(Xi) by a


nonparametric estimator 8*(Xi), and then let fi minimize ._!?[/i;e2(.)].
Ch. 38: Applied Nonparametric Methods 233 1

Carroll (1982) and Robinson (1987) examine the situation where t(X; /3) = BTX
in which case

p=
i
t
i=l
xix82(xi)-
I-l n izl xi yi:.82(xi)- (52)

They establish (under iid sampling) that /? is asymptotically equivalent to the


infeasible GLS estimator based on (51). Remarkably, Robinson allows X to have
unbounded support, yet did not need to trim out contributions from its tails: he
used nearest neighbor estimators of g2(.) that always average over the same number
of observations. Extensions of this model to the multivariate nonlinear r(*; 8) case
are considered in Delgado (1992), while Hidalgo (1992) allows both heteroskedasticity
and serial correlation of unknown form. Applications include Melenberg and van
Soest (1991), Altug and Miller (1992) and Whistler (1988).
Carroll and Hardle (1989), Cavanagh (1989) and Linton (1993) develop second
order theory for these estimators. In this case, the optimal bandwidth is O(n-1)
when X is scalar, making the correction to the (asymptotic) MSE O(n-45).

6.3. Single index models

When the conditional distribution of a scalar variable Y, given the d-dimensional


predictor variable X, depends on X only through the index /IX, we say that this is
a single index model.
One example is the single index regression model in which E [ Y 1X = x] = m(x) =
g(xT&, but no other restrictions are imposed. Define the vector of average
derivatives

6= ECm(Wl= ECdWB)IP, (53)

and note that 6 determines /I up to scale - as shown by Stoker (1986). Let f(x) denote
the density of X and 1 be its vector of the negative log-derivatives (partial),
I= - a log flax = -f/f (1 is also called the score oector). Under the assumptions
on f given in Powell et al. (1989), we can write

6 = E[m(X)] = E[l(X) Y], (54)

and we estimate 6 by s^= n- x1= ,&(Xi) Yi, where &(x) = -_&/fH(x) is an estimator
of l(x) based on a kernel density smoother with bandwidth matrix H. Furthermore,
g(.) is estimated by a kernel estimator gh(.) for which [s^XJ;, 1 is the right-hand
side data.
Hardle and Stoker (1989) show that
2332 W. Hiirdle and 0. Linton

where & = Var {1(X)[ Y - m(X)] + m(X)}, while Q,,converges at rate Jnh - i.e. like
a one dimensional function. Stoker (1991) proposed alternative estimators for 6
based on first estimating the partial derivatives m(x) and then averaging over the
observations. A Monte Carlo comparison of these methods is presented in Stoker
and Villas-Boas (1992). Hlrdle et al. (1992b) develop a second order theory for &
in the scalar case, the optimal bandwidth h is O(np2j7) and the resulting correction
to the MSE is O(n- I).
Another example is the binary choice model

Yi = Z(BXi + Ui ~ O), (55)

where (X, u) are iid. There are many treatments of this specification following the
seminal paper of Manski (1975) - in which a slightly more general specification was
considered. We assume also that u is independent of X with unknown distribution
function F(.), in which case Pr[ F = 1 IX,] = F(pTXi) = E( YJPX,), i.e. F(.) is a
regression function. In fact, (55) is a special case of (53). Applications include Das
(1990), Horowitz (199 l), and Melenberg and van Soest (199 1).
Klein and Spady (1993) use the profile likelihood principle (see also Ichimura and
Lee (1991)) to obtain (semiparametric) efficient estimates of 8. When F is known,
the sample log-likelihood function is

Y{F(fi)} = i (Yiln[F(flrXJ] + (1 - YJln[l - F(fiXi)]}. (56)


i=l

For given /I, let @rX) be the nonparametric regression estimator of E( Y 1/?X). A
feasible estimator fi of fi is obtained as the minimizer of

y[&I)] = i { Yiln[@rXi)] + (1 - Y,)ln[l - @rX,)]}. (57)


i=l

This can be found using standard numerical optimization techniques. The average
derivative estimator can be used to provide initial consistent estimators of B,
although it is not in general efficient, see Cosslett (1987). Note that to establish
&-consistency, it is necessary to employ bias reduction techniques such as higher
order kernels as well as to trim out contributions from sparse regions. Note also
that b is not as efficient as the MLE obtained from (56).
We examined the performance of the average derivative estimator on a simulated
dataset, where

Pr( Y = 11X = x) = A(brx) + 0.64(~rx)

B = (I, l)T,
Ch. 38: Applied Nonparumetric Methods 2333

ADE Projection

0 OI-PD oan 0 0
I I I I I I I
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
DeltaHX

Figure 9. For the simulated dataset 6, X versus Y and two estimates of g(pX,) are shown. The thick
line shows the Nadaraya-Watson estimator with a bandwidth h = 0.3, while for the thin line h = 0.1 was
chosen.

while A and C#Jare the standard logit and normal density functions respectively. A
sample of size n = 200 was generated, and the bivariate density function was
estimated using a NadarayaaWatson estimator with bandwidth matrix H =
diag(0.99,0.78). This example is taken from Hardle and Turlach (1992). The
estimation of 6 and its asymptotic covariance matrix & was done with XploRe
macro adefit. For this example 6 = (0.135, 0.135), and

Figure 9 shows the estimated regression function J,,(pXi).


These results allow us to test some hypotheses formally using a Wald statistic
(see Stoker (1992), pp 53-54). In particular, to test the restriction R6 = ro, the Wald
statistic

W= n(R6^- rJT(R&RT)- (Rg- ro)

is compared to a x2 (rank R) critical value. Table 3 gives some examples for this
technique.
2334 W. Hiirdle and 0. Linton

Table 3
Wald statistics for some restrictions on 6.
Restriction Value W d.f. P[x2(d.f.) > W]

61 =p=0 25.25 2 0
S = 62 = 0.135 0.365 2 0.83
6 = 62 0.027 1 0.869

7. Conclusions

The nonparametric methods we have examined are especially useful when the
variable over which the smoothing takes place is one dimensional. In this case, the
relationship can be plotted and evaluated, while the estimators converge at rate
Jnh.
For higher dimensions these methods are less attractive due to the slower rate of
convergence and the lack of simple but comprehensive graphs. In these cases, there
are a number of restricted structures that can be employed including the nonpara-
metric additive models of Hastie and Tibshirani (1990), or semiparametric models
like the partially linear and index models examined in Section 6.

References

Abramson, I. (1982) On bandwidth variation in .kernel estimates -a square root law, Annals of
Statistics, 10, 1217-1223.
Ahn, H. and J.L. Powell (1990) Estimation of Censored Selection Models with a Nonparametric
Selection Mechanism, Unpublished Manuscript, University of Wisconsin.
Akaike, H. (1970) Statistical predictor information, Annals of the Institute of Statistical Mathematics,
22,203-17.
Akaike, H. (1974) A new look at the statistical model identification, IEEE Transactions of Automatic
Control, AC 19, 716-23.
Altug, S. and R.A. Miller (1992) Human capital, aggregate shocks and panel data estimation,
Unpublished manuscript, University of Minnesota.
Anand, S., C.J. Harris and 0. Linton (1993) On the concept of ultrapoverty, Harvard Center for
Population Studies Working paper, 93-02.
Andrews, D.W.K. (1989) Semiparametric Econometric Models: I Estimation, Cowles Foundation
Discussion paper 908.
Andrews, D.W.K. (1991) Asymptotic Normality of Series Estimators for Nonparametric and Semi-
parametric Regression Models, Econometrica, 59, 307-346.
Andrews, D.W.K. and Y.-J. Whang (1990) Additive and Interactive Regression Models: Circumvention
of the Curse of Dimensionality, Econometric Theory, 6, 466-479.
Ansley, C.F., R. Kohn and C. Wong (1993) Nonparametric spline regression with prior information,
Biometrikn. 80. 75-88.
Banks, J., R: Blundell and A. Lewbel (1993) Quadratic Engel curves, welfare measurement and
consumer demand, Institute for Fiscal Studies, 92-14.
Bhattacharya, P.K. and A.K. Gangopadhyay (1990) Kernel and Nearest-Neighbor Estimation of a
Conditional Quantile, Annals ofbtatistics. 18. 1400-15.
Bickel, P.J., C.A.J. Klaassen, Y. kitov and J.A. Welner (1992) Ejicient and Adaptive Znference in
Semiparametric Models. Johns Hopkins University Press: Baltimore.
Ch. 38: Applied Nonparametric Methods 2335

Bierens, H.J. (1987) Kernel Estimators of Regression Functions, in Advances in Econometrics: Fifth
World Congress, Vol 1, ed. by T.F. Bewley. Cambridge University Press.
Bierens, H.J. and H.A. Pott-Buter (1990) Specification of household Engel curves by nonparametric
regression, Econometric Reviews, 9, 123-184.
Brillinger, D.R. (1980) Time Series, Data Analysis and, Theory, Holden-Day.
Carroll, R.J. (1982) Adapting for Heteroscedasttctty in Linear Models, Annals of Statistics, 10,
122441233.
Carroll, R.J. and W. Hlrdle (1989) Second Order Eflects in Semiparametric Weighted Least Squares
Regression, Statistics, 20, 179-186.
Cavanagh, C.L. (1989) The cost of adapting for heteroskedasticity in linear models, Unpublished
manuscript, Harvard University.
Chambers, J.M., W.S. Cleveland, B. Kleiner and P.A. Tukey (1983) Graphical Methodsfor Data Analysis.
Duxburry Press.
Chaudhuri, P. (1991) Global nonparametric estimation of conditional quantile functions and their
derivatives, Journal of Multivariate Analysis, 39, 246-269.
Cleveland, W.S. (1979) Robust Locally Weighted Regression and Smoothing Scatterplots, Journal of
the American Statistical Association, 74, 829-836.
Cosslett, S.R. (1987) Efficiency bounds for Distribution-free estimators of the Binary Choice and the
Censored Regression model, Econometrica, 55, 559-587.
Cox, D.R. and D.V. Hinkley (1974) Theoretical Statistics. Chapman and Hall.
Craven, P. and Wahba, G. (1979) Smoothing noisy data with spline functions, Numer. Math., 31,
377-403.
Daniell, P.J. (1946) Discussion of paper by M.S. Bartlett, Journal of the Royal Statistical Society
Supplement, 8 :27.
Das, S. (1990) A Semiparametric Structural Analysis of the Idling of Cement Kilns, Journal of
Econometrics, 50, 235-256.
Deaton, A.S. (1991) Rice-prices and income distribution in Thailand: a nonparametric analysis,
Economic Journal, 99, l-37.
Deaton, A.S. (1993) Data and econometric tools for development economics, The Handbook of
Development Economics, Volume III, Eds J. Behrman and T.N. Srinavasan.
Delgado, M. (1992) Semiparametric Generalised Least Squares in the Multivariate Nonlinear
Regression Model, Econometric Theory, 8,203-222.
Diebold, F., and J. Nason (1990) Nonparametric exchange rate prediction?, Journal of International
Economics, 28,315-332.
Doukhan, P. and Ghindts, M. (1980) Estimation dans le processus X, = f (X,_ 1) + E,, Comptes
Rendus. Academic des Sciences de Paris, 297, Serie A, 61-4.
Elbadawi, I., A.R. Gallant and G. Souza (1983) An elasticity can be estimated consistently without a
priori knowledge of functional form, Econometrica, 51, 1731l1751.
Engle, R.F. and R. Gardiner (1976) Some Finite Sample Properties of Spectral Estimators of a Linear
Regression, Econometrica, 44, 149-165.
Engle, R.F., D.F. Hendry and J.F. Richard (1983) Exogeneity, Econometrica, 51, 277-304.
Engle, R.F., C.W.J. Granger, J. Rice and A. Weiss (1986) Semiparametric Estimates of the Relationship
Between Weather and Electricity Sales, Journal of the American Statistical Association, 81, 310-320.
Eubank, R.L. (1988) Smoothing Splines and Nonparametric Regression. Marcel Dekker.
Fama, E.F. (1965) The behavior of stock prices, Journal of Business, 38, 34-105.
Family Expenditure Survey, Annual Base Tapes (1968-1983). Department of Employment, Statistics
Division, Her Majestys Stationary Office, London, 196881983.
Fan, J. (1992) Design-Adaptive Nonparametric Regression, Journal of the American Statistical
Association, 87, 998-1004.
Fan, J. and I. Gijbels (1992) Spatial and Design Adaptation: Variable order approximation in function
estimation, Institute ofStat&ics MimeoSeries,no 2080, University ofNorthCarolinaat Chapel Hill.
Fan, J., N.E. Heckman and M.P. Wand (1992) Local Polynomial Kernel Regression for Generalized
Linear Models and Quasi-Likelihood Functions, University of British Columbia Working paper
92-028.
Fix, E. and J.L. Hodges (1951) Discriminatory analysis, nonparametric estimation: consistency
properties, Report No 4, Project no 21-49-004, USAF School of Aviation Medicine, Randolph Field,
Texas.
2336 W. Hiirdle and 0. Linton

Gallant, A.R. and G. Souza (1991) On the asymptotic normality of Fourier flexible form estimates,
Journal ofEconometrics, 50, 329-353.
Gallant, A.R. and H. White (1988) A Unified Theory ofEstimation and Znferencefor Nonlinear Dynamic
Models. Blackwell: Oxford.
Gallant, A.R., D.A. Hsieh and G.E. Tauchen (1991) On Fitting a Recalcitrant Series: The Pound/Dollar
Exchange Rate, 1974-1983, in Nonparametric and Semiparametric Methods in Econometrics and
Statistics. Eds Barnett, Powell, and Tauchen. Cambridge University Press.
Gasser, T. and H.G. Miiller (1984) Estimating regression functions and their derivatives by the kernel
method, Scandinavian Journal of Statistics, 11, 171-85.
Gasser, T., H.G. Miiller and V. Mammitzsch (1985) Kernels for nonparametric curve estimation,
Journal of the Royal Statistical Society Series B, 47,238852.
Gozalo, P.L. (1989) Nonparametric analysis of Engel curves: estimation and testing of demographic
effects, Brown University, Department of Economics Working paper 92215.
Gyorfi, L., W. Hardle, P. Sarda and P. Vieu (1990) Nonparametric Curve Estimation,fiom Time Series.
Lecture Notes in Statistics, 60. Springer-Verlag: Heidelberg, New York.
Hall, P. (1992) The Bootstrap and Edgeworth Expansion. Springer-Verlag: New York.
Hall, P. (1993) On Edgeworth Expansion and Bootstrap Confidence Bands in Nonparametric Curve
Estimation, Journal of the Royal Statistical Society Series B, 55, 291-304.
Hall, P. and I. Johnstone (1992) Empirical functional and efficient smoothing parameter selection,
Journal of the Royal Statistical Society Series B, 54, 4755530.
Hardle, W. (1990) Applied Nonparametric Regression. Econometric Society Monographs 19, Cambridge
University Press.
Hardle, W. (1991) Smoothing Techniques with Implementation. Springer-Verlag: Heidelberg, New York,
Berlin,
Hardle, W. and.R.J. Carroll (1990) Biased cross-validation for a kernel regression estimator and its
derivatives, Osterreichische Zeitschriffiir Statistik und Informatik, 20, 53-64.
Hardle, W. and M. Jerison (1991) Cross Section Engel Curves over Time, Recherches Economiques de
Louvain, 57, 391-431.
Hardle, W. and J.S. Marron (1985) Optimal bandwidth selection in nonparametric regression
function estimation, Annals of Statistics, 13, 1465581.
Hardle, W. and M. Miiller (1993) Nichtparametrische Gllttungsmethoden in der alltaglichen statistischen
Praxis, Allgemeines Statistiches Archiv, 77, 9-31.
HPrdle, W. and D.W. Scott (1992) Smoothing in Low and High Dimensions by Weighted Averaging
Using Rounded Points, Computational Statistics, 1, 97-128.
Hlrdle, W. and T.M. Stoker (1989) Investigating Smooth Multiple Regression by the Method of
Average Derivatives, Journal of the American Statistical Association, 84,9866995.
Hlrdle, W. and B.A. Turlach (1992) Nonparametric Approaches to Generalized Linear Models, In:
Fahrmeir, L., Francis, B., Gilchrist, R., Tutz, G. (Eds.) Aduances in GLIM and Statistical Modelling,
Lecture Notes in Statistics, 78. SpringerrVerlag: New York.
Hardle, W. and P. Vieu (1991) Kernel regression smoothing of time series, Journal of Time Series
Analysis, 13, 209-232.
Hardle, W., P. Hall and J.S. Marron (1988) How far are automatically chosen regression smoothing
parameters from their optimum. ?, Journal of the American Statistical Association, 83, 86699.
Hlrdle, W., P. Hall and J.S. Marron (1992a) Regression smoothing parameters that are not far from
their optimum, Journal of the American Statistical Association, 87,227-233.
Hardle, W., J. Hart, J.S. Marron and A.B. Tsybakov (1992b) Bandwidth Choice for Average Derivative
Estimation, Journal of the American Statistical Association, 87, 218-226.
HPrdle, W., P. Hall and H. Ichimura (1993) Optical Smoothing in Single Index Models, Annals of
Statistics, 21, to appear.
Hart, J. and P. Vieu (1990) Data-driven bandwidth choice for density estimation based on dependent
data, Annals of Statistics, 18, 873-890.
Hart, D. and T.E. Wehrly (1986) Kernel regression estimation using repeated measurements data,
Journal of the American Statistical Association, 81, 1080-g.
Hastie, T.J. and R.J. Tibshirani (1990) Generalized Additive Models. Chapman and Hall.
Hausman, J.A. and W.K. Newey (1992) Nonparametric estimation of exact consumer surplus and
deadweight loss, MIT, Department of Economics Working paper 93-2, Massachusetts.
Ch. 38: Applied Nonparametric Methods 2331

Hidalgo, J. (1992) Adaptive Estimation in Time Series Models with Heteroscedasticity of Unknown
Form, Econometric Theory, 8, 161-187.
Hildenbrand, K. and W. Hildenbrand (1986) On the mean income effect: a data analysis of the U.K.
family expenditure survey, in Contributions to Mathematical Economics, ed W. Hildenbrand and A.
Mas-Colell. North-Holland: Amsterdam.
Hildenbrand, W. and A. Kneip (1992) Family expenditure data, heteroscedasticity and the law of
demand, Universitat Bonn Discussion paper A-390.
Horuwitz, J.L. (1991) Semiparametric estimation of a work-trip mode choice model, University of
Iowa Department of Economics Working paper 91-12.
Hsieh, D.A. and C.F. Manski (1987) Monte Carlo Evidence on Adaptive Maximum Likelihood
Estimation of a Regression, Annals ofStatistics, 15, 541-551.
Hussey, R. (1992) Nonparametric evidence on asymmetry in business cycles using aggregate employ-
ment time series, Journal of Econometrics, 51, 217-231.
Ichimura, H. and L.F. Lee (1991) Semiparametric Least Squares Estimation of Multiple Index Models:
Single Equation Estimation, in Nonparametric and Semiparametric Methods in Econometrics and
Statistics. Eds Barnett, Powell, and Tauchen. Cambridge University Press.
Jones, M.C. (1985) Discussion of the paper by B.W. Silverman, Journal ofthe Royal Statistical Society
Series B, 47, 25-26.
Jones, M.C. (1989) Discretized and interpolated Kernel Density Estimates, Journal ofthe American
Statistical Association, 84, 733-741.
Jones, M.C. and P.J. Foster (1993) Generalized jacknifing and higher order kernels, Forthcoming in
Journal of Nonparametric Statistics.
Jones, M.C., J.S. Marron and S.J. Sheather (1992) Progress in data-based selection for Kernel Density
estimation, Australian Graduate School of Management Working paper no 92-014.
Jones, M.S., 0. Linton and J.P. Nielsen (1993) A multiplicative bias reduction method, Preprint,
Nuffield College, Oxford.
Klein, R.W. and R.H. Spady (1993) An Efficient Semiparametric Estimator for Binary Choice Models,
Econometrica, 61, 387-421.
Koenker, R. and G. Bassett (1978) Regression quantiles, Econometrica, 46, 33-50.
Koenker, R., P. Ng and S. Portnoy (1993) Q uantile Smoothing Splines, Forthcoming in Biometrika.
Lewbel, A. (1991) The Rank of Demand Systems: Theory and Nonparametric Estimation, Econometrica,
59, 71 l-730.
Li, K.-C. (1985) From Steins unbiased risk estimates to the method of generalized cross-validation,
Annals of Statistics, 13, 1352-77.
Linton, O.B. (1991) Edgeworth Approximation in Semiparametric Regression Models, PhD thesis,
Department of Economics, UC Berkeley.
Linton, O.B. (1992) Second Order Approximation in the Partially Linear Model, Cowles Foundation
Discussion Paper no 1065.
Linton, O.B. (1993) Second Order Approximation in a linear regression with heteroskedasticity of
unknown form, Nuffield College Discussion paper no 75.
Linton, O.B. and J.P. Nielsen (1993) A Multiplicative Bias Reduction Method for Nonparametric
Regression, Forthcoming in Statistics and Probability Letters.
McFadden, D. (1985) Specification ofeconometric models, Econometric Society, Presidential Address.
Mack, Y.P. (1981) Local properties of k-NN regression estimates, SIAM J. Alg. Disc. Meth., 2,
31 l-23.
Mandelbrot. B. (1963) The variation of certain speculative prices, Journal ofBusiness, 36, 394-419.
Manski, C.F. (1975) Maximum Score Estimation of the Stochastic Utility Model of Choice, Journal
of Econometrics, 3, 2055228.
Marron, J.S. and D. Nolan (1989) Canonical kernels for density estimation, Statistics and Probability
Letters, 7, 191-195.
Marron, J.S. and M.P. Wand (1992) Exact Mean Integrated Squared Error, Annals ofstatistics, 20,
712-736.
Meese, R.A. and A.K. Rose (1991) An empirical assessment of nonlinearities in models of exchange rate
determination, Review of Economic Studies, 80, 603-619.
Melenberg, B. and A. van Soest (1991) Parametric and semi-parametric modelling of vacation
expenditures, CentER for Economic Research, Discussion paper no 9144, Tilburg, Holland.
2338 W. Hiirdle and 0. Linton

Mizrach, B. (1992) Multivariate nearest-neighbor forecasts of EMS exchange rates, Journal ofApplied
Econometrics, I, 151-163.
Miiller, H.G. (1987) On the asymptotic mean square error of L, kernel estimates of C, functions,
Journal ofApproximation Theory, 51, 1933201.
Miiller, H.G. (1988) Nonparametric Regression Analysis ofLongitudinal Data. Lecture Notes in Statistics,
Vol. 46. SpringerrVerlag: Heidelberg/New York.
Nadaraya, E.A. (1964) On estimating regression, Theory of Probability and its Applications, 10,
1866190.
Newey, W.K. (1990) Semiparametric Efficiency Bounds, Journal ofApplied Econometrics, 5, 99-135.
Newey, W.K., J.L. Powell and J.R. Walker (1990) Semiparametric Estimation of Selection Models:
Some Empirical Results, American Economic Review Papers and Proceedings, 80, 324-328.
Olley, G.S. and A. Pakes (1991) The Dynamics of Productivity in the Telecommunications Equip-
ment Industry, Unpublished manuscript, Yale University.
Pagan, A.R. and Y.S. Hong (1991) Nonparametric Estimation and the Risk Premium, in Nonparametric
and Semiparametric Methods in Econometrics and Statistics. Eds Barnett, Powell, and Tauchen.
Cambridge University Press.
Pagan, A.R. and W. Schwert (1990) Alternative models for conditional stock volatility, Journal of
Econometrics, 45, 267-290.
Pagan, A.R. and A. Ullah (1988) The econometric analysis of models with risk terms, Journal of
Applied Econometrics, 3, 87-105.
Park, B.U. and B.A. Turlach (1992) Practical performance of several data-driven bandwidth selectors
(with discussion), Computational Statistics, 7,251&271.
Phillips, P.C.B. (1991) Spectral Regression for Cointegrated Time Series in Nonparametric and
Semiparametric Methods in Econometrics and Statistics. Eds Barnett, Powell, and Tauchen. Cambridge
University Press.
Powell, J.L. and T.M. Stoker (1991) Optimal Bandwidth Choice for Density-Weighted Averages,
Unpublished manuscript, Princeton University.
Powell, J.L., J.H. Stock and T.M. Stoker (1989) Semiparametric Estimation of Index Coefficients,
Econometrica, 51, 1403-1430.
Prakasa Rao, B.L.S. (1983) Nonparametric Functional Estimation. Academic Press.
Rice, J.A. (1984) Bandwidth choice for nonparametric regression, Annals of Statistics, 12, 1215-30.
Robb, A.L., L. Magee and J.B. Burbidge (1992) Kernel smoothed consumption-age quantiles,
Canadian Journal of Economics, 25, 669-680.
Robinson, P.M. (1983) Nonparametric Estimators for Time Series, Journal of Time Series Analysis, 4,
185-208.
Robinson, P.M. (1987) Asymptotically Efficient Estimation in the Presence of Heteroscedasticity of
Unknown Form, Econometrica, 56,875-891.
Robinson, P.M. (1988a) Root-N-Consistent Semiparametric Regression, Econometrica, 56, 931-954.
Robinson, P.M. (1988b) Semiparametric Econometrics: A Survey, Journal of AppliedEconometrics, 3,
35-51.
Robinson, P.M. (1991) Automatic Frequency Domain Inference on Semiparametric and Nonparametric
Models, Econometrica, 59, 132991364.
Rosenblatt, M. (1956) Remarks on some nonparametric estimates of a density function, Annals of
Mathematical Statistics, 27, 642-669.
Ruppert, D. and M.P. Wand (1992) Multivariate Locally Weighted Least Squares Regression, Rice
University, Technical Report no. 9224.
Schuster, E.F. (1972) Joint asymptotic distribution of the estimated regression function at a finite
number of distinct points, Annals of Mathematical Statistics, 43, 84-8.
Sentana, E. and S. Wadhwani (1991) Semi-parametric Estimation and the Predictability of Stock
Returns: Some Lessons from Japan, Review of Economic Studies, 58, 547-563.
Shibata, R. (1981) An optimal selection of regression variables, Biometrika, 68,45-54.
Silverman, B.W. (1984) Spline smoothing: the equivalent variable kernel method, Annals of Statistics,
12.898-916.
Silverman, B.W. (1985) Some aspects of the Sphne Smoothing approach to Non-parametric Regression
Curve Fitting, Journal of the Royal Statistical Society Series B, 47, l-52.
Silverman, B.W. (1986). Densiry estimationfor statistics and data analysis. Chapman and Hall: London.
Stock, J.H. (1989) Nonparametric Policy Analysis, Journal qfthe American Statistical Association, 84,
567-516.
Ch. 38: Applied Nonparumetric Methods 2339

Stock, J.H. (1991) Nonparametric Policy Analysis: An Application to Estimating Hazardous Waste
Cleanup Benefits, in Nonparametric and Semiparametric Methods in Econometrics and Statistics. Eds
Barnett, Powell, and Tauchen. Cambridge University Press.
Stoker, T.M. (1986) Consistent Estimation of Scaled Coefficients, Econometrica, 54, 1461-1481.
Stoker, T.M. (1991) Equivalence of direct, indirect, and slope estimators of average derivatives, in
Nonparametric and Semiparametric Methods in Econometrics and Statistics. Eds Barnett, Powell, and
Tauchen. Cambridge University Press.
Stoker, T.M. (1992) Lectures on Semiparametric Econometrics. CORE Lecture Series. Universite
Catholique de Louvain, Belgium.
Stoker, T.M. and J.M. Villas-Boas (1992) Monte Carlo Simulation of Average Derivative Estimators,
Unpublished manuscript, MIT: Massachusetts.
Stone, C.J. (1982) Optical global rates ofconvergence for nonparametric regression, Annals ofStatistics,
10,1040~1053.
Strauss, J. and D. Thomas(1990) The shape of the calorie-expenditure curve, Unpublished manuscript,
Rand Corporation, Santa Monica.
Stute, W. (1986) Conditional Empirical Processes, Annals of Statistics, 14, 638-647.
Tibshirani. R. (1984) Local likelihood estimation. PhD Thesis, Stanford University, California.
Tikhonov,A.N. (1963) Regularization of incorrectly posed problems, Soviet Math:, 4, 1624-1627.
Turlach, B.A. (1992) On discretization methods for average derivative estimation, CORE Discussion
Paper no. 9232, Universite Catholique de Louvain, Louvain-la-Neuve, Belgium.
Vapnik, V. (1982). Estimation of Dependencies Based on Empirical Data. SpringerrVerlag: Heidelberg,
New York, Berlin.
Wahba, G. (1990) Spline Models for Observational Data. CBMS-NSF Regional Conference Series in
Applied Mathematics, no. 59.
Watson, G.S. (1964) Smooth regression analysis, Sankhya Series A, 26, 359-372.
Whistler, D. (1988) Semiparametric ARCH Estimation of Intra-Daily Exchange Rate Volatility,
Unpublished manuscript, London School of Economics.
Whittaker, E.T. (1923) On a new method of graduation, Proc. Edinburgh Math. Sot., 41, 63-75.
XploRe (1993) An interactive statistical computing environment. Available from XploRe Systems,
Institute fur Statistik und ekonometrie, Wirtschaftswissenschaftliche Fakultat, Humboldt-Universitat
zu Berlin, D 10178 Berlin, Germany.
2342 P. Hall

Abstract

A brief account is given of the methodology and theory for the bootstrap.
Methodology is developed in the context of the equation approach, which allows
attention to be focussed on specific criteria for excellence, such as coverage error of
a confidence interval or expected value of a bias-corrected estimator. This approach
utilizes a definition of the bootstrap in which the key component is replacing a true
distribution function by its empirical estimator. Our theory is Edgeworth expansion
based, and is aimed specifically at elucidating properties of different methods for
constructing bootstrap confidence intervals in a variety of settings. The reader
interested in more detail than can be provided here is referred to the recent
monograph of Hall (1992).

1. Introduction

A broad interpretation of bootstrap methods argues that they are defined by


replacing an unknown distribution function, F, by its empirical estimator, p, in a
functional form for an unknown quantity of interest. From this standpoint, the
individual who first suggested that a population mean,

p = xdF(x),
s

could be estimated by the sample mean,

x= xdF(x),
s

was using the bootstrap. We tend to favour this definition, although we appreciate
that there are alternative views.
Perhaps the most common alternative is to confer the name bootstrap on
procedures that use Monte Carlo methods to effect a numerical approximation.
While we see that this does have its merits, we would argue against it on two
grounds. First, it is sometimes convenient to draw a distinction between the
essentially statistical argument that leads to the substitution or plug-in method
described in the previous paragraph, and the essentially numerical argument that
employs a Monte Carlo approximation to calculate a functional of F^. There do
exist statistical procedures which marry the numerical simulation and statistical
estimation into one operation, where the simulation is regarded as primarily a
statistical feature. Monte Carlo testing is one such procedure; see for example
Ch. 39: Methodoioyy and Theoryfor the Bootstrap 2343

Barnard (1963), Hope (I 968) and Marriott (1979). Our definition of the bootstrap
would not regard Monte Carlo testing as a bootstrap procedure. That may be
seen as either an advantage or a disadvantage, depending on ones view.
A second objection that one may have to defining the bootstrap strictly in terms
of whether or not Monte Carlo methods are employed, is that the method of
numerical computation becomes intrinsic to the definition. TO cite an extreme case,
one would not usually think of using Monte Carlo methods to compute a sample
mean or variance, but nevertheless those quantities might reasonably be regarded
as bootstrap estimators of the population mean and variance, respectively. In a less
obvious instance, estimators of bootstrap distribution functions, which would
usually be candidates for approximation by Monte Carlo methods, may sometimes
be computed most effectively by exact, non-Monte Carlo methods. See for example
Fisher and Hall (1991). In other settings, saddlepoint methods provide excellent
alternatives to simulation; see Davison and Hinkley (1988) and Reid (1988). Does
a technique stop being a bootstrap method as soon as non-Monte Carlo methods
are employed? To argue that it does seems unnecessarily pedantic, but to deny that
it does would cause some problems for a bootstrap definition based on the notion
of simulation.
The name bootstrap was introduced by Efron (1979), and it is appropriate here
to emphasize the fundamental contributions that he made. As Efron was careful to
point out, bootstrap methods (in the sense of replacing F by F) had been around
for many years before his seminal paper. But he was perhaps the first to perceive
the enormous breadth of this class of methods. He saw too that the power of modern
computing machinery could be harnessed to allow functionals of F^to be computed
in very diverse circumstances. The combination of these two observations is
extremely powerful, and its ultimate effect on Statistics will be revolutionary.
Necessarily, these two observations go together; the vast range of applications of
bootstrap methods would not be possible without a facility for extremely rapid
simulation. However, that fact does not imply that bootstrap methods are restricted
to situations where simulation is employed for calculation.
Statistical scientists who thought along lines similar to Efron include Hartigan
(1969, 1971), who used resampled sub-samples to construct point and interval
estimators, and who stressed connections with Mahalanobis interpenetrating
samples and the jackknife of Quenouille (1949, 1956) and Tukey (1958); and Simon
(1969, Chapters 23-25), who described a variety of Monte Carlo methods.
Let us accept, for the sake of argument, that bootstrap methods are defined by
the replace F by P rule, described above. Two challenges immediately emerge in
response to this definition. First, we must determine how to focus this concept,
SO as to make the bootstrap responsive to statistical demands. That is, how do we
decide which functionals of F should be estimated? This requires a principle that
enables US to implement bootstrap methods in a range of circumstances. The second
challenge is that of calculating the values of those functionals in a practical setting.
The latter problem may be solved partly by providing simulation methods or related
2344 P.Hall

devices, such as saddlepoint arguments, for numerical approximation. Space


limitations mean that a thorough account of these techniques is beyond the scope
of this chapter. However, a detailed account of efficient methods of bootstrap
simulation may be found in Appendix II of Hall (1992). A key part of the answer to
the first question is the development of theory describing the relative performance
of different forms of the bootstrap, and that issue will be addressed at some length
here.
Our answer to the first question is provided in Section 2, where we describe an
equation approach to focussing attention on specific statistical questions. This
technique was discussed in more detail by Hall and Martin (1988), Martin (1989)
and Hall (1992, Chapter 1). It leads naturally to bootstrap iteration, which is
discussed in Section 3. Section 4 presents theory that enables comparisons to be
made of different bootstrap approaches to inference about distributions. The reader
is referred to Hinkley (1988) and DiCiccio and Roman0 (1988) for excellent reviews
of bootstrap methods.
Our discussion is necessarily kept brief and is essentially an abbreviated form of
an account that may be found in Hall (1992). In undertaking that abbreviation we
have omitted discussion of a variety of different approaches to the bootstrap. In
particular, we do not discuss various forms of bias correction, not because we do
not recommend it but because space does not permit an adequate survey. We readily
concede that the restricted account of bootstrap methods and theory presented here
is in need of a degree of bias correction itself!
We do not address in any detail the bootstrap for dependent data, but pause here
to outline the main issues. There are two main approaches to implementing the
bootstrap in dependent settings. The first is to model the dependent process as one
that is driven by independent and identically distributed disturbances - examples
include autoregressions and moving averages. We describe briefly here a technique
which may be used when no parametric assumptions are made about the
distribution of the disturbances. First estimate the parameters of the model, and
calculate the residuals (i.e. the estimated values of the independent disturbances).
Then run the process over and over again, by Monte Carlo simulation, with
parameter values set equal to their estimated values and with the bootstrapped
independent disturbances obtained by resampling randomly, with replacement,
from the set of residuals. Each resampled process should be of the same length as
the original one, and bootstrap inference may be conducted by averaging over the
independent Monte Carlo replications. Bose (1988) addresses the efficacy of this
procedure in the context of autoregressive models, and derives results that may be
viewed as analogues (in the case of autoregressive processes) of some of those
discussed later in this chapter for independent data.
If the distribution of disturbances is assumed known then, rather than estimate
residuals and resample with replacement from those, the parameters of the assumed
distribution may be estimated. The bootstrap disturbances may now be derived by
resampling from the hypothesized distribution, with parameters estimated.
Ch. 39: Methodology and Throryfir the Bootstrup 2345

The major other way of bootstrapping dependent processes is to divide the data
sequence into blocks, and resample the blocks rather than individual data values.
This approach has application in spatial as well as linear or time series contexts,
and indeed was apparently first suggested for spatial data; see Hall (1985). Blocking
methods may involve either non-overlapping blocks, as in the technique treated by
Carlstein (1986), or overlapping blocks, as proposed by Kiinsch (1989). (Both
methods were considered for spatial data by Hall (1985)) In sheer asymptotic terms
Kiinschs method has advantages over Carlsteins, but those advantages are not
always apparent in practice. This matter has been addressed by Hall and Horowitz
(1993) in the context of estimating bias or variance, and there the matter of optimal
block width has been treated. The issue of distribution estimation using blocking
methods has been discussed by Gotze and Kiinsch (1990), Lahiri (1991, 1992) and
Davison and Hall (1993).

2. A formal definition of the bootstrap principle

Much of statistical inference involves describing the relationship between a sample


and the population from which the sample was drawn. Formally, given a functional
f, from a class (f,:t~Y->, we wish to determine that value t, of r that solves an
equation such as

W(Fcl?FJlFo) = 0, (2.1)

where F = F, denotes the population distribution function and F = F, is the


distribution function of the sample. An explicit definition of F, will be given
shortly. Conditioning on F, in (2.1) serves to stress that the expectation is taken
with respect to the distribution F,. We call (2.1) the population equation because we
need properties of the population if we are to solve this equation exactly.
For example, let 8, = d(F,) denote a true parameter value, such as the rth power
of a mean,

Let e= B(F,) be our bootstrap estimator of 8,, such as the rth power of a sample
mean,

where 3 = F, is the empirical distribution function of the sample from which _? is


computed. Correcting gadditively for bias is equivalent to finding that value 1, that
2346 P.Hall

solves (2.1) when

fr(F,, Fl) = v-1) - W,) + t. (2.2)

Our bias-corrected estimator would be 8+ t,. On the other hand, to construct a


symmetric, 95% confidence interval for 8, we would solve (2.1) when

jl(F,, F,) = Z{B(F,) - t d B(F,) < B(F,) + t} - 0.95, (2.3)

where the indicator function Z(&) is defined to equal 1 if event 6 holds and 0
otherwise. The confidence interval is (6 - to, 6 + to), where 8 = B(F,).
To obtain an approximate solution of the population equation (2.1) we argue as
follows. Let F, denote the distribution function of a sample drawn from F,
(conditional on F,). Replace the pair (F,, F,) in (1.1) by (F,, F,), thereby transforming
(2.1) to

(2.4)

We call this the sample equation because we know (or can find out) everything about
it once we know the sample distribution function F,. In particular, its solution f,
is a function of the sample values.
We call & and E{f,(F,, FJ F,} the bootstrap estimators of t, and E{f,(F,,
F,) 1F,}, respectively. They are obtained by replacing F0 by F, in formulae for to
and E{f,(F,, F,)I F,}. In the bias correction problem, where f, is given by (2.2), the
bootstrap version of our bias-corrected estimator is I!+ &,. In the confidence interval
problem where (2.3) describes f,, our bootstrap confidence interval is (e - &,, 8 + f,).
The latter is commonly called a (symmetric) percentile-method confidence interval
for 6,.
The bootstrap principle might be described in terms of this approach to
estimation of a population equation.
It is appropriate now to give detailed definitions of F, and F,. There are two
approaches, suitable for nonparametric and parametric problems respectively. In
both, inference is based on a sample X of n random (independent and identically
distributed) observations of the population. In the nonparametric case, F, is simply
the empirical distribution function of X; that is, the distribution function of the
distribution that assigns mass n-l to each point in X. The associated empirical
probability measure assigns to a region B a value equal to the proportion of the
sample that lies within 2. Similarly, F, is the empirical distribution function of a
sample drawn at random from the population with distribution function F,; that
is, the empiric of a sample !Z* drawn randomly, with replacement, from 3. If we
denote the population by X0 then we have a nest of sampling operations: X is drawn
at random from X0 and !E* is drawn at random from X.
Ch. 39: Mrthodology and Theoryfor the Bootstrap 2341

In the parametric case, F, is assumed completely known up to a finite vector i,


of unknown parameters. To indicate this dependence we write F, = F,*(,), an element
of a class {F,,,, k.~Aj of possible distributions. Let 1: be an estimator of I, computed
from J, often (but not necessarily) the maximum likelihood estimator. It will be a
function of sample values, so we may write it as h(X). Then F, = F,Q, the distribution
function obtained on replacing true parameter values by their sample estimates.
Let X* denote the sample drawn at random from the distribution with distribution
function F,,, (not simply drawn from 3 with replacement), and let fi* = A(F*)
denote the version of I computed for .Y* instead of .Y. Then F, = F,i*,.
It is appropriate now to discuss two examples that illustrate the bootstrap
principle.

Example 2.1. Bias reduction

Here the function f, is given by (2.2), and the sample equation (2.4) assumes the form

E{W,) - W,) + [IF,) = 0,


whose solution is

t= to= 8(F,) - E{O(F,)IF,}.

The bootstrap bias-reduced estimator is thus

6, = @+ t*,,= 8(F,) + 2, = 28(F,) - E{O(F,)IF,}. (2.5)

Note that our basic estimator I!?= B(F,) is also a bootstrap estimator since it is
obtained by substituting F, for F, in the functional formula 8, = 8(F,).
The expectation E(B(F,)jF,} may always be computed (or approximated) by
Monte Carlo simulation, as follows. Conditional on F,, draw B resamples
{.Fz, 1 d b d B} independently from the distribution with distribution function F,.
In the nonparametric case, where F, is the empirical distribution function of the
sample 3, let F,, denote the empirical distribution function of .!!z. In the parametric
case, let iz = I(%;) be that estimator of &, computed from resample Fz, and put
F,, = Fci*,. Define 6: = 8(F,,) and o^= H(F,). Then in both parametric and non-
parametrPc circumstances,

h=l

converges to fi = E(O(F,)lF,} = E(@*(X) (with probability one, conditional on F,)


as B+ncj.
2348 P. Hull

Example 2.2. Confidence interval

A symmetric confidence interval for 8, = U(F,) may be constructed by applying the


resampling principle using the function f, given by (2.3). The sample equation then
assumes the form

P{8(F,) - t < 8(F,) < Q(F,) + t(F,} - 0.95 = 0. (2.6)

In a nonparametric context Q(F,), conditional on F,, has a discrete distribution


and so it would seldom be possible to solve (2.6) exactly. However, any error in the
solution of (2.6) will usually be very small, since the size of even the largest atom of
the distribution of B(F,) decreases exponentially quickly with increasing II. The
largest atom is of size only 3.6 x 1O-4 when IZ= 10. We could remove this minor
difficulty by smoothing the distribution function F,. In parametric cases, (2.6) may
usually be solved exactly for t.
The interval (& f,, 8+ &J is a bootstrap confidence interval for 8, = 8(F,),
usually called a (two-sided, symmetric) percentile interval since &, is a percentile of
the distribution of le(F,) - Q(F,)I conditional on F,. Other nominal 95% percentile
intervals include the two-sided, equal-tailed interval (i?- f,,, 8 + fo2) and the
one-sided interval (- co, & + f,,), where f,,, fo2, and f,, solve

P{@F,) < B(F,) - tlF,} - 0.025 = 0,

P(B(F,) < QF,) + tlF,} - 0.975 = 0,

and

P{e(F,)<e(F,)+ tlF,} -0.95=0,

respectively. The former interval is called equal-tailed because it attempts to place


equal probability in each tail:

P(8, d 8- f,,) z I-(& > 8+ f,,) z 0.025.

The ideal form of this interval, obtained by solving the population equation rather
than the sample equation, does place equal probability in each tail.
Still other 95% percentile intervals are I^, = (e- fo2, 8+ f,,) and III = (- co,
8 + fo4), where too4is the solution of

P{fI(F,) d Q(F,) - tlF,} - 0.05 = 0.

These do not fit naturally into a systematic development of bootstrap methods by


frequentist arguments, and we find them a little contrived. They are sometimes
Ch. 3Y: Methodology and Theoryfor the Bootstrap 2349

motivated as follows. Define e* = B(F,), I?(x) = P(8* < ~1%) and

I?(C()=inf{x:t?(x)>a}.

Then

r^, = [I?(0.025),&(0.975)] and fI = [- co,I?(0.95)].

All these intervals cover 8, with probability approximately 0.95, which might be
called the nominal coverage. Coverage error is defined to be true coverage minus
nominal coverage; it generally converges to zero as sample size increases.
We now treat in more detail the construction of two-sided, symmetric percentile
intervals in parametric problems. There, provided the distribution functions Fo, are
continuous, equation (2.6) may be solved exactly. We focus attention on the cases
where 8, = Q(F,) is a population mean and the population is normal or exponential.
Our main aim is to bring out the virtues of pivoting, which usually amounts to
resealing so that the distribution of a statistic depends less on unknown parameters.
If the population is Normal N@,c?) and we use the maximum likelihood
estimator x = (x, S2) to estimate 1, = (CL,a2), then the sample equation (2.6) may be
rewritten as

P(ln - %Nj < t 1F,) = 0.95, (2.7)

where N is Normal N(0, 1) independent of F,. Therefore the solution of (2.6) is


*
t = t, = xog5n -12B, where X, is defined by

P(INI <x,) = oz.

The bootstrap confidence interval is therefore

(X - n-2X,,,,c?,X+ n-12xo.958)

with coverage error

P(r7- C2X (p& d /.i d x +n- 12xo.958)


= P{ (n2(X - ,U)/Sl G x,.95} - 0.95. (2.8)

Ofcourse n(X - p)/S does not have a Normal distribution, but a resealed Students
t distribution with n - 1 degrees of freedom. Therefore the coverage error is
essentially that which results from approximating to Students t distribution by a
Normal distribution, and so is O(n-). (See Kendall and Stuart (1977, p. 404).) That
is disappointing, particularly as classical methods lead so easily to an interval with
precisely known coverage in this important special case.
2350 P.Hull

To appreciate why the percentile interval has this inadequate performance, let us
go back to our parametric example involving the Normal distribution. The root
cause of the problem there is that 8, and not cr, appears on the right-hand side in
(2.8). This happens because the sample equation (2.6), equivalent here to (2.7),
depends on 8. Put another way, the population equation (2.1), equivalent to

P{ l(V,) - B(F,)I < t} = 0.95,

depends on cr, the population variance. This occurs because the distribution of
le(F,) - 8(F,)I depends on the unknown CJ.We should try to eliminate, or at least
minimize, this dependence.
A function T of both the data and an unknown parameter is said to be (exactly)
piootal if it has the same distribution for all values of the unknowns. It is
asymptotically pivotal if, for sequences of known constants {a,} and {b,}, a,T+ b,
has a proper nondegenerate limiting distribution not depending on unknowns. We
may convert 8(F,) - 8(F,) into a pivotal statistic by correcting for scale, changing
it to T= {B(F,) - fl(F,)}/* r w h ere z*= r(F,) is an appropriate scale estimator. In our
example about the mean there are usually many different choices for 2, e.g. the
sample standard deviation {n-C(Xi - X) } 1/Z, the square root of the unbiased
variance estimate, Ginis mean difference and the interquartile range. In more
complex problems, a jackknife standard deviation estimator is usually an option.
Note that exactly the same confidence interval will be obtained if t^ is replaced by
c?, for any given c # 0, and so it is inessential that z^be consistent for the asymptotic
standard deviation of f&F,). What is important is piuotalness - exact pivotalness if
we are to obtain a confidence interval with zero coverage error, asymptotic
pivotalness if exact pivotalness is unattainable. If we change to a pivotal statistic
then the function f, alters from the form given in (2.3) to

f,(F,, F,) = 1(&F,) - tr(F,) d B(F,) d B(F,) + tr(F,)} - 0.95. (2.9)

In the case of our parametric Normal model, any reasonable scale estimator t*will
give exact pivotalness. We shall take z^= 8, where 8 = a(F,) = n-C(Xi - f)
denotes sample variance. Then f, becomes

ft(F,, F,) = 1(Q(F,) - m(~,) d e(F,) < e(F,) + ta(~,)) - 0.95.

Using this functional in place of that at (2.3), but otherwise arguing exactly as before,
equation (2.7) changes to

P((n- 1)-121T~_1( dt(F,} =0.95, (2.10)

where T,_ I has Students t distribution with n - 1 degrees of freedom and is


stochastically independent of F,. (Therefore the conditioning on F, in (2.10) is
Ch. 39: Methodology and Theory for the Bootstrap 2351

irrelevant.) Thus, the solution of the sample equation is f0 = (n - 1)) 1i2w0,95, where
w&= w,(n) is given by P(IT,_ 1 1< w,) = ct. The bootstrap confidence interval is
(X - &,,b,.% + 2,8), with perfect coverage accuracy,

P{X -(n - 1)-12w0,95 8 ,< p d X + (n - l)-zw,,,,B) = 0.95.

(Of course, the latter statement applies only to the parametric bootstrap under the
assumption of a Normal model.)
Such confidence intervals are usually called percentile-t intervals since f0 is a
percentile of the Students t-like statistic /0(F,) - 8(F,)1/r(F,).
Perfect coverage accuracy of percentile-t intervals usually holds only in parametric
problems where the underlying statistic is exactly pivotal. More generally, if
symmetric percentile-t intervals are constructed in parametric .and nonparametric
problems by solving the sample equation when f, is defined by (2.9), where z(F,) is
chosen so that T= {8(F,) - B(F,)}/z(F,) is asymptotically pivotal, then coverage
error will usually be O(n-) rather than the O(n- ) associated with ordinary
percentile intervals.
We conclude this example with remarks on the computation of critical points,
such as r?,, by uniform Monte Carlo simulation. Further details, including an
account of efficient Monte Carlo simulation, are given in Section 5.
Assume we wish to compute the solution 0, of the equation

PC{w-2) - W,)}Iz(F,) d 9, IF,] = LY, (2.11)

or, to be more precise, the value

4 = inf{x:PC{W,) - B(F,))/T(F,)d xIF, (3 M}.

Choose integers B > 1 and 1 d v 6 B such that v/(B + 1) = ~1.For example, if c1= 0.95
then we could take (v, B) = (9599) or (950,999). Conditonal on F,, draw B resamples
(gz, 1 <b d B} independently from the distribution with distribution function F,.
In the nonparametric case, write F,,, for the empirical distribution function of 3:.
In the parametric case, where the population distribution function is Fclo, and 1,
is a vector of unknown parameters, let i and @ denote the estimates of 1, computed
from the sample X and the resample %t, respectively, and put F,,, = Fci.,. For both
b
parametric and nonparametric cases, define

and write T* for a generic Tt. In this notation, equation (2.11) is equivalent to
P(T* d z$,(%)= a. Let I&, denote the vth largest value of Tz. Then O,,, --f 8, with
probability one, conditional on 3, as B+ co. The value O,,, is a Monte Carlo
approximation to v^,.
2352 P. Hall

3. Iterating the principle

Recall that in Section 2, we suggested that statistical inference often involves


describing a relationship between the sample and the population. We argued that
this leads to a bootstrap principle, which may be enunciated in terms of finding an
empirical solution to a population equation, (2.1). The empirical solution is obtained
by solving a sample version, (2.4) of the population equation. The notation
employed in those equations includes taking F,, F, and F, to denote the true
population distribution function, the empirical distribution function, and the
resample version of the empiric, respectively. The solution of the population
equation is a functional of F,, say T(F,), and the solution of the sample equation
is the corresponding functional of the empiric, T(F,). The population equation may
then be represented as

-W-W,W,~ FJIFd = 0,

with approximate solution

-WYFI)(FO~F,)IFI~
~00. (3.1)

The solution of the sample equation represents an approximation to the solution


of the population equation. In many instances we would like to improve on this
approximation ~ for example, to further reduce bias in a bias correction problem,
or to improve coverage accuracy in a confidence interval problem. Therefore we
introduce a correction term t to the functional T, so that T(.) becomes U(., t) with
U( ., 0) E T( .). The adjustment may be multiplicative, for example, U( ., t) E (1 + t)T( .).
Or it may be an additive correction, as in U(*, t) = T(.) + t. Or t might adjust some
particular feature of T, as in the level-error correction for confidence intervals, which
we shall discuss shortly. In all cases, the functional U(.,t) should be smooth in t.
Our aim is to choose t so as to improve on the approximation (3.1).
Ideally, we would like to solve the equation

(3.2)

for t. If we write g,(F, G) =fac,_(F, G), we see that (3.2) is equivalent to

J%(Fo> FAIF,) = 0,

which is of the same form as the population equation (2.1). Therefore we obtain an
approximation by passing to the sample equation,
Ch. 39: Methodology and TheoryJbr the Bootstrap 2353

or equivalently,

This has solution &,oz= T,(F,), say, giving us a new approximate equation of the
same form as the first approximation (3.1), and being the result of iterating that
earlier approximation,

Our hope is that the approximation here is better than that in (3.1) so that in a
sense U(F,, T,(F,)] is a better estimate than T(F,) of the solution t, to equation
(2.1). Of course, this does not mean that U[F,, 7,(F,)] is closer to t, than T(F,),
only that the left-hand side of (3.4) is closer to zero than the left-hand side of (3.1).
If we revise notation and call U[F,, T,(F,)] the new T(F,), we may run through
the argument again, obtaining a third approximate solution of (2.1). In principle,
these iterations may be repeated as often as desired.
We have given two explicit methods, multiplicative and additive, for modifying
our original estimate f, = T(F,) of the solution of (2.1) so as to obtain the adjustable
form U(F,, t). Those modifications may be used in a wide range of circumstances.
In the special case of confidence intervals, an alternative approach is to modify the
nominal coverage probability of the confidence interval. To explain the argument
we shall concentrate on the special case of symmetric percentile-method intervals
discussed in Example 2.1. Corrections for other types of intervals may be introduced
in like manner.
An a-level symmetric percentile-method interval for Be = QF,) is given by
[B(F,) - &,, f?(F,) + &,I, where &, is chosen to solve the sample equation

P{w-*) - t d WI) d 8(F,) + t/F,) -a = 0.

(In our earlier examples, tl = 0.95.) This f,, is an estimator of the solution t, = T(F,)
of the population equation

PlW,) - t d W,) d B(F,) + tlF,} - c1= 0,

that is. of

P((B-8&t(F,)=cc,

where e= O(F,). Therefore to is just the a-level quantile, x,, of the distribution of
6 &I,
2354 P. Hall

Write x, as x(F,),, the quantile when F, is the true distribution function. Then
E, = T(F,) is just x(F,),, and we might take U(., t) to be

U(., t) = x(.),+ 1.

This is an alternative to multiplicative and additive corrections, which in the present


problem are

U(.,t)-(1 +t)x(.), and U(.,t)=x(.),+t,

respectively. In general, each will give slightly different numerical results, although,
as we shall prove shortly, each provides the same order of correction.
Concise definitions of Fj are different in parametric and nonparametric cases. In
the former we work within a class {Fo,, d~l\} of distributions that are completely
specified up to an unknown vector 1 of parameters. The true distribution is
F, = Fcno), we estimate il, by I= L(X) where X = Xi is an n-sample drawn from
F,, and we take F, to be F,i,. To define Fj, let ij = L(Xj) denote the estimator i
computed for an n-sample Xj drawn from Fjm 1 and put Fj = F,ir The nonparametric
case is conceptually simpler. There, Fj is the empirical distribution of an n-sample
drawn randomly from Fj_ 1, with replacement.
To explain how high-index Fis enter into computation of bootstrap iterations,
we shall discuss calculation of the solution to equation (3.3). That requires
calculation of U(F,, t), defined for example by

U(F,, t) = (1 + t)T(F,).

And for this we must compute T(F,). Now, f, = T(F,) is the solution (in t) of the
sample equation

and so T(F,) is the solution (in t) of the resample equation

~tft(F,~F,)IF,)= 0.
Thus, to find the second bootstrap iterate, the solution of (3.3), we must construct
F,, F,, and F,. Calculation of F, by simulation typically involves order B
sampling operations (B resamples drawn from the original sample), whereas
calculation of F, by simulation involves order B2 sampling operations (B
resamples drawn from each of B resamples) if the same number of operations is used
at each level. Thus, i bootstrap iterations could require order B computations, and
so complexity would increase rapidly with the number of iterations.
Ch. 39: Methodology and Theory,for the Bootstrap 2355

In regular cases, expansions of the error in formulae such as (3.1) are usually
power series in n- i* or n- r, often resulting from Edgeworth expansions of the type
that we shall discuss in Section 4. Each bootstrap iteration reduces the order of
magnitude of error by a factor of at least n - liz . However, in many problems with
an element of symmetry, such as two-sided confidence intervals, expansions of error
are power seriesin IZ- rather than n I, and each bootstrap iteration reduces error
by a factor of n-l, not just n- I*.

Example 3.1. Bias reduction

In this situation, each bootstrap iteration reduces the order of magnitude of bias
by the factor n-l. (See Hall 1992, Section 1.5, for further details.) To investigate
further the effect of bootstrap iteration on bias, observe that, in the case of bias
reduction by an additive correction,

.ft(FcbFl) = w-1) - QF,) + t.

Therefore the sample equation,

has solution t = T(F,) = QF,) - E{B(F,)IF,}, and so the once-iterated estimate is

8, = @+ T(F,) = B(F,) + T(F,) = 2&F,) - E{B(F,)IF,).

See also (2.5). On iteration of this formula we obtain the following formula for a
general bootstrap estimator.

Theorem 3.1

If I!?~denotes the jth iterate of 8, and if the adjustment at each iteration is additive,
then

E{B(Fi)(F,}, ja 1.

Example 3.2. Confidence interval

Here, each iteration generally reduces the order of coverage error by the factor n-l
in the case of two-sided intervals, and by n- I2 for one-sided intervals. To appreciate
the effect of iteration in more detail, let us consider the case of parametric, percentile
confidence intervals for a mean, assuming a Normal N(p, CJ*)population, discussed
in Example 2.2. Let N denote a Normal N(0, 1) random variable. Estimate the
P. Hall

parameter i, = (p, 0) by th e maximum likelihood estimator

n^
=(X, 62)= (up,), rJ2(Fl)),

where X = n- CX, and 6 = IZ- x(X, - X) are sample mean and sample variance,
respectively. The functional ,f, is, in the case of a symmetric two-sided 95% percen-
tile confidence interval,

f;(F,,F,) = I{W,) - t d fI(F,) d B(F,) + t} - 0.95,

and the sample equation (2.4) has solution t = T(F,) = n-2~,~,,o(F,), where x0,95
is given by P( 1N 1< x0.95) = 0.95. This gives the percentile interval

(X - n- 2x,,,& x +n- l2xo,958),


derived in Example 2.2. For the sake of definiteness we shall make the coverage
correction in the form

U(F,,t)=n - 12(%.95 + t)@,),

although we would draw the same conclusion with other forms of correction. Thus,

f (F,,t)(hI Fl) = w2 lw,l) - ~(~,w(~,) d x0.95 + t> - 0.95,

so that the sample equation (3.3) becomes

P{n2IU(F,) - B(F,)(/o(F,) < x0.95 + tlF,} - 0.95 = 0. (3.5)

Observe that

W = nli2{d(F2) - B(F,)}/o(F,)

= nm1j2 i (X*-X) n-l i (X*-X)2 -,


i i=l H i=l I

where conditional on ?Z, XT,. . . , Xz are independent and identically distributed


N(X, b2) random variables and X* = n cX*. Therefore, conditional on X, and
also unconditionally, W is distributed as {n/(n - 1)}2T,_, where T,- 1 has
Students t distribution with n - 1 degrees of freedom. Therefore the solution E, of
equation (3.5) is f. = (~/(n - l)}i2wo,,5 - xo,95, where w, = w,(n) is defined by

P(IT,-,/<w,)=sr.
Ch. 9: Methodoloyy und Thuory,fiv the Bootstrap 2357

The resulting bootstrap confidence interval is

+ f,), NF,) + fl- 12~(~l)(%.95


CW,) - n 124F1)(%95 + Ml
= [X - (?I- 1))2W,,,,B,X + (?I - 1))2W,,,,6].

This is identical to the percentile-t (not the percentile) confidence interval derived
in Example 2.2 and has perfect coverage accuracy.
The methodology of bootstrap iteration was introduced by Efron (1983), Hall
(1986), Beran (1987) and Loh (1987).

4. Asymptotic theory

4.1. Summary

We begin by describing circumstances where Edgeworth expansions, in the usual


rather than the bootstrap sense, may be generated under rigorous regularity
conditions; see Section 4.2, Major contributors to this theory include Chibishov
(1972,1973a, 1973b), Sargan (1975, 1976) and Bhattacharya and Ghosh (1978). Our
account is based on the latter paper. Following that, in Section 4.3, we discuss
bootstrap versions of those expansions and then describe the conclusions that may
be drawn from those results. Our first conclusions, about the efficacy of pivotal
methods, are given towards the end of Section 4.3. Sections 4.4, 4.5, 4.6 and 4.7
describe respectively a variety of different confidence intervals, properties of
bootstrap estimates of critical points, properties of coverage error and the special
case of regression. The last case is of particular interest because, in the context of
intervals for slope parameters, it admits bootstrap methods with unusually good
coverage accuracy.
The main conclusions drawn in this section relate to the virtues of pivoting. That
subject was touched on in Section 2 but there we lacked the technical devices
necessary to provide a broad description of the relative performances of pivotal and
non-pivotal methods. The Edgeworth expansion techniques introduced in Section4.2
fill this gap. In particular, they enable us to show that pivotal methods generally
yield greater accuracy in the estimation of critical points (Section 4.5) and smaller
asymptotic order of coverage error of one-sided confidence intervals (Section 4.6).
Nevertheless, it should be borne in mind that these results are asymptotic in
character and that, while they provide a valuable guide, they do not tell the whole
story. For example, the performance of pivotal methods with small samples depends
in large part on the relative accuracy of the variance estimator and can be very poor
in cases where an accurate variance estimator is not available. Examples which
feature poor accuracy include interval estimation for the correlation coefficient and
for a ratio of means when the denominator mean is close to zero.
2358 P. Hall

Theory for the bootstrap, along the lines of that described here, was developed
by Bickel and Freedman (1980), Singh (1981), Beran (1982, 1987), Babu and Singh
(1983, 1984, 1985), Hall (1986, 1988a, 1988b), Efron (1987), Liu and Singh (1987)
and Robinson (1987). Further work on the bootstrap in regression models is
described by Bickel and Freedman (198 1, 1983), Freedman (198 l), Freedman and
Peters (1984) and Peters and Freedman (1984a, 1984b).

4.2. Edgeworth and Cornish-Fisher expansions

We begin by describing a general model that allows Edgeworth and CornishhFisher


expansions to be established rigorously. Let @, 4 denote respectively the Standard
Normal distribution and density functions. Let X, X,, X,, . . . be independent and
identically distributed random column d-vectors with mean p, and put X = n C Xi.
Let A: Rd --f R be a smooth function satisfying A(p) = 0. We have in mind a function
such as A(x) = {g(x) - g(~)}/h@), where 8, = g(p) is the (scalar) parameter estimated
by 6 = g(X) and g2 = h(/1)2 is the asymptotic variance of n28; or A(x) = {g(x) -
g(p)}/h(x), where b2 = h(X) is an estimator of h(p). (Thus, we assume h is a known
function.)
This smooth function model allows us to study problems where 8, is a mean,
or a variance, or a ratio of means or variances, or a difference of means or variances,
or a correlation coefficient, etc. For example, if { W,, . . . , W,,} were a random sample
from a univariate population with mean m and variance fi2, and if we wished to
estimate 0, = m, then we would take d = 2, x = (X, Xc2)r = (W, W2)T, p = E(f),

(1,x(2) = x(, h(x,X2) =x(2 _ (,(1)2.


&7(x

This would ensure that g(p) = m, g(x) = w (the sample mean), h(p) = b2, and

h(X)=n- $lX~2-(n~1i$lXj1)2=n~1i$l(Wi- W)=/P

(the sample variance). If instead our target were 8, = /II then we would take d = 4,
X = (W, W2, W3, W4)*, /I = E(X),

g(x, . . )X(4)= x(2 - (x(l)2,


h(x(,..., x(4) = x(4 _ 4xx3 + fj(x)2x2 _ 3(x)4 _ [x _ (,W)2]2.

In this case,

Y(P)= B2> dx, = Bz,


h(p) = E( W - m) - fi4,
Ch. 39: Methodology and Theoryfor the Bootstrap 2359

h(X) = rl-l i (Wi - W)2= B.


i=l

(Note that E(W- m) - /I equals the asymptotic variance of nri2/?.) The cases
where o0 is a correlation coefficient (a function of five means), or a variance ratio (a
function of four means), among others, may be treated similarly.
The following result may be established under the model described above. We
first present a little notation. Put p = E(X), and let

pi ,,,.i, = E{ (X - pp. . .(X - pp}, j> 1,

ai,...i, = (aj/axv.. .ax~q4(x)~,=,,

and

Note that c2 equals the asymptotic variance of ni2A(_%).

Theorem 4.1

Assume that the function A has j + 2 continuous derivatives in a neighbourhood of


p = E(X), that .4(p) = 0, that E( IIii! IJj2) < co, and that the characteristic function
x of X satisfies

limsup Ix(t)1 < 1. (4.1)


E,,
- %

Suppose CJ> 0. Then for j 2 1,

P{n2A(@/o d x) = e,(x) + n-2pl(x)&) + ...

+ n-j2pj(X)4(X) + 0(n-j2)
(4.2)
2360 P.Hall

uniformly in x, where pj is a polynomial of degree at most 3j - 1, odd for even j and


even for odd j, with coefficients depending on moments of 2 up to order j + 2. In
particular,

pl(x) = (4,o- + $42a-3(X2- l)}.

See Bhattacharya and Ghosh (1978) for a proof.


Condition (4.1) is a multivariate form of Cramers continuity condition. It is
satisfied if the distribution of z is nonsingular (i.e. has a nondegenerate absolutely
continuous component) or if 2 = (W, W2,. , Wd)T where W is a random variable
with a nonsingular distribution.
Two versions of (4.2) are given by

P{n12(g-f&)/a < x} = CD(x) + n-12pl(X)c/l(X)


+ I.

+ n -*pj(x)q5(x)+ o(n -j/2) (4.3)

and

P{n12(O^-
8,)/b d x} = @(x) + n-12ql(X)#(X) + ...
+n -2qj(X)~(X) + O(n-j), (4.4)

being Edgeworth expansions for non-Studentized and Studentized statistics, respec-


tively. Here, pj and qj are polynomials of degree at most 3j - 1 and are odd or even
functions according to whether j is even or odd. They are usually distinct.
The Edgeworth expansion in Theorem 4.1 is readily inverted so as to yield a
Cornish-Fisher expansion of the critical point of a distribution. To appreciate how,
first define w, = w,(n), the a-level quantile of the distribution of S, = n12A(x), by

w, = inf{x:P(S, d x) 3 2).

Let z, be the a-level Standard Normal quantile, given by @(z,) = a. We may write

W,=Z,+n-2p,l(Z,)+n-1p2,(z,)+ .. +n-2pj&,)+ ..

and

z, = w, + n -12p12(wa)
+ n-p,,(w,) + + nP2pj2(Wm) + ..)

where the functions pjl and pj2 are polynomials. These expansions are to be
interpreted as asymptotic series and in that sense are available uniformly in
ECU< 1 -sforanyO<.s<+.
Ch. 39: Methodology and Theoryfor the Bootstrap 2361

The polynomials pjI and pj2 are of degree at most j + 1, odd for even j and even
for odd j, and depend on cumulants only up to order j + 2. They are completely
determined by the pis in (4.2). In particular, it follows that pjI is determined by
p1 , . . . , pj. To derive formulae for p1 1 and p2 1, note that

a= @(z,)+ ~~-2Pl,(z,)+~-P2,(z,)~~(z,)-~~~~2P11(Z,)~2Z,~(Z,)

+ n - 12cPl mw + n - 2PII(z,){P;(d - z,Pl(z,)Hwl

+ n-p2(z,)4(z,) + O(n_32)

=a+n ~2{Pll(z,)+P,(z,)}~(z,)+~-CP2,(z,)-~z,P,,(z,)2

+ Pllk){PW - Z,PIWI + P2(Z,)laz) + wp32).

From this we may conclude that

Pllb) = - Pl(4

and

P2164 = PlwP;(x) - +xPl(42 - P2W

Formulae for the other polynomials piI, and for the pi2s, may be derived similarly,
however, they will not be needed in our work.
CornishhFisher expansions under explicit regularity conditions may be deduced
from results such as Theorem 4.1. For example, the following inversions of (4.3) and
(4.4) are valid uniformly in E < c1< 1 - E, under the conditions of that theorem:

u, = z, + n -12p11(z,) + nmp2,(z,) + .. + n-j2pj,(Za) + o(n-j2), (4.5)

and

v, = z, + n ~2q11(z,)+n~1q21(z,)+ .. +n-2qjl(z,)+o(n-j2). (4.6)

Here z,, u,, v, are the solutions of the equations @(z,) = LX,

P{n(& <u,} = a,
&J/a

P{n2(8- Q,)/8 < v,}= a,

respectively; p1 1 and p21 are given by the formulae displayed in the previous
paragraph, with p1 and p2 defined as in (4.3); and qll and q21 are given by the
analogous formulae, with q1 and q2 from (4.4) replacing p1 and p2.
2362 P. Hull

4.3. Edgeworth and Cornish-Fisher expansions of bootstrap distributions

We are now in a position to describe Edgeworth expansions of bootstrap distri-


butions. We shall emphasize the role played by pivotal methods, introduced in
Section 2. Recall that a statistic is (asymptotically) pivotal if its limiting distribution
does not depend on unknown quantities. In several respects the bootstrap does a
better job of estimating the distribution of a pivotal statistic than it does for a
nonpivotal statistic. The advantages of pivoting can be explained very easily by
means of Edgeworth expansion, as follows. If a pivotal statistic T is asymptotically
Normally distributed, then in regular cases we may expand its distribution function
as

G(x) = P( T < x) = a(x) + n q(x)+(x) + O(n- ), (4.7)

where q is an even quadratic polynomial. See (4.2), for example. We might take
T= n(t!?- 0)/c?, where i? is an estimator of an unknown parameter H,, and s2 is
an estimator of the asymptotic variance o2 of n I28 . The bootstrap estimator of G
admits an analogous expansion,

G(x) = P(T* d x(.5) = a(x) + n-2Q(x)~(x) + O,(n-I), (4.8)

where T* is the bootstrap version of T, computed from a resample %* instead of


the sample ?Z,and the polynomial 4 is obtained from q on replacing unknowns,
such as skewness, by bootstrap estimates. (The notation O,(n- I) denotes a
random variable that is order n- m probability. The distribution of T* conditional
on 3 is called the bootstrap distribution of T*.
The estimators in the coefficients of 4 are typically distant O,(n j2) from their
respective values in q, and so 4 -q = O,(n- 12). Therefore, subtracting (4.7) and
(4.8), we conclude that

P(T* <xl??) - P(T6 x) = O,(n-).

That is, the bootstrap approximation to G is in error by only n-l. This is a


substantial improvement on the Normal approximation, G N CD,which by (4.7) is
in error by n- 12.
On the other hand, were we to use the bootstrap to approximate the distribution
of a nonpivotal statistic U, such as U = n2(g- t3,), we would typically commit an
error of size n- I2 rather than n . To appreciate why, observe that the analogues
of (4.7) and (4.8) in this case are

H(x) = P(U < x)

= @(x/a) + n-p(x/a)#(x/a) + O(n-),


Ch. 39: Methodology and Theory for the Bootstrap 2363

B(x)= P(u* d xl%)


= 0 (x/c?) + n - 2~(x/&#o/s) + O,(n - )

respectively, where p is a polynomial, 8 is obtained from p on replacing unknowns


by their bootstrap estimators, o2 equals the asymptotic variance of U, d2 is the
bootstrap estimator of c2 and U* is the bootstrap version of U. Again, @- p =
O,(n- ), and also 8 - G = O,(n- 1/2), whence

B(x)- H(x) = @(x/B) - @(x/a) + O,(nP). (4.9)

Now, the difference between 6 and (T is usually of precise order n- 12. Indeed,
n1i2(B - 0) typical1 y h as a limiting Normal N(0, c2) distribution, for some i > 0.
Thus, @(x/6) - @(x/a) is generally of size n-1/2, not n-l. Hence by (4.9), the
bootstrap approximation to H is in error by terms of size n-*/*, not n-. This
relatively poor performance is due to the presence of cr in the limiting distribution
function @(x/a), i.e. to the fact that U is not pivotal.
Expansions such as (4.8) may be developed under the smooth function model,
and analogues of Theorem 4.1 are available in the bootstrap case. For example, let
us return to the notation introduced just prior to that theorem, and introduce
additionally the definitions x* = n- CXT, 8* = g(x*) and c?*~ = h(X*), where
%* = {XT,. . . ) x} denotes a resample drawn randomly, with replacement, from
s = {Xl,. . .) X,,}, Then under the same conditions as in Theorem 4.1, except that the
moment condition should be strengthened a little, we have the following analogues
of (4.3) and (4.4) respectively:

P(n2(f7*- e,lr3*< xp-} = @(x) + n-121j1(X)c#J(X)


+ ...
+ n -2gj(x)c#J(x)
+ o,(n-j2) (4.10)

and

P(n12(B*
-6)/d* <x(2-} = aqx) + n-2Q1(X)4(X)+ ...
+ n-24j(x)q5(x)+ o,(n -ji2). (4.11)

Bootstrap Edgeworth expansions may be inverted in much the same way as


ordinary Edgeworth expansions, to obtain bootstrap Cornish-Fisher expansions.
For example, the quantiles arising from inversion of the bootstrap expansions (4.10)
and (4.11) are

a, = z, + n -2B11(Z,)+n~1~21(Z,)+ +fji2@j,(Z,)+ (4.12)

and

I?, = z, + n -12~11(z,)+n-~21(z,)+ .. +n-2Qjl(z,)+ ... (4.13)


2364 P.Ha//

Here, pjl and Qjl differ from pjI and q,r, appearing in (4.5) and (4.6), only in that F,
is replaced by F,; that is, population moments are replaced by sample moments.
Of course, CornishhFisher expansions are to be interpreted as asymptotic series
and apply uniformly in values of z bounded away from zero and one. For example,

.j/ 2
sup Iti,-{z,+n -r/2f111(Z,)+ . +n-j/* Bjl(za)}l+o
r.<l<lLi

almost surely (and hence also in probability) as II + co, for each 0 < e < $.
A key assumption underlying these results is smoothness of the sampling
distribution. For example, under the smooth function model introduced in
Section 4.2, the sampling distribution would typically be required to satisfy
Cramers condition.

4.4. DifSerent versions of bootstrap confidence intervals

We begin with a little notation. Let F, denote the population distribution function,
Q(.) a functional of distribution functions, and 8, = 8(F,) a true parameter value,
such as the rth power of a population mean 8, = (JxdFo(x)}. Write F, for the
distribution function of a sample .% drawn from the population. Interpretations
of F, differ for parametric and nonparametric cases; see Section 2. The bootstrap
estimator of 8, is 8= O(F,). Define F, to be the distribution function of a resample
X* drawn from the population with distribution function F,. Again, F, is different
in parametric and nonparametric settings.
A theoretical a-level percentile confidence interval for QO,

I, =(- co,H^+t,),

is obtained by solving the population equation (2.1) for t = t,, using the function

f;(F,, F,) = Z{W,) d B(F,) + t} - oz.

Thus, to is defined by

P(8 < e + to) = cc.

The bootstrap version of this interval is r^, = (- oc, e+ fo), where t = f, is the
solution of the sample equation (2.4). Equivalently, & is given by

P{fl(F,) d B(F,) + f,] F,} = a.

We call r^, a bootstrap percentile confidence interval, or simply a percentile interval


Ch. 39: Methodology and Theory for the Bootstrap 2365

To construct a percentile-r confidence interval for 8,, define a2(F,) to be the


asymptotic variance of n 1/28, and put b2 = a2(F,). A theoretical cc-level percentile-t
confidence interval is J, = (- co, 8 + to&), where on the present occasion t, is given
by

lye, < f7+ to&) = ct.

This is equivalent to solving the population equation (2.1) with

.f,(F,, Fl) = I{@,) G w.1) + WFl)) - cc

The bootstrap interval is obtained by solving the corresponding sample equation,


and is _?i = (- co, 8 + f,c?), where f, is now defined by

P(8 < B(F,) + &#,)I F,} = ct.

To simplify notation in future sections we shall often denote O(F,) and a@,) by t?*
and 8*, respectively.
Exposition will be clearer if we represent t, and &, in terms of quantiles. Thus,
we define u,, v,, ri,, and 8, by

(4.14)

and

PblVv,) - e(Fl)jqa(F2)G D,IF,I = @. (4.15)

Write (r = a(F,) and 6 = a@,). Then definitions of I,, .I,, fI, and ii equivalent to
those given earlier are

I, =(-Co,8-n-2aul_m), J, =(-co,e-n-128vl_a),
I; =(-co,8-n-26til_a), f1 =(-co,8-n-2c?91_J,

All are confidence intervals for B,,, with coverage probabilities approximately equal
to a.
In the nonparametric case the statistic B(F,), conditional on F,, has a discrete
distribution. This means that equations (4.14) and (4.15) will usually not have exact
solutions, although as we point out in Section 1.3 and Appendix I, the errors due
to discreteness are exponentially small functions of n. The reader concerned by the
2366 P. Hull

problem of discreteness might like to define li, and 0, by

4 = inf{u:PCn12{B(F,) - B(F,)}/~J(F,) d u~F,] 3 CC}

and

0, = inf{u:P[n12{8(F2) - B(F,)}/@,) G uIF,] 2 CZ},

Two-sided, equal-tailed confidence intervals are constructed by forming the


intersection of two one-sided intervals. Two-sided analogues of I, and J, are

and

J, = (& n- 12c?uC1
+aJ,2,& n-28uC, -&

respectively, with bootstrap versions

and

The intervals I, and J, have equal probability in each tail; for example,

P(8, d 6- n-112cw (l+a),Z) = P(6, > e- II- %UC1 _a),2) = $1 - CC).

Intervals f2 and j2 have approximately the same level of probability in each tail,
and are called two-sided, equal-tailed confidence intervals. Two-sided symmetric
intervals were discussed in Section 2.
All the intervals defined above have at least asymptotic coverage ~1,in the sense
that if 4 is any one of the intervals,

as n + co. As before, we call CIthe nominal coverage of the confidence interval 9.


The coverage error of 9 is the difference between true coverage and nominal
coverage,

coverage error = P(8,~4) - CI.


Ch. 39: Methodology and Theoryfor the Bootstrap 2361

4.5. Order of correctness of bootstrap approximations to critical points

The a-level quantiles of the distributions of S = n(@- 0,)/a and T = n(8- d,)/c?
are u, and v,, respectively, with bootstrap estimates ti, and 6,. Subtracting
expansions (4.5) and (4.6) from (4.12) and (4.13) we deduce that

4 -u, = n-121MZ,) - PII + ~-{a,,(d - P21Wl + ... (4.16)

and

U*,--U,=n-2~~11(Z,)-qll(z,)} +n-CLi21(Z,)-q21(Z,)} + ..

Now, the polynomial jjl is obtained from pjl on replacing population moments by
sample moments, and the latter are distant O,(n- iI) from their population counter-
parts. Therefore fijl is distant O,(n- 12) from pjl. Thus, by (4.16),

li, - u, = O,(C%I -12 + n-i)= O&-i),

and similarly 0, - v, = O,(n - ).


This establishes one of the important properties of bootstrap, or sample, critical
points: the bootstrap estimates of U, and u, are in error by only order n-. In
comparison, the traditional Normal approximation argues that u, and v, are both
close to z, and is in error by n-l; for example,

z, - u, = z, - {za+ n -12p11(z,)+ . ..} = -n-2pll(Z,)+O(n-).

Approximation by Students t distribution hardly improves on the Normal approx-


imation, since the g-level quantile t, of Students t distribution with n - v degrees
of freedom (for any fixed v) is distant order n- , not order n- 12, away from z,. Thus,
the bootstrap has definite advantages over traditional methods employed to approx-
imate critical points.
This property of the bootstrap will only benefit us if we use bootstrap critical
points in the right way. To appreciate the importance of this remark, go back to
the definitions of the confidence intervals I,, J,, fl, and j1 given in Section 4.4. Since
v*l_~=vl~~+O,(n~),theupperendpointoftheinterval~l=(-co,8-n~2Bv*~_~)
differs from the upper endpoint of J, = (-co, 6 n-28ul -,) by only 0,(n-32).
We say that i1 is second-order correct for J, and that & n-/280, _a is second-
order correct for 8- n- 1128vl -a since the latter two quantities are in agreement up to
and including terms of order (n- 12)2 = n- . In contrast, II1 = (- co, 8 - n- 1 OUl-a)
**
is generally only first-order correct for I, = (- co, 8- n- 1/2~u1 _,) since the upper
endpoints agree only in terms of order n- lj2, not n- l,

((j_n-1/2A* fJU1-.) - (6 n-2aul -,) = n-/2(au1 _a - CM, _,)

= n- 12u1-@(a - 8) + 0,(n-32),
2368 P.Hall

and (usually) n(6 - a) is asymptotically Normally distributed with zero mean and
nonzero variance. Likewise, r^, is usually only first-order correct for J, since terms
of order n-112 in Cornish-Fisher expansions of ti,_, and ulPa generally do not
agree,

(Q-n- 26ti,_.)-(8-n-28u,_a)=n~128(v,~a-a,_.)

=n +~{P,(z,) - 41(z,)) + 0,(n-32).

However, there do exist circumstances where pi and q1 are identical, in which case
it follows from the formula above that 1^i is first-order correct for J,. Estimation of
slope in regression provides an example and will be discussed in Section 4.7.
As we shall show in Section 4.6, these correctness properties of critical points
have important consequences for coverage accuracy. The second-order correct
confidence interval .Z, has coverage error of order n-, whereas the first-order
correct interval I, has coverage error of size np112.
As noted in Section 4.4, the intervals I;, ii represent bootstrap versions of I,, .Z1,
respectively. Recall from Example 2.2 of Section 2 that percentile intervals such as I,
are based on the nonpivotal statistic @- 8, and that it is this nonpivotalness that
causes the asymptotic standard deviation r~ to appear in the definition of I,. That
is why r^, is not second-order correct for I,, and so our problems may be traced
back to the issue of pivotalness raised in Section 2. The percentile-t interval 5^i is based
on the (asymptotically) pivotal statistic (e- 0,)/8, hence its virtuous properties.
If the asymptotic variance rr2 should be known then we may use it to standardize,
and construct confidence interv_als based on (e- 8,)/a (which is asymptotically
pivotal), instead of on either 0 - 8, or (O- 8,)/8. Application of the principle
enunciated in Section 2 now produces the interval

I;, =(- c0,6-nn-2alil_m),

which is second-order correct for I, and has coverage error of order n- I.


We should warn the reader not to read too much into the notion of correctness
order for confidence intervals. While it is true that second-order correctness is a
desirable property, and that intervals that fail to exhibit it do not correct even for
the elementary skewness errors in Normal approximations, it does not follow that
we should seek third- or fourth-order correct confidence intervals. Indeed, such
intervals are usually unattainable as the next paragraph will show. Techniques such
as bootstrap iteration, which reduce the order of coverage error, do not accomplish
this goal by achieving high orders of correction but rather by adjusting the error
of size n- 32 inherent to almost all sample-based critical points.
Recall from (4.6) that

v1 _L1= z1 -a +.n -12qll(Z1_a)+II-~q21(Z1_a)+ ...


Ch. 39: Methodology and Theory for the Bootstrap 2369

Coefficients of the polynomial q1 I are usually unknown quantities. In view of results


such as the Cramer-Rao lower bound (e.g. Cox and Hinkley 1974, pp. 254ff), the
coefficients cannot be estimated with an accuracy better than order n-l. This
means that u1 _a cannot be estimated with an accuracy better than order n- , and
that the upper endpoint of the confidence interval J, =(-co,& n-28u, -,)
cannot be estimated with an accuracy better than order nP3j2. Therefore, except in
unusual circumstances, any practical confidence interval i, that tries to emulate J,
will have an endpoint differing in a term of order ne3j2 from that of J,, and so will
not be third-order correct. Exceptional circumstances are those where we have
enough parametric information about the population to know the coefficients of
qll. For example, in the case of estimating a mean, qll vanishes if the underlying
population is symmetric. If we know that the population is symmetric, we may
construct confidence intervals that are better than second-order correct. For
example, we may resample in a way that ensures that the bootstrap distribution is
symmetric, by sampling with replacement from the collection {+(X1 -X), . . . ,
*(X,-X)} rather than {X,-X,..., X, - X}. But in most problems, both para-
metric and nonparametric, second-order correctness is the best we can hope to
achieve.

4.6. Coverage error of conjidence intervals

In this section we show how to apply the Edgeworth and CornishhFisher expansion
formulae developed in Sections 4.2 and 4.3, to develop expressions for coverage
accuracy of bootstrap confidence intervals. It is convenient to focus attention
initially on the case of one-sided intervals and to progress from there to the
two-sided case.
A general one-sided confidence interval for BOmay be expressed as 9, = (- 00,
8 + f), where f is determined from the data. In most circumstances, if 3i has nominal
coverage c1then f admits the representation

f = n - 2d(z, + i?,), (4.17)

where e, is a random variable and converges to zero as n+ CO. For example, this
would typically be the case if T = n2(8 - f3,,)/&had an asymptotic Standard Normal
distribution. However, should the value 0 of asymptotic variance be known, we
would most likely use an interval in which f had the form

f = n li20(z, + c^,).

Intervals fl,J1, and j1 (defined in Section 4.4) are of the former type, with c*, in
(4.17) assuming the respective values - li, -a - z,, li, - z,, - vi _= - z,, -G1 _= - z,.
So also are the Normal approximation interval (- co, 8 + n-i/*&z,) and Students
2370 P.Hull

t approximation interval (- co, 8 + n p1/28tm), where t, is the a-level quantile of


Students t distribution with IZ- 1 degrees of freedom. Interval I, is of the latter
type. The main purpose of the correction term 2, is to adjust for skewness. To a
lesser extent it corrects for higher-order departures from Normality.
Suppose that f is of the form (4.17). Then coverage probability

a i,n = P(B,E.Y,) = P{B, < e+ n-Q(Z, + c*,)}

= 1 - P{ni2(8- 0,)K l + e, < -za}. (4.18)

We wish to develop an expansion of this probability. For that purpose it is necessary


to have an Edgeworth expansion of the distribution function of

?I(& e,)B- l + e,,

or at least a good approximation to it. In some circumstances, 6, is easy to work


with directly; for example ?, = 0 in the case of the Normal approximation interval.
But for bootstrap intervals, c^, is defined only implicitly as the solution of an
equation, and that makes it rather difficult to handle. So we first approximate it by
a Cornish-Fisher expansion.
Suppose that

t, = n-12.Gl(z,) + n-s*,(z,) + 0,(np32), (4.19)

where s1 and s2 are polynomials with coefficients equal to polynomials in


population moments and gj is obtained from sj on replacing population moments.
by sample moments. Then ij = sj + O,,(n- l/*) and

P{n*(B- &JB- + t, d x} = P n(Q- 8,)8- + n-i2{s*l - s,(z,)}


[

d x - i
j=l
n-ji2sj(z,)
1 + 0(nm3*). (4.20)

Here we have used the delta method.


Therefore, to evaluate the coverage probability al,n at (4.18) up to a remainder
of order ne3j2, we need only derive an Edgeworth expansion of the distribution
function of

S, = n(& 8,)8- + n-A,, (4.21)

where A,, = ni2{s*l(z,) - s,(z,)}. That is usually simpler than finding an Edgeworth
expansion for n l*@- 8,)K l + c^,.
Ch. 3Y: Methodology and Theory for the Bootstrap 2371

Put T, = n($- 8,-J/8 and A, = niz{$l(zJ - sl(z,)}, and let a, denote the real
number such that

E(T,AJ = E[n(& 8,)8- ldz{$l(za) - s,(z,))]


=a,+O(n-1). (4.22)

If s, is an even polynomial of degree 2, which would typically be the case, then


a, = n(z,), where rt is an even polynomial of degree 2 with coefficients not depending
on c(. Then it may be shown that

P(S, <x) = P(T, <x) - npa,x@(x) + 0(n-32). (4.23)

It is now a simple matter to obtain expansions of coverage probability for our


general one-sided confidence interval Y1. Taking x = - z, in (4.20), and noting (4.18)
and (4.23), we conclude that if 6, is given by (4.19) then the confidence interval

9, =(-co,e+n-1~28(z,+C*,))

has coverage probability

= P
i
Q~~($-e,)p > - z, -
j=l
i cj12, sj(z,)
I
- n-a,z,~(z,) + O(n-y.

Assuming the usual Edgeworth expansion for a Studentized statistic, i.e.

P{?(B- 8,)/c? < x} = 0(x) + n- 12q1GM4 + fl- 1Y2mw + O(n - 32),

putting x= -~~-x~=~,~n -ji2sj(z,), and Taylor-expanding, we finally obtain

c( 19 = a + n-12{s1k)
- 41(%))dw + n- 1cq2(z,) + sz(z,) - $z,s1(zJ2
+ %(%){vI?l(%) - 4;@) - a,zJ#&J + O(n-32). (4.24)

(Remember that qj is an odd/even function for even/oddj, respectively.)


The function sr is generally an even polynomial and s2 an odd polynomial. This
follows from the Cornish-Fisher expansion (4.19) of 2,. Since s1 is an even function,
then by the definition of a, at (4.23), this quantity equals an even polynomial in z,.
Therefore the coefficient of n-$(z,) in (4.24) is an odd polynomial in z,. The
coefficient of n- &z,) is clearly an even polynomial.
2372 P. Hall

There is no difficulty developing the expansion (4.24) to an arbitrary number of


terms, obtaining a series in powers of n- P where the coefficient of n-j4(z,) equals
an odd or even polynomial depending on whether j is even or odd. The following
proposition summarizes that result.

Proposition 4.1

Consider the confidence interval

91 =c~~(~)=(-m,6+n-128(z,+ c*,)),

where

= n-2s*,(z,)
~2~ + n-s*,(z,) + ...,

where the 3js are obtained from polynomials sj on replacing population moments
by sample moments and odd/even indexed s;s are even/odd polynomials, respec-
tively. Suppose

P{n(@- 0,)/s <x) = C@(x) + n-112q1(~)&~) + n-q,(x)4(x) + ...,

where odd/even indexed polynomials qj are even/odd functions, respectively. Then

P{8,~9,(cc)} = CI+ n -i2rl(z,)4Yz,) + n- lrAz,)@(z,) + ..., (4.25)

where odd/even indexed polynomials rj are even/odd functions, respectively. In


particular,

rl = s1 - q1

and

r2(d = q2W + s2k) -$mW* + sl(z,){z,qlk) - 4;k)) - a,~,.


where a, is defined at (4.22).

The coverage expansion (4.25) should of course be interpreted as an asymptotic


series. It is often not available uniformly in c(, but does hold uniformly in
E < tl< 1 - Efor any 0 < E < i. However, if c^,is monotone increasing in CIthen (4.25)
will typically be available uniformly in 0 < c1< 1.
An immediate consequence of (4.25) is that a necessary and su#icient condition for
our conjidence interval 9, to have coverage error of order n-l for all values of cI, is
that it be second-order correct relative to the interval J,, defined in Section 4.4. To
Ch. 39: Methodology and Theory for the Bootstrap 2373

appreciate why, go back to the definition (4.19) of e,, which implies that

Y, =(-&+n-28(2,+&))

=(-co,e+n-~8{z,+n-%,(z,)}+O,(n-~2)).

Since q1 1 = -ql (see Section 4.2) then,

J, =(-co,fLn-12hl_J
=(-co,e-n-1%[zl_~+ll-12qll(Zl-a)]+0,(n-32))

=(-co,e+n-12~[z,+.-12ql(ZJ]+Op(n-32)).

The upper endpoint of this interval agrees with that of 9, in terms of order n- ,
for all CC,if and only if sl = ql, that is, if and only if the term of order n- l vanishes
from (4.25). Therefore the second-order correct interval jl has coverage error of
order n- , but the interval I;, which is only first-order correct, has coverage error
of size IZ- 12 except in special circumstances.
So far we have worked only with confidence intervals of the form (- co,6 + f)
where t*= n-i2B(z, + &) and C, is given by (4.19). Should the value a2 of asymptotic
variance be known then we would most likely construct confidence intervals using
t*= n-izo(z, + e,), again for 6, given by (4.19). This case may be treated by
reworking arguments above. We should change the symbol q to p at each
appearance, because we are now working with the Edgeworth expansion (4.3) rather
than (4.4). With this alteration, formula (4.24) for coverage probability continues to
apply,

P{8,+oo,8+n-2 4zm+ a)) = M + n- 12{%(%) - Pl(Z,)}~(Z,)

+ n-C~~(z,) + s2(z,) - +,s1(d2

+ ~lk){ZaPlW - P;(%)> - %4
x &z,) + O(n- 32).

(Our definition of a, at (4.22) is unaffected if 8-l is replaced by c-l, since


A-1 _
o - o- + O&n- 12).) Likewise, as the analogue of Proposition 4.1 is valid, it is
necessary only to replace 8 by c in the definition of Y, and qj by pj at all
appearances of the former. Therefore our conclusions in the case where C-Jis known
are similar to those when c is unknown: a necessary and sufficient condition for the
confidence interval (- co, 8 + n- I2u(z, + 6,)) to have coverage error of order n-l
for all values of cxis that it be second-order correct relative to I,.
Similarly it may be proved that if 9, isjth-order correct relative to a one-sided
confidence interval JJ;, meaning that the upper endpoints agree in terms of size
n-j12 or larger, then Y, and zJ; have the same coverage probability up to but not
2314 P. Hall

necessarily including terms of order n -j The converse of this result is false for
j > 3. Indeed, there are many important examples of confidence intervals whose
coverage errors differ by O(n -3/2) but which are not third-order correct relative to
one another.
Coverage properties of two-sided confidence intervals are rather different from
those in the one-sided case. For two-sided intervals, parity properties of polynomials
in expansions such as (4.25) cause terms of order n -12 to cancel completely from
expansions of coverage error. Therefore coverage error is always of order n-l or
smaller, even for the most basic Normal approximation method. In the case of
symmetric two-sided intervals constructed using the percentile-t bootstrap, coverage
error is of order nm 2. The remainder of the present section will treat two-sided
equal-tailed intervals.
We begin by recalling our definition of the general one-sided interval 9, = S,(U)
whose nominal coverage is a:

The equal-tailed interval based on this scheme and having nominal coverage c1is

-02(4=au +~))\A(+(1 -4)

=(~+n-1'2+t~-gl,2 +c^(,-01),2),~+n-"28(z(l+.,,2 +tcl+aj,2)). (4.26)

(Here Y\$ denotes the intersection of set 9 with the complement of set 2.) Apply
Proposition 4.1 with z = zcl +aJ,2 = - zcl -ai,2, noting particularly that rj is an odd
or even function accordingly as j is even or odd, to obtain an expansion of the
coverage probability of Y,(M):

M z,n=Wo~.~2(4~

= @(~)+n-~~~r~(z)~(z)+n-'r,(z)~(z)+ ...

-{@(-z)+n-"2 rl(-z)qb-z)+n~1r2(-z)~(-z)+~~~}

= cI + 2n-r,(z)+(z) + 2nm2r,(z)q5(z) + ...

= a + 2n-[q,(z) + s2(z) - +z~~(z)~ + sl(z){zql(z) - 4;(z)) - ql +aj,2~l


x 4(z) + O(n ). (4.27)

The property of second-order correctness, which as we have seen is equivalent to


s1 = ql, has relatively little effect on the coverage probability in (4.27). This contrasts
with the case of one-sided confidence intervals.
Ch. 39: Methodology and Theory@ the Bootstrap 2375

For percentile confidence intervals,

s1= --P11 =Pl (4.28)

and

s*(x)= P21W= Plc4P;w - +vl(4z - P2(X), (4.29)

while for percentile-t intervals,

s1= -q11=41
(4.30)

and

s2(4 = 421(x) = ql(x)q;b) - &%(x)2- q2(4. (4.3 1)

There is no significant simplification of (4.27) when (4.28) and (4.29) are used to
express s1 and s2. However, in the percentile-t case we see from (4.27), (4.30) and
(4.3 1) that

a 2,n = a - 2~ a(, +a)/2~(1 +or~,24(z~1 +al,2) + 0W2L

which represents a substantial simplification.


When the asymptotic variance c2 is known, our formula for equal-tailed,
two-sided, cc-level confidence intervals should be changed from that in (4.26) to

(e+ n- %zo -aji2 + c*(r-aj,2), @+ n- 12+o +a)i2 + kc1+&),

for a suitable random function c^,.If c*,is given by (4.19) then the coverage probability
of this interval is given by (4.27), except that q should be changed to p at each
appearance in that formula. The value of a(, +6j,2 is unchanged.

4.7. Simple linear regression

In previous sections we drew attention to important properties of the bootstrap in


a wide range of statistical problems. We stressed the importance of pivoting. For
example, the coverage error of a one-sided percentile-t confidence interval is of size
n - , but the coverage error of an uncorrected one-sided percentile interval is of size
n- 11.7
The good performance of a percentile-t interval is available in problems where
the variance of a parameter estimate may be estimated accurately. Many regression
2316 P.Hall

problems are of this type. Thus, we might expect the endearing properties of
percentile-t to go over without change to the regression case. In a sense, this is true;
one-sided percentile-t confidence intervals for regression mean, intercept or slope
all have coverage error at most 0(X ), whereas their percentile method counterparts
generally only have coverage error of size n - I2 However, this generalization
conceals several very important differences in the case of slope estimation. One-
sided percentile-t confidence intervals for slope have coverage error O(n 312), not
O(n-); and the error is only O(n-) in the case of two-sided intervals.
These exceptional properties apply only to estimates of slope, not to estimates of
intercept or means. However, slope parameters are particularly important in the
study of regression, and our interpretation of slope is quite general. For example,
in the polynomial regression model

Yi=c+xidl + . ..+xdm+Ei. ldidn, (4.32)

we regard each dj as a slope parameter. A one-sided percentile-t confidence interval


for dj has coverage error O(n- 32), although a one-sided percentile-t interval for c
or for

E(Ylx=x,)=c+x,d,+...+x;;d,

has coverage error of size n-.


The reason that slope parameters have this distinctive property is that the design
points xi confer a significant amount of extra symmetry. Note that we may rewrite
the model (4.32) as

Yi = C + (Xi - 5l)dl + .. + (X - 5,)d, + i l,<idn,

where<j=nmCxiandc=c+tldl+... + &,,d,. The extra symmetry arises from


the fact that

(4.j3)
i=l

For example, (4.33) implies that random variables Ci(xi - tj)$ and CE: are
uncorrelated for each triple (j,k, I) of nonnegative integers, and this symmetry
property is enough to establish the claimed performance of percentile-t confidence
intervals.
We shall deal only with the simple linear regression model. Multivariate problems
are similar in many respects, and the reader is referred to Chapter 4 of Hall (1992)
for a more general account in that context.
Ch. 39: Methodology and Theory forthe Bootstrap 2311

The simple linear model is

Yi= c + xid + .zi, Ibid&

where c, d, xi, Yi, &iare scalars, c and d are unknown constants representing intercept
and slope, respectively, the ENSare independent and identically distributed random
variables with zero mean and variance o, and the xis are fixed design points. Put
&= ri- r-(Xi-$&X=n-Cxi,

62&p;, fJ; = n- l t (Xi - X)2.


i=l

Then 6 estimates o, and in this notation,

8=cq2n- i (Xi_X)(Yi_ Y), c*= r-22,


i=l

are the usual least-squares estimates of c and d. Since ahas variance It- 0; CTthen
n(& d)a,/8 is (asymptotically) pivotal. Define

r* = e+ x,2+ E?, 16idn,

where the 6:s are generated by resampling randomly from the residuals ii.
Furthermore, ?*, c?*, and 6* have the same formulae as ~2,d, and 8, except that Yi is
replaced by YF at each appearance of the former.
Quantiles u, and u, of the distributions of

n(d^ - d)o,/o and n(d^ - d)a,/c?

may be defined by

f{n(& d)o,/a < urn}= P{n(& d)o,.B d ua}

= cc.

and their bootstrap estimates 12,and 0, by

P{n(iP - d^)g,/B< 22,l~} = P{n(J* - &7,/6* d 0,1X}

= CY,

where X denotes the sample of pairs {(xl, Y,), . . . ,(x,, Y,)}. In this notation,
one-sided percentile and percentile-r bootstrap confidence intervals for d are given
2378 P. Hull

by
r^,=(-co,li-n-2a,1Bal_.),
.T1=(-co,&--a;80,_,)
respectively; compare Section 4.4. Each of these confidence intervals has nominal
coverage c(.The percentile-t interval _?i is the bootstrap version of an ideal interval

J, =(-G0,d^-n-2a,1Bvl_a).
Of course, each of the intervals has a two-sided counterpart.
Recall our definition that a one-confidence interval is second-order correct relative
to another if the (finite) endpoints of the intervals agree up to and including terms
of order (n - 1/2)2 = n- ; see Section 4.3. It comes as no surprise to find that j1 is
second-order correct for J,, given what was learned in Section 4.5 about bootstrap
confidence intervals in more conventional problems. However, on the present
occasion r^, is also second-order correct for J,, and that property is quite unusual.
It ari_ses because Edgeworth expansions of the distributions of n/(d^- d)a,/a and
n12(d - Lt)a /c? contain identical terms of size n- l/2, that is, Studentizing has no
effect on thexfirst term in the expansion. This is a consequence of the extra symmetry
conferred by the presence of the design points xi, as we shall show in the next
paragraph. The reason why second-order correctness follows from identical
formulae for the n- 12 terms in expansions was made clear in Section 4.5.
Assume that 0.: is bounded away from zero as n---f co, and that maxi ~ iG ,,(xi - Xl
is bounded as n -+ co. (In refined versions of the proof below, this boundedness
condition may be replaced by a moment condition on the design points xi, such as
sup& C(xi - x-) < co.) Put C= n- C.q, and observe that

82=ne1i~l~f=np1 i (ei_c_(Xi-_x)(&d)}2
i=l

=g2+n- i$l(Ef- 02)+ O,(n - 1).

Therefore, defining S = n(d^ - c&,/c, T = n1j2(a - d)a,/~?, and

A =+n-1a-2 i (6: - g2),


i=l

we have

T= S(l - d) + O,(n-) = S + 0,(n-112). (4.34)

By making use of the fact that X(xi - X) = 0 (this is where the extra symmetry
Ch. 3Y: Methodology and Theory fiw the Bootstrap 2379

conferred by the design comes in) and of the representation

s = n-%;lo-l t (Xi - X)Ci,


i=l

we may easily prove that E{ S( 1 - A)} - E(S) = O(n ) forj = 1,2,3. Therefore, the
first three cumulants of S and S(l - A) agree up to and including terms of order
n- I. Higher-order cumulants are of size n- or smaller. It follows that Edgeworth
expansions of the distributions of S and S(l - A) differ only in terms of order 6 .
In view of (4.34), the same is true for S and T,

P(S < w) = P(T6 w) + O(n_ ).

(This step uses the delta method.) Therefore, Studentizing has no effect on the fist
term in the expansion, as had to be proved.

References

Babu, G.J. and Singh, K. (1983) Inference on Means Using the Bootstrap, Annals ofStatistics, 11,
999-1003.
Babu, G.J. and Singh, K. (1984) On One Term Correction by Efrons Bootstrap, Sankhya, Series A 46,
219-232.
Babu, G.J. and Singh, K. (1985) Edgeworth Expansions for Sampling without Replacement from Finite
Populations, Journal of Multivariate Analysis, 17, 261-278.
Barnard, G.A. (1963) Contribution to Discussion, Journal of the Royal Statistical Society, Series B,
25, 294.
Beran, R. (1982) Estimated Sampling Distributions: The Bootstrap and Competitors, Annals of
Statistics, 10, 212-225.
Beran, R. (1987) Prepivoting to Reduce Level Error of Confidence Sets, Biometrika, 74.457-468.
Bhattacharya, R.N. and Ghosh, J.K. (1978) On the Validity of the Formal Edgeworth Expansion,
Annals of Statistics, 6,434&451.
Bickel, P.J. and Freedman, D.A. (1980) On Edgeworth Expansions and the Bootstrap. Unpublished
manuscript.
Bickel, P.J. and Freedman, D.A. (1981) Some Asymptotic Theory for the Bootstrap, Annals of
Statistics, 9, 1196-1217.
Bickel, P.J. and Freedman, D.A. (1983) Bootstrapping Regression Models with Many Parameters in
P.J. Bickel, K.A. Doksum, and J.C. Hodges, Jr., eds. A Festschrift for Erich L. Lehmann. Belmont:
Wadsworth, 28-48.
Bose, A. (1988) Edgeworth Correction by Bootstrap in Autoregressions, Annals of Statistics, 16,
170991722.
Carlstein, E. (1986) The Use of Subseries Methods for Estimating the Variance of a General Statistic
from a Stationary Time Series, Annals of Statistics, 14, 1171-l 179.
Chibishov, D.M. (1972) An Asymptotic Expansion for the Distribution of a Statistic Admitting an
Asymptotic Expansion, Theory of Probability and its Applications, 17, 620-630.
Chibishov, D.M. (1973a) An Asymptotic Expansion for a Class of Estimators Containing Maximum
Likelihood Estimators, Theory of Probability and its Applications, 18, 295-303.
Chibishov, D.M. (1973b) An Asymptotic Expansion for the Distribution of Sums of a Special Form
with an Application to Minimum-Contrast Estimates, Theory ofProbability and its Applications, 18,
&19-661.
Cox, D.R. and Hinkley, D.V. (1974) Theoretical Statistics. London: Chapman and Hall,
2380 P. Hall

Davison, AC. and Hall, P. (1993) On Studentizing and Blocking Methods for Implementing the
Bootstrap with Dependent Data, Australian Journal ofStaristics, 35, 215-224.
Davison, A.C. and Hinkley, D.V. (1988) Saddlepoint Approximations in Resampling Methods,
Biometrika, 15, 411-431.
DiCiccio, T.J. and Romano, J.P. (1988) A Review of Bootstrap Confidence Intervals (With Discussion),
Journal oj the Royal Statistical Society, Series B 50, 338-354.
Efron, B. (1979) Bootstrap Methods: Another Look at the Jackknife, Annals ofStatistics, 7, l-26.
Efron, B. (1983) Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation,
Journal ofthe American Statistical Association, 78, 316-331.
Efron, B. (1987) Better Bootstrap Confidence Intervals (With Discussion), Journal of the American
Statistical Association, 82, 171L200.
Fisher, N.I. and Hall, P. (1991) Bootstrap Algorithms for Small Samples. Journal ofStatistical PIanniny
and Inference, 21, 151- 169.
Freedman, D.A. (198 1) Bootstrapping Regression Models, Annals ofStatistics, 9, 1218-1228.
Freedman, D.A. and Peters, S.C. (1984) Bootstrapping a Regression Equation: Some Empirical
Results, Journal oJthe American Statistical Association, 79, 97-106.
G&e, F. and Kiinsch, H.R. (1990) Blockwise Bootstrap for Dependent Observations: Higher Order
Approximations for Studentized Statistics (Abstract), Bull. Inst. Math. Statist.. 19,443.
Hall, P. (1985) Resampling a Coverage Process, Stoch. Proc. Appl., 19,259-269.
Hall, P. (1986) On the Bootstrap and Confidence Intervals, Annals ofstatistics, 14, 1431-1452.
Hall, P. (1988a) Theoretical Comparison of Bootstrap Confidence Intervals (With Discussion), Annals
of Statistics, 16, 927-985.
Hall, P. (1988b) Unusual Properties of Bootstrap Confidence Intervals in Regression Problems,
Probab. Theory Rel. Fields, 81, 247-273.
Hall, P. (1992) The Bootstrap and Edgeworth Expansion. New York: Springer.
Hall, P. and Horowitz, J.L. (1993) Corrections and Blocking Rules for the Bootstrap with Dependent
Data. Research Report no. CMA-SRI l-93, Centre for Mathematics and its Applications, Australian
National University.
Hall, P. and Martin, M.A. (1988) On Bootstrap Resampling and Iteration, Biometriku, 75, 661-671.
Hartigan, J.A. (1969) Using Subsample Values as Typical Values, Journal of the American Statistical
Association, 64, 1303-1317.
Hartigan, J.A. (1971) Error Analysis by Replaced Samples, Journal of the Royal Statistical Society,
Series B 33, 98-l 10.
Hinkley, D.V. (1988) Bootstrap Methods (With Discussion), Journal ofthe Royal Statistical Society,
Series B 50, 321-337
Hope, A.C.A. (1968) A Simplified Monte Carlo Significance Test Procedure, Journal qf the Royal
Statistical Society, Series B 30, 582-598.
Kendall, M.G. and Stuart, A. (1977) The Advanced Theory of Statistics. Vol 1, 4th three-volume Ed.
London: Griffin.
Kiinsch, H.R. (1989) The Jackknife and the Bootstrap for General Stationary Observations, Annals
ofStatistics, 17, 1217-1241.
Lahiri, S.N. (1991) Second Order Optimality of Stationary Bootstrap, Statistics and Probability
Letters, 11, 335-341.
Lahiri, S.N. (1992) Edgeworth Correction by Moving Block Bootstrap for Stationary and Non-
stationary Data, in: Exploring the Limits ofthe Bootstrap. R. Lepage and L. Billard, eds., New York:
Wiley, pp 183-214.
Liu, R.Y. and Singh, K. (1987) 0 n a Partial Correction by the Bootstrap, Annals of Statistics, 15,
1713-1718.
Lob, W.-Y. (I 987) Calibrating Confidence Coefficients, Journal of the American Statistical Association,
82, 155-162.
Marriott, F.H.C. (1979) Barnards Monte Carlo Tests: How Many Simulations?, Applied Statistics, 28,
75-77.
Martin, M.A. (1989) On the Bootstrap and Confidence Intervals. Unpublished PhD thesis, Australian
National University.
Peters, SC. and Freedman, D.A. (1984a) Bootstrapping an Econometric Model: Some Empirical
Results, J. Bus. Econ. Studies, 2, 150-158.
Ch. 39: Methodology und Theory,for the Bootstrap 2381

Peters, S.C. and Freedman, D.A. (1984b) Some Notes on the Bootstrap in Regression Problems, J.
Bus. Econ. Studies, 2, 406-409.
Quenouille, M.H. (1949) Approximate Tests of Correlation in Time-Series, Journal of the Royal
Statistical Association, Series B 11. 68-84.
Quenouille, M.H. (1956) Notes on Bias in Estimation, Biometrika, 43, 353-360.
Reid, N. (1988) Saddlepoint Methods and Statistical Inference (With Discussion), Statistic. Sci., 3,
213-238.
Robinson, J. (1987) Nonparametric Confidence Intervals in Regression: The Bootstrap and Randomiza-
tion Methods, in: M.L. Puri, J.P. Vilaplana, and W. Wertz, eds., New Perspectives in Theoretical and
Applied Statistics. New York: Wiley, pp 2433256.
Sargan, J.D. (1975) Gram-Charlier Approximations Applied to t Ratios of k-Class Estimators,
Econometrica, 43, 3277346.
Sargan, J.D. (1976) Econometric Estimators and the Edgeworth Approximation, Econometrica, 44,
421-448.
Simon, J.L. (1969) Basic Research Methods in Social Science. New York: Random House.
Singh, K.(1981)0n the Asymptotic Accuracy ofEfrons Bootstrap, Annalsofstatistics, 9,1187-l 195.
Tukey, J.W. (1958) Bias and Confidence in Not-Quite Large Samples (Abstract), Ann. Math. Statist.,
29, 614.
Chapter 40

CLASSICAL ESTIMATION METHODS FOR LDV


MODELS USING SIMULATION

VASSILIS A. HAJIVASSILIOU

Yale University

PAUL A. RUUD

University of California

Contents

1. Introduction 2384
2. Limited dependent variable models 2386
2.1. The latent normal regression model 2386

2.2. Censoring 2386

2.3. Truncation 2392

2.4. Mixtures 2394

2.5. Time series models 2395

2.6. Score functions 2397

2.7. The computational intractability of LDV models 2399

3. Simulation methods 2400


3.1. Overview 2400

3.2. Censored simulation 2402


3.3. Truncated simulation 2406
4. Simulation and estimation of LDV models 2408
4.1. Overview 2408
4.2. Simulation of the log-likelihood function 2412

4.3. Simulation of moment functions 2421

4.4. Simulation of the score function 2428

4.5. Bias corrections 2435

5. Conclusion 2437
6. Acknowledgements 2438
References 2438

Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden
0 1994 Elseuier Science B. V. All rights reserved
2384 V.A. Hajitmsiliou and P.A. Ruud

1. Introduction

This chapter discusses classical estimation methods for limited dependent variable
(LDV) models that employ Monte Carlo simulation techniques to overcome compu-
tational problems in such models. These difficulties take the form of high-dimen-
sional integrals that need to be calculated repeatedly. In the past, investigators were
forced to restrict attention to special classes of LDV models that are computationally
manageable. The simulation estimation methods we discuss here make it possible to
estimate LDV models that are computationally intractable using classical estima-
tion methods.
We first review the ways in which LDV models arise, describing the differences
and similarities in censored and truncated data generating processes. Censoring and
truncation give rise to the troublesome multivariate integrals. Following the LDV
models, we described various simulation methods for evaluating such integrals.
Naturally, censoring and truncation play roles in simulation as well. Finally, estima-
tion methods that rely on simulation are described. We review three general ap-
proaches that combine estimation of LDV models and simulation: simulation of
the log-likelihood function (MSL), simulation of moment functions (MSM), and
simulation of the score (MSS). The MSS is a combination of ideas from MSL and
MSM, treating the efficient score of the log-likelihood function as a moment
function.
One of the most familiar LDV models is the binomial probit model, which
specifies that the probability that a binomial random variable y equals one, condi-
tional on the regression vector x, is @(xfi) where a(.) is the univariate standard
normal cumulative distribution function (c.d.f.). Although this integral has no
analytical expression, @ has accurate, rapid, numerical approximations. These help
make maximum likelihood estimation of the binomial probit model straightforward
and most econometric software packages provide such estimation as a feature.
However, a simple and common extension of the binomial probit model renders
the resulting model too difficult for maximum likelihood computation. Introducing
correlation among the observations generally produces a likelihood function con-
taining integrals that cannot be well approximated and rapidly computed.
An example places the binomial probit model in the context of panel data in
which a cross-section of N experimental units (individuals or households) is observed
repeatedly, say in T consecutive time periods. Denote the binomial outcome for the
nth experimental unit in the tth time period by y,,~(0, l}. In panel data sets,
econometricians commonly expect correlation among the y,, for the same n across
different t, reflecting the presence of unobservable determinants of y,, that evolve
slowly for each experimental unit through time. In order to model such correlation
parsimoniously, econometricians have adapted familiar models with correlation to
the probit model. One can describe each y,, as the transformation of a latent,
Ch. 40: Classical Estimation Methods,for LDV Models Using Simulation 2385

normally distributed, y,*,:

Then, one can assign the latent y,: a nonscalar covariance matrix appropriate to
continuously distributed panel data. For example, stacking the y,*, first by time
period and then by experimental unit, a common specification of the covariance
matrix is the variance components plus first-order autoregression model

+ +JT,

PT-2 ... p 1

(1)

Now consider the impact of such nonscalar covariance matrices on the likelihood
for the observed y,,. Although the marginal probabilities that y,, is zero or one are
unchanged, the likelihood function consists of thejoint probabilities that the depen-
dent series {yn1,yn2,. . . , ynT} are the observed sequences of zeros and ones. These
joint probabilities are multivariate normal integrals over T dimensions and there
are 2* possible integrals.
The practical significance of the increased dimensionality of the integrals is that
traditional numerical methods generally cannot compute the integrals with sufficient
speed and precision to make the computation of the maximum likelihood estimator
workable. In this chapter, we review a collection of alternative, feasible, methods
based on the ideas of estimation with simulation suggested by McFadden (1989)
and Pakes and Pollard (1989).
In Section 2, we describe LDV models and illustrate the computational difficulties
classical estimation methods encounter. Section 3 summarizes basic simulation
methods, covering censored and truncated sampling methods. Estimation of LDV
models and simulation are combined in Section 4 where three general approaches
are reviewed: simulation of the log-likelihood function, simulation of moment

A partial list of studies in numerical analysis of such integrals is Clark (1961), Daganzo (1980), Davis
and Rabinowitz (1984), Dutt (1973), Dutt (1976). Fishman (1973), Hammersley and Handscomb (1964),
Horowitzeta1.(1981),Moran(1984),Owen(1956),Rubinstein(l981),Stroud(l97l)andThisted(1988).
2386 V.A. Hajiuassiliou and P.A. Ruud

functions, and simulation of the efficient score. We provide computational examples


throughout to illustrate the various methods and their properties. We conclude this
chapter with a summary of the main approaches presented.

2. Limited dependent variable models

2.1. The latent normal regression model

Consider the problem of maximum likelihood estimation given the N observations


on the vector of random variables y drawn from a population with cumulative
distribution function (c.d.f.) F(0, Y) = Pr{y < Y}. Let the corresponding density
function with respect to Lebesgue measure be f(0, y). The density f is a parametric
function and the parameter vector 0 is unknown, finite-dimensional, and 0~63,
where 0 is a compact subset of RK.Estimation of 19by maximum likelihood (ML)
involves the maximization of the log-likelihood function IN(d) = C,= 1In f(0; y,)
over 0. Often, finding the root of a system of normal equations V,l,(O) = 0 is
equivalent. In the limited dependent variable models that we consider in this
chapter, F will be a mixture of discrete and continuous distributions, so that f may
consist of nonzero probabilities for discrete values of y and continuous probability
densities for intervals of y. These functions are generally difficult to compute because
they involve multivariate integrals that do not have closed forms, accurate approxi-
mations or rapid numerical solutions. As a result, estimation of 0 by classical
methods is effectively infeasible.
In this section, we review the various forms of likelihood functions that arise in
LDV models. In the second subsection, we discuss models generated as partially
observed or censored latent dependent variables. The third subsection describes
truncated latent dependent variables. In this case, one views observations in a latent
data set as missing entirely from an observed data set. Within these broad categories,
we review discrete, mixed discrete/continuous, and mixture likelihood functions.
Following our discussion of likelihood functions, Section 2.6 treats the structure
of the score function for LDV models and the last subsection gives a concrete
illustration of the intractability of classical estimation methods for the general LDV
model.

2.2. Censoring

In general, and particularly in LDV models, one can represent the data generating
process for y as an incomplete data or partial observability process in which the
observed data vector y is an indirect observation on a latent vector y*. In such cases,
y* cannot be recovered from the censored random variable y.
Ch. 40: Classical Estimation Methods.for LDV Models Using Simulation 2387

De$nition 1. Censored random variables

Let Y* be a random variable from a population with c.d.f. F(Y*) and support A.
Let B be the support of the random variable Y = r(Y*) where z:A+ B is not
invertible. Then Y is a censored random variable.

In LDV models, r is often called the observation rule; and though it may not be
monotonic, t is generally piece-wise continuous. An important characteristic of
censored sampling is that no observations are missing. Observations on y* are merely
abbreviated or summarized, hence the descriptive term censored. Let A c RM and
Bc RJ.
The latent c.d.f. F(B; Y*) for y* is related to the observed c.d.f. for y by the integral
equation

F(B; Y) =
s (Y*lr(Y*) d Y)
dF(B; y*). (2)

In the LDV models that we consider, F(0; y*) is the multivariate normal c.d.f. given
by F(B, y*) = J&y* - p, Qdy* where R is a positive definite matrix, and

@(y* - p,Q) = {det[2nR]}-12exp[ -i(y* - ~)a-(y* -p)]. (3)

We will refer to this multivariate normal distribution as the N(p,fl) distribution.


The mean vector is often parameterized as a linear function of observed conditioning
variables X: p(p) = Xp, where fi is a vector of K, slope coefficients. The covariance
matrix is usually a function of a vector of K, variance parameters C.
The p.d.f. for y is the function that integrates to F(8; Y). In this chapter, integration
refers to the Lebesgue-Stieltjes integral and the p.d.f. is a generalized derivative of
the c.d.f.2 This means that the p.d.f, has discrete and continuous components.
Everywhere in the support of Y where F is differentiable, the p.d.f. can be obtained
by ordinary differentiation:

(4)

A simple illustration of such p.d.f.s is given below in Example 2. In the LDV models
we consider, F generally has a small nuumber of discontinuities in some dimensions

Such densities are formally known as Radon-Nikodym p.d.ts. with respect to Lebesgue measure
2388 V.A. Hajimssiliou and P.A. Ruud

of Y so that F is not differentiable everywhere. At a point of discontinuity Yd, we


can obtain the generalized p.d.f. by partitioning Y into the elements in which F is
differentiable, { Y,, . . . , YJ,} say, and the remaining elements { Y,. + i, . . . , Y,} in which
the discontinuity occurs. The p.d.f. then has the form

f(e; y) = $& 1 .. J
[F(0; Y) - F(8; Y - 0)]

= f(e; Y1,..., yJ,)Pr{Yj=Y~;j>~(6; yl,..., YJ'}, (5)

where the discrete jump F(8; Y) - F(8; Y - 0) reflects the nontrivial probability of
the event { Yj = Y:; j > 5}.3 Examples 1 and 2 illustrate such probabilities.
It is these probabilities, the discrete components of the p.d.f., that pose computa-
tional obstacles to classical estimation. One must carry out multivariate integration
and differentiation in (2))(5) to obtain the likelihood for the observed data - see the
following example for a clear illustration of this problem. Because accurate numeri-
cal approximations are unavailable, this integration is often handled by such general
purpose numerical methods as quadrature. But the speed and accuracy of quadrature
are inadequate to make the computation of the MLE practical except in special
cases.

Example 1. Multinomial probit

The multinomial probit model is a leading illustration of the computational diffi-


culties of classical estimation methods for LDV models, which require the repeated
evaluation of (2))(5). This model is based on the work of Thurstone (1927) and was
first analyzed by Bock and Jones (1968). For a multinomial model with J = M
possible outcomes, the latent y* is N(p, f2) where p is a J x 1 vector of means and
R is a J x J symmetric positive definite covariance matrix. The observed y is often
represented as a vector of indicator functions for the maximal element of y*: r(y*) =
[l(yf=max,y*}; j= l,..., 51. Therefore, the sampling space B of y is the set of
orthonormal elementary unit vectors, whose elements are all zero except for a
unique element that equals one:

B= {(l,O,O ,..., O),(O,l,O,O ,..., 0) ,..., (0,O ,..., O,l)}.

The probability function for y can be written as an integral over J - 1 dimensions


after noting that the event {yj = 1, yi = 0, i # j} is equivalent to {y; - y* 3 0, i = 1,. . . ,

3The height of the discontinuity is denoted by

F(O; Y) - F(B; Y - 0) = lim [F(f$ Y) - F(0; Y - E)].


40
Ch. 40: CIassical Estimation Methods for LDV Models Using Simulation 2389

J}. By creating the first-difference vector zj = [y: - y*, i = 1,. . . ,J, i #j] = AjY*
and denoting its mean and covariance by pj = Ajp and fij = AjflA> respectively,
F(B;y) and ~(0; y) are both functions of multivariate normal negative orthant
integrals of the general form

@(p,L?) =
ss 0

-m
..
0

-co
4(x + P, Wx.

We obtain

F(e;Y)= i l{Yj3 l}@(-Pj,.Rj)


j=l

and

ng= r @( - ~j, Rj)Y ifyE&


f (0;Y) = (6)
0 otherwise.

When J = 2, this reduces to the familiar binomial probit likelihood mentioned in


the introduction:

WC12 -PI, l)y@(P1 - P2, 1Y2


f (0;Y) = (7)
i @( - p, 1) _Y@($, 1)Y

where ji = pL1- pLzand y = y,.


If J > 5, then the likelihood function (6) is difficult to compute using conventional
expansions without special restrictions on the covariance matrix or without adopting
other distributions that imply closed-form expressions. Examples of the former
approach are the factor-analytic structures for R analyzed in Heckman (1981),
Bolduc (1991) and Bolduc and Kaci (1991), and the diagonal R discussed in
Hausman and Wise (1978), p. 310. An example of the latter is the i.i.d. extreme-value
distribution which, as McFadden (1973) shows, yields the analytically tractable
multinomial logit model. See also Lerman and Manski (1981), p. 224, McFadden
(198 1) and McFadden (1986) for further discussions on this issue.

Example 2. Tohit

The tobit or censored regression model4 is a simple example of a mixed distribution


with discrete and continuous components. This model has a univariate latent

4Tobin (I 95X).
2390 V.A. Hajivassiliou and P.A. Ruud

structure like probit: y* - N(p, a). The observation rule is also similar: r(y*) =
l{y* >O}.y* which leads to the sample space B = {yeRly 2 0} and c.d.f.

0 if Y < 0,
F(B; Y) =
+(y*-p,~)dy*=@(Y-~(,a*) ifY30.
s (Y < Y)
/

The p.d.f. is mixed, containing discrete and continuous terms:

0 if Y < 0,
f(@Y)= @(--~,a) ifY=O, (8)
f$(Y-~,a) ifY>O.
i

The discrete jump in F at Y = 0 corresponds to the nonzero probability of { Y = 0},


just as in binomial probit. F is differentiable for Y > 0 so that the p.d.f. is obtained
by differentiation. Just as in the extension of binomial to multinomial probit,
multivariate tobit models present multivariate integrals that are difficult to compute.

Example 3. Nonrandom sample selection

The nonrandom sample selection model provides a final example of partial observ-
ability which generalizes the tobit model. In the simplest version, the latent y*
consists of two elements drawn from a bivariate normal distribution where

n(a) = l al2
*
[ al2 a2 1

The observation rule is

so that the first element of y is a binomial variable and the second element is an
observation on yf when y, = 1; otherwise, there is no observation of yf because y,
is identically zero. That is, the sampling space of y is the union of two disjoint sets:
B = { (O,O)} u { (l,y,), ~,ER}. Thus, two cases capture the nonzero regions of the

See Gronau (1974), Heckman (1974), Heckman (1979), Lewis (1974), Lee (1978) and Lee (1979).
Ch. 40: Chwical Estimation Merhods fir LDV Models Using Simuhfion 2391

c.d.f. of y. First of all, the c.d.f. is constant on B, = [0, 1) x [0, co):

F(B; Y)=
srv: <01
ddy* - p,fWy* = @(- ~l,lX YEB,

because yr is unrestricted (and unobserved) in this case. Once Y, reaches 1, the


entire sampling space for y, has been covered and the c.d.f. on B, = [l, co) x R is
increased according to

F(B; Y) =
l{Y22O}@(-P1,1)+
s 4(~* - p, 0)
IY:~O,Y;~Y*)
dy*, YEB,,

l{Y,2O}@(-Pr,l)+@

The p.d.f. will therefore be

1 @(- Pl> 1) if Y, = 0,
f (0; Y) =
@(PI + arAY - P2)/& 1 - 0:,/0:).4(Y, - ~,,a:) if Y, = 1.

The sample selection process is often more complicated, with several causes of
sample selection. In such cases, the latent yT is a vector with each element associated
with a different cause of partial observation. The latent yT is observed only if all the
elements of y: (suppose there are J = M - 1) are positive so that the observation
rule is

where 1 {y: 2 0) is an (M - 1) x 1 vector of indicator variables. The sampling space is

M-l
B= JJER~IYM=O,fl Yj=l,YjE{O,l}, j<M
j= 1
M-l
JJER~IYM=O, JJ Yj=O,YjE{O,l} 9
j=l

and the likelihood function contains multivariate integrals over the M - 1 dimen-
sions of yy.

Other types of nonrandom sample selection lead to general discrete/continuous


models and models of switching regressions with known sample separation. Such
2392 V.A. Hajiwssiliou and P.A. Ruud

models are discussed extensively in Dubin and McFadden (1984), Hanemann (1984)
Lee (1978) Maddala (1983) and Amemiya (1984).

2.3. Truncation

When it is represented as a partial observation, a limited dependent variable is a


censored latent variable. Another mechanism for generating limited dependent
variables is truncation, which refers to dropping observations so that their realiza-
tion goes unrecorded.

Dejinition 2. Truncated random variables

Let F(Y) be the c.d.f. of y* and let D be a proper subset of the support of F and DC
its complement such that Pr {y*E DC} > 0. The function

F(Y)/Pr{YED} if YED,
G(Y)=
0 if YED.

is the c.d.f. of a truncated y*.

One can generate a sample of truncated random variables with the c.d.f. G by
drawing a random sample of y* and removing the realizations that are not members
of D. This is typically the way truncation arises in practice. To draw a single
realization of the truncated random variable, one can draw y*s until a realization
falls into D. The term truncation derives from the visual effect dropping the set DC
has on the original distribution when DC is a tail region: the tail of the p.d.f. is cut
off or truncated.
To incorporate truncation, we expand the observation rule to

4Y*) ify*ED,
Y=
unobserved otherwise,

where D is an acceptance region. This situation differs from that of the nonrandom
sample selection model in which an observation is still partially observed: at least,
every realization is recorded. In the presence of truncation, the observed likelihood
requires normalization relative to the latent likelihood:

(10)
Ch. 40: Classical Estimation Methodsfor LDV Models Using Simulation 2393

The normalization by a probability in the denominator makes the c.d.f. proper,


with an upper bound of one.

Example 4. Truncated normal regression

Ify* - N(p, a) and y is an observation of y* when y* > 0, the model is a truncated


normal regression. Setting D = (y~f?I y > 0) makes B = D so that the c.d.f. and p.d.f.
of y are

0 if Y < 0,
r

4 -
F(6; Y)= y
NY* P, 4 dy*
0 = @(y-PY~2)-@(-I44 if y>.

sm
2
1- @(-p,c?)
@(Y* -p,ddy*
0

f(& Y) = I &Y-
0
K 0)
if Y d 0,

ifY>O,
! 1 -@(-&a)

As in the tobit model, a normal integral appears in the likelihood function. However,
this integral enters in a nonlinear fashion, in the denominator of a ratio. Clearly,
multivariate forms of truncation lead to multivariate integrals in the denominator.

To accommodate both censored and truncated models, in the remainder of this


chapter we will often denote the general log-likelihood function for LDV models
with a two-part function:

ln f@ Y) = ln ft (8 Y) - ln fd@ Y) (11)

where fi represents the normalizing probability Pr {y* E D ] = SDdF(0; y*). In models


with only censoring, f2E 1. But in general, both fi and f2 will require numerical
approximation. Note that in this general form, the log-likelihood function can be
viewed as the difference between two log-likelihood functions for models with
censoring. For example, the log-likelihood of the truncated regression in Example
4 is the difference between the log-likelihoods of the tobit regression in Example 2
and the binomial probit model mentioned in the introduction and Example 1 (see
equations (7) and (S))?

6Note that scale information about y* is available in the censored and truncated normal regression
models which is not in the case of binary response, so that u2 is now identifiable. Hence, the normalization
c_= 1 is not necessary, as it is in the binary probit model where only the discrete information 1{ Y> 0)
is available.
2394 V.A. Hajivassiliou und P.A. Ruud

1 {y > 0) ln L!9kEL
[ 1 -@(-p,a2) 1 =l{Y>O}ln[C$(Y-/J,a)]+1{Y=0}@(-,U,0*)

-[l(Y>O}ln[l-@(--p,c*)]

+l(Y=o}@(-~,a*)].

2.4. Mixtures

LDV models have come to include a family of models that do not necessarily have
limited dependent variables. This family, containing densities called mixtures, shares
an analytical trait with the LDV models that we have already reviewed: the p.d.f.
generally contains discrete probability terms.

Definition 3. Mixtures

Let F(8; Y) be the c.d.f. of y* depending on a parameter 8 and H(8) another c.d.f.
Then the c.d.f.

G(Y) = F(B; Y) dH(8)


s

is a mixture.

Possible ways in which mixtures arise in econometric models are unobservable


heterogeneity in the underlying data generating process (see, for example, Heckman
(198 1)) and short-side rationing rules (Quandt (1972), Goldfeld and Quandt
(1975), Laroque and Salanit: (1989)). Laroque and Salanitt (1989) discuss simulation
estimation methods for the analysis of this type of model.

Example 5. Mixture

A cousin of the nonrandom sample selection model is the mixture model generated
by an underlying trivariate normal distribution, where

The observation rule maps a three-dimensional vector into a scalar; the rule can be
written as
Ch. 40: Cla.ssical Estimation Mefhods,fiw LDV Models Usiny Simulation 2395

An indicator function determines whether yT or y: is observed. An important


difference with sample selection is that the indicator itself is not observed. Thus, y
is a mixture of yzs and yzs. As a result, such mixtures have qualitatively distinct
c.d.fs, compared to the other LDV models we have discussed. In the present case,

F(8; Y) =
s o( 3 0.y; G Y)u (Y: < 0-y;Q Yl
4(y* - PL,f4 dy*,

s (Y:a o,y: s Yl
4(y* - PL,Qdy* +
s
{Y:< 0.y: s Y)
$(Y* - PL,4 dy*,

and

where, for j = {2,3},

PlljeE(Y:lYj*= y)=Pl +alj(YTPj)/af9

Llllj E Iqyyyj* = Y) = 1 - 0+;

are conditional moments. The p.d.f. particularly demonstrates the weighted nature
of the distribution: the marginal distributions of yz and ys are mixed together by
probability weights.

2.5. Time series models

LDV models are not typically applied to time series data sets but short time series
have played an important role in the analysis of panel or longitudinal data sets.
Such time series are another source of high-dimensional integrals in likelihood
functions. Here we expand our introductory example.

Example 6. Multiperiod binary probit model

A random sample of N economic agents is followed over time, with agent n being
observed for T periods. The latent variable y,*, = pnt + E,, measures the net benefit
to the agent characterizing an action in period t. Typically, pnt is a linear index
2396 V.A. Hajiwssiliou und P.A. Ruud

function of a k x 1 vector of exogenous explanatory variables x,,~,i.e., pL,t= xk#. The


agent chooses one of two actions in each period, denoted by y,,,~jO, l}, depending
upon the value of y,*,:

y,, = 1 ify,*, > 0,


r(y*) = t= l,...,T. (12)
i y,, = 0 if yz* d 0, I

Hence, the sample space for r(y*) is B = x T= I (0, l}, i.e., all possible (2r) sequences
of length T, with 0 and 1 as the possible realizations in each period.
Let the distribution of y,* = (y,*,, . . , y,*,) be the multivariate normal given in
equation (3). Then, for individual n the LDV vector {y,,}, t = 1,. , T, has the
discrete p.d.f.

f(fi, 0; Y) = @(S,,, SOS), where p = xb and S = diag{2y - 1).

This is a special case of the multinomial probit model of Example 1, with J = 2r


alternatives and a typically highly restricted 52, reflecting the assumed serial correla-
tion in the {snt}T=1 sequence.
By way of illustration, let us consider the specific covariance structure, found very
useful in applied work:7

E*= rn + in,, &I*= Pi,,* - 1 + nt, IPI < 1 (13)

and v and yeindependent. This implies that

1 p p2 ... pT-
1 p ... pT-2
P
p2 p ... i + a;.J,.
...
1 P
pT-l pT-2 ... /, 1

The variance parameters 0: and of cannot both be identified, so the normalization


a~+~,2=1isused.
The probability of the observed sequence of choices of individual n is

Pr (y,; 8, XJ =
s b&n)

MYn)
4(~,* - A, f&J dy:,

See Hajivassiliou and McFadden (1990), BGrsch-Supan et al. (1992) and Hajivassiliou (1993a).
*This is the structure assumed in the introductory example see equation (1) above.
Ch. 40: Classical Estimation Methods/iv LD V Models Using Simulation 2397

with 0 = (/3, of, p) and

0 if y,, = 1,
%t =
i -YE ify,, = 0,

Note that the likelihood of this example is another member of the family of
censored models. Time series models like this do not present a new analytical
problem. Indeed, such time series models are more tractable for estimation because
classical methods do provide consistent, though statistically inefficient, estimators
(see Poirier and Ruud (1988), Hajivassiliou (1986) and Avery et al. (1983)).9 Keane
(1993) discusses extensively special issues in the estimation by simulation of panel
data models and Miihleisen (1991) compares the performance of alternative simula-
tion estimators for such models. Studies of dynamic discrete behavior using simula-
tion techniques are Berkovec and Stern (1991), Bloemen and Kapteyn ($991),
Hajivassiliou and Ioannides (1991), Hotz and Miller (1989), Hotz and Sanders
(1991), Hotz et al. (1991), Pakes (1992) and Rust (1992).
In this chapter, we do not analyze the estimation by simulation of long time
series models. We refer the reader to Lee and Ingram (1991), Duffie and Singleton
(1993), Laroque and Salanie (1990) and Gourieroux and Monfort (1990) for results
on this topic.

2.6. Score functions

For models with censoring, the score for 8 can be written in two ways which we will
use to motivate two approaches to approximation of the score by simulation.

V,lnf(Q;y) =z (14)

= ~JW~lnf(~;~*)l~l (15)

where V0 is an operator that represents partial differentiation with respect to the


elements of 0. The ratio (14) is simply the derivative of the log-likelihood and

Panel data sets in which each agent is observed for the same number of time periods T are called
balanced, while sets with T. # T for some n = 1,. , N are known as unhalancrd. As long as the determi-
nation of T. is not endogenous to the economic model at hand, balanced and unbalanced sets can be
analyzed using the same techniques. There exists, however, the interesting case in which T,, is determined
endogenously through an economic decision, which leads to a multiperiod sample selection problem.
See Hausman and Wise (1979) for a discussion of this case.
2398 V.A. Hajivassiliou and P.A. Ruud

simulation can be applied to the numerator and denominator separately. The


second expression (15), the conditional expectation of the score of the latent log-
likelihood, can be simulated as a single expectation if V, lnf(8; y*) is tractable.
Ruud (1986), van Praag and Hop (1987), Hajivassiliou and McFadden (1990) and
Hajivassiliou (1992) have noted alternative ways of writing score functions for the
purpose of estimation by simulation.
Here is the derivation of (15). Let F(8; y* 1y) denote the conditional c.d.f. of y*
given that r(y*) = y. We let

ECNY*)IYI = r(y*)dF(&y*Iy) (16)


s

denote the expectation of a random variable t(y*) with respect to the conditional
c.d.f. F(B; y* 1y) of y* given r(y*) = y. Then

vtm Y) _
f(R Y) .mY)
1
s
(y*,r(y*) y)
V,dF(&y*)

s (y'lr(y*)'Y)
= -V,WV;Y*)I~(Y*)
v&m
fvt Y*)
Y*) fvt Y*)
fvt Y)
= ~1
dy*

since f(8; y*)dy* is the p.d.f. of the truncated


We; Y*)ifvt Y) = fv3 Y*YJ(~.,~(~*)=~~
distribution {y* 1r(y*) = y}.
This formula for the score leads to the following general equations for normal
LDV models when y* has the multivariate normal p.d.f. given in (3):

v,ln f(e; Y) = a- 1[E(Y* IY) - ~1,


V,lnf(e;Y)=~~-l{~(Y*lY)+c~(y*i~(y*)=y)-~i
x [WY* Ir(Y*) = Y) - PI - fl>fl- r, (17)

using the standard derivatives for the log-likelihood of a multivariate normal

I0 Formally,

Pr{y*< y*,y-E<T(y*)<y}
F(O; Y* 1T(y*) = y) E lim
610 Pr{y-E<T(y*)<y}
Ch. 40: Classical Estimation MethodsJiv LDV Models Usiny Simulation 2399

According to (17), the score of a normal LDV model depends only on the first two
moments of a truncated multivariate normal random variable z generated by the
truncation rule

Y* if r(y*) = y,
z= (19)
unobserved otherwise.

The functional form of these moments depends on the specification of the LDV
function 2.
For LDV models with truncation, there are no changes to (14)-(16). The only
change that (9) requires for (19) is the restriction to the acceptance region D. That
is, the score depends on only the first two moments of a truncated multivariate
normal random variable z generated by the truncation rule

z = Y* if r(y*) = y, Y*E&
unobserved otherwise.

As a result, there is a basic change to (17). Because the log-likelihood function of


truncated models is the difference between two log-likelihood functions for censored
models (see equation (1 l)), the score is expressed as the difference in the scores for
such models:

V. ln f(& Y) = V. ln fl(& Y) - V. ln f,(Q;Y)


= .W,lnf(R~*)l~(~*) = ~1- W,lnf(R~*)l~*~Dl

and (17) becomes

V,lnF(B;y)=R-[E(y*)z(y*) =y)-E(y*ly*~D)],

v,lnF(B;y)=~R-{EC(y*-~)(y*-I*)I~(y*)=y]

- EC(Y* - P)(Y* - ~)ll~*~Dl}fl-.

2.7. The computational intractability of LDV models

The likelihood contribution f(t9; y,) and the score V, lnf(B; y,) are functions of at
most M-dimensional integrals over the region D(y) = {ylz(y*) = y} in the domain
of the M x 1 latent vector yx. The fundamental source of the computational intract-
ability of classical estimation methods for the general LDV model is the repeated
evaluation of such integrals. To illustrate, consider a multinomial probit model with
M = 16 alternatives, with K = 20 exogenous variables that vary by alternative. A
random sample of N = 1000 observations is available. Suppose the M x M variance-
2400 V.A. Hajicassiliou and P.A. Ruud

covariance matrix R of the unobserved random utilities has (15 x 16/2) - 1 = 119
free elements (after imposing identification restrictions). Then, the number of param-
eters to be estimated is p = 139. Suppose the analyst uses an iterative Newton-
Raphson type of numerical procedure, employing numerical approximations to the
first derivatives based on two-sided first differences and that 20 iterations are
required to achieve convergence, which is a realistic number. Each iteration
requires at least 2p evaluations of the likelihood function for approximating the first
derivatives. We thus expect that finding the ML estimator will require about 20 x 2p
function evaluations. Since the sample consists of N = 1000 individuals, we will have
to calculate N x 20 x 2p contributions to the likelihood function, each of which, in
general, will be 16-dimensional integrals. Let s be the time in seconds a given
computer requires to approximate a 16-dimensional integral by numerical quad-
rature methods. Our hypothetical ML estimation will thus require about N x 20 x
2p x s seconds. On a typical modern supercomputer (say a Cray 1) one could expect
s z 2. Hence, using such a supercomputer, our problem would take about 1000 x
20 x 178 x 2/3600 hours, which is about 4 months of Cray 1 CPU! It is crucial to
stress that such numerical quadrature methods offer only poor approximations to
integrals of such dimension. I2 The maximum likelihood estimates resulting from 4
months of Cray 1 CPU would be utterly unreliable. The need for alternative
estimation methods for these problems is apparent.

3. Simulation methods

3.1. Overview

Two general approaches to exploiting simulation in parametric estimation are to


approximate the likelihood function and to approximate such moment functions
as the score. The likelihood function can be simulated by Monte Carlo techniques
over the latent marginal distribution f(0; y*) in equation (4) for the mixture case,
equation (5) for the discrete/continuous case and equation (10) for the truncated
case. Alternatively, the score can be approximated either by integrating both nume-
rator and denominator in equation (14) or by integrating over the latent conditional
p.d.f. f(0; y* 1y) as in equation (15). Thus, simulation techniques focus on the simula-
tion from these two distributions, f(0; y*) and f(0; y* 1y). The censoring and trunca-
tion discussed above for LDV models also appear in simulations and we consider
methods for effecting each type of observation rule below. As we will show in Section
4, some simulation estimation methods use censored simulation for the estimation
of the main types of LDV models discussed in Section 2 (censored, truncated and

See Quandt (1986) for a discussion of issues in numerical optimization methods.


Clark (1961) proposed another numerical approximation method for such integrals - see also
Daganzo et al. (1977) and Daganzo (1980). The Horowitz et al. (1981) study finds serious shortcomings
in the numerical accuracy of the Clark method in typical problems with high .I and unrestricted R.
Ch. 40: C/assical Estimation Methods for LDV Models Using Simulation 2401

mixture models), whereas other estimation methods use truncated simulation for
the estimation of these models.
Simulation of standard normal random variables is an old and well-studied
problem. Relatively fast algorithms are widely available for generating such random
variables on a computer. Thus, consider the simulation of the latent data generating
process. We can always write

where q is a vector of M independent standard normal random variables and I- is


a matrix square root of Q, so that R = I-T. It is convenient to set r to the (lower
triangular) Cholesky factor. Clearly, the latent data generating process can be
simulated rapidly with simulations of q for any values of p and R. Such simulations
can be used in turn to simulate the likelihood and log-likelihood functions and their
derivatives with respect to the parameters.
As in all of the examples given above, the observation rules common in LDV
models imply regions of integration that are rectangles: that is, for some matrix A
and vectors b, and b,, possibly with some infinite elements,

(21)

where rank(A) d M. These are the problems that we will consider. Since Ay* is also
normally distributed, it will often be convenient to simulate Ay* instead of y*. In
that case, we simply transform the mean vector and covariance matrix to Ap and
ARA, respectively. Without any loss of generality in this section, we set A = I,,
the M x M identity matrix. We denote D = {zeRI b, < z < b,}.
Such regions as (21) have two important analytical properties. First of all, rect-
angular regions have constant boundaries with respect to the variable of integration,
simplifying integration. Secondly, the differentiation in (4) and (5) can be carried
out analytically to obtain likelihood functions composed of multivariate normal
p.d.f.s of the form (3) and multivariate normal c.d.f.s of the form

Pr{D;p,fl} = l{y*ED}+(y*-p,n)dy*. (22)


s

Thus, simulation of the likelihood can be restricted to terms in Pr( D; p, l2}. Simula-
tion of the score in (14) involves only the additional terms

V,Pr(D;fl,R} =0-i I{Y*ED)(Y* -/MY* - ~fl)dy*,


s

V,Pr{D;p,R} = $Q-l I{Y*ED)C(Y*--)(Y*-/+-J&~*-@)~~*


is
(23)
2402 V.A. Hujioussiliou and P.A. Kuud

Normalized by Pr { D; p, Q}, these equations transform to

(24)

which are terms in (17).


In the remainder of this section, we will discuss the simulation of (22)-(24). For
this purpose we denote

yTi= [yi;m= l,...,M;m#i],

p-i S E(Y*i)j
n_i,~i ~ V(YTi),

R_i,i = cov(y*,,y*)

and the conditional moments

p_i,i(y*) s E(Y*~~YF) and R-,.-iii E t(YiIYF).

These conditional moments have the well-known formulas

The conditional mean and variance of y* given y? i, denoted nil _ i and nii, _ i, are
defined analogously. We also define

y*,,=[y$m= l,...,i- 11

and use a similar notation for the marginal and conditional moments of this
subvector of random variables. For example, the conditional mean ofy: given y*, i is

3.2. Censored simulation

We begin by focusing on the integrals in (22) and (23) accumulated in the vector

E[h(y*, D)] s E 1 {y*ED} (25)


[ [veci y*J]'
The elements of h are censored random variables. We consider two basic methods
of simulation: direct censoring of the multivariate normal random variable and
importance sampling.

3.2.1. Multivariate normal simuhtion

A direct method for simulating Pr{ D; p, Q} and its derivatives is to make repeated
Monte Carlo draws for ye, use (20) to calculate y* for each q, and then form an
empirical analogue of (25), by working only with the realizations that fall in set D.
Let {YIP,...,qR} be R simulated draws from the N(0, ZJ distribution and J, = p + l-q,
(r=l,...,R)sothat

is an unbiased simulation of (25). As R gets larger, the sampling variance of h,


P(l - P)/R, approaches zero and h coverges strongly to E[h(y*, D)]. The simula-
tion of Pr{ D; p, 0) is simply the observed frequency with which the simulations of
y* fall into D. Its derivatives with respect to p and Q are functions of the average
simulation of l{y*~D}y* and l{y*~D}y*y*. We will call this the crude Monte
Curlo (CMC) simulator. Lerman and Manski (1981) conducted the first extensive
application of Monte Carlo integration as a numerical technique to the estimation
of LDV models using the CMC simulator.
The CMC is quick to compute and ideal for computers with a vectorization
facility. 3 However, the CMC also has at least two major drawbacks. First, it is
not continuous in parameters. The simulator jumps at parameter values where a
J, is on the boundary of D. For example, consider parameter values (pug,r,) chosen
so that the mth element of the rth simulation equals its lower bound in D:

where r,, is the mth row of r,. Decreasing the parameter pm from porn will cause
the indicator l{jTl~D> to jump from 1 to 0, and this will result in discrete jumps in
the elements of h(jj,, D) and h. Such discontinuities make computation of estimators
and asymptotic distribution theory awkward.i4 Second, the number of computa-
tions required by the CMC rises inversely with Pr{D; p, Q}, which makes it intract-
able when this probability is small. It should be noted that in principle the accuracy
of the CMC can be improved by use of so-called simulation-variance-reduction

Such a mechanism allows simultaneous operation on adjacent elements of a vector using multiple
processors. See Hajivassiliou (1993b) which shows that the CMC exhibits the greatest speed gains from
vectorization among 13 alternative simulation methods.
14See Quandt (1986) for a discussion ofiterative parameter search algorithms and their requirements
for differentiability of the function to be optimized.
2404 V.A. Hajivassiliou and P.A. Ruud

techniques, as, for example, the use of control and antithetic variates. See Hendry
(1984) for definitions.

3.2.2. Importance sampling

Importance sampling is another general method for reducing the sampling variance
of integrals computed by Monte Carlo integration over (censoring) intervals. The
CMC involves sampling y* from the #(y* - ~1~0) p.d.f. and evaluating the function
h(y*, D). A simple generalization of this procedure rewrites EC/r] in terms of another
sampling distribution y:

E[h] =
sh(y*, D)$(y*

6 is a vector of parameters characterizing


- p, Wy* =
S[
&

,
D) a - PL,f4
yw;P, fl,4
the design of the importance sampler y(.).
1
16; P, Q, 4 d9

Note that for h(.) = 1 {LED}, this expression corresponds to Pr{D;p, 01. By drawing
a random variable j from the importance p.d.f. y and evaluating the weighted
indicator function h@)w(j), where

w(J) = m - P, Q)
Y(k zGi

one obtains an alternative unbiased simulation of Pr{ D; p, Ll>.The first advantage


offered by importance sampling is the ability to substitute sampling from y for
sampling from 4. In some cases, y may be sampled more quickly, or, in a more
general setting, sampling from 4 may be impractical.
In addition, if y also has an analytical integral over a truncated sampling region
C such that D G C, then this analytical integral can be exploited as an approxima-
tion to Pr{ D; p, f2} as follows:

Pr{D;p,R} = Pr{jEC} I(j%D}w(j)~~~dj.


sC

By drawing from the truncated p.d.f. y(J, ,u, 0, fi)/Pr{jEC}, fewer simulations are
wasted on outcomes of zero and, in effect, Pr{ jEC}w(j) approximates Pr{ D; p, f2).
When y is a good approximation to 4, so that the ratio of densities w = 4/y is
relatively constant, the sampling variance of the importance-sampling simulator is
small. As noted above, the sampling variance of the CMC for a single simulation is
P( 1 - P), while the sampling variance of the importance sampler is

Wc.1 {SD}w(J)) = P~%V(w(Wd) + (1 - f,).E(~(j)lj%D)~],

where PC= Pr{$EC} and PD = Pr{JED}. In the extreme case that y = 4,


Ch. 40: Classical Estimation Methods,for LDV Models Using Simulation 2405

V(W(J)IJED) = 0 and E(w(J)ljj~0) = P D. Therefore, good approximations to 4


afford improvements over the CFC. Geweke (1989) introduces importance sampling
in Monte Carlo integration in the context of Bayesian estimation.i5

Dejinition 4. GHK importance-sampling simulator

The GHK importance p.d.f. is the recursively truncatedmultivariate normal p.d.f.

for jkD where (T,,<,,, = fi,,,,,,,,, and

Cim E him - &nI < m(9<nJ2 i=O,l.

By construction, the support of this p.d.f. is D. Conditional on J<,,,, p,,, is univariate


truncated normal on D, with conditional mean determined by p<,. Draws from y
can be made recursively according to the formula

where the o are independently distributed uniform random variablesi The GHK
simulator is the product

It is an unbiased simulator of E(h).

The GHK simulator was developed by Geweke (1992), Hajivassiliou and


McFadden (1990) and Keane (1990). Experience suggests that the sampling variance
of &uk(jj) is very small so that it approximates E(h) well in practice. This approxi-
mant has the properties of lying in the unit interval, summing to one over all the

t50ther investigations of the use of Monte Carlo integration in Bayesian analysis are, inter aha,
Bauwens (1984), Kloek and van Dijk (1978) van Dijk (1987) and West (1990).
16This method is described extensively in Devroye (1986) and is a simple application of the cumulative
probability integral transform result -see Feller (1971). Computationally more efficient methods for
generating univariate truncated normal variates exist -for example Geweke (1992). The advantage of
the method presented in the preceding equation, however, is that it is continuous in p, 0, and w,, which,
as already mentioned, is a desirable property of simulators for asymptotic theory and for iterative
parameter search. The method of constructing y in this example can also be extended to a bivariate
version using a bivariate normal c.d.f. and standardizing adjacent pairs of elements.
2406 V.A. H~jiaassiliou and P.A. Ruud

disjoint rectangular regions surrounding and including D, and being a continuous


function of w, p, L2, he, and b,. These properties are discussed in Borsch-Supan and
Hajivassiliou (1993). Moreover, Hajivassiliou et al. (1992) found conclusive evidence
for the superior root-mean-squared-error performance of the GHK method in an
extensive Monte Carlo study comparing the GHK to 12 other simulators for
normal rectangle probabilities Pr{ D; p, l2}.

3.3. Truncated simulation

We now turn to the expectations in (24). These are ratios of the integrals in (25) and
cannot be simulated without bias using the censored simulation methods above.
Even ignoring the bias, one must ensure that the denominator of the ratio is not
zero. For example, the CMC and some importance-sampling simulators can yield
outcomes of zero for probabilities and thus violate this requirement.17 In this
subsection, we describe two general procedures which draw directly from the
truncated distributions associated with the expectations in equation (24).

3.3.1. Acceptance/rejection methods

Acceptance/rejection (A/R) methods provide a mechanism for drawing from a


conditional density when practical exact transformations from uniform or standard
normal variates are not available. The following result is standard; see Devroye
(1986) Fishman (1973) or Rubinstein (1981) for proofs.

Proposition 1

Suppose b(y*) is a J-dimensional density, and one wishes to sample from the
conditional density 4(y* ID) = 4(y*)/jD r#~(y*)dy*. Suppose y(J) is a density with a
support A from which it is practical to sample, with the property that

supa<a< + co,
D Y(J)

where D s A. Draw J from y and o from a uniform density on [0,11, repeat this
process until a pair satisfying PED and 4(J) 3 occ.r(J) is observed, and accept the
associated 9. Then, the accepted points have density 4(.1 D).

It should be noted that one of the attractive properties of the GHK simulator is that it generates
simulated probability values that are bounded away from 0 and 1, unlike many other importance-
sampling simulators. See Bgrsch-Supan and Hajivassiliou (1993) for details.
Ch. 40: Classical Estimation Methods for LD V Models Using Simulution 2401

The choice of a suitable comparison density y() is important because it deter-


mines the expected yield of the acceptance/rejection scheme. The main attractive
feature of A/R is that the accepted draws have the correct truncated distribution.
The practical shortcoming, though, is that the operations necessary until a specific
number of draws are accepted may be very large.
The A/R scheme also provides an unbiased simulator of l/Pr{ D; p, f2} if JDy@)dJ =
T(D) is practical to compute. The conditional probability of acceptance, given
@ED}, is So6(J)dUa = Pr{D)l c(, so that the marginal probability of acceptance is
T(D) Pr{D}/cr. The distribution of the number of trials to get an acceptance is the
geometric distribution and its expectation is LY/[ZJZI) Pr(D}]. Therefore, if t is the
number of draws made until J is accepted, t.T(D)/cc is an unbiased simulator of
l/Pr( D}.

Example 7

The recursively truncated normal p.d.f. in Definition 4 works well in practice as the
comparison distribution. A bound on the density ratio is given by

a= fi {~(b,,-~L,,<,(bl<,),~~,<,)-~(bo,-~,,<,(bo<,),aH,<,)>,
m=l

where the conditional moments are conditioned on J<, equal to the boundaries.
Since A = D, T(D) = 1.

3.3.2. Gibbs resampling

Gibbs resampling is another way to draw from truncated distributions. An infinite


number of calculations are required to generate a finite number of draws with
distribution approaching the true one. But convergence to the true distribution is
geometric in the number of resamplings, hence the performance of this simulator
in practice is generally very satisfactory. In addition, this simulator is continuous
and differentiable in the parameters j.~ and R. The Gibbs simulator is based on
a Markov chain that utilizes computable univariate truncated normal densities
to construct transitions, and has the desired truncated multivariate normal as
its limiting distribution. * This simulator is defined by the following Markovian
updating scheme.

Proposition 2

Consider the multivariate normal distribution N(p,O) truncated on D, which is


assumed to be finite. Define a recursive procedure with steps j = 1,. . . , J in rounds
g=l , . . . , G. Let {y*jg} be a sequence on D such that on the jth step of the gth

This simulator can be generalized in principle to non-normal distributions, provided the correspond-
ing univariate distributions are easy to sample.
2408 V.A. Hajimwiliou and P.A. Ruud

round, thejth element of y*(jg) is computed from J??~~-~ by

j - ) + aj, _ j @ - l [oj,y _ I aqc; j, dj, _j)


yj*(jg) = pj, _ j(y *(j&J

-(l -Oj,g-l)~(c~j~jl-j)l

where

and the ojg are independent uniform [0, l] variates and cj, _ j = Gj. Repeat
this process for G Gibbs resampling rounds. Then the random draws obtained by
this simulator have a distribution that converges in L, norm at a geometric rate to
the true truncated distribution {y*ly*~D} as the number of Gibbs resampling
rounds G grows to infinity.

This result is proved in Hajivassiliou and McFadden (1990). It relies on stochastic


relaxation techniques as discussed in Geman and Geman (1984). See also Tierny
(1992) for other theoretical results on the Gibbs resampling scheme. We present
below Monte Carlo experiments with simulation estimators based on this truncated
simulation scheme.

4. Simulation and estimation of LDV models

4.1. Overview

In this section, we bring together the parametric estimation of the LDV models
described in Section 2 with the simulation methods in Section 3. Our focus is the
consistent estimation of the parameters of the model; we defer the discussion of
limiting distributions to a later section. Our exposition follows the general historical
trend of thought in this area. We begin with the application of simulation to
approximating the log-likelihood function. Next, we consider the simulation of
moment functions. Because of the simulation biases that naturally arise in the
log-likelihood approach, the unbiased simulation of moment functions and the
method of moments are an alternative approach. Finally, we discuss simulation of
the score function. Solving the normal equations of ML estimation is a special case
of the method of moments and simulating the score function offers the potential for
efficient estimation.
One can organize a description of the methods along the following lines. Figure 1
gives a diagrammatic presentation of a useful taxonomy. In this figure, the various

The usefulness of Gibbs resampling for Bayesian estimation has been recognized by Geweke (1992),
Chib (1993), and by McCulloch and Rossi (1993).
Ch. 40: Classical Estimorion Methods fir LDV Models UsingSimulation

GMSM

Figure 1. Taxonomy of simulation estimators

estimation methods are represented as elliptical sets and the properties of the
associated simulation methods are represented as rectangular sets. Five families of
estimation methods are pictured. All of the methods fall into the class of generalized
method ofsimulated moments (GMSM). This is the simulated counterpart to the
generalized method of moments (GMM) (see Newey and McFadden (1994)). Within
the GMSM fall the method of simulated scores (MSS), the simulated EM (SEM), the
method of simulated moments (MSM), and maximum simulated likelihood (MSL).
In parallel with the types of LDV models, the simulation methods are divided
between censored and truncated sampling. The simulation methods are further
separated into those that simulate the efficient score of the LDV models with and
without bias.
The MSM is a simulated counterpart to the method of moments (MOM). As the
figure shows, the MSM is restricted to simulation methods that generate unbiased
simulations using censored simulation methods. The MSL estimation method also
rests on censored simulation but, as we will explain, the critical object (the log-
likelihood function) is simulated with bias. The SEM algorithm is an extension of
the EM algorithm using unbiased simulations from truncated distributions; it falls,
therefore, in the upper half of the figure. Of these methods, only the MSS has
versions that use both classes of simulation methods, censored and truncated, that
we have described above.
Throughout this section, we will assume that we are working with models for
which the maximum likelihood estimator is well-behaved. In particular, we suppose
2410 V.A. Hajimssiliou und P.A. Ruud

that the usual regularity conditions are met, ensuring that the ML estimator is the
most efficient CUAN estimator. We will illustrate the methods using the rank
ordered prohit model. This LDV model is a natural candidate for most approaches
to estimation with simulation and the exact MLE performs well in small samples.

Example 8. Rank ordered probit

The rank ordered probit model is a generalization of the multinomial probit model
described in Example 1. Instead of observing only the most preferred (or highest
ranked) alternative, each observation records the rank order of the alternatives
from most preferred to least preferred. The rank ordering yields considerably more
information about the underlying preference parameters than the simpler, highest-
ranked-alternative response. Hence, consumer survey designers often prefer to ask
for complete rankings.
We can express the observation rule of rank ordered data algebraically as

Y = zij(Y*) = l {Y& = Yj*}f i,j= 1,..., J

where the {yz,} correspond to the order statistics of y*,

* < * <...<
Y(l) Y(2) Y;J,,

so that the first row of y contains indicators of the smallest element of y* and so
on until the last row indicates the largest element of y*. The sample space of y
consistsoftheJ!=Jx(J-1)x... x 2 different J x J matrices containing zeros
and ones such that only a single entry equals one in each row and column:

J]lYijE{o,1}3CYij=CYij=
I j

Thus, even moderate numbers of alternatives correspond to discrete sampling


spaces with many outcomes.
The c.d.f. of y is not particularly informative; it is simpler to derive the probability
of each possible outcome directly. The rank ordering y corresponds to values of y*
in a set satisfying J - 1 inequalities:

D(y) = {y*d?-ly,.y* d y2.y* d ... < y,.y*},

where ,vj, is the row vector [yjl,. . . , yjJ]. Such additional inequalities as yr.y* < y3.y*
are redundant. As in the multinomial choice model, it is convenient to transform
the latent y* into a vector of .I - 1 differences:

zy~Ayy*=[yi.y*-y;+l.y*;i= l,..., J- 11,


Ch. 40: Classicul Estimation Methods fir LDV Models Using Simulatim 2411

where

d,- CYij-Yi+l,j; i= l)...) J- I;j= l)..) 51

is a (J - 1) x J differencing matrix. According to this definition, D(y) = {.Y*Iz, d 0).


The transformed random vector zy is also multivariate normal and for all YEB,

f(8; Y) = Pr{ y = Y; p, f2} = @(drp, d&Id;). (29)

One probability term in this p.d.f. is equivalent in computational complexity to the


normal orthant integrals of the choice probabilities in Example 1.

We will use the various simulation and estimation methods to estimate this rank
ordered probit model in Monte Carlo experiments. Because a natural standard of
comparison is the MLE, we present first a Monte Carlo experiment for the MLE
in a workable case.

Example 9

When J = 4, the MLE is computable using standard approximation methods. In


our basic Monte Carlo experiment the population parameters will be

-1 1 l/2 0 0

p=

i
-l/3
l/3
1 I and Q =

: 0
l/2
0 0
0
1 0
1
l/2
0
f/2

These values yield a reasonable amount of variation in y and they induce significant
1 I
.

inconsistency in the popular rank ordered logit estimator (Beggs et al. (1981)) when
it is applied to the data. The block diagonal 0 contains covariances among the
latent y* that are zero in the latent logit model. The ,u and R parameters are not all
identifiable and so we normalize by reducing the parameterization to d,~ and
drf2dk for Y = I,, the 4 x 4 identity matrix. The first variance in d,R,4; is also
scaled to 1. In order to restrict d,Rd; to be positive semi-definite, this covariance
matrix is also parameterized in terms of its Cholesky square root. Putting the mean
parameters first, then stacking the nonzero elements of the Cholesky parameters,
the identifiable population parameter vector is B0 = [ - 0.6667, - 0.6667, - 0.6667,
0.5000, 1.3230, 0.0000, - 0.3780, 0.92581.
The basic Monte Carlo experiment will be a random draw from the distribution
of each estimator for N = 100 observations on y. There will be 500 replications of
each estimator. Results of the experiment for the MLE are in Table 1. The MLE
has a small bias relative to its sampling variance and the sampling variance is small
enough to make hypothesis tests for equal means or zero covariances quite powerful.
2412 V.A. Hajivassiliou and P.A. Ruud

Table 1
Sample statistics for rank ordered probit MLE

Population Standard Lower Upper


Parameter value Mean deviation quartile Median quartile

-0.6667 -0.6864 0.1317 -0.7702 -0.6807 -0.5921


-0.6667 -0.6910 0.2351 -0.8354 - 0.6629 -0.5231
-0.6667 -0.7063 0.2263 - 0.8276 - 0.6648 -0.5374
-0.5000 -0.5135 0.2265 - 0.6402 -0.5016 -0.3645
1.3230 1.3536 0.3002 1.130 1.317 1.519
0.000 -0.0127 0.1797 -0.1241 -0.008616 0.09545
-0.3780 ~ 0.408 1 0.1909 -0.5158 -0.3891 -0.2765
0.9258 0.9385 0.246 I 0.7513 0.9140 1.074

It appears that the bias in the MLE is largely caused by asymmetry in the sampling
distribution: The medians are closer to the population values than the means.
Overall, the asymptotic approximation to the distribution of the MLE is good. The
inverse information matrix predicts the standard deviations in the fourth column
of Table 1 to be 0.1296, 0.1927, 0.1703, 0.2005, 0.2248, 0.1543, 0.1514, 0.1987.
Therefore, the actual sampling distribution has more variation than the asymptotic
approximation.
For the simulation estimators, we will also conduct Monte Carlo experiments for
a model with J = 6 alternatives. In that case, the MLE is not easily computed. We
will use the population values

--1- - 1 l/2 0 0 0 0
- 315 l/2 1 0 0 0 0
- l/5 0 0 514 314 I/4 l/4
P(= and R=
115 0 0 314 514 l/4 l/4
315 0 0 l/4 l/4 514 314
_ l_ -0 0 l/4 l/4 314 514

whichcorrespond to0,=[-0.4000, -0.4000, -0.4000, -0.4000, -0.4000, -0.5000,


1.414, 0.000, -0.3536, 0.9354, 0.000, -0.1768, - 0.6013, 1.052, 0.000, 0.000, O.OOQ,
-0.4752, 0.87991 when normalizing on Y = I,.

4.2. Simulation of the log-likelihood function

One of the earliest applications of simulation to estimation was the general compu-
tation of multivariate integrals in such likelihoods as that of the multinomial probit
by Monte Carlo integration. Crude Monte Carlo simulation can approximate the
probabilities of the multinomial probit to any desired degree of accuracy, so that
Ch. 40: Classical Estimation Methodsfir LDV Models Using Simulation 2413

the corresponding maximum simulated likelihood (MSL) estimator can approximate


the ML estimator.

De$nition 5. Maximum simulated likelihood

Let the log-likelihood function for the unknown parameter vector 0 given the
sample of observations (y,, n = 1,. . , N) be

and let f(t); y, o) be an unbiased simulator so that f(O; y) = E,[7(6; y, o)l y] where
w is a simulated vector of R random variates. The maximum simulated likelihood
estimator is

&,, E arg max TN(O),


8

where

for some given simulation sequence {w}.

It is important to note that the MSL estimator is conditional on the sequence of


simulators {w}. For both computational stability and asymptotic distribution
theory, it is important that the simulations do not change with the parameter values.
See McFadden (1989) and Pakes and Pollard (1989) for an explanation of this point.

Example 10

Borsch-Supan and Hajivassiliou (1993) proposed MSL estimation of the multi-


nomial probit model of Example 1 using the GHK simulator for the choice probabi-
lities. In this example, we make similar calculations for the rank ordered probit
model of Example 9. Instead of the normal probability function in (29), we used the
probability simulator in the first element of h,,, in (28) to compute the simulated
log-likelihood function ~,(O).2o For the simu lations of the probability of each
observation, we drew a vector of J - 1 = 3 independently distributed uniform
random variables for each CO,.For each replication of &,,,, we drew a new data set

The order of integration affects this simulator, but we do not attempt to describe our particular
orderings. They were chosen purely on the basis of a convenient algorithm for finding the limits of
integration.
2414 V.A. Hajiaassiliou and P.A. Ruud

Table 2
Sample statistics for rank ordered probit MSLE using GHK (J = 4, R = 1).

Population Standard Lower Upper


Parameter value Mean deviation quartile Median quartile

01 -0.6667 -0.7230 0.1424 -0.8219 -0.7198 - 0.6253


$2 - 0.6667 - 0.6077 0.2162 - 0.7342 -0.5934 - 0.4640
0, - 0.6667 -0.9555 0.2520 - 1.087 -0.9256 -0.7860
0, ~0.5000 -0.6387 0.1430 - 0.7305 -0.6415 -0.5379
05 1.3230 1.2595 0.1741 1.134 1.237 1.353
06 0.0000 0.0131 0.1717 - 0.09063 -0.01013 0.1285
0, -0.3780 -0.6715 0.2088 -0.7883 -0.6639 - 0.5292
08 0.9258 1.3282 0.2211 1.185 1.301 1.448

{(Y,>%);n= I,..., N} before maximizing TN(@)over 8. Each 7(&y,, w,) consisted of


a single simulation of f(0; y,) (R = 1).
The results of this Monte Carlo experiment for J = 4 are in Table 2. In contrast
with the MLE, this MSLE exhibits much larger bias. The median is virtually
identical to the mean. The sampling variances are also larger, particularly for the
covariance parameters. Nevertheless, this MSLE gives a rough approximation to
the population parameters.
The results of this Monte Carlo experiment for J = 6 are in Table 3. For brevity,
only the mean parameters are listed. Once again, substantial biases appear in the
sample of estimators. Given our experience with J = 4, it seems likely that these
biases are largely due to simulation. We will confirm this below as we apply other
methods to this case.

Note that unbiased simulation of the likelihood function is neither necessary nor
sufficient for consistent MSL estimation. Because the estimator is a nonlinear
function (through optimization) of the simulator, the MSL estimator will generally
be a biased simulation of the MLE even when the criterion function of estimation

Table 3
Sample statistics for rank ordered probit MSLE using GHK (J = 6, R = 1).

Population Standard Lower Upper


value Mean deviation quartile Median quartile

- 0.4000 -0.4585 0.1504 -0.5565 -0.4561 -0.3664


- 0.4000 - 0.2489 0.2059 -0.3898 - 0.2460 - 0.0940
~ 0.4ooo - 0.5054 0.1710 - 0.6056 -0.4957 -0.3891
- 0.4000 -0.4589 0.2013 -0.5779 -0.4551 -0.3216
- 0.4000 -0.6108 0.1882 - 0.6934 -0.6016 -0.5042
Ch. 40: Ckwicnl Estimation Methods for LDV Models Using Simulation 2415

is simulated without bias because

E[T(e)]= l(O) +
Ll
E arg max q(3) = arg max r(0).
6
Note also that while unbiased simulation of the likelihood function is often straight-
forward, unbiased simulation of the log-likelihood is generally infeasible. The
logarithmic transformation of the intractable function introduces a nonlinearity
that cannot be overcome simply. However, to obtain an estimator with the same
probability limit as the MLE, a sufficient characteristic of a simulator for the
log-likelihood is that its sample average converge to the same limit as the sample
average log-likelihood. Only by reducing the error of a simulator for the log-
likelihood function to zero at a sufficiently rapid rate with sample size can one expect
to obtain a consistent estimator. Such results rest on a general proposition that
underlies the consistency of many extremum estimators (see Newey and McFadden
(1994), Theorem 2.1):

Lemma 1

Let

(1) 0~ 0, a compact subset of RK,


(2) QJO), QN(0) be continuous in 8,
(3) f0 = arg max,, eQo(B) be unique,
(4) 8, = arg max,,,QN(B) and
(5) Q,(e) + Qe(C3)in probability uniformly in & 0 as N + co.

Then $N 4 Be in probability.

We will assume from now on that the log-likelihood function is sufficiently regular
to exploit this lemma. In particular, we suppose that the y, are i.i.d., that 8 is
identifiable, that f(0;y) is continuous at each 8 in a compact parameter space 0,
and that E[sup,,.llnf(8;y)l] < cc. We refer the reader to Newey and McFadden
(1994, Theorem 2.5) for further discussion of these conditions and their roles.
For LDV models with censoring, the generic likelihood simulator f(0; y,, w,) is
the average of R replications of one of the simulation methods described above:

.7eY,, wn)- i i .W;Y,, o,,).


I 1

If the model includes truncation, then the likelihood simulation typically involves
a ratio of such averages, because a normalizing probability appears in the deno-
minator, although unbiased simulation of the ratio is possible (see Section 3.3). In
any case, the simulation error will generally be O,( l/R). Thus, a common approach
2416 V.A. Hajimwiliou and P.A. Ruud

to approximating the log-likelihood function with sufficient accuracy is increasing


the number of replications per observation R with the sample size N. This statistical
approach is in contrast to a strictly numerical approach of setting R high enough
to achieve a specified numerical accuracy independent of sample size.

Example 11

For illustration, let us increase the replications in the previous examples from R = 1
simulation per observation to 5. The summary statistics are listed in Tables 4 and 5.
In both cases, J = 4 and J = 6, the biases are significantly reduced. See BSrsch-
Supan and Hajivassiliou (1993) for a more extensive Monte Carlo study of the
relationship between R and bias in the multinomial probit model.
In the rank ordered probit model and similar discrete LDV models, all that is
necessary for estimator consistency is that R + CE as N -+ co. No relative rates are
required provided that the likelihood is sufficiently regular. Nor must the simula-
tions o satisfy any restrictions on dependence across observations. The following
proposition, taken from Lee (1992), establishes this situation.

Proposition 3

Let f(& y) be uniformly bounded away from zero for all 8~0, a compact set, and
all WEB, the sample space of y. Assume that the set of regularity conditions in the

Table 4
Sample statistics for rank ordered probit MSLE using GHK (J = 4, R = 5).

Population Standard Lower Upper


value Mean deviation quartile Median quartile

-0.6667 ~ 0.6795 0.1366 -0.7726 -0.6774 -0.5840


- 0.6667 -0.6528 0.2267 -0.7913 - 0.6268 -0.5029
- 0.6667 -0.8327 0.2299 -0.9686 -0.8085 -0.6768
-0.5OQo -0.5771 0.2159 -0.7076 -0.5641 -0.4412
I .3230 1.3582 0.2459 1.1863 1.3184 1.5036
0.0000 -0.0121 0.2089 -0.1380 -0.01570 0.1275
- 0.3780 - 0.5034 0.2016 -0.6256 -0.4875 -0.3753
0.9258 1.1334 0.2454 0.9505 1.1142 1.2814

Table 5
Sample statistics for rank ordered probit MSLE using GHK (J = 6, R = 5).

Population Standard Lower Upper


value Mean deviation quartile Median quartile

- 0.4000 - 0.4088 0.1256 - 0.4893 -0.4053 -0.3227


- 0.4ooo -0.3059 0.1776 - 0.4200 - 0.2966 -0.1846
-0.4000 -0.4554 0.1387 -0.5373 -0.4553 -0.3615
- 0.4OcNI - 0.4288 0.1661 -0.5369 -0.4219 -0.3142
-0.4000 -0.5046 0.1773 -0.6211 - 0.4976 -0.3872
Ch. 40: Classical Estimation Methods f;v LDV Models Usiny Simulation 2417

paragraph after Lemma 1 hold. Let {onl} be an i.i.d. sequence over the index r. The
MSL estimator &s, = arg maxe(l/N)C,N, 1 In j(& yn, 0,) is consistent if R -+ cc as
N-co.

Proof

By a uniform law of large numbers and the lower bound off,

1 J+o, as R-rco,

so that

Since our regularity assumptions in the paragraph after Lemma 1 guarantee that

s:p $lN(0) - E[lnf(@ y)] LO as N+oo,

then &,(@/N also converges uniformly to E[ln f(0; y)] and consistency follows by
Lemma 1. Q.E.D.

Thus, the property of estimator consistency makes modest demands on the


simulations of the likelihood function. Strictly speaking, one could employ a common
sequence of simulations {w,), for all simulated likelihoods, which grows at an
arbitrarily slow rate with sample size. The differences between simulation designs
appear only in the limiting normal distributions of the estimators. It is especially
important to note that consistency does not confine such differences to sampling
variances. Both the expectations and the variances of the approximate limiting
distribution can be affected by the simulation design.
Note that Proposition 3 does not apply to models with elements of y which are
continuously distributed and unbounded. Additional work is needed in this area.
See Hajivassiliou and McFadden (1990) for the special conditions needed for an
example of a multiperiod (panel) autocorrelated tobit model.
From the standpoint of asymptotic distribution theory, the simplest use of
simulation makes independent simulations for the contribution of each observation
to the likelihood function. If elements of the sequence (w,,} are independent across
the observation index n, as well as the replication index I, then we preserve the
independence of the f(@ y,, o,,) and its derivatives across n, permitting the applica-
tion of familiar laws of large numbers and central limit theorems. When f is
2418 V.A. Hajimssiliou and P.A. Ruud

differentiable in 0, we can make a familiar linear approximation for t&s,:

0= 1 V,l&)) + ; v,zqe,
JN(&,, - e,), (30)
JN [ 1

where the elements of 81ie on the line segment between &s, and 8,. The consistency
of &,,I_ implies the consistency of I? which in turn implies that

;v:l(s,J+NV,2In f(e,; Y)I = W4J, (31)

using the argument that supports Proposition 3. The leading term is a sum of N i.i.d.
terms

to which we would like to apply a central limit theorem. But we are prevented from
this by the fact that the expectation of these terms is not zero. Consider the simple
factorization, obtained by adding and subtracting terms,

Lv&) =L V,1(&)) + A, + B,, (32)


fl x/N
where

A, is a sum of i.i.d. terms with zero expectation and can be viewed as the source of
pure simulation noise in 8,,, . B, is the potential source of simulation bias. The next
result can be used to show that R/J?? -+ co is a sufficient rate of increase to avoid
such bias.

Proposition 4

Let fi(13;y, o) be an unbiased simulator for ~(8;y) such that V(ji -ply) = O(R-).
Let ~(0; y, p) be a moment function such that E[s(B,; y, p)] = 0. Consider the simula-
Ch. 40: Classical Estimation Methods fir LD V Models Using Simulation 2419

tor F(O;y) = s(H; y, 8) and let R/fi + GO.If sis Lipschitz in DES uniformly in 8, then
the simulation bias

Proof

Ifs is Lipschitz in p uniformly in 6 then

s- s = CV,s(B;y, p)l(ii - p) + CV,s(&Y, P*) - V,s(@Y, P)l(F - PL),

where p* is on the line segment joining D and p. According to the hypothesis of


unbiasedness,

y, P*) - V,s(RY, dl (P - P)>


E,(s - s) = E, { CV,s(B;

so that

II&,,(s-s)ll <M*@-/L)~=O(R-)

for some finite M* according to the Lipschitz hypothesis. Therefore, B, =


O,(@/R) and the result follows. Q.E.D.

In the multinomial and rank ordered probit cases, the Lipschitz requirement is
generally met by the regularity conditions that bound the discrete probabilities and
the smoothness of the probability simulator f: p = (f, V,f), k = (7, VJ), and s =
(V&/f. We are not aware of any slower rates for R that avoid bias in the limiting
distribution of &,,.

Proposition 5

Let f be-bounded uniformly away from zero and Lipschitz in 0 on a compact space
0. Let f(@ y,o) be an unbiased differentiable simulator for f(0; y), also bounded
uniformly away from zero and Lipschitz in 8 on 0 such that V(f - f) = O(R-).
Let RI&? + 00. Then the simulation components

A+.$$l (V,lnf(e;y,,o,)-V,lnf(e;y,)}
LO

and &,,, is asymptotically efficient.


2420 V.A. Hujioassiliou and P.A. Ruud

Proof

The difference between simulated and exact scores can be written

By the Chebychev inequality,

Pr 1 1 [VJ - jToz lnfI >E 6--


,>F VCV,,7-.7V,,ln
fl=O(JNIR)
iIJNn I I

for each component of the gradient. The result follows from this order and equations
(30)-(33). Q.E.D.

Propositions 4 and 5 demonstrate that bias is the fundamental hurdle that MSL
must overcome. The logarithmic transformation of the likelihood function forces
one to increase R with the sample size to obtain a consistent estimator. Given
enough simulations to overcome bias, there are enough simulations to make the
asymptotic contribution of simulation to the limiting distribution of &,s, negligible.
There is a simulation design that uses the same total number (N x R)of simulations
of w as the independent design, but applies every simulation of o to every observa-
tion of y. That is, the simulated log-likelihood function is generated according to
the double sum

$0) = 5 y In y(0;y,,w,).
n=l m=l

The motivation for this approach is to take advantage of all N x R simulations that
must be drawn when R independent simulations are made for each observation.
Lee (1992) finds that efficiency requires only that R + coas N + co with this design.
This approach appears to gain efficiency without any additional computational
cost. However, one simulates each contribution to the likelihood N x R times rather
than merely R times, substantially increasing the cost of evaluating the average
simulated log-likelihood function. The computational savings gained by pooling
simulations in this manner are generally overcome by the added computational cost
of calculating O(N) likelihoods instead of O(N3), especially when N is large.
We close our discussion of simulated likelihood functions by noting that the
method of simulated pseudo-maximum likelihood (SPML) of Laroque and Salanie
(1989) is another early simulation estimation approach for LDV models. This
method, originally developed for the mixture models of Section 2.4 in the case
of the analysis of markets in disequilibrium, uses simulation to overcome the high-
dimensional integration difficulties that arise in calculating the moments of such
models.

Definition 6. Simulated pseudo-maximum likelihood

Let the observation rule I = y yield a mixture model with the first two moments
g1 (x,, 0) = E(Y Ix,,, 0) and g2(xn, (3 = Q(Y - EY) Ixnr 0). C onsider simulating functions
gj(x,,, 0, w, R),j = 1,2, based on auxiliary simulation sequences (CO],such that yj(X,, 8,
co, R) converge almost surely to _yj(x,, 6I) as R + co, j = 1,2. The simulated pseudo-
maximum likelihood estimator O,,,, is defined by:

where $(.I = ~C(Y, - s1(.))2/si(.)1


+ lns,(.)l corresponds to the log-likelihood
contribution assuming y, - N(g,(.),g,(.)).

Laroque and Salanie (1989) prove t_hat for x,EXER, 8~0 compact, and gj(.)
sufficiently continuous on X x 0, then Q,,,, A fIpMLas R -+ ~0.~~ It should be noted
that for particular choices of a pseudo-likelihood function $(.), the SPML estimator
can be shown to be consistent for a finite number of simulations R, because it then
satisfies the basic linearity property of the MSM approach. Such a choice could be
$(.) = (y, - gI(.))2, which corresponds to the assumption that y, - N(g,(.), 1).

4.3. Simulation of moment functions

The simulation of the log-likelihood is an appealing approach to applying simula-


tion to estimation, but this approach must overcome the inherent simulation bias
that forces one to increase R with the sample size. Instead of simulating the
log-likelihood function, one can simulate moment functions. When they are linear
in the simulations, moment functions can be simulated easily without bias. The
direct consequence is that the simulation bias in the limiting distribution of an
estimator is also zero, making the need to increase the number of simulations per
observation with sample size unnecessary. This was a key insight of McFadden
(1989) and Pakes and Pollard (1989).

l Pseudo-maximum likelihood estimation methods, which are special types of the classical minimum
distance (CMD) approach, are developed in Gourieroux et al. (1984a) and Gourieroux et al. (1984b).
See Newey and McFadden (1994) for a discussion of CMD and the closely related pweralized method
cfmommts (GMM).
2422 V.A. Hujivassiliou and P.A. Ruud

Method of moments (MOM) estimators have a simple structure. Such estimators


are generally constructed from residuals that are differences between observed
random variables y and their conditional expectations. These expectations are
known functions of the conditioning variables x and the unknown parameter vector
0 to be estimated. Let E(ylx, 0) = ~(0; x). Moment equations are built up by multi-
plying the residuals by various weights or instrumental variable functions and
specifying the estimator as the parameter values which equate the sample average
of these products with zero. The MOM estimator &o, is defined by

; .$,W(X,&oMKYn
- l4i4oh.G
x,)1 = 0. (35)

The consistency of such estimators rests on the uniform convergence of the sample
averages to their population counterparts for any value of 8 as the sample size
approaches infinity. When the unique root of the population equations is fI,, the
population value of 0, the root of the sample equations, converges to 8,. The limiting
distribution of &,,, is derived from the linear expansion

where we have denoted the residual by e,(e) = y, - E(y, (x,, 0) and t? lies between
&Oh4 and Be. Because E[e,(&,)] = 0, the leading term will generally converge to a
limiting normal random variable with zero expectation, implying no asymptotic
bias in &,,:

where

One of the matrices in the second term converges to zero:

i $ e,(t?)V,w,(H)-%O.
n 1

This fact is often exploited by replacing the weights w in (35) with consistent
estimates that do not change the limiting distribution of &,,,. Thus under regularity
2423

conditions,

JNam4- b) %V(O,H-'ZH'-'),
where

; 2 w,(@V,e,@) LH.
n 1

Simulation has an affinity with the MOM. Substituting an unbiased, finite-


variance simulator for the conditional expectation p(& x,) does not alter the essential
convergence properties of these sample moment equations. We therefore consider
the class of estimators generated by the method of
simulated moments (MSM).

Definition 7. Method of simulated moments

Let p(6); x, o) = l/RCF= ,,G(0; x, w,) be an unbiased simulator so that ,u(e; x) =


E[p(e; x, 0)1x] where w is a simulated random variable. The method of simulated
moments estimator is

&s, 3 arg min I(F,(O) (1,

where

(36)

for some sequence {on}.

Defining the MSM estimator as a minimizer rather than the root of the simulated
moments equation s(0) = 0 is an important part of making the MSM operational.
Newey and McFadden (1994), Sections 1 and 2.2.3, discuss the general difficulties
that MOM poses for the construction of consistent estimators. Whereas the structure
of ML provides a direct link between parameter identification and estimator consis-
tency, MOM does not. It is often difficult to guarantee that a system of nonlinear
equations has a unique solution. MSM inherits these difficulties. Also, the addition
of simulation in MSM may introduce problems that were not present in the
original MOM formulation. For example, simulated moment equations may not
exhibit solutions at all in small samples, leading one to question the reliability of
asymptotic approximations. This property may be the greatest practical drawback
of this method of estimation using simulations, although it does not greatly affect
the asymptotic distribution theory extended from the MOM case.
2424 V.A. fiujioassiliou and P.A. Ruud

Table 6
Sample statistics for rank ordered probit CMD (J = 4).

Population Standard Lower Upper


Parameter value Mean deviation quartile Median quartile

11 -0.6661 ~ 0.6906 0.1481 - 0.1192 -0.6918 -0.5948


(2 -0.6667 -0.7887 0.3714 -0.9496 -0.7109 -0.5431
Cl.? - 0.6667 -0.6953 0.2223 - 0.8347 -0.6594 -0.5366
04 -0.5000 ~ 0.6683 0.4271 - 0.8962 -0.5688 -0.3679
0, 1.3230 1.4143 0.4384 1.1 18 1.337 1.633
0, 0.0000 0.1764 0.5053 -0.1957 0.08331 0.5563
0, -0.3780 -0.3077 0.2765 -0.4703 - 0.3207 -0.1747
0, 0.9258 0.7714 0.3356 0.5955 0.7980 0.9834

Example 12

To construct an MSM estimator for the rank ordered probit model, we construct
a set of moment equations corresponding to the elements of y:

j, Yijn
.________-Pr{yij= l;p,8} =O, i,j = 1,...,J- 1.
N

Not all J2 elements of y are needed because these elements have a singular distribu-
tion. As the sampling space of y makes clear, we can focus our attention on the first
J - 1 rows and columns of y.
Because we obtain more moment equations than parameters, we combine the
moments of y according to the method of classical minimum distance (CMD) using
the inverse of the sample covariance of the elements of y as the normalizing matrix.
Note, however, that one could use more moments to increase the efficiency of the
estimator. For example, the cross-products yijykl (i # k, j # 1) contain additional
sample information about the population parameters.
The CMD estimation results are described in Table 6 for J = 4 ranked alter-
natives. This classical estimator is much less efficient than the MLE. In addition, it
exhibits large bias and skewness in the sampling distribution.
The summary statistics for the MSM version of the CMD estimator are listed in
Table 7. There was R = 1 simulation of the GHK probability simulator for each
observation and each probability. As expected, the sampling variance is larger for
the MSM estimator than for the CMD estimator. In addition, the bias and skewness
in the CMD estimator for the mean parameters seem to be aggravated by the
simulation in the MSM estimator.
We do not present analogous results for J = 6 alternatives because the MSM
estimator is not practical in this case. With 720 elements in the sampling space, the
Ch. 40: Classical Estimation Methods for LDV Models Using Simulation 2425

Table I
Sample statistics for rank ordered probit CMD, MSM version, (J = 4, R = 1).

Population Standard Lower Upper


Parameter value Mean deviation quartile Median quartile

- 0.6667 -0.6916 0.1905 -0.7915 - 0.6809 -0.5798


- 0.6667 -0.9790 0.9576 - 1.099 -0.7619 -0.5654
- 0.6667 -0.8561 0.6394 -1.008 - 0.6900 -0.4813
-0.5000 - 0.7083 0.6392 -0.8918 -0.5559 -0.3327
1.3230 1.4733 0.8402 1.086 1.323 1.662
O.OC@O 0.0780 0.6268 -0.3091 0.03749 0.4616
-0.3780 -0.3828 0.5423 -0.5758 -0.3099 -0.1110
0.9258 0.8560 0.6857 0.5023 0.7341 0.9745

amount of simulation becomes prohibitive. This illustrates another important draw-


back in this method: the MSM works best for sample spaces with a small number
of elements.

The analogies between MSM and MOM are direct and, as a result, the asymptotic
analysis is generally simpler than for MSL. The first difference with MSL appears
in the requirements on the simulation design for estimator consistency. Whereas
MSL requires that R + og regardless of whether simulations are independent across
observations, MSM yields consistent estimators with fixed R provided that the
simulations vary enough to make a law of large numbers work. Because the
simulated moments are linear in the simulations, one has the option of applying the
law of large numbers to large numbers of observations alone, or in combination
with large numbers of simulations.

Proposition 6

Let k(f9;x, o) be an unbiased, finite-variance simulator for ~(0; x) and let either

(1) l%i n=l , . . . , N, r = 1,. . . , R} be i.i.d. random variables for fixed R, or


(2) (q;r= 1,. . . , N} be an i.i.d. sequence for R = N and let o,, = w,, n = 1,. , . , N.
Then 4lMSh.l
L O0under the regularity conditions

(1) s,(O) E (l/N)C,N, 1 w,(B)[y, - ~(8; x,)] is continuous in 0,


(2) s&9 J+ ~(0) = plim(fIN)C,N= 1 w,(@Cp(Oo; xn) - p(fl; xJ] uniformly in 8~ 0,
a compact parameter space,
(3) s,(O) is continuous in 8 and s,(8) equals zero only at 8,.

Proof

The average difference between the classical moment functions and their simulated
counterparts is
2426 V.A. Hajiaassiliou and P.A. Ruud

(37)

(38)

where s,(B) 3 (l/N)C,N, r w,(B)[y, - ~(0; x,)]. Under design 1, the (,& - p,,} are
an i.n.i.d. sequence so that a uniform law of large numbers applied to (37)
implies s,(e) - F,,J0) -% 0 as N + co. Under design 2, s,(8) - .FN(0)is written in (38)
as a U-statistic and a uniform law of large numbers for U-statistics (Lee (1992))
implies s,(8) - sN(0) LO as N + co. Therefore, in either case, by continuity,
I(~~(0) - sN(0) 1)-% 0 uniformly in 6 and Lemma 1 implies the result. Q.E.D.

The opportunity to fix R for all sample sizes offers significant computational
savings that are a key motivation for interest in the MSM. As we shall see below,
the benefits of the dependent design are generally modest. Thus, while the theoretical
applicability of U-statistics to MSM is interesting in itself, we will not consider it
further in this section. We continue with the analogy between the MOM and the
MSM. Note first of all that an analogous linear expansion for &sM exists:

0= L f
$v.=l
w(e,)qe,) + f
[ n
$ w,(w$qB)
1
+ a,(e,v,w,(iq Jiq&,,
I
- e,),

where we have denoted the simulated residual by &n(0)= y,, - D(e; x,) and I? lies
between 8 MSM and do. Because E[&,(6J,)] = 0, the leading term will generally converge
to a limiting normal random variable with zero expectation, implying no asymptotic
bias in &s,:

where

hf, w,(b)wc(e,) 1x,iw,(e,)~ 11,z,,,.

Also, as before,

2ZSee Lee (1992).


Ch. 40: Classical Estimation Mrthodsfor LDV Models Using Simulation 2427

so that under regularity conditions,

- 0,) -% N(0, H - &,,H-


JNCe,SM ),

where

The equivalence of the H matrices also rests on the unbiased simulation of p. If


~(0;x) = E[p(&x,o)[x], then V&&x) = V,E[fi(tI; x,w)IxJ =E[V,ji(tI;x,o)(x] for
the smooth simulators described in Section 3.
While the first moment of the MSM estimator does not depend on R, the limiting
covariance matrix, and hence relative efficiency, does. Simulation noise introduces
a generic difference between the covariance matrices of I!?~,,, and &s,. Intuition
suggests, and theory confirms, that the larger R is, the more efficient the MSM
estimator will be as the simulation noise is diminished. The extra variation in &,sM
is contained in the object (37). This term is generated conditional on the realizations
ofy and is, by definition, distributed independently of the classical moment function.
Inflating the simulation noise by fi and evaluating it at 0,,, we can apply a central
limit theorem to it to obtain the following result.

Proposition 7

x MSM - ZMOM
- + (l/R)Z, where

If it were not for the simulation noise, the MSM estimator would be as efficient as
its MOM counterpart. McFadden (1989) noted that in the special case where fi is
obtained by averaging simulations of the data generating process itself, Z, = Z,,,
and ZMsM= (1 + l/R)ZMoM. In this case, the inefficiency of simulation is easy to
measure and one observes that 10 replications are sufficient to reduce the inefficiency
to 10% compared to classical MOM.
The proposition suggests that full efficiency would be obtained if we simply
increased R without bound as N grows. That intuition is formalized in the next
proposition, which is analogous to Proposition 5 (see McFadden and Ruud
(1992)).

Proposition 8

If R = O(N), LX>0, then @(e^,o, - e^,,,) -% 0.


2428 V.A. Hajiuassiliou and P.A. Ruud

For any given residual and instrumental variables, there generally exist optimal
weights among MOM estimators, and the same holds for MSM as well. In what is
essentially an asymptotic counterpart to the GaussMarkov theorem, if H = ZMSM
then the MSM estimator is optimal (Hansen (1982)). To construct an MSM estimator
that satisfies this restriction, one normalizes the simulated residual by its variance
and makes the instrumental variables the partial derivatives of the conditional
expectation of the simulated moment with respect to the unknown parameters:

One can approximate these functions using simulations that are independent of the
moment simulations with R fixed, but efficiency will require increasing R with
sample size. If /? is differentiable in t3, then independent simulations of the V,,ii are
unbiased simulators of the instruments. Otherwise, discrete numerical derivatives
can be employed. The covariance matrix can be estimated using the sample variance
of p and the simulated variance of y. Inefficiency in simulated instruments constructed
in this way has two sources: the simulation noise and the bias in the inverse of an
estimated variance. Both sources disappear asymptotically if R approaches infinity
with N. While it is critical that the simulations of w be independent of the simula-
tions of b, there is no obvious advantage to simulating the individual components
of w independently. In some cases, for example simulating a ratio, it appears that
independent simulation may be inferior.23

4.4. Simulation of the score function

Interest in the efficiency of estimators naturally leads to attempts to construct an


efficient MSM estimator. The obvious way to do this is to simulate the score
function as a set of simulated moment equations. Within the LDV framework,
however, unbiased simulation of the score with a finite number of operations is not
possible with simple censored simulators. The efficient weights are nonlinear func-
tions of the objects that require simulation. Nevertheless, it may be possible with
the aid of simulation to construct good approximations that offer improvements in
efficiency over simpler MSM estimators.
There is an alternative approach based on truncated simulation. We showed in
Section 2 that every score function can be expressed as the expectation of the score
of a latent data generating process taken conditional on the observed data. In the
particular case of normal LDV models, this conditional expectation is taken over
a truncated multivariate normal distribution and the latent score is the score of an
untruncated multivariate normal distribution. Simulations from the truncated nor-

23A Taylor series expansion suggests that positive correlation between the numerator and denominator
of a ratio can yield a smaller variance than independent simulation.
Ch. 40: Classical Estimation Methods for LDV Models Using Simulation 2429

ma1 distribution can replace the expectation operator to obtain unbiased simulators
of the score function.
In order to include both the censored and truncated approaches to simulating
the score function, we define the method of simulated scores as follows.24

Dejinition 8. Method of simulated scores

Let the log-likelihood function for the unknown parameter vector (3 given the
sample of observations (y,, n = 1,. . . , N) be I,(0) = C,= 1In f(& y,). Let fi(Q; y,, w,) =
(l/R)Cr=, ~(&y,,,lo,J be an asymptotically (in R) unbiased simulator of the
score function ~(0;y) = Vlnf(B; y) where o is a simulated random variable. The
method of simulated scores estimator is &,s, E arg min,, J/ 5,(e) (1 where .YN(0)3
(l/N)Cr= ,b(@, y,, 0,) for some sequence {on}.

Our definition includes all MSL estimators as MSS estimators, because they
implicitly simulate the score with a bias that disappears asymptotically with the
number of replications R. But there are also MSS estimators without simulation
bias for fixed R. These estimators rely on simulation from the truncated conditional
distribution of the latent y* given y. We turn to such estimators first.

4.4.1. Truncated simulation of the score

The truncated simulation methods described in Section 3.3 provide unbiased simu-
lators of the LDV score (17), which is composed of elements of the form (24). Such
simulation would be ideal, because R can be held fixed, thus leading to fast estima-
tion procedures. The problem is that these truncated simulation methods pose new
problems for the MSS estimators that use them.
The first truncated simulation scheme, discussed in Section 3.3.1 above, is the
A/R method. This provides simulations that are discontinuous in the parameters,
a property shared with the CMC. A/R simulation delivers the first element in a
simulated sequence that falls into a region which depends on the parameters under
estimation. As a result, changes in the parameter values cause discrete changes in
which element in the sequence is accepted. An example of this phenomenon is to
suppose that one is drawing a sequence of normal random variables {ql} IV N(0, I,)
in order to obtain truncated multivariate normal random variables for rank ordered
probit estimation. Given the observation y, one seeks a simulation from D(y), as
defined in Example 8. Let the simulation of y* be jjl(pl, r,) 3 ,~i + T1qr at the
parameter values (pi, r,). At neighboring parameter values where two elements of
the vector j,(p, r) are equal, the A/R simulation is at the point of jumping from the
value j& r) to another point in the sequence {J,(p, r)}. See Hajivassiliou and
McFadden (1990) and McFadden and Ruud (1992) for treatments of the special

24The term was coined by Hajivassiliou and McFadden (1990)


2430 V.A. Hajivassiliou and P.A. Ruud

asymptotic distribution theory for such simulation estimators. Briefly described,


this distribution theory requires a degree of smoothness in the estimator with respect
to the parameters that permits such discontinuities but allows familiar linear ap-
proximations in the limit. See Ruud (1991) for an illustrative application.
The second truncated simulation scheme we discussed above was the Gibbs
resampling simulation method; see Section 3.3.2. This method is continuous in
the parameters provided that one uses a continuous univariate truncated normal
simulation scheme. But this simulation method also has a drawback: Strictly
applied, each simulation requires an infinite number of resampling rounds. In
practice, Gibbs resampling is truncated and applied as an approximation. The
limited Monte Carlo evidence that we have seen suggests that such approximation
is reliable.
Simulation of the efficient score fits naturally with the EM algorithm for computing
the MLE derived by Dempster et al. (1977). The EM algorithm includes a step in
which one computes an expectation with respect to the truncated distribution of y*
conditional on y. Ruud (1991) suggested that a simulated EM (SEM) algorithm
could be based on simulation of the required expectation.25 This substitution
provides a computational algorithm for solving the simulated score of MSS esti-
mators.

Dejinition 9. EM algorithm

The EM algorithm is an iterative process for computing the MLE of a censored


data model. On the ith iteration, the EM algorithm solves

0 i+ = arg max Q(0, 0';y), (39)

where the function Q is

Q(O1,OO;~)~E,oClnf(O1;~*)l~l, (40)

where EOo[. Iy] indicates an expectation measured with respect to ~(0; y* 1y).

If Q is continuous in both 0 arguments, then (39) is a contraction mapping that


converges to a root of the normal equations; as Ruud (1991) points out,

0 = 0 = 0 *V,, Q(O, 0; y) = VBIn F(0; y), (41)

so that the first-order conditions for an iteration of (39) and the normal equations
for ML are intimately related.
Unlike the log-likelihood function, this Q can be simulated without bias for LDV
models because the latent likelihood f(0; y*) is tractable and Q is linear in In f(0; y*)

25van Pragg et al. (1989) and van Praag et al. (1991) also investigated this approach and applied it in
a study of the Dutch labor market.
Ch. 40: Classical Estimution Methodsfor LDV Models Using Simulation 2431

(see equation (40)). According to (41). unbiased simulation of Q implies a means for
unbiased simulation of the score. Although it is not guaranteed, an unbiased
simulator of Q usually yields a contraction mapping to a stationary point.
For LDV models based on a latent multivariate normal distribution, the itera-
tion in (39) is quite simple to compute, given Q or a simulation of Q. Iff(e;y*) =
4(y* - P; W, then

and IR = k $ &o[(y: - P)(Y,* - P),\Y~], (42)


n 1

which are analogous to the equations for the MLE using the latent data. This
algorithm is often quite slow, however, in a neighborhood of the stationary point
of (39). Any normalizations necessary for identification of 0 can be imposed at
convergence. See Ruud (1991) for a discussion of these points.

Example 13. SEM estimation

In this example, we apply the SEM procedure to the rank ordered probit model
of our previous examples. We simulated an (approximately) unbiased 0 of Q by
drawing simulations of y: from its truncated normal distribution conditional on y,
using the Gibbs resampling method truncated to 10 rounds. The support of this
truncated distribution is specified as D(y) in Example 8. The simulated estimators
were computed according to (42), after replacing the expectations with the averages
of independent simulations.
The usual Monte Carlo results for 500 experiments with J = 6 ranked alternatives
are reported in Table 8 for data sets containing 100 observations and R = 5 simula-
tions per observation. These statistics are comparable to those in Table 5 for the
MSL estimator of the same model with the same number of simulation replications.
The biases for the true parameter values appear to be appreciably smaller in the
SEM estimator, while the sampling variances are larger. We cannot judge either
estimator as an approximation to the MLE, because the latter is prohibitively
difficult to compute.

Table 8
Sample statistics for rank ordered probit SEM using Gibbs simulation (J = 6, R = 5).

Population Standard Lower Upper


Parameter value Mean deviation quartile Median quartile

01 - 0.4000 - 0.3827 0.1558 - 0.4907 - 0.3848 -0.2757


0, -0.4000 -0.4570 0.3271 -0.5992 -0.4089 -0.2455
0, - 0.4ooo -0.4237 0.2262 -0.5351 -0.3756 - 0.2766
0, - 0.4000 - 0.4268 0.2710 -0.5319 -0.3891 -0.2580
0, - 0.4000 -0.4300 0.2622 -0.5535 - 0.3794 -0.2521
2432 V.A. Hajivassiliou and P.A. Ruud

Although truncated simulation is generally more costly, the SEM estimator


remains a promising general approach to combining simulation with relatively
efficient estimation. It is the only method that combines unbiased simulation of the
score with optimization of an objective function and the latter property appears to
offer substantial computational advantages.

4.4.2. Censored simulation of ratios

The censored simulation methods in Section 3.2 can also be applied to approxima-
ting the efficient score. These simulation methods tend to be much faster computa-
tionally than the truncated simulation methods, but censored simulations introduce
simulation bias in much the same way as in the MSL. Censored simulation can be
applied to discrete LDV models by noting that the score function of an LDV model
with observation rule y = z(y*) can generally be written in the ratio form:

VtJf(0;Y) s(Y'IHY')'Yl
V, dF(e;Y*)

.0&Y) = ____
s =Y)
(Y*lr(Y*)
= W,WRY*)I$Y*)
WA Y*)

= Y)
%y*lT(y*) = Y>
where F(0; y* 1y) is the conditional c.d.f. of y given r(y*) = y. See Section 2.6 for
more details. Van Praag and Hop (1987), McFadden (1989) and Hajivassiliou and
McFadden (1990) note that this form of the score function offers the potential of
estimation by simulation. 26 An MSS estimator can be constructed by simulating
separately the numerator and denominator of the score expressions:

1 N @;y,,q,)
s,(e)=- 1 (43)
N n = I P(e; Y,,, ~2n)

where 2(&y,, win) = (l/R,)CF: 1&e; y ,,,winr) is an unbiased simulator of the deri-
vative function V,f(0, y) and p(0; y,, oz.) = (l/R,)CFZ 1d(& y,, mznr) is an unbiased
function of the probability expression f(0; y,). Hajivassiliou and McFadden (1990)
prove that when the approximation of the scores in ratio form is carried out using
the GHK simulator, the resulting MSS estimator is consistent and asymptotically
normal when N -+ 00 and R,/,,h-t co. The number of simulations for the nume-
rator expression, R,, affects the efficiency of the resulting MSS estimator. Because
the unbiased simulator p(&y, 02) of f(e; y) does not yield an unbiased simulator of

26See Hajivassiliou (1993~) for a survey of the development of simulation estimarion methods for LDV
models.
Ch. 40: Classical Estimation Methodsfor LDV Models Using Simulation 2433

the reciprocal l/f(& y) in the simulator l/p(0; y, w,), R2 must increase with sample
size to obtain a consistent estimator. This is analogous to simulation in MSL. In
fact, this simulation scheme is equivalent to MSL when ur = o2 and d= V,p.
McFadden and Ruud (1992) note that MSM techniques can also be used generally
to remove the simulation bias in such MSS estimators. In discrete LDV models,
where y has a sampling space B that is countable and finite, we can always write y
as a vector of dummy variables for each of the possible outcomes so that

E,(y,) = Pr{yi = 1; 6} = f(Q; Y) if Yi = 1, Yj = 0, j # i.

Thus,

E v,f(e;Y) =o= 1 f@; Y).- VfLf(~;


--~
Y)
8
[ f@Y) 1 YEB f (0; Y)

and the score can be written

(44)

Provided that the residual l{y = Y} - f(0; Y) and the instrumental variables
V,f(e; Y)/f(0; Y) are simulated independently, equation (44) provides a moment
function for the MSM. In this form, the instrumental variables ratio can be simulated
with bias as in (43) because the residual term is independently distributed and
possesses a marginal expectation equal to zero at the population parameter value.
For example, we can alter (43) to

&@Y,%)
ue)=X$1 YEB
c
CRY,= ~1-iw; Y,O,A-,
iv; y,%I
(45)

where wr and o2 are independent pseudo-random variables. While such bias does
not introduce inconsistency into the MSM estimator, the simulation bias does
introduce inefficiency because the moment function is not an unbiased simulator of
the score function. This general approach underlies the estimation method for
multinomial probit originally proposed by McFadden (1989).

4.4.3. MSM versus MSS

MSM and MSS are natural competitors in estimation with simulation because each
has a comparative advantage. MSM uses censored simulations that are cheap to
2434 V.A. Hajivassiliou and P.A. Ruud

compute, but it cannot simulate the score without bias within a finite number of
calculations. MSS uses truncated simulations that are expensive to compute (and
introduce jumps in the objective function with A/R simulations), but simulates the
score (virtually) without bias. McFadden and Ruud (1992) make a general com-
parison of the asymptotic covariance matrices that suggests when one method is
preferable to the other.
Consider the special MSS case in which the simulations P*(e; Y,w) are drawn
from the latent conditional distribution and the exact latent score V,l* is available
so that

&(O)= R, 2 V,I*[8;
r=l
?*(Q; Y,w)].

Then Z,, the contribution of simulation to the covariance matrix of the estimator,
has a useful interpretation:

where Z, = E,{V,I*(@; Y*)[VBI*(O; Y*)]} is th e information matrix of the latent


log-likelihood. The simulation noise is proportional to the information loss due to
partial observability.
In the simplest applications of censored simulation to the MSM, the simulations
are independent of sample outcomes and their contribution to the moment function
is additively separable from the contribution of the data. Thus we can write &,(0) =
g(& Y,o&g(O; wi,~J(see(45)). In that case, &simplifies to V{,/%[~(~,;O,,O,)]}.
In general, the simulation process makes R independent replications of the simula-
tions {ml; r = 1,. , R}, so that

and & = R- V,&j(d,; ml, oJ]. In an important special case of censored simulation,
the simulation process makes R independent replications of the modeled data
generating process, { F(L);w,,); r = 1,. . . , R}, so that

ae;al, 4 = R; 1 de; t(e; ml,), w2]

and Z, = R V[g(B,; Y, co)] = Z,/R. Then the MSM covariance matrix equals (1 +
l/R) times the classical MOM covariance matrix without simulation G-Z,(G)).
Now let us specialize to simulation of the score. For simplicity, suppose that the
simulated moment functions are unbiased simulations of the score: E[S,,,(B)I Y] =
V,1(8; Y). Of course in most cases, the MSM estimator will have a simulation bias
Ch. 40: Classical Estimation Methods for LDV Models Using Simulation 2435

for the score. The asymptotic variance of the MSM estimator is

Z, = lim,,, I/(&,(0,) - V,1(8,; Y)]


= lim,, m ~chmlw4J)l
+ &l
= z, + E,,
where 22, = Z,/R and Z, holds additional variation attributable to the simulation
of the score. If the MSS and MSM estimators use the same number of simulation
replications, we can make a simple comparison of the relative efficiency of the two
methods. The difference between the asymptotic covariance matrices is

R-Z,[& + (R+ l)Z, - (Z, - &)]Z,.

This expression gives guidance about the conditions under which censored simula-
tion is likely to dominate truncated. It is already obvious that if Z, is high, so that
censored simulation is inefficient due to a poor approximation of the score, then
truncated simulation is likely to dominate. On the other hand, if Z:, is low, because
partial observability causes a large loss in information, then estimation with censored
simulation is likely to dominate truncated.
Thus, we might expect that the censored simulation method will dominate the
truncated one for the multinomial probit model, particularly if Z, = 0. That,
however, is a special case in which a more efficient truncated simulation estimator
can be constructed from the censored simulation estimator. Because E[E(Q)I Y] =
VfJ(R Y),

ma y,02) - m 01, w2)l = VfJ(RVI


-E[g(e;ol,o,)] = E{g[e; ~(e;o),d]} =o t/e.
The bias correction is obviously unnecessary and only increases the variance of the
MSM estimator. But an MSM estimator based on g(& Y, o) is a truncated simulation
MSM estimator; only simulation for the particular Y observed is required. We
conclude that the censored method can outperform the truncated method only by
choosing E,[e(B)] # V,1(8; Y) in such a way that the loss in efficiency in Z, is offset
by low Zc, and low Z,.27

4.5. Bias corrections

In this section, we interpret estimation with simulation as a general method for re-
moving bias from approximate parametric moment functions, following McFadden

The actual difference in asymptotic covariance matrices is more complicated than the formula above
however, because G # EM # &.
2436 V.A. Hajicawiliou and P.A. Ruud

and Ruud (1992). The approximation of the efficient score is the leading problem in
estimation with simulation. In a comparison of the MSM and MSS approximations,
we have just described a simple trade-off. On the one hand, the simulated term in
the residual of (45) that replaces the expectation in (44) is clearly redundant when
the instrumental variables are V,f(Q; Y)/f(& Y). The expectation of the simulated
terms multiplied by the instruments is identically zero for all parameter values so
that the simulation merely adds noise to the score and the resulting estimator. On
the other hand, the simulated residual is clearly necessary when the instruments are
not ideal. Without the simulation, the moment equation is invalid and the resultant
estimators are inconsistent.
This trade-off motivates a general structure of simulated moments estimators.
We can interpret the extra simulation term as a bias correction to an approxima-
tion of the score. For example, one can view the substitution of non-ideal weights
into the original score function as an approximation to the score, chosen for its
computational feasibility. Because the approximation introduces bias, the bias is
removed by simulating the (generally) unknown expectation of the approximate
score. Suppose the moment restrictions have a general form

Hs(&; y, Xl Ix 1 = 0.

When the moment function s is computationally burdensome, an approximation


g(fl; y, X, o) becomes a feasible alternative. The additional argument o represents
an ancillary statistic containing the coefficients of the approximation. In general,
such approximation will introduce inefficiency and bias into MOM estimators
constructed from g. Simulation of g over the distribution of y produces an approxi-
mate bias correction Q(8; X, w, o), where o represents the simulated component.
Thus, we consider estimators 6 that satisfy

g(& y, x, w) - lj(8;x, co,w) = 0. (47)

MSM estimators have this general form; and feasible MSS estimators generally do,
too.

4.5.1. A score test for estimator bias

The appeal of simulation estimators without bias correction is substantial. Although


the simulation of moments or scores overcomes a substantial computational diffi-
culty in the estimation of LDV models, there may remain practical difficulties in
solving the simulated moment functions for the estimators. Whereas maximum
likelihood possesses a powerful relationship between the normal equations and the
likelihood function, moment equations generally do not satisfy such integrability
conditions. As a result, there is not even a guarantee that a root of the estimating
Ch. 40: Classicul Esfimation Methodsfor LDV Models Using Simulution 2437

equations exists. Bias correction can introduce a significant amount of simulation


noise to estimators. For these reasons, the approximation of the log-likelihood
function itself through simulation still offers an important opportunity to construct
feasible and relatively efficient estimators.
MSS, and particularly MSL, estimators can be used without bias correction if
the bias is negligible relative to the sampling error of the estimator and the magni-
tude of the true parameter. A simple score test for significant bias can be developed
and implemented easily.
Conditional on the MSS estimator, the expectation of the simulated bias in the
approximate score should be zero. The conditional distribution of the elements of
the bias correction are i.n.i.d. random variables to which a central limit theorem
can be applied. In addition, the White-Eicker estimator of the covariance matrix
of the bias elements is consistent so that the usual Wald statistic, measuring the
statistical significance of the bias term, can be computed (see Engle (1984)). As an
alternative to testing the significance of this statistic, the bias correction term can
be used to compute a local approximate confidence region for the biases in the
moment function or the estimated parameters. This has the advantage of providing
a way to assess whether the biases are important for the purposes of inference.

5. Conclusion

In this chapter, we have described the use of simulation methods to overcome the
difficulties in computing the likelihood and moment functions of LDV models.
These functions contain multivariate integrals that cannot be easily approximated
by series expansions. However, unbiased simulators of these integrals can be com-
puted easily.
We began by reviewing the ways in which LDV models arise, describing the
differences and similarities in censored and truncated data generating processes.
Censoring and truncation give rise to the troublesome multivariate integrals. Fol-
lowing the LDV models, we described various simulation methods for evaluating
such integrals. Naturally, censoring and truncation play roles in simulation as well.
Finally, estimation methods that rely on simulation were described in the final
section of this chapter. We organized these methods into three broad groups: MSL,
MSM, and MSS. These are not mutually exclusive groups. But each group has a
different motivation: MSL focuses on the log-likelihood function, the MSM on
moment functions, and the MSS on the score function. The MSS is a combination
of ideas from MSL and MSM, treating the efficient score of the log-likelihood
function as a moment function.
Software for implementing these methods is not yet widely available. But as such
tools spread, and as improvements in the simulators themselves are developed,
simulation methods will surely become a familiar tool in the applied econometri-
cians workshop.
2438 V.A. Hajiuassiliou and P.A. Ruud

6. Acknowledgements

We would like to thank John Geweke and Daniel McFadden for very helpful com-
ments. John Wald provided expert research assistance. We are grateful to the
National Science Foundation for partial financial support, under grants SES-
929411913 (Hajivassiliou) and SES-9122283 (Ruud).

Amemiya, T. (1984) Tobit Models: A Survey, Journal of Econometrics, 24, 3-61.


Avery, R., Hansen, L. and Hotz, V. (1983) Multiperiod Probit Models and Orthogonality Condition
Estiplation, International Economic Review, 24, 21-35.
Bauwens, L. (1984) Bayesian Full Information Analysis ofSimultaneous Equation Models using Integration
by Monte Carlo. Berlin: Springer.
Beggs, S., Cardell, S. and Hausman, J. (1981) Assessing the Potential Demand for Electric Cars, Journal
of Econometrics, 17, l-20.
Berkovec, J. and Stern, S. (1991) Job Exit Behavior of Older Men, Econometrica, 59, 189-210.
Bloemen, H. and Kapteyn, A. (1991) The Joint Estimation of a Non-linear Labour Supply Function and
a Wage Equation Using Simulated Response Probabilities. Tilburg University, mimeo.
Bock, R.D. and Jones, L.V. (1968) The Measurement and Prediction of Judgement and Choice. San
Francisco: Holden-Day.
Bolduc, D. (1992) Generalized Autoregressive Errors in the Multinomial Probit Model, Transportation
Research B - Methodological, 26B(2), 155- 170.
Bolduc, D. and Kaci, M. (1991) Multinomial Probit Models with Factor-Based Autoregressioe Errors: A
Computationally EJicient Estimation Approach. Universite Lava], mimeo.
Borsch-Supan, A. and Hajivassiliou, V. (1993) Smooth Unbiased Multivariate Probability Simulators
for Maximum Likelihood Estimation of Limited Dependent Variable Models, Journal of Econometrics,
58(3), 347-368.
Biirsch-Supan, A., Hajivassiliou, V., Kotlikoff, L. and Morris, J. (1992) Health, Children and Elderly
Living Arrangements: A Multi-Period Multinomial Probit Model with Unobserved Heterogeneity
and Autocorrelated Errors pp. 79-108, in: D. Wise, ed., Topics in the Economics of Aging. Chicago:
University of Chicago Press.
Chib, S. (1993) Bayes Regression with Autoregressive Errors: A Gibbs Sampling Approach, Journal of
Econometrics, 58(3), 275-294.
Clark, C. (1961) The Greatest of a Finite Set of Random Variables, Operations Research, 9, 145-162.
Daganzo, C. (1980) Multinomial Probit. New York: Academic Press.
Daganzo, C., Bouthelier, F. and Sheffi, Y. (1977) Multinomial Probit and Qualitative Choice: A
Computationally Efficient Algorithm, Transportation Science, 11,338-358.
Davis, P. and Rabinowitz, P. (1984) Methods of Numerical Integration. New York: Academic Press.
Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977) Maximum Likelihood from Incomplete Data via
the EM Algorithm, Journal of the Royal Statistical Society, Series B, 39, l-38.
Devroye, L. (1986) Non-Uniform Random Variate Generation. New York: Springer.
Dubin, J. and McFadden, D. (1984) An Econometric Analysis of Residential Electric Appliance
Holdings and Consumption, Econometrica, 52(2), 345-362.
Duffie, D. and Singleton, K. (1993) Simulated Moments Estimation of Markov Models of Asset Prices,
Econometrica, 61(4), 929-952.
Dutt, J. (1973) A Representation of Multivariate Normal Probability Integrals by Integral Transforms,
Biometrika, 60, 637-645.
Dutt, J. (1976) Numerical Aspects of Multivariate Normal Probabilities in Econometric Models,
Annals of Economic and Social Measurement, 5, 547-562.
Engle, R. (1984) Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics, pp. 776-826,
in: Z. Griliches and M. Intriligator, eds., Handbook of Econometrics, Vol. 2. Amsterdam: North-
Holland.
Feller, W. (1971) An Introduction to Probability Theory and its Applications. New York: Wiley,
Fishman, G. (1973) Concepts and Methods of Digital Simulation. New York: Wiley.
Ch. 40: Classical Estimufion Methodsfor LD V Models Using Simulation 2439

Geman, S. and Geman, D. (1984) Stochastic Relaxation, Gibbs Distributions and the Bayesian Restora-
tion of Images,IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721~741.
&w&e, J. (1989) Bayesian Inference in Econometric Models Using Monte Carlo Integration, Econo-
metrica, 57, 1317-1340.
Geweke, J. (1992)Efficient Simulation from the Multivariate Normal and Student-t Distributions Subject
to Linear Constraints. Computing Science and Statistics: Proceedings qf the Twenty-Third Symposium,
571-578.
Goldfeld, S. and Quandt, R. (1975) Estimation in a Disequilibrium Model and the Value of Information,
Journal ofEconometrics, 3(3), 325-348.
Gourieroux, C. and Monfort, A. (1990) Simulation Based Inference in Models with Heterogeneity. INSEE,
mimeo.
Gourieroux, C., Monfort, A., Renault, E. and Trognon, A. (1984a) Pseudo Maximum Likelihood
Methods: Theory, Econometrica, 52, 681-700.
Gourieroux, C., Monfort, A., Renault, E. and Trognon, A. (1984b) Pseudo Maximum Likelihood
Methods: Applications to Poisson Models, Econometrica, 52, 701-720.
Gronau, R. (1974) The Effect of Children on the Housewifes Value of Time, Journal of Political
Economy, 81, 168-199.
Hajivassiliou, V. (1986) Serial Correlation in Limited Dependent Variable Models: Theoretical and Monte
Carlo Results. Cowles Foundation Discussion Paper No. 803.
Hajivassiliou, V. (1992) The Method ofSimulated Scores: A Presentation and Comparative Evaluation.
Cowles Foundation Discussion Paper, Yale University.
Hajivassiliou, V. (1993a) Estimation by Simulation of the External Debt Repayment Problems. Cowles
Foundation Discussion Paper, Yale University. Published in the Journal of Applied Econometrics,
9(2) (1994) 109-132.
Hajivassiliou, V. (1993b) Simulating Normal Rectangle Probabilities and Their Derivatives: The effects
of Vectorization. International Journal ofSupercomputer Applications, 7(3), 231-253.
Hajivassiliou, V. (1993~) Simulation Estimation Methods for Limited Dependent Variable Models.
pp. 519-543, in: G.S. Maddala, C.R. Rao and H.D. Vinod, eds., Handbook ofstatistics (Econometrics),
Vol. 11. Amsterdam: North-Holland.
Hajivassiliou, V. and Ioannides, Y. (1991) Switching Regressions Models of the Euler Equation: Consump-
tion Labor Supply and Liquidity Constraints. Cowles Foundation for Research in Economics, Yale
University, mimeo.
Hajivassiliou, V. and McFadden, D. (1990). The Method ofsimulated Scores, with Application to Models
of External Debt Crises. Cowles Foundation Discussion Paper No. 967.
Hajivassiliou, V., McFadden, D. and Ruud, P. (1992) Simulation of Multivariate Normal Orthant
Probabilities: Methods and Programs, Journal of Econometrics, forthcoming.
Hammersley, J. and Handscomb, D. (1964) Monte Carlo Methods. London: Methuen.
Hanemann, M. (1984) Discrete/Continuous Models of Consumer Demand, Econometrica, 52(3), 541-
562.
Hansen, L.P. (1982) Large Sample Properties of Generalized Method of Moments Estimators Econo-
metrica, 50, 1029-1054.
Hausman, J. and Wise, D. (1978) A Conditional Probit Model for Qualitative Choice: Discrete
Decisions Recognizing Interdependence and Heterogeneous Preferences, Econometrica, 46,
403-426.
Hausman, J. and Wise, D. (1979) Attrition Bias in Experimental and Panel Data: The Gary Negative
Income Maintenance Experiment, Econometrica, 47(2), 445-473.
Heckman, J. (1974) Shadow Prices, Market Wages, and Labor Supply, Econometrica, 42, 679-694.
Heckman, J. (1979) Sample Selection Bias as a Specification Error, Econometrica, 47, 153-161.
Heckman, J. (1981) Dynamic Discrete Models. pp. 179-195,_in C. Manski and D. McFadden, eds.,
Structural Analysis ofDiscrete Data with Econometric Applications. Cambridge: MIT Press.
Hendry, D. (1984) Monte Carlo Experimentation in Econometrics, pp. 937-976 in: Z. Griliches and
M. Intriligator, eds., Handbook ofEconometrics, Vol. 2. Amsterdam: North-Holland.
Horowitz, J., Sparmonn, J. and Daganzo, C. (1981) An Investigation of the Accuracy of the Clark
Approximation for the Multinomial Probit Model, Transportation Science, 16, 382-401.
Hotz, V.J. and Miller, R. (1989) Condirional Choice Probabilities and the Estimation of Dynamic Program-
ming Models. GSIA Working Paper 88-89-10.
Hotz, V.J. and Sanders, S. (1991) The Estimation ofDynamic Discrete Choice Models by the Method of
Simulated Moments. NORC, University of Chicago.
2440 V.A. Hujiuussiliou and P.A. Ruud

Hotz, V.J., Miller, R., Sanders, S. and Smith, J. (1991) A Simulation Estimatorfin Dynamic Discrete Choice
Models. NORC, University of Chicago, mimeo.
Keane, M. (1990) A Computationully &ficient Practical Simulation Estimator ,Jor Panel Data with
Applications to Estimating Temporal Dependence in Employment and Wqes. University of Minnesota,
mimeo.
Keane, M. (1993) Simulation Estimation Methods for Panel Data Limited Dependent Variable Models,
in: G.S. Maddala, C.R. Rao and H.D. Vinod, eds., Handbook of Statistics (Econometrics), Vol. 11.
Amsterdam: North-Holland.
Kloek, T. and van Dijk, H. (1978) Bayesian Estimates of Equation System Parameters: An Application
of Integration by Monte Carlo, Econometrica, 46, l-20.
Laroque, G. and Salanie, B. (1989) Estimation of Multi-Market Disequilibrium Fix-Price Models: An
Application of Pseudo Maximum Likelihood Methods, Econometrica, 57(4), 83 t-860.
Laroque, G. and Salanie, B. (1990) The Properties of Simulated Pseudo-Maximum Likelihood Methods:
The Case ofthe Canonical Disequilibrium Model. Working Paper No. 9005, CREST-Departement de
la Recherche, INSEE.
Lee, B.-S. and Ingram, B. (1991) Simulation Estimation of Time-Series Models, Journal of Econo-
metrics, 47, 197-205.
Lee, L.-F. (1978) Unionism and Wage Rates: A Simultaneous Equation Model with Qualitative and
Limited Denendent Variables. international Economic Review, 19,415-433.
Lee, L.-F. (1979) Identification and Estimation in Binary Choice Models with Limited (Censored)
Dependent Variables, Econometrica, 47, 977-996.
Lee, L.-F. (1992) On the Efficiency of Methods of Simulated Moments and Maximum Simulated
Likelihood Estimation of Discrete Response Models, Econometric Theory, 8(4), 518-552.
Lerman, S. and Manski, C. (1981) On the Use of Simulated Frequencies to Aproximate Choice Prob-
abilities, pp. 305-319, in: C. Manski and D. McFadden, eds., Structural Analysis ofDiscrete Data with
Econometric Applications. Cambridge: MIT Press.
Lewis, H.G. (1974) Comments on Selectivity Biases in Wage Comparisons, Journal of Political
Economy, 82(6), 114551155.
Maddala, G.S. (1983) Limited Dependent and Qualitative Variables in Econometrics. Cambridge: Cambridge
University Press.
McCulloch, R. and Rossi, P.E. (1993) An Exact Likelihood Analysis of the Multinomial Probit Model.
Working Paper 91-102, Graduate School of Business, University of Chicago.
McFadden, D. (1973) Conditional Logit Analysis of Qualitative Choice Behavior, pp. 105-142, in:
P. Zarembka, ed., Frontiers in Econometrics. New York: Academic Press.
McFadden, D. (1981) Econometric Models of Probabilistic Choice, pp. 1988272, in: C. Manski and
D. McFadden, eds., Structural Analysis ofDiscreteData with Econometric Applications. Cambridge:
MIT Press.
McFadden, D. (1986) Econometric Analysis of Qualitative Response Models, pp. 1395-1457, in:
Z. Griliches and M. Intriligator, eds., Handbook ofEconometrics, Vol. 2, Amsterdam: North-Holland.
McFadden, D. (1989) A Method of Simulated Moments for Estimation of Discrete Response Models
without Numerical Integration, Econometrica, 57, 99551026.
McFadden, D. and Ruud, P. (1992) Estimation by Simulation. University of California at Berkeley,
working paper.
Moran, P. (1984) The Monte Carlo Evaluation of Orthant Probabilities for Multivariate Normal
Distributions, Australian Journal of Statistics, 26, 39-44.
Miihleisen, M. (1991) On the Use of Simulated Estimators for Panel Models with Limited-Dependent
Variables. University of Munich, mimeo.
Newey, W.K. and McFadden, D.L. (1994) Estimation in Large Samples, in: R. Engle and D. McFadden,
eds., Handbook ofEconometrics, Vol. 4. Amsterdam: North-Holland.
Owen, D. (1956) Tables for Computing Bivariate Normal Probabilities, Annals of Mathematical
Statistics, 27, 1075-1090.
Pakes, A. (1992) Estimation of Dynamic Structural Models: Problems and Prospects Part II: Mixed
ContinuoussDiscrete Controls and Market Interactions. Yale University, mimeo.
Pakes, A. and Pollard, D. (1989) Simulation and the Asymptotics of Optimization Estimators, Econo-
metrica, 57, 1027-1057.
Poirier, D. and Ruud, P.A. (1988) Probit with Dependent Observations, Review of Economic Studies,
55,5933614.
Ch. 40: CIassicaf Estimation Methods for LDV Models Using Simulation 2441

Quandt, R. (1972) A New Approach to Estimating Switching Regressions, Journal of the American
Statistical Association, 67, 306-310.
Quandt, R. (1986) Computational Problems in Econometrics, pp. 1395-1457, in: Z. Griliches and
M. Intriligator, eds., Handbook ofEconometrics, Vol. 1. Amsterdam: North-Holland.
Rubinstein, R. (1981) Simulation and the Monte Carlo Method. New York: Wiley.
Rust, J. (1992) Estimation of Dynamic Structural Models: Problems and Prospects Part II: Discrete
Decision Processes. SSRI Working Paper #9106, University of Wisconsin at Madison.
Ruud, P. (1986) On the Method ofsimulated Moments for the Estimation of Limited Dependent Variable
Models. University of California at Berkeley, mimeo.
Ruud, P. (1991) Extensions of Estimation Methods Using the EM Algorithm, Journal ofEconometrics,
49,305-341.
Stroud, A. (1971) Approximate Calculation ofMultiple Integrals. New York: Prentice-Hall.
Thisted, R. (1988) Elements ofStatistical Computing. New York: Chapman and Hall.
Thurstone, L. (1927) A Law of Comparative Judgement, Psychological Review, 34,273-286.
Tierny, L. (1992). Markov Chainsfor Exploring Posterior Distributions. University of Minnesota, working
paper.
Tobin, J. (1958) Estimation of Relationships for Limited Department Variables, Econometrica, 26,
24-36.
van Dijk, H.K. (1987) Some Advances in Bayesian Estimation Methods Using Monte Carlo Integration,
pp. 205-261, in: T.B. Fomby and G.F. Rhodes, eds., Aduances in Econometrics, Vol. 6, Greenwich, CT:
JAI Press.
van Praag, B.M.S. and Hop, J.P. (1987) Estimation of Continuous Models on the Basis of Set-Valued
Observations. Erasmus University Working Paper, presented at the ESEM Copenhagen.
van Praag, B.M.S., Hop, J.P. and Eggink, E. (1989) A Symmetric Approach to the Labor Market by
Means of the Simulated Moments Method with an Application to Married Females. Erasmus University
Working Paper, presented at the EEA Augsburg.
van Praag, B.M.S., Hop, J.P. and Eggink, E. (1991) A Symmetric Approach to the Labor Market by
Means of the Simulated EM-Algorithm with an Application to Married Females. Erasmus University
Working Paper, presented at the ESEM Cambridge.
West, M. (1990) Bnyesian Computations: Monte-Carlo Density Estimation. Duke University, Discussion
Paper 90-AlO.
Chapter 41

ESTIMATION OF SEMIPARAMETRIC
MODELS*

JAMES L. POWELL

Princeton Unioersity

Contents

Abstract 2444
1. Introduction 2444
1.1. Overview 2444
1.2. Definition of semiparametric 2449
1.3. Stochastic restrictions and structural models 2452
1.4. Objectives and techniques of asymptotic theory 2460
2. Stochastic restrictions 2465
2.1. Conditional mean restriction 2466
2.2. Conditional quantile restrictions 2469
2.3. Conditional symmetry restrictions 2414
2.4. Independence restrictions 2416
2.5. Exclusion and index restrictions 2482
3. Structural models 2487
3.1. Discrete response models 2487
3.2. Transformation models 2492
3.3. Censored and truncated regression models 2500
3.4. Selection models 2506
3.5. Nonlinear panel data models 2511
4. Summary and conclusions 2513
References 2514

*This work was supported by NSF Grants 91-96185 and 92-10101 to Princeton University. I am
grateful to Hyungtaik Ahn, Moshe Buchinsky, Gary Chamberlain, Songnian Chen, Gregory Chow,
Angus Deaton, Bo Honor&, Joel Horowitz, Oliver Linton, Robin Lumsdaine, Chuck Manski, Rosa
Ma&kin, Dan McFadden, Whitney Newey, Paul Ruud, and Tom Stoker for their helpful suggestions,
which were generally adopted except when they were mutually contradictory or required a lot of extra
work.

Handbook of Econometrics, Volume IV, Edited by R.F. En& and D.L. McFadden
0 1994 Elseuier Science B.V. All rights reserved
2444 J.L. Powell

Abstract

A semiparametric model for observational data combines a parametric form for


some component of the data generating process (usually the behavioral relation
between the dependent and explanatory variables) with weak nonparametric restric-
tions on the remainder of the model (usually the distribution of the unobservable
errors). This chapter surveys some of the recent literature on semiparametric
methods, emphasizing microeconometric applications using limited dependent
variable models. An introductory section defines semiparametric models more
precisely and reviews the techniques used to derive the large-sample properties of
the corresponding estimation methods. The next section describes a number of
weak restrictions on error distributions ~ conditional mean, conditional quantile,
conditional symmetry, independence, and index restrictions - and show how they
can be used to derive identifying restrictions on the distributions of observables.
This general discussion is followed by a survey of a number of specific estimators
proposed for particular econometric models, and the chapter concludes with a
brief account of applications of these methods in practice.

1. Introduction

1.l. Overview

Semiparametric modelling is, as its name suggests, a hybrid of the parametric and
nonparametric approaches to construction, fitting, and validation of statistical
models. To place semiparametric methods in context, it is useful to review the way
these other approaches are used to address a generic microeconometric problem ~
namely, determination of the relationship of a dependent variable (or variables) y
to a set of conditioning variables x given a random sample {zi = (yi, Xi), i = 1,. . . , N}
of observations on y and x. This would be considered a micro-econometric
problem because the observations are mutually independent and the dimension
of the conditioning variables x is finite and fixed. In a macro-econometric
application using time series data, the analysis must also account for possible serial
dependence in the observations, which is usually straightforward, and a growing
or infinite number of conditioning variables, e.g. past values of the dependent
variable y, which may be more difficult to accommodate. Even for microecono-
metric analyses of cross-sectional data, distributional heterogeneity and dependence
due to clustering and stratification must often be considered; still, while the random
sampling assumption may not be typical, it is a useful simplification, and adaptation
of statistical methods to non-random sampling is usually straightforward.
In the classical parametric approach to this problem, it is typically assumed
that the dependent variable is functionally dependent on the conditioning variables
Ch. 41: Estimation of Semiparametric Models 2445

(regressors) and unobservable errors according to a fixed structural relation


of the form

Y = g(x, @o,s), (1.1)


where the structural function g(.) is known but the finite-dimensional parameter
vector a,~Iwp and the error term E are unobserved. The form of g(.) is chosen to
give a class of simple and interpretable data generating mechanisms which embody
the relevant restrictions imposed by the characteristics of the data (e.g. g() is
dichotomous if y is binary) and/or economic theory (monotonicity, homotheticity,
etc.). The error terms E are introduced to account for the lack of perfect fit of (1.1)
for any fixed value of c1eand a, and are variously interpreted as expectational or
optimization errors, measurement errors, unobserved differences in tastes or
technology, or other omitted or unobserved conditioning variables; their inter-
pretation influences the way they are incorporated into the structural function
9(.).
To prevent (1.1) from holding tautologically for any value of ao, the stochastic
behavior of the error terms must be restricted. The parametric approach takes the
error distribution to belong to a finite-dimensional family of distributions,
a
Pr{s d nix} = f,(a Ix, ~0) dl,, (1.2)
s -CO
where f(.) is a known density (with respect to the dominating measure p,) except
for an unknown, finite-dimensional nuisance parameter lo. Given the assumed
structural model (1.1) and the conditional error distribution (1.2), the conditional
distribution of y given x can be derived,

Pr{y < 11~) =


s
1

-a,
1b d ~hf,,,(uI x, %I, qo) dpYIX,

for some parametric conditional density f,,,,(.). Of course, it is usually possible to


posit this conditional distribution of y given x directly, without recourse to
unobservable error terms, but the adequacy of an assumed functional form is
generally assessed with reference to an implicit structural model. In any case, with
this conditional density, the unknown parameters c(~and q. can be estimated by
maximizing the average conditional log-likelihood

This fully parametric modelling strategy has a number of well-known optimality


properties. If the specifications of the structural equation (1.1) and error distribution
(1.2) are correct (and other mild regularity conditions hold), the maximum likeli-
hood estimators of ~1~and lo will converge to the true parameters at the rate of
the inverse square root of the sample size (root-N-consistent) and will be
asymptotically normally distributed, with an asymptotic covariance matrix which
is no larger than that of any other regular root-N-consistent estimator. Moreover,
the parameter estimates yield a precise estimator of the conditional distribution
of the dependent variable given the regressors, which might be used to predict y
for values of x which fall outside the observed support of the regressors. The
drawback to parametric modelling is the requirement that both the structural
model and the error distribution are correctly specified. Correct specification may
be particularly difficult for the error distribution, which represents the unpredict-
able component of the relation of y to x. Unfortunately, if g(x, u, E) is fundamentally
nonlinear in E - that is, it is noninvertible in E or has a Jacobian that depends on
the unknown parameters tl - then misspecification of the functional form of the
error distribution f(slx, 9) generally yields inconsistency of the MLE and inconsistent
estimates of the conditional distribution of y given x.
At the other extreme, a fully nonparametric approach to modelling the relation
between y and x would define any such relation as a characteristic of the joint
distribution of y and x, which would be the primitive object of interest. A causal
or predictive relation from the regressors to the dependent variable would be given
as a particular functional of the conditional distribution of y given x,

g(x) = WY,,), (1.3)

where F,,, is the joint and F,tx is the conditional distribution. Usually the functional
T(.) is a location measure, in which case the relation between y and x has a rep-
resentation analogous to (1.1) and (1.2), but with unknown functional forms for
f( .) and g(.). For example, if g(x) is the mean regression function (T(F,,,) = E[y 1x]),
then y can be written as

Y = g(x) + E,

with E defined to have conditional density f,,, assumed to satisfy only the normali-
zation E[E[x] = 0. In this approach the interpretation of the error term E is different
than for the parametric approach; its stochastic properties derive from its definition
in terms of the functional g(.) rather than a prior behavioral assumption.
Estimation of the function g(.) is straightforward once a suitable estimator gYIX
of the conditional distribution of y given x is obtained; if the functional T(.) in
(1.3) is well-behaved (i.e. continuous over the space of possible I&, a natural
estimator is

9(x) = ~(~y,,).
Thus the problem of estimating the relationship g(.) reduces to the problem of
estimating the conditional distribution function, which generally requires some
smoothing across adjacent observations of the regressors x when some components
Ch. 41: Estimation of Semiparametric
Models 2441

are continuously distributed (see, e.g. Prakasa Rao (1983) Silverman (1986), Bierens
(1987), Hardle (1991)). In some cases, the functional T(.) might be a well-defined
functional of the empirical c.d.f. of the data (for example, g(x) might be the best
linear projection of y on x, which depends only on the covariance matrix of the
data); in these cases smoothing of the empirical c.d.f. will not be required. An
alternative estimation strategy would approximate g(x) and the conditional distri-
bution of E in (1.6) by a sequence of parametric models, with the number of param-
eters expanding as the sample size increases; this approach, termed the method
of sieves by Grenander (1981), is closely related to the seminonparametric
modelling approach of Gallant (1981, 1987), Elbadawi et al. (1983) and Gallant
and Nychka (1987).
The advantages and disadvantages of the nonparametric approach are the
opposite of those for parametric modelling. Nonparametric modelling typically
imposes few restrictions on the form of the joint distribution of the data (like
smoothness or monotonicity), so there is little room for misspecification, and
consistency of an estimator of g(x) is established under much more general
conditions than for parametric modelling. On the other hand, the precision of
estimators which impose only nonparametric restrictions is often poor. When
estimation of g(x) requires smoothing of the empirical c.d.f. of the data, the
convergence rate of the estimator is usually slower than the parametric rate (square
root of the sample size), due to the bias caused by the smoothing (see the chapter
by Hardle and Linton in this volume). And, although some prior economic
restrictions like homotheticity and monotonicity can be incorporated into the
nonparametric approach (as described in the chapter by Matzkin in this volume),
the definition of the relation is statistical, not economic. Extrapolation of the
relationship outside the observed support of the regressors is not generally possible
with a nonparametric model, which is analogous to a reduced form in the classical
terminology of simultaneous equations modelling.
The semiparametric approach, the subject of this chapter, distinguishes between
the parameters of interest, which are finite-dimensional, and infinite-dimensional
nuisance parameters, which are treated nonparametrically. (When the param-
eter of interest is infinite-dimensional, like the baseline hazard in a proportional
hazards model, the nonparametric methods described in the Hardle and Linton
chapter are more appropriate.) In a typical parametric model, the parameters of
interest, mO, appear only in a structural equation analogue to (l.l), while the
conditional error distribution is treated as a nuisance parameter, subject to certain
prior restrictions. More generally, unknown nuisance functions may also appear
in the structural equation. Semiparametric analogues to equations (1.1) and (1.2)
are

(1.4)

Pr{s d nix} = 1 {u d A}fo(aIx)dp,, (1.5)


s
2448 J.L. Powell

where, as before, CQis unknown but known to lie in a finite-dimensional Euclidean


subspace, and where the unknown nuisance parameter is

lo = (to(.)J
As with the parametric approach, prior economic reasoning
general
regularity and identification restrictions are imposed on the nuisance parameters
qO, as in the nonparametric approach.
As a hybrid of the parametric and nonparametric approaches, semiparametric
modelling shares the advantages and disadvantages of each. Because it allows a
more general specification of the nuisance parameters, estimators of the parameters
of interest for semiparametric models are consistent under a broader range of
conditions than for parametric models, and these estimators are usually more
precise (converging to the true values at the square root of the sample size) than
their nonparametric counterparts. On the other hand, estimators for semiparametric
models are generally less efficient than maximum likelihood estimators for a
correctly-specified parametric model, and are still sensitive to misspecification of
the structural function or other parametric components of the model.
This chapter will survey the econometric literature on semiparametric estimation,
with emphasis on a particular class of models, nonlinear latent variable models,
which have been the focus of most of the attention in this literature. The remainder
of Section 1 more precisely defines the semiparametric categorization, briefly
lists the structural functions and error distributions to be considered and reviews
the techniques for obtaining large-sample approximations to the distributions of
various types of estimators for semiparametric models. The next section discusses
how each of the semiparametric restrictions on the behavior of the error terms
can be used to construct estimators for certain classes of structural functions.
Section 3 then surveys existing results in the econometric literature for several
groups of latent variable models, with a variety of error restrictions for each group
of structural models. A concluding section summarizes this literature and suggests
topics for further work.
The coverage of the large literature on semiparametric estimation in this chapter
will necessarily be incomplete; fortunately, other general references on the subject
are available. A forthcoming monograph by Bickel et al. (1993) discusses much of
the work on semiparametrics in the statistical literature, with special attention to
construction of efficient estimators; a monograph by Manski (1988b) discusses the
analogous econometric literature. Other surveys of the econometric literature
include those by Robinson (1988a) and Stoker (1992), the latter giving an extensive
treatment of estimation based upon index restrictions, as described in Section 2.5
below. Newey (1990a) surveys the econometric literature on semiparametric
efficiency bounds, which is not covered extensively in this chapter. Finally, given
the close connection between the semiparametric approach and parametric and
say, to different method5 and degrees of smoothing of the empirical c.d.f.), while
estimation of a semiparametric model would require an additional choice of the
particular functional T* upon which to base the estimates.
On a related point, while it is common to refer to semiparametric estimation
and semiparametric estimators, this is somewhat misleading terminology. Some
authors use the term semiparametric estimator to denote a statistic which in-
volves a preliminary plug-in estimator of a nonparametric component (see, for
example, Andrews chapter in this volume); this leads to some semantic ambiguities,
since the parameters of many semiparametric models can be estimated by para-
metric estimators and vice versa. Thus, though certain estimators would be hard
to interpret in a parametric or nonparametric context, in general the term semi-
parametric, like parametric or nonparametric, will be used in this chapter to
refer to classes of structural models and stochastic restrictions, and not to a
particular statistic. In many cases, the same estimator can be viewed as parametric,
nonparametric or semiparametric, depending on the assumptions of the model.
For example, for the classical linear model

y = x& + E,

the least squares estimator of the unknown coefficients &,

PC
[
itl xixl
1
-lit1 xiYi3

would be considered a parametric estimator when the error terms are assumed
to be Gaussian with zero mean and distributed independently of the regressors x.
With these assumptions fi is the maximum likelihood estimator of PO, and thus
is asymptotically efficient relative to all regular estimators of PO. Alternatively, the
least squares estimator arises in the context of a linear prediction problem, where
the error term E has a density which is assumed to satisfy the unconditional moment
restriction

E[&.X] = 0.

This restriction yields a unique representation for /I0 in terms of the joint distribu-
tion of the data,

& = {E[x.x'])-'E[x.y],

so estimation of /I0 in this context would be considered a nonparametric problem


by the criteria given above. Though other, less precise estimators of the moments
E[x.x] and E[x.y] (say, based only on a subset of the observations) might be
used to define alternative estimators, the classical least squares estimator fi is, al-
Ch. 41: Estimation of Semiparametric Models 2451

most by default, an efficient estimator of PO in this model (as Levit (1975) makes
precise). Finally, the least squares estimator b can be viewed as a special case of
the broader class of weighted least squares estimators of PO when the error terms
E are assumed to have conditional mean zero,

E[.51xi] = 0 a.s.

The model defined by this restriction would be considered semiparametric, since


&, is overidentified; while the least squares estimator b is *-consistent and
asymptotically normal for this model (assuming the relevant second moments are
finite), it is inefficient in general, with an efficient estimator being based on the rep-
resentation

of the parameters of interest, where a2(x) E Var(sJxi) (as discussed in Section 2.1
below). The least squares statistic fi is a semiparametric estimator in this context,
due to the restrictions imposed on the model, not on the form of the estimator.
Two categories of estimators which are related to semiparametric estimators,
but logically distinct, are robust and adaptive estimators. The term robustness
is used informally to denote statistical procedures which are well-behaved for slight
misspecifications of the model. More formally, a robust estimator & - T(p,,,) can
be defined as one for which T(F) is a continuous functional at the true model
(e.g. Manski (1988b)), or whose asymptotic distribution is continuous at the
truth (quantitative robustness, as defined by Huber (1981)). Other notions of
robustness involve sensitivity of particular estimators to changes in a small frac-
tion of the observations. While semiparametric estimators are designed to be
well-behaved under weak conditions on the error distribution and other nuisance
parameters (which are assumed to be correct), robust estimators are designed to
be relatively efficient for correctly-specified models but also relatively insensitive
to slight model misspecification. As noted in Section 1.4 below, robustness of
an estimator is related to the boundedness (and continuity) of its influence function,
defined in Section 1.4 below; whether a particular semiparametric model admits
a robust estimator depends upon the particular restrictions imposed. For example,
for conditional mean restrictions described in Section 2.1 below, the influence
functions for semiparametric estimators will be linear (and thus unbounded)
functions of the error terms, so robust estimation is infeasible under this restriction.
On the other hand, the influence function for estimators under conditional quantile
restrictions depends upon the sign of the error terms, so quantile estimators are
generally robust (at least with respect to outlying errors) as well as semipara-
metric.
Adaptive estimators are efficient estimators of certain semiparametric models
for which the best attainable efficiency for estimation of the parameters of interest
J.L. Powell

does not depend upon prior knowledge of a parametric form for the nuisance
parameters. That is, adaptive estimators are consistent under the semiparametric
restrictions but as efficient (asymptotically) as a maximum likelihood estimator
when the (infinite-dimensional) nuisance parameter is known to lie in a finite-
dimensional parametric family. Adaptive estimation is possible only if the semi-
parametric information bound for attainable efficiency for the parameters of
interest is equal to the analogous Cramer-Rao bound for any feasible parametric
specification of the nuisance parameter. Adaptive estimators, which are described
in more detail by Bickel et al. (1993) and Manski (1988b), involve explicit estimation
of (nonparametric) nuisance parameters, as do efficient estimators for semipara-
metric models more generally.

1.3. Stochastic restrictions and structural models

As discussed above, a semiparametric model for the relationship between y and


x will be determined by the parametric form of the structural function g(.) of (1.4)
and the restrictions imposed on the error distribution and any other infinite-
dimensional component of the model. The following sections of this chapter group
semiparametric models by the restrictions imposed on the error distribution,
describing estimation under these restrictions for a number of different structural
models. A brief description of the restrictions to be considered, followed by a
discussion of the structural models, is given in this section.
A semiparametric restriction on E which is quite familiar in econometric theory
and practice is a (constant) conditional mean restriction, where it is assumed that

-wx) = PO (1.6)
for some unknown constant po, which is usually normalized to zero to ensure
identification of an intercept term. (Here and throughout, all conditional expec-
tations are assumed to hold for a set of regressors x with probability one.) This
restriction is the basis for much of the large-sample theory for least squares and
method-of-moments estimation, and estimators derived for assumed Gaussian
distributions of E (or, more generally, for error distributions in an exponential
family) are often well-behaved under this weaker restriction.
A restriction which is less familiar but gaining increasing attention in econometric
practice is a (constant) conditional quantile restriction, under which a scalar error
term E is assumed to satisfy

Pr{c d qolx} = 71 (1.7)

for some fixed proportion rr~(O, 1) and constant q. = qo(n); a conditional median
restriction is the (leading) special case with n= l/2. Rewriting the conditional
2454 J.L. Powell

an assumption that

Pr{e < ulx} = Pr{a d UlU(X)> (1.11)

for some index function u(x) with dim{u(x)} < dim (x}; a weak or mean index
restriction asserts a similar property only for the conditional expectation ~

E[&1X] = E[&IU(X)]. (1.12)

For different structural models, the index function v(x) might be assumed to be a
known function of x, or known up to a finite number of unknown parameters
(e.g. u(x) = xBO), or an unknown function of known dimensionality (in which case
some extra restriction(s) will be needed to identify the index). As a special case,
the function u(x) may be trivial, which yields the independence or conditional
mean restrictions as special cases; more generally, u(x) might be a known subvector
x1 of the regressors x, in which case (1.11) and (1.12) are strong and weak forms
of an exclusion restriction, otherwise known as conditional independence and
conditional mean independence of E and x given x1, respectively. When the index func-
tion is unknown, it is often assumed to be linear in the regressors, with coeffi-
cients that are related to unknown parameters of interest in the structural
model.
The following diagram summarizes the hierarchy of the stochastic restrictions
to be discussed in the following sections of this chapter, with declining level of
generality from top to bottom:

Nonparametric

-1

Conditional mean Location median

lndedndence 1 Conditional symmetry

Parametric m

Turning now to a description of some structural models treated in the semi-


parametric literature, an important class of parametric forms for the structural
Ch. 41: Estimation of Semiparametric Models 2455

functions is the class of linear latent variable models, in which the dependent variable
y is assumed to be generated as some transformation

Y = t(y*; &, %(.I) (1.13)

of some unobservable variable y*, which itself has a linear regression representation

y* = x& + E. (1.14)

Here the regression coefficients /I0 and the finite-dimensional parameters 2, of the
transformation function are the parameters of interest, while the error distribution
and any nonparametric component rO(.) of the transformation make up the non-
parametric component of the model. In general y and y* may be vector-valued,
and restrictions on the coefficient matrix /I0 may be imposed to ensure identification
of the remaining parameters. This class of models, which includes the classical
linear model as a special case, might be broadened to permit a nonlinear (but
parametric) regression function for the latent variable y*, as long as the additivity
of the error terms in (1.14) is maintained.
One category of latent variable models, parametric transformation models, takes
the transformation function t(y*;&) to have no nonparametric nuisance com-
ponent to(.) and to be invertible in y* for all possible values of &. A well-known
example of a parametric transformation model is the Box-Cox regression model
(Box and Cox (1964)), which has y = t(x&, + E; A,) for

t - yy. 2) = -va
F - l 1(1# 0} + ln(y)l{A = O}.

This transformation, which includes linear and log-linear (in y) regression models
as special cases, requires the support of the latent variable y* to be bounded from
below (by - I/&) for noninteger values of A,, but has been extended by Bickel
and Doksum (1981) to unbounded y*. Since the error term E can be expressed as
a known function of the observable variables and unknown parameters for these
models, a stochastic restriction on E (like a conditional mean restriction, defined
below) translates directly into a restriction on y, x,/IO, and II, which can be used
to construct estimators.
Another category, limited dependent variable models, includes latent variable
models in which the transformation function t(y*) which does not depend upon
unknown parameters, but which is noninvertible, mapping intervals of possible
y* values into single values of y. Scalar versions of these models have received
much of the attention in the econometric literature on semiparametric estimation,
owing to their relative simplicity and the fact that parametric methods generally
yield inconsistent estimators for /I0 when the functional form of the error distri-
bution is misspecified. The simplest nontrivial transformation in this category is
2456 J.L.Powell

an indicator for positivity of the latent variable y*, which yields the binary response
model

y = 1{x/A)+ E> O}, (1.15)

which is commonly used in econometric applications to model dichotomous choice


problems. For this model, in which the parameters can be identified at most up
to a scale normalization on PO or E, the only point of variation of the function
t(y*) occurs at y* = 0, which makes identification of &, particularly difficult. A
model which shares much of the structure of the binary response model is the
ordered response model, with the latent variable y* is only known to fall in one of
J + 1 ordered intervals { (- co, c,], (c,, c,], . . . , (c,, GO)}; that is,

J
y= 1 1(x& +& >Cj}. (1.16)
j=l

Here the thresholds {cj} are assumed unknown (apart from a normalization like
c0 = 0), and must be estimated along with PO. The grouped dependent oariable model
is a variation with known values of {cj}, where the values of y might correspond
to prespecified income intervals.
A structural function for which the transformation function is more informative
about /I0 is the censored regression model, also known in econometrics as the
censored Tobit model (after Tobin (1956)). Here the observable dependent variable
is assumed to be subject to a nonnegativity constraint, so that

y = max (0, xpO + s}; (1.17)

this structural function is often used as a model of individual demand or supply


for some good when a fraction of individuals do not participate in that market.
A variation on this model, the accelerated failure time model with jixed censoring,
can be used as a model for duration data when some durations are incomplete.
Here

y=min{x;&+E,x2}, (1.18)

where y is the logarithm of the observable duration time (e.g. an unemployment


spell), and x2 is the logarithm of the duration of the experiment (following which
the time to completion for any ongoing spells is unobserved); the fixed qualifier
denotes models in which both x1 and x2 are observable (and may be functionally
related).
These univariate limited dependent variable models have multivariate analogues
which have also been considered in the semiparametric literature. One multi-
variate generalization of the binary response model is the multinomial response
Ch. 41: Estimation of Semipurametric Models 2457

model, for which the dependent variable is a J-dimensional vector of indicators,


y=vec{yj,j= l,..., J}, with

yj=l(yf3y: for k#j) (1.19)

and with each latent variable y? generated by a linear model

yj*=xpj,+E. J
Bo = cp;, . . . , a& . > DJ,l. (1.20)

That is, yj = 1 if and only if its latent variable yl is the largest across alternatives.
Another bivariate model which combines the binary response and censored reg-
ression models is the censored sample selection model, which has one binary res-
ponse variable y, and one quantitative dependent variable y, which is observed
only when yi = 1:

y1= l(x;B;+E, >O) (1.21)

and

Y2 = Yl cx;fi:+4. (1.22)

This model includes the censored regression model as a special case, with fi; =
fii s /I, and s1 = a2 = E. A closely related model is the disequilibrium regression
model with observed regime, for which only the smaller of two latent variables is
observed, and it is known which variable is observed:

y, = x;g +e,)
1(x;& + El -=c (1.23)

and

A special case of this model, the randomly censored regression model, imposes the
restriction fii = 0, and is a variant of the duration model (1.18) in which the
observable censoring threshold x2 is replaced by a random threshold a2 which is
unobserved for completed spells.
A class of limited dependent variable models which does not neatly fit into the
foregoing latent variable framework is the class of truncated dependent variable
models, which includes the truncated regression and truncated sample selection
models. In these models, an observable dependent variable y is constructed from
latent variables drawn from a particular subset of their support. For the truncated
regression model, the dependent variable y has the distribution of y* = x/I,, + E
2458 J.L. Powell

conditional on y* > 0:

y = xpo + u, (1.25)

with

Pr(u<clx}=Pr{cdclx, 6) -xB0}. (1.26)

For the truncated selection model, the dependent variable y is generated in the
same way as y, in (1.24), conditionally on y, = 1. Truncated models are variants
of censored models for which no information on the conditioning variables x is
available when the latent variable y* cannot be observed. Since truncated samples
can be constructed from their censored counterparts by deleting censored obser-
vations, identification and estimation of the parameters of interest is more challeng-
ing for truncated data.
An important class of multivariate latent dependent variable models arises in
the analysis of panel data, where the dimensionality of the dependent variable y
is proportional to the number of time periods each individual is observed. For
concreteness, consider the special case in which a scalar dependent variable is
observed for two time periods, with subscripts on y and x denoting time period;
then a latent variable analogue of the standard linear fixed effects model for
panel data has

y, = HY+ x;&J + $1qJ>


(1.27)
Y, = t(Y+$D, + &2'?J'

where t(.) is any of the transformation functions discussed above and y is an


unobservable error term which is constant across time periods (unlike the time-
specific errors cl and s2) but may depend in an arbitrary way on the regressors*
x1 and x2. Consistent estimation of the parameters of interest PO for such models
is a very challenging problem; while time-differencing or deviation from cell
means eliminates the fixed effect for linear models, these techniques are not
applicable to nonlinear models, except in certain special cases (as discussed by
Chamberlain (1984)). Even when the joint distribution of the error terms E, and
s2 is known parametrically, maximum likelihood estimators for &,, r0 and the
distributional parameters will be inconsistent in general if the unknown values of
y are treated as individual-specific intercept terms (as noted by Heckman and
MaCurdy (1980)), so semiparametric methods will be useful even when the distri-
bution of the fixed effects is the only nuisance parameter of the model.
The structural functions considered so far have been assumed known up to a
finite-dimensional parameter. This is not the case for the generalized regression
Ch. 41: Estimation ofSemiparametric
Models 2459

model, which has

Y = %(XPo
+ 4, (1.28)

for some transformation function TV which is of unknown parametric form, but


which is restricted either to be monotonic (as assumed by Han (1987a)), or smooth
(or both). Formally, this model includes the univariate limited dependent variable
and parametric transformation models as special cases; however, it is generally
easier to identify and estimate the parameters of interest when the form of the
transformation function t(.) is (parametrically) known.
Another model which at first glance has a nonparametric component in the
structural component is the partially linear or semilinear regression model proposed
by Engle et al. (1986), who labelled it the semiparametric regression model; esti-
mation of this model was also considered by Robinson (1988). Here the regression
function is a nonparametric function of a subset xi of the regressors, and a linear
function of the rest:

Y= x;p,+&(x,) +6 (1.29)

where A,(.) is unknown but smooth. By defining a new error term E* = 2,(x,) + E,
a constant conditional mean assumption on the original error term E translates
into a mean exclusion restriction on the error terms in an otherwise-standard
linear model.
Yet another class of models with a nonparametric component are generated
regressor models, in which the regressors x appear in the structural equation for
y indirectly, through the conditional mean of some other observable variable w
given x:

Y=h(ECwlxl,~,,&)~g(x,~,,~,(~),&), (1.30)

with 6,(x) _=E[wjx]. These models arise when modelling individual behavior under
uncertainty, when actions depend upon predictions (here, conditional expectations)
of unobserved outcomes, as in the large literature on rational expectations.
Formally, the nonparametric component in the structural function can be absorbed
into an unobservable error term satisfying a conditional mean restriction; that is,
defining q 5 w -JZ[wlx] (so that E[qlx] -O), the model (1.30) with nonpara-
metrically-generated regressors can be rewritten as y = g(w - q,cr,,s), with a
conditional mean restriction on the extra error term q. In practice, this alternative
representation is difficult to manipulate unless g(.) is linear, and estimators are
more easily constructed using the original formulation (1.30).
Although the models described above have received much of the attention in
the econometric literature on semiparametrics, they by no means exhaust the set
of models with parametric and nonparametric components which are used in
2460 J.L.Powell

econometric applications. One group of semiparametric models, not considered


here, include the proportional hazards model proposed and analyzed by Cox (1972,
1975) for duration data, and duration models more generally; these are discussed
by Lancaster (1990) among many others. Another class of semiparametric models
which is not considered here are choice-based or response-based sampling models;
these are similar to truncated sampling models, in that the observations are drawn
from sub-populations with restricted ranges of the dependent variable, eliminating
the ancillarity of the regressors x. These models are discussed by Manski and
McFadden (1981) and, more recently, by Imbens (1992).

1.4. Objectives and techniques of asymptotic theory

Because of the generality of the restrictions imposed on the error terms for semi-
parametric models, it is very difficult to obtain finite-sample results for the
distribution of estimators except for special cases. Therefore, analysis of semi-
parametric models is based on large-sample theory, using classical limit theorems
to approximate the sampling distribution of estimators. The goals and methods
to derive this asymptotic distribution theory, briefly described here, are discussed
in much more detail in the chapter by Newey and McFadden in this volume.
As mentioned earlier, the first step in the statistical analysis of a semiparametric
model is to demonstrate identijkation of the parameters a0 of interest; though
logically distinct, identification is often the first step in construction of an estimator
of aO. To identify aO, at least one function T(.) must be found that yields T(F,) = aO,
where F, is the true joint distribution function of z = (y,x) (as in (1.3) above). This
functional may be implicit: for example, a,, may be shown to uniquely solve some
functional equation T(F,, a,,) = 0 (e.g. E[m(y, x, a,,)] = 0, for some m(.)). Given
the functional T(.) and a random sample {zi = (y,, xi), i = 1,. . . , N) of observations
on the data vector z, a natural estimator of a0 is

62= T(P), (1.31)

where P is a suitable estimator of the joint distribution function F,. Consistency


of & (i.e. oi+ a,, in probability as N + co) is often demonstrated by invoking a law
of large numbers after approximating the estimator as a sample average:

= $,f cPiV(Yi3xi) + Op(1)~


I 1
(1.32)

where E[q,(y, x)] + aO. In other settings, consistency is demonstrated by showing


that the estimator maximizes a random function which converges uniformly and
almost surely to a limiting function with a unique maximum at the true value aO.
As noted below, establishing (1.31) can be difficult if construction of 6i involves
Ch. 41: Estimation ofSemiparametric
Models 2461

explicit nonparametric estimators (through smoothing of the empirical distribution


function).
Once consistency of the estimator is established, the next step is to determine
its rate ofconueryence, i.e. the steepest function h(N) such that h(N)(Gi - Q) = O,(l).
For regular parametric models, h(N) = fi, so this is a maximal rate under weaker
semiparametric restrictions. If the estimator bi has h(N) = fi (in which case it is
said to be root-N-consistent), then it is usually possible to find conditions under
which the estimator has an asymptotically linear representation:

di= '0 + k,E$(Yi,


I 1
xi) + op(11JN)2 (1.33)

where the influence function I/I(.) has E[$(y, x)] = 0 and finite second moments.
The Lindeberg-Levy central limit theorem then yields asymptotic normality of the
estimator,

JNca- ao) L Jqo, If,), (1.34)

where V, = E{$(y,x)[$(y,x)]}. With a consistent estimator of V, (formed as the


sample covariance matrix of some consistent estimator ~(yi,Xi) of the influence
function), confidence regions and test statistics can be constructed with coverage/
rejection probabilities which are approximately correct in large samples.
For semiparametric models, as defined above, there will be other functionals
T+(F) which can be used to construct estimators of the parameters of interest.
The asymptotic efJtciency of a particular estimator 6i can be established by showing
that its asymptotic covariance matrix V, in (1.34) is equal to the semiparametric
analogue to the Cramer-Rao bound for estimation of ~1~. This semiparametric
ejjiciency bound is obtained as the smallest of all efficiency bounds for parametric
models which satisfy the semiparametric restrictions. The representation ~1~=
T*(F,) which yields an efficient estimator generally depends on some component
do(.) of the unknown, infinite-dimensional nuisance parameter qo(.), i.e. T*(.) =
T*(., 6,), so construction of an efficient estimator requires explicit nonparametric
estimation of some characteristics of the nuisance parameter.
Demonstration of (root-iv) consistency and asymptotic normality of an estimator
depends on the complexity of the asymptotic linearity representation (1.33), which
in turn depends on the complexity of the estimator. In the simplest case, where
the estimator can be written in a closed form as a smooth function of sample
averages,

6i=a j$,$ m(Yi,xi) 9 (1.35)


( I 1 >
2462 J.L. Powell

the so-called delta method yields an influence function II/ of the form

+(Y, 4 = ca4~o)/ad c~(Y, 4 - ~~1, (1.36)

where pLoE E[m(y,x)]. Unfortunately, except for the classical linear model with a
conditional mean restriction, estimators for semiparametric models are not of this
simple form. Some estimators for models with weak index or exclusion restrictions
on the errors can be written in closed form as functions of bivariate U-statistics.

(1.37)

with kernel function pN that has pN(zi, zj) = pN(zj,zi) for zi = (y,,z,); under
conditions given by Powell et al. (1989), the representation (1.33) for such an
estimator has influence function II/ of the same form as in (1.36), where now

m(.V,
X)= lim EEPN(zi9
zj)lzi = (Y,X)1, PO = ECm(y, 41. (1.38)
N-02

A consistent estimator of the asymptotic covariance matrix of bi of (1.37) is the


sample second moment matrix of

(1.39)

In most cases, the estimator 6i will not have a closed-form expression like in
(1.35) or (1.37), but instead will be defined implicitly as a minimizer of some sample
criterion function or a solution of estimating equations. Some (generally inefficient)
estimators based on conditional location or symmetry restrictions are M-
estimators, defined as minimizers of an empirical process

h = aigETn $ ,i p(_Yi,Xi, a) = argmin S,(a) (1.40)


L 1 asI3

and/or solutions of estimating equations

0= i .g m(yit xi, 8.)= kN(B). (1.41)


I 1

for some functions p(.) and m(.), with dim{m(.)} = dim(a). When p(y,x,cr) (or
m(y,x, a)) is a uniformly continuous function in the parameters over the entire
parameter space 0 (with probability one), a standard uniform law of large numbers
can be used to ensure that normalized versions of these criteria converge to their
2464 J.L. Powell

where the kernel pN(.) has the same symmetry property as stated for (1.37) above;
such estimators arise for models with independence or index restrictions on the
error terms. Results by Nolan and Pollard (1987,1988), Sherman (1993) and Honor&
and Powell (1991) can be used to establish the consistency and asymptotic normality
of this.estimator, which will have an influence function of the form (1.42) when

m(y, X, a) = lim aE [pN(zi, zj, CC)1yi = y, xi = xyaE. (1.47)


N+m

A more difficult class of estimators to analyze are those termed semiparametric


M-estimators by Horowitz (1988a), for which the estimating equations in (1.41)
also depend upon an estimator of a nonparametric component de(.); that is, ai solves

o=~.~
m(yi,xi,6i,~())=mN(6i,6^())
I 1
(1.48)

for some nonparametric estimator sof 6,. This condition might arise as a first-order
condition for minimization of an empirical loss function that depends on 8,

d=ar~~n~i~lP(Yi,xi,a,6^()), (1.49)

as considered by Andrews (1990a, b). As noted above, an efficient estimator for any
semiparametric model is generally of this form and estimators for models with
independence or index restrictions are often in this class. To derive the influence
function for an estimator satisfying (1.48), a functional mean-value expansion of
Ci,(& c!?)around s^= 6, can be used to determine the effect on di of estimation of
6,. Formally, condition (1.48) yields

o= mN(61, &,, = &(&&,()) + &,(8()- do()) + op(f/v6) (1.50)

for some linear functional L,; then, with an influence function representation of
this second term

(1.51)

(with E[S(y, x)] = O), the form of the influence function for a semiparametric M-
estimator is

G(Y, 4 = raE(m(Y,x3 4 ~o)iiw,_,3 - 1~4~~ X, ao,6,) + a~, 41. (1.52)


Ch. 41: Estimation of Semiparametric Models 2465

To illustrate, suppose 6, is finite-dimensional, 6,~@; then the linear functional in


(1.50) would be a matrix product,

L&%9- hk)) = b& 6,) = CaE(m(y,~,a,6)/a6'1.=.o,a=a,](6- do), (1.53)

and the additional component 5 of the influence function in (1.52) would be the
product of the matrix L, with the influence function of the preliminary estimator
8. When 6, is infinite-dimensional, calculation of the linear functional L, and the
associated influence function 5 depends on the nature of the nuisance parameter
6, and how it enters the moment function m(y,x,a,d). One important case has 6,
equal to the conditional expectation of some function s(y,x) of the data given
some other function u(x) of the regressors, with m(.) a function only of the fitted
values of this expectation; that is,

43= ~,(44) = ECdY, x)l a41 (1.54)

and
(1.55)

with am/&J well-defined. For instance, this is the structure of efficient estimators
for conditional location restrictions. For this case, Newey (1991) has shown that
the adjustment term t(y,x) to the influence function of a semiparametric M-
estimator 6i is of the form

a~, X) = CWm(y,x3 4 4 ewa~~t,=,,i 4


CS(Y~ - 4A44)i. (1.56)

In some cases the leading matrix in this expression is identically zero, so the
asymptotic distribution of the semiparametric M-estimator is the same as if 6,(.)
were known; Andrews (1990a, b) considered this and other settings for which the
adjustment term 5 is identically zero, giving regularity conditions for validity of
the expansion (1.50) in such cases. General formulae for the influence functions of
more complicated semiparametric M-estimators are derived by Newey (1991)
and are summarized in Andrews and Newey and McFaddens chapters in this
volume.

2. Stochastic restrictions

This section discusses how various combinations of structural equations and


stochastic restrictions on the unobservable errors imply restrictions on the joint
distribution of the observable data, and presents general estimation methods for
the parameters of interest which exploit these restrictions on observables. The
classification scheme here is the same as introduced in the monograph by Manski
2466

(1988b) (and also in Manskis chapter in this volume), although the discussion
here puts more emphasis on estimation techniques and properties. Readers who
are familiar with this material or who are interested in a particular structural form,
may wish to skip ahead to Section 3 (which reviews the literature for particular
models), referring back to this section when necessary.

2.1. Conditional mean restriction

As discussed in Section 1.3 above, the class of constant conditional location


restrictions for the error distribution assert constancy of

vO = argmin E[r(c - b)jx], (2.1)


b

for some function r(.) which is nonincreasing for negative arguments and non-
decreasing for positive arguments; this implies a moment condition E[q(.z - po)lx] =
0, for q(u) = ar(t#Ih. When the loss function of (2.1) is taken to be quadratic,
r(u) = uu, the corresponding conditional location restriction imposes constancy of
the conditional mean of the error terms,

.%4x) = PO (2.2)

for some po. By appropriate definition of the dependent variable(s) y and exogenous
variables x, this restriction may be applied to models with endogenous regressors
(that is, some components of x may be excluded from the restriction (2.2)).
This restriction is useful for identification of the parameters of interest for
structural functions g(x, IX,E) that are invertible in the error terms E; that is,

Y = g(x, MO,40s = 4Y, x, MO)

for some function e(.), so that the mean restriction (2.1) can be rewritten

(2.3)

where the latter equality imposes the normalization p. E 0 (i.e., the mean ,u~ is
appended to the vector ~1~of parameters of interest).
Conditional mean restrictions are useful for some models that are not completely
specified ~ that is, for models in which some components of the structural function
g(.) are unknown or unspecified. In many cases it is more natural to specify the
function e(.) characterizing a subset of the error terms than the structural function
g(.) for the dependent variable; for example, the parameters of interest may be
coefficients of a single equation from a simultaneous equations system and it is
Ch. 41: Estimation of Semipurametric Models 2461

often possible to specify the function e(.) without specifying the remaining equations
of the model. However, conditional mean restrictions generally are insufficient to
identify the parameters of interest in noninvertible limited dependent variable
models, as Manski (1988a) illustrates for the binary response model.
The conditional moment condition (2.3) immediately yields an unconditional
moment equation of the form

0 = EC4x)4.k x, 41, (2.4)

where d(x) is some conformable matrix with at least as many rows as the dimension
of a,. For a given function cl(.), the sample analogue of the right-hand side of (2.8)
can be used to construct a method-of-moments or generalized method-of-moments
estimator, as described in Section 1.4; the columns of the matrix d(x) are
instrumental variables for the corresponding rows of the error vector E. More
generally, the function d(.) may depend on the parameters of interest, Q, and a
(possibly) infinite-dimensional nuisance parameter 6,(.), so a semiparametric
M-estimator for B may be defined to solve

(2.5)

where dim(d(.)) = dim(g) x dim(s) and s^= c?(.) is a consistent estimator of the
nuisance function 6,(.). For example, these sample moment equations arise as the
first-order conditions for the GMM minimization given in (1.43), where the moment
functions take the form m(y, x, U) = c(x) e(y, x, a), for a matrix c(x) of fixed functions
of x with number of rows greater than or equal to the number components of CC
Then, assuming differentiability of e(.), the GMM estimator solves (2.5) with

$ ,$[ae(y,, xi, d)pd][c(xi)]


d(x, d, 8)=
i L 1 I
A,c(x), (2.6)

where A, is the weight matrix given in (1.43).


Since the function d(.) depends on the data only through the conditioning
variable x, it is simple to derive the form of the asymptotic distribution for the
estimator oi which solves (2.5) using the results stated in Section 1.4:

,h@ - a,,)~N(O, M,VJM;)-), (2.7)

where
2468 J.L. Powell

and

V. = ECdb,ao,6,) e(y,x, a01e(y,x, a01dk a0, S0)l


= E[d(x, aO, 6O)z(X)d(xi, aO>sO)l.

In this expression, Z(x) is the conditional covariance matrix of the error terms,

Z(x)- E[e(y,x,ao)e(y,x,ao)lx] = E[EdIx].

Also, the expectation and differentiation in the definition of MO can often be inter-
changed, but the order given above is often well-defined even if d(.) or e(.) is not
smooth in a.
A simple extension of the Gauss-Markov argument can be used to show that
an efficient choice of instrumental variable matrix d*(x) is of the form

d*(x)=d*(x,ao,do)= &E[e(yyx,cr)lxi]Ia=,,,
Cal -; (2.8)

the resulting efficient estimator &* will have

,/??(a* - ~1~)5 J(O, V*), with I/* = {E[d*(x)C~(x)lCd*(x)l}-,


(2.9)

under suitable regularity conditions. Chamberlain (1987) showed that V* is the


semiparametric efficiency bound for any regular estimator of ~1~when only the
conditional moment restriction (2.3) is imposed. Of course, the optimal matrix
d*(x) of instrumental variables depends upon the conditional distribution of y
given x, an infinite-dimensional nuisance parameter, so direct substitution of d*(x)
in (2.5) is not feasible. Construction of a feasible efficient estimator for a0 generally
uses nonparametric regression and a preliminary inefficient GMM estimator of
u. to construct estimates of the components of d*(x), the conditional mean of
ae(y,x, a,)/aa and the conditional covariance matrix of e(y, x, ao). This is the
approach taken by Carroll (1982), Robinson (1987), Newey (1990b), Linton (1992)
and Delgado (1992), among others. Alternatively, a nearly efficient sequence of
estimators can be generated as a sequence of GMM estimators with moment
functions of the form m(y, x, a) = c(x) e(y, x, a), when the number of rows of c(x)
(i.e. the number of instrumental variables) increases slowly as the sample size
increases; Newey (1988a) shows that if linear combinations of c(x) can be used to
approximate d*(x) to an arbitrarily high degree as the size of c(x) increases, then
the asymptotic variance of the corresponding sequence of GMM estimators equals
v*.
Ch. 41: Estimation of Semipammrtric Models 2469

For the linear model

y = x& + E

with scalar dependent variable y, the form of the optimal instrumental variable
matrix d*(x) simplifies to the vector

d*(x) = [a(x)] - lx,

where a(x) is the conditional variance of the error term E. As noted in Section 1.2
above, an efficient estimator for f10 would be a weighted least squares estimator,
with weights proportional to a nonparametric estimator of [a(x)] -I, as considered
by Robinson (1987).

2.2. Conditional quantile restrictions

In its most general form, the conditional 71th quantile of a scalar error term E is
defined to be any function 9(x; rr) for which the conditional distribution of E has
at least probability rr to the left and probability 1 - rc to the right of q=(x):

Pr{s d q(x; n) Ix} 2 71 and Pr{.s > ~(x; n)lx} 3 1 - 7~. (2.10)

A conditional quantile restriction is the assumption that, for some rt~(O, l), this
conditional quantile is independent of x,

9(x; 7r)= rj,(7c) = qo, a.s. (2.11)

Usually the conditional distribution of E is further restricted to have no point mass


at its conditional quantile (Pr{s = q,,} = 0), which with (2.10) implies the conditional
moment restriction

E[71- l{E<Y/O}IX] =O=E[n- l{&<O}~X], (2.12)

where again the normalization lo E 0 is imposed (absorbing q0 as a component


of Q). To ensure uniqueness of the solution lo = 0 to this moment condition, the
conditional error distribution is usually assumed to be absolutely continuous with
nonnegative density in some neighborhood of zero. Although it is possible in
principle to treat the proportion rr as an unknown parameter, it is generally
assumed that rt is known in advance; most attention is paid to the special case
71 = i (i.e. a conditional median restriction) which is implied by the stronger
assumptions of either independence of the errors and regressors or conditional
symmetry of the errors about a constant.
2470 J.L. Powell

A conditional quantile restriction can be used to identify parameters of interest


in models in which the dependent variable y and the error term E are both scalar,
and the structural function g(.) of (1.4) is nondecreasing in E for all possible a0
and almost all x:

u1 G u2 =-dx, M, 4) G 4x, ~1,d a.s. Cd. (2.13)

(Of course, nonincreasing structural functions can be accommodated with a sign


change on the dependent variable y.) This monotonicity and the quantile restriction
(2.11) imply that the conditional xth quantile of y given x is g(x, aO, 0); since

EGO or E2.O 3 y=g(x,ao,E)~g(x,~o,O) or ykg(x,cc,,O),

it follows that

Pr{ydg(x,cc,,O)Ix) > Pr{s<OO(x} arc and

Pr{y > g(x,cr,,O)lx} 2 Pr{.s 3 01x) 3 1 - rc. (2.14)

Unlike a conditional mean restriction, a conditional quantile restriction is useful


for identification of CI~even when the structural function g(x, a, E) is not invertible
in E. Moreover, the equivariance of quantiles to monotonic transformations means
that, when it is convenient, a transformation l(y) might be analyzed instead of the
original dependent variable y, since the conditional quantile of I(y) is l(g(x, aO, 0))
if I(.) is nondecreasing. (Note, though, that application of a noninvertible trans-
formation may well make the parameters a,, more difficult to identify.)
The main drawback with the use of quantile restrictions to identify a0 is that
the approach is apparently restricted to models with a scalar error term E, because
of their lack of additivity (i.e. quantiles of convolutions are not generally the sums
of the corresponding quantiles) as well as the ambiguity of a monotonicity restric-
tion on the structural function in a multivariate setting. Estimators based upon
quantile restrictions have been proposed for the linear regression, parametric
transformation, binary response, ordered response and censored regression models,
as described in Section 3 below.
For values of x for which g(x,a,,e) is strictly increasing and differentiable at
E = 0, the moment restriction given in (2.12) and monotonicity restriction (2.13)
can be combined to obtain a conditional moment restriction for the observable
data and unknown parameter aO. Let

b(x a) = 1

(2.15)
7

aE =- a& - aE
then (2.12) immediately implies

E(&x,~~)C~- 11~ ~s(x,~o,O))llx) = ~Cm(y,x,ao)lxl=O. (2.16)


Ch. 41: Estimation of Semiparametric
Models 2471

In principle, this conditional moment condition might be used directly to define


a method-of-moments estimator for cr,; however, there are two drawbacks to this
approach. First, the moment function m(.) defined above is necessarily a dis-
continuous function of the unknown parameters, complicating the asymptotic
theory. More importantly, this moment condition is substantially weaker than the
derived quantile restriction (2.14), since observations for which g(x, CX~, u) is not
strictly increasing at u = 0 may still be useful in identifying the unknown parameters.
As an extreme example, the binary response model has b(x, a,) = 0 with probability
one under standard conditions, yet (2.14) can be sufficient to identify the parameters
of interest even in this case (as discussed below).
An alternative approach to estimation of c(~ can be based on a characterization
of the nth conditional quantile as the solution to a particular expected loss
minimization problem. Define

wh x; 4 = m,(Y - b) - P,(Y)lXl> (2.17)

where

p,(u) = u[7c - l(u <O)];

since Ip,(u - b) - p,(u)/ < 1b(, this minimand is well-defined irrespective of the
existence of moments of the data. It is straightforward to show that Q(b,x) is
minimized at b* = g(x,ct,,O) when (2.14) holds (more generally, Q(b,x) will be
minimized at any conditional rcth quantile of y given x, as noted by Ferguson
(1967)). Therefore, the true parameter vector a0 will minimize

Qb; w(.),4 = NW(X)Ndx, a,O),x; 41= E{w(x)CP,(Y - dx, a,0)) - A( >


(2.18)

over the parameter space, where w(x) is any scalar, nonnegative function of x
which has E[w(x).Ig(x,a,O)l] < co. For a particular structural function g(.), then,
the unknown parameters will be identified if conditions on the error distribution,
regressors, and weight function w(x) are imposed which ensure the uniqueness
of the minimizer of Q(cc;w(.), n) in (2.18). Sufficient conditions are uniqueness of
the rrth conditional quantile q0 = 0 of the error distribution and Pr{w(x) > 0,
g(x, u, r~)# g(x, c1,,0)} > 0 whenever c1# ~1~.
Given a sample {(y,, xi), i = 1,. . . , N} of observations on y and x, the sample
analogue of the minimand in (2.18) is

QN(CC
wt.),n) = k.$1 W(Xi)Pn(yi- g(xi,m,OIL
L
(2.19)

where an additive constant which does not affect the minimization problem has
been deleted. In general, the weight function w(x) may be allowed to depend upon
2472 J.L. Powell

nuisance parameters, w(x) E w(x, 6,), so a feasible weighted quantile estimator of


CC~might be defined to minimize SN(a,q, G(.);x), with G(x) = w(x, $) for some
preliminary estimator 6^of 6,. In the special case of a conditional median restriction
(n = $), minimization of QN is equivalent to minimization of a weighted sum of
absolute deviations criterion

(2.20)

which, with w(x) 3 1, is the usual starting point for estimation of the particular
models considered in the literature cited below. When the structural function g(.)
is of the latent variable form (g(x, CL,&) = t(x/3 + E,T)), the estimator oi which
minimizes QJLY; cii,rr) will typically solve an approximate first-order condition,

,: fl k(Xi)[71 - l(y, < g(xi, oi,O))]b(Xi, a) ag(;;e,O) r 0, (2.21)

where b(x, CY)is defined in (2.15) and ag(.)/acr denotes the vector of left derivatives.
(The equality is only approximate due to the nondifferentiability of p,(u) at zero
and possible nondifferentiability of g(.) at c?; the symbol G in (2.21) means the
left-hand side converges in probability to zero at an appropriate rate.) These
equations are of the form

where the moment function m(.) is defined in (2.16) and

d(X, bi,8)3 W(Xi, &Xi, d,jag:&


O).
Thus the quantile minimization problem yields an analogue to the unconditional
moment restriction E[m( y, x, cl,,) d(x, CI~,S,)] = 0, which follows from (2.16).
As outlined in Section 1.4 above, under certain regularity conditions (given by
Powell (1991)) the quantile estimator di will be asymptotically normal,

,/%a
- ~0)5 A@,M, Vo(Mb)- ), (2.22)

where now

adx, uo,wm, ao,0)


1
MO= E ./-@I4 w(x,&.Jm,ao)
au aa I
Ch. 41: Estimation qf Semiparametric Models 2413

and

ag(x, a,,O)ag(x, ao, 0)


V, = E ~(1 - rc) w2(x, 6,) b(x, CIJ
aa ad I

for f(Ol x) being the conditional density of the residualy - g(x, clO,0) at zero (which
appears from the differentiation of the expectation of the indicator function in
(2.21)). The regularity conditions include invertibility of the matrix MO, which is
identically zero for the binary and ordered response models; as shown by Kim and
Pollard (1990), the rate of convergence of the estimator bi is slower than fi for
these models.
When (2.22) holds, an efficient choice of weight function w(x) for this problem is

w*(x) E .I-(0IX)? (2.23)

for which the corresponding estimator c?* has

JN@*
- a,) J+J-(0, v*), (2.24)

with

ah, ao,wdx, a,, 0) - 1


v* = n(l - 7t) E f2(Olx)b(x,ao).
i[ aa ad 11

The matrix V* was shown by Newey and Powell (1990) to be the semiparametric
efficiency bound for the linear and censored regression models with a conditional
quantile restriction, and this is likely to be the case for a more general class of
structural models.
For the linear regression model g(x, c(~,E) 3 xbo + E, estimation of the true coeffi-
cients PO using a least absolute deviations criterion dates from Laplace (1793); the
extension to other quantile restrictions was proposed by Koenker and Bassett
(1978). In this case b(x, CI)= 1 and ag(x, a, .s)/aa = x, which simplifies the asymptotic
variance formulae. In the special case in which the conditional density of E = y - xBO
at zero is constant - f(Olx) = f. - the asymptotic covariance matrix of the quantile
estimator B further simplifies to

V*=rc(l -~)[f~]-~(E[xx]}-~.

(Of course, imposition of the additional restriction of a constant conditional density


at zero may affect the semiparametric information bound for estimation of PO.) The
monograph by Bloomfield and Steiger (1983) gives a detailed discussion of the
Ch. 41: Estimation of Semiparametric Models 2475

for some h(.) and all possible x, CIand E. Then the random function h(y, x, a) =
h(g(x, Q,E),x, a) will also be symmetrically distributed about zero when CI= LX~,
implying the conditional moment restriction

my, x, MO)Ixl = awx, xl =


MO,4, XT@ON 0. (2.27)

As with the previous restrictions, the conditional moment restriction can be used
to generate an unconditional moment equation of the form E[d(x) h( y, x, LY,)]= 0,
with d(x) a conformable matrix of instruments with a number of rows equal to the
number of components of 0~~.In general, the function d(x) can be a function of a
and nuisance parameters S (possibly infinite-dimensional), so a semiparametric
M-estimator biof ~1~can be constructed to solve the sample moment equations

O= i ,$
I 1
d(xi, Oi,4 h(Yi, xi, Oi), (2.28)

for s^an estimator of some nuisance parameters 6,.


For structural functions g(x, M,E) which are invertible in the error terms, it is
straightforward to find a transformation satisfying condition (2.26). Since E= e( y, x, ~1)
is an odd function of E, h(.) can be chosen as this inverse function e(.). Even for
noninvertible structural functions, it is still sometimes possible to find a trimming
function h( .) which counteracts the asymmetry induced in the conditional distribution
of y by the nonlinear transformation g(.). Examples discussed below include the
censored and truncated regression models and a particular selectivity bias model.
As with the quantile estimators described in a preceding section, the moment
condition (2.27) is sometimes insufficient to identify the parameters go, since the
trimming transformation h(.) may be identically zero when evaluated at certain
values of c1in the parameter space. For example, the symmetrically censored least
squares estimator proposed by Powell (1986b) for the censored regression model
satisfies condition (2.27) with a function h(.) which is nonzero only when the fitted
regression function x$ exceeds the censoring point (zero), so that the sample
moment equation (2.28) will be trivially satisfied if fl is chosen so that x$ is
nonpositive for all observations. In this case, the estimator /? was defined not only
as a solution to a sample moment condition of the form (2.28), but in terms of a
particular minimization problem b = argmino &(/I) which yields (2.28) as a first-
order condition. The limiting minimand was shown to have a unique minimizer at
/IO, even though the limiting first-order conditions have multiple solutions; thus,
this further restriction on the acceptable solutions to the first-order condition was
enough to ensure consistency of the estimator ,!?for PO.Construction of an analogous
minimization problem might be necessary to fully exploit the symmetry restriction
for other structural models.
Once consistency of a particular estimator di satisfying (2.28) is established, the
asymptotic distribution theory immediately follows from the GMM formulae pre-
2476 J.L. Powell

sented in Section 2.1 above. For a particular choice of h(.), the form of the sample
moment condition (2.28) is the same as condition (2.6) of Section 2.2 above, replacing
the inverse transformation e(.) with the more general h(.) here; thus, the form
of the asymptotically normal distribution of 6i satisfying (2.28) is given by (2.7) of
Section 2.2, again replacing e(.) with h(.).
Of course, the choice of the symmetrizing transformation h(.) is not unique - given
any h(.) satisfying (2.26), another transformation h*( y, x, U) = I(h( y, x, CI),x, U) will
also satisfy (2.26) if I(u, x, a) is an odd function of u for all x and CI.This multiplicity
of possible symmetrizing transformations complicates the derivation of the semi-
parametric efficiency bounds for estimation of ~1~under the symmetry restriction,
which are typically derived on a case-by-case basis. For example, Newey (1991)
derived the semiparametric efficiency bounds for the censored and truncated reg-
ression models under the conditional symmetry restriction (2.25), and indicated how
efficient estimators for these models might be constructed.
For ,the linear regression model g(x, cue,E) E xb + E, the efficient symmetrizing
transformation h(y, x, B) is the derivative of the log-density of E given x, evaluated
at the residual y - xj, with optimal instruments equal to the regressors x:

h*(~,x,p)=alnf~,~(y--BIx)la&, d*(x, p, 6) = x.

Here an efficient estimator might be constructed using a nonparametric estimator


of the conditional density of E given x, itself based on residuals e= y - xg from a
preliminary fit of the model. Alternatively, as proposed by Cragg (1983) and Newey
(1988a), an efficient estimator might be constructed as a sequence of GMM estimators,
based on a growing number of transformation functions h(.) and instrument sets
d(.), which are chosen to ensure that the sequence of GMM influence functions can
approximate the influence function for the optimal estimator arbitrarily well. In
either case, the efficient estimator would be adaptive for the linear model, since
it would be asymptotically equivalent to the maximum likelihood estimator with
known error density.

2.4. Independence restrictions

Perhaps the most commonly-imposed semiparametric restriction is the assumption


of independence of the error terms and the regressors,

Pr(si < ;1Ixi} = Pr(s, < A} for all real 2, w.p.1. (2.29)

Like conditional symmetry restrictions, this condition implies constancy of the


conditional mean and median (as well as the conditional mode), so estimators which
are consistent under these weaker restrictions are equally applicable here. In fact,
for models which are invertible in the errors (E E e(y,x, cle) for some e(.)), a large
Ch. 41: Estimation of Semiparametric Models 2417

class of GMM estimators is available, based upon the general moment condition

E(d(x)Cl(e(y,x,cr,))-v,l} =O (2.30)

for any conformable functions d(.) and I(.) for which the moment in (2.30) is
well-defined, with v,, = EC/(s)]. (MaCurdy (1982) and Newey (1988a) discuss how to
exploit these restrictions to obtain more efficient estimators of linear regression
coefficients.) Independence restrictions are also stronger than the index and exclusion
restrictions to be discussed in the next section, so estimation approaches based upon
those restrictions will be relevant here.
In addition to estimation approaches based on these weaker implied stochastic
restrictions, certain approaches specific to independence restrictions have been
proposed. One strategy to estimate the unknown parameters involves maximization
of a feasible version of the log-likelihood function, in which the unknown distri-
bution function of the errors is replaced by a (preliminary or concomitant) non-
parametric estimator. For some structural functions (in particular, discrete response
models), the conditional likelihood function for the observable data depends only
on the cumulative distribution function FE(.) of the error terms, and not its derivative
(density). Since cumulative distribution functions are bounded and satisfy certain
monotonicity restrictions, the set of possible c.d.f.s will be compact with respect to
an appropriately chosen topology, so in such cases an estimator of the parameters
of interest CI~can be defined by maximization of the log-likelihood simultaneously
over the finite-dimensional parameter c1and the infinite-dimensional nuisance par-
ameter F,( .). That is, if f( y Ix, a, FE(.)) is the conditional density of y given x and the
unknown parameters cl0 and F, (with respect to a fixed measure pLy),a nonparametric
maximum likelihood (NPML) estimator for the parameters can be defined as

= argmax 1 $J Inf(yiIxi,cr,F(.)), (2.31)


at~,~Ep IV i= 1

where p is the space of admissible c.d.f.s. Such estimators were proposed by, e.g.
Cosslett (1983) for the binary response model and Heckman and Singer (1984) for
a duration model with unobserved heterogeneity. Consistency of 6i can be established
by verification of the Kiefer and Wolfowitz (1956) conditions for consistency of
NPML estimation; however, an asymptotic distribution theory for such estimators
has not yet been developed, so the form of the influence function for 6i (if it exists)
has not yet been rigorously established.
When the likelihood function of the dependent variable y depends, at least for
some observations, on the density function f,(e) = dF,(e)/de of the error terms, the
joint maximization problem given in (2.31) can be ill-posed: spurious maxima (at
infinity) can be obtained by sending the (unbounded) density estimator Te to infinity
at particular points (depending on c1and the data). In such cases, nonparametric
density estimation techniques are sometimes used to obtain a preliminary estimator
Ch. 41: Estimation of Semiparametric Models 2419

and identically distributed random variables are symmetrically distributed about


zero. For a particular structural model y = g(x, CC, E), the first step in the construction
of a pairwise difference estimator is to find some transformation e(z,, zj, a) E eij(a)
of pairs of observations (zi, zj) 3 (( yi, xi), (yj, xi)) and the parameter vector so that,
conditional on the regressors xi and xj, the transformations eij(crO) and eji(cr,) are
identically distributed, i.e.

=Y(eij(ao)lXi, xj) = ~(eji(Q)lXi, xj) as., (2.35)

where LZ(.l.) denotes the conditional sampling distribution of the random variable.
In order for the parameter a0 to be identified using this transformation, it must also
be true that 9(eij(a,)Ixi, xj) # _Y(eji(a,)Ixi, xj) with positive probability if a1 # ao,
which implies that observations i andj cannot enter symmetrically in the function
e(zi,zj,a). Since si and sj are assumed to be mutually independent given xi and Xi,
eij(a) and eji(a) will be conditionally independent given xi and xj; thus, if (2.35) is
satisfied, then the difference eij(a) - eji(a) will be symmetrically distributed about
zero, conditionally on xi and xi, when evaluated at a = a,,. Given an odd function
{(.) (which, in general, might depend on xi and xj), the conditional symmetry of
eij(a) - eji(a) implies the conditional moment restriction

E[S(eij(%J- ~ji(%))I~i~xjl
= O a.s., (2.36)

provided this expectation exists, and a0 will be identified using this restriction if it
fails to hold when a # ao. When [(.) is taken to be the identity mapping t(d) = d,
the restriction that eij(ao) and eji(ae) have identical conditional distributions can be
weakened to the restriction that they have identical conditional means,

E[eij(a,)IXi, Xjl = ECeji(ao)lXi, Xjl a.s., (2.37)

which may not require independence of the errors Ei and regressors xi, depending
on the form of the transformation e(.).
Given an appropriate (integrable) vector /(xi, xj, a) of functions of the regressors
and parameter vector, this yields the unconditional moment restrictions

(2.38)

which can be used as a basis for estimation. If Z(.) is chosen to have the same
dimension as a, a method-of-moments estimator bi of a0 can be defined as the
solution to the sample analogue of this population moment condition, namely,

02-IiTj 4teijCbi) - eji(d))l(Xi, Xj, di) = 0 (2.39)


2480 J.L. Powell

(which may only approximately hold if t(eij(a) - eji(M))is discontinuous in E). For
many models (e.g. those depending on a latent variable y* E g(xi, a) + ci), it is
possible to construct some minimization problem which has this sample moment
condition as a first-order condition, i.e. for some function s(zi, zj, IX)with

as(z:azj
) =((eij(a) - eji(a))l(xi,xj, a),

the estimator d might alternatively be defined as

bi= argmin ; -I 1 S(Zi,Zj$). (2.40)


aE@ 0 iij

A simple example of a model which is amenable to the pairwise differencing


approach is the linear model, yi = x:/I0 + ci, where gi and xi are assumed to be
independent. For this case, one transformation function which satisfies the require-
ments above is

4Yi, xi, xj, Coc Yi - X:B,


which does not depend on xP Choosing l(x,, xj, ~1)= xi - xi, a pairwise difference
estimator of /&,can be defined to solve

0i - 1 (((yi - yj) -(Xi


i<j
- Xj))fi)(xi - xj) E OT

or, if E is the antiderivative of r, to minimize

&l(B)= ;
0 - ~j~((Yi-Yj)-(xi-xj)lB).

When &I) = d, the estimator fiis algebraically equal to the slope coefficient estimators
of a classical least squares regression of yi on Xi and a constant (unless some
normalization on the location of the distribution of ci is imposed, a constant term
is not identified by the independence restriction). When t(d) = sgn(d), j? is a rank
regression estimator which sets the sample covariance of the regressors xi with the
ranks of the residuals yi - x$ equal (approximately) to zero (JureEkovB (1971),
Jaeckel(l972)). The same general approach has been used to construct estimators
for discrete response models and censored and truncated regression models.
In all of these cases, the pairwise difference estimator diis defined as a minimizer
of a second-order U-statistic of the form
Ch. 41: Estimation of Semiparametric Models 2481

(with zi 3 ( yi, xi)), and will solve an approximate first-order condition

0 -
n
2 C q(Zi,Zj,6i)=",(n-"2),
icj

where q(.) = ap(.)/aa when this derivative is well-defined. As described in Section 1.4
above, the asymptotic normal distribution of the estimator 6i can be derived from
the asymptotically linear representation

h = %3 - m
n
t H, l r(zi, cto)+
i=l
o&n- l/2), (2.41)

where r(zj, LX)


E E[q(zi, zj, CY)/
zi] and

The pairwise comparison approach is also useful for construction of estimators


for certain nonlinear panel data models. In this setting functions of pairs of observations
are constructed, not across individuals, but over time for each individual. In the
simplest case, where only two observations across time are available for each
individual, a moment condition analogous to (2.36) is

ECS(elz,i(~o) - ezl,i(~o))IXil9 xi21 = 0 a.s., (2.42)

where now ei2,Ja) - e(zil, zi2, a) for th e same types of transformation functions e(.)
described above, and where the second subscripts on the random variables denote
the respective time periods. To obtain the restriction (2.42), it is not necessary for
the error terms Ei= (sil, ci2) to be independent of the regressors xi = (xii, xi2) across
individuals i; it suffices that the components sil and si2 are mutually independent
and identically distributed across time, given the regressors xi. The pairwise differ-
encing approach, when it is applicable to panel data, has the added advantage that it
automatically adjusts for the presence of individual-specific fixed effects, since
Eil + yi and Ei2+ yi will be identically distributed if sil and si2 are. A familiar example
is the estimation of the coefficients /IOin the linear fixed-effects model

Yit = XIrbO + Yi + &it, t= 1,2,

where setting the transformation e12Jcl) = yi, - xi1 /I and 5(u) = u in (2.42) results
in the moment condition
2482 J.L. Powell

which is the basis for the traditional least squares fixed effects estimator. As described
in Section 3.5 below, this idea has been exploited to construct estimators for panel
data versions of the binary response and censored and truncated regression models
which are semiparametric with respect to both the error distribution and the
distribution of the fixed effects.

2.5. Exclusion and index restrictions

Construction of estimators based on index restrictions can be based on a variety of


different approaches, depending upon whether the index function u(x) is completely
known or depends upon (finite- or infinite-dimensional) unknown parameters, and
whether the index sufficiency condition is of the weak (affecting only the condi-
tional mean or median) or strong (applying to the entire error distribution) form.
Estimators of the parameters of interest under mean index restrictions exploit
modified forms of the moment conditions implied by the stronger constant condi-
tional mean restrictions, just as estimators under distributional index restrictions
use modifications of estimation strategies for independence restrictions.
Perhaps the simplest version of the restrictions to analyze are mean exclusion
restrictions, for which the index function is a subset of the regressors (i.e. u(x) E x1,
where x = (xi, xi)), so that the restriction is

E[elx] = E[E[x,] a.s. (2.43)

As for conditional mean restrictions, this condition can be used to identify the
parameters of interest, 01~,for structural functions y = g(x, a,, E) which are invertible
in the error terms (E = e(y,x, a,,)), so that the exclusion restriction (2.43) can be
rewritten as

EC4y,x,aO)Ixl-ECe(~,x,a~)lx~l=O. (2.44)

By iterated expectations, this implies an unconditional moment restriction which


is analogous to condition (2.4) of Section 2.1, namely,

(2.45)

where now

(2.46)

for any conformable matrix d(x) and square matrix A(x) of functions of the regressors
for-which the relevant expectations and inverses exist. (Note that, by construction,
E[d(x)lx,] = 0 almost surely.) Alternatively, estimation might be based on the
Ch. 41: Estimation ofSemiparametric
Models 2483

condition

0= EC&)ay,x, cl,)l, (2.47)

where, analogously to (2.46),

Given a particular nonparametric method for estimation of conditional means


given x1 (denoted E[*lx,]), a semiparametric M-estimator 61of the structural
coefficients c1ecan be defined as the solution to a sample analogue of (2.45),

0= j!$
,${d(xi,4 s3- E[d(xi,4 s31xi1]
I 1
(i[A(xi)lxil])- A(xi)}e(.Yi, Xi, a),
(2.48)

where the instrumental variable matrix d(x) is permitted to depend upon LXand a
preliminary nuisance parameter estimator 8, as in Section 2.2. Formally, the asymp-
totic distribution of this estimator is given by the same expression (2.7) for estimation
with conditional mean restrictions, replacing d with 2 throughout. However, rigorous
verification of the consistency and asymptotic normality of dzis technically difficult,
and the estimating equation (2.48) must often be modified to, trim (i.e. delete)
observations where the nonparametric regression estimator EC.1 is imprecise. A
bound on the attainable efficiency of estimators of t1e under condition (2.44) was
derived by Chamberlain (1992), who showed that an optimal instrumental variable
matrix d*(x)of the form (2.46) is related to the corresponding optimal instrument
matrix d*(x) for the constant conditional moment restrictions of Section 2.2 by the
formula

d(x)= d*(x) - E[d*(x)lx,] [E{ [Z(x)]-lxl}]- [Z(x)]- , (2.49)

where d*(x) is defined in (2.8) above and E(x) is the conditional covariance matrix
of the errors s given the regressors x. This formula directly generalizes to the case
in which the subvector x1 is replaced by a more general (but known) index function
u(x).
For a linear model y = x!J& + E, the mean exclusion restriction (2.43) yields the
semilinear model considered by Robinson (1988):

Y = @cl + w% I+ %

where 0(x,) - E[E/x,] and E[qlx] = E[E - 0(x,)1x] = 0. Defining e(y,x,cl) E


y - xi/I, d(x) 3 x2, and A = I, the moment condition (2.47) becomes
2484 J.L. Powell

which can be solved for PO:

Robinson (1988) proposed an estimator of /IO constructed from a sample analogue


to (2.47), using kernel regression to nonparametrically estimate the conditional
expectations and trimming observations where a nonparametric estimator of the
density of x1 (assumed continuously distributed) is close to zero and gave conditions
under which the resulting estimator was root-N-consistent and asymptotically nor-
mal. Linton (1992) constructs higher-order approximations to the distribution of
this estimator.
Strengthening the mean exclusion restriction to a distributional exclusion condi-
tion widens the class of moment restrictions which can be exploited when the
structural function is invertible in the errors. Imposing

Pr{s < ulx} = Pr{s bulx,) (2.50)

for all possible values of u yields the general moment conditions

0= EC&444x x, d)l (2.51)

for any square-integrable function I(E)of the errors, which includes (2.45) as a special
case. As with independence restrictions, precision of estimators of a, can be improved
by judicious choice of the transformation I(.).
Even for noninvertible structural functions, the pairwise comparison approach
considered for index restrictions can be modified to be applicable for distributional
exclusion (or known index) restrictions. For any pair of observations zi and zj which
have the same value of the index function u(xi) = u(xj), the corresponding error terms
si and sj will be independently and identically distributed, given the regressors xi
and xj, under the distributional index restriction

Pr{.s < ulx} = Pr{s < ulu(x)>. (2.52)

Given the pairwise transformation function e(z,, zj, a) = eij@) described in the previous
section, an analogue to restriction (2.35) holds under this additional restriction of
equality of index functions:

T(eij(cco))xi, xj) = Y(eji(M,)lXi, Xj) a.s. if U(Xi) = U(Xj). (2.53)

As for independence restrictions, (2.53) implies the weaker conditional mean restric-
tion

E[eij(cr,)Ixi, xi] = E[eji(M,)IXi, Xj] a.s. if U(Xi) = U(Xj), (2.54)


Ch. 41: Estimation ofSemiparametric Models 2485

which is relevant for invertible structural functions (with eij(a) equated with the
inverse function e( yi, xi, a) in this case).
These restrictions suggest estimation of a, by modifying the estimating equation
(2.39) or the minimization problem (2.40) of the preceding subsection to exclude
pairs of observations for which u(xi) # D(x~). However, in general U(Xi)- U(Xj) may
be continuously distributed around zero, so direct imposition of this restriction
would exclude all pairs of observations. Still, if the sampling distributions LZ(eij(Uo)
(xi, xj, u(xi) - u(xj) = c) or conditional expectations E[eij(Eo)l xi, xj, u(xi) - u(xj) = c]
are smooth functions of c at c = 0, the restrictions (2.53) or (2.54) will approximately
hold if u(xi) - u(xj) is close to zero. Then appropriate modifications of the estimating
equations (2.39) and minimization problem (2.40) are

0 *: -I iTj 4(eij(d) - eji(d)) l(Xi, Xj, d) WN(U(Xi) - U(Xj)) = ()


(2.55)

and

oi = argmin T - 1 iTj Nzi2 zj9 Co wN("(xi)- u(xj))3


(2.56)
&Q 0

for some weighting function wN(.) which tends to zero as the magnitude of its
argument increases and, at a faster rate, as the sample size N increases (so that,
ultimately, only observations with u(xi) - u(xj) very close to zero are included in the
summations).
Returning to the semilinear regression model y = xi& + 0(x,) + 4, E[qlx] = 0,
the same transformation as used in the previous subsection can be used to construct
a pairwise difference, provided the nonparametric components B(xil) and /3(x,J are
equal for the two observations; that is, if e( yi, xi, Xi, CI)= eij(a) = yi - xQ and u(xi) =
xii, then

if u(xi) = D(Xj). Provided B(x,r) is a smooth (continuous and differentiable) function,


relation (2.36) will hold approximately if xi1 E xjI. Defining the weight function
w,,&.) to be a traditional kernel weight,

WN(d)= k(h, l d), k(O)>O,k(ll)+Oas IIAII+oO,hN+OasN+cO, (2.57)

= xiZ - xj2 and t(d) = d, a pairwise difference estimator of PO


and+ taking l(x,, xj, CC)
using either (2.55) or (2.56) reduces to a weighted least squares regression of the
distinct differences (yi - yj) in dependent variables on the differences (Xi2 - xj2) in
regressors, using k(h,(xi, - xjI)) as weights (as proposed by Powell (1987)).
Consistency of the resulting estimator a requires only the weak exclusion restric-
tion (2.43); when the strong exclusion restriction (2.53) is imposed, other choices of
odd function t(d) besides the identity function are permissible in (2.55). Thus, an
estimator of Do using t(d) = sgn(d) might solve

N
0 2
iTj sgn((yi - yj) - (xi1 - xjl)g)(xil - Xjl)k((Xi2 - Xj2)lhN) E 0. (2.5)

This is the first-order condition of a smoothed version of the minimization


problem defining the rank regression estimator,

b = argmin : - iTj I(Yi - Yj) ~ (xi - xj)B Ik((xi, - Xjz)/hN), (2.59)


a 0

which is a robust alternative to estimators proposed by Robinson (1988b) and


Powell (1987) for the semilinear model. Although the asymptotic theory for such
estimators has yet to be developed, it is likely that reasonable conditions can be
found to ensure their root-N-consistency and asymptotic normality.
So far, the discussion has been limited to models with known index functions u(x).
When the index function depends upon unknown parameters 6, which are functionally
unrelated to the parameters of interest rxe,and when preliminary consistent estimators
s^ of 6, are available, the estimators described above are easily adapted to use an
estimated index function O(x) = u(x, 8). The asymptotic distribution theory for the
resulting estimator must properly account for the variability of the preliminary
estimator $. When 6, is related to a,, and that relation is exploited in the construction
of an estimator of CI~, the foregoing estimation theory requires more substantial
modification, both conceptually and technically.
A leading special case occurs when the index governing the conditional error
distribution appears in the same form in the structural function for the dependent
variable y. For example, suppose the structural function has a linear latent variable
form,

Y= 57(x,
%I, 4 = WPo + 4, (2.60)

and index u(x) is the latent linear regression function xpo,

Pr(s d ulx} = Pr{s < 4xPo). (2.61)

This particular index restriction on the unobservable error terms immediately


implies the same index restriction for the observable dependent variable,

Pr{ydulx}=Pr(ydulxBo), (2.62)
Ch. 41: Estimation qf Semiparametric Models 2487

which can be used to generate moment restrictions for estimation of PO. For
example, (2.62) implies the weaker restriction

ECYIXI =x;(xP,), (2.63)

on the conditional mean of the dependent variable, where G(.) is some unknown
nuisance function. (Clearly &, is at most identified up to a location and scale
normalization without stronger restrictions on the form of G(.).) Defining Z(y, x, b) -
y - E[ y Jxb], condition (2.63) implies that

J34x) Z(Y, x2 /Ml = 0 (2.64)

for any conformable, square-integrable d(x). Thus, with a nonparametric estimator


@ylxb] of the conditional expectation function E[ylxb], a semiparametric M-
estimator of /I0 can be constructed as a sample analogue to (2.64). Alternatively, a
weighted pairwise difference approach might be used: assuming G(.) is continuous,
the difference in the conditional means of the dependent variables for observations
i and j satisfies

E[ yi - yj 1xi, xj] = G(x&) - G(x;/3,) g 0 if x& z x;p,,. (2.65)

So by estimating E[ yi - yj (Xi, xj] nonparametrically and determining when it is


near zero, the corresponding pair of observations will have (xi - xj)& g 0,
which is useful in determining &,. When G(.) is known to be monotonic (which
follows, for example, if the transformation t(.) of the latent variable in (2.60) is
monotonic and E is assumed to be independent of x), a variation on the pairwise
comparison approach could exploit the resulting inequality E[y, - yjl xi, xj] =
G(x&) - G(x>P,) > 0 only if x$,, > x&.
Various estimators based upon these conditions have been proposed for the
monotone regression model, as discussed in Section 3.2 below. More complicated
examples involve multiple indices, with some indices depending upon parameters
of interest and others depending upon unrelated nuisance parameters, as for some
of the proposed estimators for selectivity bias models. The methods of estimation
of the structural parameters ~1~vary across the particular models but generally involve
nonparametric estimation of regression or density functions involving the index u(x).

3. Structural models

3.1. Discrete response models

The parameters of the binary response model

y = 1(x& + E > 0) (3.1)


2488 J.L. Powell

are traditionally estimated by maximization of the average log-likelihood function

(3.2)

where the error term E is assumed to be distributed independently of x with known


distribution function F(.) (typically standard normal or logistic). Estimators for
semiparametric versions of the binary response model usually involve maximization
of a modified form of this log-likelihood, one which does not presuppose knowledge
of the distribution of the errors. For the more general multinomial response model,
in which J indicator variables { yj, j = 1,. . . , J} are generated as

yj=l{xfl~+~j>x&++Ek forall k#j}, (3.3)

the average log-likelihood has the analogous form

~N(P,..., BJ; F, = i itl j$l YijlnCFj(x$, . . . , XipJ)], (3.4)

where Fj(.) is the conditional probability that yj = 1 given the regressors x. This
form easily specializes to the ordered response or grouped dependent variable models,
replacing Fj(.) with F(x& - cj) - F(x$,, - cj_ r), where the {cj} are the (known or
unknown) group boundaries.
The earliest example of a semiparametric approach for estimation of a limited
dependent variable model in econometrics is the maximum score estimation method
proposed by Manski (1975). For the binary response mode, Manski suggested that
PO be estimated by maximizing the number_of correct predictions of y by the sign
of the latent regression function xp; that is, /I was defined to maximize the predictive
score function

(3.5)
i=l

over a suitable parameter space 0 (e.g. the unit sphere). The error terms E were
restricted to have conditional median zero to ensure consistency of the estimator.
A later interpretation of the estimator (Manski (1985)) characterized the maximum
score estimator p^as a least absolute deviations estimator, since the estimator solved
the minimization problem

b = arg:in A i$r I Yi - 1 tX:B > Ol1. (3.6)


Ch. 41: Estimation of Semiparametric Models 2489

This led to the extension of the maximum score idea to more general quantile
estimation of /?,,, under the assumption that the corresponding conditional quantile
of the error terms was constant (Manski (1985)). The maximum score approach was
also applied to the multinomial response model by Manski (1975); in this case, the
score criterion becomes

and its consistency was established under the stronger condition of mutual indepen-
dence of the alternative specific errors (ej}. M. Lee (1992) used conditional median
restrictions to define a least absolute deviations estimator of the parameters of the
ordered response model along the same lines.
Although consistency of the maximum score estimator for binary response was
rigorously established by Manski (1985) and Amemiya (1985), its asymptotic distri-
bution cannot be established by the methods described in Section 2.2 above, because
of lack of continuity of the median regression function 1{xj?, > 0} of the dependent
variable y. More importantly, because this median regression function is flat except
at its discontinuity points, the estimator is not root-N-consistent under standard
regularity conditions on the errors and regressors. Kim and Pollard (1990) found
that the rate of convergence of the maximum score estimator to j?,, under such
conditions is N1/3, with a nonstandard asymptotic distribution (involving the
distribution of the maximum value of a particular Gaussian process with quadratic
drift). This result was confirmed for finite samples by the simulation study of Manski
and Thompson (1986).
Chamberlain (1986) showed that this slow rate of convergence of the maximum
score estimator was not particular to the estimation method, but a general con-
sequence of estimation of the binary response model with a conditional median
restriction. Chamberlain showed that the semiparametric version of the information
matrix for this model is identically zero, so that no regular root-N-consistent
estimator of /I?,,exists in this case. An extension by Zheng (1992) derived the same
result - a zero semiparametric information matrix - even if the conditional median
restriction is strengthened to an assumption of conditional symmetry of the error
distribution. Still, consistency of the maximum score estimator fi illustrates the fact
that the parameters flc,of the binary response model are identified under conditional
quantile or symmetry assumptions on the error terms, which is not the case if the
errors are restricted only to have constant conditional mean.
If additional smoothness restrictions on the distribution of the errors and regres-
sors are imposed, the maximum score (quantile) approach can be modified to
obtain estimators which converge to the true parameters at a faster rate than N113.
Nawata (1992) proposed an estimator which, in essence, estimates f10by maximizing
the fit of an estimator of the conditional median function 1(x& > 0) of the binary
variable to a nonparametric estimator of the conditional median of y given x. In a
2490 J.L.Powell

first stage, the observations are grouped by a partition of the space of regressors,
and the median value of the dependent variable y is calculated for each of these
regressor bins. These group medians, along with the average value of the regression
vector in each group, are treated as raw data in a second-stage fit of the binary
response model using the likelihood function (3.2) with a standard normal cumulative
and a correction for heteroskedasticity induced by the grouping scheme. Nawata
(1992) gives conditions under which the rate of convergence of the resulting estimator
is N25, and indicates how the estimator and regularity conditions can be modified
to achieve a rate of convergence arbitrarily close to N. Horowitz (1992) used a
different approach, but similar strengthening of the regularity conditions, to obtain
a median estimator for binary response with a faster convergence rate. Horowitz
modifies the score function of (3.5) by replacing the conditional median function
l{x/I > 0} by a smoothed version, so that an estimator of /I,, is defined as a
minimizer of the criterion

s,*(P)= iilYi K(x:B/hN)+ t1 - Yi) Cl - K(x~B/hN)l~ (3.8)

where K(.) is a smooth function in [0, l] with K(u)+0 or 1 as U+ - co or co, and


h, is a sequence of bandwidths which tends to zero as the sample size increases (so
that K(x&/h,) approaches the binary median 1(x& > 0) as N + co). With particular
conditions on the function K(.) and the smoothness of the regressor distribution
and with the conditional density of the errors at the median being zero, Horowitz
(1992) shows how the rate of convergence of the minimizer of S;G(fi) over 0 can be
made at least N2 and arbitrarily close to N2; moreover, asymptotic normality
of the resulting estimator is shown (and consistent estimators of asymptotic bias
and covariance terms are provided), so that normal sampling theory can be used to
construct confidence regions and hypothesis tests in large samples.
When the error terms in the binary response model are assumed to satisfy the
stronger assumption of independence of the errors and regressors, Cosslett (1987)
showed that the semiparametric information matrix for estimation of fiO in (3.1)
(once a suitable normalization is imposed) is generally nonsingular, a necessary
condition for existence of a regular root-N-consistent estimator. Its form is anal-
ogous to the parametric information matrix when the distribution function F(.) of
the errors is known, except that the regressors x are replaced by deviations from
their conditional means given the latent regression function x&; that is, the best
attainable asymptotic covariance matrix for a regular estimator of &, when E is
independent of x with unknown distribution function F(.) is

wm
Cf (xBo)12
- wm1
[5Z- E(:lx&)] [Z - E(i(x&,)]
II -l, (3.9)

where f(u) = dF(u)/du and Z?is the subvector of regressors x which eliminates the
Ch. 41: Estimution ~JSemipurametric Models 2491

last component (whose coefficient is assumed normalized to unity to pin down the
scale of /IO). Existence of the inverse in (3.9) implies that a constant term is excluded
from the regression vector, and the corresponding intercept term is absorbed into
the definition of the error cumulative F(.).
For the binary response model under an index restriction, Cosslett (1983) proposed
a nonparametric maximum likelihood estimator (NPMLE) of j3e through maxi-
mization of the average log-likelihood function _Y,,@;F) simultaneously over BE 0
and FEN, where g is the space of possible cumulative distributions (monotonic
functions on [0,11). Computationally, given a particular trial value b of fi, an
estimator of F is obtained by monotonic regression of the indicator y on xb, using
the pool adjacent violators algorithm of isotonic regression; this estimator F^ of F is
then substituted into the likelihood function, and the concentrated criterion SY,(b; F)
is maximized over bE O= {/I: )//31)= 1 }. Cosslett (1983) establishes consistency of
the resulting estimators of j?, and F(.) through verification of the Kiefer-Wolfowitz
(1956) conditions for the consistency of NPMLE, constructing a topology which
ensures compactness of the parameter space B of possible nuisance functions F(.).
As noted in Section 2.4 above, an asymptotic distribution for NMPLE has not yet
been established.
Instead of the monotonic regression estimator F(.) of F(.) implicit in the construc-
tion of the NPMLE, the same estimation approach can be based upon other
nonparametric estimators of the error cumulative. The resulting projle likelihood
estimator of /IO, maximizing ZP,(b; F) of (3.2) using a kernel regression estimator F,
was considered by Severini and Wong (1987a) (for a single parameter) and Klein
and Spady (1993). Because kernel regression does not impose monotonicity of the
function estimator, this profile likelihood estimator is valid under a weaker index
restriction on the error distribution Pr{.s < u/x} = Pr{& < u[x&,}, which implies
that E[ ~1x1 = F(x/?,) for some (not necessarily monotone) function F(.). Theoreti-
cally, the form of the profile likelihood TN(b;@ is modified by Klein and Spady
(1993) to trim observations with imprecise estimators of F(.) in order to show
root-N-consistency and asymptotic normality of the resulting estimator p. Klein
and Spady show that this estimator is asymptotically efficient under the assumption
of independence of the errors and regressors, since its asymptotic covariance matrix
equals the best attainable value V* of (3.9) under this restriction.
Other estimators of the parameters of the binary response model have been
proposed which do not exploit the particular structure of the binary response model,
but instead are based upon general properties of transformation models. If indepen-
dence of the errors and regressors is assumed, the monotonicity of the structural
function (3.1) in E can be used to define a pairwise comparison estimator of Do.
Imposition of a weaker index restriction Pr{s < u (x] = Pr{s < ~1xp,} implies that

ECYIXI= WA,) (3.10)

for some unknown function G(.), so any estimator which is based on this restriction
2492 J.L. Powell

is applicable to the binary response model. A number of estimators proposed for


this more general setup are discussed in the following section on transformation
models.
Estimation of the multinomial response model (3.3) under independence and
index restrictions can be based on natural extensions of the methods for the binary
response model. In addition to the maximum score estimator defined by minimizing
(3.7), Thompson (1989a, b) considered identification and estimation of the parameters
in (3.3) assuming independence of the errors and regressors; Thompson showed how
consistent estimators of (/?A,. . . , /I,) could be constructed using a least squares criterion
even if only a single element yj of the vector of choice indicators (y,, . . . , yj) is
observed. L. Lee (1991) extended profile likelihood estimation to the multinomial
response model, and obtained a similar efficiency result to Klein and Spadys (1993)
result for binary response under index restrictions on the error terms. And, as for
the binary response model, various pairwise comparison or index restriction esti-
mators for multiple index models are applicable to the multinomial response model;
these estimators are reviewed in the next section.

3.2. Transformation models

In Section 1.3 above, two general classes of transformation models were distinguished.
Parametric transformation models, in which the relation between the latent and
observed dependent variables is invertible and of known parametric form, are
traditionally estimated assuming the errors are independent of the regressors with
density function f(.;r) of known parametric form. In this setting, the average
conditional log-likelihood function for the dependent variable

y = t(x& + E; &JO& = t - l (Y; &) - xPo= 4x x, PO,2,) (3.11)

is

ThdP,A ? f) = k.z
1
I
(InCf(e(Yi, xi, B,4; r)l - ln CladYi, Xi, B,2yay I]),

(3.12)

which is maximized over 8 = (B, ;1,r) to obtain estimators of the parameters /IO and
2, of interest.
Given both the monotonicity of the transformation t(.) in the latent variable and
the explicit representation function e(.) for the errors in terms of the observable
variables and unknown parameters, these models are amenable to estimation under
most of the semiparametric restrictions on the error distribution discussed in
Section 2. For example, Amemiya and Powell (1981) considered nonlinear two-
stage least squares (method-of-moments) estimation of /IO and A,, for the Box-Cox
Ch. 41: Estimation of Semiparametric Models 2493

transformation under a conditional mean restriction on the errors E given the


regressors x, and showed how this estimator could greatly outperform (in a mean-
squared-error sense) a misspecified Gaussian ML estimator over some ranges of the
transformation parameter &. Carroll and Ruppert (1984) and Powell (1991) discuss
least absolute deviations and quantile estimators of the Box-Cox regression model,
imposing independence or constant quantile restrictions on the errors. Han (1987b)
also assumes independence of the errors and regressors, and constructs a pairwise
difference estimator of the transformation parameter 2, and the slope coefficients
&, which involves maximization of a fourth-order U-statistic; this approach is a
natural generalization of the maximum rank correlation estimation method de-
scribed below. Newey (1989~) constructs efficient method-of-moments estimators for
the BoxxCox regression model under conditional mean, symmetry, and independence
restrictions on the error terms. Though not yet considered in the econometric
literature, it would be straightforward to extend the general estimation strategies
described in Section 2.5 above to estimate the parameters of interest in a semilinear
variant of the BoxxCox regression model.
When the form of the transformation function t(.) in (3.11) is not parametrically
specified (i.e. the transformation itself is an infinite-dimensional nuisance parameter),
estimation of &, becomes more problematic, since some of the semiparametric
restrictions on the errors no longer suffice to identify /I,, (which is, at most, uniquely
determined up to a scale normalization). For instance, since a special case is the
binary response model, it is clear from the discussion of the previous section that a
conditional mean restriction on E is insufficient to identify the parameters of interest.
Conversely, any dependent variable generated from an unknown (nonconstant and
monotonic) transformation can be further transformed to a binary response model,
so that identification of the parameters of a binary response model generally implies
identification of the parameters of an analogous transformation model.
Under the assumption of independence of the errors and regressors, Han (1987a)
proposed a pairwise comparison estimator, termed the maximum rank correlation
estimator, for the model (3.11) with t(.) unknown but nondecreasing. Han actually
considered a generalization of (3.1 l), the generalized regression model, with structural
function

Y = 41,
tCs(xBo, (3.13)

with t[.] a monotone (but possibly roninvertible) function and s(.) smooth and
invertible in both of its arguments; with continuity and unbounded support of the
error distribution, this construction ensures that the support of y will not depend
upon the unknown parameters &,. Though the discussion below focusses on the
special case s(x /I, E) = xfi + E, the same arguments apply to this, more general,
setup.
For model (3.11), with t(.) unknown and E and x assumed independent, Han
proposed estimation of /I,, by maximization of
2494 J.L. Powell

RN(b) =01 -IN-1 N


x;P)+ l(Yi < Yj) ltx:P < x;B)l
(3.14)

over a suitably-restricted parameter space @(e.g. normalizing one of the components


of & to unity). Maximization of (3.14) is equivalent to minimization of a least
absolute deviations criterion for the sign of yi - yj minus its median, the sign of
xi/? - xi/?, for those observations with nonzero values of yi - yj:

N -IN-1 N
B E argmax RN(b) = argmin
0 0 0 2
izI j=z 1 l(Yi+Yj) I l(Yi>Yj)- l(xlP>x;B)I.

(3.15)

In terms of the pairwise difference estimators of Section 2.4, defining

eij(B) E l(Yi Z Yj)%nCl(Yi > Yj)- l(x$>xJB)l~

identification of & using the maximum rank correlation criterion is related to the
conditional symmetry of

= 2 l(yi # Yj)Sgn[l((Xi-Xj)Bo > &j-&i)- l((xi-xjYBO )l

about zero given xi and xj. The maximum rank correlation estimator defined in
(3.15) does not solve a sample moment condition like (2.39) of Section 2.4 (though
such estimators could easily be constructed), because the derivative of RN(B) is zero
wherever it is well-defined; still, the estimator b is motivated by the same general
pairwise comparison approach described in Section 2.4.
Han (1987a) gave regularity conditions under which fl is consistent for & these
included continuity of the error distribution and compact support for the regressors.
Under similar conditions Sherman (1993) demonstrated the root-N-consistency and
asymptotic normality of the maximum rank estimator; writing the estimator as the
minimizer of a second-order U-process,

j? = argmax N -r jj 5 P(ziYzjt8)> (3.16)


0 0 2 i=l j=i+l

Sherman showed that the asymptotic distribution of B is the same as that for an
M-estimator based on N/2 observations which maximizes the sample average of
the conditional expectation r(zi, /I) = E[ p(zi, zj, /I) 1zi] over the parameter space 0,
y*. Greene (1981, 1983) derives similar results for classical least squares estimates
in the special case of a censored dependent variable. Brillinger (1983) shows consis-
tency of classical least squares estimates for the general transformation model when
the regressors are jointly normally distributed, which implies that the conditional
distribution of the regressors x given the index xBO has the linear form

Cxl XBOI= PO + vo(XBO) (3.20)

for some p. and vo. Ruud (1983) noted that condition (3.20) (with a full-rank
condition on the distribution of the regressors) was sufficient for consistency (up to
scale) of a misspecified maximum likelihood estimator of PO in a binary response
model with independence of the errors and regressors; this result was extended by
Ruud (1986) to include all misspecified maximum likelihood estimators for latent
variable models when (3.1 l), (3.20) and independence of the errors and regressors
are assumed. Li and Duan (1989) have recently noted this result, emphasizing the
importance of convexity of the assumed likelihood function (which ensures unique-
ness of the minimizer rcfio of the limiting objective function). As Ruud points out,
all of these results use the fact that the least squares or misspecified ML estimators
6i and y^of the intercept term and slope coefficients satisfy a sample moment condi-
tion of the form

1
5
O= i=l r(.Yi,6i + Xiy*)
[3xi
(3.21)

for some quasi-residual function I(.). Letting F(xBo, ~1+ xy) = E[r(y, c1+ xy) 1x]
and imposing condition (3.20), the value y* = rcpo will solve the corresponding
population moment condition if K and the intercept CIare chosen to satisfy the two
conditions

0 = W(xBo, a + dxDo))l = -w(x&, a + K(XPO))(xA41,

since the population analogue of condition (3.21) then becomes

under the restriction (3.20). (An analogous argument works for condition (3.19),
replacing xfio withy* where appropriate; in this case, the index restriction _Y(yI x) =
_Y(yIxp,) is not necessary, though this condition may not be as easily verified as
(3.20).) Conditions (3.19) and (3.20) are strong restrictions which seem unlikely to
hold for observational data, but the consistency results may be useful in experimental
design settihgs (where the distribution of the regressors can be chosen to satisfy
Ch. 41: Estimation of Semiparametric Models 2497

(3.20)), and the results suggest that the inconsistency of traditional maximum
likelihood estimators may be small when the index restriction holds and (3.19) or
(3.20) is approximately satisfied.
If the regressors are assumed to be jointly continuously distributed with known
density function fX(x), modifications of least squares estimators can yield consistent
estimators of /I0 (up to scale) even if neither (3.19) nor (3.20) holds. Ruud (1986)
proposed estimation of & by weighted least squares,

& (d4xi)lfx(xi))(xi - a)(Yi - Jib

(3.22)

where 4(x) is any density function for a random vector satisfying condition (3.20)
(for example, a multivariate normal density function) and

(3.23)

with an analogous definition for 9. This reweighting ensures that the probability
limit for the weighted least squares estimator in (3.22) is the same as the probability
limit for an unweighted least squares estimator with regressors having marginal
density 4(x); since this density is assumed to satisfy (3.20), the resulting estimator
will be consistent for /I,, (up to scale) by the results cited above.
A different approach to use of a known regressor density was taken by Stoker
(1986), who used the mean index restriction E[y 1x] = E[y 1x/I,,] = G(x/?,) implied
by the transformation model with a strong index restriction on the errors. If the
nuisance function G(.) is assumed to be smooth, an average of the derivative of
E[ylx] with respect to the regressors x will be proportional to PO:

EEaE~ylxll~xl= ECWxP,)lWP,)l PO= K*Po. (3.24)

Furthermore, if the regressor density f,(x) declines smoothly to zero on the boundary
of its support (which is most plausible when the support is unbounded), an integration-
by-parts argument yields

huh = - EC9 lnCfx(41/ax)~ (3.25)

which implies that PO can be consistently estimated (up to scale) by the sample
average ofy, times the derivative of the log-density of the regressors, a ln[fX(xi)]/ax.
Also, using the facts that

-W lnCfxb)l/ax) = 0, E{(a Ufx(41/WX) = - 1, (3.26)


2498 J.L. Powell

Stoker proposed an alternative estimator of K*& as the slope coefficients of an


instrumental variables fit ofyi on xi using the log-density derivatives a ln[fJxJ]/ax,
and a constant as instruments. This estimator, as well as Ruuds density-weighted
least squares estimator, is easily generalized to include models which have regressor
density f,(x; rO) of known parametric form, by substitution of a preliminary esti-
mator + for the unknown distribution parameters and accounting for the variability
of this preliminary estimator in the asymptotic covariance matrix formulae, using
formula (1.53) in Section 1.4 above.
When the regressors are continuously distributed with density function f,(x) of
unknown form, nonparametric (kernel) estimators of this density function (and its
derivatives) can be substituted into the formulae for the foregoing estimators.
Although the nonparametrically-estimated components necessarily converge at a
rate slower than N12, the corresponding density-weighted LS and average derivative
estimators will be root-IV-consistent under appropriate conditions, because they
involve averages of these nonparametric components across the data. Newey and
Ruud (1991) give conditions which ensure that the density-weighted LS estimator
(defined in (3.22) and (3.23)) is root-iV-consistent and asymptotically normal when
f,.(x) is replaced by a kernel estimator_?Jx). These conditions include the requirement
that the reweighting density 4(x) is nonzero only inside a compact set which has
f,(x) bounded above zero, to guarantee that the reciprocal of the corresponding
nonparametric estimator f,(x) is well-behaved. Hlrdle and Stoker (1989) and Stoker
(1991) considered substitution of the derivative of a kernel estimator of the log-
density, a ln[~.Jx)]fix into a sample analogue of condition (3.26) (which deletes
observations for which a ln[TX(xi)]/Z x is small), and gave conditions for root-l\r-
consistency and asymptotic normality of the resulting estimator.
A density-weighted variant on the average derivative estimator was proposed
by Powell et al. (1989), using the fact that

where the last inequality follows from a similar integration-by-parts argument as


used to derive (3.25). The resulting estimator s^of 6, = K+&,

(3.28)

was shown to have Ith component of the form

(3.29)

with weights c+,(xi - xi) which tend to zero as 11


Xi - Xj 11increases, and, for fixed
2500 J.L. Powell

separately from the index xf10 in that formula, is replaced by the deviation of the
regressors from their conditional mean given the index, x - E[x) x&J. Newey and
Stoker (1993) derived the semiparametric efficiency bound for estimation of PO (up
to a scale normalization on one coefficient) under condition (3.32), which has a
similar form to the semiparametric efficiency bound for estimation under exclusion
restrictions given by Chamberlain (1992) as described in Section 2.5 above.

3.3. Censored and truncated regression models

A general notation for censored regression models which covers fixed and random
censoring takes the dependent variable y and an observable indicator variable d to
be generated as

y = min {x& + E,u}, d= l{y<u}. (3.34)

This notation covers the censored Tobit model with the dependent variable censored
below at zero (with u = 0 and a sign change on the dependent and explanatory
variables) and the accelerated failure time model (y equals log failure time) with
either fixed (u always observable) or random censoring times. Given a parametric
density f(~; r,,) for the error terms (assumed independent of x), estimation of the
resulting parametric model can be based upon maximization of the likelihood
function

LP$j?, r; F) = $ .$ (di ln[f(yi - x:fl; T)] + (1 - di) ln[l - F(ui - x$; T)]) (3.35)
I 1

over possible values for /I0 and T,,, where F(.) is the c.d.f. of E (i.e. the antiderivative
of the density f(.)). This likelihood is actually the conditional likelihood of yi, di
given the regressors {xi} and the censoring points {ui} for all observations (assuming
ui is independent of yi and xi), but since it only involves the censoring point ui for
those observations which are censored, maximization of the likelihood in (3.35) is
equally feasible for fixed or random censoring. For truncated data (i.e. sampling
conditional on d = l), the likelihood function becomes

=93/A 7; F) = +j ,t lnCf(Yi - xiB; 7)lF(ui - x:/3; 7)l;


1 1

here the truncation points must be known for all observations.


When the error density is Gaussian (or in a more general exponential family), the
first-order conditions for the maximum likelihood estimator of /I0 with censored
data can be interpreted in terms of the EM algorithm (Dempster et al. (1977), as
Ch. 41: Estimation of Semiparametric Models 2501

a solution to

(3.37)

where

J+(PO,~0)= diyi + (1 - di)E[XiBo + EiIdi = 0,Xi, Ui]

= diyi + (1 - di) XiBo +


[ s Cc

u, - x;s,
E_I+;20)d&
1
, (3.38)

with a similar expression for the nuisance parameter estimator z^.Related formulae
for the conditional mean of y given x and U,

E[ylx,u]=[l -F(u-x/$))]u+
s

-m
-xpo

CxPo + ~lf(~;4 d&, (3.39)

or for the conditional mean of y given x and u and with d = 1,

E[y 1x, u, d = l] = [F(u - x&,)] -


s u-i/Jo

-cc
CxBo+ &ME;To)de,

can be used to define nonlinear least squares estimators for censored data (or for
truncated data using (3.40)) in a fully parametric model.
As discussed in Section 2.1 above, the parameters of interest &, for the censored
regression model (3.34) will not in general be identified if the error terms are assumed
only to satisfy a constant conditional mean restriction, because the structural
function is not invertible in the error terms. However, the monotonicity of the
censoring transformation in E for fixed x and u implies that the constant conditional
quantile restrictions discussed in Section 2.2 will be useful in identifying and consis-
tently estimating /I,,. For fixed censoring (at zero), Powell (1984) proposed a least
absolute deviations estimator for &, under the assumption that the error terms
had conditional median zero; in the notation of model (3.34), this estimator b would
be defined as

J=argminh ,t Iyi-Illill{X$?,Ui}I (3.41)


0 I 1

where 0 is the (compact) parameter space. Since the conditional median of y given
x and u depends on the censoring value u for all observations (even if y is uncensored),
the estimator is not directly applicable to random censoring models. Demonstration
2502 /.I_.. Powell

of the root-iV-consistency and asymptotic normality of this estimator follows the


steps outlined in Section 2.3. The asymptotic covariance matrix of fi(fi - flO) for
this model will be H; V&Z; , with

H, = 2E[f(O\x) 1(x& < u}xx] and V, = E[~{x~~ < u}xx].

f(Olx) is the conditional density of the error term E at its median, zero.
This approach was extended to the model with a general constant quantile
restriction by Powell (1986a), which derived analogous conditions for consistency
and asymptotic normality. Under the stronger restriction that the error terms are
independent of the regressors, this paper showed how more efficient estimators of
the slope coefficients in /I,, could be obtained by combining coefficients estimated
at different quantiles, and how the assumption of independent errors could be tested
by testing convergence of the differences in quantile slope estimators to zero, as
proposed by Koenker and Bassett (1982) for the linear model. Nawata (1990)
proposed a two-step estimator for b,, which calculates a nonparametric estimator
of the conditional median of y given x, in the first step, by grouping the regressors
into cells and computing the within-cell medians of the dependent variable. The
second step treats these cell medians jjj and the corresponding cell averages of the
regressors Ij as raw data in a Gaussian version of the likelihood function (3.35) and
weights these quasi-observations by the cell frequencies (which would be optimal if
the conditional density of the errors at the median were constant). Nawata gives
conditions for the consistency of this estimator, and shows how its asymptotic
distribution approaches the distribution of the censored least absolute deviations
estimator (defined in (3.41)) as the regressor cells become small. And, as mentioned
in Section 3.2, Newey and Powell (1990) showed that an efficient estimator of PO,
under a quantile restriction on the errors, is a weighted quantile estimator with
weights proportional tof(O(x), the conditional density of the errors at their condi-
tional quantile, and proposed a feasible one-step version of this estimator which is
asymptotically efficient.
When the censoring value u is observed only for censored observations, with u
independently distributed from (y, x), Ying et al. (1991) propose a quantile estimator
for ,!& under the restriction Pr{s d 01x) = n~(0, 1) using the implied relation

Pr{y > xfl,Jx} = Pr{x& < u and E > 01x)

= Pr{xP, < ulx} Pr{s > 01x)

= H(xB,)(f - xx (3.42)

where H(c) = Pr{u > c} is the survivor function of the random variable u. The
unknown function H(.) can be consistently estimated using the Kaplan and Meier
(1958) product-limit estimator for the distribution function for censored data. The
resulting consistent estimator H(.) uses only the dependent variables (y,} and the
Ch. 41: Estimation ofSemiparametric Models 2503

censoring indicators (di). Ying et al. (1991) define a quantile estimator [ as a


solution to estimating equations of the form

OZb,$ [[E?(x~~)]-l{yi>X~~-(l-n)]Xi, (3.43)


I 1

based on the conditional moment restriction (3.42) and give conditions for the
root-N-consistency and asymptotic normality of this estimator. Since H(x/?,) =
&x/&J = 1 {xp O< uO} when the censoring points ui are constant at some value uO
with probability one, these equations are not well-defined for fixed censoring (say,
at zero) except in the special case Pr{x& < uO> = 1. A modification of the sample
moment conditions defined in (3.43),

0 ~ ~ ,~ [l{ri > XIP} - [~(XlB)](l - 71)]Xi,


I 1

would allow a constant censoring value, and when n = i would reduce to the
subgradient condition for the minimization problem (3.41) in this case. Unfortunately,
this condition may have a continuum of inconsistent roots, if B can be chosen so
that x$ > ui for all observations. It is not immediately clear whether an antiderivative
of the right-hand side of (3.44) would yield a minimand which could be used to
consistently estimate PO under random censoring, as it does (yielding (3.41) for z = i)
for fixed censoring.
Because the conditional median (and other quantiles) of the dependent variable
y depend explicitly on the error distribution when the dependent variable is trun-
cated, quantile restrictions are not helpful in identifying p,, for truncated samples.
With a stronger restriction of conditional symmetry of the errors about a constant
(zero), the symmetric trimming idea mentioned in Section 2.3 can be used to
construct consistent estimators for both censored and truncated samples. Powell
(1986b) proposed a symmetrically truncated least squares estimator of PO for a
truncated sample. The estimator exploited the moment condition

E[l{y>2x~,-u}(y-x~o)~x,~<u]=E[1~E>X~o-u}F(X,E<U-x~O]=O
(3.45)

which holds for the truncated model under conditional symmetry given x and u.
The resulting estimator is defined to minimize

(3.46)

which yields a sample analogue to (3.45) as an approximate first-order condition.


2504 J.L. Powell

Similarly, a symmetrically censored least squares estimator for the censored regres-
sion model (3.34) will solve a sample moment condition based upon the condition

E[maxjy,2xB, - u$ - xpOlx] = E[ max(min(i-:, u - xpo), xBO ~ U} 1x1 = 0.


(3.47)

The root-JV-consistency and asymptotic normality of these estimators were established


by Powell (1986b). In addition to conditional symmetry and a full-rank condition
on the matrix V, = E[ 1{xfio < u}xx], a unimodality condition on the error distri-
bution was imposed in the truncated case. A variant on the symmetric trimming
approach was proposed by M. Lee (1993a, b) which, for a fixed scalar w > 0,
constructed estimators for truncated and censored samples based on the moment
conditions

(3.48)

and

~C1~~-x80~w}min(/~-xll,I,w}sgn{y-xBo}Ixl
=E[l{u-x~o>w}min{Isl,w}sgn{~}Ix]=O, (3.49)

respectively. Newey (1989a) derives the semiparametric efficiency bounds for estimation
of f10 under conditional symmetry with censored and truncated samples, noting that
the symmetrically truncated least squares estimator attains that efficiency bound in
the special case where the unknown error distribution is, in fact, Gaussian (the
analogous result does not hold, though, for the symmetrically censored estimator).
As described at the end of Section 2.2, conditional mode restrictions can be used
to identify PO for truncated data, and an estimator proposed by M. Lee (1992)
exploits this restriction. This estimator solves a sample analogue to the characteriza-
tion of /I0 as the solution to the minimization problem

B0 = argmin Pr{ Iy - minju, + 0, x;h}) > w},

as long as the modal interval of length 201 for the untruncated error distribution is
assumed to be centered at zero. M. Lee (1992) showed the N3-consistency of this
estimator and considered its robustness properties.
Most of the literature on semiparametric estimation for censored and truncated
regression in both statistics and econometrics has been based upon independence
restrictions. Early estimators of /I0 for random censoring models which relaxed the
assumed parametric form of the error distribution (but maintained independence
2506 J.L.Powell

Pairwise difference estimators for the censored and truncated regression models
have also been constructed by Honor& and Powell (1991). For model (3.34) with
fixed censoring, and using the notation of Section 2.4, these estimators were based
upon the transformation

eij(0) = e(z,, zj, fl) = min{y, - Xi/I, Ui - Xi/?), (3.54)

which satisfies

eij(Bo) = min(min{e,, ui - xi&}, Ui - xi/&} = min{ei, ui - X$e, Ui - x)be},

so that eij(Q,) and eji(Qo) are clearly independently and identically distributed given
xi and xi. Again choosing /(xi, xj, 0) = xi - xj, the pairwise difference estimator for
the censored regression model was given as a solution to the sample moment
condition (2.39) of Section 2.4 above. These estimating equations were shown to
have a unique solution, since they correspond to first-order conditions for a convex
minimization problem. Honor& and Powell (1991) also considered estimation of the
truncated regression model, in which yi and xi are observed only if yi is positive;
that is, ify, = xi/&, + vi, where ui has the conditional distribution ofei given si > - x&,
then 6p(Ui 1xi) = di(q 1Xi, q > - xi&). Again assuming the untruncated errors si are
i.i.d. and independent of the regressors xi, a pairwise difference estimator of &, was
defined using the transformation

e(zi, Zj, p) E (Yi - Xib) l(Yi - Xi/? > - Xip) l(Yj - X;fi > - X:/3). (3.55)

When evaluated at the true value /IO, the difference

eij(fio) - eji(fio) = (Vi- Uj) l(Ui > - X;p)l(Oj > - Xifl) (3.56)

is symmetrically distributed around zero given xi and xj. As for the censored case,
the estimator B for this model was defined using &xi, xj, 0) = (xi - xj) and (2.39)
through (2.40) above. When the function c(d) = sgn(d), the solution to (2.39) for this
model was proposed by Bhattacharya et al. (1983) as an estimator of /?,, for this
model under the assumption that xi is a scalar. The general theory derived for
minimizers of mth-order U-statistics (discussed in Section 1.3) was applied to show
root-N-consistency and to obtain the large-sample distributions of the pairwise
difference estimators for the censored and truncated regression models.

3.4. Selection models

Rewriting the censored selection model of (1.21) and (1.22) as

d = 1(x@, + v > 0},


(3.57)
Y = dCx;Bo + ~1
Ch. 41: Estimation of Semiparametric Models 2507

(for y, E d, y, E y, /IA = 6,, and & E /IO), a fully parametric model would specify the
functional form of the joint density f(s, q; tO) of the error terms. Then the maximiza-
tion of the average log-likelihood function

-x;{s
m
cc m
f(_Yi- X;ip, q; 5)dq
1
+(l -d,)ln
[s s -m -x;,a
.I-(&,rl; r) dq ds
11 (3.58)

over fi, 6, and r in the parameter space. An alternative estimation method, proposed
by Heckman (1976), can be based upon the conditional mean of y given x and with
d= 1:

1
00 m -1

E[ylx,d = l] =x;& + 0, rl; r,,) drl ds


[S -a, s -xi60

-x;ao 1
m 00
X E f(.s, q; q-,) dq ds = x;/?~ + 1(x;&,; rO). (3.59)
[S -00 s

When the selection correction function A(x;~;z) is linear in the distributional


parameters r (as is the case for bivariate Gaussian densities), a two-step estimator
of &, can be constructed using linear least squares, after inserting a consistent
first-step estimator $ of 6, (using the indicator d and regressors x1 in the binary
log-likelihood of (3.2)) into the selection correction function. Alternatively, a non-
linear least squares estimator of the parameters can be constructed using (3.59),
which is also applicable for truncated data (i.e. for y and x being observed conditional
on d = 1).
To date, semiparametric modelling of the selection model (3.57) has imposed
independence or index restrictions on the error terms (&,I]). Chamberlain (1986a)
derived the semiparametric efficiency bound for estimation of /IO and 6, in (3.57)
when the errors are independent of the regressors with unknown error density. The
form of the efficiency bound is a simple modification of the parametric efficiency
bound for this problem when the error density is known, with the regression vectors
x1 and x2 being replaced by their deviations from their conditional means, given
the selection index, x1 - E[x, 1xi S,] and x2 - E[x, 1x;d,], except for terms which
involve the index ~~6,. Chamberlain notes that, in general, nonsingularity of the
semiparametric information matrix will require an exclusion restriction on x2 (i.e.
some component of x1 with nonzero coefficient in 6,, is excluded from x,), as well
as a normalization restriction on 6,. The efficiency bound, which was derived
imposing independence of the errors and regressors, apparently holds more generally
when the joint distribution of the errors in (3.57), given the regressors, depends only
upon the index xi&, appearing in the selection equation.
2508 J.L. Powell

Under this index restriction, the conditional mean of y given d = 1 and x will
have the same form as in (3.59), but with a selection correction function of unknown
form. More generally, conditional on d = 1, the dependent variable y has the linear
representation y = x&, + E, where E satisfies the distributional index restriction

dP(s(d= l,x)=Z(sId= 1,~~6,) as., (3.60)

so that other estimation methods for distributional index restrictions (discussed in


Section 2.5) are applicable here. So far, though, the econometric literature has
exploited only the weaker mean index restriction

E(sld = 1,x) = E(cJd = 1,x,&). (3.61)

A semiparametric analogue of Heckmans two-step estimator was constructed by


Cosslett (1991), assuming independence of the errors and regressors. In the first step
of this approach, a consistent estimator of the selectivity parameter 6, is obtained
using Cossletts (1983) NPMLE for the binary response model, described in Section
3.1 above. In this first step, the concomitant estimator F(.) of the marginal c.d.f.
of the selection error 1is a step function, constant on a finite number J of intervals
{~-(~j-,,~j),j= l,..., J> with cc, = - cc and cJ = co. The second-step estimator of
PO approximates the selection correction function A(.) by a piecewise-constant
function on those intervals. That is, writing

y = x;&) + i Aj 1 {QOEllj} + t?, (3.62)


j=l

the estimator B is constructed from a linear least squares regression of y on x2 and


the J indicator variables {l(x;z~T~}}. Cosslett (1991) showed consistency of the
resulting estimator, using the fact that the number of intervals, J, increases slowly
to infinity as the sample size increases so that the piecewise linear function could
approximate the true selection function A(.) to an arbitrary degree. An important
identifying assumption was the requirement that some component of the regression
vector xi for the selection equation was excluded from the regressors x2 in the
equation for y, as discussed by Chamberlain (1986a).
Although independence of the errors and regressors was imposed by Cosslett
(1991), this was primarily used to ensure consistency of the NPML estimator of the
selection coefficient vector 6,. The same approach to approximation of the selection
correction function will work under an index restriction on the errors, provided the
first-step estimator of 6, only requires this index restriction. In a parametric context,
L. Lee (1982) proposed estimation of /I0 using a flexible parametrization of the
selection correction function A(.) in (3.59). For the semiparametric model Newey
(1988) proposed a similar two-step estimator, which in the second step used a series
Ch. 41: Estimation of Semiparametric Models 2509

approximation to the selection correction function to obtain the approximate model

Y ZE x;PO+ ,$, AjPj(x;61J) + e9 (3.63)

which was estimated (substituting a preliminary estimator Jfor 6,) by least squares
to obtain an estimator of Do. Here the functions {pj(.)} were a series of functions
whose linear combination could be used to approximate (in a mean-squared-error
sense) the function A(.) arbitrarily well as J --f co. Newey (1988) gave conditions
(including a particula_ rate of growth of the number J of series components) under
which the estimator p of &, was root-IV-consistent and asymptotically normal, and
also discussed how efficient estimators of the parameters could be constructed.
As discussed in Section 2.5, weighted versions of the pairwise-difference estima-
tion approach can be used under the index restriction of (3.61). Assuming a pre-
liminary, root-N-consistent estimator s^ of 6, is available, Powell (1987) considers
a pairwise-difference estimator of the form (2.55) when t(d) = d, eij(B) = yi - x&/3
and I(x,, xj, 0) = xi2 - xjz, yielding the explicit estimator

PC C
[
WN((Xil - Xjl)l&(Xi, - Xjz)(Xi, - Xj2)'

1-
1
icj

I
x iTj wN((xil - xjl)18J(xi2 - xj*)(Yi2 - Yj2) (3.64)

Conditions were given in Powell (1987) on thedata generating process, the weighting
functions w,(.), and the preliminary estimator 6 which ensured the root-IV-consistency
and asymptotic normality of p. The dependence of this asymptotic distribution on
the large-sample behavior of s^ was explicitly derived, along with a consistent
estimator of the asymptotic covariance matrix. The approach was also extended to
permit endogeneity of some components of xi2 using an instrumental variables
version of the estimator. L. Lee (1991) considers system identification of semipara-
metric selection models with endogenous regressors and proposes efficient estimators
of the unknown parameters under an independence assumption on the errors.
When the errors in (3.57) are assumed independent of the regressors, and the
support of the selection error q is the entire real line, the assumption of a known
parametric form ~~6, of the regression function in the selection equation can be
relaxed. In this case, the dependent variable y given d = 1 has the linear representation
yi = x$,, + ci, where the error term E satisfies the distributional index restriction

Y(EId = 1,x) = 2(&/d = l,p(x,)) a.s., (3.65)

where now the single index p(xl) is the propensity score (Rosenbaum and Rubin
2510 J.L. Powell

(1983)), defined as

(3.66)

Given a nonparametric estimator @(xi) of the conditional mean p(xi) of the selection
indicator, it is straightforward to modify the estimation methods above to accommo-
date this new index restriction, by replacing the estimated linear index x;J by the
nonparametric index @(xi) throughout. Choi (1990) proposed a series estimator of
/3,, based on (3.63) with this substitution, while Ahn and Powell (1993) modified the
weighted pairwise-difference estimator in (3.64) along these lines. Both papers used
a nonparametric kernel estimator to construct @(xi), and both gave conditions on
the model, the first-step nonparametric estimator and the degree of smoothing in
the second step which guaranteed root-N-consistency and asymptotic normality of
the resulting estimators of &. The influence functions for these estimators depend
upon the conditional variability of the errors E and the deviations of the selection
indicator from its conditional mean, d - p(xi). Newey and Powell (1993) calculate
the semiparametric efficiency bounds for &, under the distributional index restric-
tion (3.65) and its mean index analogue, while Newey and Powell (1991) discuss
construction of semiparametric M-estimators which will attain these efficiency
bounds.
For the truncated selection model (sampling from (3.57) conditional on d = l),
identification and estimation of the unknown parameters is much more difficult.
Ichimura and Lee (1991) consider a semiparametric version of a nonlinear least
squares estimator using the form of the truncated conditional mean function

ECylx,d = 11 =x;& + 2(x;&) (3.67)

from (3.59) with A(.) unknown, following the definition of Ichimuras (1992) estimator
in (3.33) above. Besides giving conditions for identification of the parameters and
root-N-consistency of their estimators, Ichimura and Lee (1991) consider a genera-
lization of this model in which the nonparametric component depends upon several
linear indices. If the linear index restriction (3.61) is replaced by the nonparametric
index restriction (3.65), identification and consistent estimation of &, requires the
functional independence of xi and x2, in which case the estimator proposed by
Robinson (1988), discussed in Section 2.5 above, will be applicable. Chamberlain
(1992) derives the efficiency bound for estimation of the parameters of the truncated
regression model under the index restriction (3.65).
Just as eliminating the information provided by the selection variable d makes
identification and estimation of fi,, harder, a strengthening of the information in the
selection variable makes estimation easier, and permits identification using other
semiparametric restrictions on the errors. Honore et al. (1992) consider a model in
which the binary selection variable d is replaced by a censored dependent variable
Ch. 41: Estimation of Semiparametric Models 2511

y,, so that the model becomes

(3.68)
Yz = l{Y, >O)cx;Po +&I.

This model is called the Type 3 Tobit model by Amemiya (1985). Assuming
conditional symmetry of the errors (E,I) about zero given x (as defined in Section
2.3), the authors note that 6, can be consistently estimated using the quantile or
symmetric trimming estimators for censored regression models discussed in Section
3.3, and, furthermore, by symmetrically trimming the dependent variable y, using
the trimming function

~(Yl,Y2,Xl,XZ,4B)- p=Yl<2x;~}(Y,-x;B), (3.69)

the function h(.) satisfies the conditional moment restriction

~C~(Y,,Y,,X,,X,,~,,B~)lXl =~Cl{-x;&l<? --;&J4x1 =o (3.70)

because of the joint conditional symmetry of the errors. By constructing a sample


analogue of (3.70) (possibly_based on other odd functions of y, - xi/?) and inserting
the preliminary estimator 6, Honor+ et al. (1992) show the resulting estimator fl to
be root-N-consistent and asymptotically normal under relatively weak conditions
on the model. Thus, with the additional information on the latent variable ~~6, + r]
provided by the censored variable y,, it is possible to consistently estimate /3,,
without obtaining explicit nonparametric estimators of infinite-dimensional nuisance
functions.

3.5. Nonlinear panel data models

For panel data versions of the latent variable models considered above, with

Ys = t(rl+ q%l + Es,54, s= l,...,T, (3.71)

derivation of log-likelihood functions like the ones above is straightforward if the


individual-specific intercept q is assumed independent of x (or its dependence is
parametrically specified) with a distribution of known parametric form. The condi-
tional density of y E (y 1,. . . , yT) given x for each individual can be obtained from
the joint density of the convolution u = (q + Er,. . . , q + +), which, for special (e.g.
Gaussian) choices of error distribution is of simple form. Maximum likelihood
estimators of POfor these nonlinear random effect models have the usual optimality
properties, but their consistency depends on proper specification of both the error
2512 J.L. Powell

terms&-(sr,..., sT) and the random effect r]. When the individual-specific intercepts
are treated as unknown parameters (fixed effects), the corresponding log-likelihoods
for the parameters PO and the vector of intercept terms (r],, . . . , ?i,. . . , I]~) are even
simpler to derive, being of the same general forms as given above when the errors
E, are assumed to be i.i.d. across individuals and time. However, because the vector
of unknown intercept terms increases with the sample size, maximum likelihood
estimators of these fixed effects will be inconsistent unless the number of time periods
T also increases to infinity; moreover, the inconsistency of the fixed effect estimators
leads to inconsistency of the estimators of the parameters of interest, a,, as a
consequence of the notorious incidental parameters problem (Neyman and Scott
(1948)).
For some special parametric discrete response models, consistent estimators of
&, with fixed effects can be obtained by maximizing a conditional likelihood
function, which conditions on a fixed sum of the discrete dependent variable across
time for each individual. In the special case T = 2, this is the same as maximizing
the conditional likelihood given that y, # y, and that the estimation method is the
analogue to estimation using pairwise differences (over time) for linear panel data
models. Models for which a version of pairwise differencing can be used to eliminate
the fixed effect in panel data include the binary logit model (Andersen (1970)), the
Poisson regression model (Hausman et al. (1984)) and certain duration models
(Chamberlain (1984)); however, these results require a particular (exponential) struc-
ture to the likelihood which does not hold in general.
For the binary, censored, and truncated regression models with fixed effects,
estimators have been proposed under the assumption that the time-specific errors
{E,} are identically distributed across time periods s given the regressors x. Manski
(1987) shows that, with T = 2 time periods, the conditional median of the differ-
ence y, - y, of the binary variables y, = 1{x& + E, > 0}, given that y, # y,, is
l{ (x2 - x,)fiO > 0}, so that a consistent estimator for /3,, will be

li=argmin$,$ 1{Yi2#Yil}l(Yi2-Yil)-1((x~-X~)I~~>O}I, (3.72)


0 I 1

which will be consistent under conditions on (xi2 - Xii), etc., similar to those for
consistency of the maximum score estimator. Honore (1992) considered pairwise-
difference estimators for censored and truncated regression models with fixed effects
using the approach described in Section 3.3. Specifically, using the transformations
given in (3.54) and (3.55) for the censored and truncated cases, respectively, estimators
of the parameter vector /?,, in both cases were defined as solutions to minimization
problems which generate a first-order condition of the form

0% f SCe(ziZ,Zi~,B)-e(zil,zi,,B)l(xi2-xil). (3.73)
i=l
Ch. 41: Estimation ~~Semiparametric Models 2513

As discussed at the end of Section 2.4, the expectation of the right-hand side of (3.73)
will be zero when evaluated at /IO, even in the presence of a fixed effect. As for
Manskis binary panel data estimator, this estimation approach can be generalized
to allow for more than T = 2 time periods.

4. Summary and conclusions

As the previous section indicates, the theoretical analysis of the properties of


estimators under various semiparametric restrictions is quite extensive, at least for
the latent variable models considered above. The following table gives a general
summary of the state of the econometric literature on estimation of several semi-
parametric models.

Mean Median Mode Index Symmetry Independence

Linear 3 3 Of 3 3
Transformed 3 3 o+ 3 3
Censored 0 3 o+ 3 3
Truncated 0 0 0 3 3
Binary 0 1 3 1 3
Monotone 0 1 2 1 2
Semilinear 3 2 3 2 3
Selection 0 ? 3 2 3
Binary panel 0 ? 1
? ? 1
Censored panel 0 ? 1 ? ? 2

Key: 0 ~ Not identified (0+ ~ Identified only up to scale); 1 Parameter identified/consistent esti-
mator; 2 ~ @-consistent, asymptotically normal estimator; 3 - Efficient estimator.

Of course, this table should be viewed with caution, as some of its entries are
ambiguous (for instance, the entry under symmetry for the selection row refers
to the Type 3 Tobit model with a censored regression model as the selection
equation, while the other columns presume a binary selection equation). Nevertheless,
the table should be suggestive of areas where more research is needed.
The literature on the empirical application of semiparametric methods (apart
from estimation of invertible models under conditional mean restrictions) is much
less extensive. When applied to relatively small data sets (roughly 100 observations
per parameter), the potential bias from misspecification of the parametric model
has proven to be less important than the additional imprecision induced when
parametric restrictions are relaxed. For example, Horowitz and Neumann (1987)
and McFadden and Han (1987) estimate the parameters of an employment duration
data set imposing independence and quantile restrictions, but for these data even
maximum likelihood estimates are imprecise (in terms of their asymptotic standard
errors). A similar outcome was obtained by Newey et al. (1990), which reanalyzed
data on married womens labor supply originally studied (in a parametric context)
2514 J.L. Powell

by Mroz (1987). For these data, estimates based upon semiparametric restrictions
were fairly comparable to their parametric counterparts, with differences in the
estimates having large standard errors. On the other hand, for larger data sets (with
relatively few parameters), the bias due to distributional misspecification is more
likely to be evident. Chamberlain (1990) and Buchinski (1991b) apply quantile
methods to estimate the returns to education for a large, right-censored data set,
and find these estimates to be quite precise. Other empirical papers which use
semiparametric methods, with mixed success, include those by Deaton and Irish
(1984), Newey (1987), Das (1991), Horowitz (1993), Bult (1992a, b), Horowitz and
Markatou (1993), Deaton and Ng (1993) and Melenberg and van Soest (1993).
Besides the possible imprecision due to weakening of semiparametric restrictions,
an obstacle to routine use of some of the estimators described in Section 3 is their
dependence upon a choice of type and degree of smoothing imposed for estimators
which depend explicitly upon nonparametric components of the model. Though
this question has been widely studied in the literature on nonparametrics, the results
are different when the nonparametric component is a nuisance parameter. Some
early results on the proper degree of smoothing are available for some special cases
of estimators for censored regression (Hall and Horowitz (1990)) or upon index
restrictions (Hall and Marron (1987), Powell and Stoker (1991), HGdle et al. (1992)),
but more theoretical results are needed to narrow the choice of possible estimators
which depend upon nonparametrically-estimated components.

References

Ahn, H. and C.F. Manski (1993) Distribution Theory for the Analysis of Binary Choice Under
Uncertainty with Nonparametric Estimation ofExpectations, JournalofEconometrics, forthcoming.
Ahn, H. and J.L. Powell (1993) Semiparametric Estimation of Censored Selection Models with a
Nonparametric Selection Mechanism, Journal of Econometrics, forthcoming.
Amemiya, T. (1974) The Nonlinear Two-Stage Least-Squares Estimator, Journal ofEconometrics, 2,
105-l 10.
Amemiya, T. (1977) The Maximum Likelihood and Nonlinear Three-Stage Least Squares Estimator
in the General Nonlinear Simultaneous Equations Model, Econometrica, 45, 955-968.
Amemiya, T. (1982) Two Stage Least Absolute Deviations Estimators, Econometrica, 50,689-711.
Amemiya, T. (1985) Advanced Econometrics. Cambridge, Mass: Harvard University Press.
Amemiya, T. and J.L. Powell (1981) A Comparison of the Box-Cox Maximum Likelihood Estimator
and the Non-Linear Two-Stage Least Squares Estimator, Journal ofEconometrics, 17, 3512381.
Andersen, E.B. (1970) Asymptotic Properties of Conditional Maximum Likelihood Estimators, Journal
ofthe Royal Statistical Society, Series B, 32,283-301.
Andrews, D.W.K. (1987) Consistency in Nonlinear Econometric Models, A Generic Uniform Law of
Large Numbers, Econometrica, 55, 1465-1471.
Andrews, D.W.K. (1990a) Asymptotics for Semiparametric Econometric Models, I. Estimation and
Testing, Cowles Foundation, Yale University, Discussion Paper No. 908R.
Andrews, D.W.K. (1990b) Asymptotics for Semiparametric Econometric Models, II. Stochastic Equi-
continuity and Nonparametric Kernel Estimation, Cowles Foundation, Yale University, Discussion
Paper No. 909R.
Andrews, D.W.K. (1991) Asymptotic Normality of Series Estimators for Nonparametric and Semi-
parametric Regression Models, Econometrica, 59, 307-345.
Ch. 41: Estimation I$ Semiparametric Models 2515

Arabmazar, A. and P. Schmidt (1981) Further Evidence on the Robustness of the Tobit Estimator to
Heteroscedasticity, Journal ofEconometrics, 17,2533258.
Arabmazar, A. and P. Schmidt (1982) An Investigation of the Robustness of the Tobit Estimator to
Non-Normality, Econometrica, 50, 1055-1063.
Bassett, G.S. and R. Koenker (1978) Asymptotic Theory of Least Absolute Error Regression, Journal
ofthe American Statistical Association, 73, 667-677.
Begun, J., W. Hall. W. Huang and J. Wellner (1983) Information and Asymptotic Efficiency in Parametric-
Nonparametric Models, Annals ofStatistics, 11, 432-452.
Bhattacharya, P.K., H. Chernoff and S.S. Yang (1983) Nonparametric Estimation of the Slope of a
Truncated Regression, Annals of Statistics, 11, 505-514.
Bickel, P.J. (1982) On Adaptive Estimation, Anna/s ofStotistics, 10, 6477671.
Bickel, P.J. and K.A. Doksum (1981) An Analysis of Transformations Revisited, Journal of the
American Statistical Association, 76, 29663 1 1.
Bickel, P.J., C.A.J. Klaasen, Y. Ritov and J.A. Wellner (1993) Ejicient and Adaptive Inference in Semipara-
metric Models. Washington: Johns Hopkins University Press, forthcoming.
Bierens, H.J. (1987) Kernel Estimators of Regression Functions, in: T.F. Bewley, ed., Aduances in
Econometrics, Ft$h World Congress. vol. 1,Cambridge: Cambridge University Press.
Bloomfield, P. and W.L. Steiger (1983) Least Absolute Deviations: Theory, Applications, and Algorithms.
Boston: Birkhauser.
Box, G.E.P. and D.R. Cox (1964) An Analysis of Transformations, Journal of the Royal Statistical
Society, Series B, 26, 21 l-252.
Brillinger, D.R. (1983)A Generalized Linear Model with 'Gaussian'Regressor Variables, in: P.J. Bickel,
K.A. Doksum and J.L. Hodges, eds., A Festschrifttfor Erich L. Lehmann. Belmont, CA: Woodsworth
International Group.
Buchinsky, M. (1991a) A Monte Carlo Study of the Asymptotic Covariance Estimators for Quantile
Regression Coefficients, manuscript, Harvard University, January.
Buchinsky, M. (1991b) Changes in the U.S. Wage Structure 1963-1987: Applications of Quantile
Regression, manuscript, University of Chicago.
Buchinsky, M. (1993) How Did Womens Return to Education Evolve in the U.S.? Exploration by
Quantile Regression Analysis with Nonparametric Correction for Sample Selection Bias, manuscript,
Yale University.
Buckley, J. and I. James (1979) Linear Regression with Censored Data, Biometrika, 66,4299436.
Bult, J.R. (1992a) Target Selection for Direct Marketing: Semiparametric versus Parametric Discrete
Choice Models, Faculty of Economics, University of Groningen, Research Memorandum No. 468.
Bult, J.R. (1992b) Semiparametric versus Parametric Classification Models: An Application to Direct
Marketing, manuscript, University of Groningen.
Burguete, J., R. Gallant and G. Souza (1982) On Unification of the Asymptotic Theory of Nonlinear
Econometric Models, Econometric Reviews, 1, 151-190.
Carroll, R.J. (1982) Adapting for Heteroskedasticity in Linear Models, Annals ofstatistics, 10, 1224-
1233.
Carroll, R.J. and D. Ruppert (1982) Robust Estimation in Heteroskedastic Linear Models, Annals of
Statistics, 10, 4299443.
Carroll, R.J. and D. Ruppert (1984) Power Transformations When Fitting Theoretical Models to Data,
Journal of the American Statistical Association, 79, 321-328.
Cavanagh, C. and R. Sherman (1991) Rank Estimators for Monotonic Regression Models, manuscript,
Bellcore.
Chamberlain, G. (1984) Panel Data, in: 2. Griliches and M. Intriligator, eds., Handbook ofEconometrics,
Vol. 2. Amsterdam: North-Holland.
Chamberlain, G. (1986a) Asymptotic Efficiency in Semiparametric Models with Censoring, Journal of
Econometrics, 32, 189-218.
Chamberlain, G. (1986b) Notes on Semiparametric Regression, manuscript, Department of Economics,
University of Wisconsin-Madison.
Chamberlain, G. (1987) Asymptotic Efficiency in Estimation with Conditional Moment Restrictions,
Journal of Econometrics, 34, 305-334.
Chamberlain, G. (1990) Quantile Regression, Censoring, and the Structure of Wages, manuscript,
Harvard University.
Chamberlain, G. (1992) Efficiency Bounds for Semiparametric Regression, Econometrica, 567-596.
2516 J.L. Powell

Choi, K. (1990) The Semiparametric Estimation of the Sample Selection Model Using Series Expansion
and the Propensity Score, manuscript, University of Chicago.
Chung, C.-F. and A.S. Goldberger (1984) Proportional Projections in Limited Dependent Variable
Models, Econometrica, 52, 531-534. _
Cosslett. S.R. (1981) Maximum Likelihood Estimation for Choice-Based Samples, Econometrica, 49,
1289-1316. ~
Cosslett, S.R. (1983) Distribution-Free Maximum Likelihood Estimator of the Binary Choice Model,
Econometrica, 51, 7655782.
Cosslett, S.R. (1987) Efficiency Bounds for Distribution-Free Estimators of the Binary Choice and the
Censored Regression Models, Economctrica, 55, 559-587.
Cosslett, S.R. (1991) Distribution-Free Estimator of a Regression Model with Sample Selectivity, in:
W.A. Barnett, J.L. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in
Econometrics and Statistics. Cambridge: Cambridge University Press.
Cox, D.R. (1972) Regression Models and Life Tables, Journal ofthe Royal Statistical Society, Series B,
34, 187-220.
Cox, D.R. (1975) Partial Likelihood, Biometrika, 62, 269-276.
Cragg, J.G. (1983) More Efficient Estimation in the Presence of Heteroscedasticity of Unknown Form,
Econometrica, 51, 751-764.
Das, S. (1991) A Semiparametric Structural Analysis of the Idling of Cement Kilns, Journal of
Econometrics, 50, 235-256.
Deaton, A. and M. Irish (1984) Statistical Models for Zero Expenditures in Household Budgets,
Journal of Public Economics, 23, 59-80.
Deaton, A. and S. Ng (1993) Parametric and Non-parametric Approaches to Price and Tax Reform,
manuscript, Princeton University.
Delgado, M.A. (1992) Semiparametric Generalized Least Squares in the Multivariate Nonlinear Regres-
sion Model, Econometric Theory, 8,203-222.
Dempster, A.P., N.M. Laird and D.B. Rubin (1977) Maximum Likelihood from Incomplete Data via
the E-M Algorithm, Journal ofthe Royal Statistical Society, Series B, l-38.
Duncan, G.M. (1986) A Semiparametric Censored Regression Estimator, Journal ofEconometrics, 32,
5-34.
Elbadawi, I., A.R. Gallant and G. Souza (1983) An Elasticity Can be Estimated Consistently Without
A Priori Knowledge of its Functional Form, Econometrica, 51, 1731-1751.
Engle, R.F., C.W.J. Granger, J. Rice and A. Weiss (1986) Semiparametric Estimates of the Relation
Between Weather and Electricity Sales, Journal of the American Statistical Association, 81, 310-
320.
Ferguson, T.S. (1967) Mathematical Statistics: A Decision Theoretic Approach. New York: Academic
Press.
Femandez, L. (1986) Nonparametric Maximum Likelihood Estimation of Censored Regression Models,
Journal of Econometrics, 32, 35-57.
Friedman, J.H. and W. Stuetzle (1981) Projection Pursuit Regression, Journal ofthe American Statistical
Association, 76, 817-823.
Gallant, A.R. (1980) Explicit Estimators of Parametric Functions in Nonlinear Regression, Journal of
the American Statistical Association, 75, 182-193.
Gallant, A.R. (1981) On the Bias in Flexible Functional Forms and an Essentially Unbiased Form, The
Fourier Flexible Form, Journal ofEconometrics, IS,21 l-245.
Gallant, A.R. (1987) Identification and Consistency in Nonparametric Regression, in: T.F. Bewley, ed.,
Advances in Econometrics, Fifth World Congress. Cambridge: Cambridge University Press.
Gallant, A.R. and D.W. Nychka (1987) Semi-nonparametric Maximum Likelihood Estimation, Econo-
metrica, 55, 363-390.
Goldberger, A.S. (1983) Abnormal Selection Bias, in: S. Karlin, T. Amemiya and L. Goodman, eds.,
Studies in Econometrics, Time Series, and Multivariate Statistics, New York: Academic Press.
Greene, W.H. (1981) On the Asymptotic Bias of the Ordinary Least Squares Estimator of the Tobit
Model, Econometrica, 49, 5055514.
Greene, W.H. (1983) Estimation of Limited Dependent Variable Models by Ordinary Least Squares
and the Method of Moments, Journal ofEconometrics, 21, 1955212.
Grenander, U. (1981) Abstract Inference. New York: Wiley.
Hall, P. and J.L. Horowitz (1990) Bandwidth Selection in Semiparametric Estimation of Censored
Linear Regression Models, Econometric Theory, 6, 123-150.
Ch. 41: Estimation of Semiparametric Models 2517

Hall, P. and J.S. Marron (1987) Estimation of Integrated Squared Density Derivatives, Statistics and
Probability Letters, 6, 109% 115.
Han, A.K. (1987a) Non-Parametric Analysis of a Generalized Regression Model: The Maximum Rank
Correlation Estimator, Journal of Econometrics, 35, 303-316.
Han, A.K. (1987b) A Non-Parametric Analysis of Transformations, Jlournal qf Econometrics, 35,
191-209.
Hansen, L.P. (1982) Large Sample Properties of Generalized Method of Moment Estimators, Econo-
metrica, 50, 1029-1054.
Hlrdle, W. (1991) Applied Nonparametric Regression. Cambridge: Cambridge University Press.
HBrdle, W. and T.M. Stoker (1989) Investigating Smooth Multiple Regression by the Method of
Average Derivatives, Journal of the American Statistical Association, forthcoming.
Hardle, W., J. Hart, J.S. Marron and A.B. Tsybakov (1992) Bandwidth Choice for Average Derivative
Estimation, Journal of the American Statistical Association, 87, 227-233.
Hausman, J., B.H. Hall and 2. Griliches(1984)Econometric Modelsfor Count Data with an Application
to the Patents-R&D Relationship, Econometrica, 52,909-938.
Heckman, J.J. (1976) The Common Structure of Statistical Models of Truncation, Sample Selection
and Limited Dependent Variables and a Simple Estimator for Such Models, Annals of Economic and
Social Measurement, 5,475-492.
Heckman, J.J. and T.E. MaCurdy (1980) A Life-Cycle Model of Female Labor Supply, Review qf
Economic Studies, 47, 47-74.
Heckman, J.J. and B. Singer (1984) A Method for Minimizing the Impact of Distributional Assumptions
in Econometric Models for Duration Data, Econometrica, 52, 271-320.
Heckman, N.E. (1986) Spline Smoothing in a Partly Linear Model, Journal of the Royal Statistical
Society, Series B, 48, 244-248.
Hoeffding, W. (1948) A Class of Statistics with Asymptotically Normal Distribution, Annuls of
Mathematical Statistics, 19, 293-325.
Honor&, B.E. (1986) Estimation of Proportional Hazards Models in the Presence of Unobserved
Heterogeneity, manuscript, University of Chicago, November.
Honor&, B.E. (1992) Trimmed LAD and Least Squares Estimation of Truncated and Censored Regres-
sion Models with Fixed Effects, Econometrica, 60, 533-565.
Honor&, B.E. and J.L. Powell (1991) Pairwise Difference Estimators of Linear, Censored, and Truncated
Regression Models, manuscript, Department of Economics, Princeton University, November.
Honor&, B.E., E. Kyriazidou and C. Udry (1992) Estimation of Type 3 Tobit Models Using Symmetric
Trimming and Pairwise Comparisons, manuscript, Department of Economics, Northwestern Univer-
sity.
Horowitz, J.L. (1986) A Distribution-Free Least Squares Estimator for Censored Linear Regression
Models, Journal of Econometrics, 32, 59-84.
Horowitz, J.L. (1988a) Semiparametric M-Estimation of Censored Linear Regression Models, Aduances
in Econometrics, 7, 45-83.
Horowitz, J.L. (1988b) The Asymptotic Efficiency of Semiparametric Estimators for Censored Linear
Regression Models, Empirical Economics, 13, 123-140.
Horowitz, J.L. (1992) A Smoothed Maximum Score Estimator for the Binary Response Model,
Econometrica, 60, 505-53 1.
Horowitz, J.L. (1993) Semiparametric Estimation of a Work Trip Mode Choice Model, Journal of
Econometrics, forthcoming.
Horowitz, J.L. and M. Markatou (1993) Semiparametric Estimation of Regression Models for Panel
Data, Department of Economics, University of Iowa, Working Paper No. 93-14.
Horowitz, J.L. and G. Neumann (1987) Semiparametric Estimation of Employment Duration Models,
with discussion, Econometric Reviews, 6, 5-40.
Hsieh, D. and C. Manski (1987) Monte-Carlo Evidence on Adaptive Maximum Likelihood Estimation
of a Regression, Annals of Statistics, 15, 541-551.
Huber, P.J. (1967) The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions,
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley,
University of California Press, 4, 221-233.
Huber, P.J. (1981) Robust Statistics. New York: Wiley.
Huber, P.J. (1984) Projection Pursuit, with discussion, Annals ofStatistics, 13, 435-525.
Hurd, M. (1979) Estimation in Truncated Samples When There is Heteroskedasticity, Journal of
Econometrics, 11, 247-258.
2518 J.L. Powell

Ichimura, H. (1992) Semiparametric Least Squares Estimation of Single Index Models, Journal of
Econometrics, forthcoming.
Ichimura, H. and L.-F. Lee (1991) Semiparametric Least Squares Estimation of Multiple Index Models:
Single Equation Estimation, in W.A. Barnett, J.L. Powell and G. Tauchen, eds., Nonparametric and
Semiparametric Methods in Econometrics and Statistics. Cambridge: Cambridge University Press.
Imbens, G.W. (1992) An Efficient Method of Moments Estimator for Discrete Choice Models with
Choice-Based Sampling, Econometrica, 60, 1187-1214.
Jaeckel, L.A. (1972) Estimating Regression Coefficients by Minimizing the Dispersion of the Residuals,
Annals of Mathematical Sta&ics, 43, 1449-1458.
JureEkovB, J. (1971) Nonparametric Estimate of Regression Coefficients, Annals of Mathematical
Statistics, 42, 1328-1338.
Kaplan, E.L. and P. Meier (1958) Nonparametric Estimation from Incomplete Data, Journal of the
American Statistical Association, 53,457-481.
Kiefer, J. and J. Wolfowitz (1956) Consistency of the Maximum Likelihood Estimator in the Presence
of Infinitely Many Incidental Parameters, Annals of Mathematical Statistics, 27, 887-906.
Kim, J. and D. Pollard (1990) Cube Root Asymptotics, Annals ofstatistics, 18, 191-219.
Klein, R.W. and R.H. Spady (1993) An Efficient Semiparametric Estimator for Discrete Choice Models,
Econometrica, 61, 387-421.
Koenker, R. and G.S. Bassett Jr. (1978) Regression Quantiles, Econometrica, 46, 33-50.
Koenker, R. and G.S. Bassett Jr. (1982) Robust Tests for Heteroscedasticity Based on Regression
Quantiles, Econometrica, 50, 43-61.
Koul, H., V. Suslara, and J. Van Ryzin (1981) Regression Analysis with Randomly Right Censored
Data, Annals of Statistics, 9, 1276-1288.
Lancaster, T. (1990) The Econometric Analysis of Transition Data. Cambridge: Cambridge University
Press.
Laplace, P.S. (1793) Sur Quelques Points du Systems du Monde, Memoires de Ikademie Royale des
Sciences de Paris, Annee 1789, l-87.
Lee, L.F. (1982) Some Approaches to the Correction of Selectivity Bias, Review of Economic Studies,
49,355-372.
Lee, L.F. (1991) Semiparametric Instrumental Variables Estimation of Simultaneous Equation Sample
Selection Models, manuscript, Department of Economics, University of Minnesota.
Lee, L.F. (1992) Semiparametric Nonlinear Least-Squares Estimation of Truncated Regression Models,
Econometric Theory, 8, 52-94.
Lee, M.J. (1989) Mode Regression, Journal of Econometrics, 42, 337-349.
Lee, M.J. (1992)Median Regression for Ordered Discrete Response, Journal ofEconometrics, 51,59-77.
Lee, M.J. (1993a) Windsorized Mean Estimator for Censored Regression Model, Econometric Theory,
forthcoming.
Lee, M.J. (1993b) Quadratic Mode Regression, Journal of Econometrics, forthcoming.
Levit, B.Y. (1975) On the Efficiency of a Class of Nonparametric Estimates, Theory of Probability and
Its Applications, 20, 723-740.
Li, K.C. and N. Duan (1989) Regression Analysis Under Link Violation, Annals of Statistics, 17,
1009-1052.
Linton, O.B. (1991) Second Order Approximation in Semiparametric Regression Models, manuscript,
Nuffield College, Oxford University.
Linton, O.B. (1992) Second Order Approximation in a Linear Regression with Heteroskedastkity of
Unknown Form, manuscript, Nuffield College, Oxford University.
MaCurdy, T.E. (1982) Using Information on the Moments of the Disturbance to Increase the Efficiency
of Estimation, Stanford University, manuscript.
Manski, C.F. (1975) Maximum Score Estimation of the Stochastic Utility Model of Choice, Journal of
Econometrics, 3, 205-228.
Manski, C. (1983) Closest Empirical Distribution Estimation, Econometrica, 51, 305-319.
Manski, C. (1984) Adaptive Estimation of Nonlinear Regression Models, Econometric Reviews, 3,
145-194.
Manski, C.F. (1985) Semiparametric Analysis of Discrete Response, Asymptotic Properties of the
Maximum Score Estimator, Journal ofEconometrics, 27, 205-228.
Manski, C.F. (1987) Semiparametric Analysis of Random Effects Linear Models from Binary Panel
Data, Econometrica, 55, 357-362.
Ch. 41: Estimation qf Semiparametric Models 2519

Manski, CF. (1988a) Identification of Binary Response Models, Journal of the American Statistical
Association, 83, 729-138.
Manski, C.F. (1988b) Analog Estimation Methods in Econometrics. New York: Chapman and Hall.
Manski, C.F. and S. Lerman (1977) The Estimation ofchoice Probabilities from Choice-Based Samples,
Econometrica, 45, 1971-1988.
Manski, CF. and D.F. McFadden (1981) Alternative Estimators and Sample Designs for Discrete
Choice Analysis, in: C. Manski and D. McFadden, eds., Structural Analysis of Discrete Data with
Econometric Applications. Cambridge: MIT Press.
Manski, C.F. and T.S. Thompson (1986) Operational Characteristics of Maximum Score Estimation,
Journal ofEconometrics, 32, 85-108.
McFadden, D.F. (1985) Specification of Econometric Models, Presidential Address, Fifth World
Congress of the Econometric Society.
McFadden, D.F. and A. Han (1987) Comment on Joel Horowitz and George Neumann Semiparametric
Estimation of Employment Duration Models, Econometric Reviews, 6, 257-270.
Melenberg, B. and A. Van Soest (1993) Semi-parametric Estimation of the Sample Selection Model,
manuscript, Department of Econometrics, Tilburg University.
Meyer, B (1987) Semiparametric Estimation of Duration Models, Ph.D. dissertation, Department of
Economics, MIT.
Moon, C.-G. (1989) A Monte Carlo Comparison of Semiparametric Tobit Estimators, Journal of
Applied Econometrics, 4, 361-382.
Mroz, T.A. (1987) The Sensitivity of an Empirical Model of Married Womens Hours of Work to
Economic and Statistical Assumptions, Econometrica, 55, 765-799.
Nawata, K. (1990) Robust Estimation Based on Grouped-Adjusted Data in Censored Regression
Models, Journal of Econometrics, 43, 337-362.
Nawata, K. (1992)Semiparametric Estimation of Binary Choice Models Based on Medians of Grouped
Data, University of Tokyo, manuscript:
Newey, W.K. (1984) Nearly Efficient Moment Restriction Estimation of Regression Models with
Nonnormal Disturbances, Princeton University, Econometric Research Program Memo. No,
315.
Newey, W.K. (1985) Semiparametric Estimation of Limited Dependent Variable Models with Endo-
genous Explanatory Variables, Ann&s de Ilnsee, 59/60, 219-236.
Newey, W.K. (1987a) Efficient Estimation of Models with Conditional Moment Restrictions, Princeton
University, manuscript.
Newey, W.K. (1987b) Interval Moment Estimation of the Truncated Regression Model, manuscript,
Department of Economics, Princeton University, June.
Newey, W.K. (1987~) Specification Tests for Distributional Assumptions in the Tobit Model, Journal
of Econometrics, 34, 1255145.
Newey, W.K. (1988a) Adaptive Estimation of Regression Models Via Moment Restrictions, Journal
ofEconometrics, 38, 301-339.
Newey, W.K. (1988b) Efficient Estimation of Semiparametric Models Via Moment Restrictions,
Princeton University, manuscript.
Newey, W.K. (1988~) Two-Step Series Estimation of Sample Selection Models, Princeton University,
manuscript.
Newey, W.K. (1989a) Efficient Estimation of Tobit Models Under Symmetry, in: W.A. Barnett, J.L.
Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics.
Cambridge: Cambridge University Press.
Newey, W.K. (1989b) Efficiency in Univariate Limited Dependent Variable Models Under Conditional
Moment Restrictions, Princeton University, manuscript,
Newey, W.K. (1989~) Efficient Instrumental Variables Estimation of Nonlinear Models, mimeo,
Princeton University.
Newey, W.K. (1989d) Uniform Convergence in Probability and Uniform Stochastic Equicontinuity,
mimeo, Department of Economics, Princeton University.
Newey, W.K. (1990a) Semiparametric Efficiency Bounds, Journal ofApplied Econometrics, 5,99- 135.
Newey, W.K. (1990b) Efficient Instrumental Variables Estimation of Nonlinear Models, Econometrica,
58,809-837.
Newey, W.K. (1991) The Asymptotic Variance of Semiparametric Estimators, Working Paper No.
583, Department of Economics, MIT, revised July.
2520 J.L. Powell

Ncwcy, W.K. and J.L. Powell (1990) Efficient Estimation of Linear and Type I Censored Regression
Models Under Conditional Quantile Restrictions, Econometric Theory, 6: 295-3 17.
Newey, W.K. and J.L. Powell (1991) Two-Step Estimation, Optimal Moment Conditions, and Sample
Selection Models, manuscript, Department of Economics, MIT, October.
Newey, W.K. and J.L. Powell (1993) Efficiency Bounds for Some Semiparametric Selection Models,
Journal ofEconometrics, forthcoming.
Newey, W.K. and P. Ruud (1991) Density Weighted Least Squares Estimation, manuscript, Depart-
ment of Economics, MIT.
Newey, W.K. and T. Stoker (1989) Efficiency Properties of Average Derivative Estimators, manuscript,
Sloan School of Management, MIT.
Newey, W.K. and T.M. Stoker (1993) Efficiency of Weighted Average Derivative Estimators and Index
Models, Econometrica, 61, 1199-1223.
Newey, W.K., J.L. Powell and J.M. Walker (1990) Semiparametric Estimation of Selection Models:
Some Empirical Results, American Economic Review Papers and Proceedings, 80, 324-328.
Neyman, J. and E.L. Scott (1948) Consistent Estimates Based on Partially Consistent Cbservations,
Econometrica, 16, l-32.
Nolan, D. and D. Pollard (1987) U-Processes, Rates of Convergence, Annals of Statistics, 15, 780-
799.
Nolan, D. and D. Pollard (1988) Functional Central Limit Theorems for U-Processes, Annals of
Probability, 16, 1291-1298.
Oakes, D. (1981) Survival Times: Aspects of Partial Likelihood, International Statistical Reuiew, 49,
235-264.
Obenhofer, W. (1982) The Consistency of Nonlinear Regression Minimizing the Ll Norm, Annals of
Statistics, 10, 316-319.
Pakes, A. and D. Pollard (1989) Simulation and the Asymptotics of Optimization Estimators, Econo-
metrica, 57, 1027-1058.
Pollard, D. (1985) New Ways to Prove Central Limit Theorems, Econometric Theory, 1, 295-314.
Powell, J.L. (1983) The Asymptotic Normality of Two-Stage Least Absolute Deviations Estimators,
Econometrica, 51, 1569-1575.
Powell, J.L. (1984) Least Absolute Deviations Estimation for the Censored Regression Model, Journal
of Econometrics, 25, 303-325.
Powell, J.L. (1986a) Censored Regression Quantiles, Journal of Econometrics, 32, 143-155.
Powell, J.L. (1986b) Symmetrically Trimmed Least Squares Estimation ofTobit Models, Econometrica,
54,1435-1460.
Powell, J.L. (1987) Semiparametric Estimation of Bivariate Latent Variable Models, Social Systems
Research Institute, University of Wisconsin-Madison, Working Paper No. 8704.
Powell, J.L. (1991) Estimation ofMonotonic Regression Models Under Quantile Restrictions, in: W.A.
Barnett, J.L. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics
and Statistics, Cambridge: Cambridge University Press.
Powell, J.L. and T.M. Stoker (1991) Optimal Bandwidth Choice for Density-Weighted Averages,
manuscript, Department of Economics; Princeton University, December. _ -
Powell, J.L., J.H. Stock and T.M. Stoker (1989) Semiparametric Estimation of Weighted Average
Derivatives, Econometrica, 57, 1403-1436. -
Prakasa Rao, B.L.S. (1983) Nonparametric Functional Estimation. New York: Academic Press.
Rice, J. (1986) Convergence Rates for Partially Splined Estimates, Statistics and Probability Letters, 4,
203-208.
Rilstone, P. (1989) Semiparametric Estimation of Missing Data Models, mimeo, Department of
Economics, Lava1 University.
Ritov, Y. (1990) Estimation in a Linear Regression Model with Censored Data, Annals of Statistics,
18,303-328.
Robinson, P. (1987) Asymptotically Efficient Estimation in the Presence of Heteroskedasticity of
Unknown Form, Econometrica, 55,875-891.
Robinson, P. (1988a) Semiparametric Econometrics, A Survey, Journal of Applied Econometrics, 3,
35-51.
Robinson, P. (1988b) Root-N-Consistent Semiparametric Regression, Econometrica, 56.931-954.
Rosenbaum, P.R. and D.B. Rubin (1983) The Central Role of the Propensity Score in Observational
Studies for Causal Effects, Biometrika, 70, 41-55.
Ch. 41: Estimution of Semiparumetric Models 2521

Ruud, P. (1983) Sufficient Conditions for Consistency of Maximum Likelihood Estimation Despite
Misspecification of Distribution, Econometrica, 51, 2255228.
Ruud, P. (1986) Consistent Estimation of Limited Dependent Variable Models Despite Misspecification
of Distribution, Journul ofEconometrics, 32, 1577187.
Schick, A. (1986) On Asymptotically Efficient Estimation in Semiparametric Models, Annals of Stutistics,
14,1139-1151.
Serfling, R.J. (1980) Approximation Theorems of Mathematical Statistics, New York: Wiley.
Severini, T.A. and W.H. Wong (1987a) Profile Likelihood and Semiparametric Models. manuscriut,
University of Chicago. -
Severini, T.A. and W.H. Wong (1987b) Convergence Rates of Maximum Likelihood and Related
Estimates in General Parameter Spaces, Technical Report No. 207, Department of Statistics, University
of Chicago, Chicago, IL.
Sherman, R.P. (1990a) The Limiting Distribution of the Maximum Rank Correlation Estimator,
manuscript, Bell Communications Research.
Sherman, R.P. (1990b) Maximal Inequalities for Degenerate U-Processes with Applications to Optimi-
zation Estimators, manuscript, Bell Communications Research.
Sherman, R.P. (1993) The Limiting Distribution of the Maximum Rank Correlation Estimator,
Economerrica, 61, 123-137.
Silverman, B.W. (1986) Density Estimationfor Statistics and Data Analysis. London: Chapman and Hall.
Stein, C. (1956) Efficient Nonparametric Testing and Estimation, Proceedings of the Third Berkeley
Symposium on Mathematical Statistics and Probability, Vol. 1, Berkeley, University ofcalifornia Press.
Stock, J.H. (1989) Nonparametric Policy Analysis, Journal of the American Statistical Association, 84,
1461l1481.
Stoker, T.M. (1986) Consistent Estimation of Scaled Coefficients, Econometrica, 54, 1461l1481.
Stoker, T.M. (1991) Equivalence of Direct, Indirect, and Slope Estimators of Average Derivatives, in:
W.A. Barnett, J.L. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in
Econometrics and Statistics. Cambridge: Cambridge University Press.
Stoker, T.M. (1992) Lectures on Semiparametric Econometrics. Louvain-LaaNeuve, Belgium: CORE
Lecture Series.
Thompson, T.S. (1989a) Identification of Semiparametric Discrete Choice Models, manuscript, Depart-
ment of Economics, University of Minnesota.
Thompson, T.S. (1989b) Least Squares Estimation of Semiparametric Discrete Choice Models, manu-
script, Department of Economics, University of Minnesota.
Tobin, J. (1956) Estimation of Relationships for Limited Dependent Variables, Econometrica, 26,
24-36.
Wahba, G. (1984) Partial Spline Models for the Semiparametric Estimation of Functions of Several
Variables, in Statistical Analysis of Time Series. Tokyo, Institute of Statistical Mathematics.
White, H. (1982) Maximum Likelihood Estimation of Misspecified Models, Econometrica, 50, l-26.
Ying, Z., S.H. Jung and L.J. Wei (1991) Survival Analysis with Median Regression Models, manuscript,
Department of Statistics, University of Illinois.
Zheng, Z. (1992) Efficiency Bounds for the Binary Choice and Sample Selection Models under Symmetry,
in Topics in Nonparametric and Semiparametric Analysis, Ph.D. dissertation, Princeton University,
Chapter 42

RESTRICTIONS OF ECONOMIC THEORY IN


NONPARAMETRIC METHODS*

ROSA L. MATZKIN

Northwestern University

Contents

Abstract 2524
1. Introduction 2524
2. Identification of nonparametric models using economic restrictions 2528
2.1. Definition of nonparametric identification 2528
2.2. Identification of limited dependent variable models 2530
2.3. Identification of functions generating regression functions 2535
2.4. Identification of simultaneous equations models 2536
3. Nonparametric estimation using economic restrictions 2537
3.1. Estimators that depend on the shape of the estimated function 2538
3.2. Estimation using seminonparametric methods 2544
3.3. Estimation using weighted average methods 2546
4. Nonparametric tests using economic restrictions 2548
4.1. Nonstatistical tests 2548
4.2. Statistical tests 255 1
5. Conclusions 2554
References 2554

*The support of the NSF through Grants SES-8900291 and SES-9122294 is gratefully acknowledged,
I am grateful to an editor, Daniel McFadden, and two referees, Charles Manski and James Powell, for
their comments and suggestions. I also wish to thank Don Andrews, Richard Briesch, James Heckman,
Bo Honor& Vrinda Kadiyali, Ekaterini Kyriazidou, Whitney Newey and participants in seminars at the
University of Chicago, the University of Pennsylvania, Seoul University, Yomsei University and the
conference on Current Trends in Economics, Cephalonia, Greece, for their comments. This chapter was
partially written while the author was visiting MIT and the University of Chicago, whose warm
hospitality is gratefully appreciated.

Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden
(3 1994 Elsevier Science B. V. All rights reserved
2524 R.L. Ma&kin

Abstract

This chapter describes several nonparametric estimation and testing methods for
econometric models. Instead of using parametric assumptions on the functions and
distributions in an economic model, the methods use the restrictions that can be
derived from the model. Examples of such restrictions are the concavity and
monotonicity of functions, equality conditions, and exclusion restrictions.
The chapter shows, first, how economic restrictions can guarantee the identifica-
tion of nonparametric functions in several structural models. It then describes how
shape restrictions can be used to estimate nonparametric functions using popular
methods for nonparametric estimation. Finally, the chapter describes how to test
nonparametrically the hypothesis that an economic model is correct and the
hypothesis that a nonparametric function satisfies some specified shape properties.

1. Introduction

Increasingly, it appears that restrictions implied by economic theory provide


extremely useful tools for developing nonparametric estimation and testing methods.
Unlike parametric methods, in which the functions and distributions in a model are
specified up to a finite dimensional vector, in nonparametric methods the functions
and distributions are left parametrically unspecified. The nonparametric functions
may be required to satisfy some properties, but these properties do not restrict them
to be within a parametric class.
Several econometric models, formerly requiring very restrictive parametric
assumptions, can now be estimated with minimal parametric assumptions, by
making use of the restrictions that economic theory implies on the functions of
those models. Similarly, tests of economic models that have previously been
performed using parametric structures, and hence were conditional on the pari-
metric assumptions made, can now be performed using fewer parametric assump-
tions by using economic restrictions. This chapter describes some of the existing
results on the development of nonparametric methods using the restrictions of
economic theory.
Studying restrictions on the relationship between economic variables is one of
the most important objectives of economic theory. Without this study, one would
not be able to determine, for example, whether an increase in income will produce
an increase in consumption or whether a proportional increase in prices will
produce a similar proportional increase in profits. Examples of economic restrictions
that are used in nonparametric methods are the concavity, continuity and
monotonicity of functions, equilibrium conditions, and the implications of optimi-
zation on solution functions.
The usefulness of the restrictions of economic theory on parametric models is
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2525

by now well understood. Some restrictions can be used, for example, to decrease
the variance of parameter estimators, by requiring that the estimated values satisfy
the conditions that economic theory implies on the values of the parameters. Some
can be used to derive tests of economic models by testing whether the unrestricted
parameter estimates satisfy the conditions implied by the economic restrictions. And
some can be used to improve the quality of an extrapolation beyond the support
of the data.
In nonparametric models, economic restrictions can be used, as in parametric
models, to reduce the variance of estimators, to falsify theories, and to extrapolate
beyond the support of the data. But, in addition, some economic restrictions can
be used to guarantee the identification of some nonparametric models and the
consistency of some nonparametric estimators.
Suppose, for example, that we are interested in estimating the cost function a
typical, perfectly competitive firm faces when it undertakes a particular project, such
as the development of a new product. Suppose that the only available data are
independent observations on the price vector faced by the firm for the inputs
required to perform the project, and whether or not the firm decides to undertake
the project. Suppose that the revenue of the project for the typical firm is distributed
independently of the vector of input prices faced by that firm. The firm knows the
revenue it can get from the project, and it undertakes the project if its revenue
exceeds its cost. Then, using the convexity, monotonicity and homogeneity of degree
one1 properties, that economic theory implies on the cost function, one can identify
and estimate both the cost function of the typical firm and the distribution of
revenues, without imposing parametric assumptions on either of these functions
(Matzkin (1992)). This result requires, for normalization purposes, that the cost is
known at one particular vector of input prices.
Let us see how nonparametric estimators for the cost function and the distribution
of the revenue in the model described above can be obtained. Let (xl,. ,x) denote
the observed vectors of input prices faced by N randomly sampled firms possessing
the same cost function. These could be, for example, firms with the same R&D
technologies. Let y equal 0 if the ith sampled firm undertakes the project and
equal 1 otherwise (i = 1, . . . , N). Let us denote by k*(x) the cost of undertaking the
project when x is the vector of input prices and let us denote by E the revenue
associated with the project. Note that E > 0. The cumulative distribution function
of E will be denoted by F*. We assume that F* is strictly increasing over the non-
negative real numbers and the support of the probability distribution of x is IX,.
(Since we are assuming that E is independent of x, F* does not depend on x.)
According to the model, the probability that y= 1 given x is Pr(s ,< k*(x)) =
F*(k*(x)). The homogeneity of degree one of k* implies that k*(O) = 0. A necessary
normalization is imposed by requiring that k*(x*) = c(, where both x* and CYare
known; cr~lw.

1 A function h: X + iw,where X c RK is convex, is convex if Vx, ysX and tll~[O, 11, h(ix + (1 - i)y) <
Ah(x) + (1 - iJh(y); h is homogeneous of degree one if VXEX and VA> 0, h(b) = ih(x).
2526 R.L. Matzkin

Nonparametric estimators for h* and F* can be obtained as follows. First, one


estimates the values that h* attains at each of the observed points x1,. . . , xN and
one estimates the values that F* attains at h*(x), . . . , II*( Second, one interpolates
between these values to obtain functions 8 and p that estimate, respectively, h* and
F*. The nonparametric functions fi and i satisfy the properties that h* and F* are
known to possess. In our model, these properties are that h*(x) = c(,h* is convex,
homogeneous of degree one and monotone increasing, and F* is monotone
increasing and its values lie in the interval [0,11.
The estimator for the finite dimensional vector {h*(x), . , h*(xN); F*(h*(x)), . . . ,
F*(h*(xN))} is obtained by solving the following constrained maximization log-
likelihood problem:

maximize f {yi log(F) + (1 - y) log(1 - F)}


{F},{h},{T} i=l

subject to

F Q F if hi d hj, i,j=l ,..., N, (2)


O<F< 1, i=l ,..., N, (3)
hi = Ti.xi, i=O,...,N+ 1, (4)
h> T.x, i,j=O ,...,N+ 1, (5)
T 2 0, i=O,...,N+ 1. (6)
In this problem, hi is the value of a cost function h at xi, T is the subgradient of h
at xi, and F is the value of a cumulative distribution at hi (i = 1,. . . , N); x0 = 0,
xN+=x*,hO=O,andhN= ~1.The constraints (2)-(3) on F, . . . , FN characterize
the behavior that any distribution function must satisfy at any given points h, . . . , h
in its domain. As we will see in Subsection 3.1, the constraints (4)-(6) on the values
hO,...,hN+ and vectors To,. . . , TN+ characterize the behavior that the values and
subgradients of any convex, homogeneous of degree one, and monotone function
must satisfy at the points x0,. . . , xN+ .
Matzkin (1993b) provides an algorithm to find a solution to the constrained
optimization problem above. The algorithm is based on a search over randomly
drawnpoints(h,T)=(h,..., hN;To ,..., TN+ ) that satisfy (4)-(6) and over convex
combinations of these points. Given any point (_h,1) satisfying (4)-(6), the optimal
values of F , . . . , FN and the optimal value of the objective function given (h, T) are
calculated using the algorithm developed by Asher et al. (1955). (See also Cosslett
(1983).) Thii algorithm divides the observations in groups, and assigns to each F
in a group the value equal to the proportion of observations within the group with

*If f:X+@ is a convex function on a convex set XC RK and XEX, any vector TEIW~, such that
Vy~Xh(y) > h(x) + F(y - x), is called a subgradient of h at x. If h is differentiable at x, the gradient of
h at x is the unique subgradient of h at x.
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2527

y = 1. The groups are obtained by first ordering the observations according to the
values of the hs. A group ends at observation i in the jth place and a new group
starts at observation k in the (j + 1)th place iffy = 0 and yk = 1. If the values of the
Fs corresponding to two adjacent groups are not in increasing order, the two
groups are merged. This merging process is repeated till the values of the Fs are in
increasing order. To randomly generate points (h, T), several methods can be used,
but the most critical one proceeds by drawing N + 2 homogeneous and monotone
linear functions and then letting (h, T) be the vector of values and subgradients of
the function that is the maximum of those N + 2 linear functions. The coefficients
of the N + 2 linear functions are drawn so that one of the functions attains the value
GIat x* and the other functions attain a value smaller than c1at x*.
To interpolate between solution (ii,. . . , fi; F, . . . , Fiv+ ; F, . . . , pN), one can
use different interpolation methods. One possible method proceeds by interpolating
linearly betw_een Pi,. . . , P to obtain a function F^ and using the following inter-
polation for h:

i;(x)=max{P.xli=O,...,N+ l}.

Figure 1 presents some value sets of this nonparametric estimator 6 when XERT.
For contrast, Figure 2 presents some value sets for a parametric estimator for h*
that is specified to be linear in a parameter /I and x.
At this stage, several questions about the nonparametric estimator described
above may be in the readers mind. For example, how do we know whether these
estimators are consistent? More fundamentally, how can the functions h* and F*
be identified when no parametric specification is imposed on them? And, if they are
identified, is the estimation method described above the only one that can be used
to estimate the nonparametric model? These and several other related questions
will be answered for the model described above and for other popular models.
In Section 2 we will see first what it means for a nonparametric function to be
identified. We will also see how restrictions of economic theory can be used to
identify nonparametric functions in three popular types of models.

Figure 1
R.L. Ma&kin

Figure 2

In Section 3, we will consider various methods for estimating nonparametric


functions and we will see how properties such as concavity, monotonicity, and
homogeneity of degree one can be incorporated into those estimation methods.
Besides estimation methods like the one described above, we will also consider
seminonparametric methods and weighted average methods.
In Section 4, we will describe some nonparametric tests that use restrictions of
economic theory. We will be concerned with both nonstatistical as well as statistical
tests. The nonstatistical tests assume that the data is observed without error and
the variables in the models are nonrandom. Samuelsons Weak Axiom of Revealed
Preference is an example of such a nonparametric test.
Section 5 presents a short summary of the main conclusions of the chapter.

2. Identification of nonparametric models using economic restrictions

2.1. Dejinition of nonparametric identijication

Formally, an econometric model is specified by a vector of functionally dependent


and independent observable variables, a vector of functionally dependent and
independent unobservable variables, a set of known functional relationships among
the variables, and a set of restrictions on the unknown functions and distributions.
In the example that we have been considering, the observable and unobservable
independent variables are, respectively, XE[W~ and EEIR,. A binary variable, y, that
takes the value zero if the firm undertakes the project and takes the value 1 otherwise
is the observable dependent variable. The profit of the firm if it undertakes the
project is the unobservable dependent variable, y*. The known functional relation-
ships among these variables are that y* = E - h*(x) and that y = 0 when y* > 0 and
y = 1 otherwise. The restrictions on the functions and distributions are that h* is
continuous, convex, homogeneous of degree one, monotone increasing and attains
the value c( at x*; the joint distribution, G, of (x, E) has as its support the set [WX,
and it is such that E and x are independently distributed.
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2529

The restrictions imposed on the unknown functions and distributions in an


econometric model define the set of functions and distributions to which these
belong. For example, in the econometric model described above, h* belongs to the
set of continuous, convex, homogeneous of degree one, monotone increasing
functions that attain the value c( at x*, and G belongs to the set of distributions of
(x,E) that have support Rr+i and satisfy the restriction that x and E are
independently distributed.
One of the main objectives of specifying an econometric model is to uncover the
hidden functions and distributions that drive the behavior of the observable
variables in the model. The identification analysis of a model studies what functions,
or features of functions, can be recovered from the joint distribution of the observ-
able variables in the model.
Knowing the hidden functions, or some features of the hidden functions, in a
model is necessary, for example, to study properties of these functions or to predict
the behavior of other variables that are also driven by these functions. In the model
considered in the introduction, for example, one can use knowledge about the cost
function of a typical firm to infer properties of the production function of the firm
or to calculate the cost of the firm under a nonperfectly competitive situation.
Let M denote a set of vectors of functions such that each function and distribution
in an econometric model corresponds to a coordinate of the vectors in M. Suppose
that the vector, m*, whose coordinates are the true functions and distribution in the
model belongs to M. We say that we can identify within M the functions and distri-
butions in the model, from the joint distribution of the observable variables, if no
other vector m in M can generate the, same joint distribution of the observable
variables. We next define this notion formally.
Let m* denote the vector of the unknown functions and distributions in an
econometric model. Let M denote the set to which m* is known to belong. For each
mEM let P(m) denote the joint distribution of the observable variables in the model
when m* is substituted by m. Then, the vector of functions m* is identified within M
if for any vector meM such that m # m*, P(m) # P(m*).
One may consider studying the recoverability of some feature, C(m*), of m*, such
as the sign of some coordinate of m*, or one may consider the recoverability of some
subvector, mf, of m*, where m* = (mr, m:). A feature is identified if a different value
of the feature generates a different probability distribution of the observable
variables. A subvector is identified if, given any possible remaining unknown
functions, any subvector that is different can not generate the same joint distribution
of the observable variables.
Formally, the feature C(m*) of m* is ident$ed within the set {C(m)(meM) if
VmEM such that C(m) # C(m*), P(m) # P(m*). The subvector rnr is identiJied within
Ml, where M = Ml x M,, myEM,, and m:EM,, if Vm,EM, such that m, #my, it
follows that Vm2, m;EM, P(m:, m;) # P(m,, m2).
When the restrictions of an econometric model specify all functions and distri-
butions up to the value of a finite dimensional vector, the model is said to be
2530 R.L. Matzkin

parametric. When some af the functions or distributions are left parametrically un-
specified, the model is said to be semiparametric. The model is nonparametric if
none of the functions and distributions are specified parametrically. For example,
in a nonparametric model, a certain distribution may be required to possess zero
mean and finite variance, while in a parametric model the same distribution may
be required to be a Normal distribution.
Analyzing the identification of a nonparametric econometric model is useful for
several reasons. To establish whether a consistent estimator can be developed for
a specific nonparametric function in the model, it is essential to determine first
whether the nonparametric function can be identified from the population behavior
of observable variables. To single out the recoverability properties that are solely
due to a particular parametric specification being imposed on a model, one has to
analyze first what can be recovered without imposing that parametric specification.
To determine what sets of parametric or nonparametric restrictions can be used to
identify a model, it is important to analyze the identification of the model first
without, or with as few as possible, restrictions.
Imposing restrictions on a model, whether they are parametric or nonparametric,
is typically not desirable unless those restrictions are justified. While some amount
of unjustified restrictions is typically unavoidable, imposing the restrictions that
economic theory implies on some models is not only desirable but also, as we will
see, very useful.
Consider again the model of the firm that considers whether to undertake a
project. Let us see how the properties of the cost function allow us to identify the
cost function of the firm and the distribution of the revenue from the conditional
distribution of the binary variable y given the vector of input prices x. To simplify
our argument, let us assume that F* is continuous. Recall that F* is assumed to be
strictly increasing and the support of the probability measure of x is rWt. Let g(x)
denote Pr(y = 1 Ix). Then, g(x) = F*(h*( x )) is a continuous function whose values
on Iw: can be identified from the joint distribution of (x, y). To see that F* can be
recovered from g, note that since h*(x*) = c1and h* is a homogeneous of degree one
function, for any CER,, F*(t) = F*((t/a) a) = F*((t/cr) h*(x*)) = F*(h*((t/a) x*)) =
g((t/a)x*). Next, to see that h* can be recovered from g and F*, we note that for
any XE@, h*(x) = (F*)-g(x). So, we can recover both h* and F* from the
observable function g. Any other pair (h, F) satisfying the same properties as (h*, F*)
but with h # h* or F # F* will generate a different continuous function g. So, (II*, F*)
is identified.
In the next subsections, we will see how economic restrictions can be used to
identify other models.

2.2. Identification of limited dependent variable models

Limited dependent variable (LDV) models have been extensively used to analyze
microeconomic data such as labor force participation, school choice, and purchase
of commodities.
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2531

A typical LDV model can be described by a pair of functional relationships,

Y = G(Y*)
and

y* = w*(x), E),

where y is an observable dependent vector, which is a transformation, G, of an


unobservable dependent vector, y *. The vector y* is a transformation, D, of the
value that a function, h*, attains at a vector of observable variables, x, and the value
of an unobservable vector, E.
In most popular examples, the function D is additively separable into the value
of h* and E. The model of the firm that we have been considering satisfies this
restriction. Popular cases of G are the binary threshold crossing model

y= 1 if y* >, 0 and y = 0 otherwise,

and the tobit model

Y=Y* if y* b 0 and y = 0 otherwise.

2.2.1. Generalized regression models

Typically, the function h* is the object of most interest in LDV models, since it
aggregates the influence of the vector of observable explanatory variables, x. It is
therefore of interest to ask what can be learned about h* when G and D are unknown
and the distribution of E is also unknown. An answer to this question has been
provided by Matzkin (1994) for the case in which y, y*, h*(x), and E are real valued,
E is distributed independently of x, and GOD is nondecreasing and nonconstant.
Roughly, the result is that h* is identified up to a strictly increasing transformation.
Formally, we can state the following result (see Matzkin (1990b, 1991c, 1994)).

Theorem. Identification of h* in generalized regression models

Suppose that
(i) GOD: Rz + R is monotone increasing and nonconstant,
(ii) h*: X + K!, where X c [WK,belongs to a set W of functions h: X + II2that are
continuous and strictly increasing in the Kth coordinate of x,
(iii) EE [wis distributed independently of x,
(iv) the conditional probability of the Kth coordinate of x has a Lebesgue density
that is everywhere positive, conditional on the other coordinates of x,
(v) for any x,x in X such that h*(x) < h*(x) there exists tell2 such that
Pr[GoD(h*(x), E) d t] > Pr[GoD(h*(x), E) d t], where the probability is taken
with respect to the probability measure of E, and
(vi) the support of the marginal distribution of x includes X.
2532 R.L. Matzkin

Then, h* is identified within W if and only if no two functions in W are strictly


increasing transformations of each other.

Assumptions (i) and (iii) guarantee that increasing values of h*(x) generate non-
increasing values of the probability of y given x. Assumption (v) slightly strengthens
this, guaranteeing that variations in the value of h* are translated into variations
in the values of the conditional distribution of y given x. Assumption (ii) implies
that whenever two functions are not strictly increasing transformations of each
other, we can find two neighborhoods at which each function attains different values
from the other function. Assumptions (iv) and (vi) guarantee that those neighbor-
hoods have positive probability.
Note the generality of the result. One may be considering a very complicated
model determining the way by which an observable vector x influences the value
of an observable variable y. If the influence of x can be aggregated by the value of
a function h*, the unobservable random variable E in the model is distributed
independently of x, and both h* and E influence y in a nondecreasing way, then
one can identify the aggregator function h* up to a strictly increasing transfor-
mation.
The identification of a more general model, where Eis not necessarily independent
of x, h* is a vector of functions, and GOD is not necessarily monotone increasing on
its domain has not yet been studied.
For the result of the above theorem to have any practicality, one needs to find
sets of functions that are such that no two functions are strictly increasing trans-
formations of each other. When the functions are linear in a finite dimensional
parameter, say h(x) = fi.x, one can guarantee this by requiring, for example, that
IIp (1= 1 or jK = 1, where b = (jr,. . . , flK). When the functions are nonparametric,
one can use the restrictions of economic theory.
The set of homogeneous of degree one functions that attain a given value, ~1,at a
given point, x*, for example, is such that no two functions are strictly increasing
transformations of each other. To see this, suppose that h and h are in this set and
for some strictly increasing function f, h = j-0 h; then since h(Ax*) = h(Ax*) for each
22 0, it follows that f(t) = f(cr(t/cr)) = f(h((t/cr) x*)) = h((t/a) x*) = t. So, f is the
identity function. It follows that h = h.
Matzkin (1990b, 1993a) shows that the set of least-concave3 functions that attain
common values at two points in their domain is also a set such that no two functions
in the set are strictly increasing transformations of each other. The sets of additively
separable functions described in Matzkin (1992,1993a) also satisfy this requirement.
Other sets of restrictions that could also be used-remain to be studied.

3A function V:X + R, where X is a convex subset of RK, is least-concaoe if it is concave and if any
concave function, u, that can be written as a strictly increasing transformation, f, of v can also be written
as a concave transformation, y. of v. For example, 0(x,, x2) = (x1 .x2) P is least-concave, but u(xl, x2) =
log(x,) + log(x,) is not.
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2533

Summarizing, we have shown that restrictions of economic theory can be used


to identify the aggregator function h* in LDV models where the functions D and G
are unknown. In the next subsections we will see how much more can be recovered
in some particular models where the functions D and G are known.

2.2.2. Binary threshold crossing models

A particular case of a generalized regression model where G and D are known is


the binary threshold crossing model. This model is widely used not only in
economics but in other sciences, such as biology, physics, and medicine, as well. The
books by Cox (1970) Finney (1971) and Maddala (1983), among others, describe
several empirical applications of these models. The semi- and nonparametric
identification and estimation of these models has been studied, among others, by
Cosslett (1983) Han (1987) Horowitz (1992), Hotz and Miller (1989), Ichimura
(1993), Klein and Spady (1993), Manski (1975, 1985, 1988), Matzkin (1990b, 199Oc,
1992), Powell et al. (1989) Stoker (1986) and Thompson (1989).
The following theorem has been shown in Matzkin (1994):

Theorem. Identijication of (h*, F*) in a binary choice model

Suppose that
(i) y* = h*(y) + E; y = 1 if y* 3 0, y = 0 otherwise.
(ii) h*: X+ R, where X c lRK, belongs to a set W of functions h:X+ IF!that are
continuous and strictly increasing in the Kth coordinate to x,
(iii) E is distributed independently of x,
(iv) the conditional probability of the Kth coordinate of x has a Lebesgue density
that is everywhere positive, conditional on the other coordinates of x,
(v) F*, the cumulative distribution function (cdf) of E, is strictly increasing, and
(vi) the support of the marginal distribution of x is included in X.
Let I- denote the set of monotone increasing functions on R with values in the
interval [0,11. Then, (h*, F*) is identified within (W x I) if and only if W is a set of
functions such that no two functions in W are strictly increasing transformations
of each other.

Assumptions (ii)- and (vi) are the same as in the previous theorem and they
play the same role here as they did there. Assumptions (i) and (v) guarantee that
assumptions (i) and (v) in the previous theorem are satisfied. They also guarantee
that the cdf F* is identified when h* is identified.
Note that the set of functions W within which h* is identified satisfies the same
properties as the set in the previous theorem. So, one can use sets of homogeneous
of degree one functions, least-concave functions, and additive separable functions
to guarantee the identification of h* and F* in binary threshold crossing models.
2534 R.L. Ma&kin

2.2.3. Discrete choice models

Discrete choice models have been extensively used in economics since the pioneering
work of McFadden (1974, 1981). The choice among modes of transportation, the
choice among occupations, and the choice among appliances have, for example,
been studied using these models. See, for example, Maddala (1983), for an extensive
list of empirical applications of these models.
In discrete choice models, a typical agent chooses one alternative from a set
A = { 1,. . , J> of alternatives. The agent possesses an observable vector, sgS, of
socioeconomic characteristics. Each alternative j in A is characterized by a vector
of observable attributes zj~Z, which may be different for each agent. For each
alternativejgA, the agents preferences for alternativej are represented by the value
of a random function U defined by U(j) = V*( j, s, zj) + sjr where sj is an unobservable
random term. The agent is assumed to choose the alternative that maximizes his
utility; i.e., he is assumed to choose alternative j iff

V*(j, St Zj) + Ej > V*(k, St Zk)+ Ek, fork=l,...,J;k#j.

(We are assuming that the probability of a tie is zero.)


The identification of these models concerns the unknown function V* and the
distribution of the unobservable random vector E = (cr,. . , Ed). The observable
variables are the chosen alternatives, the vector s of socioeconomic characteristics,
and the vector z = (zr , . , zJ) of attributes of the alternatives. The papers by Strauss
(1979), Yellott (1977) and those mentioned in the previous subsection concern the
nonparametric and semiparametric identification of discrete choice models.
A result in Matzkin (1993a) concerns the identification of V* when the distri-
bution of the vector of unobservable variables (or, . . . , Ed) is allowed to depend on
the vector of observable variables (s,zr,. . . ,z,). Letting (sr,. . . , eJ) depend on (s,z)
is important because there is evidence that the estimators for discrete choice models
may be very sensitive to heteroskedasticity of E (Hausman and Wise (1978)). The
identification result is obtained using the assumptions that (i) the V*( j, .) functions
are continuous and the same for all j; i.e. 3v* such that Vj V*( j, s, zj) = v*(s, zj), and
(ii), conditional on (s,z r,. .,zJ), the sjs are i.i.d.4 Matzkin (1993a) shows that a
sufficient condition for v*: S x Z + R to be identified within a set of continuous
functions W is that for any two functions v, v in W there exists a vector s such that
u(s, .) is not a strictly increasing transformation of v(s, .). So, for example, when the
functions v: S x Z -+ R in W are such that for each s, v(s, .) is homogeneous of degree
one, continuous, convex and attains a value c1 at some given vector z*, one can
identify the function u*.
A second result in Matzkin (1993a) extends techniques developed by Yellott (1977)

Manski (1975, 1985) used this conditional independence assumption to analyze the identification of
semiparametric discrete choice models.
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2535

and Strauss (1979). The result is obtained under the assumption that the distribution
of E is independent of the vector (s, z). It is shown that using shape restrictions on
the distribution of E and on the function V*, one can recover the distribution of the
vector (s2-si,..., eJ - el) and the V*(j, .) functions over some subset of their
domain. The restrictions on I/* involve knowing its values at some points and
requiring that I/* attains low enough values over some sections of its domain. For
example, Matzkin (I993a) shows that when I/* is a monotone increasing and
concave function whose values are known at some points, I* can be identified over
some subset of its domain.
The nonparametric identification of discrete choice models under other non-
parametric assumptions on the distribution of the ESremains to be studied.

2.3. Identification offunctions generating regression functions

Several models in economics are specified by the functional relation

Y = f *cd + 4 (7)

where x and E are, respectively, vectors of observable and unobservable functionally


independent variables, and y is the observable vector of dependent variables.
Under some weak assumptions, the function f *: X -+ Iw can be recovered from
the joint distribution of (x, y) without need of specifying any parametric structure
for f *.To see this, suppose that E@(x) = 0 a.s.; then E(ylx) = f *(x) a.s. Hence, if
f * is continuous and the support of the marginal distribution of x includes the
domain off *, we can recover f *. A similar result can be obtained making other
assumptions on the conditional distribution of E, such as Median@ Ix) = 0 a.s.
In most cases, however, the object of interest is not a conditional mean (or a
conditional median) function f *, but some deeper function, such as a utility
function generating the distribution of demand for commodities by a consumer, or
a production function generating the distribution of profits of a particular firm. In
these cases, one could still recover these deeper functions, as long as they influence
f *. This requires using results of economic theory about the properties that f *
needs to satisfy.
For example, suppose that in the model (7) with E(E~x) = 0, x is a vector (p, I) of
prices of K commodities and income of a consumer, and the function f * denotes
for each (p,l) the vector of commodities that maximizes the consumers utility
function U* over the budget set (z > 0lp.z < Z}; E denotes a measurement error.
Then, imposing theoretical restrictions on f * we can guarantee that the preferences
represented by U* can be recovered from f *. Moreover, since f * can be recovered
from the joint distribution of (y,p, I), it follows that U* can also be recovered from
this distribution. Hence, U* is identified. The required theoretical restrictions on
f * have been developed by Mas-Colell(l977).
2536 R.L. Matzkin

Theorem. Recoverability of utility functions from demand functions (Mas-Cole11


(1977))
Let W denote a set of monotone increasing, continuous, concave and strictly quasi-
concave functions such that no two functions in W are strictly increasing transfor-
mations of each other. For any UEW, let f(p,Z; U) denote the demand function
generated by U, where PELWK,denotes a vector of prices and ZEIR, + denotes a
consumers income. Then, for any U, U in W, such that U # U one has that
ft.9
';
U) z ft.9
';
w.
This result states that different utility functions generate different demand
functions when the set of all possible values of the vector (p,l) is Iw:+. The
assumption that the utility functions in the set W are concave is the critical
assumption guaranteeing that the same demand function can not be generated from
two different utility functions in the set W.
Mas-Cole11 (1978) shows that, under certain regularity conditions, one can
construct the preferences represented by U* by taking the limit, with respect to an
appropriate distance function, of a sequence of preferences. The sequence is
constructed by letting {p,Z},~, be a sequence that becomes dense in (w;+i. For
each N, a utility function V, is constructed using Afriats (1967a) construction:

V,(z) = min { I/ + Ap.(z - z, b . . , N},

where zi = f *(pi, Ii) and the Vis and 2s are any numbers satisfying the inequalities
vi < vj + Ajpj. (Zi _ Zj), i,j=l ,.., N,

1 2 0, i= l,...,N.

The preference relation represented by U* is the limit of the sequence of preference


relations represented by the functions V, as N goes to co.
Summarizing, we have shown that using Mas-Cole113 (1977) result about the
recoverability of utility functions from demand functions, we can identify a utility
function from the distribution of its demand.
Following a procedure similar to the one described above, one could obtain non-
parametric identification results for other models of economic theory. Brown and
Matzkin (1991) followed this path to show that the preferences of heterogeneous
consumers in a pure exchange economy can be identified from the conditional dis-
tribution of equilibrium prices given the endowments of the consumers.

2.4. Identijication of simultaneous equations models

Restrictions of economic theory can also be used to identify the structural equations
of a system of nonparametric simultaneous equations. In particular, when the
functions in the system of equations are continuously differentiable, this could be
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2531

done by determining what type of restrictions guarantee that a given matrix is of


full rank. This matrix is presented in Roehrig (1988).
Following Roehrig, let us describe a system of structural equations by

r*(x,y)--u=O

where. XEUP, y, UEIWG,and I*: IWKx IWG + [WC;y denotes a vector of observable
endogeneous variables, x denotes a vector of observable exogenous variables, and
u denotes a vector of unobservable exogenous variables. Let 4* denote the joint
distribution of (x, u).
Suppose that (i) V(x, y) ar*/ay is full rank, (ii) there exists a function 7~such
that y = Y$X,u), and (iii) d* is such that u is distributed independently of x. Let (I, 4)
be another pair satisfying these same conditions. Then, under certain assumptions
on the support of the probability measures, Roehrig (1988) shows that a necessary
and sufficient condition guaranteeing that P(r*, &J*)= P(r, 4) is that for all i = 1,. . . , G
and all (x, y) the rank of the matrix

is less than G + 1. In the above expression, ri denotes the ith coordinate function of
r and P(r, 4) is the joint distribution of the observable vectors (x, y), when (r*, 4*)
is substituted with (r, 4).
Consider, for example, a simple system of a demand and a supply function
described by

4 = 44 P, w) + $3
P = s(w, 431) + Es,
where q denotes quantity, p denotes price, I denotes the income of the consumers
and w denotes input price. Then, using the restrictions of economic theory that
adlaw = 0, as/al = 0, adfal # 0 and as/&v # 0, one can show that both the demand
function and the supply function are identified up to additive constants.
Kadiyali (1993) provides a more complicated example where Roehrigs (1988)
conditions are used to determine when the cost and demand functions of the firms
in a duopolistic market are nonparametrically identified. I am not aware of any
other work that has used these conditions to identify a nonparametric model.

3. Nonparametric estimation using economic restrictions

Once it has been established that a function can be identified nonparametrically,


one can proceed to develop nonparametric estimators for that function. Several
methods exist for nonparametrically estimating a given function. In the following
subsections we will describe some of these methods. In particular, we will be
2538 R.L. Matzkin

concerned with the use of these methods to estimate nonparametric functions


subject to restrictions of economic theory. We will be concerned only with
independent observations.
Imposing restrictions of economic theory on estimator of a function may be
necessary to guarantee the identification of the function being estimated, as in the
models described in the previous section. They may also be used to reduce the
variance of the estimators. Or, they may be imposed to guarantee that the results
are meaningful, such as guaranteeing that an estimated demand function is down-
wards sloping. Moreover, for some nonparametric estimators, imposing shape
restrictions is critical for the feasibility of their use. It is to these estimators that we
turn next.

3.1. Estimators that depend on the shape of the estimated function

When a function that one wants to estimate satisfies certain shape properties, such
as monotonicity and concavity, one can use those properties to estimate the function
nonparametrically. The main practical tool for obtaining these estimators is the
possibility of using the shape properties of the nonparametric function to charac-
terize the set of values that it can attain at any finite number of points in its domain.
The estimation method proceeds by, first, estimating the values (and possibly the
gradients or subgradients) of the nonparametric function at a finite number of points
of its domain, and second, interpolating among the obtained values. The estimators
in the first step are subject to the restrictions implied by the shape properties of the
function. The interpolated function in the second step satisfies those same shape
properties.
The estimator presented in the introduction was obtained using this method. In
that case, the constraints on the vector (h, . , hN; To,. . , TN+) of values and
subgradients of a convex, homogeneous of degree one, and monotone function were

hi = Ti.xi, i=O,...,N+ 1, (4)

h> T.x, i,j=O ,...,N+ 1, (5)

T > 0, i=O,...,N+ 1. (6)

The constraints on the vector (F, . . . , FN) of values of a cdf were

F < F ifh<hj,i,j= l,..., N, (2)

06 F< 1, i=l ,., N. (3)

The necessity of the first set of constraints follows by definition. A function h: X + R,


where X is an open and convex set in R K, is convex if and only if for all XCX there
exists T(x)E@ such that for all ye X, h(y) 3 h(x) + T(x).(y - x). Let h be a convex
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2539

function and T(x) a subgradient of h at x; h is homogeneous of degree one if and


only if h(x) = T(x).x and h is monotone increasing if and only if T(x) 2 0. Letting
x = xc, y = xj, h(x) = h(x), h(y) = hj and T(x) = T one gets the above constraints.
Conversely, toesee that if the vector (ho,. , hN+ ; To,. . . , TN+ ) satisfies the above
constraints with ho = 0 and hN+ = ~1,then its coordinates must correspond to the
values and subgradients at x0,. . , xN+l of some convex, monotone and homo-
geneous of degree one function, we note that the function h(x) = max{ T.xl i =
0 , . . . , N + l} is one such function. (See Matzkin (1992) for a more detailed
discussion of these arguments.)
The estimators for (II*, F*) obtained by interpolating the results of the optimization
in (l)-(6) are consistent. This can be proved by noting that they are maximum likeli-
hood estimators and using results about the consistency of not-necessarily para-
metric maximum likelihood estimators, such as Wald (1949) and Kiefer and
Wolfowitz (1956). To see that (g,@ is a maximum likelihood estimator, let the set
of nonparametric estimators for (h*,F*) be the set of functions that solve the
broblem

max L,(h, F) = 5 {yi log [F(h(x))] + (1 - y) log [ 1 - F(h(x))] }


(h-F) i=l

subject to (%F)c(H x r),


(8)

where H is the set of convex, monotone increasing, and homogeneous of degree one
functions that attain the value CI at x* and r is the set of monotone increasing
functions on R whose values lie in the interval [0,11. Notice that the value of L,(h, F)
depends on h and F only through the values that these functions attain at a finite
number of points. As seen above, the behavior of these values is completely charac-
terized by the restrictions (2)-(6) in the problem in the introduction. Hence, the set
of solutions of the optimization problem (8) coincides with the set of solutions
obtained by interpolating the solutions of the optimization problem described by
(l))(6). So, the estimators we have been considering are maximum likelihood
estimators.
We are not aware of any existing results about the asymptotic distribution of
these nonparametric maximum likelihood estimators.
The principles that have been exemplified in this subsection can be generalized
to estimate other nonparametric models, using possibly other types of extremum
estimators, and subject to different sets of restrictions on the estimated functions.
The next subsection presents general results that can be used in those cases.

3.1 .I. General types of shape restrictions

Generally speaking, one can interpret the theory behind estimators of the sort
described in the previous subsection as an immediate extension of the theory behind
parametric M-estimators. When a function is estimated parametrically using a
2540 R.L. Mat&in

maximization procedure, the function is specified up to the value of some finite


dimensional parameter vector do RL, and an estimator for the parameter is obtained
by maximizing a criterion function over a subset of RL. When the nonparametric
shape restricted method is used, the function is specified up to some shape
restrictions and an estimator is obtained by maximizing a criterion function over the
set of functions satisfying the specified shape restrictions.
The consistency of these nonparametric shape restricted estimators can be proved
by extending the usual arguments to apply to subsets of functions instead of subsets
of finite dimensional vectors. For example, the following result, which is discussed
at length in the chapter by Newey and McFadden in this volume, can typically be
used:

Theorem

Let m* be a function, or a vector of functions, that belongs to a set of functions M.


Let L,: M + 52 denote a criterion function that depends on the data. Let P& be an
estimator for m*, defined by A,Eargmax(L,(m)ImEM}. Assume that the following
conditions are satisfied:
(i) The function L, converges a.s. uniformly over M to a nonrandom continuous
function L: M + R.
(ii) The function m* uniquely maximizes L over the set M.
(iii) The set M is compact with respect to a metric d.

Then, any sequence of estimators {fiN} converges a.s. to m* with respect to the
metric d. That is, with probability one, lim,, m d(rfi,, m*) = 0.

See the Newey and McFadden chapter for a description of the role played by
each of the assumptions, as well as a list of alternative assumptions.
The most substantive assumptions are (ii) and (iii). Depending on the definition
of L,, the identification of m* typically implies that assumption (ii) is satisfied. The
satisfaction of assumption (iii) depends on the definitions of the set M and of the
metric d, which measures the convergence of the estimator to the true function.
Compactness is more difficult to be satisfied by sets of functions than by sets of
finite dimensional parameter vectors. One often faces a trade-off between the
strength of the convergence result and the strength of the restrictions on M in the
sense that the stronger the metric d, the stronger the convergence result, but
the more restricted the set M must be. For example, the set of convex, monotone
increasing, and homogeneous of degree one functions that attain the value CIat x*
and have a common open domain is compact with respect to the I. norm. If, in
addition, the functions in this set possess uniformly bounded subgradients, then the
set is compact with respect to the supremum norm on any compact subset of their
joint domain.
Two properties of the estimation method allow one to transform the problem of
finding functions that maximize L, over M into a finite dimensional optimization
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2541

problem. First, it is necessary that the function L, depends on any meM only
through the values that m attains at a finite number of points. And second, it is
necessary that the values that any function rn~M may attain at those finite number
of points can be characterized by a finite set of inequality constraints. When these
conditions are satisfied, one can use standard routines to solve the finite dimensional
optimization problem that arises when estimating functions using this method. The
second requirement is not trivially satisfied. For example, there is no known finite
set of necessary and sufficient conditions on the values of a function at a finite
number of points guaranteeing that the function is differentiable and a-Lipschitzian5
(c( > 0). In the example given in Section 3.1, the concavity of the functions was critical
in guaranteeing that we can characterize the behavior of the functions at a finite
number of points.
While the results discussed in this section can be applied to a wide variety of
models and shape restrictions, some types of models and shape restrictions have
received particular attention. We next survey some of the literature concerning
estimation subject to monotonicity and concavity restrictions.

3.1.2. Estimation of monotone functions

A large body of literature concerns the use of monotone restrictions to estimate


nonparametric functions. Most of this work is summarized in an excellent book by
Robertson et al. (1988), which updates results surveyed in a previous book by
Barlow et al. (1972). (See also, Prakasa Rao (1983).) The book by Robertson et al.
describes results about the computation of the estimators, their consistency, rates
of convergence, and asymptotic distributions. Subsection 9.2 in that book is of
particular interest. In that subsection the authors survey existing results about
monotone restricted estimators for the function f * in the model

Y = f*(x) +6
where E(EI x) = 0 a.s. or Median(&Ix) = 0. Key papers are Brunk (1970), where the
consistency and asymptotic distribution of the monotone restricted least squares
estimators for f * is studied when E(E[x) = 0 and x~[0, 11; and Hanson et al. (1973),
where consistency is proved when XG[O, l] x [0,11. Earlier, Asher et al. (1955) had
proved some weak convergence results. Recently, Wang (1992) derived the rate of
convergence of the monotone restricted estimator for f * when E(E~x) = 0 a.s. and
x~[0, l] x [0,11.The asymptotic distribution of the least squares estimator for this
latter case is not yet known.
Of course, the general methods described in the previous subsection apply in
particular to monotone functions. So, one can use those results to determine the
consistency of monotone restricted estimators in a variety of models that may or
may not fall into the categories of models that are usually studied. (See, for example,
Cosslett (1983) and Matzkin (1990a).)

A function h:X + Iw,where X c Rx, is a-lipschitzian (GL> 0) if Vx, y~X,Ih(x) - h(y)1 6 a 11


x -Y 11.
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2543

gradients of the concave function (Matzkin (1986,1991a), Balls (1987)). The


constraints in (9) become

fi<fj+ Tj.(xi-xj), i,j=l 3.) N,

and the minimization is over the values {fi} and the vectors (T}. To add a mono-
tonicity restriction, one includes the constraints

T >/ 0, i= 1,. . . , N.

To bound the subgradients by a vector B, or to bound the values of the function


by the values of a function b, one uses, respectively, the constraints

- B < T < B, i=l , , N,

and

- b(x) d fi d b(x'), i=l ,...,N.

Algorithms for the resulting constrained optimization problem were developed


by Dykstra (1983) and Goldman and Ruud (1992) for the least squares estimator,
and Matzkin (1993b) for general types of objective functions. The algorithms by
Dykstra and by Goldman and Ruud are extensions of the method proposed by
Hildreth (1954). This algorithm proceeds by solving the problem

minimize I/y - A2 112,


A>0

where A is a matrix whose rows are all vectors ~~EIW~with pi = 1 (some i), & d 0
(all k # i), and /IX = 0. The rows of the N x K matrix X are the observed points xi,
the first coordinates of which are ones. This is the dual of the problem of finding the
vector z*that minimizes the sum of squared errors subject to concavity constraints

minimize (1y - z 112,


A.ZdO

The solution to this problem is 1= y - A/l, where fi is the solution to the dual
problem. While the dual problem is minimized over more variables, the constraints
are much simpler than those of the primal problem. The algorithm minimizes the
objective function over one coordinate of 2 at a time, repeating the procedure till
convergence.
The consistency of the concavity restricted least squares estimator of a multivariate
nonparametric concave function can be proved using the consistency result
2544 R.L. Matzkin

presented in Section 3.1.1. Suppose, for example, that in the model

y= f*(x) + 6

XEX, where X is an open and convex subset of Rx, f *: X + R4, and the unobserved
vector EE lFPis distributed independently of x with mean 0 and variance .Z. Let BE R,
and b: X + R4. Assume that f* belongs to the set, H, of concave functions f: X + Rq
whose subgradients are uniformly bounded by B and their values satisfy that VxgX,
If(x)\ < b(x). Then, H is a compact set, in the sup norm, of equicontinuous functions.
So, following the same arguments as in, e.g., Epstein and Yatchew (1985) and
Gallant (1987), one can show that the function L,: H + [wdefined by

LN(f) = k.$1(Yi -
L
f(xi)z _ (Yi - f(xi))

converges a.s. uniformly to the continuous function L: H + R defined by

s
L(f) = q + (f(x) - f*(~))(fb) - f*(x)) $4x),

where p is the probability measure of x. Since the functions in H are continuous, L


is uniquely minimized at f*. Hence, by the theorem of Subsection 3.1.1 it follows
that the least squares estimator is a strongly consistent estimator for f*.
For an LAD (least absolute deviations) nonparametric concavity restricted
estimator, Balls (1987) proposed proving consistency by showing that the distance
between the concavity restricted estimator and the true function is smaller than the
distance between an unrestricted consistent nonparametric splines estimator (see
Section 3.2) and the true function. Matzkin (1986) showed consistency of a non-
parametric concavity restricted maximum likelihood estimator using a variation
of Walds (1949) theorem, which uses compactness of the set H. No asymptotic
distribution results are known for these estimators.

3.2. Estimation using seminonparametric methods

Seminonparametric methods proceed by approximating any function of interest


with a parametric approximation. The larger the number of observations available
to estimate the function, the larger the number of parameters used in the approxi-
mating function and the better the approximation. The parametric approximations
are chosen so that as the number of observations increases, the sequence of parametric
approximations converges to the true function, for appropriate values of the
parameters.
A popular example of such a class of parametric approximations is the set of
Ch. 42: Restrictions IJ~Economic Theory in Nonparametric Methods 2545

functions defined by the Fourier flexible form (FFF) expansion

gN(x, 0) = hx + XCX + 1 uk eikx, XEFF,


Ikl*< T

where i = J-1, ~E[W~, C is a K x K matrix, uk = uk + iv, for some real numbers


uk and ok, k = (k,, , kK) is a vector with integer coordinates, and JkJ* = Cf= i (ki(.
(See Gallant (198 1))
To guarantee that the above sum is real valued, it is imposed that oe = 0, uk = u_~
and vk = - v_~. Moreover, the values of each coordinate of x need to be modified
to fall into the [0,2~] interval. The coordinates of the parameter vector 19are the
uks, the uks and the coefficients of the linear and quadratic terms. Important
advantages of this expression are that it is linear in the parameters and its partial
derivatives are easily calculated. As K + CO,the FFF and its partial derivatives up
to order m - 1 approximate in an Lp norm any m times differentiable function and
its m - 1 derivatives.
Imposing restrictions on the values of the parameters of the approximation, one
can guarantee that the resulting estimator satisfies a desired shape property. Gallant
and Golub (1984), for example, impose quasi-convexity in the FFF estimator by
calculating the estimator for 6 as the solution to a constrained minimization
problem

min s,(8) subject to r(0) > 0,


8

where sN(.) is a data dependent function, such as a weighted sum of squared errors,
r(0) = min, u(x, 0) and u(x, 0) = min, (zDg,(x, 0)z (zDg,(x, 0) = 0, zz = 1}. Dg, and
Dg, denote, respectively, the gradient and Hessian of gN with respect to x. Gallant
and Golub (1984) have developed an algorithm to solve this problem.
Gallant (1981, 1982) developed restrictions guaranteeing that the Fourier flexible
form approximation satisfies homotheticity, linear homogeneity or separability.
The consistency of seminonparametric estimators can typically be shown by
appealing to the following theorem, which is presented and discussed in Gallant
(1987) and Gallant and Nychka (1987, Theorem 0).

Theorem

Suppose that m* belongs to a set of functions M. Let L,: M + [wdenote a criterion


function that depends on the data. Let {MN} denote an infinite sequence of subsets
of M such that . ..M. c MN+i c MN+z.... Let rnc be an estimator for m*, defined
by rni = argmax {L,(m)(m~M,}. Assume that the following conditions are satisfied.
(i) The function L, converges a.s. uniformly over M to a nonrandom continuous
function L: M + R.
(ii) The function m* uniquely maximizes L over the set M.
2546 R.L. Matzkin

(iii) The set M is compact with respect to a metric d.


(iv) There exists a sequence of functions (gN} c M such that gNEMN for all
N= 1,2,... and d(g,, WI*)+ 0.

Then, the sequence ofestimators {mN} converges a.s. to m* with respect to the metric
d. That is, with probability one, lim,, m d(m,, m*) = 0.

This result is very similar to the theorem in Subsection 3.1.1. Indeed, Assumptions
(i)-(iii) play the same role here as they played in that theorem. Assumption (iv) is
necessary to substitute for the fact that the maximization of L, for each N is not
over the whole space M but only over a subset, M,, of M. This asumption is satisfied
when the M, sets become dense in M as N -+ co. (See Gallant (1987) for more
discussion about this result.)
Asymptotic normality results for Fourier flexible forms and other seminonpara-
metric estimators have been developed, among others, by Andrews (1991), Eastwood
(1991), Eastwood and Gallant (1991) and Gallant and Souza (1991). None of these
considers the case where the estimators are restricted to be concave.
The M, sets are typically defined by using results that allow one to characterize
any arbitrary function as the limit of an infinite sum of parametric functions. The
Fourier flexible form described above is one example of this. Each set M, is defined
as the set of functions obtained as the sum of the first T(N) terms in the expansion,
where T(N) is increasing in N and such that K(N)+ co as N + co.
Some other types of expansions that have been used to define parametric
approximations are Hermite forms (Gallant and Nychka (1987)) power series
(Bergstrom (1985)) splines (Wahba (1990)), and Miintz-Szatz type series (Barnett
and Yue (1988a, 1988b) and Barnett et al. (1991)).
Splines are smooth functions that are piecewise polynomials. Kimeldorf and
Wahba (1971) Utreras (1984, 1985), Villalobos and Wahba (1987) and Wong (1984)
studied the imposition of monotonicity and convexity restrictions on splines
estimators. Yatchew and Bos (1992) proposed using splines to estimate a consumer
demand function subject to the implications of economic theory on demand
functions.
Barnett et al. (1991) impose concavity in a Miintz-Szatz type series by requiring
that each term in the expansion satisfies concavity. This method for imposing
concavity restrictions in series estimators was proposed by. McFadden (1985).

3.3. Estimation using weighted aoerage methods

A weighted average estimator, 7, for the function f* in the model

Y = f*(x) + s,
2548 R.L. Matzkin

where

in = min 7(x), TL(x) = maxrK(x)


x>x XGX

and where fK is a kernel estimator for J The consistency of this estimator follows
from the consistency of the kernel estimator. No asymptotic distribution for it is
known.
A kernel estimator was also used in Matzkin (1991d) to obtain a smooth inter-
polation of a concavity restricted nonparametric maximum likelihood estimator
and in Matzkin and Newey (1992) to estimate a homogenous function in a binary
threshold crossing model. The Matzkin and Newey estimator possesses a known
asymptotic distribution.

4. Nonparametric tests using economic restrictions

The testing of economic hypotheses in parametric models suffers from drawbacks


similar to those of the estimation of parametric models; the conclusions depend on
the parametric specifications used. Suppose, for example, that one is interested in
testing whether some given consumer demand data provide support for the classical
model of utility maximization. The parametric approach would proceed by: first,
specifying parametric structures for the demand functions; second, using the
demand data to estimate the parameters; and then testing whether the estimated
demand functions satisfy the integrability conditions. But, if the integrability
conditions are not satisfied by the parametrically estimated demand functions it is
not clear whether this is evidence against the utility maximization model or against
the particular parametric structures chosen. In contrast, a nonparametric test of the
utility maximization model would use demand functions estimated nonparamet-
rically. In this case, rejection of the integrability conditions provides stronger
evidence against the utility maximization model.

4.1. Nomtatistical tests

A large body of literature dating back to the work of Samuelson (1938) and
Houthakker (1950) on Revealed Preference has developed nonparametric tests for
the hypothesis that data is consistent with a particular choice model, such as the
choice made by a consumer or a firm. Most of these tests are nonstatistical. The
data is assumed to be observed without error and the models contain no
unobservable random terms. (One exception is the Axiom of Revealed Stochastic
Rationality (McFadden and Richter (1970, 1990)), where conditions are given
characterizing discrete choice probabilities generated by a random utility function.)
In the nonstatistical tests, an hypothesis is rejected if at least one in a set of
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2549

nonparametric restrictions is violated; the hypothesis is accepted otherwise. The


nonparametric restrictions used to test the hypotheses are typically expressed in
one of two different ways. Either they establish that a solution must exist for a
certain finite system of inequalities whose coefficients are determined by the data;
or they establish that certain algebraic conditions must be satisfied by the data. For
example, the Strong Axiom of Revealed Preference is one of the algebraic conditions
that is used in these tests.
To provide an example of such results, we state below Afriats (1967a) Theorem,
which is fundamental in this literature. Afriats Theorem can be used to test the
consistency of demand data with the hypothesis that observed commodity bundles
are the maximizers of a common utility function over the budget sets determined
by observed prices of the commodities and incomes of a consumer. If the data
correspond to different individuals, the conditions of the theorem can be used to
test the existence of a utility function that is common to all of them.

Afriats Theorem (I 967a)

Let {xi, pi, Ii}: I denote a set of N observations on commodity bundles xi, prices pi,
and incomes I such that Vi, p~ = I. Then, the following conditions are equivalent.

(i) There exists a nonsatiated function V: [WK+ [wsuch that for all i = 1,. . . , N and
all YEW, [pi.y < I] *[V(y) < I/(x)].
(ii) The data {x i, pi, Ii};= 1 satisfy Cyclical Consistency; i.e., for all sequences
{i,.i, k , . . . >I, 1)

~8.~ Q zj,pk.xj 6 zk,....,pt.xrg zJ]+.[Zi dpi.xt].

(iii) There exist numbers Ai > 0 and I/ (i = 1,. . . , N) satisfying

vi G vj + ;Ij$qxi - xj), i,j=l ,..., N.

(iv) There exists a monotone increasing, concave and continuous function V: [WK+ IR
such that for all i = 1,...,Nandally~lRK,[pi.ydZi]=[V(y)<V(xi)].

This result states that the data could have been generated by the maximization
of a common nonsatiated utility function (condition (i)) if and only if that data satisfy
the set of algebraic conditions stated in condition (ii). In Figure 3, two observations
that do not satisfy Cyclical Consistency are graphed. In these observations,
p.x2 < 1 = p.x and p2.x < I2 = p2.x2.
The theorem also states that a condition equivalent to Cyclical Consistency is
that one can find numbers II > 0 and I/ (i = 1, . . . , N) satisfying the linear inequali-
ties in (iii). For example, no such numbers can be found for the observations in
Figure 3; since when p1.(x2 - x) < 0, p2.(x - x2) < 0, and 1, 2 > 0, the inequalities
in (iii) imply that V - V2 < 0 and V2 - V < 0.
2550 R.L. Matzkin

Figure 3

Finally, the equivalence between conditions (i) and (iv) implies that if one can find
a nonsatiated function that is maximized at the observed xis then one can also find
a monotone increasing, concave, and continuous function that is maximized at the
observed xis,.
Varian (1982) stated an alternative algebraic condition to Cyclical Consistency
and developed algorithms to test the conditions of the above theorem.
Along similar lines to the above theorem, a large literature deals with non-
parametric tests for the hypothesis that a given set of demand data has been generated
from the maximization of a utility function that satisfies certain shape restrictions.
For example, Afriat (1967b, 1972a, 1973,1981), Diewert (1973), Diewert and Parkan
(1985), and Varian (1983a) provided tests for the consistency of demand data with
additively separable, weakly separable and homothetic utility functions. Matzkin
and Richter (1991) provided a test for the strict concavity and strict monotonicity
of the utility function; and Chiappori and Rochet (1987) developed a test for the
consistency of demand data with a strictly concave and infinitely differentiable
utility function. To provide an example of one such set of conditions, the algebraic
conditions developed by Chiappori and Rochet are that (i) for all sequences
{i,j, k,. . . , I, Z>in { 1,. . . ,N)

[$.xi~I/pk.Xj~p ,.... , p.x < Z] = [I < pi.xf], and

(ii) for all i, j [xi = xj] * [pi = a$ for some c1> 01.
Yatchew (1985) provided nonparametric restrictions for demand data generated
by utility maximization subject to budget sets that are the union of linear sets.
Matzkin (1991 b) developed restrictions for demand data generated subject to choice
sets that possess monotone and convex complement and for choices that are each
supported by a unique hyperplane.
Nonstatistical nonparametric tests for the hypothesis of cost minimization and
profit maximization have also been developed. See, for example, Afriat (1972b),
Diewert and Parkan (1979), Hanoch and Rothschild (1978), Richter (1985) and
Varian (1984). Suppose, for example, that (y, pi} are a set of observations on a vector
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2551

ofinputs and outputs, y, and a vector of the corresponding prices, P. Then, one of the
results in the above papers is that {yi,pi} is consistent with profit maximization
iffforalli,j=l,..., N, pi.yi 2 p.# (Hanoch and Rothschild (1978)).
Some of the above mentioned tests have been used in empirical applications. See,
for example, Landsburg (198 l), McDonald and Manser (1984) and Manser and
McDonald (1988).
Nonparametric restrictions have also been developed to test efficiency in produc-
tion. These tests, typically appearing under the heading of Data Envelope Analysis,
use data on the input and output vectors of different facilities (decision making units
or DMUs) that are assumed to possess the same technology. Then, making assump-
tions about the technology, such as constant returns to scale, they determine the
set of vectors of inputs and outputs that are efficient. A DMU is not efficient if its
vector of input and output quantities is not in the efficiency set. See the paper by
Seiford and Thrall(l990) for a survey of this literature.
Recently, nonparametric restrictions characterizing data generated by models
other than the single agent optimization problem have been developed. Chiappori
(1988) developed a test for the Pareto optimality of the consumption allocation
within a household using data on aggregate household consumption and labor
supply of each household member. Brown and Matzkin (1993) developed a test for
the general equilibrium model, using data on market prices, aggregate endowments,
consumers incomes, and consumers shares of profits. Nonparametric restrictions
characterizing data consistent with other equilibrium models, such as various
imperfect competition models, have not yet been developed. Varian (1983b)
developed a test for the model of investors behavior.
Some papers have developed statistical tests using the nonstatistical restrictions
of some of the tests mentioned above (Varian (1985, 1988), Epstein and Yatchew
(1985), Yatchew and Bos (1992) and Brown and Matzkin (1992), among others). As
we will see in the next subsection, the test developed by Yatchew and Bos (1992), in
particular, can be used with several of the above restrictions to obtain statistical
nonparametric tests for economic models.

4.2. Statistical tests

Using nonparametric methods similar to those used to estimate nonparametric


functions, it is possible to develop tests for the hypothesis that a nonparametric
regression function satisfies a specified set of nonparametric shape restrictions.
Yatchew and Bos (1992) and Gallant (1982), for example, present such tests.
The consistent test by Yatchew and Bos is based on a comparison of the restricted
and unrestricted weighted sum of square errors. More specifically, suppose that the
model is specified by y = f*(x) + E,where ye Rq,XERK,EERq,x and E are independent,
E(E)= 0, and COV(E)= Z The null hypothesis is that f* EF c F, while the alternative
2552 R.L. Matzkin

hypothesis is that ~*EF\F. The Sobolev6 norms of the functions in the sets F and F
are uniformly bounded. The test proceeds as follows. First, divide the sample into
two independent samples of the same size, T. Compute the estimators $. and s,
using, respectively, the first and second samples, where

si = min$Zi [y-f(x)]Z-[y-f(x)]
/SF

and

s$ = min $Zi [y-f(x)]Z-[y-f(x)].


fEF

To transform these minimization problems into finite dimensional problems,


Yatchew and Bos (1992) use a method similar to the one described in Section 3.1.
They show that, under the null hypothesis, the asymptotic distribution of t, =
T112[$ - $1 is N(O,20), where u = Var EZ s. So, one can use standard statistical
tables to determine whether the difference of the sum of squared errors is
significantly different from zero. (This test builds on the work of Epstein and
Yatchew (1985), Varian (1985) and Yatchew (1992).)
The Yatchew and Bos (1992) test can be used in conjunction with the nonstatistical
nonparametric tests described in the previous subsection. Suppose for example that
y denotes a vector of commodities purchased by a consumer and xi denotes the
vector of prices pi and income I faced by the consumer when he or she purchased
y. Assume that the observations are independent and for each i, y = f*(xi) + E,
where E satisfies the assumptions made above. Then, as it is described in Yatchew
and Bos (1992), we can use their method to test whether the data is consistent with
the utility maximization hypothesis. In particular, Afriats inequalities (in condition
(iii) in Afriats Theorem) can be used to calculate $. by minimizing the value of

ii1 CY-filZ -CYi-fil

with respect to Vi, A, and fi (i = 1,. . . , T) subject to (i) the Afriat inequali-
ties: I/< vj+Aj$.(fi-fj) (i,j= l,..., T), (ii) the budget constraints: pi.fi = I

6The Sobolev norm is defined on a set of m times continuously differentiable functions Cm by

where a = (a,, , a,) is a vector of integers; Dqf(x) is the value resulting from differentiating f at X, a1
times with respect to x,, a2 times with respect to x2,. , aK times with respect to x,; and Jai = max,ja,).
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2553

(i= 1,..., T), and (iii) inequalities that guarantee that the Sobolev norm of the
function f is within specified bounds.
Gallant (1982) presents a seminonparametric method for testing whether a
regression function satisfies some shape restrictions, such as linear homogeneity,
separability and homotheticity. The method proceeds by testing whether the
parametric approximation used to estimate a nonparametric function satisfies the
hypothesized restrictions.
Following Gallant (1982), suppose that we are interested in testing the linear
homogeneity of a cost function, c(p,u), where p = (pr,. . ,pJ is a vector of input
prices and u is the output. Let

g(l, v) = In c
exp1,)...)P ~
a1

whereli=lnpi+lnaiandv=lnu+lna,+,. (The ais are location parameters that


are determinedfrom the data.) Then, linear homogeneity of the cost function c in
prices is equivalent to requiring that for all r, g(1+ ~1, v) = t + g(1, v). The approxi-
mation gN of g, given by

gN(x 1t3)= hx + XCX + 1 uk eikx XEW+


1kl-G T

satisfies these restrictions, for C = Z,c,kk, if

=f bj=l and if ak =0 and c,=O when 5 kj ~0.


j=l j=l

Linear homogeneity is then tested by determining whether these restrictions are


satisfied. Gallant (1982) shows that by increasing the degree of approximation (i.e.
the number of parameters) at a particular specified rate, as the number of
observations increases, one can construct tests that are asymptotically free of
specification bias. That is, for any given level of significance, a, one can construct a
test statistic t, and a critical value cN such that if the true nonparametric function
satisfies the null hypothesis then lim,,, Pr(t, > cN) = ~1.
Several other methods have been developed to test restrictions of economic
theory on nonparametric functions. Stoker (1989), for example, presents nonpara-
metric tests for additive constraints on the first and second derivatives of a
conditional mean function f*(x). These tests are based on weighted-average
derivatives estimators (Stoker (1986), Powell et al. (1989), Hardle and Stoker (1989)).
Linear homogeneity and symmetry constraints are examples of properties of f*
that can be tested using this method. (See also Lewbel(1991):) Also using average
derivatives, Hlrdle et al. (1992) tested the positive definiteness of the matrix of
aggregate income effects.
2554 R.L. Ma&kin

Hausman and Newey (1992) developed a test for the symmetry and negative slope
of the Hicksian (compensated) demand. The test is derived from a nonparametric
estimator for a consumer surplus. Since symmetry of the Hicksian demand implies
that the consumer surplus is independent of the price path used to calculate it,
estimates obtained using different paths should converge to the same limit. A
minimum chi-square test is then developed using this idea.
We should also mention in this section the extensive existent literature that
deals with tests for the monotonicity of nonparametric functions in a wide variety
of statistical models. For a survey of such literature, we refer the reader to the
previously mentioned books of Barlow et al. (1972) and Robertson et al. (1988). (See
also Prakasa Rao (1983).)

5. Conclusion

We have discussed the use of restrictions implied by economic theory in the


econometric analysis of nonparametric models. We described advancements that
have been made on the theories of identification, estimation, and testing of non-
parametric models due to the use of restrictions of economic theory.
First, we showed how restrictions implied by economic theory, such as shape and
exclusion restrictions, can be used to identify functions in economic models. We
demonstrated this in generalized regression models, binary threshold models,
discrete choice models, models of consumer demand and in systems of simultaneous
equations.
Various ways of incorporating economic shape restrictions into nonparametric
estimators were discussed. Special attention was given to estimators whose
feasibility depends critically on the imposition of shape restrictions. We described
technical results that can be used to develop new shape restricted nonparametric
estimators in a wide range of models. We also described seminonparametric and
weighted average estimators and showed how one can impose restrictions of
economic theory on estimators obtained by these two methods.
Finally, we have discussed some nonstatistical and statistical nonparametric tests.
The nonstatistical tests are extensions of the basic ideas underlying the theory of
Revealed Preference. The statistical tests are developed using nonparametric
estimation methods.

References

Afriat, S. (1967a) The Construction of a Utility Function from Demand Data, International Economic
Review, 8, 66-77.
Afriat, S. (1967b) The Construction of Separable Utility Functions from Expenditure Data, mimeo,
Purdue University.
Afriat, S. (1972a) The Theory of International Comparisons of Real Income and Prices, in: J.D. Daly,
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2555

ed., International Comparisons of Prices and Output. New York, National Bureau of Economic
Research.
Afriat, S. (1972b) Efficiency Estimates of Production Functions, International Economic Review, 13,
5688598.
Afriat, S. (1973) On a System of Inequalities on Demand Analysis, International Economic Review, 14,
460-412.
Afriat, S. (1981) On the Constructability of Consistent Price Indices between Several Periods
Simultaneously, in: A. Deaton, ed., Essays in Applied Demand Analysis. Cambridge University Press:
Cambridge.
Andrews. D.W.K. (1991) Asymptotic Normality of Series Estimators for Nonparametric and Semi-
parametric Regression Models, Econometrica, 59(2), 3077345.
Asher. M.. H.D. Brunk. GM. Ewing. W.T. Reid and E. Silverman (1955) An Empirical Distribution
Function for Sampling with Incomplete Information, Annals of Mathematical Statistics, 26,641-647.
Balls, K.G. (1987) Inequality Constrained Nonparametric Estimation, Ph.D. Dissertation. Carnegie
Mellon University.
Barlow, R.E., D.J. Bartholomew, J.M. Bremner and H.D. Brunk (1972) Statistical Znference under Order
Restrictions, New York: John Wiley.
Barnett, W.A. and P. Yue (1988a) The Asymptotically Ideal Model (AIM), working paper.
Barnett, W.A. and P. Yue (1988b) Semiparametric Estimation of the Asymptotically Ideal Model: The
AIM Demand System, in: G. Rhodes and T. Fornby, eds., Nonparametric and Robust Inference,
Advances in Econometrics, Vol. 7. JAI Press: Greenwich, Connecticut, 2299252.
Barnett, W.A., J. Geweke, and P. Yue (1991) Seminonparametric Bayesian Estimation of the
Asymptotically Ideal Model: The AIM Consumer Demand System, in: W. Barnett, J. Powell and G.
Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics. Cambridge:
Cambridge University Press.
Bergstrom, A.R. (1985) The Estimation of Nonparametric Functions in a Hilbert Space, Econometric
Theory, 1, l-26.
Brown, D.J. and R.L. Matzkin (1991) Recoverability and Estimation of the Demand and Utility
Functions of Traders when Demands are Unobservable, mimeo, Cowles Foundation, Yale University.
Brown, D.J. and R.L. Matzkin (1992) A Nonparametric Test for the Perfectly Competitive Model,
mimeo, Northwestern University.
Brown. D.J. and R.L. Matzkin (1993) Walrarian Comparative Statics, Technical Report No. 57,
Stanford Institute for Theoretical Economics. _
Brunk, H.D. (1970) Estimation of Isotonic Regression, in: M.L. Puri, ed., Nonparametric Techniques
in Statistical Inference. Cambridge University Press: Cambridge, 177-197.
Chiappori, P.A. (1988) Rational Household Labor Supply, Econometrica, 56(l), 63-89.
Chiappori, P. and J. Rochet (1987) Revealed Preference and Differential Demand, Econometrica, 55,
687-691.
Cosslett, S.R. (1983) Distribution-free Maximum Likelihood Estimator of the Binary Choice Model,
Econometrica, 51(3), 765-782.
Cox, D.R. (1970) The Analysis of Binary Data. Methuen & Co Ltd.
Diewert, E. (1973) Afriat and Revealed Preference Theory, Review of Economic Studies, 40,419-426.
Diewert, E.W. and C. Parkan (1979) Linear Programming Tests for Regularity Conditions for
Production Functions, University of British Columbia.
Diewert, E.W. and C. Parkan (1985) Tests for the Consistency of Consumer Data, Journal of
Econometrics, 30, 127-147.
Dykstra, R.L. (1983) An Algorithm for Restricted Least Squares Regression, Journal of the American
Statistical Association, 78, 8377842.
Eastwood, B.J. (1991) Asymptotic Normality and Consistency of Semi-nonparametric Regression
Estimators Using an upward F Test Truncation Rule, Journal of Econometrics.
Eastwood, B.J. and A.R. Gallant (1991) Adaptive Truncation Rules for Seminonparametric Estimators
that Achieve Asymptotic Normality, Econometric Theory, 7, 307-340.
Epstein, L.G. and D.J. Yatchew (1985) Non-Parametric Hypothesis Testing Procedures and Applications
to Demand Analysis, Journal of Econometrics, 30, 150-169.
Finney, D.J. (1971) Probit Analysis, Cambridge University Press.
Friedman, J. and R. Tibshirani (1984) The Monotone Smoothing of Scatter Plots, Technometrics, 26,
243-350.
2556 R.L. Matzkin

Gallant, A.R. (1981) On the Bias in Flexible Functional Forms and an Essentially Unbiased Form,
Journal ofEconometrics, 15, 211-245.
Gallant, A.R. (1982) Unbiased Determination of Production Technologies, Journalof Econometrics,
20,2855323.
Gallant, A.R. (1987) Identification and Consistency in Seminonparametric Regression, in: T.F. Bewley,
ed., Advances in Econometrics, Fifth World Congress, Volume 1. Cambridge University Press.
Gallant, A.R. and G.H. Golub (1984) Imposing Curvature Restrictions on Flexible Functional Forms,
Journal of Econometrics, 26, 295-321.
Gallant, A.R. and D.W. Nychka (1987) Seminonparametric Maximum Likelihood Estimation,
Econometrica, 55, 363-390.
Gallant, A.R. and G. Souza (1991) On the Asymptotic Normality of Fourier Flexible Form Estimates,
Journal of Econometrics, 50, 329-353.
Goldman, S.M. and P.A. Ruud (1992) Nonparametric Multivariate Regression Subject to Monotonicity
and Convexity Constraints, mimeo, University of California, Berkeley.
Han, A.K. (1987) Nonparametric Analysis of a Generalized Regression Model: The Maximum Rank
Correlation Estimation, Journal ofEconometrics, 35, 303-316.
Hanoch, G. and M. Rothschild (1978) Testing the Assumptions of Production Theory: A Nonparametric
Approach, Journal of Political Economy, 80, 256275.
Hanson, D.L. and G. Pledger (1976) Consistency in Concave Regression, Annals of Statistics, 4,
1038~1050.
Hanson, D.L., G. Pledger, and F.T. Wright (1973) On Consistency in Monotonic Regression, Annals
of Statistics, 1, 401-421.
Hardle, W. and T.M. Stoker (1989) Investigating Smooth Multiple Regression by the Method of
Average Derivatives, Journal of the American Statistical Association, 84,986-995.
Hardle, W., W. Hildenbrand and M. Jerison (1992) Empirical Evidence on the Law of Demand,
Econometrica, 59, 1525-1550.
Hausman, J. and W. Newey (1992) Nonparametric Estimation of Exact Consumer Surplus and
Deadweight Loss, mimeo, Department of Economics, M.I.T.
Hausman, J.A. and D.A. Wise (1978) A Conditional Probit Model of Qualitative Choice: Discrete
Decisions Recognizing Interdependence and Heterogeneous Preferences, Econometrica, 46,403-426.
Hildreth, C. (1954) Point Estimates of Ordinates of Concave Functions, Journal of the American
Statistical Association, 49, 598-619.
Horowitz, J.L. (1992) A Smoothed Maximum Score Estimator for the Binary Choice Model,
Econometrica, 60, 5055531.
Hotz, V.J. and R.A. Miller (1989) Conditional Choice Probabilities and the Estimation of Dynamic
Discrete Choice Models, mimeo, University of Chicago and Carnegie Mellon University.
Houthakker, H.S. (1950) Revealed Preference and the Utility Function, Economica, 17, 159-174.
Ichimura, H. (1993) Semiparametric Least Squares (SLS) and Weighted SLS Estimation of Single Index
Models, Journal of Econometrics, 58, 7 1 - 120.
Kadiyali, V. (1993) Ph.D. Thesis, Department of Economics, Northwestern University.
Kiefer, J. and J. Wolfowitz (1956) Consistencv of the Maximum Likelihood Estimator in the Presence
of Infinitely Many Incidental Parameters, Annals of Mathematical Statistics, 27, 887-906.
Kimeldorf, G.S. and G. Wahba (1971) Some Results on Tchebycheffian Splines Functions, Journal of
Mathematical Analysis and Applications, 33, 82-95.
Klein, R.W. and R.H. Spady (1993) An Efficient Semiparametric Estimator for Discrete Choice Models,
Econometrica, 61, 387-422.
Landsburg, SE. (1981) Taste Change in the United Kingdom, 1900-1955, Journal ofPolitical Economy,
89,922104.
Lewbel, A. (1991) Applied Consistent Tests of Nonparametric Regression and Density Restriction,
mimeo, Brandeis University.
Maddala, G.S. (1983) Limited-Dependent and Qualitatioe Variables in Econometrics, Cambridge University
Press: Cambridge.
Mammen, E. (1991a) Estimating a Smooth Monotone Function, Annals of Statistics, 19, 724-740.
Mammen, E. (1991 b) Nonparametric Regression under Qualitative Smoothness Assumptions, Annals
ofStatistics, 19, 741-759.
Manser, M.E. and R.J. McDonald (1988) An Analysis of Substitution Bias in Measuring Inflation,
1959985, Econometrica, 56(4), 909-930.
Ch. 42: Restrictions of Economic Theory in Nonparametric Methods 2557

Manski, C. (1975) Maximum Score Estimation of the Stochastic Utility Model of Choice, Journal of
Econometrics, 3, 2055228.
Manski, C. (1985) Semiparametric Analysis of Discrete Response: Asymptotic Properties of the
Maximum Score Estimator, Journal ofEconometrics, 27, 313-334.
Manski, C. (1988) Identification of Binary Response Models, Journal of the American Statistical
Association, 83, 7299738.
Mas-Colell, A. (1977) On the Recoverability of Consumers Preferences from Market Demand
Behavior, Econometrica, 45(6), 140991430.
Mas-Colell, A. (1978) On Revealed Preference Analysis, Review ofEconomic Studies, 45, 121-131.
Matzkin, R.L. (1986) Mathematical and Statistical Inferences from Demand Data, Ph.D. Dissertation,
University of Minnesota.
Matzkin, R.L. (1990a) Estimation of Multinomial Models Using Weak Monotonicity Assumptions,
Cowles Foundation Discussion Paper No. 957, Yale University.
Matzkin, R.L. (1990b) Least-concavity and the Distribution-free Estimation of Nonparametric Concave
Functions, Cowles Foundation Discussion Paper No. 958, Yale University.
Matzkin, R.L. (1990~) Fully Nonparametric Estimation of Some Qualitative Dependent Variable
Models Using the Method of Kernels, mimeo, Cowles Foundation, Yale University.
Matzkin, R.L. (199la) Semiparametric Estimation of Monotone and Concave Utility Functions for
Polychotomous Choice Models, Econometrica, 59, 1315-1327.
Matzkin. R.L. f199lb) Axioms of Revealed Preference for Nonlinear Choice Sets, Econometrica, 59,
177991786.
Matzkin, R.L. (1991~) A Nonparametric Maximum Rank Correlation Estimator, in: W. Barnett,
J. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics
and Statistics. Cambridge: Cambridge University Press.
Matzkin, R.L. (1991d) Using Kernel Methods to Smooth Concavity Restricted Estimators, mimeo,
Cowles Foundation, Yale University.
Matzkin, R.L. (1992) Nonparametric and Distribution-Free Estimation of the Binary Choice and the
Threshold-Crossing Models, Econometrica, 60,239-270.
Matzkin, R.L. (1993a) Nonparametric Identification and Estimation of Polychotomous Choice
Models, Journal of Econometrics, 58, 137-168.
Matzkin, R.L. (1993b) Computation and Operational Properties of Nonparametric Concavity
Restricted Estimators, Northwestern University.
Matzkin, R.L. (1994) Identification in Nonparametric LDV Models, mimeo, Northwestern University.
Matzkin, R.L. and W. Newey (1992) Kernel Estimation of a Structural Nonparametric Limited
Dependent Variable Models, mimeo.
Matzkin, R.L. and M.K. Richter (1991) Testing Strictly Concave Rationality, Journal of Economic
Theory, S3,287-303.
McDonald, R.J. and M.E. Manser (1984) The Effect of Commodity Aggregation on Tests of Consumer
Behavior, mimeo.
McFadden, D. (1974) Conditional Logit Analysis of Qualitative Choice Behavior, in P. Zarembka,
ed., Frontiers ofEconometrics, New York: Academic Press, 105-142.
McFadden, D. (1981) Econometric Models of Probabilistic Choice, in: C. Manski and D. McFadden,
eds., Structural Analysis of Discrete Data with Econometric Applications. The MIT Press, 198-
272.
McFadden, D. (1985) Specification of Econometric Models, Presidential Address, Fifth World
Congress, mimeo, M.I.T.
McFadden, D. and M.K. Richter (1970) Stochastic Rationality and Revealed Stochastic Preference,
Department of Economics, University of California, Berkeley, mimeo.
McFadden, D. and M.K. Richter (1990) Stochastic Rationality and Revealed Stochastic Preference, in:
J.S. Chipman, D. McFadden and M.K. Richter, eds., Preference, Uncertainty, and Optima&y. Essays
in Honor ofLeonid Hurwicz. Boulder: Westview Press, 161-186.
Mukarjee, H. (1988) Monotone Nonparametric Regression, The Annals ofStatistics, 16, 741-750.
Mukarjee, H. and S. Stern (1994) Feasible Nonparametric Estimation of Multiarnument Monotone
Functions, Journal of the American Statistical Association, 89, No. 425, 77-80. -
Nadaraja, E. (1964) On Regression Estimators, Theory of Probability and its Applications, 9,157-l 59.
Nemirovskii, AS., B.T. Polyak, and A.B. Tsybakov (1983) Rates of Convergence of Nonparametric
Estimates ofMaximum Likelihood Type, Problem peredachi informatsii, 21, 258-272.
2558 R.L. Ma&kin

Powell, J.L., J.H. Stock, and T.M. Stoker (1989) Semiparametric Estimation of Index Coefficients,
Econometrica, 57, 140331430.
Prakasa Rao, B.L.S. (1983) Nonparametric Functional Estimation, Academic Press.
Richter, M.K. (1985) Theory of Profit, mimeo, Department of Economics, University of Minnesota.
Robertson, T., F.T. Wright. and R.L. Dykstra (1988) Order Restricted Statistical Inference, John Wiley
and Sons.
Roehrig, C.S. (1988) Conditions for Identification in Nonparametric and Parametric Models,
Econometrica, 56(2), 4333447.
Royall, R.M. (1966) A Class of Nonparametric Estimators of a Smooth Regression Function. Ph.D.
Thesis, Stanford University.
Samuelson, P.A. (1938) A Note on the Pure Theory of Consumer Behavior, Economica, 5,61-71.
Seiford, L.M. and R.M. Thrall(l990) Recent Development in DEA, Journal ofEconometrics, 46,7-38.
Stoker, T.M. (1986) Consistent Estimation of Scaled Coefficients, Econometrica, 54, 1461-1481.
Stoker, T.M. (1989) Tests of Additive Derivatives Constraints, Reuiew ofEconomic Studies, 56,535-552.
Strauss, D. (1979) Some Results on Random Utility Models, Journal OfMathematical Psychology, 20,
35552.
Thompson, T.S. (1989) Identification of Semiparametric Discrete Choice Models, Discussion Paper
No. 249, Center for Economic Research, University of Minnesota.
Utreras, F. (1984) Positive Thin Plate Splines, CAT Rep. No. 68, Dept. of Math., Texas A & M
Univ.
Utreras, F. (1985) Smoothing Noisy Data Under Monotonicity Constraints: Existence, Characterization,
and Convergence Rates, Numerische Mathematik, 47, 61 l-625.
Varian, H. (1982) The Nonparametric Approach to Demand Analysis, Econometrica, 50(4), 9455974.
Varian, H. (1983a) Nonparametric Tests of Consumer Behavior, Review of Economic Studies, 50,
999110.
Varian, H. (1983b) Nonparametric Tests of Models of Investor Behavior, Journal of Financial and
Quantitative Analysis, 18, 269-278.
Varian, H. (1984) The Nonparametric Approach to Production Analysis, Econometrica, 52,579-597.
Varian, H. (1985) Non-Parametric Analysis of Optimizing Behavior with Measurement Error, Journal
of Econometrics, 30.
Varian, H. (1988) Goodness-of-Fit in Demand Analysis, CREST Working Paper No. 89911, University
of Michigan.
Villalobos, M. and G. Wahba (1987) Inequality Constrained Multivariate Smoothing Splines with
Application to the Estimation of Posterior Probabilities, Journal of the American Statistical
Association, 82, 239-248.
Wahba, G. (1990) Spline Modelsfor Obseruational Data, CBMS-NSG Regional Conference Series in
Applied Mathematics, No. 59, Society for Industrial and Applied Mathematics.
Wald, A. (1949) Note on the Consistency of the Maximum Likelihood Estimator, Annals of
Mathematical Statistics, 20, 595-601.
Wang, Y. (1992) Nonparametric Estimation Subject to Shape Restrictions, Ph.D. Dissertation,
Department of Statistics, University of California, Berkeley.
Watson, G.S. (1964) Smooth Regression Analysis, Sankhya Series A, 26, 359-372.
Wong, W.H. (1984) On Constrained Multivariate Splines and Their Approximations, Numerische
Mathematik, 43, 141-152.
Wright, F.T. (1982) Monotone Regression Estimates for Grouped Observations, Annals of Statistics,
lo,2788286.
Yatchew, A.J. (1985) A Note on Nonparametric Tests of Consumer Behavior, Economic Letters, 18,
45-48.
Yatchew, A.J. (1992) Nonparametric Regression Tests Based on Least Squares, Econometric Theory,
8,435-451.
Yatchew, A. and L. Bos (1992) Nonparametric Tests of Demand Theory, mimeo, Department of
Economics, University of Toronto.
Yellott, J.I. (1977) The Relationship Between Lutes Choice Axiom, Thurstones Theory of Comparative
Judgement, and the Double Exponential Distribution, Journal of Mathematical Psychology, 15,
109-144.
Chapter 43

ANALOG ESTIMATION OF ECONOMETRIC MODELS*

CHARLES F. MANSKI

University ofWisconsin-Madison

Contents

Abstract 2560
1. Introduction 2560
2. Preliminaries 2561
2.1. The analogy principle 2561

2.2. Moment problems 2563

2.3. models
Econometric 2565

3. Method-of-moments estimation of separable models 2566


3.1. Mean independence 2561

3.2. Median independence 2568

3.3. Conditional symmetry 2569

3.4. Variance independence 2510

3.5. Statistical independence 2570

3.6. A historical note 2571

4. Method-of-moments estimation of response models 2571


4.1. Likelihood models 2512

4.2. Invertible models 2574

4.3. Mean independent linear models 2514

4.4. Quantile independent monotone models 2575

5. Estimation of general separable and response models 2577


5.1. Closest-empirical-distribution estimation of separable models 2517

5.2. Minimum-distance estimation of response models 2580

6. Conclusion 2581
References 2581

*I am grateful for the comments of Rosa Matzkin and Jim Powell.

Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden
0 1994 Elsevier Science B. V. All rights reserved
2560 C.F. Manski

Abstract

Suppose that one wants to estimate a parameter characterizing some feature of a


specified population. One has some prior information about the population and a
random sample of observations. A widely applicable approach is to estimate the
parameter by a sample analog; that is, by a statistic having the same properties in
the sample as the parameter does in the population. If there is no such statistic, then
one may choose an estimate that, in some well-defined sense, makes the known
properties of the population hold as closely as possible in the sample. These are
analog estimation methods. This chapter surveys some uses of analog methods to
estimate two classes ofeconometric models, the separable and the response models.

1. Introduction

Suppose that one wants to estimate a parameter characterizing some feature of a


specified population. One has some prior information about the population and
a random sample of observations. A widely applicable approach is to estimate the
parameter by a sample analog; that is, by a statistic having the same properties
in the sample as the parameter does in the population. If there is no such statistic,
then one may choose an estimate that, in some well-defined sense, makes the
known properties of the population hold as closely as possible in the sample. These
are analog estimation methods.
Familiar examples include use of the sample average to estimate the population
mean and sample quantiles to estimate population quantiles. The classical method
of moments (Pearson (1894)) is an analog approach, as is minimum chi-square
estimation (Neyman (1949)). Maximum likelihood, least squares and least absolute
deviations estimation are analog methods.
This chapter surveys some uses of analog methods to estimate econometric
models. Section 2 presents the necessary preliminaries, defining the analogy principle,
moment problems and the method of moments, and two classes of models, the
separable and the response models. Sections 3 and 4 describe the variety of
separable and response models that imply moment problems and may be estimated
by the method of moments. Section 5 discusses two more general analog estimation
approaches: closest empirical distribution estimation of separable models and
minimum distance estimation of response models. Section 6 gives conclusions. The
reader wishing a more thorough treatment of much of the material in this chapter
should see Manski (1988).
The analogy principle is used here to estimate population parameters. Other
chapters of this handbook exploit related ideas for other purposes. The chapter
by Hall describes bootstrap methods, which apply the analogy principle to approxi-
Ch. 43: Analoy Estimation of Econometric Models 2561

mate the distribution of sample statistics. The chapter by Hajivassiliou and Ruud
describes simulation methods, which use the analogy between an observed sample
and a pseudo-sample from the same population, drawn at postulated parameter
values.

2. Preliminaries

2.1. The analogy principle

Assume that a probability distribution P on a sample space 2 characterizes a


population. One observes a sample of N independent realizations of a random
variable z distributed P. One knows that P is a member of some family 17 of prob-
ability distributions on 2. One also knows that a parameter b in a parameter
space B solves an equation

T(P, b) = 0, (1)

where T(., *) is a given function mapping 17 x B into some vector space Y. The
problem is to combine the sample data with the knowledge that DEB, PEJI and
T(P, b) = 0 so as to estimate b.
Many econometric models imply that a parameter solves an extremum problem
rather than an equation. We can use (1) to express extremum problems by saying
that b solves

b - argmin W(P, c) = 0. (2)


COB

Here W(., .) is a given function mapping 17 x B into the real line.


Let P, be the empirical distribution of the sample of N draws from P. That is,
P, is the multinomial probability distribution that places probability l/N on each
of the N observations of z. The group of theorems collectively referred to as the
laws of large numbers show that P, converges to P in various senses as N --) co.
This suggests that to estimate b one might substitute the function T(P,, .) for
T(P, .) and use

B, = [cEB: T(P,, c) = 01. (3)

This defines the analog estimate when P, is a feasible value for P; that is, when
P,EIZ. In these cases T(P,;) is well-defined and has at least one zero in B, so B,
is the (possibly set-valued) analog estimate of b.
Equation (3) does not explain how to proceed when P,#I7. We have so far
defined T(.;) only on the space ZZ x B of feasible population distributions and
parameter values. The function T(P,, .) is as yet undefined for P,#I7.
2562 C.F. Manski

Let @ denote the space of all multinomial distributions on Z. To define T(P,, .)


for every sample size and all sample realizations, it suffices to extend T(., .) from
17 x B to the domain (n u @) x B. Two approaches have proved useful in practice.

Mapping P, into 17. One approach is to map P, into 17. Select a function
rc(.): Hu @ + 17 which maps every member of 17 into itself. Now replace the
equation T(P, b) = 0 with

T[rc(P), b] = 0. (4)

This substitution leaves the estimation problem unchanged as T[rr(Q), .] = r(Q, .)


for all Q~17. Moreover, n(P,)~17; so T[rc(P,);] is defined and has a zero in B.
The analogy principle applied to (4) yields the estimate

B,, = [CER T{7c(P,), c>= 01. (5)


When P,EII, this estimate is the same as the one defined in equation (3). When
P,$II, the estimate (5) depends on the selected function rc(.); hence we write B,,
rather than B,.
A prominent example of this approach is kernel estimation of Lebesgue density
functions. Let n be the space of distributions having Lebesgue densities. The
empirical distribution P, is multinomial and so is not in 17. But P, can be smoothed
so as to yield a distribution that is in 17. In particular, the convolution of P, with
any element of 17 is itself an element of 17. The density of the convolution is a kernel
density estimate. See Manski (1988), Chapter 2.

Direct extension. Sometimes there is a natural direct way to extend the domain
of T(., .), so T(P,, .) is well-defined. Whenever T(P,, .) has a zero in B, equation (3)
gives the analog estimate. If P, is not in l7, it may be that T(P,, c) # 0 for all
CEB. Then the analogy principle suggests selection of an estimate that makes
T(P,, .) as close as possible to zero in some sense.
To put this idea into practice, select an origin-preserving function r(.) which
maps values of T(., .) into the non-negative real half line. That is, let I(.): Y + [0, co),
with T = O-r(T) = 0. Now replace the equation T(P, b) = 0 with the extremum
problem

min r[ T(P, c)]. (6)


CSB

This substitution leaves the estimation problem unchanged as T(Q,c) = 00


r[T(Q, c)] = 0 for (Q, c)~17 x B. To estimate b, solve the sample analog of (6).
Provided only that r[T(P,;)] attains its minimum on B, the analog estimate is

B,, = argmin r[T(P,, c)]. (7)


ceB
Ch. 43: Analog Estimation of Econometric Models 2563

If P,~17, this estimate is the same as the one defined in (3). If PN$17 but T(P,, .)
has a zero in B, the estimate remains as in (3). If T(P,;) is everywhere non-zero,
the estimate depends on the selected function r(.); hence we write B,, rather than B,.
Section 2.2 describes an extraordinarily useful application of this approach, the
method of moments.

2.2. Moment problems

Much of present-day econometrics is concerned with estimation of a parameter b


solving an equation of the form

sg(z, b) dP = 0

or an extremum problem of the form


(8)

min h(z, c) dP. (9)


CEBs

In (8), g(., .) is a given function mapping 2 x B into a real vector space. In (9), h(., .)
is a given function mapping 2 x B into the real line. Numerous prominent examples
of (8) and (9) will be given in Sections 3 and 4 respectively.
When PN~n, application of the analogy principle to (8) and (9) yields the
estimates

g(z, c)dP, = 0 =
I[ CEB:; $
I1
g(z, c) = 0
I (10)

B, = argmin h(z, c) dP, = argE,n i .$ h(z, c), (11)


CEB I 1

where (zi,i= l,..., N) are the sample observations of z. When P,$ZIl, one might
either map P, into ZZor extend the domain of T(.;) directly. The latter approach
is simplest; the sample analogs of the expectations jg(z, .) dP and J h(z, .) dP are the
sample averages jg(z, .)dP, and I&, *)dP,. So (10) and (11) remain analog esti-
mates of the parameters solving (8) and (9).
It remains only to consider the possibility that the estimates may not exist. In
applications, s h(z, .) dP, generally has a minimum. On the other hand, jg(z, *)dP,
often has no zero. In that case, one may select an origin-preserving transformation
r(.) and replace (8) with the problem of minimizing r[J g(z, .) dP], as was done in (6).
2564 C.F. Manski

Minimizing the sample apalog yields

(12)

Estimation problems relating b to P by (8) or (9) are called moment problems.


Estimates of the forms (lo), (1 l), and (12) are method-of-moments estimates. Use of
the term moment rather than the equally descriptive expectation, mean, or
integral honors the early work of K. Pearson on the method of moments.

Consistency of method-of-moments estimates. Clearly, consistent estimation of b


requires that the asserted moment problem has a unique solution; that is, b must
be identified. If no solution exists, the estimation problem has been misspecified and
b is not defined. If there are multiple solutions, sample data cannot possibly
distinguish between them. There is no general approach for determining the number
of solutions to equation systems of the form (8) or to extremum problems of the
form (9). One must proceed more or less case-by-case.
Given identification, method-of-moments estimates are consistent if the estima-
tion problem is sufficiently regular. Rigorous treatments appear in such econo-
metrics texts as Amemiya (1985), Gallant (1987) and Manski (1988). I provide here
an heuristic explanation focussing on (12); case (11) involves no additional consi-
derations.
We are concerned with the behavior of the function r[Jg(z, *)dP,] as N + co.
The strong law of large numbers implies that for all CEB, s g(z, c) dP, + j g(z, c) dP
as N + co, almbst surely. The convergence is uniform on B if the parameter space
is sufficiently small, the function g(.;) sufficiently smooth, and the distribution P
sufficiently well-behaved. (For example, it suffices for B to be a compact finite-
dimensional set, for J (g(z, *)1dP to be bounded by an integrable function D(z), and
for g(z;) to be continuous on B. See Manski (1988), Chapter 7.) If the convergence
is uniform and I( .) is smooth, then as N + cc the minima on B of I [ J g(z, .) dP,]
tend to occur increasingly near the minima of r[Jg(.z, .) dP]. The unique minimum
of r[Jg(z, .) dP] occurs at b. So the estimate B,, converges to b.
Uniform convergence on B of Jg(z, .) dP, to Jg(z, .)dP is close to a necessary
condition for consistency of method-of-moments estimates. If this condition is
seriously violated, s g(z, .) dP, is not a good sample analog to jg(z, .) dP and the
estimation approach does not work. Beginning in the 1930s with the Glivenko-
Cantelli Theorem, statisticians and econometricians have steadily broadened the
range of specifications of B, g(., *) and P for which uniform laws of large numbers
have been shown to hold (e.g. Pollard (1984) and Andrews (1987)). Nevertheless,
uniformity does break down in situations that are far from pathological. Perhaps
the most important practical concern is the size of the parameter space. Given a
specification for g(., .) and for P, uniformity becomes a more demanding property
as B becomes larger.
Ch. 43: Analog Estimation of Econometric Models 2565

Sampling distributions. The exact sampling distributions of method-of-moments


estimates are generally complicated. Hence the practice is to invoke local asymptotic
approximations. If the parameter space is finite-dimensional and the estimation
problem is sufficiently regular, a method-of-moments estimate B,, converges at rate
l/,,&andfl(B,,-b)h as a limiting normal distribution centered at zero. Alter-
native estimates of a given parameter may have limiting distributions with different
variances. This fact suggests use of the variance of the limiting distribution as a
criterion for measuring precision.
Comparison of the precision of alternative estimators has long engaged the
attention of econometric theorists. An estimate is termed asymptotic efficient if the
variance of the limiting normal distribution of @(B,, - b) is the smallest possible
given the available prior information. Hansen (1982) and Chamberlain (1987)
provide the central findings on the efficiency of method-of-moments estimates. For
an exposition, see Manski (1988), Chapters 8 and 9.

Non-random sampling. In discussing moment problems and estimation problems


more generally, I have assumed that the data are a random sample. It is important
to understand that random sampling, albeit a useful simplifying idea, is not essential
to the success of analog estimation. The essential requirement is that the sampling
process be such that relevant features of the empirical distribution converge to
corresponding population features.
For example, consider stationary time series problems. Here the data are obser-
vations at N dates from a single realization of a stationary stochastic process
whose marginal distribution is P. So we do not have a random sample from P.
Nevertheless, dependent sampling versions of the laws of large numbers show that
P, converges to P in various senses as N -+ co.

2.3. Econometric models

We have been discussing an estimation problem relating a parameter b to a probabi-


lity distribution P generating realizations of an observable random variable z. Eco-
nometric models typically relate a parameter b to realizations of the observable z
and of an unobservable random variable, say u. Analog estimation methods may
be used to estimate b if one can transform the econometric model into a representa-
tion relating b to P and to nuisance parameters.
Formally, suppose that a probability distribution P,, on a space Z x U charac-
terizes a population. A random sample of N realizations of a random variable (z, u)
distributed P,, is drawn and one observes the realizations of z but not of U. One
knows that P,, is a member of some family D,, of probability distributions on
Z x U. One also knows that a parameter b in a parameter space B solves an equation

f(z, u,b) = 0, (13)


2566 C.F. Manski

where f(., ., .) maps Z x U x B into some vector space. Equation (13) is to be


interpreted as saying that almost every realization (i, q) of (z, U) satisfies the equation
f(i, yl, b) = 0.
Equation (13) typically has no content in the absence of information on the
probability distribution P,, generating (z, u). A meaningful model combines (13)
with some distributional knowledge. The practice has been to impose restrictions
on the probability distribution of u conditional on some function of z, say x = x(z)
taking values in a space X. Let P,lx denote this conditional distribution. Then a
model is defined by equation (13) and by a restriction on the conditional distributions
(P,l5,5EX).
Essentially all econometric research has specified f to have one of two forms. A
separable model makes the unobserved variable u additively separable, so that

where u,(.;) maps Z x B into U. A response model defines z = (y, x), Z = Y x X,


and makes f have the form

f(Y, x9 UTb)= Y - Y,k u,b), (15)

where y,(., ., .) maps X x U x B into Y. Functional forms (14) and (15) are not
mutually exclusive. Some models can be written both ways.
The next two sections survey the many separable and response models implying
that b and a nuisance parameter together solve a moment problem. (The nuisance
parameter characterizes unrestricted features of P,Ix). These models may be esti-
mated by the method of moments if the parameter space is not too large.

3. Method-of-moments estimation of separable models

Separable models suppose that realizations of (z, U) are related to the parameter b
through an equation

u,(z, b) = u. (16)

In the absence of information restricting the distribution of the unobserved U, this


equation simply defines u and conveys no information about b. In the presence of
various distributional restrictions (16) implies that b and a nuisance parameter solve
a type of moment equation known as an orthogonality condition, defined here.

Orthogonality conditions. Let x = x(z) take values in a real vector space X. Let r
denote a space in which a nuisance parameter y lives. Let e(*, .) be a function
mapping U x r into a real vector space. Let e(.;) denote the transpose of the
Ch. 43: Analog Estimation of Econometric Models 2561

column vector e(., .). The random vectors x and e(u, y) are orthogonal if

s xe(u, y)dP,, = 0.

Equation (17) relates the observed random variable x to the unobserved random
(17)

variable u. Suppose that (16) holds. Then we can replace u in (17) with u,(z, b),
yielding

xe[u,(z, b), y]dP = 0. (18)

This orthogonality condition is a moment equation relating the parameters (b, y) to


the distribution P of the observable z.
It is not easy to motivate orthogonality conditions directly, but we can readily
show that these conditions are implied by various more transparent distributional
restrictions. The remainder of this section describes the leading cases.

3.1. Mean independence

The classical econometric literature on instrumental variables estimation is con-


cerned with separable models in which x and u are known to be uncorrelated. Let
y be the mean of U. Zero covariance is the orthogonality condition

x(u - Y)dP,, = x[u,(z, b) - y]dP = 0. (19)


s s

Most authors incorporate the nuisance parameter y into the specification of u,(., .)
by giving that function a free intercept. This done, u is declared to have mean zero
and equation (19) is rewritten as

s x[u,(z, b)]dP = 0.

To facilitate discussion of a variety of distributional restrictions, I shall keep y


(20)

explicit.
Zero covariance is sometimes asserted directly, to express a belief that the random
variables x and u ,are unrelated. It is preferable to think of zero covariance as
following from a stronger form of unrelatedness. This is the mean-independence
condition

s udP,lt = y, VEX. (21)


2568 C.F. Manski

Mean independence implies zero covariance but it is difficult to motivate zero


covariance in the absence of mean independence. To see why, rewrite (19) as the
iterated expectation

r r rr 1

J x(u - y)dP,, = x
J LJ
(u - Y)dP,jx
J
dP, = 0. (22)

This shows that mean independence implies zero covariance. It also shows that x
and u are uncorrelated if positive and negative realizations of x[(u - Y)dP,(x
balance when weighted by the distribution of x. But one rarely has information
about P,, certainly not information that would make one confident in (22) in the
absence of (21). Hence, an assertion of zero covariance suggests a belief that x and
u are unrelated in the sense of mean independence.
Mean independence implies orthogonality conditions beyond (19). Let u(.) be any
function mapping X into a real vector space. It follows from (16) and (21) that

s
v(x) [u,(z, h) - y] dP =
s
v(x)
[S
(u - y) dP, 1x
1
dP, = 0, (23)

provided only that the integral in (23) exists. So the random variables u(x) and
u,(z, b) are uncorrelated. In other words, all functions ofx are instrumental variables.

3.2. Median independence

The assertion that u is mean independent of x expresses a belief that u has the same
central tendency conditional on each realization of x. Median independence offers
another way to express this belief. Median independence alone does not imply an
orthogonality condition, but it does when the conditional distributions P,l& VEX
are componentwise continuous.
Let U be the real line; the vector case introduces no new considerations as we
shall deal with u componentwise. For each 5 in X, let mg be the median of u
conditional on the event [x = 41. Let y be the unconditional median of U. We say
that u is median independent of x if

m<=Y, (EX. (24)

It can be shown (see Manski (1988), Chapter 4) that if P,/& [EX are continuous
probability distributions, their medians solve the conditional moment equations

sgn(u - mJdP,(t = 0, [EX. (25)


s
Ch. 43: Analog Estimation of Econometric Models 2569

So median independence and continuity together imply that

ssgn(u - Y)dP,It = 0,

It follows from (16) and (26) that


VEX. (26)

su(x)x4-n b) - r]dP=
s[s
44 sgn(u - Y)dP,lx
1 dP, = 0 (27)

for all u(.) such that the integral in (27) exists. Thus, all functions of x are orthogonal
to sgn CU,(Z,b) - ~1.
The median of a probability distribution is its 0.5-quantile. The above derivation
can be generalized to obtain orthogonality conditions implied by the assumption
that, for given c(E(O, l), the a-quantile of u does not vary with x.

3.3. Conditional symmetry

Mean and median independence both express a belief that the central tendency of
u does not vary with x. Yet they are different assertions. This fact may cause the
applied researcher some discomfort. One often feels at ease saying that the central
tendency of u does not vary with x. But only occasionally can one pinpoint the
mathematical sense in ulhich the term central tendency should be interpreted.
The need for care in defining central tendency disappears if the conditional
distributions P,l<, {EX are componentwise symmetric with common center of
symmetry. Let U be the real line again and assume that for all realizations of x, the
conditional distribution of u is symmetric around some point y. That is,

pu-,I5 = py-.14, <EX. (28)

Let II(.) be any odd function mapping the real line into a real vector space; that is,
h(q) = - h( - I) for q in R. Conditional symmetry implies

sh(u-y)dP,lt=O,

Equations
VEX.

(16) and (29) imply that (b, y) solves


(29)

so(x)h[ue(z, b) - y]dP =
s
u(x)
[S
h(u-y)dP,Ix
1 dP,=O (30)
2570 C.F. Manski

for all u(.) and h(.) such that the integral in (30) exists. So all functions of x are
orthogonal to all odd functions of u - y. The functions h(u - y) = u - y and h(u - y) =
w(u - y) are odd. Thus, the orthogonality conditions (23) and (27) that follow from
mean and median independence are satisfied given conditional symmetry.

3.4. Variance independence

One may believe that u not only has the same central tendency for each realization
of x but also the same spread. The usual econometric practice has been to express
spread by variance. In the presence of mean independence, variance independence
(homoskedasticity) is the additional condition

(31)

Here yi is the common mean of the distributions P,[<, VEX and y2 is the common
variance matrix.
Let u(.) be any real function on X. It follows from (16) and (31) that (b,y,,y,)
solves the orthogonality condition

r
J
b)- YllCu&
W{ CU,(Z~ w- (32)

The assertion of variance independence imposes no restrictions on the variance


matrix yZ. In some applications, information about y2 is available. For example, it
may be known that the components of u are uncorrelated with one another, so y2
is a diagonal matrix. Such information may be expressed by appropriately restric-
ting the space of possible values of yZ.

3.5. Statistical independence

It is often assumed that u has the same distribution for each realization ofx. That is,

PA5 = p,, (EX. (33)

This statistical independence assumption implies mean, median and variance in-
dependence. In fact, it implies that all functions of x are uncorrelated with all
functions of U.
Let s(.) map U into a real vector space. Let y be the unconditional mean of s(u).
Ch. 43: Analog Estimation of Econometric Models 2571

It follows from (33) that

s(u) dP, I5 = Y, <EX. (34)


s

It follows from (16) and (34) that (b, y) solves

s u(x)t-S{UO(Z~
b)f - y]dP=O

for all u(.) and s(.) such that the integral in (35) exists.
(35)

3.6. A historical note

The analogy principle can be applied to the orthogonality conditions derived in the
preceding sections to yield method-of-moments estimates of (b, y). These estimators
are easy to understand and to apply. Nevertheless, they have taken considerable
time to evolve.
Wright (1928) and Reiersol(l941, 1945) developed the zero covariance condition
(19) in the case where U is the real line, X and B are both K-dimensional real space,
and u,(.;) is linear in b. In this case, the sample analog of the orthogonality
condition always has a solution.
For some time, the literature offered no clear prescription for estimation when
the vector x is longer than b; that is, when there are more instruments than
unknowns. The sample analog of the zero covariance condition then usually has
no solution. The idea of selecting an estimate that makes the sample condition hold
as closely as possibly took hold in the 1950s particularly following the work of
Sargan (1958).
It was not until the 1970s that the estimation methods developed for linear
models were extended to models nonlinear in b. See Amemiya (1974). And it was
not until the late 1970s that systematic attention was paid to distributional restric-
tions other than mean independence. The work of Koenker and Bassett (1978) did
much to awaken interest in models assuming median independence. The idea that
orthogonality conditions should be thought of as special cases of moment equations
took hold in the 1980s. See Burguete et al. (1982), Hansen (1982) and Manski (1983).

4. Method-of-moments estimation of response models

Response models assert that an observable random variable y is a measurable


function of a random pair (x, u), with x observable and u not. The mapping from
(x, u) to y is known to be a member of a family of functions indexed by a parameter
2512 C.F. Munski

h in a parameter space B. Thus,

y = Y&G u, b). (36)

The random variable y is referred to variously as the dependent, endogenous,


explained or response variable. The pair (x, U) are termed independent, exogenous,
explanatory or stimulus variables. The function yO(.;, .) mapping (x, u, h) into y is
sometimes called the response function.
Equation (36) is meaningful only when accompanied by information on the
distribution of U. The usual practice is to restrict the conditional distribution P,lx
in some way. Many response models imply that b and a nuisance parameter solve
a moment problem. I describe here the moment problems implied by likelihood
models (Section 4.1) invertible models (Section 4.2) mean independent linear models
(Section 4.3) and quantile independent monotone models.(Section 4.4).

4.1. Likelihood models

The form of the response model (36) implies that the conditional distribution P,lx
is determined by x, b, and P,lx. Suppose that P,(x is known to be a member of a
family of distributions s(x,y), ~/ET, where I- is a parameter space and r(.;) is a
known function mapping X x r into probability distributions on U. Then P,lx is
a function of x, of the parameter of interest b and of the nuisance parameter y. To
be precise, let <eX, CUB, A c Y and define

U((,c, A) = UEU s.t. y()((,u, C)EA. (37)

BY (36)

b,41,
P,(A 15)= r(t, Y)CU(~, for all A c Y. (38)

Suppose that there exists a measure v on Y that dominates (in the measure theoretic
sense) all of the feasible values of P,,[<. That is, for (&c,d)~X x B x r and A c Y,
v(A) = O=r(<, S)[U(& c, A)] = 0. Then the Radon-Nikodym Theorem shows that
P,lx has a density with respect to v, say 4(.; x, b, y), with the density function known
up to the value of (b, y). Jensens inequality can be used to show that, for each ~;EX,
(b, y) solves the extremum problem

max log &Y; 4, c, 6) dP, I 5. (39)


(C,&EBx I- s

See, for example, Manski (1988), Chapter 5. Because (39) holds for all values of x,
Ch. 43: Analog Estimation of Econometric Models 2573

it follows that (b, y) solves the unconditional extremum problem

max log CNY;x, c, 6) dp, (40)


(C,6)EB
xr s

whose sample analog is the maximum likelihood estimator

The dominance condition. The above shows that maximum likelihood estimation
is well-defined whenever there exists a measure v dominating the feasible values of
P,lx. I give a pair of familiar examples in which the condition is satisfied.
Discrete response models are ones in which the space Y has at most a countable
number of points. The dominance condition is satisfied trivially by letting v be a
counting measure on Y. Hence, all discrete response models are likelihood models.
Models with additive errors are ones in which Y is a finite dimensional real space,
U = Y, and equation (36) has the form

Y= Y(X>
b) + 4 (42)

g(., .) being a given function mapping X x B into Y. It follows that

PYlX= pg(X,b)+lx. (43)

Suppose that the distributions P,(x are known to be dominated by Lebesgue


measure. Then the shifted distributions P,,, bJ+uIx are similarly dominated. So the
distributions P,jx are dominated by Lebesgue measure.

The nuisance parameter. It may appear from the above that maximum likelihood
can be used to estimate almost any response model, but this conclusion is too
sanguine. To apply the method, one usually must estimate the parameter of interest
b and the nuisance parameter y jointly. (There are special cases in which the problem
decomposes but these are not the rule.) The estimation task is typically feasible if
one has substantial prior information; for example, the classical theory of maximum
likelihood estimation supposes that the parameter space B x r is finite dimensional.
But the maximum likelihood estimate may not be consistent if the parameter space
is too large. And the computational problem of maximizing the likelihood function
may become intractable before the method breaks down statistically.
For example, maximum likelihood estimation is unappealing when one knows
only that u is mean or median independent of x. In these cases, the space r indexing
the possible values of P,) x is rather large. In fact, the dominance condition typically
2514 C.F. Manski

fails to hold, so that the maximum likelihood estimate is not even well-defined.
That maximum likelihood may break down in the presence of weak distributional
information does not imply that estimation is impossible. The remainder of this
section shows that some response models can be estimated using other method-
of-moments approaches.

4.2. Invertible models

Suppose that y and u are one to one. That is, for each (5, c) in X x B, let y,(& ., c)
be invertible as a mapping from U into Y. Let y; (t, ., c) denote the inverse function
mapping Y into U. Then an alternative representation of equation (36) is

Y, Yx,Y, b) = u. (44)

This is a separable model, so all of the approaches described in Section 3 can be


applied.
The additive error model (42) is obviously invertible. Also invertible are the
linear simultaneous equations models prominent in the econometrics literature.
In simultaneous equations analysis, equation (44) is referred to as the structural
form and equation (36) as the reduced form of the model.

4.3. Mean independent linear models

Certain response functions combine well with specific distributional restrictions.


Linear response functions pair nicely with mean independent unobservables.
Let Y and U be J-dimensional and K-dimensional real space. Let equation (36)
have the linear-in-u form

Y = 91 (x, b) + &, bb. (45)

Here g1 (., .) maps X x B into RJ. The function g2(., .) maps X x B into RJ K and
is written as a J x K matrix. Note that the response function in (45) is not invertible
unless J = K and the matrices g2(& c), (5, C)EX x B are non-singular.
Let it be known that u is mean independent of x. Let y denote the mean of 1.4.
Equation (45) implies that the mean regression of y on x is

E(Y Ixl = c/l (x, b) + c/Ax, 4-c (46)

Mean regressions solve a variety of moment problems. Rewrite (46) as

sCY - sl(L b) - gz(k b)yldP,I t = 0, (EX. (47)


Ch. 43: Analog Estimation of Econometric Models 2515

Let u(.) be any function on X. Because (47) holds for all values of x, it follows
that (b, y) solves the orthogonality condition

s44 CY- 9 1 (x, b) - gAx>Nyl dP = 0.

Another approach uses the well-known fact that the mean regression
(48)

of y on x
is the best predictor of y conditional on x, in the sense of minimizing expected
square loss. That is, for each VEX, (b,y) solves the extremum problem

min C~-~~(5,c)-~~(5,c)~lCy-~~(5,c)-~~(5,c)~ldP~l5. (49)


(C,6)EB
xr s

It follows that, for any function w(.) mapping X into (0, co), (b, y) solves the un-
conditional extremum problem

min W(X)CY-~~(~,C)-S~(~,~)~IC~-~~~(~,~)-~~~(~,C)~I~~, (50)


(C,@EBxr s

whose sample analog is a weighted least squares estimator of (b, y), with weights
w(x).

4.4. Quantile independent monotone models

Whereas mean independence meshes well with linear response function, quantile
independence combines nicely with real valued response functions that are mono-
tone in a scalar u. Let Y and U be the real line. Let it be known that, for given
a~(0, l), the a-quantile of u does not vary with x. Let y denote the cr-quantile of
u. For each VEX, let y,(<, u, c) be non-decreasing as a function of u and continuous
at y as a function of c. Then it can be shown that y,(x,y, b) is the a-quantile
regression of y on x (see Manski (1988), Chapter 6).
The a-quantile regression of y on x is a best predictor of y conditional on x, in
the sense of minimizing the expected value of the asymmetric absolute loss function
giving weights (1 - a) and c1to overpredictions and underpredictions. That is, for
each VEX, (b,y) solves the extremum problem

min
(C,&B
xr I
(1 -~)~CY<Y,(~,~,~)IIY-Y,(~,~,~)I

+ @ICY YO(~~~>C)IIY
- ~,,(S,hc)l @,I& (51)

It follows that, for any function w(.) mapping X into (0, oo), (b,y) solves the un-
2576 C.F. Manski

conditional extremum problem

,,pF r W{(l -~~~CY~Y,~~,~,~~llY-Yy,~~,~,~~I


<, t x s

+ alCy > Y~(x,~c)IIY -~,b,~,c)I) dP, (52)

whose sample analog is a weighted asymmetric least absolute deviations estimator


of (b,y), with weights w(x).
Two applications follow. For simplicity, I confine attention to the median
independence case.

Censored response. Let Y = [0, co) and X = B = RK. Powell (1984, 1986) studied
estimation of the censored linear model asserting that y = 0 if xb + u d 0 and
y = xb + u otherwise. That is,

y = max(O, xb + u). (53)

For each l in X, the function max(O, tb + u) is non-decreasing and continuous in


u. Hence, the median of P,lx is max(O,xb + y). Applying (52), (b,y) solves

s
,,,zJf.ly- max(O,xc + S)l dP, (54)

whose sample analog is the censored least absolute deviations estimator

(55)

Binary response. Let Y = (0, l} and X = B = RK. Manski (1975, 1985) studied
estimation of the binary response model asserting that y = 0 if xb + u d 0 and
y = 1 otherwise. That is,

y= l[xb+u>O]. (56)

For each 5 in X, the response function l[{b + u > 0] is non-decreasing in u. This


function is continuous at y if {b # y, but discontinuous at y if tb = y. Nevertheless,
it can be shown that for all 5, the median of P,I 4 is 1[tb + y > 01. Applying (52),
(b, y) solves

fc,mkxr l~-1Cxc+~>OlldP, (57)


I
Ch. 43: Analog Estimation qf Econometric Models 2517

whose sample analog is the maximum score estimator

(58)

5. Estimation of general separable and response models

The method-of-moments approaches described in the preceding sections make it


possible to estimate a wide variety of econometric models and so are enormously
useful. These approaches cannot, however, handle all models of potential interest.
Not all models imply moment problems and those models that do imply moment
problems can be estimated by the method of moments only if the estimation
problem is sufficiently regular.
This section describes more general analog approaches to the estimation of
separable and response models. Section 5.1 presents closest-empirical-distribution
estimation of separable models, introduced in Manski (1983). Section 5.2 presents
minimum-distance estimation of response models, based on the work of Wolfowitz
(1953, 1957) and others.

5.1. Closest-empirical-distribution estimation of separable models

Recall that a separable model has the form u,(z, b) = u. Hence

P, = P+,b). (59)

Thus, the joint distribution of observables and unobservables is a function of the


observable distribution P and of the parameter b. To make the dependence of
P,, on (P, b) explicit, let Q be any probability distribution on Z and let z(Q) denote
a random variable distributed Q. For EL?, let $(Q,c) denote the distribution of
[z(Q), ue{z(Q), c}]. Then (59) implies that

Pm= W, b). (60)

Suppose one knows that P,,~17,,, where I&, is some family of distributions on
Z x U. By (60),

W, b)EJI,,. (61)

This translates the information on P,, into a condition relating the parameter b
to the observable distribution P. We may now apply the analogy principle to (61).
2578 C.F. Manski

The analog estimate for b is

B, = [cEB:$(P,, C)EITZ], (62)

unless this set is empty. In that case, the analogy principle suggests selection of
an estimate that makes the distribution $(PN, .) as close as possible to &,, in some
sense.
To do this, we may select a function r(.,&,) that maps each probability
distribution $ on Z x U into [0, co) and that satisfies the condition r($, l7,,) = 00
$~17,,. Thus, r(., I&,) distinguishes distributions that are in fl,, from ones that are
not. Condition (61) is equivalent to saying that b solves the minimization problem

(63)

The analogy principle applied to (63) yields the closest-empirical-distribution


(CED) estimator

In words, (64) selects an estimate that brings the distribution of [z(PN), uO{z(P,), .}]
as close as possible to Z&,.

Examples. Method-of-moments estimates of parameters solving orthogonality


conditions are CED estimates. The set fl,, is the family of distributions satisfying
(17). This information is translated by (18) into an orthogonality condition relating
b to P. The method of moments then selects an estimate that makes the distribution
of [z(PN), u,{z(P,), .}] satisfy the orthogonality condition as closely as possible.
As a second example, suppose it is known that u is statistically independent of
some function of z, say ?c = x(z). This information can be expressed through the
statement that, for all (s, t)eX x U,

PZ(Xd s, u ,< t) = P(x < S)P(U< t). (65)

This translates into the following restriction relating b to P:

p[xG s,u,(z, b) < tl = P(x d s)P[u,(z, b) < t]. (66)

If it exists, the analog estimate is

B, s {EB:P,[x d s, u,(z, b) < t] = P,(x d s)P,Ju,,(z, b) < t, vs,tl}. (67)


Ch. 43: Analog Estimation of Econometric Models 2519

But this estimate typically does not exist so we need to make (65) hold as closely
as possible, in some sense.
One among many a priori reasonable approaches expresses the prior information
through the statement that b minimizes the integrated square distance between
P[x,u,(z;)] and P(x)P[u,(z;)]. That is, b solves the minimization problem

?$I
s {P[X G s,u,(z,c) < t] - P(x <s)PCu,(z,c)< tl}2dsdt, (68)

whose sample analog is the estimator

min
CEB
{P,[x < s, u,(z, c) d t] - P,(x d s)PN[uo(z, C) G t]} ds dt. (69)
s

Computation of this estimator is difficult as one must integrate over all values of
(s, t). A computationally simpler (but notationally more complicated) estimator
results if one uses mean square distance rather than integrated square distance.
Let P = P, define z to be a random variable distributed P and independent of z,
and replace (68) by

m$r {P[X G x, u,(z, c) < ue(z, c)] - P(x < x)P[u,(z, c) < u&, c)I}~ dP. (70)
s

The sample analog of (70) is

TE$lk
,il{NCxG
I
xi, uO(z,c) d uO(zi~c)l - p,(x G xi)p~[~(Z,C)G uo(zi, C)]}2.
(71)

Consistency of CED estimates. Manski (1983) shows that closest-empirical-distri-


bution estimates are consistent if b is identified, B is compact and finite dimensional,
r[II/(P, .), I7,,] is continuous on B, and

(72)

almost surely. (Condition (72) is an abstract uniform law of large numbers.) This
consistency theorem has been applied to prove that the estimator (71) is consistent
given regularity conditions (see Manski (1983), Corollary to Theorem 2). The
asymptotic sampling behavior of general CED estimates has not been studied.
2580 C.F. Manski

5.2. Minimum-distance estimation of response models

Recall that the response model (36) implies that P,I x is a function of (x, b, P,Ix).
When we introduced likelihood models in Section 4.1, we assumed that P,lx is a
member of a family of distributions T(X,y), r~r, so P,lx is a function of (x, b, y).
We also assumed that P,(x is dominated by a known measure v. Here we maintain
the former assumption but drop the latter one.
By assumption,

P, Ix = N--G
b,Y), (73)

where h(., ., .) is a known function mapping X x B x r into probability distribu-


tions on Y. Let p(., .) be a function that maps pairs of probability distributions on Y
into [0, co) and that satisfies the condition Q = Qop(Q, Q) = 0. Equation (73)
implies that (b, y) solves the collection of conditional minimization problems

min pC(P,I0, ML c,41,


(c,6)eB Y I-
5~x. (74)

It follows that, for any function w(.) mapping X into (0, co), (b,y) solves the un-
conditional minimization problem

min
(C.&Bx r s
w(OCV, Ix), k c,41 dp,, (75)

whose sample analog is the minimum-distance estimator

minr
(C,6)EB x
s
W(X)P Ix),4~ c,41 df,,
CWNy

= ,,gk .$ jl W(Xi)PC(pNy Yxi7, )I


Ixi)T (76)

Wolfowitz (1953, 1957) investigated the case with no conditioning variable x and
with p specified to be a metric on the space of distributions on Y. In that setting,
(75) selects an estimate that minimizes the distance, in the sense of p, between the
theorized distribution of y and its empirical distribution. Sahler (1970) extended
the approach by letting p be any smooth function that maps pairs of probability
distributions on Y into [0, co) and that satisfies the condition Q = Qop(Q, Q) = 0.
An early minimum-distance estimator with conditioning variables is the minimum
chi-square method (Neyman, 1949). Here x is multinomial with support X = (1,. . . , J),
Ch. 43: Analoy Estimation of Econometric Models 2581

y is Bernoulli conditional on x, and p is Euclidean distance. So the estimator is

w(j)[P,(y= llx=j)-h(j,c,6)]2. (77)

Here Nj is the number of observations at which xi =j and P,(y = 11.x =j) is the
sample frequency of the event [y = 1] conditional on the event [x = j].
Econometricians use the term minimum-distance to refer to estimators that
minimize the distance between specified features of the theorized distribution of
y and the sample analogs of these features. Thus, in econometric usage, t(Q) = t(Q)e
p(Q, Q) = 0 for some functional t(.). Usually, p has measured Euclidean distance
between theoretical and sample moments (e.g. Goldberger and Joreskog, 1972). In
a recent application, Chamberlain (1994) measures distance between theoretical
and sample medians of y conditional on x.

6. Conclusion

This chapter has surveyed the application of analog methods to estimate econometric
models. The analogy principle is more than just a useful tool for generating
estimators. It offers a valuable framework for teaching and for research.
The analogy principle is an effective device for teaching estimation. In analog
estimation, one begins by asking what one knows about the population. One then
treats the sample as if it were the population. Finally, one selects an estimate that
makes the known properties of the population hold as closely as possible in the
sample. What could be more intuitive?
The analogy principle disciplines econometric research by focussing attention
on estimation problems rather than on methods. Much of the literature proposes
some new method and then looks for problems to which it can be applied. This
approach has been productive, but it seems more sensible to first specify an
estimation problem and then seek to develop applicable estimation methods. The
analogy principle forces this mode of thought. One can define an analog estimator
only after one has stated the estimation problem of interest.

References

Amemiya, T. (1974) The Non-Linear Two Stage Least Squares Estimator, Journal of Econometrics,
2, 105-l 10.
Amemiya, T. (I 985) Advanced Econometrics, Cambridge: Harvard University Press.
Andrews, D. (1987) Consistency in Nonlinear Econometric Models: A Generic Uniform Law of Large
Numbers, Econometrica, 55, 1465-1471.
Burguete, J., Gallant, R., and Souza, G. (1982) On Unification of the Asymptotic Theory of Nonlinear
Econometric Models, Econometric Reviews, 1, I51-190.
2582 C.F. Manski

Chamberlain, G. (1987) Asymptotic Efficiency in Estimation With Conditional Moment Restrictions,


Journal cf Econometrics, 34, 305-334.
Chamberlain, G. (1994) Quantile Regression, Censoring, and the Structure of Wages, in: C. Sims,
ed., Advances in Econometrics: Sixrh World Congress, New York: Cambridge University Press.
Cambridge University Press, forthcoming.
Gallant, R. (1987) Nonlinear Statistical Models, New York: Wiley.
Goldberger, A. and Joreskog, K. (1972) Factor Analysis by Generalized Least Squares, Psychometrika,
37, 243-260.
Hansen, L. (1982) Large Sample Properties of Generalized Method of Moment Estimators, Econometrica,
50, 1029-1054.
Koenker, R. and Bassett, G. (1978) Regression Quantiles, Econometrica, 46, 33-50.
Manski. C. (1975) Maximum Score Estimation of the Stochastic Utility Model of Choice, Journal
ofEconometrics, 3, 205-228.
Manski, C. (1983) Closest Empirical Distribution Estimation, Econometrica, 51, 305-319.
Manski, C. (1985) Semiparametric Analysis of Discrete Response: Asymptotic Properties of the
Maximum Score Estimator, Journal of Econometrics, 27, 303-333.
Manski, C. (1988) Analog Estimation Methods in Econometrics, London: Chapman and Hall.
Neyman, J. (1949) Contributions to the Theory of the x2 Test, in: Berkeley Symposium on Mathematical
Statistics and Probability, Berkeley: University of California.
Pearson, K. (1894) Contributions to the Mathematical Theory of Evolution, Philosophical Transactions
of the Royal Society of London, A185, 71-78.
Pollard, D. (1984) Contlergence c$Storhastic Processes, New York: Springer-Verlag.
Powell, J. (1984) Least Absolute Deviations Estimation for the Censored Regression Model, Journal
of Econometrics, 25, 303-325.
Powell, J. (1986) Censored Regression Quantiles, Journal of Econometrics, 32, 143-155.
Reiersol, 0. (1941) Confluence Analysis by Means of Lag Moments and Other Methods of Confluence
Analysis, Econometrica, 9, l-23.
Reiersol, 0. (1945) Confluence Analysis by Means of Instrumental Sets of Variables, Arkio Fur
Matematik, Astronomi Och Fysik, 32A, no. 4, 1-119.
Sahler, W. (1970) Estimation by Minimum Discrepancy Methods, Metrika, 16, 85-106.
Sargan, J. (1958) The Estimation of Economic Relationships Using Instrumental Variables, Economet-
rica, 26, 393-415.
Wolfowitz, J. (1953) Estimation by the Minimum Distance Method, Annals ofthe Institute ofStatistics
and Mathematics, 5,9-23.
Wolfowitz, J. (1957) The Minimum Distance Method, Annals OfMathematical Statistics, 28, 75-88.
Wright, S. (1928) Appendix B to Wright, P. The Tariff on Animal and Vegetable Oils, New York:
Macmillan.
Chapter 44

TESTING NON-NESTED HYPOTHESES

C. GOURIEROUX

CREST-CEPREMAP

A. MONFORT

CREST-INSEE

Contents

1. Introduction 2585
2. Non-nested hypotheses 2587
2.1. Definitions 2587

2.2. Pseudo-true values 2589

2.3. Semi-parametric hypotheses 2590

2.4. Examples 2591

2.5. Symmetry of the problem 2596

3. Testing procedures 2597


3.1. Maximum likelihood estimator under misspecification 2597

3.2. The extended Wald test 2598

3.3. The extended score test 2600


3.4. The Cox procedure 2602
3.5. Application to the choice of regressors in linear models 2605
3.6. Applications to qualitative models 2608

4. Artificial nesting models 2610


4.1. Examples 2610
4.2. Local expansions of artificial nesting models 2614
4.3. A score test based on a modified Atkinsons compound model 2618
4.4. The partially modified Atkinsons compound model 2621
5. Comparison of testing procedures 2621
5.1. Asymptotic equivalence of test statistics 2622
5.2. Asymptotic comparisons of power functions 2622
5.3. Exact finite sample results 2624

Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden
0 1994 Elseuier Science B.V. All rights reserved
Ch. 44: Testing Non-Nested Hypothrses 2585

1. Introduction

The comparison of different hypotheses, i.e. of competing models, is the basis of


model specification. It may be performed along two main lines. The first one con-
sists in associating with each model a loss function and in retaining the specification
implying the smallest (estimated) loss. In practice, the loss function is defined either
by updating some a priori knowledge on the models given the available observations
(the Bayesian point of view), or by introducing some criterion taking into account
the trade-off between the goodness of fit and the complexity of the model (for
instance the usual adjusted R2 or the Akaike information criterion). This approach,
called model choice or model selection, has already been discussed in the handbook
(see Learner (1983)) and therefore it will not be treated in this chapter, except in
Section 6. The second approach is hypothesis testing theory. For model selection
we have to choose a decision rule explaining for which observations we prefer to
retain each hypothesis. In the simplest case of two hypotheses H, and H,, this is
equivalent to the definition of the critical region giving the set of observations for
which H, is rejected. However, the determination of the decision rule is not done
on the same basis as model choice. The basis of hypothesis testing theory is to
introduce the probability of errors: first error type (to reject H, when it is true),
and second error type (to reject H, when it is true), then to choose a critical region
for which the first error type probability is smaller than a given size (generally
5%) and the second error type probability is as small as possible. Hypothesis
testing theory is usually advocated when one of the hypotheses He, called the null
hypothesis, may be considered as a limit case of the second hypothesis H,, called
the alternative hypothesis. Broadly speaking the model H, u H, can be reduced
to the submodel H, by imposing some restrictions on the parameters. H, is said
to be nested in H,. Some general testing procedures have been developed for this
case; the main ones are the likelihood ratio tests, Wald tests, Lagrange multiplier
(or score) tests and Hausmans tests (see Engle (1984) for a survey of these testing
procedures).
In this chapter we are interested in the opposite case, where none of the
hypotheses is a particular case of another one. These hypotheses may be entirely
distinct (globally non-nested hypotheses) or may have an intersection (partially
non-nested hypotheses). Most of the theoretical literature on non-nested hypotheses
testing derives from papers by Cox (1961, 1962) and Atkinson (1970). The first
author developed a general procedure, known as the Cox test, which generalizes
the likelihood ratio procedure used in the case of nested hypotheses. The second
author proposed to introduce a third model H, called an artificial nesting model,
containing both Ho and H, and to use the classical procedures for testing H,
against H, and HI against H. These two approaches to the problem are conceptually
different, even if they provide similar results in a number of important applications,
2586 C. Gourieroux und A. Monfort

especially in the case of linear models. The example of linear models, which leads
to explicit and tractable computations has been extensively studied during the
seventies and the beginning of the eighties (see, e.g. Fisher and McAleer (1981),
Godfrey and Pesaran (1983) Pesaran (1974, 1982a, 1982b)), and different nesting
models have been proposed for specific problems. The generalization of Wald and
score testing procedures to non-nested hypotheses have been proposed by Cox
(1961), Gourieroux et al. (1983) and Mizon and Richard (1986) and their links with
the Cox test have been studied. In parallel, Davidson and McKinnon (1981, 1983,
1984) considered some local approximations of artificial nesting models in a
neighbourhood of the hypothesis H, and derived the associated tests. Then the
power of these different testing procedures have been compared, either in finite
samples (McAleer (198 l), Fisher and McAleer (198 I), Dufour (1989)) or asymptoti-
cally (Davidson and McKinnon (1987), Gourieroux (1983), Pesaran (1984, 1987)
Szroeter (1989)). The generalization of the Wald test leads to a procedure with
interesting interpretations in terms of predictions (Gourieroux et al. (1983), Mizon
(1984), Mizon and Richard (1986)). This interpretation has been used as the basis
of a modelling strategy. The so-called encompassing principle has been developed
in a number of papers by Mizon and Richard (198 1,1986), Hendry (1988), Hendry
and Mizon (1990), Hendry and Richard (1990) and associated tests have been
introduced in Gourieroux and Monfort (1992).
In Section 2, we carefully define the notions of non-nested hypotheses, we
distinguish, especially, partially and globally non-nested hypotheses. For this
purpose it is necessary to introduce a suitable metric measuring the closeness of
the hypotheses; this leads to two concepts of pseudo-true value, one is marginal
with respect to the explanatory variables and the other is conditional to these
variables. Different classical examples of non-nested models are also described in
this section.
Section 3 treats the extension of the usual testing procedures: Wald, score, likeli-
hood ratio tests. We obtain different forms of the test statistics depending on the
kind of pseudo-true value which is used. The application to the choice of regressors
in the linear model is particularly emphasized.
The artificial nesting models are presented in Section 4, first for some specific
problems in which the nesting models may have interesting interpretations, then
in the general case. The expansion of these artificial nesting models in a neighbour-
hood of one of the hypotheses leads to a linearization of the problem and to simple
specification tests, called the J-test and the P-test (Davidson and McKinnon
(1981)).
These tests are compared in Section 5 through their asymptotic power. This
analysis is complicated by the fact that some points of the composite null hypothesis
are not the limits of points of the alternative hypothesis. A local power analysis
may only be conducted along a sequence of points leading to a distribution which
is common to both hypotheses; otherwise it is necessary to develop some other
Ch. 44: Testing Non-Nested Hypotheses 2587

asymptotic comparisons of tests for fixed alternatives (Bahadur (1960), Geweke


(1981) Gourieroux (1983)).
In the last section, we discuss the use of non-nested testing procedures for model
building. We introduce the notion of encompassing and explain how it may be
used as a modelling strategy. Then we derive Wald tests of the encompassing
hypothesis which are modified versions of the Wald test for non-nested hypotheses,
taking into account the fact that the true distribution does not necessarily belong
to one of the competing models.
Since our aim is to present the main ideas and potential applications, we will
in general omit the proofs and the less significant assumptions. In the whole
chapter, we consider T observations of some endogenous and exogenous variables,
denoted by y,, x,, t = 1,. . . , T, respectively. We assume that:

the pairs (y,, x,), t = 1,. , T are identically and independently distributed with
an unknown probability density function: fox(x) f&y/x) with respect to a
product measure denoted by v,(dx) @ v,(dy).

These assumptions are made in order to simplify the presentation, but they may
be weakened in order to allow for some correlations between the pairs (y,, x,), for
instance when there are lagged endogenous regressors (see, e.g. Wooldridge (1990),
Domowitz and White (1982), Gourieroux and Monfort (1992)), or to consider
the case of deterministic regressors (White and Domowitz (1984), Wooldridge
(1990)).

2. Non-nested hypotheses

The hypotheses may concern either the whole conditional distribution of y, given
x,, or simply some conditional moments, such as the conditional expectation. We
successively consider these two situations.

2.1. Dejinitions

When the hypotheses concern the whole conditional distribution and have a para-
metric form, they may be written as

H, = MY,/&; a),CffzA= [WC},


H, = {~(Y,/x,; b),PEB = RH}. (2.1)
They are respectively parameterized by the parameters c( and 8, which may have
different sizes and different interpretations. The first hypothesis H, (for instance)
2588 C. Gourieroux and A. Morzfbrt

is valid if the true conditional distribution f&y,/x,) can be written as g(y,/x,;irO)


for some u,,EA. In such a case a, is the true value of the parameter.
To measure the closeness between the two hypotheses H, and H,,we have to
introduce a proximity measure between probability density functions. One such
measure, is the KullbackkLeibler information criterion (KLIC). The KLIC of two
conditional distributions g(y/x; a) and h(y/x; fl) may be defined either conditionally
to the exogenous variables by

Z,,(a,P;4 =
s
log $f$2 g(y/x; a)v, (dy), (2.2)

or unconditionally by

cJhbY
P)=

The conditional
ss log .dY/x;4
____ s(yIx; G-ox(xbyWY)v, (d-4.
W/x; PI
version may be computed as soon as the two hypotheses have
(2.3)

been defined, whereas the unconditional version Z,,(a, /J) depends on the unknown
marginal distribution of the exogenous variables, and therefore is also unknown.
However, it may be consistently estimated by

The KLIC takes non-negative values and is equal to zero if and only if the two
p.d.f.s appearing in the definition are the same:

Igh(4B;x) = O-dYlx; a) = 4Yl-T B).


It is not a distance in the mathematical sense since, for instance, it does not satisfy
the symmetry condition:

I&?&>B; x) z r,,(D> a; x).

For the definition of nested and non-nested hypotheses, we follow Pesaran (1987).
We first define the proximity between a distribution of H, and the whole hypothesis
H, by

I&a; x) = inf Zgh(a,/I; x), (2.5)


BEE

and similarly, the proximity between a distribution of H, and H,

Ih(B;x)
= infIh,(B,
4x1. (2.6)
EA
2590 C. Gourieroux and A. MonJbrt

This finite sample pseudo-true value b,(a) depends on T and on the observations
x1,. . . , xT, and therefore it is a random variable. For this reason, it is sometimes
called the conditional pseudo-true value. It converges to b(a), when T tends to
infinity.
An important simplification occurs when we consider models without explana-
tory variables, or equivalently models in which the only x variables are constant
variables. In this case Z,,(a, /i;x) = Z,,(a, p) = I^,(a, fi) and the asymptotic and finite
sample pseudo-true values coincide. However models without explanatory variables
are not frequent in econometrics and therefore we will have to keep in mind the
distinction between @cc) and b,(a).

2.3. Semi-parametric hypotheses

The competing models may also have been defined through some conditional
moments, for instance through conditional means. The approach described in
subsections 2.1 and 2.2 can easily be adapted to this semi-parametric framework.
We basically have to define in a suitable way a measure of proximity between
these conditional moments. As an illustration let us consider the case of conditional
means. The two hypotheses H, and H, are defined by

H, = (E(y,/x,) = m(x,; a), WA c R},

H, = (E(ytlxt) = Ax,; P), BEB = R}. (2.8)

The proximity between two means may be measured by the usual average squared
Euclidean distance

k,Ja, 8) =
s II4x; 4 - Ax; 8)II2fox(-4
dvx(xL (2.9)

or by its empirical counterpart

kP)=;1;f: IIm(x,;~)-~(x,;B)l12. (2.10)

The pseudo-true values associated with these semi-parametric hypotheses and with
this distance D, are then defined by

D,,,+(M,b(x)) = inf D,,J~, b) for the asymptotic case,


BEE
for the finite sample case. (2.11)
Ch. 44: Testing Non-Nested Hypotheses 2591

In fact this semi-parametric approach is strongly linked with the previous one.
Let us assume for a moment that the conditional distribution of y, given x, is
normal with unit variance. The hypotheses H, and H, would be defined by

H, = {Y(Y,/x,; a) = Wm(x,, a),11,a-l},


H,, = (~(Y,/x,; PI = N/4x,,/% 11,BEB).
Moreover the associated KLIC would be

1 (a
gh 2
px)=
1

slogdJi~x;a)
h(~B(y!x; aWv,(y)

=
s 1 dylx; a)dv,W

= 4II4x; 4 - Ax; B)II*.


We deduce that I,,(a, /j) = JZ,,(a, /I; x)f&x) dv,(x) = iD,,,Ja, p). The pseudo-true
values defined in the semi-parametric framework with a Euclidean distance coincide
with the pseudo-true values computed as if the conditional distribution was a
Gaussian distribution.
It would have been possible to measure the closeness of the non-nested hypo-
theses H, and H, by using a KLIC based on another artificial conditional
distribution. The artificial distributions which are suitable for this purpose are the
members of linear exponential families, including the normal, the Poisson, the
gamma, etc. distributions (see Gourieroux et al. (1984)).

2.4. Examples

2.4.1. Choice of regressors in a linear model

We consider three sets of linearly independent regressors x0, xi, x2, with respective
sizes K,, K,, K, and x = (x,, xi, x2). The two models of interest are

H, = {E(Y,/x,)= xOtaO+ xltal,(ab,a;)~[WKo+K:,

H, = {E(~,/xt) = x,,b, +~~~bl,(bb,b;)~[W~+~~}. (2.12)

To choose between H, and H, is equivalent to choosing the regressors which have


to be eliminated: either xi (H, is accepted), or x2 (H, is accepted). As soon as K,
or K, is non-zero the two semi-parametric hypotheses are non-nested. They are
2592 C. Gourieroux and A. Monfort

only partially non-nested as soon as K, # 0, their intersection being

H,n H, = (E(y,/xJ = xotuo,uo~RK}.

The finite sample pseudo-true value br(c() corresponding to

is the solution of the minimization problem

min1
bo,b, 1= 1
CxOtaO
+ ha1 - (xotbo+ x2A)12.

It is a least squares problem.


Let us introduce the matrices X0,, X,.,X,, giving the observations on the
different regressors. They have respective sizes TX K,, T x K,, T x K,. The
solution of the problem is

When T is large the empirical cross-products converge to the corresponding second


order moments of the marginal distribution of the x variables. We get

>(>
a,
b (2.14)
a,

where $ is the expectation with respect to the true unknown marginal distribution
of the explanatory variables.
It must be emphasized that the definitions of the two hypotheses concern the
same conditional expectations. It might have seemed natural to define the hypo-
theses by

Hm= {E(Y,/x
ot, XlJ = XO(UO+ x,,a,,(u;, u;)ERKo+K},

H, = {E(ytlxot> ~a) = x,,b, + x,tb,,(b;, b;)aRKo+K2).

However it is easily seen that the hypotheses defined in this way may be compatible:
for instance, if the vector of all variables (yt,xot,x1t,x2t) is Gaussian, H, and H,
are both satisfied.
Ch. 44: Testing Non-Nested Hypotheses 2593

2.4.2. Linear versus logarithmic ,formulations

When specifying a regression model we have often to choose between a linear and
a logarithmic formulation. For instance, for demand functions giving the consump-
tion as a function of income, we have to compare a formulation with constant pro-
pensity to consume (linear formulation) and a formulation with constant elasticity
(logarithmic formulation). The two hypotheses written in terms of conditional
distributions are

H, = {logy, = x,a + ut, where uJx,, z, - N(0, or), CCEA,o2 > 0},

H, = {yt = z,B + vl, where Q/X,, z, - N[O, v2], /IEB, q2 > O}. (2.15)

For the same reasons as in the previous example the hypotheses are made on a
conditional distribution given the same set of regressors x and z. The hypotheses
differ in different respects. First, the distributions have different supports: the
positive real line for H, and the whole real line for H,; clearly the question of
choosing between H, and H, is only meaningful if the observed values of y are
positive. Second, the two models differ not only by the form of the regression
functions

E(y,/x,, zr) = exp(x,cr + $), for H,,

E(YJ% zt) = z,P, for H,,

but also by their variance properties since the data are supposed to be hetero-
skedastic under H, and homoskedastic under H,.
The finite sample pseudo-true values may be easily determined (see Gourieroux
et al. (1983)). They are given by

&,(a, a2)= ( fjziz,)


\r=1
- i / t=1
zi exp(x,a) exp g,

V+(a,a2)= + $ exp(2x,cr) exp(2a2)


-rl,,cr2(~I z,expx,a) *jil z:z,
T (

2.4.3. Polytomous logit versus sequential logit formulations

Another classical example (Hausman and McFadden (1980)) is found in discrete


choice models. Let us consider the case of three alternatives, i = 1,2,3, and the
2594 C. Gourieroux and A. Monfort

conditional distribution of the indicator function of the retained alternative given


some attributes. The y variable is a qualitative variable and the model is completely
defined by the conditional selection probabilities. Two formulations are often
examined. Under the independence from the irrelevant alternative hypothesis
(I.I.A.), the selection probabilities have a polytomous logit form

H,={g(l/x,;cr)=(l +expx,,cr, +expx,,a,)-,


g@/x,;cc) = expx,,cc,(l + expx,,cr, + expx,,a,)),
g(3/x,;a) = expx+,(l + expx,,a, + expx,,a,)-, EEA}, (2.16)

where g(i/x,;cc) denotes the probability of choosing alternative i given the x


variables. Such a model describes choices deduced from a unique utility maximi-
zation among the set of alternatives { 1,2,3}.
However other models describe the idea of sequential choices: a first choice
between 1 and {2,3}, then if the second subset is retained, a choice between 2 and
3. We get the so-called sequential logit model

H, = {Wlx,;8) = (1 + expx,,P,)-,
Wx,;P) = evx3,B1(l + expx,A-(1 + expX41/j2)-13
h(3/x,; 8) = exp xJfBl exp x&(1 + exp xjtB1)- (1 + exp xqlB2)- , RB}.
(2.17)

As before, the distributions are conditional to the set of all regressors xit, x2*, xjt,
X4,.

2.4.4. Choices between macromodels

Other examples exist in macromodelling: choice between New Keynesian and


Classical models (Dadkhah and Valbuena (1985), Nakhaeizadeh (1988)), between
models defined through Euler conditions corresponding to different optimizations
(Ghysels and Hall (1990)) and between equilibrium and disequilibrium models.
This last example is particularly interesting, since the usual equilibrium and dis-
equilibrium models are generally defined by different kinds of conditional distribu-
tions. Let us introduce the demand and supply functions, depending on a price p
and on exogenous variables x,z. The latent model is

D, = up, + x,ct + ut,

St= b, + zrb+ r~r>

where for instance (ut, VJ is, conditionally to x, z, normally distributed with a zero
mean and a variance-covariance matrix denoted by 0.
Ch. 44: Testing Non-Nested Hypotheses 2595

In the -equilibrium model the exchanged quantity and the equilibrium price are
defined by

Q, = aP, + x,a + u,,


Q, = bp, + z,B+ u,,

and this model leads to a parameterized form for the conditional distributions of
Q,, P, given q,z,.
In the disequilibrium framework, the prices pt are assumed to be exogenous and
the exchanged quantity is defined as the minimum of supply and demand: Q, =
min(D,, S,). The disequilibrium model leads to a parameterized form of the condi-
tional distribution of Q, given pI,xt,z,. Since the two models do not correspond
to the same kind of endogenous and exogenous variables, they can only be
compared by increasing or decreasing the information. A first possibility is to
ignore the information on the distribution of prices in the equilibrium model and
to compare the form of f(Q,/p,, x,, zt) in the equilibrium and disequilibrium models.
A second solution is to complete the disequilibrium model by adding a price
equation explaining how the price depends on x and z; with this enlarged model
we may determine the conditional distribution f(Q,, pJx,, z,) and compare it with
the specification associated with the equilibrium model.

2.4.5. Separate families in time series analysis

The last examples that we present deal with time series. These examples are
generally linked with the determination of the autoregressive or the moving average
orders, i.e. the so-called identification problem (see, e.g. Walker (1967) j.

2.4.5.1. Choice between autoregressive and moving average representations. Let


us consider a centered time series (Y,, tcN), whose dynamics is of interest. It is
natural to first test if this series satisfies the white noise properties and, if this
hypothesis is rejected, to propose a more complicated specification. Two simple
specifications may be introduced, an autoregressive representation of order 1

Y, = pY,_ I + ut, U, v I.I.N(O, a*),

and a moving average representation of order 1,

y, = E, - BE, _ 1, E, - I.I.N(O, Y/*).

These two hypotheses are partially non-nested and their intersection corresponds
to the white noise processes. The comparison between the two previous models
Ch. 44: Testing Non-Nested Hypotheses 2599

E denotes the conditional expectation with respect to g(y/x;a,) and E the


X
zxpectation with respect to f&x).

Then if V- and V*- are generalized inverses of I/ and I/*, if c- and c*- are
consistent estimators of these generalized inverses and if d and d* are the ranks of
V and V*, we deduce from Proposition 3.2.1 that the Wald statistics

are asymptotically distributed under H,, as chi-squares with d and d* degrees of


freedom respectively (see Cox (1961) for <FL, Gourieroux et al. (1983) for <yZ). It is
easily seen that d - d* is nonnegative.

Dejnition 3.2.2

The Wald tests of H, against H, consist in accepting H, if ;rl < xt _,(d)[resp rp2 d
x: _,(d*)], and in rejecting it otherwise (E is the asymptotic level of the test).

Remark 3.2.3

It is easily seen that the previous testing procedures coincide with the usual Wald
test in the special case of nested hypotheses. Indeed let us assume that

MY,/x,;B)= khlx,; Y, CI), wherep = (y,a),

and that the null hypothesis is defined by constraining y to be zero:

dytlx,; ~1)= k(y,lx,;0,a).

The pseudo-true values are

b(cz)= b&) =
0O
ci

and the two previous test statistics trl, rp2 are equal. Moreover we have
Ch. 44: Testing Non-Nested Hypotheses 2601

evaluated at the estimated pseudo-true value. Depending on the computability of


the asymptotic pseudo-true value, we may consider either

(3.4)

or

(3.5)

In the special case of nested hypotheses l$) and ;i(T2)are equal and coincide with the
usual score evaluated under the null.
The asymptotic distributional properties of the estimated scores have been
derived in Gourieroux et al. (1983); they are summarized in the following proposition
(for the i.i.d. case).

Proposition 3.3.1

Under the null hypothesis H,, the random vectors (l/J?)@ and (l/J?)@ are
asymptotically normal, with zero mean and asymptotic variance-covariance matri-
des respectively given by

w=ctltl- GgCgglCgh,
w* = C& - C&,tCgh.

The matrices W and W* have the same ranks d and d* as the matrices V and V*
introduced for the Wald procedures.

Proposition 3.3.2

If I$- and @*- are consistent estimators (under the null) of generalized inverses of
Wand W*, the score statistics are defined by
2602 C. Gourieroux and A. Mocfort

They are asymptotically distributed under H,, as chi-squares with d and d* degrees
of freedom respectively. The associated critical regions with asymptotic level c
are

and

{t,> x:-,(d*)>.

3.4. The Cox procedure

In the two previous subsections we described natural extensions of the Wald and
score test procedures. We now consider the third kind of procedure: the likelihood
ratio test. The extension was first introduced by Cox (1961,1962) and studied
in detail in a number of other papers (see, e.g. Jackson (1968), Pesaran (1974),
Pesaran and Deaton (1978)).
The idea is to consider, in a first step, the usual likelihood ratio (LR) test statistic,
and to study its asymptotic distributional properties. Since under the null hypo-
thesis, the LR statistic divided by T tends to a non-zero limit, this limit is, in a second
step, estimated and used to correct the usual procedure. More precisely let us
consider the maximum log-likelihood function evaluated under H, and H, res-
pectively:

(3.6)

Lh,&)= max i logh(y,lx,;B)


p t=i

= f: WWxt;&). (3.7)
i=l

The usual LR statistic is defined by

= 2 ,ilClog4+/x,; fir)- logdytlxt; 411. (3.8)


2604 C. Gourieroux and A. MonJort

where u, is the E quantile of the standard normal. The asymptotic level of this test
is E; moreover this test is consistent for the distributions of H, which do not belong
to H,. (Note that C, is sometimes replaced by - C,; in this case the critical region
must be changed accordingly.)

Proof

We essentially have to explain why the critical region is one-sided. For this purpose
we consider the asymptotic behaviour of C, under the alternative H,. We denote
by a(Pe) the asymptotic pseudo-true value of c( associated with the true value, /IO,
of fi. We get

h(Y/X;P,)
log

E log
W/X; II
9( y/x; a&J)

bC4Bo)l)
- d8o)
[ 9(y/x; 4Bo)) x

This limit is strictly positive (except if the hypotheses are partially non-nested
and if the true p.d.f. belongs to both hypotheses) and under the alternative the Cox
statistic tends to + co, which explains the one-sided form of the test.

Remark 3.4.3

In practical situations the marginal distribution of the exogenous variables is


generally unknown and it may be interesting to consider the modified version of
the Cox statistic

cT = +:(8T) - L;(dT)] - +,fl E {log h(Y/X,; &T@T)) - lgdy/x,; dT)/Xt)

;[L:(fi,(,) - L;(dT)]/x
1
However such a modification will necessitate the determination of the asymptotic
variance of ET which is different from the asymptotic variance of CT.
The explicit form of the Cox statistic or of its modified version has been derived
and interpreted in several econometric models (Pesaran (1974) Pesaran and Deaton
(1978), Fisher. and McAleer (1979), White (1982b)).
Ch. 44: Testing Non-Nested Hypotheses 2605

Remark 3.4.4

The use of the Cox statistic is only valid under the regularity condition 0; # 0, i.e.
if the difference logg(Y/X; a,) - logh( Y/X; b(a,)) does not belong to the vector
space generated by the components of

When H, is nested in H,, we get (see Remark 3.2.3) h( Y/X; @a,)) = k( Y/X; 0, ao) =
g( Y/X; Q) and wi = 0. Therefore the Cox procedure does not apply in the simple
case of nested hypotheses. It does not apply either if the two non-nested hypotheses
are orthogonal. The second case is especially clear for linear models (see
Section 3.5).

Remark 3.4.5

Different modifications of the Cox statistic have been introduced in the literature.
The most popular one is Atkinsons variation defined by

CA, = f {L;CW,)l - L;(b)} - H(d,).

It is such that

CA, = C, + +~(e,)l- L;[&l) d C,,

since flT gives the maximum of L!j_, and it is asymptotically equivalent to C, under
the null.

3.5. Application to the choice of regressors in linear models

3.5.1. The estimated pseudo-true values

Let us consider two Gaussian linear models

H, = {Y = X,y + u, u - N[O, &d]},


(3.10)
H,={Y=X,6+v, o-NIO,rZId]}.

The matrices X, and X, have respective sizes (n, K,), (n, K,) and their ranks are K,
C. Gourieroux and A. Monfort

and X,. The parameter is

CX= for H,,

/J= 6 for H,.


0 T2

Let us denote by Pj = Xj(X;Xj)-Xi, j = 1,2 the orthogonal projector on the


column vectors of Xj and Mj = I - Pj. We get

(xix,)-lx;

i
+Yll
(xix,)- lx;
(3.11)
$f2Yf12

and

(x;x2)-'x;x,?T
hT(BT)=

+,YJ+ ;i!M2X,f,/i2
i

(x;x,)-'x;P, Y
(3.12)
~11~,~112+~11~2~~~112
=i

3.5.2. An interpretation of the extended Wald statistic

The difference between the two estimators of the pseudo-true value b(a,) is

(x;x2)-'x~(y-x,Y*T)
fi[bT-bT(6iT)l = fi

It may be proved (see, e.g. Gourieroux and Monfort (1989)) that the second
subvector is asymptotically equivalent to a linear combination of the first subvector
under the hypothesis H,. Therefore the Wald statistic measures the distance between
Ch. 44: Testing Non-Nested Hypotheses 2607

zero and

This quantity is an inner product between the residuals of H, and the exogenous
variables of H,. Therefore the procedure is asymptotically equivalent to a score
test of the hypothesis H, = {S = 0) in the artificial nesting model

Y = x,y + r?,s+w,
including all the explanatory variables 2, of H, not spanned by those of H,, i.e. to
the usual F test.

3.5.3. The Cox statistic

Let us consider the modified version of the Cox statistic (see Section 3.4.3). We get
(Pesaran (1974))

L;(p)= -~log2n-~logr-~,,Y-X26,,~,

L+(cI)= -~log2n-~loga-Zfr,,Y-X,y,,2,

EL;(p)= -~log2n-~logr-~-2j21X,~-X26,,,
CI

EL;(a)= -~log2n-~logrr-;.
CI

We deduce that

where ?c = l/T 11 M, Y 11 2 is the maximum likelihood estimator of 52 under the


alternative and ?c is the estimated pseudo-true value

f;= ~IllWl12 + I/M,~,Yl121.


2608 C. Gourieroux and A. MonJivt

Therefore, the Cox statistic is directly given as a simple function of two estimated
variances. This kind of result is valid for any Gaussian regression model, including
nonlinear regression or multivariate models (after replacing the scalar variances by
the determinants of the residual variance-covariance matrices). Fg may be seen as
the residual variance of H, expected under H, whereas ff is the actual estimate of
r2 under H,. A positive Cox statistic indicates that the actual performance is better
than expected. A significantly positive statistic leads to rejection of H, because H,
is performing too well for H, to be regarded as true. (McAleer (1986)).
It may be noted that when the two hypotheses are orthogonal, the Cox statistic
e, still has meaning even if the Cox procedure cannot be directly used. More
precisely we get, in this special case,

and

z; =!,ogLY?
T 2 IIM,Yl12

This statistic is a simple function of the variable T 11P, Y l12/)lM2 Y II2 which is
asymptotically proportional to a chi-square under H,. Contrary to the general case
described in Proposition 3.4.2 the Cox statistic no longer follows a normal
distribution.

3.6. Applications to qualitative models

Let us consider a framework including, as a special case, the polytomous logit versus
sequential logit example of Section 2.4.3.
We assume that the endogenous variable associated with the choice among K
alternatives is defined by ytk = 1 if k is chosen for the tth observation and ytk = 0
otherwise. The two competing models are

H,: P(Y,, = 1)= &Ax,,4,


H,:P(y,,i= 1)= h&t,BX
2610 C. Gourieroux and A. Monfort

and

We obtain, for instance

[diag g - @j] diag L E :


0 h apt- I

where the matrices are evaluated at ~1~and b(cr,), and g (resp x) is the vector
whose components are & (resp &).
Similar expressions are obtained for C,, and C,,. Consistent estimators are
obtained by replacing f by the empirical mean, ~1~by 8, and b(a,) by fiT or bT(Oir.).

4. Artificial nesting models

The basic idea of artificial nesting methods is to introduce a third hypothesis in


which both H, and H, are nested and characterized by some equality constraints.

4.1. Examples

We describe some general nesting procedures and some specific ones.

4.1 .I. Quads procedure (Quandt (1974))

The idea is to introduce the compound model made of mixtures of the distributions
of H, and H,. This model is

~4= ((1- Mytlx,; 4 + Wy,/x,; D), k[O, 11, CEA, /MI}. (4.1)

The basic hypotheses are defined by the constraints

H,={A=O} and H,,={;l=l}.

The procedure consists in testing {A = 0} against {A 3 0} and {A = l} against {A < l},


i.e. in applying a one-sided t-ratio test to the parameters 2 or 2 - 1. It is possible
that both hypotheses are not satisfied. In such a case H, and H, will be
asymptotically rejected by the testing procedure and it will be possible to get an
estimate f, of the parameter 2 significantly different from 0 or 1. In some
applications such a value of 2 may be interpreted.
Ch. 44: Testing Non-Nested Hypotheses 2611

Let us consider the choice between polytomous and nested logit models (see
Section 2.4.3). The nesting model may be seen as describing the average choice of
a group of individuals, some, a proportion 1 - 2, selecting the alternatives according
to the polytomous logit form and the others, a proportion I+, in a sequential way.
&- may give an idea of the decomposition of the population into these two
subgroups.

Remark 4.1 .l

Even if such a compound model is attractive, it has the usual drawback of mixtures.
Under the null hypothesis H, = {,I = 0}, the parameter fi is not identifiable.
Therefore the asymptotic properties of the M.L. estimator of a, fi, 1, obtained by
maximizing

maxi logcu- MYtlx,; co+ %y,lx,; P)l,


t=1

and the properties of the t-ratio test are unknown except in special cases. This
difficulty is at the origin of several papers whose aim is either to compute directly
the distribution of some test statistics under specific formulations of the hypotheses
(Pesaran (1981, 1982a)) or to introduce some change of parameters or some
identifiability constraints (Fisher and McAleer (1981)).

4.1.2. Atkinsons procedure (Atkinson (1970))

Atkinson proposed a similar procedure in which the compound model is derived


by considering exponential combinations. The model is

M = i Kdy,lx,; a)-ih(~,lx,; BY,


where

1
-1
a)-NY/x,; B)dv,W , 1~CO,ll, ad, (4.2)

This nesting model has two drawbacks. First, as for Quandts procedure, p is
unidentifiable under H, (and a under W,) and this creates many difficulties in
performing the tests (see, e.g. Pesaran (1981)). Second, the model is only meaningful
if the function g(y/x,; a) -Q/x,; p) IS integrable for AE[O, 11, a condition which is
not always satisfied.
However the identifiability problem may be solved in particular cases.
2612 C. Gourieroux and A. Monfort

Example 4.1.2

Let us consider two linear hypotheses

H,=~Yt=w+ut~ 4 - I.I.N(O, l), t = 1,. . ., T},

H, = {yt = xd + u,, u, - I.I.N(O, l), t = 1,. . . , T}.

The distributions of the nesting model have p.d.f. l(y,/x,; y, &A) which are proportional
to

exp (- +(l - J*)(y, - xi,y) - $n(y, - xzt6)2)

or to

exp I- $CY, -(x1,(1 - 0~ + x2J412).


Therefore the distributions of the nesting model are such that y, is conditionally
normal with a unit variance and a conditional mean obtained by simultaneously
introducing the variables of the two initial models. If for instance xi and x2 have
no common component, we may introduce the new parameters

y* = (1 - A)y, s* = U.

Whereas y, 6, I are not identifiable in the nesting model the transformed parameters
y*, S* are. The same is true for the hypotheses H, and H, which are characterized
by the constraints H, = {6* = 0}, H, = {y* = 0} which only depend on 6* and y*.
This kind of result may directly be extended to multivariate Gaussian models
(Pesaran (1982a)).

4.1.3. Mixtures of regressions

We have seen that for regression models Atkinsons procedure is equivalent to


considering the regression which is the convex combination of the two initial
regressions. This kind of nesting procedure may be directly applied to nonlinear
regressions, even in a semi-parametric framework. With the two nonlinear regression
models

ff, = {E(Y,/x,) = m(x,;4, =A},


H, = {E(Y,/x,) = Axt; 8, DEB},

we can associate the nesting nonlinear regression model

(4.3)
Ch. 44: Testing Non-Nested Hypotheses 2613

4.1.4. Box-Cox transformation (Box and Cox (1964))

The Box-Cox transformation of a positive variable y is defined by

y(l) = Y- l
-, A#O,
A

Y(A) = log Y9 A=0 (4.4)

This transformation reduces to the logarithmic function for 1= 0, and to a linear


function for I = 1 and is often used for nesting linear and logarithmic formulations

H, = {log yt = x,a + u,, u, - I.I.N(O, o)},

H, = (y, = x,/3 + u,, u, - I.I.N(O, q)}.

As soon as the regressions contain a constant term we may consider the nesting
model

Y; = x,y + q, o, - I.I.N(O, .r2), &CO, l] .

The two initial hypotheses are characterized by H, = (2 = 0} and H, = (2 = l},


respectively, and may be tested by the usual t-ratio test (see Zarembka (1974), Chang
(1977)) or by applying a t-ratio test after a Taylor expansion of the models around
a pre-assigned value (2 = 0) on (2 = 1) (Andrews (1971), Godfrey and Wickens
(1981)).
In the nesting model it is assumed that the error term u is normally distributed
with mean 0 and variance TV. Strictly speaking this assumption is untenable since
y,(l) cannot take values less than - l/2. This difficulty can be circumvented by
truncating o, in some fashion (Poirier and Ruud (1979)), but as noted by Davidson
and McKinnon (1985a) it seems reasonable to ignore the problem which would
occur with small probabilities.

4.13. A comprehensive model for autoregressive and


moving average representations

In some applications a natural embedding model appears. Let us for instance


consider the two hypotheses of an autoregressive and a moving average representation
of order one:

H, = (y, = py, _ 1 + Ed, Ewhite noise),


H, = (y, = qr - Or],_ 1, q white noise).
2614 C. Gourieroux and A. Moniorort

A comprehensive model is the ARMA(l,l) representation

M = {y, - (~~y,_~ = u, - fl,u,_,, u white noise}.

As noted in Section 2.4.5. the initial models are regression models with constrained
coefficients

WYtlY, - 1)= PYt- 1 for H,,

E(y,/y,_,)= -By,_, -82y,_2- ... -8ky,_k- ... for H,,

where YY,-~ ={yt-i, Y,-2,... }, and the comprehensive model is not obtained by
taking linear combinations of the two previous regression functions since we get

E(y,/y,-r)=(cpi -8i)y,_r + ... +(cpi -8i)8:-y,_k+ ... under M.

4.2. Local expansions of artificial nesting models

In a series of papers Davidson and McKinnon proposed a simple method for solving
the identifiability problem appearing in mixtures of regressions. We first discuss the
idea in the case of linear regressions and then we describe the general results
concerning the so-called J- and P-tests.

4.2.1. Linear regressions

Let us consider two linear regressions with different sets of regressors

ff, = {Y, = XI~Y + u,, u, I.I.D, E(u,/x,) = 0, V(u,/x,) = a*, t = 1,. . . , T},
H, = {Y, = ~24 + u,, u, I.I.D, E(o,/x,) = 0, V(u,/x,) = r2, t = 1,. . . , T},

and the associated mixtures

M = y, = (1 - I)x,,y + 1x2,6 + o,, o, I.I.D, E(w,/x,) = 0, V(o,/x,) = q2,

t = 1,. . ., T . (4.6)

To circumvent the unidentifiability of parameter 6 under the null H,, Davidson


and McKinnon (1981) proposed an approach which is different from the simple
Ch. 44: Testing Non-Nested Hypotheses 2615

change of parameters given in Example 4.1.2, and which can be extended to more
general problems (see Section 4.4). The idea is to replace the nuisance parameter 6
by its estimator under H, instead of using its estimator under M. Therefore the
nesting model M is replaced by a pseudo-nesting model

M* = {y, = Xlty* + n*x,,& + o:, t= 1,. . .) T}, (4.7)

where 8, is the O.L.S. estimator of 6 based on H,.


Clearly in this formulation the second kind of regressors X,& depends on
Yl,...? y, through the estimator &.; they are partly endogenous and the error terms
are correlated. However Davidson and McKinnon (1981) proposed to omit these
difficulties and to study directly the asymptotic properties of the t-ratio statistic
associated with A*, computed as if & was non-random and the w:s were error
terms satisfying the usual conditions. The O.L.S. estimator of A* in the regression
of y, on xIt,xJT is

1; = {&x;M,x,&} - l6;x;M, Y, (4.8)

and the associated t-ratio statistic is given by

(l/fi)&X;M, Y
T2 = (4.9)
sT[(l/T)~~X;M,X,s^,]

where rj, is the usual estimator of the variance:

Under the null hypothesis H,, 8r tends to the pseudo-true value

4hJ = C&;x,)l +E(x;x,)y,,

fl, tends to G: the true value of the variance and the denominator of the t-ratio
statistic converges to

%P%+JCw$x,) - E(x;x,)(Ex;x,)-E(x;x2)]d(yo))12.

The numerator of TA is equal to


2616 C. Gourieroux und A. Mocfort

It is asymptotically normal under H,, with a zero mean and a variance equal to
the square of limit of the denominator:

~;bwb)C~(x;x,)- ~(x;x,)(~x;x,)~'~(x;x,)l~(~,)}.

Two cases have to be distinguished:

(i) the denominator is asymptotically different from zero making the t-statistic
asymptotically standard normal,
(ii) the denominator tends to zero making the limit of the T,-statistic undetermined.

The denominator tends to zero if either d(y,) = 0, i.e. if x1 and x2 are orthogonal
regressors, or if E(x;xJ - E(x~xl)(Ex~xl)-E(x~x,) = 0, i.e. if the regressors x2 are
linear combinations of the regressors x1.

Proposition 4.2.1

If the regressors x1 and x2 are non-orthogonal (Exix, # 0) and if H, is not nested


in H,, the t-ratio statistic TL has asymptotically a standard normal distribution
under the null. The Davidson-McKinnon test consists in rejecting H, if 1TA1> u1 _-E,2,
where E is the asymptotic level and u1 --E,2the 1 - c/2 quantile of the standard normal
distribution.

In the previous proposition we gave a two-sided version of the test; however, as for
the Cox test, it can be seen that the one-sided test whose critical region is {T* > u1 -E}
has an asymptotic level equal to E and is consistent (except if x26,, is a linear
combination of the components of x1).
The previous test has been called the J-test by Davidson and McKinnon. It is
also worth noting that there exists an exact version of this test, called the_JA-test (A
indicating the Atkinson variant of this test), which is the usual t-test of I = 0 in the
regression

y = x,7 + ;iP,P,y + 63, (4.10)

where

Pj = I - Mj = Xj(XJXj)- XJ, j= 1,2.

The difference between this pseudo-nesting model and model M* given in (4.7)
is the replacement of X,6, = P,y by P,P,y. Since the right hand side of (4.10)
depends on y only through Ply, the t-statistic on 5 has the t distribution with
T - K, - 1 degrees of freedom, where K, is the number of columns in X, (see
Milliken and Graybill (1970)).
Ch. 44: Testing Non-Nested Hypotheses 2617

4.2.2. Nonlinear regressions: the J-test

The previous approach may be directly extended to nonlinear regressions

H, = {Y, = m(x,; Y) + u,, u, I.I.D, E&/x,) = 0, V&/x,) = r~, t = 1,. . . , T},

H, = {it = Ax,; 4 + u,, u, I.I.D, E(u,/x,) = 0, V(u,/x,) = ~~~ t = 1,. . . , T},

and to the set of mixtures

A4 = { y, = (1 - i)m(x,, y) + Ap(xt, 6) + q, CO,I.I.D, E(w,/x,) = 0, V(o,/x,) = v2,

t= l,...,T .

This model is replaced by the pseudo-nesting model

M* = {yt = (1 - A*)m(x,, y*) + A*p(x,, &) + o:, t = 1,. . . ) T}, (4.11)

where s^, is the nonlinear least squares estimator of 6 under H,. Then we can
compute the nonlinear least squares estimator of I*, y* under M* as if 8, was a
constant and the t-ratio statistic TA was associated with A*. The following
proposition is the analogue of Proposition 4.2.1, and, as in this proposition, the
two-sided test can be replaced by a one-sided test.

Proposition 4.2.2

Under regularity conditions, the t-ratio TA has, under H,, an asymptotic standard
normal distribution. The so-called J-test consists in rejecting H, if 1TAI> u1 -E,2.

4.2.3. Nonlinear regressions: the P-test:

The previous J-test necessitates the determination of nonlinear least squares


estimators of some parameters. It is possible to develop a procedure which only
uses linear least squares. For this purpose we may consider an expansion of the
regression model M* in a neighbourhood of the null hypothesis. More precisely
since the true value y0 is unknown, we introduce the expansion around yT, the least
squares estimator of y under H,. We have

(1 - A*)m(x,, y*) + A*/L(x,, &, + C0:

= (1 - A*) m(x,; j$-) +


a4x,; 2%)
(Y* - M + ~*P(x,; &) + 0:
i w I

= m(x,; Y^T)+ A*[p(x,; s^,, - m(x,; &)I + am;;: %) c + cot*,


2618 C. Gourieroux and A. Mmfort

where c is an unknown parameter. Therefore an expansion of model M* is

M** = it $4- 4x,; &)I


= 4x,; M + ~*CP(X,;
am
+ -(x,; jqc + Cot**, t = 1,. . . ) T . (4.12)
a?!

Let us now consider the asymptotic properties of the t-ratio statistic Ti for A* in
M**, computed as if pr, 8, were deterministic and the error terms CO:* had the usual
properties of white noise. Let

h = [4x,; &)I, P = cl&; &)I>

6 be the matrix of derivatives (am/aY)(x,; 8,) and I?, the orthogonal projector on
the space orthogonal to the space spanned by the columns of d. The t-ratio statistic
is

T: = (P - W&dY - 4
tj { (p - &i)G,(p - rfi)} liZ

where q2 is the residual variance. It has the same asymptotic distribution under H,
as the statistic

(P- m)Mdy- 4
- m)M,(I* - Nl
~oC(P

dbdl~ D = CGWW)(x,;RJI.
wherem = (m(x,;Y~)AP = CP(X,,
Using the same arguments as for the linear case, we get the following proposition
(given with the two-sided version of the test).

Proposition 4.2.3

Under the condition plim (l/T)(p - m)M,(p - m) # 0, the statistic TT has asympto-
tically a standard normal distribution under the null H,. The P-test consists in
rejecting H, if I Tt I > u1 -E,2.

4.3. A score test based on a modijed Atkinsons compound model

We have seen that the artificial nesting model obtained by introducing the convex
combinations of two nonlinear regressions was a special case of the Atkinsons
compound model. Moreover the modification of this artificial nesting model
Ch. 44: Testing Non-Nested Hypotheses 2619

considered by Davidson and McKinnon was mainly intended to solve the


identification problem of the a-parameter under the null 2 = 0. Therefore, it is
natural to look for a potential extension of the DavidsonMcKinnon procedure to
the Atkinsons compound model. More precisely let us consider this compound
model

We may introduce the modified compound model in which the unknown parameters
c( and /I are both replaced by the maximum likelihood estimators oi, and br
computed under H, and H, respectively:

Qyt/xt; dytlxt; 4) - Wtlxt; ih)

s
ii = 2) =
dulx,; 4 - 'k(u/x,;
i

By analogy with the DavidsonMcKinnon approach, we define the score statistic


for testing the hypothesis (2 = 0}, computed as if bi, and Jr were deterministic. This
statistic is

T a 0) 1 T a iogfT(y,/x,;0) 2 - 12
(
i0g&(y,/~,;

=$t:l aA
[ Tt?l a2 11
(4.13)

What are the properties of this statistic?


Let us first consider the numerator. It is easily seen that

alog.L:T(YJx~;
an
~
0)
= log khlx,; &I - log shlx,;

- logg(ulx,;~,)lg(ulx,;b)dv,(u),
a,) -
sClog&lx,; b-d

and that the numerator is equal to

(4.14)

where L,,(flT) and L&&r) are the maxima of the likelihood function under H, and H,
respectively. Therefore the numerator has an expression which is analogous to that
2620 C. Gourieroux and A. Monfort

of the Cox statistic except that the estimated pseudo-true value b(&,) has been
directly replaced by a, (such a replacement was initially proposed by Atkinson
(1970)). It is easy to check that the numerator is asymptotically equivalent under
the null to the numerator ofthe Cox statistic; in particular it has the same asymptotic
variance given by

w= ~~/,,(logh-logg)-Cov,, logh-logg,Z
( 1

X [ V~O(~)]~cov~O(~Jogh-logg), (4.15)

where log h, logg and ag/aa are evaluated at (&a,), a,,) and where V,,, Cov,, are the
variance and covariance under the null.
The denominator of the score statistic computed as if oi,, p^, were deterministic
tends to the square root of

which is always larger than the true asymptotic variance W.


Therefore this first extension of the Davidson-McKinnon approach to Atkinsons
compound model does not produce a test with the right asymptotic size.
More precisely we get the following result.

Proposition 4.3.1

Let us consider the score statistic Fs computed as if oi, and IT were deterministic,
and also the corrected score statistic

where @is a consistent estimator of W under the null

(i) 5 is asymptotically equivalent to the Cox statistic,


(ii) 4gives a procedure with an incorrect asymptotic level,
(iii) the critical region based on I?/ is conservative, and the null hypothesis is
accepted too often.

It is worth noting that 4sis easier to evaluate than 4. If the use of p leads to the
rejection of the null hypothesis, we can conclude (without computing 5s) that the
null hypothesis must be rejected.
Ch. 44: Testing Non-Nested Hypotheses 2621

4.4. The partially modijied Atkinsons compound model

In fact in their approach Davidson and McKinnon only replace the parameter of
the alternative model by its maximum likelihood estimator. The same idea may be
followed with the Atkinsons compound model. We now consider the partially
modified compound model in which only /3 is replaced by the maximum likelihood
estimator p^, computed under H,:

ii* = J*,(y,/x,;2, a) = ~
~(Y,/x,; 4 - hbtlx,;fl,) (4.16)

We then define the score


s g(nlx,; a) 1- h(u/x,; &.) dv,(u)

statistic for testing (2 = 0} computed as if br was


deterministic:

where

a 10gj: a i0gj;
a2 a@

are evaluated at the constrained estimators 2 = 0 and c1= 8,.


With this correction, we get the following proposition.

Proposition 4.4.1

The score statistic tf for 2 = 0 deduced from the partially modified compound
model is asymptotically equivalent to the Cox statistic under the null.

5. Comparison of testing procedures

In the comparisons of testing procedures we have to distinguish between asymptotic


results (such as the equivalence of two procedures under the null hypothesis, or a
comparison of their asymptotic relative efficiencies) and finite sample results (such
as an exact comparison or an evaluation of the difference between the small sample
2622 C. Gourieroux and A. Monjiirt

significance level and the nominal level). The asymptotic results are generally
derived from a theoretical point of view, while the small sample results are obtained
from Monte Carlo studies except for simple models such as Gaussian linear models.

5.1. Asymptotic equivalence of test statistics

We first study if the test statistics considered above are asymptotically equivalent
under the null, i.e. if they differ at most by a negligible term in probability o,(l). If
this is the case it is known that they have the same asymptotic distribution. We have
seen before that the form of these asymptotic distributions may be either chi-square
(several degrees of freedom) or univariate normal distributions. Therefore some
previously introduced statistics, such as the Wald statistic and the Cox statistic,
cannot be equivalent and the equivalence may only be derived for particular pairs
of statistics.

Proposition 5.1.1

(i) The extended Wald and score statistics are asymptotically equivalent under the
null hypothesis

(ii) The Cox statistic and the score statistic based on the partially modified
Atkinsons compound model (and the J- and P-test in the case of regression
models) are asymptotically equivalent under the null.
(iii) The two Wald statistics <:I and i;r2 are generally not equivalent and they are
not equivalent to the Cox statistic (or to the J- and P-tests in the case of
regression models).

The first part of the proposition has been proved in Gourieroux et al. (1983), the
second part is a consequence of Proposition 4.4.1 and has been established by
Davidson and McKinnon (1981) for the J- and P-tests. The third part is a
consequence of the different asymptotic distributions for the statistics x2(d), x(d*)
and N[O, l] respectively.

5.2. Asymptotic comparisons of power functions

Let us consider a consistent testing procedure for testing H, against H,. This
procedure is generally defined by a critical region of the form (5r > cE}, where 5r
Ch. 44: Testing Non-Nested Hppotheses 2623

is the test statistic and c, a critical value chosen to get the right asymptotic level:

The power function of this test gives the probability of the critical region under the
alternative hypothesis H,. It is a function of E, T and the parameter fi:

P(E, T>p) = p,(tT > Cc), (5.1)

where P, is the probability with respect to the p.d.f. corresponding to p in H,. Since
the tests previously defined are consistent, this power function tends to one when
T tends to infinity:

lim p(~, T, /I) = 1, VE > 0, /I.


T+Ca

Therefore if (5iT,c1), (5 2T,c2) are two testing procedures associated with critical
regions (trT > cle), (cZT > cZE),they cannot be compared by examining the asymptotic
value of the power function at fixed arguments E,/I, since the limit equals one for
both procedures. But it is possible to study this power function for either a varying
p or a varying E.
The first approach of a varying j? was introduced by Pitman (1948). The sequence
/jT of alternative hypotheses has to be chosen in such a way that the associated
distributions h(y/x; /IT) tend to a distribution g(y/x; a,) of the null hypothesis, at a
rate ensuring that the limits

lim Pj(h T, PT) = lim PpT(<jT > CjJ = aj(E) (say), j= 1,2
T+CC T-CC

exist and are different from zero and one. Then the first procedure is said to be
asymptotically more powerful than the second if ai > +(E), Vs. The sequence
h(y/x; pT) is called a sequence of local alternatives. For non-nested hypotheses this
approach of local alternatives can only be used for some specific problems. Indeed
it needs partially non-nested hypotheses in order to be able to build the local
alternatives and even in such a case the local alternatives can only be defined in
some special directions corresponding to the intersection of the two hypotheses.
Such an approach has been followed by several authors. Pesaran (1982b) considered
the case of two non-nested linear regression models, and proved that the local power
of the one degree of freedom tests (Cox test, J- and P-tests) is not exceeded by that
of the standard F-test, when the number of explanatory variables is smaller under
the null than under the alternative. Dastoor and McAleer (1982) extended the
previous results to the case of multiple alternatives and demonstrated that Pesarans
result depends crucially on the type of local alternatives specified. In general, it is
2624 C. Gourieroux und A. Monfort

not possible to rank the tests in terms of asymptotic local power. Ericsson (1983)
considered Cox-type statistics in the context of instrumental variable estimation
and Gill (1983b) considered the general case of parametric hypotheses.
In the second approach due to Bahadur (1960) (see also Geweke (1981)) the
alternative /I is fixed and the first type error cT tends to zero in such a way that the
limits

lim Pj(ET,T, B)= lim Pp(SjT> CjJ= bj(B) (say), j= 1,2


T-m TdX

exist and are different from zero and one. The first test is said to be asymptotically
more powerful than the second if b,(b) 2 b&?), VP. This approach seems more
suitable for non-nested hypotheses but may be difficult to apply.
The problem of fixed alternatives has been analyzed in some particular cases by
McAleer et al. (1982). Epps et al. (1982) introduced a testing procedure based on
empirical moment generating functions and tried to maximize the power for fixed
alternatives. Gourieroux (1983) considered testing procedures based on estimated
residuals and looked for the optimal form of such test statistics using the Bahadurs
criterion. The results are valid for the choice between non-linear regression models
and show that the Wald test, the score test and the J- and P-tests satisfy the
optimality condition, while the Cox procedure does not. Pesaran (1984) gives a
general survey of both approaches by local and fixed alternatives.

5.3. Exactfinite sample results

For some specific examples it is possible to determine the exact forms of the test
statistics and of their distributions under the null and alternative hypotheses (or at
least an expansion of this exact distribution).
The earlier papers were interested in discriminating between some families with
common invariant sufficient statistics, in the i.i.d. framework. Dumonceaux et al.
(1973) proved that the likelihood ratio does not depend on nuisance parameters for
discriminating between normal and Cauchy distributions and between normal and
exponential distributions. Dumonceaux and Antle (1973) gave the table of critical
values for a test based on likelihood ratio statistics for discriminating between a
log-normal and a Weibull distribution (see also Pereira (1977a)); some other
authors look for accurate approximations of the finite sample distribution when
its explicit expression is not available. For instance Jackson (1968) considered the
Cox statistic for choosing between a log-normal model

lexp _(lgy- @A2


[
g(y;a)=
Y&2 ( 2% >I
Ch. 44: Testing Non-Nested Hypotheses 2625

and an exponential model

B)I>
1
(
h(y;b)=-exp
B I
-
(
y

(see also Pereira (1977a), Epps et al. (1982) for a study of this problem originally
considered by Cox).
When some explanatory variables are introduced into the models, exact results
have been essentially derived for the choice between two Gaussian linear models, or
for the comparison of two test statistics with closed forms. These results provide
some inequalities between test statistics, Fisher and McAleer (1981) consider this
problem for Gaussian non-linear regressions, Dastoor (1983a) establishes an in-
equality between two versions of Cox statistic: when the opposites of the statistics
given above are retained, the Atkinsons version of the Cox statistic is smaller than
the Cox statistic itself, and therefore is less likely to reject the null hypothesis.
Determination of exact distributions of Fisher type (Dastoor and McAleer (1985))
and the comparison of exact power functions (Dastoor and McAleer (1985)) are also
dealt with. A summary of the expressions of the main statistics used in the regression
case and of the size and power functions of the associated tests in finite sample is
given in Table 2.4 of Fisher and Whistler (1982), for instance.

5.4. Monte Carlo studies

Monte Carlo studies have been performed for more complicated examples. Dyer
(1973, 1974) compared testing procedures which are invariant with respect to
location and scale parameters in the i.i.d. case. Pesaran (1982b) and Godfrey and
Pesaran (1983) considered the choice between two regression models by the COX
statistic or by modified versions of this statistic. They analyzed the effect of the
difference between the number of regressors in the two hypotheses, of the
non-normality of the error term and of the presence of lagged endogenous variables
in the regressions. Davidson and McKinnon (1982) compared various versions of
the Cox test with the F-test, the J-test and the JA-test in the linear case.
All these results are partial since they concern specific problems and specific
values of the parameters, but they give some ideas on the behaviour of the
procedures for small T. The main observations are the following ones.
(4 The finite sample size of Cox type tests can be much greate; than the nominal
level. These tests reveal a tendency to reject the true model too frequently. This
effect is also important for the J-test. However it seems possible to incorporate
in the test statistics both mean and variance adjustments in order to avoid such
an effect (Godfrey and Pesaran (1983)). The simulations by the authors show
that these corrections are partially successful. For instance the size of the
2626 C. Gourimmx and A. Monfort

adjusted J-test is smaller than for the unadjusted J-test, but is still higher than
the nominal significance level.
(ii) The comparison of power functions is difficult to interpret since the usual
procedures do not have the same finite sample size.
(iii) The results are often very sensitive to the relative number of regressors in the
two hypotheses and significatively depend on the fact that this number is
smaller or larger in the null hypothesis than in the alternative one. For instance
the power of the J-test is poor in the second case.
(iv) The JA-test lacks power in several situations: when the number of regressors
in the null hypothesis is less than in the alternative or when the true distribution
does not belong to either the null or the alternative.
(v) The finite sample sizes are not badly distorted when the errors have been
assumed to be normal and follow another distribution (log-normal, chi-square,
etc). Similarly the ordering of the power functions does not seem to be
significatively modified.
(vi) When the sample size is reasonably large and the variance of the error terms
reasonably small, all the tests perform in a satisfactory manner.

6. Encompassing

6.1. The encompassing principle

In the non-nested hypothesis testing procedures that we have described in the


previous sections, we assume that the true conditional distribution belongs to one
of the hypotheses. This assumption can be considered as a strong one and, therefore,
it is interesting to see if it is possible to avoid this assumption. This kind of idea led
to a tentative definition of the notation of encompassing (Mizon and Richard (1986),
Hendry and Richard (1990)):

One model encompasses another if the former can account for, or explain, the
results obtained by the latter.

This notion can be used in a modelling strategy, in which we want to propose


more and more suitable models. These models have not only to take into account
some new interesting phenomena, but they also have to be able to explain previous
results derived with the previous models.
Theoretically when two or more competing models are considered, it is possible
to define a general model in which they are all nested, and to assume that the true
distribution belongs to this general model. This is the idea of artificial nesting and
historically the first definition of encompassing (see Pesaran and Deaton (1978)
Mizon and Richard (198 1) or Hendry and Richard (1982)). However in practice this
general nesting model will contain many parameters and will require an amount of
Ch. 44: Testing Non-Nested Hypotheses 2627

information often larger than that contained in the available data. In fact, there is
room for more parsimonious strategies of encompassing, in which we do not have
to nest the models at each step in a more general model, nor to assume that a model
contains the true distribution (contrary to the axiom of correct specification
formalized by Learner (1978)). For developing such a modelling strategy, we
have
(i) to precisely define the notion of encompassing,
(ii) to modify the test procedures in order to take into account the fact that H, may
encompass H,, even if neither H, nor H, is true.
Since the notion of encompassing is linked with model choice, we introduce
different notations in the rest of this section. The two competing models are denoted
by

instead of H, and H,. The true conditional distribution of y given x isf,, and is not
assumed a priori to belong to M, or M,.

6.1.1. Pseudo-true values and binding functions

As previously meutioned, the pseudo-true values of the parameters GINand CI*are


defined by

foy(Y/X)
rxyO= argminEElog--------
OL, X0 Y,(YlK a

= arg max E E log g,(y/x; CQ),


b, X0

.foy(Y/X)
a& = arg min E E log
a2 x0 g,(ylx; 4
= arg max E E log g,(y/x; a,), (6.2)
a2 X0

where E is the expectation with respect to the marginal distributionfoX of x, and E


0
is the &nditional expectation with respect to the true p.d.f. fo,(y/x).
The proximity between fo, and models M, or M, is

.foy(Y/X)
l[fo,, Mj] = E E log ________ ~ j= 1,2.
X0
CJj(Ylxi clj*o)

In the same spirit we can, for any curEAT, define the value of Q, denoted by b,,(a,),
2628 C. Gourieroux and A. Monfbrt

providing the p.d.f. of M, which is the closest to gI(Y/x;a,),

b,lh) = argmzxt E log g,(ylx; a,), (6.4)

and similarly

The functions b,, and b,, are called binding functions. They only involve the
models M, and M, and not the true distribution.

6.1.2. Encompassing (Mizon and Richard (1986))

The distribution of M, (resp M,) associated with CYT~ (resp CX~,)can be seen as the
best representative of M, (resp MJ and it seems natural to formalize the notion of
encompassing by saying that M, encompasses M, if, acting as if the best
representative of M 1 was the true distribution, we find that the closest distribution
of M, is the best representative of M,. This means that a& = bzl(aTo). Note that
this property depends not only on M, and M, but also on the true p.d.f. fO of
(Y>x).

Dejinition 6.1 .I

(i) fO is such that M 1 encompasses M, if and only if LX&,= b,,(aTJ. This condition
is denoted

fO s.t. MI&M,.

(ii) fO is such that there is mutual encompassing if and only if we have simultaneously
fO s.t. MI&M, and fO s.t. M,bM,.

6.2. The encompassing tests

6.2.1. The encompassing hypothesis

We want to define testing procedures of the null hypothesis

H, = {fO s.t. M,bM,}

= IG, = b&:,)>. (6.5)

This null hypothesis constrains the unknown p.d.f. fO, and the tests have to be
Ch. 44: Testing Non-Nested Hypotheses 2629

considered without assuming a priori that foV is in M, or in M,. It is also clear that
H, is true if fey belongs to MI.
It is natural to consider the test procedures previously introduced for testing
non-nested hypotheses and to examine if they can be used for testing the
encompassing hypothesis H,.

6.2.2. Cox likelihood ratio statistic

The Cox approach is based on the following statistic:

s 1,=; c {logg,(Y,lx,;si,,)-logg,(Y,lx,;6i,,)}
f 1

- ktfl &fT (logg,(Y,lx,;~,.)-logg,(Y,lx,;~,,)}.

Under the null hypothesis He, this statistic tends to

plim slT = E E Clog


g,(Ylx;@To)- logg,(Ylx; @To)1
T X0

- E E Clog&k G,) - logg,(y/x;G,,l.


x 40

This limit is equal to zero if the true conditional p.d.f. belongs to M,. However it
is generally different from zero for the p.d.f. of Ho whose conditional distributions
do not belong to M,. This shows that the Cox approach is not appropriate for
testing the encompassing hypothesis Ho. Obviously the same conclusion occurs for
the J- and P-tests, which are equivalent to the Cox test.

6.2.3. The Wald encompassing test (WET)

The difference between the two estimators of the pseudo-true value aTo, i.e.
*
~1~~- b,,(oi,,), tends to zero under the encompassing hypothesis. Under Ho,
fi[B,, - b,,(oi,,)] is asymptotically normal, but its asymptotic variance-co-
variance matrix is different from the one given in Proposition 3.2.1 which has been
computed under the strict sub-hypothesis M, of the encompassing hypothesis Ho.

Proposition 6.2.1 (see Gourieroux and Monfort (1992))

Under Ho, fi[oZ 2T - bZ1(ilr)] converges in distribution to N[O, a,], with

0, = K,-,[C,, - C,,C;;CJK;; + K,-,[CPIC;/ - K&$,,K,]C,,


x [C,,CIz - K,,?,&K,,]K;;,
Ch. 44: Testing Non-Nested Hypotheses 2631

which

plim 1, = 0.

Therefore the score statistic can be used as the basis for an encompassing test. The
statistic & is asymptotically equivalent to ,/?K,,[B,, - b,,(oi,,)], under the
encompassing hypothesis.

Proposition 6.2.2

The SET is based on the statistic c$. = T?Tk,-,fi,l?;i&, and the critical region
is ;, > I:_,, where d is the rank of 0,.

As for the WET, the asymptotic variance-covariance matrix is reevaluated, in


comparison with that of the extended score test of Proposition 3.3.1, in order to
take into account the fact that the true p.d.f. may satisfy the encompassing
hypothesis without belonging to Mr.

6.2.5. The generalized encompassing test (GET)

The previous Wald and score encompassing tests may be difficult to implement for
various reasons and, in particular, because the variance-covariance matrices
appearing in the test statistics are, in general, not invertible. This implies that a
generalized inverse must be used and that the rank must be estimated. Therefore it
is worth looking for simpler tests even if the price to pay is the enlargement of the
implicit null hypothesis H, = (CC&, = b,,(aT,)). This null hypothesis has an intersection
with M, which is equal to the so-called reflecting set R,, = (c~~:cx~ = b21[b12(a2)]}.
The tests that are proposed below have an implicit null hypothesis whose
intersection with M, is equal to the image M,, of M, by b,,. This implies that,
when b,, is injective, these tests are effective only if pz, the size of c(~, is greater than
pl, the size of c1r.

Proposition 6.2.3

Under He and if the rank of ab,,/acr, is pl, the statistic

where zr is a consistent estimator of Z = K& Cz2K,;, is asymptotically distributed


as xz(p2 - pJ. The test consists in rejecting H, if t, > xf Jp2 - pl), where E is the
asymptotic level of the test. This test is called the generalized encompassing test
(GET).
2634 C. Gourieroux and A. Mortfiwt

Davidson, R. and J.G. McKinnon (1985a) Testing Linear and Loglinear Regressions against Box-Cox
Alternatives, Canadian Journal of Economics, XVIII, 499-517.
Davidson, R. and J.G. McKinnon (1985b) Heteroskedasticity-Robust Tests in Regression Directions,
Annales de IINSEE, 59160, 183-218.
Davidson, R. and J.G. McKinnon (1987) Implicit Alternatives and the Local Power of Test Statistics,
Econometrica, 55, 1305- 1329.
Deaton, AS. (1982) Model Selection Procedures, or, Does the Consumption Function Exist? in:
C.G. Chow and P. Corsi, eds., Etialuating theReliability ofMacroeconomic Models, New York: Wiley,
43-65.
Domowitz, I. and H. White (1982) Misspecified Models with Dependent Observations, Journal y/
Econometrics, 20, 35-58.
Dufour, J.M. (1989) Nonlinear Hypotheses, Inequality Restrictions and Non-nested Hypotheses: Exact
Simultaneous Tests in Linear Regressions, Econometrica, 57, 335-356.
Dumonceaux, R. and C.E. Antle (1973) Discrimination Between the Log Normal and the Weibull
Distribution, Technometrics, 15, 923-926.
Dumonceaux, R., C.E. Antle and G. Haas (1973) Likelihood Ratio Test for Discrimination Between
Two Models with Unknown Location and Scale Parameters, Technomerrics, 15, 19-27.
Dyer, A.R. (1973) Discrimination Procedures for Separate Families of Hypotheses, Journal of the
American Statistical Association, 68, 970-974.
Dyer, A.R. (1974) Hypothesis Testing Procedures for Separate Families of Hypotheses, Journal ofthe
American Statistical Association, 69, 140-145.
Efron, B. (1983) Comparing Non-nested Linear Models, Technical Report 84, Stanford University.
Engle, R.F. (1984) Wald, Likelihood Ratio and Lagrange Multiplier Tests in Econometrics, in:
Z. Griliches and M. Intriligator, Eds., Handbook ofEconometrics, Vol 2, North-Holland: Amsterdam,
776-826.
Epps, T.W., K.J. Singleton and L.B. Pulley (1982) A Test of Separate Families of Distributions Based
on the Empirical Moment Generating Function, Biometrika, 69, 391-399.
Ericsson, N.R. (1982) Testing Non-nested Hypotheses in Systems of Linear Dynamic Economic
Relationships, Ph.D. dissertation, London School of Economics.
Ericsson, N.R. (1983) Asymptotic Properties of Instrumental Variables Statistics for Testing Nonnested
Hypotheses, Review ofEconomic Studies, 50,287-304.
Ericsson, N.R. (1991) Monte Carlo Methodology and the Finite Sample Properties of Instrumental
Variables Statistics for Testing Nested and Non Nested Hypotheses, Econometrica, 59, 1249-1278.
Ericsson, N.R. and D.F. Hendry (1989) Encompassing and Rational Expectations: How Sequential
Corroboration Can Imply Refutation, International Finance, dissertation paper 354, Board of
Governors of the Federal Reserve System.
Fisher, G.R. (1983) Tests for Two Separate Regressions, Journal of Econometrics, 21, 117-132.
Fisher, G.R. and M. McAleer (1979) On the Interpretation of the Cox Test in Econometrics, Economics
Letters, 4, 145-150.
Fisher, G.R. and M. McAleer (1981) Alternative Procedures and Associated Tests of Significance for
Non-nested Hypotheses, Journal qfEconometrics, 16, 103-l 19.
Fisher, G.R. and D. Whistler (1982) Tests for Two Separate Regressions, Institut National de la
Statistique et des Etudes Economiques (INSEE), dissertation paper 8210.
Geweke, J. (1981) The Approximate Slopes of Econometric Tests, Econometrica, 49, 1427-1442.
Ghysels, E. and A. Hall (1990) Testing Non-nested Euler Conditions with Quadrature Based Method
of Approximation, Journal of Econometrics, 46, 273-308.
Gill, L. (1983a) Some Non-nested Tests in an Exponential Family of Distributions, University of
Manchester, dissertation paper 129.
Gill, L. (1983b) Local Power Comparisons for Tests of Non-nested Hypotheses, University of
Manchester dissertation paper.
Godfrey, L.G. (1983) Testing Non-nested Models After Estimation by Instrumental Variables or Least
Squares, Econometrica, 51, 355-365.
Godfrey, L.G. (1984) On the Use of Misspecification Checks and Tests of Non-nested Hypotheses in
Empirical Econometrics, Economic Journal, 94, 69-81.
Godfrey, L.G. and M.H. Pesaran (1983) Tests of Non-nested Regression Models: Small Sample
Adjustments and Monte Carlo Evidence, Journal of Econometrics, 21, 133-154.
2636 C. Gourieroux and A. Monfort

with autocorrelated disturbances: an application to models of U.S. unemployment, Communications


in Statistics, Series A, 19, 3619-44.
McFadden, D.L. (1984) Econometric Analysis of Qualitative Response Models, in: 2. Griliches and
M.D. Intriligator, eds., Handbook of Econometrics, Vol 2, North-Holland: Amsterdam, 1395-1458.
McFadden, D.L. (1989) A Method of Simulated Moments for Estimation of Discrete Response
Models without Numerical Integration, Econometrica, 57, 995-1026.
McKinnon, J.G. (1983) Model Specification Tests Against Non-nested Alternatives, Econometric
Reoiew, 2, 85-l 10.
McKinnon, J.G., H. White and R. Davidson (1983) Tests for Model Specification in the Presence of
Alternative Hypotheses: Some Further Results, Journal ofEconometrics, 21, 53-70.
Milliken G.A. and F.A. Graybill (1970) Extensions of the General Linear Hypothesis Model, Journal
of the American Statistical Association, 65, 797-807.
Mizon, G.E. (1984) The Encompassing Approach in Econometrics, in: D.F. Hendry and K.F. Wallis,
eds., Econometrics and Qualitative Mod&q, Oxford: Basil Blackwell.
Mizon, G.E. and J.F. Richard (198 I) The Structure of some Non-nested Hypothesis Tests, Southampton
University, mimeo.
Mizon, G.E. and J.F. Richard (1986) The Encompassing Principle and its Application to Testing
Non-Nested Hypotheses, Econometrica, 54,657-678.
Nakhaeizadeh, G. (1988) Non Nested New Classical and Keynesian Models: A Comparative Studv in
the Case of the Federal Republic of Germany, Karlsruhe University, dissertation paper. _
Pakes, A. and D. Pollard (1989) Simulation and the Asymptotics of Optimization Estimators,
Econometrica, 57, 1027-1058.
Pereira, B. de B. (1977a) A Note on the Consistency and on the Finite Sample Comparisons of Some
Tests of Separate Families of Hypotheses, Biometrika, 64, 109-l 13.
Pereira, B. de B. (1977b) Discriminating Among Separate Models: A Bibliography, International
Statistical Review, 45, 1633172.
Pesaran, M.H. (1974) On the General Problem of Model Selection, Review of Economic Studies, 41,
153-171.
Pesaran, M.H. (1981) Pitfalls ofTesting Non-nested Hypotheses by the Lagrange Multiplier Method,
Journal of Econometrics, 17, 3233331.
Pesaran, M.H. (1982a) On the Comprehensive Method of Testing Non-nested Regression Models,
Journal of Econometrics, 78, 263-274.
Pesaran, M.H. (1982b) Comparison of Local Power of Alternative Tests of Non-nested Regression
Models, Econometrica, 50, 128771305.
Pesaran, M.H. (1984) Asymptotic Power Comparisons of Tests of Separate Parametric Families by
Bahadurs Approach, Biometrika, 71, 245-252.
Pesaran, M.H. (1987) Global and Partial Non-nested Hypotheses and Asymptotic Local Power,
Econometric Theory, 3, 69-9.
Pesaran, M.H. and A.S. Deaton (1978) Testing Non-nested Nonlinear Regression Models, Econometrica,
46,6777694.
Pesaran, H. and B. Pesaran (1989) Simulation Approach to the Problem of Computing Coxs Statistic
for Testing Non-nested Models, dissertation paper presented at the European Meeting of the
Econometric Society.
Pitman, E.J.G. (1948) Non-Parametric Statistical Inference, University of North Carolina, Institute of
Statistics, Mimeographed Lecture Notes.
Poirier, D.J. and P.A. Ruud (1979) A Simple Lagrange Multiplier Test for Lognormal Regression,
Economic Letters, 4, 251-255.
Quandt, R.E. (1974) A Comparison of Methods for Testing Non-nested Hypotheses, Review of
Economics and Statistics, 56, 92-99.
Ramsey, J.B. (1974) Classical Model Selection Through Specification Error Tests, in: P. Zarembka,
ed., Frontiers of Econometrics, Academic Press: New York, 13-47.
Rossi, P.E. (1985) Comparison of Alternative Functional Forms in Production, Journal qfEconometrics,
30,345%361.
Sargan, J.D. (1964) Wages and Prices in the United Kingdom: A Study in Econometric Methodology,
in: P.E. Hart, G. Mills and J.K. Whitaker, eds., Econometric Analysisfor National Economic Planning,
London: Butterworths, 25-63.
Ch. 44: Testing Non-Nested Hypotheses 2631

Sawyer, K.R. (1980) The Theory of Econometric Model Selection, Ph.D. dissertation, Australian
National University.
Sawyer, K.R. (1983) Testing Separate Families of Hypotheses: An Information Criterion, Journal oj
the Royal Sratistical Society, Series B, 45, 89-99.
Smith, M.A. and G.S. Maddala (1983) Multiple Model Testing for Non-nested Heteroscedastic
Censored Regression Models, Journal of Econometrics, 21, 71-81.
Smith, R.J. (1992) Non-nested tests for competing models estimated by generalized method of
moments, Econometrica, 60, 973-80.
Szroeter, J. (1989) Efficient Tests of Non-nested Hypotheses, University College, London.
Vuong, GM. (1989) Likelikood Ratio Tests for Model Selection and Non-nested Hypotheses,
Econometrica, 51, 307-334.
Walker, A.M. (1967) Some Tests of Separate Families of Hypotheses in Time Series Analysis,
Biometrika, 54, 39-68.
White, H. (1982a) Maximum Likelikood Estimation of Misspecified Models, Econometrica, 50, l-26.
White, H. (1982b) Regularity Conditions for Coxs Test of Non-nested Hypotheses, Journal of
Econometrics, 19, 301-318.
White, H. and I. Domowitz (1984) Nonlinear Regression with Dependent Observations, Econometrica,
52, 143% 162.
Wooldridge, J.M. (1990) An Encompassing Approach to Conditional Mean Tests with Applications to
Testing Non-nested Hypotheses, Journal ofEconomefrics, 45, 331-350.
Zabel, J.E. (1992) A comparison of non-nested tests for misspecified models using the method of
approximate slopes, Journal ofEconometrics, forthcoming.
Zarembka, P. (1974) Transformation of Variables in Econometrics, in: P. Zarembka, ed., Frontiers in
Econometrics, New York: Academic Press.
2640

7.4. Estimating the asymptotic variance 2697

7.5. Asymptotic efficiency 2699

7.6. Testing 2700

Part III. The globally nonstationary, weakly dependent case 2701

8. General results 2701


8.1. Introduction 2701
8.2. Asymptotic normality of an abstract estimator 2702
9. Asymptotic normality of M-estimators 2706
9.1. Asymptotic normality 2706
9.2. Estimating the asymptotic variance 2710

Part IV. The nonergodic case 2710

10. General results 2710


10.1. Introduction 2710
10.2. Abstract limiting distribution result 2711
11. Some results for linear models 2713
12. Applications to nonlinear models 2723
Appendix 2725
References 2733
Ch. 45: Estimation und Ir&-mce,for Dependrnt Procrsses 2641

Abstract

This chapter provides an overview of asymptotic results available for parametric


estimators in dynamic models. Three cases are treated: stationary (or essentially
stationary) weakly dependent data, weakly dependent data containing deterministic
trends, and nonergodic data (or data with stochastic trends). Estimation of
asymptotic covariance matrices and computation of the major test statistics are
covered. Examples include multivariate least squares estimation of a dynamic
conditional mean, quasi-maximum likelihood estimation of a jointly parameterized
conditional mean and conditional variance, and generalized method of moments
estimation of orthogonality conditions. Some results for linear models with
integrated variables are provided, as are some abstract limiting distribution results
for nonlinear models with trending data.

Part I. Introduction and overview

1. Introduction

This chapter discusses estimation and inference in time series contexts. For the
most part, estimation techniques that are suitable for cross section applications -
see Newey and McFadden (this Handbook) - are either directly applicable or
applicable after slight modification to time series problems. Just a few examples
include least squares, maximum likelihood and method of moments estimation.
Complications in the analysis arise due to the dependence and possible trends in
time series data.
Part II of this chapter covers estimation and inference for the essentially
stationary, weakly dependent case. This material comprises the bulk of the chapter
and is also the case covered in most of the econometrics literature. The work of
Bierens (1981, 1982), Domowitz and White (1982), White (1984), Bates and White
(1985), Gallant (1987), Gallant and White (1988) and Potscher and Prucha (199 la, b)
contains various catalogues of assumptions that can be used in various estimation
settings. Part II synthesizes and extends some of these results, but our emphasis is
somewhat different from the earlier work. While we state some formal results with
regularity conditions, our focus is on the assumptions that impact on how one
performs inference. These assumptions often involve conditional moments and are
therefore straightforward to interpret.
The approach in Part II of this chapter is most similar to the book by White
(1993). White analyzes quasi-maximum likelihood estimation for heterogeneous (but
essentially stationary), weakly dependent processes under possible model misspeci-
fication. His results are very general and technically sophisticated. Here, by
2642 J.M. Wooldridgr

restricting ourselves to models where the primary feature of interest is correctly


specified, and by focusing on weak rather than strong consistency of the estimators,
we obtain results with simple regularity conditions that are nevertheless applicable
in a variety of contexts. We also make some further simplifying assumptions, such
as assuming that moment matrices settle down to some limit. The hope is that,
after seeing a stripped-down analysis that de-emphasizes the role of regularity
conditions, the reader can then tackle the more advanced treatments referenced
above. At the same time the inference procedures offered here are fairly general.
When the data are trending the standard uniform law of large numbers approach
cannot be used to establish consistency and to find the limiting distribution of
optimization estimators. Nevertheless, if the data are weakly dependent one
generally expects the estimation techniques useful in the essentially stationary case
to still have good properties in the trending, weakly dependent case. Part III of
this chapter draws on the work of Crowder (1976), Heijmans and Magnus (1986),
Wooldridge (1986) and others to establish the consistency and asymptotic normality
(when properly scaled) of a general class of optimization estimators for globally
nonstationary processes that are weakly dependent. These results can be applied
to M-estimation and method of moments estimation. An important consequence
of Part III is that the common practice of performing inference in trending, weakly
dependent contexts exactly as if the stochastic process is essentially stationary and
weakly dependent is justified quite generally.
The last part of this chapter, Part IV, covers limiting distribution results when
the process (or at least the score of the objective function) is not weakly dependent.
The case when at least some elements of the underlying stochastic process are
integrated of order one, or 1(l), is of particular interest and has received a lot of
attention recently [just a few references include Phillips (1986, 1987, 1988), Phillips
and Durlauf (1986), Park and Phillips (1988, 1989), Phillips and Hansen (1990)
and Sims et al. (1990)]. Most of the specific work on nonergodic processes has been
in the context of linear models. Some abstract results are available for nonlinear
models, for example Basawa and Scott (1983), Domowitz (1985), Wooldridge (1986),
Jeganathan (1988) and Domowitz and Muus (1988). In Part IV we present a
modification of a result in Wooldridge (1986) that applies immediately to linear
models; we also given an example of a nonlinear application with nonergodic
data.
A few remaining features of this chapter are worth drawing attention to at this
point. First, we do not discuss the interpretation of the parameters in dynamic
models beyond assuming that they index a conditional distribution, conditional
expectation or conditional variance. Models that are expressed in terms of under-
lying innovations are most easily handled by expressing them in terms of a condi-
tional expectation or conditional distribution in observable variables.
Second, although the subsequent results apply to linear models, most of the
conditions are explicitly set out for nonlinear models (the exception is Section 11).
Unsurprisingly, these are more restrictive than the conditions needed to analyze
Ch. 45: Estitnation and Inference,for Dependent Proce.ssrs 2643

linear models. While it is possible to relax the assumptions for application to linear
models, we do not do that here.
Finally, a warning about the notation. We have tried to limit conflicts, but some
are unavoidable. It is best to view the notation as being local in nature: the
same symbol can be used to represent different quantities in different sections.
Hopefully this will not cause confusion.

2. Examples of stochastic processes

The classification of the results in Parts II, III, and IV relies heavily on the notions
of essential stationarity and weak dependence. It would take us too far afield to
define and analyze the many kinds of dependence concepts (such as various mixing
and near epoch dependence conditions) that have been used recently in the time
series econometrics literature for generally heterogeneous processes. For the
purposes of estimation and inference, what is most important are the implications
of these concepts for limiting distribution theory, and this section provides an
informal discussion primarily from this perspective. As we see below, this turns
out to be imperfect; nevertheless, it strips a complicated literature down to its
essentials for the asymptotic analysis of estimators in time series settings. Formal
definitions of the types of stochastic processes discussed below are provided in
Rosenblatt (1978), Hall and Heyde (1980) Gallant and White (1988), Andrews
(1988) and Piitscher and Prucha (1991a, b).
Let {x,: t = 1,2,. . .} be a scalar stochastic process defined on the positive integers
[for definitions of a stochastic process and other basic time series concepts, see
Brillinger (1981)]. For our purposes the idea that {xt} is essentially stationary is
best captured by the (minimal) assumption that E(x:) is uniformly bounded. An
immediate implication of essential stationarity is that the variance of the partial
sum,

T
02,=Var
( )
1 x,
I=1
, (2.1)

is well-defined. If, in addition,

of = O(T), (2.2)

CT
-2 = O(T). (2.3)

and

a; f: (x, - E(x,)) -% Normal(0, l), (2.4)


then we say that {x~} is weakly dependent. Condition (2.2) implies that the variance
of the partial sum is bounded above by a multiple of T; it rules out highly dependent
processes with positive autocorrelations that do not die out to zero sufficiently
quickly. Condition (2.3) implies that the variance of the partial sum is bounded
below by a (positive) multiple of T; among other things it rules out processes with
strong negative serial correlation. Condition (2.4) states that {x,} satisfies the
central limit theorem (CLT).
These definitions of essential stationarity and weak dependence are not without
their glitches. First, there are many strictly stationary sequences that have an
infinite second moment; by the above convention, such processes are not essentially
stationary. Actually, defining essential stationarity to rule out these cases serves
a purpose because stationary processes with an infinite second moment do not
satisfy the CLT. Because we do not deal with such applications in this chapter it
seems easiest to exclude them in the definition of essential stationarity.
Second, there are processes exhibiting very little temporal dependence that never-
theless violate (2.3). The leading example is x, E e, - e,_ i, where (e,: t = 0, 1,2,. . .}
is an i.i.d. sequence with finite second moment. Then a$/T+O even though x, and
X ,+j are independent for j 3 2. The problem again is that such a sequence does
not satisfy the CLT, so we rule it out in the definition of weak dependence.
One might argue that, as long as we are assuming essential stationarity - which
we sometimes refer to as local nonstationarity or bounded heterogeneity ~ we might
as well simplify things further and restrict attention to the strictly stationary case.
This argument has some merit because the inference procedures are identical
whether we assume strict stationarity or allow for bounded heterogeneity, at least
for correctly specified models. Nevertheless, it is important to know that asymptotic
results are available for the heterogeneous case so we can handle processes with
deterministic seasonality, structural breaks, and other forms of temporal hetero-
geneity. In addition, as will be seen in Section 4, even if the underlying stochastic
process is strictly stationary the sequence of objective functions defining the
estimation problem need not be.
As we will see in Section 4.4, it is the gradient or score of the objective function
evaluated at the true parameters that should satisfy the CLT in applications
with essentially stationary, weakly dependent data. For many problems this follows
from the essential stationarity and weak dependence of the underlying process,
along with some additional moment conditions. Still, it is not always true that the
score is weakly dependent if the underlying stochastic process is, especially when
the objective function depends on a growing number of lags of the data. A simple
example is nonlinear least squares (NLS) estimation of an MA(l) model (moving
average of order one) when a white noise series has been overdifferenced: even
though the first difference of a white noise series is stationary and weakly dependent,
the score of the NLS objective function evaluated at the true parameter is not.
See, for example, Quah and Wooldridge (1988). The other case is also possible:
the underlying stochastic process could have an infinite second moment, or be
Ch. 45: Estimation und ltlfkwcr for Dependent Proce.w.~ 2645

strongly dependent, but the score of the objective function could be essentially
stationary and weakly dependent.
The point of the previous paragraph is that the terms essentially stationary
and weak dependence really apply to a particular function of the underlying
process, namely the score of the objective function. For many applications this
distinction is irrelevant. But, when in doubt, it is the score that should be studied
for weak dependence properties.
There has been much work on establishing primitive conditions under which
stochastic processes satisfy the CLT. Most of the early work on limiting distribution
theory - which focused on maximum likelihood estimation ~ relied heavily on the
central limit theorem for martingale difference sequences. This is because the score
of the conditional log-likelihood is a martingale difference sequence under correct
dynamic specification (more on this in Section 5). Roussas (1972) analyzed the MLE
for strictly stationary, ergodic data and employed the CLT for strictly stationary
martingale differences. McLeish (1974) proved CLTs for martingale difference
sequences that are not strictly stationary; see also Hall and Heyde (1980). These
results allowed for substantial heterogeneity in the underlying stochastic process,
and they were used in the work of Bhat (1974), Basawa et al. (1976) and Crowder
(1976).
Recent work in the econometrics literature covers a broader class of estimators
and allows for dynamic misspecification. For many problems with misspecified
dynamics the martingale CLT cannot be applied. Thus, the econometric work on
limit theory for estimation with essentially stationary, weakly dependent processes
has relied on various mixing conditions available in the theoretical time series
literature. Under certain moment and mixing assumptions the process {x,} satisfies
the central limit theorem. In the strictly stationary case, Rosenblatt (1956) proves
a CLT for cc-mixing (strong mixing) sequences and Billingsley (1968) proves results
for &mixing (uniform mixing) sequences and functions of &mixing sequences; see
also Rosenblatt (1978) and Hall and Heyde (1980). McLeish (1975) extended
Billingsleys results to allow for bounded heterogeneity. Wooldridge and White
(1989) and Davidson (1992) have proven CLTs for near epoch dependent (NED)
functions of underlying mixing sequences. Among other things this allows for
infinite moving averages in an underlying mixing sequence.
It might be helpful at this point to give an example of a weakly dependent
process. Let {e,: t~i?} be an independent, identically distributed (i.i.d.) sequence
with gf = E(ef) < co, E(e,) = 0. Let {4j: j = 0, 1,2,. . .} be a sequence of real cons-
tants such that

j=O
2 IOjl<a. (2.5)

Then we can define a process (x,: t = 1,2,. . .> by

xt= 2
j=O
4jet-j, t= 1,2,... (2.6)
2646 J.M. Wooldridge

(in the sense that C,?= o 4je, _ j exists almost surely). Provided that CT= 0 #j # 0, (2.3)
holds and it follows by Anderson (1971, Theorem 7.7.8) that

clT-= c x, %
1=1
Normal(0, 1) (2.7)

where

(2.8)

[See Hall and Heyde (1980, Corollary 5.2) for a weaker set of conditions.] The
summability condition on {4j: j = 1,2,. . .} ensures (2.2); it allows for much more
dependence than a stable autoregressive moving average (ARMA) process with
i.i.d. innovations, but (2.5) does imply that 4j + 0 asj -+ cc at a sufficiently fast rate.
We can easily allow for bounded heterogeneity by changing the assumption
about the underlying sequence {e,}. Now assume that the e, are independent, non-
identically distributed, or i.n.i.d, with E(le,12+) bounded for some 6 > 0. Then
Fuller (1976, Theorem 6.3.4) implies that (2.4) holds. This covers heterogeneous
ARMA models with independent innovations. Also, if we allow x, to have a
time-varying mean pt then {x1 - p,} satisfies the CLT.
When we relax the requirement that E(x:) is uniformly bounded we arrive at
the notion of a globally nonstationary process. Even though such processes are
growing or shrinking over time, it is entirely possible for them to satisfy the CLT
(2.4). [Condition (2.2) no longer holds, but this is not a problem.] As a simple
example of a globally nonstationary but weakly dependent process, define

x, = tu,, t= 1,2,... (2.9)

where {u,:t= 1,2,... } is a weakly dependent series with E(uf) uniformly bounded
and E(u,) = 0, t = 1,2,. . . (for example, {nt} could be i.i.d.). Note that E(xf) = O(t2)
and rr, = 0(T3). Nevertheless, under general conditions, (2.4) holds. [See, for
example, Wooldridge and White (1989) and Davidson (1992).]
There are several examples of processes, including ones that are strictly stationary
and ergodic, that are not weakly dependent. Robinson (1991b) calls such processes
strongly dependent. A general class of strongly dependent processes is given by
(2.6) where the coefficients {#j} are square summable

(2.10)
Even though such a process is covariance stationary, without further restrictions
on {4j} the variance of the partial sum can be of order larger than T, so (2.2) does
not hold. Examples are the long memory or fractionally integrated processes with
degree of integration between zero and one half; see, for example, Brockwell and
Davis (1991). Little is known about the asymptotic distribution of estimators from
general nonlinear problems when the underlying sequence is strongly dependent
[for some recent results for a simple model, see Sowell (1988)]. The results in Part
II or Part IV may be applicable, but this remains an important topic for future
research.
The term nonergodic is reserved for those processes that exhibit such strong
dependence they do not satisfy the law of large numbers. A popular example of
a nonergodic process is

x,=x,-i +e,, t= 1,2,..., (2.11)

where {e,: t = 1,2,. . .} is an i.i.d. sequence and x0 is a given random variable. For
illustration, assume that E(ef) < cc and E(e,) = 0. Even under these assumptions
the first moment of x, need not exist. If we add E(xi) < co and x0 is uncorrelated
with all e,, it is easy to see that Var(x,) = O(t). Also, E(x,) = E(x,) for all t, so the
mean of x, is constant over time when it exists. Still, the process {x,> does not
return to its mean with any regularity (it is nonergodic), and the sample average
X, will not converge in probability or in any other meaningful sense to E(x,).
The work of Phillips (1986, 1987) has sparked a recent interest in asymptotic
theory with general integrated processes, of which (2.11) is a special case. A general
integrated of order one, or Z(l), process, can be written as x, = LX+ x,_ 1 + u,, where
{u,> is an essentially stationary, weakly dependent zero mean process with

lim Var( T-112$lo,)>0. (2.12)


T+CC

[Condition (2.12) ensures that the process (Axt 3 x, -x,-1} has not been over-
differenced.] When a # 0 the process is said to be 1(l) with drift, otherwise it is
Z(1) without drift.
Before turning to Part II we should emphasize that the partitioning of the results
into Parts II, III, and IV is determined by the limiting distribution theory. In
particular, a separate consistency result is given in Part II that does not require
the process on any function of it to be weakly dependent. In the strictly stationary
case only ergodicity of the underlying process and a moment condition are needed
for consistency, so that it applies to strongly dependent processes. However, the
asymptotic normality results rely on weak dependence of the score. Parts III and
IV do not contain separate consistency results; consistency is proven along with
the limiting distribution result.
2648

3. Types of estimation techniques

The approach to estimation in this chapter is what Goldberger (1968) and Manski
(1988) have termed the analogy principle. To apply the analogy principle one must
know the population problem that the parameters of interest solve in order to con-
struct a sample counterpart. This is where basic probability theory, and especially
properties of conditional expectations and conditional distributions, play a key
role. Often population parameters can be shown to solve a minimization or maxi-
mization problem, which then leads to the class of optimization estimators dis-
cussed in this chapter. To show how the analogy principle is applied, we consider
the example of nonlinear least squares estimation.
Suppose that {(x,, y,): t = 1,2,. .} is a stochastic process, where y, is a scalar and
x,E%, is a vector whose dimension may depend on t. Allowing the number of condi-
tioning variables x, to grow with t allows for cases such as x, = (y,_ 1, y,- 2,. . . , y,)
orXt=(Zf,yf-l,Zt-l,..., y,, zl), where zI is a 1 x J vector of conditioning variables,
as well as for static regression models with x, = zf. Suppose that Ely,l < a3 for all
t and that interest lies in the conditional expectation E(y,jx,). A parametric model
of this conditional expectation is {m,(x,, 0): x,E%,, 0~ 0 c Rp} (0 is the parameter
set). The model is correctly specified if for some 0,~ 0

E(Y,lX,) = m,(x,,ho), t= 1,2,.... (3.1)

To see how to estimate 8,, we rely on a well-known fact from probability theory.
Namely, if E(y:) < co, then pt(xt) E E(y,(x,) is the best mean square error predictor
of y,. In other words, for any other function y,(x,) such that E[g,(x,)2] < co,

EC(Y, - PL,(xt))21G EC(Yt - st(xJ)21. (3.2)

It follows that if the parametric model nlr(x,, 0) is correctly specified then

EC(YV,- m,(-%RJ21 d EC(Y,- m,(x,,@)21 (3.3)

for all 8~0. This suggests estimating 8, by solving the sample problem

(3.4)

which leads to the nonlinear least squares estimator.


Closely tied to the analogy principle is the concept of Fisher consistency. An
estimation procedure is Fisher consistent if the parameters of interest solve the
population analog of the estimation problem. Inequality (3.3) shows that least
squares is Fisher consistent for estimating the parameters of a wnditional mean.
Using the Kullback-Leibler information inequality, we show in Section 5 that
Ch. 45: Estimation and lr@mce ,fiv Dependent Processes 2649

maximum likelihood is Fisher consistent for estimating the parameters of a


correctly specified conditional density, regardless of what the conditioning variables
X, are.
Many other estimation procedures, including multivariate weighted nonlinear
least squares, least absolute deviations, quasi-maximum likelihood, and generalized
method of moments are all Fisher consistent for certain features of conditional or
unconditional distributions. We cover several of these examples in Sections 5, 6
and 7.

Part II. The essentially stationary, weakly dependent case

4. Asymptotic properties of M-estimators

4.1. Introduction

In this section we study the consistency and asymptotic normality of a class of


estimators known as M-estimators (which stands for maximum likelihood-like
estimators), a term introduced by Huber (1967) in the context of i.i.d. observations.
The class of M-estimators includes the maximum likelihood estimator, the quasi-
maximum likelihood estimator, multivariate nonlinear least squares and many
other estimators used by econometricians. The terminology adopted here is not
universal. Piitscher and Prucha (1991a, b) refer to a more general class of optimi-
zation estimators as M-estimators. Burguete et al. (1982) and Piitscher and Prucha
(199 1a, b) call the estimators studied in this section least mean distance estimators.
The parameter space 0 is a subset of Rp and 0 denotes a generic P x 1 vector
contained in 0. We have a sequence of random variables {w,: t = 1,2,. .}. Denote
the range of w, by -W;, where IIT, is a subset of a finite dimensional Euclidean
space whose dimension may depend on t.
The objective function for M-estimation is a sample average:

T- l i qt(w, a (4.1)
t=1

where qr: W, x 0 + R. There are a few different situations that warrant special
attention. The first is when {w,: t = 1,2,. . .} is a sequence of strictly stationary
(hereafter, simply stationary) M x 1 random vectors - whereby Y%~ can be taken
to be a subset 1w of RM for all t= 1,2,...- and there exists a time-invariant
function q: W x 0 -+ R such that qJw,, 0) = q(w,, d). Then each summand in (4.1)
depends on t only through the observation w,. An important consequence of this
setup is that {q(w,, Q)} is stationary for each 0~0, and this facilitates application
of laws of large numbers and central limit theorems.
2650 J.M. Wooldridcp

Another case of interest, which requires notably more technical work, is when
the dimension of W, grows with t. This can happen when one is interested in getting
the dynamics of a model for a conditional mean or a conditional distribution correctly
specified. For example, suppose that for a scalar sequence {y,} and a vector sequence
{Z,E RK} one is interested in E(yllzl, y,- 1, ztm i, . . , y,, zl). Let m,(x,, 0) be a model
for this conditional expectation, where x, = (zt,yfP1,zt_i,. . .,yr,zi). If E(y,(x,)
depends on all past lags of y and z then the model m,(x,,8) should reflect this.
Thus, letting w, = (y,, z,, yt_ i,. . . , y,, z,), nonlinear least squares estimation of 8,
such that E(y,(x,) = m,(x,, 0,) would take qJw,, 0) = (y, - m,(x,, 8))2. Note that even
if {(y,, z,)} is stationary, {m,(x,, 0)) is not if E(y,I z,, y,- i, zr- 1,. . .) depends on all
past lags ~ such as in finite order moving average models. If E(y,( zf, y, _ 1, z, _ 1,. . .)
depends on a finite number of lags of y and z - such as in finite order autoregressive
models - then we are essentially in the stationary case described above.
Heterogeneity in {qt( w,, O)} can also arise when interest lies in a model relating an
observable sequence to an unobservable sequence. For example, let (e,: t = 0,1,2,. .}
be an i.i.d. sequence with E(ef) < co and E(e,) = 0, and consider an MA(l) model
for observable y,:

y, = e, + 8 e t-1,
0
t= 1,2,... (4.2)

where 18,I < 1. One can study estimation of 9, in the previous framework by finding
the regression function E(y,Jy,_ i, . . , yl) (a tractable calculation in this simple
example). In practice one often sees a different approach used: set the time zero
residual equal to zero and then build up the residual function recursively. This
leads to the psuedo-regression function

f-l

m,(x,,e)Em,(Y,-,,Y,-2,...,Y,,8)= - 1 (-eljYt-j. (4.3)


j= 1

This is not a true regression function because m,(x,, 0,) # E(y,I x,) = E(y, ly,- i, . . . , yi).
Nevertheless, because of the invertibility assumption IO, I < 1, E I E(y, I y, _ 1,. . . , y,) -
&-I,..., y,, fl,)l -+O as t + 00 at the rate I fl,\. This is enough to consistently
estimate 0, by nonlinear least squares. Once again the technical complications
arise because, even though the observable data are stationary, the sequence of
summands in the objective function is not.
Yet a different approach that avoids both approximation arguments and com-
plicated expectations calculations is to ensure that (4.3) is a true regression function
by changing the assumption about how y, is generated. If we assume that e, s 0,
then @IY,- 1,. . . ,y,) is given by (4.3) with 8 = OO.Now {yt: t = 1,2,. . .} as well as
{qt(wt, O)} is a heterogeneous sequence.
Of course heterogeneity in {qt(w,,fl)} a 1so arises with fixed dimensional w, if
{wt} constitutes a heterogeneously distributed sequence, as in Domowitz and White
(1982) White and Domowitz (1984) and Bates and White (1985).
Ch. 45: Estimation und Infhwre ,fiw Dependrnf Proces.ses 265 1

4.2. Consistency

It is convenient to begin with two definitions.

Definition 4.1

A sequence of random variables {z,: t = 1,2,. . .} satisfies the weak law of large
numbers (WLLN) if
(i) ~Clz,ll< 03, t= 1,2,...;
(ii) lim,,, T- C,= 1E(z,) exists;
(iii) T lx,= I (zr - E(z,)} A 0.
Condition (ii) of this definition is not needed for what follows, but it entails little
loss of generality and simplifies the statement of conditions.

Dejinition 4.2

Let 0 c KY, let {IV,:t = 1,. . .} be a sequence of random vectors with w,E%,,
t= 1,2,... and let {q,: IVYx 0 -+ [w,t = 1,2,.. .} be a sequence of real-valued
functions. Assume that
(i) 0 is compact;
(ii) qt satisfies the standard measurability and continuity conditions on %$ x 0,
t= 1,2,... (see Definition A.2 in the Appendix);
(iii) E[lq,(w,,8)1] < co for all OEO,t= 1,2,...;
(iv) lim,,, T-l CT= 1E[q,(w,, O)] exists for all 8~0;
(v) maxoEBl Tp CT=1dw,, 4- ECdw,, @II 3 0.
Then (qt(w,, O)} is said to satisfy the uniform weak law of large numbers (UWLLN)
on 0.

When applied to vector or matrix functions this definition applies element by


element. Condition (iv) is an inconsequential but convenient simplification.
When {W,EVV c FP} is stationary and ergodic, conditions sufficient for the
UWLLN to hold are relatively straightforward. The following result is due to
Ranga Rao (1962); see also Hansen (1982).

Theorem 4.1. U WLLN for the stationary ergodic case

Let 0 c lQp, let {w~EK: t = 1,2,. . .} be a sequence of stationary and ergodic M x 1


random vectors and let q: IV x 0 + R be a real-valued function. Assume that
(i) 0 is compact;
(ii) q satisfies the standard measurability and continuity requirements on
7V-x 0;
(iii) for some function b: YV -+ R+ with E[b(w,)] < m, lq(w, @I d b(w) for all
dE 0.
Then {q(w,,Q)} satisfies the UWLLN on 0.
2652 J.M. Wooldridgc

The proof of Theorem 4.1 - which is very similar to the i.i.d. case (see Newey and
McFadden (Lemma 2.4)) - is driven by the fact that if {wl} is stationary and ergodic
then so is any time-invariant function of it. Note carefully that we have only
assumed ergodicity here; as discussed in Section 2 this allows for fairly strong
forms of dependence.
If we relax the stationarity assumption then the conditions for the UWLLN are
notably more complicated, especially since we are allowing the dimension of w,
to grow. Here we follow Andrews (1987) and impose some smoothness on the
objective function. Then, as in Newey (1991a), a pointwise WLLN can be turned
into a UWLLN. The following result is a corollary of Newey (1991a, Corollary 3.1).
A proof that does not rely on the notion of stochastic equicontinuity ~ see Andrews
(this Handbook) and Newey and McFadden (Section 2.8) - is given in the Appendix.

Theorem 4.2. U WLLN for the heterogeneous case

Let 0, (We: t = 1,2,. . .}, and {qt: wt x 0 + IF!:t = 1,2,. . .} be as in Definition 4.2.
Assume that
(9 0 is compact;
(ii) qt satisfies the standard measurability and continuity requirements on
iVt x 0, t = 1,2,. . . ;
(iii) for each %E0, {qt(w,, %):t = 1,2,. . .} satisfies the WLLN;
(iv) there exists a function ct(wt) 3 0 such that
(a) for all %,,%,E@, Iqt(wt,%,) - qJw,,%,)l d c,(~,)ll%, - e2 II;
(b) {c,(w,)> satisfies the WLLN.
Then (qt(w,, %)} satisfies the UWLLN on 0. (For proof see Appendix.)

If qJw,, .) is continuously differentiable on an open, convex set @?containing 0


then the natural choice for ct(wt) is

4%) = sup IIVs4r(Y, 4 II, (4.4)


BEV;

provided it satisfies the WLLN. To see why this choice satisfies (iv) (a), simply use
a mean value expansion of qt about 8.
Because most time series applications involve smooth objective functions, the
difficulty in applying Theorem 4.2 usually lies in verifying that {qJw,,%)} and
{c,(w,)} satisfy the WLLN for any 8~0. These WLLN requirements restrict the
dependence analogous to the ergodicity assumption for the stationary case. If the
dimension of w, is fixed, and {w,} is an M- or @mixing process with mixing coeffi-
cients declining at an appropriate rate, verification of (iii) and (iv) (b) is straight-
forward because qt(wt, %)inherits its mixing properties from {We}. See, for example,
Domowitz and White (1982), White (1984), White and Domowitz (1984) and
Potscher and Prucha (1989, 1991a). McLeish (1975) introduced strong laws of
large numbers that can be applied when qt(w,, 0) depends on an increasing number
of lags of a mixing process. See Gallant and White (1988), Hansen (199 1a), Piitscher
and Prucha (1991a) and White (1993) for further discussion of strong laws. Because
our focus is on weak consistency, the general WLLNs of Andrews (1988) are
especially relevant here; they can be used to verify (iii) and (iv) (b) under satisfyingly
weak assumptions, including conditions that allow for strongly dependent hetero-
geneous processes (although when applied to the stationary case, the conditions
are always more restrictive than ergodicity).
Before stating a formal consistency result for M-estimators, it is useful to allow
for the presence of some estimated nuisance parameters. Let yT denote an R x 1
vector estimator such that plim pT = y* for some Y*ET c RR. The (two-step) M-
estimator 8, solves

mini q,(w,,0;%I, (4.5)


BE@r=l

where q, is now defined on wt x 0 x I-.

Theorem 4.3. Weak consistency of M-estimators

Let 0 c Rp, I- c RR, {wt~wt: t = 1,2,. . .} be a sequence of random vectors, and


let {y,: YY*x 0 x f + R: t = 1.2,. . .) be the sequence of objective functions. Assume
that
M.l: (i) 0 and I- are compact;
(ii) % +*El-;
(iii) qt satisfies the standard measurability and continuity requirements on
iY~xOxl-,t=1,2 )....
M.2: {q,(w,, 8; y): t = 1,2,. . .} satisfies the UWLLN on 0 x r;
M.3: 0, is the unique minimizer of

Y(O;Y*)= :m_T- c$l


E[qr(wt, 8; y*)] on 0.

Then a random vector 8, exists that solves (4.5) and 8, L 8,. (For proof see
Appendix.)

Under the assumptions of Theorem 4.3 it turns out that the limit function q(O; y*)
is necessarily continuous on 0, so that it achieves its minimum on 0 by compact-
ness. In the stationary case without nuisance parameters, q(O) = E[q(w,, O)] for all
t, so it suffices to concentrate on a single observation when verifying the identification
Assumption M.3. Even in the heterogeneous case, for applications, 0, minimizes
E[q,(w,,Q)] over 0 for each t (more on this in Sections 5 and 6). Verifying that
0, is the unique minimizer of q in either the stationary or heterogeneous case often
2654 J.M. Wooldridye

requires knowing something about the distribution of conditioning variables, and


so identification is often taken on faith unless there are reasons to believe it might
fail. Newey and McFadden (Section 2.2) give three examples of how to verify identi-
fication in examples with identically distributed data.
There are situations - such as the stationary MA(l) example when the startup
value is set to zero - where 0, will not minimize E[ql(wt, S)] on 0 for any t. Never-
theless, especially in cases where the error of approximation dies off at an exponential
rate, it can usually be verified that 0, minimizes lim,,, T-CT= I E[qr(wf, S)]. This
is true for the invertible MA( 1) example. Bierens (1981, 1982) provides several
illustrations.
In some cases with nuisance parameters the identification condition holds only
for a particular value of y, say y = yO= y*, where yO also indexes some feature of
the distribution of w,; this is generally the case for two-step maximum likelihood
procedures under correct specification of the conditional density. In other cases
the identification condition holds for any ycT, and so the preliminary estimator
fT could come from a misspecified estimation problem. For example, as we will
see later, $T could be parameter estimates in a misspecified conditional variance
function in the context of weighted nonlinear least squares. The misspecification
of the variance function does not prevent consistent estimation of the conditional
mean parameters O,, and so we expect the identification condition for the condi-
tional mean parameters to hold for an arbitrary element in lY
Theorem 4.3 restricts attention to objective functions continuous in 0 (and y),
and this rules out certain estimators that have been suggested primarily in the
cross section econometrics literature. A leading example is Manskis (1975)
maximum score estimator. It turns out that Theorem 4.3 can be extended without
much difficulty to cover certain discontinuous objective functions (although
measurability of the estimator becomes an issue). Wooldridge and White (1985)
present a result that applies to the maximum score and related estimators for
mixing processes. Newey and McFadden (Section 2.7.1) provide a careful discussion
of the relevant issues for i.i.d. observations; these same issues are relevant for
identically distributed dependent observations.

4.3. Asymptotic normality

We first define what it means for an essentially stationary, weakly dependent vector
process to satisfy the central limit theorem.

Definition 4.3

Let {sr: t = 1,2,. . .} be a P x 1 random vector sequence. Then {sl} satisfies the
central limit theorem (CLT) if
(i) E(s+,) < co, t = 1,2,. ;
(ii) T~2~~=1E(s,)-+0 as T+co;
(iii) T- C,= Is, %Normal(O,B), where B = lim,,, Var(T~C,= ,sr). (4.6)

Condition (i) ensures that Var(s,) exists for all t. In the cases we focus on E(s,) = 0
for all t, so (ii) is trivially satisfied. Still, there are cases where only the weaker
condition (ii) holds, so we allow for it in the definition. Implicit in (iii) is that the
limit in (4.6) exists. We could relax this assumption, as in Gallant and White (1988)
Potscher and Prucha (1991 b) and White (1993), but this does not affect estimation
or inference procedures.
We refer the reader to Pijtscher and Prucha (1991b) for a current list of central
limit theorems available in the stationary and heterogenous cases under weak
dependence; see also the references in Section 2.

Theorem 4.4. Asymptotic normality of M-estimators

Let 0, r, {w,: t = 1,2,. .}, and {qr: Wt x 0 x r + R: t = 1,2,. . .} be as in Theorem


4.3. In addition to M.l-M.3, assume
M.4: (9 e0 is interior to 0;
(ii) y* is interior to r;
(iii) JT(y*, - Y*) = G,(l);
(iv) for each ~~r,q, satisfies the standard measurability and second order
differentiability conditions on YY~
x 0 (see Definition A.4 in the
Appendix).
Define the P x 1 score vector s,(& y) = s,(w,, 8; y) = V,q,(w,, 0; y) and the
P x P Hessian matrix h,(8; y) = V&B; y) = Viq,(w,, 8; y).
(v) For each OE 0, s,(Q; .) is continuously differentiable on int(r).
M.5: (i) (h,(B; y): t = 1,2,. .} satisfies the UWLLN on 0 x r;
(ii) A,=limr._jo T-lC,T_l E[h,(8,; ?*)I is positive definite;
(iii) {V,s,(&y): t = 1,2,. . .} satisfies the UWLLN on 0 x r;
M.6: {st(Q,; y*): t = 1,2,. . .} satisfies the CLT with positive definite asymptotic
variance

B, E lim Var T- j2 r$l s,(@,; Y*) . (4.7)


T+CC (

M.7: E[V,s,(B,; y*)] = 0, t = 1,2,. ..

Then ,,/?(e, - 0,) % Normal(0, A; BOA ; ), so that

Avar fi(e, - Q,) = A ; 1B, A ; I. (4.8)

(For proof see Appendix.)

This general result can be used to establish the asymptotic normality of the M-
estimator in both the stationary and heterogeneous cases. The UWLLN assump-
2656 .I.M.
W'ooldrid~q

tions MS(i) and M.S(iii) differ depending on the nature of {wt}, {qr(wl, d)).
As mentioned in Section 2, the CLT is applied to the score of the objective
function evaluated at (G,, y*),s; 3 s,(OJ*). Due to the scaling of the partial sum
byT- j2 > Theorem 4.4 is restricted to cases where Var(sp) is bounded; this rules
out most examples with trending data since, generally, trending data implies a
trending score. Assumption M.6 also implies that Var(C:= ,sp) grows linearly with
K and this restricts {sp} to be a weakly dependent process. As discussed in Section
2, this is not necessarily the same as the underlying process {rvt} being weakly
dependent. Often the weak dependence of the score is established by exploiting
weak dependence properties of {rut}, but this is not always the case. For example,
for a broad class of estimation problems {sp} is a martingale difference sequence
(MDS) - see Sections 5 and 6 -in which case it satisfies a CLT under some
additional assumptions. While the properties of (wt} play a role in verifying these
additional assumptions, it is sometimes possible to establish M.6 without imposing
mixing or related conditions on {wt}.
Another consequence of M.6 is

Tm12,$l
E[s,(B,;y*)]+O as T+m. (4.9)

In most cases, including but not restricted to stationarity, the stronger condition

E[s,(B,;y*)]=o, t= 1,2,... (4.10)

holds. The invertible MA(l) example when the startup values are set to zero is an
example of where (4.9) holds but (4.10) does not. For the rest of this discussion
we assume that (4.10) holds, but it should be kept in mind that the weaker
assumption (4.9) is sufficient.
Often it is possible to establish (4.10) directly from the structure of qt(0). Still,
it is useful to know that it holds under the following additional assumptions. First,
suppose that 8, minimizes E[q,(B;y*)] on 0 for all t. Then, because B,Eint(O), it
follows that if E[q,(d; y*)] is differentiable, then 8, satisfies

(4.11)

Second, if the derivative and expectations operator can be interchanged (which is


the case quite generally), then (4.11) implies E[V,q,(B; y*)] leEO, = 0, which is simply
(4.10).
Notice that 0, is assumed to be on the interior of its parameter space. Technically,
this is needed for a mean value expansion of the score about 0,. It is also easy to
devise examples where JT(8, - 0,) has a nonnormal limiting distribution because
0, is on the boundary of 0; see, for example, Newey and McFadden (Section 3.1).
Another limitation of Theorem 4.4 is that the objective function is assumed to
be twice continuously differentiable in 8 with positive definite expected Hessian at
(I,. For most time series applications with essentially stationary, weakly dependent
data, M.4(iv) holds. Nevertheless, it rules out some procedures that have recently
become more popular in the econometric analysis of time series data. The leading
example is least absolute deviations (LAD) estimation of a smooth conditional
median function. For simplicity, suppose we have stationary data {(x,, y,)}. The
objective function for the LAD estimator for observation t is q(w,, 0) = ly, - m(x,, 0)l,
where m(x,, 0) is the hypothesized conditional median function. Under the assump-
tion that Med(y,Ix,) = m(x,, 0,) for some fl,~ 0, it can be shown [for example, Manski
(1988), White (1993), Newey and McFadden (Section 2.7.1)] that 6, minimizes
E[q(w,, H)] on 0, so that LAD is Fisher-consistent for the parameters of a condi-
tional median. Theorem 4.3 applies for weak consistency because the LAD objective
function is continuous in 8.
The problem with applying Theorem 4.4 for asymptotic normality is that q(w,, 0)
is not twice continuously differentiable on int(O). Nevertheless, under reasonable
assumptions it is still possible to obtain a first order representation of the form

(4.12)

where s,(0) plays the role of the score and A, is the derivative of the expected score
E[s,(B)] evaluated at 8,. The key insight in obtaining representations such as (4.12)
to nonsmooth problems is that often E[s,(@] is smooth in 0 even though s,(H) is
not; this works with dependent data just as with independent data. We refer the
reader to the general treatment in Newey and McFadden (this Handbook, Section
7). Also, Wooldridge (1986) shows how the results of Huber (1967) extend to the
case of mixing observations. Bloomfield and Steiger (1983) Wooldridge (1986) and
Weiss (1991) study the asymptotic properties of LAD for dependent observations.
For regular problems with essentially stationary, weakly dependent data,
Assumptions M.l-M.6 can be viewed as regularity conditions. On the other hand,
Assumption M.7 plays a key role in how one goes about conducting inference
about 0,. Namely, M.7 guarantees that the asymptotic distribution of fi(3, - v*)
does not affect that of JT(a^, ~ 0,). (Of course, if there are no nuisance parameters,
M.7 is automatically satisfied.) In particular,

Avar(8,) = A; BOA; /T (4.13)

is the same as if y* were known rather than estimated. See Newey and McFadden
(Section 6) for an insightful analysis of two-step estimation procedures with finite
dimensional I-.
Theorem 4.4 also assumes that r is finite dimensional, and so it cannot be
applied to semiparametric estimation problems. For semiparametric problems 6
2658 J.M. Wooldrid~qr

is still a P x 1 vector but r is an infinite dimensional function space. For a certain


class of semiparametric problems, including adaptive estimation [see Piitscher and
Prucha (1986), Robinson (1987), Andrews (1989), White and Stinchcombe (1991)
and Steigerwald (1992)], the limiting distribution of fi(H^, - 0,) is still given by
(4.8) provided fT is consistent at a fast enough rate in a suitable norm (often the
required rate is T1j4 and a suitable norm is an L, norm). The conditions in
Theorem 4.4, notably those concerning the gradients with respect to y, must be
modified to cover semiparametric problems. The notion of stochastic equicontinuity
can replace differentiability assumptions. Andrews (this Handbook) gives a general
discussion of empirical process methods; Newey (1991 b) and Newey and McFadden
(Section 8) show how to perform asymptotic analysis with certain kinds of infinite
dimensional nuisance parameters and i.i.d. observations.

4.4. Adjustment for nuisance parameters

There are many problems for which M.7 fails, for example two-step estimators for
factor autoregressive conditional heteroskedasticity (ARCH) models [see Lin
(1992)]. In such cases one must adjust the asymptotic variance of HITto account
for estimation of y*. First define the P x R matrix

F, = lim T-i f E[V,s,(B,; y*)] (4.14)


T-rCC 1=1

(this exists by M.S(iii)). In place of M.7 we now assume that 9, has first order (or
influence function) representation

JT(gT - y*) = T_ i/2 i yZ(Y*)+ o,(l),


I=1

where u,(y) is an R x 1 vector with E[v,(y*)] = 0. The vector v,(y) in general depends
on unknown parameters other than y, but these are suppressed for simplicity. Thus,
fT could itself be an M-estimator or, as we will see in Section 7, a generalized
method of moments estimator. The mean value expansion underlying the proof
of Theorem 4.4 (under M.l-M.6) now gives

JT(&.-e,)= -A-lT-li2
0 f: fs,(d,;Y*)+ E,~Y*)} + o,(l).

If the P x 1 process (~~(0,; y*) = s,(8,; y*) + F,r,(y*)} satisfies the CLT with asymp-
totic variance

(4.15)
2659

then

Avar~(B,-8,)=A-DoA~l. (4.16)

Note that the form of D, is similar in structure to B,. In the next section, we
discuss how to estimate such matrices under different assumptions.

4.5. Estimating the asymptotic variance

We first consider estimating the asymptotic variance of fi(e^, - 0,) under Assump-
tions M.llM.7, so that the asymptotic variance of fi($, - y*) does not affect
Avar fi(8, - 0,) (see (4.8)). In constructing valid estimators, we make use of a
third matrix.

J, = lim T- i E[s,(B,;y*)s,(8,;y*)], (4.17)


T+ai r=1

which is essentially the average variance of the scores (~~(0,; r*)}.


Estimation of A, is no different from cross section analysis with independent
observations. A consistent estimator of A, that is always available under the
conditions of Theorem 4.4 is the average of the Hessians evaluated at the estimates,

T- i h,(&;y*,).
r=1

As a practical matter, an analytical formula for h,(B; y) is not needed to implement


this formula; numerical second derivatives can be used for approximating (4.18).
Still, this estimator is often more difficult to compute than necessary, and it is not
guaranteed to be even positive semi-definite for a particular sample (although it
is positive definite with probability approaching one). Sometimes more structure
is available that allows a different estimator. Suppose we can partition w, as
w, = (x,,y,), where x, is a random vector with possibly growing dimension and yt
is G x 1 (interest lies in estimating some aspect of the conditional distribution of
yt given 1,). Define

a,(x,, 8,; y*) = ECh,(w,,8,; r*)l~,l. (4.19)

By the law of iterated expectations

0,; r*)l = ECht(wt,


EC~~(~,> 6,; y*)l, (4.20)
sothatA,=lim.,,T~CT=, E[a,(B,; y*)]. Appendix Lemma A.1 and the UWLLN
imply that generally

T _
T- c u,(fI,; 7,) A A,. (4.21)
t=1

This estimator is useful in the context of nonlinear least squares, maximum likeli-
hood, and for several quasi-maximum likelihood applications. In some leading
cases, including the examples in Section 6, a,(~,, 8,; y*) depends only on the first
derivatives of the conditional mean and variance functions.
Estimation of B, is generally more difficult due to the time series dependence
in the data. But in some important cases it is no more difficult than estimating
A,. In the simplest case, B,, = A,,soa separate estimator of B,is not needed. Two
additional assumptions imply that B,= A,.The first is that the score (evaluated
at (0,; y*)) is serially uncorrelated.

Assumption M.8

For t = 1,2,. . , E[s,(B,;y*)s,+j(8,;y*)] = O,j 3 1.

(In cases where E[s,(B,; y*)] is not identically zero, but (4.9) holds, M.8 might only
be approximately true as t + co; the fo 11owing conclusions still hold provided the
approximation error dies out fast enough with t.) The score is always serially
uncorrelated if sp is independent of sp+ j, j = 1,2,. . . , which is usually assumed in
cross section applications but rarely in time series settings. Of course independence
is not nearly necessary for M.8 to hold. When time series models are dynamically
complete ~ in ways to be defined precisely in later sections - sp(0,; y*) often satisfies
the stronger requirement of being a martingale difference sequence with respect
to the information sets ar = (w,, . . . , wl}.
Even outside the context of maximum likelihood estimation (MLE), in many
practical cases of interest an extension of the information matrix equality holds
for each observation t. This is stated as follows.

Assumption M.9

ECs,(d,; i*h(Q,; y*)l = ECh,(~,;


?*)I> t= 1,2,....

In other words, the variance of the score for observation t is equal to the expected
value of the Hessian for observation t. (Note that the latter quantity is at least
positive semi-definite because we are analyzing a minimization problem.) Assump-
tion M.9 immediately implies that A, = J,. In addition to MLE, we will encounter
other cases where M.9 is satisfied, among these multivariate weighted nonlinear
least squares when the conditional mean and conditional variance are both
correctly specified. In virtually every case that M.9 holds it makes sense to replace
y* by y,, to indicate that y indexes some feature of the distribution of w, that is
correctly specified.
It is important to see that conditions M.8 and M.9 are logically distinct and
must be examined separately. In Section 6 we cover cases where the score s,(tl,; y*)
is serially uncorrelated - so that M.8 holds - but M.9 does not hold; in Sections 5
and 6 we give examples where M.9 holds but M.8 does not.
The usefulness of imposing both M.8 and M.9 is seen in a simple lemma.

Lemma 4.1

Under assumptions M. l&M.9, B, = Jo = A,, and therefore

Avar(8,) = A; IT. (4.22)

This shows that the asymptotic variance of 6, can be estimated as A^G l/T under
M.l&M.9, where AT is the consistent estimator of the expected Hessian given by
either (4.18) or (4.21). Alternatively, we can use an estimator of J,, to obtain A$r(H^,).
A consistent estimator of Jo under M.l-M.6 (and the regularity condition that
(s,(fI; y)s,(B; y)} satisfies the UWLLN) is

(4.23)

where s*,= sJH^~;~*~).Under assumptionM.9, jT is also a consistent estimator of


A,. Therefore, under M.l-M.9, Avar(6,) can be estimated by A%r(&-) = j,/T.
This outer product of the score estimator is usually attributed to Berndt, Hall,
Hall and Hausman (1974) (BHHH) in the context of MLE.
Practically speaking, Assumption M:9 without M.8 does not afford much
simplification for estimating asymptotic variances because of the potential serial
correlation in the score. On the other hand, M.8, even in the absence of M.9,
means that covariance matrices can be estimated by sample averages. This follows
from the following obvious lemma.

Lemma 4.2

Under M.l&M.8, B, = Jo, and so Avar(&) = Ai1 J,A; l/T.

Thus, to estimate Avar(6,) under M.l&M.8, take AT to be one of the estimators


(4.18) or (4.21) and take jT as in (4.23). Then a consistent estimator of Avar(&,) is

(4.24)

This estimator was suggested by White (1982) in the context of maximum likelihood
estimation of misspecified models with i.i.d. data. This formula Avar Jf(@, - 13,)=
Ai J,,A; appeared in Huber (1967) in his analysis of M-estimators for i.i.d. data.
The estimator (4.24) has since been suggested by Domowitz and White (1982),
Hsieh (1983), White and Domowitz (1984), White (1993) and others in a variety
of contexts when the score is serially uncorrelated.
When we relax M.8 as well as M.9, the general form of B, in (4.7) must be
estimated. Again let sy EEs,(H,;y*) and first suppose it is known that

E(s;s;; J = 0, j > L, (4.25)

where L is a known integer. A sensible estimator of B, is

(4.26)

where

T-j
j=O,l,..., L. (4.27)

(Sometimes T - in (4.27) is replaced by (T - P)- as a degrees-of-freedom adjust-


ment.) Hansen and Hodrick (1980) proposed this estimator in the context of linear
rational expectations models. Hansen (1982) and Hansen and Singleton (1982)
proposed it in the generalized method of moments (GMM) framework; see also
White (1984). A simple application of the uniform weak law of large numbers to
js,(U; y)s,+ j(& 7)) and Lemma A.1 shows that

T-j
plim .Ej = lim T-l c E(s;s;;~), (4.28)
T-tm t=1

and so BT is consistent for B, under general conditions. When (4.26) applies, the
choice of L typically depends on the frequency of the data. A potential drawback
of (4.26) is that it need not be positive definite. Remedies for this are discussed
below.
Several econometricians have recently looked at the problem of estimating B,
consistently in the general case where the autocovariances of {s;} are known only
to die out at a polynomial rate as j gets large. For motivational purposes, assume
that (s;)} is covariance stationary; the estimators that follow are valid in the
heterogeneous case under mixing or near epoch dependence conditions, as in White
(1984), Newey and West (1987), Gallant and White (1988), Hansen (1992a), Andrews
(1991), Piitscher and Prucha (1991b) and Andrews and Monohan (1992). Under
covariance stationarity of the score, B, = E. + C,?! 1(Ej + El), where Ej 3 E(s;s;+ j).
If we truncate the infinite sum at L,, but let L, tend to infinity with T, then

E, + z(Zj + E;)+ B,. (4.29)


j= 1

This suggests estimating B,, exactly as when S: is known to have zero autocovariances
after a certain point, with the technical distinction that the truncation lag L, should
grow with T to ensure consistency for general autocovariance structures. But L,
cannot grow too quickly or else too many autocovariances are estimated for a
given sample size.
Rather than using (4.26), as in White (1984), in some cases it is useful to weight
the autocovariances to ensure a positive semi-definite (p.s.d.) matrix. Thus, consider
the estimator

b, = $, + tw(j,
i= 1
T)(gj + ss), (4.30)

where the weights o(j, T) are chosen to ensure that &- is p.s.d. Note that if (4.25)
holds and L, = L, then o(j, T) -+ 1 as T+ co, j = 1,2,. , L, ensures that & is
consistent for B,.Possibilities for o(j, T) are abundant in the time series literature
on spectral density estimation because, in the covariance stationary case, B, is
proportional to the spectral density matrix of the process {SF} evaluated at frequency
zero; in econometrics, B,is often called the Iony run variance of {s;} [for example,
Phillips (1988)]. For a list and discussion of weights see Anderson (197 1, Chapter 9)
Bloomfield (1976, Chapter 8), Gallant (1987, Chapter 7) and Andrews (199 1). The
Bartlett weights were suggested by Newey and West (1987) and are also studied
in Gallant (1987) and Gallant and White (1988). They are given by

w(j,T)=l-ml-, j=1,2 ,..., L,, (4.3 1)


(L, + 1)
= 0, j>L,.

[The Bartlett weights are not among the most popular weights in the spectral
density estimation literature; see, for example, Bloomfield (1976, p 164).] Note that
o(j, T) -+ 1 as T+ co for each j provided Lr grows with T. Newey and West (1987)
demonstrate that with this choice of weights, b, is p.s.d.
For applications, Gallant (1987) recommends the Parzen weights, given by

w( j, T) = 1 - 6( j/&l2 + 6( j/L,)3, j d L,/2,

= 2(1 - (j/Lr))3, L,/2 <j < L,. (4.32)


Andrews (1991) allows for weighting schemes that leave all of the covariances in
the estimator for each sample size (that is, L, E T - 1) but of course those for
largej are downweighted. Several other choices of el(j, T) are studied by Andrews
(1991).
There are a variety of regularity conditions under which (4.30) is a consistent
estimator of B,. For nonlinear models these usually include the assumption that
o( j, T) is uniformly bounded (which is satisfied by the Bartlett and Parzen weights),
that w(j, T)+ 1 as T+ CCfor each j, and that L, tends to infinity at a slower rate
than T. Piitscher and Prucha (1991 b) provide a careful discussion and extensions
of sufficient conditions based on work of White (1984), Newey and West (1987),
Gallant (1987), Gallant and White (1988) and Andrews (1991). Hansen (1992a)
gives results that relax the moment conditions. The details underlying consistency
are rather intricate, and so they will not be given here. Generally, the assumptions
include smoothness of s, as a function on 0 x r and weak dependence of sp (such
as mixing and near epoch dependence conditions).
In applying serial-correlation-robust estimators the choice of the lag length L,
is crucial, but consistency results only yield rates at which L, should grow with
T. For problems with near epoch dependent scores, the available proofs allow
L, = o(Tti3). This was also shown by Quah (1990) to produce a consistent estimator
in a related context. Andrews (1991) contains consistency results under cumulant
and mixing assumptions that allow L, = o(T); see also Keener et al. (1991) for
OLS with fixed regressors. Andrews (1991) also uses asymptotic mean squared
error criterion to derive optimal growth rates of L, for a general class of weighting
schemes. The optimal growth rate for the Bartlett weights is T1j3, and for the
Parzen weights it is T l . The optimal weighting scheme in the class Andrews
considers is the quadratic spectral (which, unlike the Bartlett and Parzen weights,
keeps all autocovariances in the estimator).
Andrews (1991, Table 1) gives the optimal lag length for a variety of weights
w(j, T) in the context of linear regression with AR(l) errors. This table provides
useful guidance but is necessarily limited in scope; such calculations require know-
ing something about the degree of dependence in the underlying stochastic process.
In practice, deterministic rules for choosing L, necessarily have a component of
arbitrariness. Andrews (1991) and Andrews and Monohan (1992) discuss data
dependent or automatic ways of selecting L,. Except for the computational re-
quirements, these are attractive alternatives to deterministic rules. They are likely
to become more popular as they are incorporated into econometrics software
packages.
There are other alternatives for estimating B,. If {sp} were a stationary, finite
order vector autoregression, then B, is simply the long run variance of a finite
order VAR; it can be consistently estimated by first estimating a VAR for & This
leads to the multivariate version of Berks (1974) autoregressive spectral density
estimator. The autoregressive spectral density estimator might also work well when
$ is not a finite order VAR, provided the lag L, in the VAR is allowed to depend
on T (again, the rate L, = o(T~) is sufficient under certain conditions); see Berk
(1974) for the scalar case. Another possibility is suggested by Andrews and
Monohan (1992): prefilter s*,through a VAR, and then apply an estimator such as
(4.30) to the residuals. Andrews and Monohan show that this can lead to better
finite sample properties.
When M.7 is violated the asymptotic variance estimator of @, must account
for the asymptotic variance of PT. This requires that we estimate the matrix D, in
(4.15) rather than B,. First we need to estimate F,. This is typically straightforward.
Under M.l-M.6 the P x R matrix

P, = T - i V,s,(&, 9,) (4.33)


t=1

is consistent for F,,. In applications with conditioning variables, such as nonlinear


least squares and quasi-maximum likelihood estimation (QMLE) (see Section 6),
a simpler estimator is ofte_n obtained by computing [,(I,, 0,; y*) = E[V,s,(fI,, y*)Ixt]
and then replacing V,s,(fr, 9,) in (4.33) with Jr@,, 8,; $r).
Next, define ri, = s*,+ F,i,, where i, replaces any unknown parameters in Ye
with consistent estimates. For example, often rJy*) = K; e,(y*), where ,K* is an
R x R unknown positive definite matrix. Given a consistent estimator K, of k*
(which is typically very similar to estimating A,),fr = f?; e,(j),). This covers the
case when 9, is an M-estimator, and similar estimators are available in generalized
method of moments contexts. Without further assumptions, estimation of D,
requires application of one of the serial-correlation-robust estimators, such as
(4.30), tp {lit}. When {ut(Bo;y*)} 1s . serially uncorrelated, as is typically the case
when 0, and +jT are obtained from problems that have completely specified
dynamics ~ more on this in Sections 5 and 6 - D, can be estimated as & =

4.6. Hypothesis testing

Consider testing Q nonlinear restrictions

H,: ~(0,) = 0, (4.34)

where c(0) is a Q x 1 vector function of the P x 1 vector 8. We assume that Q d P,


c(.) is continuously differentiable on the interior of 0, and 0, is in the interior of
0 under H,. Define C(0) = V&U) to be the Q x P gradient of c(0) and assume
that rank C(0,) = Q (this ensures that the Q restrictions are nonredundant). The
Wuldstatistic is based on the limiting distribution offic(@,.) under H,. A standard
J.M. Wooldridyr

mean value expansion gives, under H,,

JTc(B,) = c 0Jr& - 0) + 0
P(1) (4.35)

where C, 3 C(8,). Therefore, the null limiting distribution of JTc(g,) is

JTc($,) A Normal(0, C, V,Cb), (4.36)

where V, = Avar fi(e, - 0%).Given a consistent estimator of V,, say P,, and the
consistent estimator of C,, C, = C(e,), the Wald statistic for testing H, against
H,: c(t),) # 0 is

w, = Jh(&)I& F&] - 1JTc(f7T). (4.37)

Under N,, W, Lx;. As discussed in Section 4.5, the choice of P, depends on what
is maintained under H,. Assuming that M.7 holds, CT = i;r or 5; under M.8
and M.9, 3, = A^; j,,+?; under M.8 only, and P, = A^; &.A^ f1 if neither M.8
nor M.9 is assumed. When nuisance parameters are present and M.7 does not
hold, pT should account for the asymptotic variance of JT(qT - y*).
Often, it is convenient to use a likelihood ratio (LR)-type statistic when testing
hypotheses. Because M-estimators solve a minimization problem, under certain
assumptions the difference in the objective function evaluated at the constrained
and unconstrained estimators can be used as a formal test statistic. While it is
possible to derive the limiting distribution of this statistic under Assumptions M.l-
M.6, here we only cover the case where the statistic has a limiting chi-square
distribution; thus, we impose M.7, M.8, and M.9. White (1993) allows the model
to be entirely misspecified, but the quasi-LR statistic has little practical value for
classical testing in such cases.
Let 8, denote the unconstrained M-estimator of 0,. The constrained estimator,
er, solves

min i qJw,, 0; $T) subject to c(0) = 0.


0t@t=l

To derive the limiting distribution of the quasi-likelihood ratio statistic, we first


assume that B,Eint(O) under H,. In addition, to ensure that fi(s, - 0,) has a
proper first order representation we assume that a continuously differentiable
mapping d: RPpQ + Rp exists such that 8, = d(a,) under H,, where c(, is a (P - Q) x 1
vector in the interior of its parameter space A under H,. Further, the rank of the
Ch. 45: Estimarion and Inferencr ,fiw Dt~pendent Procrs.sr.s 2661

P x (P - Q) gradient V,d(a,) is P - Q under H,. The estimator of a,, &, solves

(4.39)

and the constrained estimator 8,. is simply 8, = d(&r). By definition,

(4.40)

The difference in (4.40) has a convenient limiting distribution that is free of nuisance
parameters under Assumptions M.llM.9.

Lemma 4.3

Under M.l-M.9, and the assumptions under H, that 8, = d(cx,) for ctO,a (P - Q) x 1
vector, and d(a), a continuously differentiable function on int(A), r,Eint(A), and
rank V,d(cc,) = P - Q,

(4.41)

converges in distribution to xi under H,.

A word of caution: in applying the QLR statistic, scaling factors that can appear
in ql(w,,O;y*) must be chosen so that M.9 is satisfied. As we will see, that is no
problem in the context of MLE since qr is simply the negative of the conditional
log-likelihood. In Section 6 we show what M.9 entails in the context of the weighted
NLS and QMLE approaches.
The final test we cover is Raos (1948) score test, known more commonly in
econometrics as the Lagrange multiplier (LM) test. Engle (1984), Godfrey (1988)
and MacKinnon (1992) contain discussions of the LM statistic and its use in econo-
metrics. Calculation of the LM statistic requires estimation only under the null,
so it is well-suited for model specification testing. As with the quasi-LR statistic,
we assume that f3, = d(a,) under H,, where rx,Eint(A) and d(a) satisfies the assump-
tions in Lemma 4.3.
The simplest method for deriving the LM test is to use Raos score principle
extended to the M-estimator case. The LM statistic is based on the limiting distri-
bution of

T- 12 i St& YT) (4.42)


I=1
2668 J.M. Wddridge

under H,,where yT now denotes an estimator of y* used in obtaining e,. As usual,


we assume that JT(PT - y*) = O,(l). W e explicitly derive the LM statistic under
the assumption that M.7 is true under H,, as this holds in many applications.
Assume initially that 0, is in the interior of 0 under H,; we discuss how this
can be relaxed below. A standard mean value expansion yields

T-12 f. &;-&) = Tp1j2 i de,; Y*)+ A,&-@,- 0,) + o,(l)


1=1 r=1

under M.7 and H,. But 0 E ~?c(t!?,) = fic(B,) + c,JT(e, - O,), where C, is
the Q x P matrix C(0) with rows evaluated at mean values between fIT and 0,.
Under Ho, c(6),) = 0, and plim C, = C(0,) = C,. Further, fi(e, - 8,) = O,( 1)
(because fi(E, - c(,) = O,(l) under the above assumptions). Therefore, under
H,, C,fi(g, - 6,) = o,,( 1). It follows that

CoA,T-/2 i s,(e,.;~T)=C,A,1T-112 (4.43)


t=1

under H,. Without imposing M.8 or M.9 under H,, we generally have

C,A y 1T- I/ 5 s,(B,; y*) LNormal(0, C,,A; BOA; C:).


f=l

Under our assumptions, C, Ai B,,A; CL has full rank Q. Therefore, under H,

where & z s,(e,; 9,). The score or LM statistic is given by

where all quantities are evaluated at (gT,yr) or just e,. Under H,,LM, Axi.
This LM statistic is robust in the sense that neither M.8 nor M.9 is maintained
under H,.
If the score is serially uncorrelated under H, (that is, M.8 holds under H,), then
ET can be replaced in (4.44) by the outer product estimator

(4.45)
C/I. 45: Estimation ud Ir~/ewncrjhr Dependrnt Proccxses 2669

otherwise gr should be a serial-correlation-robust estimator applied to {S,}.


For both the Wald and QLR statistics, we assumed that %,Eint(O) under H,;
this is crucial for the statistics to have limiting chi-square distributions. We will
not consider thi Wald or QLR statistics when 0 is on the boundary of 0 under
H,; see Wolak (1991) for some recent results. The general derivation of the LM
statistic also assumed that %,Eint(O) under H,. Nevertheless, for certain appli-
cations of the LM test we can drop the requirement that 0, is in the interior of
0 under H,. A leading case is when 8 can be partitioned as 0 = (%;, d;), where
%r is (P - Q) x 1 and 9, is Q x 1. The null hypothesis is H,:H,,= 0,sothat c(9) = 9,.
It is easily seen that the mean value expansion used to derive the LM statistic is
valid provided ~1,= %,, is in the interior of its parameter space under H,;9,= (%b,, 0)
can be on the boundary of 0. This is useful especially when testing hypotheses
about variances; see Bollerslev, Engle, and Nelson (this Handbook).
Recall that under Assumptions M.8 and M.9,$, = B, = Jo, and the LM statistic
simplifies considerably. For example, 2, and B, can both be replaced with 3,.
Some matrix algebra shows that the LM statistic becomes

(4.46)

which is just TR,'from the linear regression

1 on S;, t=1,2 ,..., T, (4.47)

where Rf is the uncentered r-squared from the regression (recall that 5: is a 1 x P


vector). Engle (1984) derives this statistic in the context of MLE. Because the
dependent variable in (4.47) is unity, LM, can also be computed as T - SSR,
where SSR is the sum of squared residuals from (4.47). The statistic (4.46) is called
the outer product LM statistic because it uses the estimator 1, for the estimator
of the variance of the score.
The outer product LM statistic requires that the two Assumptions M.8 and M.9
hold in addition to M.7; it is not generally asymptotically xi distributed if any of
these assumptions fail. Even when M.7, M.8, and M.9 hold under H,, the outer
product LM statistic might not be the best choice for applications. There is evidence
in the maximum likelihood context that it can have severe finite sample biases. If
an estimate of the Hessian or expected Hessian AT is available, the Hessian LM
statistic

(4.48)

can have better finite sample properties. See, for example, Orme (1990), Chesher
and Spady (1991) Davidson and MacKinnon (1984, 1991) and Bollerslev and
2670 J.M. Wooldridge

Wooldridge (1992). When A, is estimated as

(4.49)

and the model has a residual structure, then (4.48) (or a scalar multiple of it) has
a simple regression-based form. This is also true of the robust statistic (4.44) when
Al,. is given in (4.49).
It is important to remember that the outer product and Hessian forms are usually
invalid if either M.8 or M.9 is violated. If there is any doubt about M.8 or M.9,
the robust form of the statistic should be used.
Two comments before ending this section. First, to keep the notation and
assumptions simple, and to focus on the computation of valid test statistics under
various assumptions, we have only derived the limiting distribution of the classical
test statistics under the null hypothesis. Very general analyses under local alter-
natives are available in Gallant (1987), Gallant and White (1988) and White (1993).
Second, the limiting distribution of the test statistics under H, have been derived
under the assumption that all elements of 0, are identified. This is violated for
certain econometric problems, and usually the standard test statistics no longer
have limiting chi-square distributions. See Hansen (1991 b) for recent work on this
problem.

5. Maximum likelihood estimation

We now apply the results on M-estimation to maximum likelihood estimation


(MLE). For the purposes of discussion, suppose initially that (IV,: t = 1,2,. . .} is a
sequence of strictly stationary M x 1 random vectors. The classical approach to
MLE entails specifying a parametric model for the joint distribution of W, =
(w,, w2,. . . , wT) for any T. A less restrictive setup allows one to partition IV, into
a G x 1 vector of endogenous variablesy, and a K x 1 vector of exogenous variables
zl. Then (conditional) MLE requires specification of the joint distribution of
Y, z (yl,. . , yT) conditional on Z, E (zl, t2,. . , zT). The latter setup allows one
to investigate how zr influences various features of the conditional distribution of
yt given z,, and it is familiar to economists in both cross section and time series
settings. Still, for a few reasons this approach is too limiting for modern econometric
practice. First, as this approach is usually implemented, a restrictive type of
exogeneity of the process {zr: t = 1,2,. . .} IS assumed. In the time series econo-
metrics literature this is known as the strict exoyeneity assumption. When interest
lies in the conditional distribution, strict exogeneity of {tt} can be stated as

~(Y,lYl~...,Y,-l,zl,z 2,...)=~(YtIYl,...,Yt-l,Z1,..., z,); (5.1)


in other words,y, is independent of z,+ 1, zr+ 2r.. conditional any,,. ., ytm ,,zr,. . . ,zr.
[This is Chamberlains (1982) modification of Simss (1972) definition of strict
exogeneity.] Chamberlain (1982) shows that (5.1) is equivalent to the Granger
(1969) noncausality condition

Often, the Z~are policy or other control variables that can be influenced by economic
agents, in which case (5.2) can easily be false.
Second, specifying the joint distribution of (y,, . . . , yT) (conditional on (z,, z2,. , zT))
assumes that the researcher knows the entire dynamic structure of (yi,. . , yr). In
certain cases the dynamic structure is not even of interest. For example, one might
be interested in the contemporaneous relationship between y, and z,; in terms of
conditional distributions, this entails specifying a model for D(y,Iz,). It is now
recognized - for example, Robinson (1982) Levine (1983) and White (1993) ~ that
this can be done quite generally without assuming that {z,} is strictly exogenous
and without specifying how y, depends on past y or Z. Several examples of this
kind of specification can be found in the literature, including the well-known static
linear model, static logit and probit [Gourieroux et al. (1985) Poirier and Ruud
(1988)], static Tobit [Robinson (1982)], static models for count data and static
transformation models [Seaks and Layson (1983)].
Even when the complete dynamics are of interest one does not always directly
specify the joint distribution of (yi,. , yT). Often it is more natural to specify the
conditional distributions D(y,ly,- i,. ., y,) or D(y,Iz,, y,_ i, z,_ r,. , yr,zi) or
D(y,l yt_ i, zf- 1,. . . , y,, z,). Static models, dynamic models, and most other cases
of interest can be cast in the following framework: interest lies in the distribution
of yt given a set of conditioning variables x,, where X, is some subset of (z,, ytp 1,
z, - 1,. . . , y,, z,). The z1 are not necessarily strictly exogenous and the dynamics are
not necessarily completely specified. Thus, we follow White (1993) in taking the
notion of maximum likelihood estimation to include cases where only the distri-
butions D(y,Ix,) are specified for some conditioning variables x,. The fact that MLE
is consistent for the parameters of D(y, Ix,) for essentially any conditioning variables
x, has many useful applications.
In what follows it is easiest to view x, as a subset of (z,, ~~~r,~~-r,. . . , yl, z,),
where the G x 1 vector yt and the K x 1 vector z, are contemporaneously observed,
but the following results hold for other definitions of x, (for example if x, contains
leads of z as well as current and lagged z). Also note that x, can be null for all t,
in which case interest lies in estimating the parameters in the unconditional distri-
butions D(y,). From the M-estimation results we know that, for verifying regularity
conditions, the easiest case to handle is when the dimension of X, is fixed and
{(x,,y,):t= 1,2 )... } IS a strictly stationary sequence. This rules out cases where
the distribution of y, given the observable past actually depends on all past
observables, as with moving average models and generalizations of ARCH models
[Bollerslev (1986) Nelson (1991)]. The following treatment allows for the number
of elements in x, to grow with t, without doing the sometimes difficult task of
verifying the regularity conditions. Generally, one must ensure that the conditional
log-likelihood (defined below) and the score are not too dependent on the past.
Let pp(.lx,) denote the conditional density ofy, given x, with respect to a measure
v,(dy). (In most applications, V,would not depend on t.) We will say nothing further
about the measure rt(dy) except that it can be chosen to allow yt to be discrete,
continuous, or some combination of the two. The interested reader is referred to
Billingsley (1986) and White (1993). Let the support of y, be ?V, c R and denote
the range of x, by :X,; the dimension of F, may depend on t.
A crucial result from basic probability theory for studying the properties of
maximum likelihood estimation is the (conditional) Kullbuck-Leibler information
inequality: for any nonnegative function fJ./x,) such that

sAf
f,(ylx,)v,(dy) d 1, all x,EX,, (5.3)

it can be shown that

s d
logCp~(~Ix,)/f,(~/x,)lv,(dy)
3 0, all VT,. (5.4)

[see, for example, Manski (1988), White (1993) and Newey and McFadden (Lemma
2.2)]. Now suppose that one specifies a parametric model for pF(.lx,),

(f,(~lx,;aeEo, 0 c RP}, (5.5)

which satisfies (5.3) for all 0~0. Model (5.5) is said to be correctly specijied if, for
some 8,~0,

.f,(.lx,;O,) = pp(Ix,), all x,ETl, t = 1,2 ,... . (5.6)

It follows from (5.4) that ifthe model is correctly specified then for all t = 1,2,. . . ,

where

is the conditional log-likelihood for observation t. From (5.7) and the law of iterated
2673

expectations, 0, solves, for each t,

(5.9)

The fact that the vector of interest, 0,, solves (5.9) shows that maximum likelihood
estimation is Fisher consistent. Importantly, this holds for any conditioning
variables x,; there is no presumption that the dynamics are completely specified
or that some sort of strict exogeneity holds.
The consistency of the maximum likelihood estimator now follows in a straight-
forward manner from the consistency of M-estimators. In the notation of Section 4,
q,(w,, 0) = - /,(y,, xt, 0). The maximum likelihood estimator e^, solves

(5.10)

Theorem 5.1. Weak consistenc~~ of MLE

Let ((xt, y,): t = 1,2,. . .} be a sequence of random vectors with xteZr, y,&Yj/, c RG.
Let 0 c Rp be the parameter set and denote the parametric model of the conditional
density as in (5.5). We assume that this is a density with respect to the measure
v,(dy) for all x, and 0:

s f,(ylx,;
!V*
B)v,(dy) = 1, all x,E.Z(, 8~0. (5.11)

Assume further that

MLE.l: (i) 0 is compact.


(ii) 1, satisfies the standard measurability and continuity requirements
on v x Tit x 0.
MLE.2: {l(y,,x,k): t = 1,2 ,... } satisfies the UWLLN on 0.
MLE.3: (i) For some B,E@,

pf(.lx,)=f,(.lx,;8,), xfcF,,t= 1,2,...;

(ii) 0, is the unique solution to

max lim T- ,tI ECUY,, -G 31. (5.12)


ese T-m

Then there exists a solution to (5.10), the MLE 8r, and plim 6, = 8,.
2674 J.M. Wooldridge

From the discussion above. MLE.3(i) ensures that 8, solves (5.12), but it does
not guarantee uniqueness; this depends on the distribution of x,, so at this level
we simply assume uniqueness in MLE.3(ii).
To derive the asymptotic normality of the MLE, we assume that I,(.) is twice
continuously differentiable on the interior of 0 and B,Eint(O). Define the score
and Hessian for observation t by

s,(O)= s,(w,, 0) = - V&r(W,, ey,


/l,(8) = h,(w,, 0) = V&(U) = - v;&v,, O),

where W, = (xi, y:). From the general M-estimator result we would like to show
that the score evaluated at 0, has expectation zero:

~Cs,(~,)l= 0. (5.13)

By the law of iterated expectations, (5.13) certainly holds if

(5.14)

Now, for any 0~69.

where E,(. IX,) denotes expectation with respect to the density fJ.(x,; 0). If inte-
gration and differentiation can be interchanged on int(O) in the sense that

(5.16)

for all x,EX,, QEint(O), then, by (5.1 l), the right hand side of (5.16) is necessarily zero.
But V,l,(y,x,, @j;(y(x,; 0) = V,,f,(ylx,; (I), which by (5.15) implies that Es[s,(B)lx,] =
0, BEint(O). Substituting U, for 0 now yields (5.14).
Given (5.13), the central limit theorem generally implies that T-1~2~~~,s,(0,)
is asymptotically distributed as Normal(0, B,), where

(5.17)

Define A, = A (0,) to be the limit of the expected value of the Hessian evaluated
2615

at the true parameter:

A, = lim T-r $j E[h,($,)]. (5.18)


T-J_ r=1

Provided that 0, is identified, A, will be positive definite; in fact, under an additional


regularity condition it can be shown that E[h,(0,)] is equal to the variance of s,(fI,),
so A, is at least positive semi-definite. The additional condition needed is that,
for all ~,fz.K,, BEint(U),

This is simply another assumption about interchanging an integral and a derivative,


which is valid in regular cases. Taking the derivative of the identity

and using (5.19) yields, for all BEint(O),

EeCW)I-d = Vah [s,(Q)


I-4 (5.20)

Equation (5.20) is called the conditional information matrix equality for observation
t. This equality has important consequences for estimating the asymptotic variance
of the MLE. One point worth emphasizing is that (5.20) holds for any set of condi-
tioning variables xt, and we have not assumed that f,(.lx,,Q,) is also the density
of yt conditional on x, and the observable past (y, 1,x, _ 1,. . ,yl, x1). Evaluating
(5.20) at 6, and taking expectations yields

ECU~JI= VarCs,(~,)13 (5.21)

which is best thought of as the unconditional information matrix equality for each
observation t. This shows that, under standard regularity conditions, the M-
estimation Assumption M.9 holds for MLE.

Theorem 5.2. (Asymptotic normality of MLE)

Let the conditions of Theorem 5.1 hold. In addition, assume

MLE.4: (i) U,Eint(O);


(ii) 1, satisfies the standard measurability and second order differentiability
conditions on Y f x .ft x 0. 3
2676 J.M. Wooldridye

(iii) the interchanges of derivative and integral in (5.16) and (5.19) hold
for all BEint(O).
MLES: (i) A, defined by (5.18) is positive definite;
(ii) {V,2l,(y,, x,, 0)) satisfies the UWLLN.
MLE.6: {sJ0,): t = 1,2,. . .} satisfies the CLT with asymptotic variance B, given
by (5.17).
Then

fi(e, - 0,) A Normal(0, A; BOA; ). (5.22)

While Assumption MLE.4(iii) is often satisfied in applied time series analysis, it


does rule out some cases where the (conditional) support of yr depends on the
parameters 0,. White (1993) presents a slightly more general result where the sup-
port ?Yyrcan depend on 8 provided it does so in a smooth enough manner such
that (5.16) and (5.19) hold.
It is possible to confuse (5.21) with the more traditional information matrix
equality. To see the difference, define J, as in Section 4.5 but without the nuisance
parameters:

J, = lim T- i E[s,(~,)s,(O,)]. (5.23)


T-30 t=1

Then, by (5.18) and (5.21), it is trivially true that A, = J,. But the traditional infor-
mation matrix equality says that (for all T)

Tpl i E[h,(Q,)] = Var . (5.24)


t=1

It is easily seen that (5.24) is not implied by the conditions of Theorem 5.2 because
these conditions do not imply that the score is serially uncorrelated. In other
words, for MLE as we have defined it, B, # A, necessarily. If {s,(d,)} is serially
uncorrelated then the traditional information matrix equality holds, A, = B, = Jo,
and the asymptotic variance of the MLE simplifies to its well-known form. We
state this as a lemma.

Lemma 5.1

Let the assumptions of Theorem 5.2 hold, and assume, in addition,

MLE.7: For t = 1>2 ,..., ECst(Bobt


+j(Ho)l= O, j> 1.
Then
Assumption MLE.7 always holds when the observations are independently
distributed, which is why it never appears for MLE with independently distributed
data. With dependent observations, if MLE.7 holds it is usually because the model
captures all of the dynamics in the following sense.

Dejinition 5.1

The model is said to be dynamically complete in distribution if

WIG @t-l) = DbIxtL (5.25)

where @,-i -(yt-i,xt-i ,..., y,, x1) is the information observed at time t - 1.
Often we simply say the density is dynamically complete when (5.25) holds.

Note that this definition allows x, and Dt_ 1 to overlap, as happens if X, contains
lags of yt or lags of some other variables zl. For example, if X, = (z,, yt _ 1, zt 1),
then Qtml =(Y~-~,z~~~,...,Y~,z~).

Lemma 5.2

If the model is dynamically complete in distribution then the score {st(Q,): t = 1,2,. . .}
is a martingale difference sequence with respect to the information sets {at: t =
1,2,. .). Consequently, MLE.7 holds.

This lemma is easily proven: (5.14) and (5.25) imply that

E cs,(e,)Ix,, @t- 11= 0,

so that E[s,(BJ at_ i] = 0 by iterated expectations. Because s,(@,) is a function of


Qr, {s,(e,)} is a martingale difference sequence with respect to {Qt}. It is a simple
consequence of the law of iterated expectations that a martingale difference
sequence is serially uncorrelated.
Dynamic completeness ~ which hinges crucially on the choice of the conditioning
variables X, - has a simple interpretation: if interest lies in explaining y, in terms
of past y and possibly current and past values of some other sequence, say {z,},
then enough lags of y and z have been included in the conditioning variables x,
to capture the entire dependence on the past. Often, but not always, it is assumed
that a fixed number of lags of observables is sufhcient for modelling dynamics,
and (5.25) makes precise the notion that enough lags have been included. For
example, suppose that the conditioning variables are chosen as x, = ( yr _ 1, zt_ 1).
If this specification of x, is dynamically complete then
so that there are only first order dynamics. As another example suppose that
x, = zl. Then dynamic completeness requires the fairly strong assumption

~(Y,lz,,Y,-l,z,-l,...~Yl~zl)=~(Y,lz,). (5.26)

In other words, the static relationship is also the dynamic relationship. That this is
rarely true is perhaps what prompts Hendry et al. (1984, p. 1043) to state that static
models . . rarely provide useful approximations to time-series data processes.
But one should keep in mind that D(y,Iz,) might be of economic interest even if
(5.26) is false.
Because we are allowing the dimension of x, to grow with t we can always choose
It-(Zf,.Y-I,&1,. ..,Y1,zl)orXt=(Yt-l,zl~ 1,...,Yl,zl)orXt-(Yr-1,Yr~2,...,Y1)
to ensure dynamic completeness, assuming of course that a correctly specified para-
metric model of the conditional density can be found.
Most of the earlier work on maximum likelihood estimation with dependent
processes assumes dynamic completeness, in which case a martingale difference
central limit theorem can be applied to (st(fI,): t = 1,2,. . .}. See, for example, Roussas
(1972), Basawa et al. (1976), Crowder (1976), Hall and Heyde (1980) and Heijmans
and Magnus (1986), among others. The popular prediction error decomposition
method of building up the likelihood function from conditional distributions [for
example, Hendry and Richard (1983), Hendry et al. (1984) and Harvey (1990,
Section 3.5)] is a special case of specifying a dynamically complete model: for each
t, the density f,(.lx,;tI,) represents the density of y, (or of the prediction errors)
given all past observable information on y and perhaps on current and past values
of z.
As usual, estimating Avar ,,I?(& - 0,) requires estimating A, and II,. First
consider estimation of the matrix A,. There are at least three possible estimators
of A, in the MLE context, each of which is valid whether or not the model is
dynamically complete. The first estimator, based on the Hessian, was encountered
in the general M-estimation setting in Section 4, but it is now useful to have a
separate notation for it. Define

& = T- i I&),
t=1

where we recall that h, generally depends on y, and x,. Under the conditions of
Theorem 5.2, H, A A,. This estimator is generally thought to have good finite
sample properties, but it does require second derivatives of the conditional log-
likelihood function and it is not guaranteed to be even positive semi-definite in
sample (although it is positive definite with probability approaching one).
A second estimator is the BHHH [Berndt et al. (1974)] estimator, which is based
on the information matrix equality (5.21). If we add to the conditions of Theorem 5.2
2679

that {s~(@s,(~):t = 1,2,. .} satisfies the UWLLN then

L4,.
& = T- t, s,(B,)s,(&)~
r=1

This estimator has the advantage of requiring only the first derivatives of the
conditional log-likelihood function, and it is guaranteed to be at least positive
semi-definite. However, it is also known to have poor finite sample properties in
some situations. [See MacKinnon and White (1985) for linear models, Davidson
and MacKinnon (1984) for logit and probit, and Bollerslev and Wooldridge (1992)
for QMLE.]
A third possible estimator was given in the M-estimation section. Let a,(~,, 0) =
M~rbt~~t8)l-d = - J%CV~M)I-~.Then

(5.29)

is a consistent estimator of A, provided {a&,@} satisfies the UWLLN. If the


conditional expectation needed to obtain A_T is in closed form (as it is in some
leading cases to be discussed in Section 6), A, has some attractive features. First,
it often depends only on first derivatives of a conditional mean and/or conditional
variance function. Second, A^T is guaranteed to be at least positive semi-definite
because of the conditional information matrix equality (5.20). Thir_d, jT has been
found to have significantly better finite sample properties than J, in situations
where a(~,, 0) can be obtained analytically.
If the model is dynamically complete or MLE.7 holds for some other reason
there is nothing further to do. The asymptotic variance of 8, is estimated as one
of the three matrices f?; 'IT, 3; 1JT, and A^; IT.
Things are more complicated if MLE.7 does not hold because B, depends on
the autocovariances of {s,(e,)}. A serial-correlation-robust estimator using {$}
should be used, for example that given in equation (4.30). With b, consistent for
B, under MLE.1 -MLE.6, consistent estimators of Avar JT(O, - 0,) are given by

We will not explicitly cover the case with nuisance parameters that affect the
asymptotic distribution of fi(eT - 0,) but these are easily reasoned from the
general M-estimator results.
The three tests covered in the M-estimation section apply directly to the MLE
context. The Wald statistic for testing Ha: c(e,,, = 0 is given in equation (4.37). If
the model is dynamically complete under H,, I, can be taken to be (5.27), (5.28)
26X0 .I. M. Wooldridgr

or (5.29). If the application is to time series with incomplete dynamics then VT


can be taken to be one of the estimators in (5.30).
Define the log-likelihood function for the entire sample by YvT(0) = I,= 1I,(O).
Let 8, be the unrestricted estimator, and let fi,. be the estimator with the Q non-
redundant constraints imposed. Then, provided MLE.7 holds and the additional
assumptions in Lemma 4.3 hold, the likelihood ratio statistic

29, = 2[&(0^,) - U,@,)] (5.3 I)

is distributed asymptotically as xi under H,; this follows immediately from Lemma


4.3 as M.8 and M.9 are satisfied. Recall that there is no known correction to this
statistic that has an asymptotic distribution free of nuisance parameters when
MLE.7 is violated, so the LR statistic should be used only when the dynamics
have been completely specified under H,.
The LM test follows exactly as in the general M-estimator case. If we assume
dynamic completeness under the null hypothesis then the three possible versions
of LM, are (4.46), (4.48) and (4.48) with 5; in place of 2; , where all quantities
are evaluated at the restricted estimator QT. Under H, and dynamic completeness,
LM, zxi. In the case outlined in Section 4.6, 0, may be on the boundary of 0
under H,. If MLE.7 is not maintained under H, then the robust LM statistic
given in (4.44) should be computed.
It is possible in this framework to cover a broader class of tests that includes
tests against nonnested alternatives, encompassing tests, Hausman tests and various
information matrix tests. White (1987, 1993) gives a very general treatment for
possible misspecified dynamic models.

6. Quasi-maximum likelihood estimation (QMLE)

In this section we cover estimation of the first two conditional moments ofy, given
x,. Section 6.1 covers the case where the conditional mean is of interest. We consider
multivariate weighted nonlinear least squares estimation of the parameters of
E(y,Ix,), which covers many of the models used in applied time series analysis
(including vector ARMA models). These methods can be applied whenever E(y,lx,)
can be obtained in parametric form, regardless of the underlying structure.
Section 6.2 covers the case when the conditional mean and conditional variance
are jointly estimated. The multivariate normal distribution is known to have
robustness properties for estimating the parameters indexing the conditional mean
and conditional variance. These results are intended primarily for the case when
the mean cannot be separated out from the variance; if E(y,lx,) can be estimated
without specifying Var(y,lx,) then, at least for robustness reasons, the methods
of Section 6.1 are preferred for estimating the conditional mean parameters. [See
Pagan and Sabau (1987) for an illustration of the inconsistency of the normal
MLE for the conditional mean parameters in a univariate linear model case when
the variance is misspecified.] Of course if one is confident about the specification
of the first two moments the methods of Section 6.2 might be preferred on efficiency
grounds over those in Section 6.1, but this depends on auxiliary assumptions.

6.1. Conditional mean estimation

We first consider estimating a model of a correctly specified conditional mean


using multivariate weighted nonlinear least squares (MWNLS). These results build
on the work of Hannan (197 l), Robinson (1972), Klimko and Nelson (1978), White
and Domowitz (1984) Gallant (1987) and others for multivariate nonlinear re-
gression. We allow for a nonconstant weighting matrix in order to include pro-
cedures asymptotically equivalent to QMLE in the linear exponential family [see
Gourieroux et al. (1984) and White (1993)].
As in Section 5 let yt be a G x 1 vector of variables to be explained by the con-
ditioning variables x,, which, as always, may contain lagged dependent variables
and other conditioning variables. Let (m,(x,, %):x,EX,, %e 0 c [w} be a model for
E(y, 1x,). We will assume that the model is correctly specified in the following sense:

Assumption WNLS.1

For some %,E 0 c Iwp,

NV, I-5)= 4(x,, O,), t= 1,2,.... (6.1)

The weighted nonlinear least squares (WNLS) estimator solves

where W1(x,, y) is a G x G symmetric, positive definite (with probability one) matrix


that can depend on the conditioning variables x, and an R x 1 vector of parameters
y. When G = 1 and Wr(xf, y) = y = cr the problem reduces to univariate nonlinear
least squares. We have put the + in the objective function to simplify the gradient
formulas and to make the quasi-LR statistic derived in Section 4.6 directly applic-
able.
The motivation for using MWNLS to estimate the parameters of E(y,lx,) is
that W1(x,, 9r) is thought to be a consistent estimator of Var(y, Ix,). For example,
if W(x,, fr) = fir (so that Var(y,Ix,) = R, is the nominal variance assumption),
then we obtain the nonlinear seemingly unrelated regressions (SUR) estimator
2682 J.M. Wooldridyr

[Gallant (1987) Robinson (1972)]. Most of the time fi, would be obtained as

fi, = T-l i (y, - m,(O,+))(y, - m,(e;)) EE T-1 i u&+,


1=1 r=1

where 0: is the MNLS estimator of0, and (u:} are the G x 1 MNLS residuals.
In general, yT can be any estimator that is n-consistent for its plim; namely,
for some DIET c [WR,fib?, - y*) = O,,(l). (In most applications $r is either a
preliminary a-consistent estimator of O,, such as the NLS estimator, or it comes
from a regression procedure using u:u: as the dependent variable and functions
of x, as the independent variables, or some combination of these.) Although we
will discuss the case where the variance function is correctly specified, we are also
interested in performing inference under variance misspecification.
Under WNLS.1. define the G x 1 errors as

u,= Yt - m,(x,,Q,), t= 1,2,.... (6.4)

Keep in mind that, under WNLS. 1, we can only conclude that E(u,lx,) = 0 and
Var(u, 1x,) = Var(y, 1x,); the II, are not necessarily serially uncorrelated or indepen-
dent of x,. This leads to the tautological model

Yt = ~t(-%,do)+ 4, (6.5)

E(u, IXJ = 0. (6.6)

Define the MWNLS objective function for observation t as

By replacing yt in (6.7) with the right hand side of (6.5) and using a little algebra
along with (6.6), we can write

Nq,(~,~Q,;74d = trC~,(+~)}-~~(-5)
+ x~,(xt> 00)- ~,(~,>@I(Wr(X,,Y))- Cwb,, 0,) - Mx,, @I, (6.8)
where Var(y,Ix,) = fip(x,) is the actual conditional variance of yt given x,. Be-
cause the first term in (6.8) does not depend on 8 it is clear that, for any r~r,
E[q,(w,, 8; y)I x,] 3 E[q,(w,, 8,; y)lx,], all BE 0. By iterated expectations

E(qt(w,,0;7412 a%(% 0,; Y)l> BE0. (6.9)

In particular, this inequality holds for plimy, = y*, which establishes Fisher
consistency of the multivariate weighted NLS estimator under MWNLS.l. (This
is one of those cases alluded to in Section 4 where Fisher consistency holds for
any value of the nuisance parameter.)
We do not write down formal consistency and asymptotic normality results, as
these follow from the results on M-estimation. Rather, we focus on the key
assumptions that influence how inference is done on 0,. Because of the Fisher
consistency result we have essentially shown that, under WNLS.l and regularity
conditions, MWNLS consistently estimates 0,.
Next define the score for observation f as

de Y)= V,q,@;Y)= - Vpr(-% Q,IW,(-%Y)}- lCYr- m,(x,,@I


= - V&(@CW,(Y))u,(Q). (6.10)

Since u, = u,(e,), it is easily seen under WNLS.l that E[s,(B,;y)lx,] = 0 for all YET.
Also under WNLS. 1 ~ and smoothness conditions on W(x,, y) ~

E[V,.s,(B,,~~)Ix,] = 0, all YET, (6.11)

which implies the convenient M-estimation assumption M.7. Thus, the asymptotic
variance of 8, does not depend on that of *TTprovided that fioT - y*) = O,( 1).
Under WNLS.l and the M-estimation regularity conditions, fi(e, - Q,) is
asymptotically normally distributed with

Avarfi(i?,-(I,)= A;BoAoml, (6.12)

where we can write

A, = lim T- f E[V,m,(B,){ W,~*)}-lV,q(O,)] (6; 13)


T+ic t=1

and B, is given by (4.7). (This expression for A, can be derived by noting that the
term in V&B; y) depending on Vim,(x,, 0) and u,(e) has a zero conditional mean
at 0 = 0,.) Under WNLS.1 and regularity conditions, a consistent estimator of A,
is the positive semi-definite matrix

AT= T- i V,m,(e,){ I+,(;,)} -V,m,&). (6.14)


1=1

As usual, estimation of B, must generally account for possible serial correlation


in s,(8,;y*); from (6.10) we see that the serial correlation in the score depends on
the serial correlation in {Us}. Straightforward application of the law of iterated
expectations shows that Assumption M.8 is implied by the following assumption.
2684 J. M, Wooldridqe

Assumption WNLS.2

For all t3 1, E(u,~:+~jlx,.x,+~)=O,j3 1

Definition 6.1

The model specification is said to be dynamically complete in mean if

WY,l-%@t-1)= WI-d, (6.15)

where @,-r =(yt_r,xt-r ,..., yr,r,).

We have the following simple lemma relating Definition 6.1 and Assumption
WNLS.2.

Lemma 6.1

If (6.15) holds then (u,: t = 1,2,. . .} is a martingale difference sequence with respect
to (at}, and so WNLS.2 holds.

In fact, we can say more. If {u,} IS a martingale difference sequence with respect
to { @,} then so is {s,(tI,;y)} for any YET. Thus, for MWNLS only the conditional
mean has to be dynamically complete for the score evaluated at 8, and any value
of y to be a martingale difference sequence (MDS). For MLE we derived the MDS
property of the score under the assumption that the model for the conditional
density of yt given (x,, @,- r) was correct in its entirety.

Theorem 6.1

Under WNLS.1 and WNLS.2.

(6.16)

where A, is given by (6.13) and J, is given by (4.17).

A consistent estimator of J, is the usual outer product estimator jr given in


(4.23). Under WNLS.1 and WNLS.2 an appropriate estimator of Avar(&) is
A~(8,)=~;.?,A^;/T.
Further simplifications arise if we have properly modelled Var(y,Ix,), as given
by the following assumption.

Assumption WNLS.3

For some Y,E~,

Var(y, Ix,) = W,k, Y,), t= 1,2,....


2685

When WNLS.3 holds we also assume that fT is JT-consistent for yO.

Theorem 6.2

Under WNLS.IPWNLS.3, Avar,,@(8, - 0,) = Ai .

Under WNLS.l-WNLS.3, a simple estimator of Avar(H^,) is

(6.17)

When y, is a scalar and I+&, y) = y = 02, 0, is the nonlinear least squares (NLS)
estimator and the usual estimator of its asymptotic variance is

&l
_____= 8: i V,m,(~,),V,m,(Q,) -l, (6.18)
T ( I=1 >

where &G2 IS the usual estimator of Var(y,Ix,) = Var(u,Ix,) = of based on the sum
of squared residuals from NLS estimation. For emphasis, sufficient conditions for
(6.18) to be valid are

E(Y~IX~,Qt- I) = E(Y,M = m,(x,,~,) (6.19)

and

Var(y,Ix,) = f~,. (6.20)

Interestingly, it is not required that Var(y,( x,, Q1- 1) = Var(y,(x,), so that the
variance need not be dynamically complete. For example, if x, does not contain
lagged dependent variables then Var(y,Ix,, at- 1) can follow an ARCH process
provided (6.20) is true; (6.18) is still a valid estimator of Avar(8,) even though
there are neglected dynamics in the second moment of u,.
For the general case where only WNLS. 1 is assumed to hold, we need to estimate
II, as in equa$on_(4._30) with the score given by (6.10). As always, Avar(&.) is
estimated by A; B,A; l/T.
Testing is easily carried out as in previous sections. The Wald statistic is obtained
from the results in Section 4.6, with the estimated variance matrix used depending
on whether WNLS.2, WNLS.2 and WNLS.3, or neither of these assumptions is
maintained.
The quasi-LR statistic, obtained as

(6.21)
2686 J.M. Wooldridqr

where o, is the restricted estimator and 4,. is the unrestricted estimator, has an
asymptotic xk distribution under WNLS.l, WNLS.2 and WNLS.3. To ensure that
M.9 holds (see Lemma 4.3), we must ensure that the objective function is properly
computed. For example, when y, is a scalar and the variance is given by Var(y, 1x,) =
crfu,(x,, 6,) for some function u&c,, a), the function q1 is given by

4t(w

f)
@

3
?) = (Y,- et, @J2 (6.22)
2&I,(x,, 6)

Typically, once s^, and @, have been calculated, S, is computed as

(6.23)

where ti, = y, - m,(x,, 8,) and 0, = u,(x,, 8,). Once the restricted estimator 8, has
been obtained, the QLR statistic can be computed as

QLR = (SW - SWJ


T A2
CT

where SSR, is the sum of squares of the restricted weighted residuals {Ot12Ult}
and SSR,, is the sum of squares of the unrestricted weighted residuals (0, I&}.
For NLS, (6.24) is QT/(T - P) times the F-statistic covered in Gallant (1987, p. 56)
for nonlinear least squares with fixed regressors and i.i.d. errors. This analysis
shows that this F-statistic is valid under the weaker assumptions WNLS.1, WNLS.2
and WNLS.3, which allow for models with lagged dependent variables and hetero-
skedastic martingale difference sequences (provided the heteroskedasticity is
properly modelled).
To compute the LM statistic for testing conditional mean hypotheses, only the
first derivative of the conditional mean function evaluated at the restricted estimates
is needed. The setup is the same as in Section 4.6; there are Q possibly implicit
restrictions on the P x 1 vector BO. Let yT be the nuisance parameter estimator
used in computing the restricted estimator e,. In using the LM statistics from
Section 4.6 it is best to use as the estimate of A, the P x P matrix

(6.25)

where V,r71, = VBm,(x,, 8,) and k, K WJx,, FT). Under WNLS.1 only, d, should
be a serial-correlation-robust estimator applied to {St} and the robust LM statistic
(4.44) should be used. This would be the case, for example, in testing a static or
distributed lag relationship without also assuming that it is the dynamically
complete conditional mean.
When WNLS.1 and WNLS.2 hold under H,, which is often the case ~ tests of
dynamic misspecification take the null to be complete dynamic specification - &
can be replaced in (4.44) by the outer product estimator jT in (4.45), where
S, = - V&z: I?- it. The resulting statistic can be given a regression interpretation
as in Wooldridge (199 1b).
If we impose WNLSI, WNLS.2 and WNLS.3 under the null then things are
even easier. The outer product LM statistic in (4.46) is asymptotically valid, but
this is not the best choice; the Hessian form (4.48) with JT in (6.25) is no more
difficult to compute and has better finite sample properties. A statistic that is
proportional to (4.48) (where the constant of proportionality tends in probability
to unity under H,) can be computed from the r-squared of a regression. Run the
(stacked) OLS regression

@-2~ on pwv fi t=1,2 T. (6.26)


t t t I3 f
)...)

Under WNLS.l, WNLS.2 and WNLS.3, TGR, 1: (G is the dimension of y,).


When y, is a scalar, Var(y,Ix,) is typically specified as Var(y,Ix,) = ~fu~(x,, 6,).
Then the LM statistic can be computed as TR, from

c-l2jj
f I
on 17~V,fiit, c=1,2 ,..., T, (6.27)

where Et = u,(x,, g,). For NLS, fir = 1.


As an example of how to set up a test in the current framework, consider testing
for AR(l) serial correlation after NLS estimation. The null mean function is f,(g,, /?)
for g, some subset of (zr,yr-l,z,-l ,..., y,, z,) (so that g, can contain lags of y,)
and the unrestricted mean function is m,(x,, 0) = f,(g,, B) + p(y,_ 1 - f,_ 1(g, _ 1, /Q),
where x, = (g,, y, 1,gt 1). Under H,: ,p, = 0, we-obtain the NLS estimator o,f PO, /I,
and the NLS residuals ii, = y, - f,(g,, /IT). Now 0, = (&O) and V,fi, = (VJ,, ii,_ l).
Because this is always a test of dynamic completeness, WNLS.2 holds under H,.
If we also impose (6.20) (WNLS.3 in this context), then an asymptotically xf statistic
is (T - l)Ri from the regression

fit on VJt7fit- l7 t = 2,3,. . . , T. (6.28)

Note that because X, contains at least y,_ 1, (6.20) rules out ARCH or other forms
of dynamic heteroskedasticity under H,.
For applications of LM statistics and more on their robust forms, see Engle
(1984), Godfrey (1988), Wooldridge (1991a) and MacKinnon (1992).
268X

4.2. QMLE for the mean and variance

We now consider joint estimation of E(Y,lx,) and Var(y,Jx,). As in the previous


section, y, is G x 1 and x,EX( is the set of conditioning variables that may contain
lags of yr. The conditional mean and variance functions are jointly parameterized
by a P x 1 vector 8:

{m,(x,, %):rr&r, %EO C P},

{C&Y,, %):x,&,, %E0 C l@}.

The subsequent analysis is carried out under the hypothesis that the first two
conditional moments are correctly specified.

Assumption QMLE.1

For some %,E 0,

E(Y, IT) = 4(x,9 eO)>


Var(Y,lx,) = Rr(q, eO), t= 1,2,....

We estimate 8, using quasi-MLE under the nominal assumption that y, given x,


is normally distributed. For observation t, the quasi-conditional log-likelihood
apart from a constant is

where 10,(x,, %)I denotes the determinant of J~,(x,, %). Letting u,(B) -y, - m,(x,, 8)
denote the G x 1 residual function and suppressing the dependence of 0,(x,, 6) on
X, yields the more concise expression

I,(%)= - $loglf2,(%)I - $4,(e){n,(e)}-1u,(%). (6.29)

The QMLE %r is obtained by maximizing the normal quasi-log-likelihood function


_Yr(%) = z,= 1I,(@. Under QMLE.l it can be shown that E[1,(8,)(x,] 3 E[l,(d)Ix,]
for all 0~0 and for all x,EX,. This result has been established in special cases by
Weiss (1986) and was shown to hold generally by Bollerslev and Wooldridge
(1992). Its important consequence is that 8, solves

maxE M%)l
eee

under QMLE.1, so that QMLE based on the normal conditional log-likelihood


is Fisher consistent. The weak consistency of the QMLE GT for 0, under QMLE.l
and regularity conditions follows from the usual analogy principle argument. As
usual, the uniform weak law of large numbers applied to fl,(y,, x,, 0): t = 1,2,. . .}
underlies such a result, which requires moment and continuity assumptions on
m(x,, .) and 0(x,, .) and the assumption that I,(O) is not too dependent on the past.
If m,(x,, .) and 0,(x,, .) are differentiable on 0 for all relevant x,, and if 0,(x,, (3)
is nonsingular with probability one for all 0~ 0, then differentiation of (6.29) yields
the P x 1 score function s,(e):

s,(U) s - V&,(U) = - {V,m,(H)R; (R)u,(U)


+~V,n,(e)[~,(U)on,(H)l vecCu,(@,(@- Q(Q)]}, (6.30)

where V,m,(H) is the G x P derivative of m, and V&!,(Q)is the G2 x P derivative


of n,(0). By definition, under QMLE.l E(u,)x,) = 0 and E(u,u:lx,) = O(xt, 0,). It
follows that, under correct specification of the first two conditional moments of
y, given x,, E[s,(0Jx,] = 0. This is an alternative statement of Fisher consistency
of QMLE.
To estimate Avar fi(e^, - 0,), we need to estimate A, and B,. Here is another case
where the simplest and best behaved estimator of A, depends only on first deri-
vatives, in this case of the mean and variance functions. Let a,(~,, tl,) = E[h,(tI,)lx,].
A straightforward but tedious calculation shows that, under QMLE.l,

q(x,, do)= v,m,(OJQ; @,)V,m,V,)


+~v,~,(~,)C~,-(~,)o~,l(~,)lV,~,(~,). (6.31)

(As expected, this matrix is positive semi-definite, something that is useful for
programming Gauss-Newton iterations for obtaining the estimates.) A consistent
estimator of A, is A^r = T- lx,= 1 a,(8,.). Under QMLE.1 only, we need a serial-
correlation-robust estimator for-B,, and this is obtained by applying one of the
Section 4.5 estimators to $ = ~~(0~).
To state a condition under which the score is serially uncorrelated, define the
(G + G2) x 1 vector Yec [ul, {vec[u,uj - f2r(0,)] }I, where u, = ~~(0,). Under QMLE.1,
E(r,(x,) = 0. We now add the assumption that {Ye} is appropriately serially un-
correlated.

Assumption QMLE.2

For all t 3 1, E(r,r;+jJx,,x,+j) = 0,j 3 1.

It is easily seen that Assumption QMLE.2 implies that the score is (conditionally)
serially uncorrelated. Therefore, under QMLE.l and QMLE.2, Avar fi(gT - 0,) =
A;J,A;, where J, is as in (5.23), and a consistent estimator of J, is given in
2690 J.,2/1. Wooldridge

(5.28) with the score as in (6.30). Usually, if QMLE.2 is to hold, one has in mind
a stronger assumption.

Dejnition 6.2
The model is dynamically complete in mean and variance if

WI-5 @r-1)= EblxtL


Varbl-q Dt- 1)= Var(y,lx,), t= 1,2,...,

Lemma 6.2

If the model is dynamically complete in mean and variance then {Y,:t = 1,2,. . .}
is a martingale difference with respect to the information sets {a,}, and therefore
QMLE.2 holds.

From (6.30), {s,(f?,)} IS a martingale difference sequence (MDS) with respect to


Dt = (yt,xt,. , y,,x,) if {r,} is. Thus, the score of the normal quasi-likelihood is
an MDS if the first two conditional moments are dynamically complete; nothing
else about the normal distribution need apply to the distribution of yt given x,.
If yt given X, is normally distributed then the conditional information matrix
equality

Var Cs,(~,)IXrl= a,(x,,0,) (6.32)

holds. When combined with QMLE.2 this further simplifies the estimation of
Avar(@,). While normality is sufficient for (6.32), it also holds under a weaker
assumption.

Assumption QMLE.3
(i) E [vec(u,u:)ui Ixt]=
0;
(ii)
EC(vec(&)- Q,(@,)}CWu,ul)- ~,(Q,))'lx,l = 2~J~,(~,)@f4(Q,,)l,
where NG = DG(D)CDGJID~,and D, is the G2 x G(G + 1)/2 duplication matrix;
see Magnus and Neudecker (1986).

In the scalar case, QMLE.3(i) is the symmetry condition E(u:Ix,) = 0 and


QMLE.3(ii) is the familiar fourth moment condition E[ {uf - CJ:(~,))~~X,] =
2a,4(0,). Assumption QMLE.3 is the multivariate version of these assumptions, and
it could hold for distributions other than multivariate normal. For more on the
matrices D, and NG, and their relevance for the multivariate normal distribution,
see Magnus and Neudecker (1986).
Under QMLE.1 and QMLE.3, (6.32) holds. Therefore, under QMLE.1,QMLE.2
andQMLE.3, Avar fi(&, - e,) =_AO, and Avar(@,) can be estimated as A^; /T
or 3; l/T. The estimator based on A, tends to have better finite sample properties
than that based on the outer product of the score estimator jr.
Testing hypotheses about BOposes no new problems. The Wald statistic is as
before for M-estimation, with an appropriate estimator for B,. Recall that under
QMLE.l-QMLE.3, B, = A,, but not otherwise. The quasi-LR statistic is valid
under assumptions QMLE.l, QMLE.2 and QMLE.3; this follows directly from
Lemma 4.3.
The general formula for the LM statistic is again given by Equation (4.44). If,
in addition to QMLE.l, we impose QMLE.2 and QMLE.3 under the null, the
statistic simplifies to (4.46) or, preferably, (4.48). Under QMLE.l and QMLE.2 the
statistic is (4.44) with Jr in place of I,.
A natural application of these results is to univariate and multivariate ARCH,
GARCH, and ARCH-in-mean models; see Bollerslev, Engle, and Nelson (this
Handbook). But the normal QMLE can be applied in many other contexts.
Whenever the mean and variance are twice continuously differentiable in 8 then,
subject to the modest regularity conditions in Section 4, the QMLE will produce
fi-asymptotically normal estimates of the conditional mean and conditional
variance parameters. The actual distribution of yt given x, can be very different
from normal; for example, when y, is a scalar, it could have a discrete or truncated
distribution. This is useful for problems where maximum likelihood is computa-
tionally difficult or violates standard regularity conditions (such as (5.16)). For
example, QMLE can be applied to certain switching regression models with
unknown regime by obtaining E(y,lx,) and Var(y,Ix,) (where X, may or may not
contain lagged dependent variables) in terms of the structural parameters and
using these in the normal quasi-log-likelihood; the MLE for these models based
on a mixture of normals is known to be computationally difficult [Quandt and
Ramsey (1978)]. The QMLE can also be applied to certain frontier production
function models [Schmidt (1976)], where the log-likelihood is discontinuous in
some of the parameters and so standard inference procedures do not necessarily
apply. The discontinuity in the true log-likelihood comes about because the support
of the conditional distribution of y, depends on unknown parameters. The QMLE
might be useful because the moments E(y,lx,) and Var(y,Ix,) depend on the
parameters in a very smooth fashion.

7. Generalized method of moments estimation

7.1. Introduction

This section studies generalized method of moments (GMM) estimation for


dependent processes. We rely heavily on the work of Hansen (1982), Bates and
White (1985, 1993), and Newey and McFadden (this Handbook). As with the rest
of the chapter, we focus on weak consistency, and we do not attempt to find the
weakest set of regularity conditions. The limiting distribution results in Section 7.3
apply to the essentially stationary, weakly dependent case.
Many applications of GMM fit into the category of estimation under conditional
moment restrictions. We will approach method of moments estimation from this
standpoint. By specifying the conditioning set to be empty the unconditional
moments framework is a special case. The related classical minimum distance
estimators are not covered here; see Newey and McFadden for a treatment that
can be applied to time series contexts by using the results of this chapter.
Let {(w~,x,): t = 1,2,. . .$ be a stochastic process, where wtc%; and x,EX,, and
the dimensions of both w, and x, may grow with t. Assume that there is an N x 1
vector function Y,:,YYW; x 0 --f [WNdefined on YJY~ and the parameter space 0 c [w.
Interest lies in estimating the P x 1 vector 8,~0 such that

Nr,(w,, Q,)lx,l= 0, t= 1,2,.... (7.1)

Equation (7.1) constitutes a set of conditional moment restrictions, studied in cross


section settings by Chamberlain (1987) and by Hansen (1982), Hansen and
Singleton (1982), Robinson (1991 a), Newey (1990) and Rilstone (1991) in dependent
settings.
Conditional moment restrictions are straightforward to generate. First consider
the setup of Section 6.1, where yt is a G x 1 vector and the model {m,(x, 0): x,E%~,
0~0) is correctly specified for E(y,lx,). Then (7.1) holds by setting u,(w,,@ =
Y, - m,(x,, Q).
Next suppose m,(x,,H) and Q,(x,,O) are correctly specified models for E(y,Jx,)
and Var(y, 1xr), as in Section 6.2. Defining u,(0) = yr - m,(x,, H), a set of conditional
moment restrictions is obtained by setting

a {G+[G(G+ 1)/2]} x 1 vector, where vech[.] denotes the vectorization of the


lower triangle of a matrix. Under correct specification of the conditional mean
and conditional variance (7.1) holds.
In many situations, including a variety of rational expectations models, economic
theory implies an Euler equation of the form

~CY,(Wf,~o)l~,_1,~,_2,...~11=0 (7.2)

in which case (7.1) follows by the law of iterated expectations whenever x, is a


subset of (wI_l,wr_l ,..., WI).
Conditional moment restrictions are also natural for analyzing nonlinear
simultaneous equations models. A general dynamic structural model can be
Ch. 45: Estimution and l@renw ,fiw Drprndrrlt Prcrcessrs 2693

expressed as

0, IXJ = 0, (7.4)

where qt(.) is N x 1 and x, contains predetermined variables. Thus, in a GMM


setting we identify YJw,, 0,) with the structural errors u,. As we see below, a whole
class of GMM estimators are consistent and asymptotically normal under the
Assumption (7.4) that the structural errors have a conditional mean of zero given
the predetermined variables; the errors need not be independent of X, or even
conditionally homoskedastic. They can also be serially correlated provided (7.4)
holds.
Condition (7.1) implies that the unconditional expectation of Y~(w~,~?,)is zero,
but it implies much more: any matrix function of x,, say the N x M matrix G&X,), is
uncorrelated with r,(t),,)= r,(w,,0,). More precisely, if E[~T,,,(w,, 8,)1] < co, h = 1,. . . , N
and E[I Grhj(xt)rth(w,, d,)(] < co, h = 1,. . . , N, j = 1,. . . , M, then

EIGl(xt)r,(~,, e,)l = 0. (7.5)

Equation (7.5) is the basis for estimating 0,. Actually, to handle cases that arise
in practice, we will consider a more general framework. Let the instrumental
variable function depend on some nuisance parameters (which might include 19)
as well as x,: G,(x,, y), where YET c RH. Under (7.1) and finite moment assumptions,

E[G&, y)rt(wt, e,)] = 0 for all YET. (7.6)

As before, we assume that we have an estimator fT that satisfies

fi(y*T- Y*)= O,(l)> (7.7)

where y* could contain 8, but need not have an interpretation otherwise. Then,
because (7.6) holds with y = y*, Lemma A.1 generally implies that

T- f G,(Wr,(~J JL0, (7.8)


t=1

where G,(y) = G,(x,, y) and r,(O) = rJw,, 0). The analogy principle suggests estimating
BOby solving the M nonlinear equations

T - l i G,(g,)'r,(e) = 0, (7.9)
1=1
where, to identify O,, we need M > P. But (7.9) does not always have a solution,
especially when M > P. Instead, we choose an estimator of 8, that makes the left
hand side of (7.9) as close as possible to zero, where closeness is measured as a
quadratic form in a positive definite matrix. The weighting matrix A, is an M x M
positive semi-definite random matrix, such that A, 11, A*, where A* is an M x M
nonrandom positive definite matrix. (This implies that A, is positive definite with
probability approaching one (w.p.a.1)) As a shorthand, denote the instruments by
6, z G,(x,, TT). A generalized method o/ moments (GMM) estimator of 6,, is the
solution 8, to

(7.10)

7.1. Consistency

Consistency of the GMM estimator does not follow from Theorem 4.3 because
the GMM estimator is not an M-estimator as we have defined it here. Nevertheless,
consistency follows under the uniform weak law of large numbers and the identifi-
cation assumption that 8, is the unique solution to

lim T- f: E[G,(x,, y*)r,(w,, @I E 0. (7.11)


T-X t=1

Usually in practice one argues that E[G,(x,, y*)r,(w,, 0)] # 0 for 6# N,.

Theorem 7.1 (Consistency qf GMM)


Assume that

GMM.1: (i)
0 and r are compact sets;
(ii)
$. J+ y*;
Ar % A*, a positive definite M x M matrix.
(iii)
GMM.2: (i)
G, satisfies the standard measurability and continuity conditions
on Xr x r;
(ii) Y, satisfies the standard measurability and continuity conditions on
wrx 0.
GMM.3: {G,(y)r,(U)} satisfies the UWLLN on 0 x I-.
GMM.4: (i) For some O,E@, E[r,(w,, Q,)jx,] = 0, t = 1,2,. .;
(ii) 8, is the unique solution to (7.11).
Then the GMM estimator gT exists and is weakly consistent for 8,.

This result follows by verifying the conditions of Newey and McFadden


(Theorem 2.1). It is similar in spirit to the strong consistency results of Hansen
(1982), Bates and White (1985), Gallant (1987), Gallant and White (1988) and
Potscher and Prucha (1991a).

7.3. Asymptotic normality

To establish the asymptotic normality of the GMM estimator we apply Theorem


3.2 in Newey and McFadden; here, only the main ingredients are sketched. In
defining the score of the objective function, it is notationally convenient to divide
the gradient of (7.10) by 2 and we do so in what follows. Straightforward differen-
tiation with respect to Q gives the score as the P x 1 vector

9
(7.12)

where V,r,(@) is the N x P gradient of_r,(@. Assuming that fI,Eint(O), weak


consistency of 8, implies that S,(t?,; $r, A,) = 0 w.p.a.1. Further,

,,/fs,(e,; &, (i,) = Tml 5 6:V,r,(e,)1 &[ T-l/


1=1
i
t=1
tf$,(d,)
1
. (7.13)

Using a standard mean value expansion, the assumption E[r,(OJlx,] = 0, and


Lemma A.l, it is easily shown that

Tp i
1=1
@,(e,)= TmI2 i t=1
G,(y*)'r,(~,) + op( 1). (7.14)

Define the M x 1 vector e,(@ y) E G,(y)r,(B), so that E[e,(8,; Y*)IxJ = 0. Under


moment and weak dependence assumptions, T- I2 C,= 1e,(flo; y*) satisfies the
central limit theorem, and so it is O,(l). Next define the M x P matrix

R, = rlim T-l i E[G,(y*)V,r,(BJ]. (7.15)


1=1

Then by the UWLLN and Lemma A.l,

(7.16)

Because AT-A* = o,(l) it follows from (7.14) and (7.16) that

fiS,(fI,;&/ir)= R:JI*T-~ ,i* e,(Q$Y*) + opU). (7.17)


2696 J M. ~(~(~l~~i~~~~

Equation (7.17) has the important impIication that the limiting distribution of
,J?($, - y*) does not affect that of ,,fi(& - 8,); only (7.7) is needed.
From (7.17) it is natural to define the score for observation t as

s,@,) = R;A*e,@?,; r*) = ~~A*G~~~*~~~(~~). (7.18)

(For simplicity we suppress ah parameters in the score except for S,.) It now
follows from Newey and McFadden (Theorem 34, that

Jf($,. - 0,) -%Normal(O, A;Z?,A;), (7.19)

where

A, SER;/l*R,. (7.20)

The matrix B, can be expressed as

D, z5 lim Var T-Ii2 $j e,(B,; Y*f . (7.22)


T-+SZ [ t=1 1
We
summarize this argument with a theorem.

Theorem 7.2 (As~~~t~t~c n~~~~~~t~of GMMJ


Assume that the conditions of Theorem 7.1 hold. In addition, assume that
GMM.5: (i) fi(Y*T- ?+I= q?m
(ii) B,Eint(O), y*&nt(Z).
GMM.6: (i) G, satisfies the standard measurability and first order differentiability
conditions on S, x C
(ii) r, satisfies the standard measurability and first order di~erentiabi~ity
conditions on 7ly; x 0.
GMM.7: (i) Ic;,(x,, Y)fVOrt(wt> 0)) and ([r,(w,, @)OZH]VyG,(xl, r,> satisfy the
UWLLN on 0 x Z;
(ii) rank (Z?,) = P.
GMM.8: (~~(8,):t = 1,2,...) satisfies the eentral limit theorem with positive
definite asymptotic variance B, given by (7.21).
Then (7.19) holds.
This result is similar to the asymptotic normality results in Hansen (1982) and
Gallant (1987), except that we leave the UWLLN and CLT as high level assump-
tions. If necessary, the assumptions on the ranks of the matrices A*, R,, and B,
can be somewhat relaxed; see Newey and McFadden (Theorem 3.2).

7.4. Estimating the asymptotic variance

A consistent estimator of A0 under the conditions of Theorem 7.2 is

AT= &i,kT (7.23)

where

&. = T i: G,~j,.)'V,r,(Q (7.24)


r=1

Given a consistent estimator BT of D,, a consistent estimator of B, is

(7.25)

From (7.22) it is seen that estimation of D, generally entails a serial-correlation-


robust estimator unless e; = e,(e,; y*) is serially uncorrelated. As usual, one simply
applies one of the serial-correlation-robust covariance matrix estimators from
Section 4.5 to (6, E e,(&r; y*,)}. See also Hansen (1982), Gallant (1987), Newey and
West (1987), Gallant and White (1988), Andrews (1991) and Piitscher and Prucha
(1991 b). As a computational matter, if M is much larger than P it is easier to
estimate B, directly by applying a serial-correlation estimator to the P x 1 vector
process $ = I&&-d,.
The form of the asymptotic variance simplifies, and we obtain the efficient
estimate: given the choice of instruments G&y*), by choosing the weighting
matrix AT appropriately. In particular, let AT be a consistent estimator of 0; ,
so that A* = 0; . Then from (7.21) and (7.22), A, = B, = RbD;R,, and so
Avar(t?=) = (Rb 0; R,) IT.

Lemma 7.1

Assume that GMM.l-GMM.8 hold, and in addition


GMM.9: A* = 0,.
Then

Avar Jf(&r - 0,) = (RiDo- RJ l. (7.26)

Choosing /1, to be consistent for 0; usually requires an initial ,,@-consistent


estimator of 8,. The resulting estimator is often called the minimum &-square
estimator (Newey and McFadden call it the optimal minimum distance estimator).
See Hansen (1982), Bates and White (1993) and Newey and McFadden (Theorem
5.2) for proofs that it is the efficient estimator given the choice of instruments
under conditions analogous to GMM.l-GMM.9.
If (uJ0,)) is serially uncorrelated in an appropriate sense, then estimation of D,,
and therefore computation of the minimum chi-square estimator, is simplified.

Lemma 7.2

Assume that GMM.l-GMM.9 hold, and in addition


GMM.10: For t 3 1, E[rt(Uo)Y,+j(e,)IX,,X,+j] = O,j 3 1.
Then

D0 = lim T- i E[e,(Bo;y*)e,(80;y*)] (7.27)


T-r, 1=1

and (7.26) holds.

Hansen (1982) showed earlier how GMM.10 leads to the form (7.27) for Do.
Under GMM.l-GMM.lO, a simple consistent estimator of Do is

(7.28)
1=1

which makes A%(H^,) = (&.b; RT)-l/T especially straightforward to obtain.


To obtain the minimum chi-square estimator under GMM.l-GMM.10, one would
obtain /1, as the inverse of (7.28), but C, would be computed using the initial
estimator.
A sufficient condition for GMM.10 is

~Crf(~o)I~,,~,-l,~,_l,...,~l,~ll=O, (7.29)

as would be the case under (7.2) when x, is a subset of (wt _ l,. . . , wl). Thus, GMM. 10
is satisfied for estimating the parameters of a dynamically complete mean, for joint
estimation of dynamically complete means and variances, for many rational
expectations applications, and for dynamic simultaneous equations models (SEM)
when the errors uI are martingale difference sequences with respect to ot. Using
the usual iterated expectations arguments, {sl(fIO;y*)) is seen to be a martingale
difference sequence under (7.29) and so a martingale CLT can be applied in
verifying GMM.8.
Often the model (7.3))(7.4) is estimated under the assumptions GMM.l-GMM.10
and the conditional homoskedasticity assumption

Jwp, Iq) = a,, (7.30)


where 0, is an N x N positive definite matrix. If we extend this assumption to
the general case, new estimators for the asymptotic variance are available.

Lemma 7.3

Assume that GMM.l-GMM.10 hold, and in addition


GMM.11: For a positive definite N x N matrix a,, E[r,(OO)v,(H,)jx,] = 0,.
Then

D,,= lim T- 5 E[G,(x,, y*)'i2,G,(~,, ?*)I.

Lemma 7.3 follows because, under GMM.ll, E(epeplx,) = G,(x,, y*))flOG,(x,, y*).
Now D, can be consistently estimated by

fiT= T- f: &$!,ti,,, (7.31)


1=1

where fi, is a consistent estimator of 0,. In virtually all applications, obtaining


R, requires an initial estimation stage. If 0,t is an initial consistent estimator of
6, then d, = T-C~Clr,(Q,)r,(e,i) is g enerally consistent for .RO. In the simul-
taneous equations setting, 8, can first be estimated by nonlinear two-stage least
squares (N2SLS). Let Z, = Z,(x,) be an N x M set of instruments. Then N2SLS
estimation is obtained by setting r,(e) = u,(O), 6, = Z,, and d, = [T- CT= 1Z:Z,] _
in (7.10). Letting ,u(+ = ql(yt,xt, 0;) denote the N2SLS residuals, a consistent
estimator of 0, is R, = T-CTz 1u~+u~. This can then be used to obtain the non-
linear three-stage least squares (N3SLS) estimator with 6t E Z, and AT = d,,
where bT is given by (7.31).

7.5. Asymptotic efjiciency

We have already seen that, given the choice of instruments G,(x,, y*,), the optimal
GMM estimator is the minimum chi-square estimator. There is little reason not
to compute the minimum chi-square estimator in applications since the weighting
matrix needed in its computation is needed to estimate the asymptotic variance
of any GMM estimator.
A more difficult issue is the optimal choice of instruments. Under (7.1), any
function of X, that satisfies regularity conditions can be used as instruments, which
leaves a lot of possibilities. Fortunately, under assumptions GMM.l-GMM.lO,
the optimal instruments can be easily characterized. For given residual function
r,(w,, Q), the optimal choice of instruments under GMM.llGMM.10 is

G;(x,) = a;(~,)- Q;(q)> (7.32)


2700 .I. M. WooldtYdge

where C$(x,) = Var[u,(B,)lx,] and QP(x,) = E[VH~,(~,, U,,)lx,l. See, for example,
Hansen (1982) and Bates and White (1993). The proof in Bates and White (1993)
is easily modified to cover the current setup. The asymptotic variance of the optimal
GMM estimator under GMM.l-GMM.10 is

A; l = R; = 0; = lim TP1 i E[Qy(x,){fi:(x,)} Qp(x,)] -l. (7.33)


T-X t=1

As an application, consider the class of M WNLS estimators studied in Section 6.1.


Under WNLS.l, WNLS.2 and WNLS.3, (7.33) reduces to the variance of the
WNLS estimator because Q;(x,) = - Veml(xI,O,) (see Theorem 6.2). Thus, if we
start by assuming only WNLS.l and WNLS.2 then the optimal WNLS estimator
is obtained by choosing W,(x,, y,) = Var(y,Ix,).
The optimal choice (7.32) also has implications for estimating simultaneous
equations models. Under GMM. 1-GMM 11, the efficient estimator uses instruments
0; E[V,q,(B,) 1xr]. In models linear in the endogenous variables, E[V,q,(UJ Ix,]
is in closed form and can be estimated; this leads to Sargans (1958) generalized
instrumental variables estimator. For models linear in all variables this leads to
the 3SLS estimator. In models nonlinear in yt the conditional expectation
E[V,q,(B,)Ix,] is rarely of closed form and so an efficient GMM estimator cannot
be obtained using standard parametric analysis.
Nevertheless, recent work has shown that, quite generally the conditional expec-
tations Qy(x,) and 0:(x,) need not be of a known parametric form in order to
achieve the asymptotic variance of the efficient GMM estimator under GMM.l-
GMM.10. Under strict stationarity (in particular, when x, has fixed dimension)
and weak dependence, both of these quantities can typically be nonparametrically
estimated at fast enough rates to obtain the asymptotically efficient GMM
estimator; see Robinson (1991a), Newey (1990) and Rilstone (1991). Newey (1990)
suggests a series estimation approach. For example, in the simultaneous equations
model under GMM. l-GMM.ll, Qp(x,) can be estimated by regressing the elements
of V,q,(fI,f) on functions of x, that have good approximation properties (here 0;
is a preliminary estimator). As a practical matter the nonparametric approach to
estimating QP(x,) and 0:(x,) has limitations, especially when the dimension of x,
is large.
When the dimensions of w, and x, are growing, little is known about the efficiency
bounds for the GMM estimator. Some work is available in the linear case; see
Hansen (1985) and Hansen et al. (1988).

7.6. Testing

Given a consistent estimator P, of Avar J?(H^, - do), the Wald test of H,: ~(0,) = 0
is just as in (4.37).
A quasi-likelihood ratio statistic can be computed for testing hypotheses of the
form covered in Section 4.6 when minimum ch_i-square estimation is used. Suppose
there are Q restrictions to be tested. Let eT denote the minimum chi-square
estimator with the restrictions imposed and let 4, be the unrestricted estimator.
(Typically, the initial estimator of A * = 0; would come from an unconstrained
estimation procedure with a simple weighting matrix.) Let Qr(0) denote the
objective function in (7.10). Then, under H,,

This limiting result holds under the conditions of Lemma 7.1, with the constraints
satisfying the conditions of Lemma 4.3. It is established using a second order
Taylors expansion and the fact that A, = B,. See Gallant (1987, Chapter 7,
Theorem 15) for a careful proof.
There is also a score test that can be applied to GMM estimation; see Gallant
(1987, Chapter 7, Theorem 16).

Part III: The globally nonstationary, weakly dependent case

8. General results

8.1. Introduction

We now consider the asymptotic properties of estimators when the data are not
essentially stationary but are still weakly dependent. This covers the well-known
case of series with deterministic trends that exhibit essentially stationary behavior
around their means. Because such series generally satisfy the CLT when properly
standardized, we expect estimators obtained from problems using globally non-
stationary, weakly dependent data to be asymptotically normally distributed. The
analysis in this part verifies this expectation.
Some comments on the limitations of the following results. First, the CLTs
underlying the analysis do not allow for series with exponential trends (polynomial
trends of any finite order are allowed). Thus, often a transformation (such as taking
the natural log) is needed to put ones problem into this framework. Second,
because the UWLLN approach cannot be used for general trending processes, we
prove consistency and asymptotic normality at the same time. This means that,
for consistency results, we assume twice-continuous differentiability of the objective
function.
Because general results for the globally nonstationary case are not readily avail-
able in the literature, in Section 8.2 we give a general result that can be used in a
variety of contexts, including M-estimation and GMM estimation. This result is
a simplification of a theorem in Wooldridge (1986), which builds on the work of
Weiss (197 1, 1973), Crowder (1976) and Heijmans and Magnus (1986). In Section 9
we show how the general result can be applied to M-estimation.
The bottom line of the subsequent analysis is that, practically speaking, there
is no difference, in performing inference, between the globally nonstationary and
the essentially stationary cases, provided the stochastic processes are weakly
dependent and reasonable regularity conditions hold. Recently, Andrews and
McDermott (1993) reached the same conclusion using a triangular array approach
to asymptotic analysis with deterministic trends.

8.2. Asymptotic normality of an abstract estimator

We begin with an objective function Q=(w,~) and assume that QT(w;) is twice
continuously differentiable on int(O) (in this section, we do not require 0 to be
compact). Rather than assume that a minimizer of Q=(w, 0) on 0 exists, we work
from the first order condition. Define the P x 1 score, S,(e), as S,(8) = V,Q,(@
and the P x P Hessian HT(0) z ViQ,(@ _= V&-(B). We assume that the P x 1
parameter vector 8,, which we are trying to estimate, is in the interior of 0. (Inci-
dentally, in contrast to the essentially stationary case, the score and Hessian do
not incorporate a scaling based on the sample size. Below we explicitly introduce
a matrix scaling.) To sketch the issues that arise in the globally nonstationary
case, first assume that we have an estimator e, such that

S,(e,) = 0 w.p.a.1 (8.1)

(below we prove that such an estimator exists). A mean value expansion about 0,
yields, with probability approaching one,

0 = S,(OJ + ii,@, - do), (8.2)


where /i, is the Hessian with rows evaluated at mean values between t?, and.0,.
Assume for the moment that, for some nonstochastic, P x P positive definite
diagonal matrix D,,

D;1'2ST(Oo)%
Normal(O,BJ, (8.3)

where B, is a P x P positive definite nonstochastic matrix. Next, use (8.2) to write


(w.p.a.1)

0= D;1'2ST(@_,)+(D;"2iiTD;1'2)[D;!2(i&
- Q,)]. (8.4)

Now. if

D,;1'2[tiT
- HT(Oo)]D,1'2
= o,(l), (8.5)
which is reasonable since the mean values should be converging to 0,, then

o;& - e,) = - [D, 12H,(e,)D, 121~[D, 2s,(e,)] + o,(l). (8.6)

Under weak dependence we can generally apply a law of large numbers so that

0; *H&,)D, * 3 AO, (8.7)

where A0 is a P x P nonrandom, positive definite matrix. If so,

Dy(e^, - 0,) = - A; [D, S,(d,)] + op( l), (8.8)

and then asymptotic normality of Dy(&, - 0,) follows in the usual way from the
asymptotic normality of the score.
For nonlinear estimation with trending data, Condition (8.5) is typically the most
difficult to verify. One approach is to take (8.5) as a reasonable regularity condition,
as is essentially done in Crowder (1976) for MLE with dependent observations.
Still, it is often useful to have more basic conditions that imply this convergence
of the Hessian, particularly since we have not yet shown that 8, is consistent for
0,. We use an extension of Weiss (197 1) suggested by Wooldridge (1986, Chapter 3,
Proposition 4.3), which implies (8.5) and at the same time guarantezs that
Di(@, - 0,) = O,( 1). We can then derive the asymptotic normality of Dy*(0, - 19,)
from (8.3) and (8.8).
The idea is to impose a type of uniform convergence of the Hessian normalized
by something tending to infinity at a rate slower than D,. The key differences
between this approach and the type of uniform convergence used in the essentially
stationary case are that now(i) we allow each element of the Hessian to be standard-
ized by a different function of the sample size and (ii) the subset of the parameter
space over which the Hessian must converge uniformly shrinks to 0, as the sample
size tends to infinity.
Formally, the condition we impose on the Hessian is

max /IC; [Zf,(d,) - H,(B)]C; l/* /I = o,(l),


.T

where (C,} is a sequence of nonstochastic positive definite diagonal matrices such


that C,D; = o(l), and

Jv; = {tk@: IIq2(e-e,)II < l}. (8.10)

We have the following theorem.

Theorem 8.1

Let {Qr: YV x 0 + R, T = 1,2,. .} be a sequence of objective functions defined on


the data space YV and the parameter space 0 c Rp. Assume that
2704 J.,M. Wooldridye

(i) H,Eint(O);
(ii) Q, satisfies the standard measurability and second order differentiability
conditions on W x 0, T = 1,2,. . .
There are sequences of nonstochastic positive definite diagonal P x P matrices
(C,:T= I,2 ,... } and {D,:T= 1,2 ,... } such that
(iii) (a) C,D;+O as T+ XI;
(b) (8.9) holds with .,+t defined by (8.10);
(iv) (a) 0; 12H,(Bo)Df %A,, where A0 is a P x P nonrandom, positive
definite matrix;
(b) 0; 12S,(0,) LNormal(0, B,), where B, is a P x P nonrandom, positive
definite matrix.
Then there exists a sequence of estimators (8,: T = 1,2,. . .} such that (8.1) and
(8.6) hold. Further, Di(e^, - 0,) -% Normal(0, A; BoAi ), and therefore

(8.11)

(For proof see Appendix.)

Condition (iv)(b) serves to identify 8, as the parameter vector such that


E[S,(O,)] = 0; (iv)(a) then ensures that 0, is the unique value of 0 that solves the
asymptotic first order condition.
Given (8.11), it is natural to define

Avar(g,) = 0; 12A61BoA~D, li2, (8.12)

which shrinks to zero as T+ co, as we would expect. Formula (8.12) reduces to


the expression for the asymptotic variance we derived in the essentially stationary
case when D, I2 = T- I21p, where ZP is the P x P identity matrix.
As can be seen in (8.12), the norming matrix D, ~ and therefore the kinds of
trends in the underlying data ~ clearly affects Avar(8,). It is natural to conclude
that the form of D, affects how one estimates the asymptotic variance of 6, and,
therefore, how one performs inference about B,._Fortunately, this is not the case.
In practice, consistent estimators A^r of A, and B, of B, incorporate the norming
D-T li2 in such a way that

D,12A^,jtj A^-D-12
TT T
(8.13)

does not depend on D,. For example, under the conditions of Theorem 8.1,

AT = 0; 12HT(@T)D, 12 (8.14)

is a consistent estimator of A,. For illustration, suppose that B, = A0 which, as


we saw in the essentially stationary case, occurs under classical assumptions. Then
2705

Avar(6,) = 0; A; 0; I2 and

A%(&) = D;12[D;12HT(&)D;12]-1D;12 = [H,(&)]-. (8.15)

Equation (8.15) shows that we estimate the asymptotic variance of @, by the inverse
of the estimated Hessian, exuctly as would occur in the essentially stationary case
when B, = A,. The researcher need not even know at what rate each parameter
estimator is converging to its limiting normal distribution.
A similar observation holds even when B, # Ao, because consistent estimators
of B, typically have the form BT = 0; l12kTD; , where A?, does not depend on
scaling by a function of the sample size. Then

A%(H^,) = fi; A?#; (8.16)

does not depend on a particular scaling. We see this explicitly in Section 9 on M-


estimation.
One apparently restrictive feature of Theorem 8.1 is that A0 and B, are assumed
to be positive definite. There are several problems with multiple trending processes
for which this is not true. As a simple example, consider the linear model y, =
a,, + y,t + p,,zt + u,, where all quantities are scalars, z, = a, + a, t + u,, a, # 0, and.
{(u,, II,)} is a strictly stationary and weakly dependent process with E(u,u,) = 0. Let
x, = (1, t, z,), and assume we estimate this model by OLS. Then HT(Q) = H, = XX,
and it can be shown that the norming matrix that makes XX converge in prob-
ability is D, = diag{ T, T3, T3}. In particular,

which has rank two instead of three.


Because of examples like this, we should discuss how Theorem 8.1 can handle
these cases. Usually it is possible to find a P x P nonsingular matrix M, and new
scaling matrix, say G, (that depends on M), such that

G, 2(MH,(0,)M)G, 12 -% A,M (8.17)

and

C, 12MS,(0,) A Normal(0, BF), (8.18)

where Af and By are positive definite. Then, by an argument entirely analogous


to the above (multiply (8.2) through by M),
cy [M- ye,- %J] = - [G, 12MH,(%,)MG, l*] - '[G, 12Ms,(%,)] + op( 1)
= - {A~}-[G,*MS,(%,)] + o,(l) (8.19)

~N0rmal(O,(z4~}-B~{A~}~~). (8.20)

Thus, inference can be carried out on 6, = Mm- BO using Theorem 8.1.


Fortunately, in order to perform inference about %,, one need not actually find
a linear combination of i?, - 0, that, when properly scaled, has a nondegenerate
limiting distribution (that is, one need not actually find the matrix M). It is enough
to know that one exists, and this is virtually always the case in identified models.
To see why the choice of M plays no role in performing inference about %0,suppose
that interest lies in testing a linear hypothesis about BO:

H,: R%, = r. (8.21)

This is stated equivalently in terms of 6, = M-BO as

He: PS, = r, (8.22)

where P = RM. Suppose for simplicity that A0 = B,, and A_: = Br, a case that arises
under-some assumptions covered earlier. Then, with 6, = M- eT, A&$($,) =
Mp lH;l(M-l and the Wald statistic for testing (8.22) is

W,=(P~T-r)f[PM-fi~(M-P~(P6^,-r)

= (R8, - r)[Rfi; R] -(Rf?, - r), (8.23)

which is the usual Wald statistic if we had started with gT and its estimated
asymptotic variance fi; . Although the reasoning is more complicated, similar
arguments can be made for testing nonlinear hypotheses of the form H,: c(%,) = 0.
In general, t-statistics, Wald statistics, quasi-LR statistics and LM statistics can
be computed in the usual ways without worrying about the rate of convergence
of each estimator. Of course, the rates of convergence might be of interest for other
reasons, such as to facilitate efficiency comparisons.

9. Asymptotic normality of M-estimators

9.1. Asymptotic normality

We now show how Theorem 8.1 can be applied when the optimand is

QT(W0) = i qt(w,,%I (9.1)


1=1
Ch. 45: Estimation and It@-ence,for Dependent Processes 2707

In applying Theorem 8.1 to M-estimation, we need to be able to verify Conditions


(iii) and (iv)(b) of Theorem 8.1. First consider Assumption (iv)(b), which requires
that the score evaluated at 19, satisfies the CLT:

-% Normal(0, B,), (9.2)

where s,(e) = VOql(wf, 0) and B, is positive definite. If (9.2) is to hold at all, D, is


usually easy to find. This is because 0; 112 must satisfy

(9.3)

in particular, the matrix on the left hand side of (9.3) must be bounded and at the
same time uniformly positive definite (this latter requirement prevents us from
choosing D, too large). Once D, has been found, a CLT for trending processes
can be applied [see McLeish (1974) for martingale difference sequences, Wooldridge
and White (1989) and Davidson (1992) for near epoch dependent functions of
mixing processes]. In some cases a functional CLT ~ see Part IV - can be used to
establish (9.2).
As in the essentially stationary case, {s,(0,): t = 1,2,. .} can be serially correlated,
although under the complete dynamic specification assumptions of Sections 5 and
6 {s,(0,)} is a martingale difference sequence with respect to Qt = {wi, . . . , wr}.
Thus, there are many examples for which {s,(0,)} is serially uncorrelated, although
Var[s,(8J] will not be bounded.
After the CLT has been argued to hold, we need to establish the key condition
(8.9). Let htij(e) denote the (i,j)th element of h,(8). Especially if qt(.) is thrice conti-
nuously differentiable on int(O), we can often establish the inequality

maxI h,ij(e) - h,ij(eJ Id bTtij II8 - 8, II2, (9.4)


BEND,

where bTtij is a positive random variable. Note that /I 8 - 8, II2 d (cTcl))- I2 for all
&Jf; where cr.(i) is the smallest element of C,; therefore, if we can control the
rate at which bTtij grows, (8.9) is easily shown. The proof of the following lemma
is straightforward.

Lemma 9.1

Assume that for all i, j = 1,2,. . . , P,

(i) inequality (9.4) holds;


(ii) Xi1b,.fij
=OpiJG).
Then Condition (iii) of Theorem 8.1 holds for M-estimation.
2708 J.M. Wooldridyu

To see that the conditions of this lemma are reasonable, consider what they entail
in the strictly stationary case. Suppose that q(w,, 6) is thrice continuously differen-
tiable on int(O) with the third derivative dominated by a function with finite
moment (see, for example, Condition (iii) in Theorem 4.1). Then we can take bTtij =
btij to be a bound on the absolute value of the third derivatives of q,(B) on int(O).
Because {b,,,} is stationary and ergodic with finite moment,

by the WLLN. Because D, = TIP we can take C, = TZp for any a < 1 and have
(iii)(a) of Theorem 8.1 satisfied. For condition (ii) of Lemma 9.1 to hold we require
C,= 1btij = o(T{~)); by (9.5) this holds for any a > 5.
Of course Lemma 9.1 is useful only if it can be applied to nonlinear models
with trending data. This is the case, at least for some applications, but finding bTtij
and verifying (ii) can be tedious. For illustrative purposes, we given a simple non-
linear least squares example.
Consider the model

y, = CI,+ f(z,, Y,) + BJ + % (9.6)

where {(z,, uJ: t = 1,2,. . .) is a strictly stationary, weakly dependent sequence such
that

E(u,Iz,) = 0. (9.7)

Here, ~1,and /3, are scalars and y, is a J x 1 vector. We assume that f(z,, .) is thrice
continuously differentiable on the open set I, with y,Eint(r). Note that if 8, # 0
then {y,} contains a linear time trend (extending this example to the case where
y, has a polynomial time trend of any finite order is straightforward). Define the
regression function

m,(x,, 0) E @+ Ski,, Y) + Pt. (9.8)

The score for observation t for nonlinear least squares estimation is

s,(@= - V,m,(@u,(~), (9.9)

where, for model (9.8), V,m,(@ = (1, V,f(w,, y), t). Letting s, = $(eJ, It is straight-
forward to show that
Ch. 45: Estimation and Inference forDependent Processes 2709

where D, E diag{ T, TI,, T3}. (It helps to remember that c,=,t = O(T) and
CT= 1t2 = 0( T3), and to first assume that {ut} is serially uncorrelated.) As we now
show, it suffices to set C, E diag{ T, TaZJ, Ta2} for some a, < 1 and a2 < 3.
For nonlinear least squares estimation, there are always two pieces of the Hessian
that need to be analyzed for convergence. The first is V,m,(O)V,m,(~) and the
second is V~m,(@u,(O), where V,m,(O) is the P x P second derivative of m,(d). What
must be shown is that

EygK:A,il I Vd%(WfP,j(e) - VtP,i(~,)VfP$j(Q,) I = O,(l)

and

for all i,j= l,..., P. By looking at the form of m,(O)and recalling the discussion
of the stationary case, the only terms that present a new challenge are the cross
products V,f,,(y)t, i = 1,. . , J, and the term V~ftih(~)ut(0), i, h = 1,. . . , J. Let us look
at the cross product term V,f&)t. Because this is differentiable, and because
cTcl) = T, to verify the conditions of Lemma 9.1 it suffices to show that C,= I gtit =
op( T +02/), where gti 1s a b ound on the partial derivatives of VJJy) for all YET.
But {gti} is a stationary sequence with finite moment, so it suffices to show that
CT= 1 [gti - E(g,,)]t = op( T +a2/2) and C,= 1t = op( T +a22). By the WLLN [see
Andrews (1988)], XT= 1[g,i - E(g,i)]t = 0p(T3*), and CT= 1t = O(T*). Thus, we
must choose (a, + 42) > 2; but under the restrictions a, < 1 and a2 < 3, (a, + u,/2)
can be made arbitrarily close to 4.
The sum containing the terms V:f,ij(y)u,(0) can be handled in a similar manner
by writing u,(O) = u, + [f(z,, y,) - f(z,, y)] + (B, - fl)t and using the fact that the
third derivative of j&y) is bounded. Thus, we can conclude the NLS estimator
of (c(,, y,, /?,,) is consistent and asymptotically normal under fairly weak assumptions.
Note that we have allowed for dynamic misspecification and conditional hetero-
skedasticity in {u,}.
Another example is the nonlinear trend model y, = cr,t@+ u,, where E(uf) is
uniformly bounded, E(u,) = 0, and {Us} is an essentially stationary process. It turns
out that we must assume that j?, > - $; otherwise, there is not enough information
in the sample to consistently estimate the parameters. The apparent simplicity of
this two parameter model for E(y,) is misleading. To carefully verify Condition (iii)
of Theorem 8.1 for the NLS estimator is very tedious. We will not go through the
details here, but Lemma 9.1 can be used. Wooldridge (1986, Chapter 3, Corollary
6.3) gives the details.
2710 J.M. Wooldridqr

9.2. Estimating the asymptotic variance

Estimation of A, and t#O follows in-much the same way as in the essentially
stationary case. Define H, E CT= 1h,(d,). Then, under the conditions of Theorem
8.1, D;121?TD;12 +A,.
The estimator based on a&8,) (Section 4.5) is also
generally consistent when it is available.
If the score is serially uncorrelated, and {st(0)s,(O)) satisfies Condition (iii) in
Theorem 8.1 (in addition to {/I,(@)), then 0; /kTD; 1/2 -% B,, where

(9.10)

As mentioned earlier, the absence of serial correlation in the score follows in the
same circumstances as in the essentially stationary case. The asymptotic variance
of t?, is estimated as

(9.11)

and this is one of the matrices used for obtaining asymptotic standard errors and
for forming Wald statistics derived for the essentially stationary case with serially un-
correlated score. Equation (9.11) again shows that, for practical purposes, the scaling
matrix D, has disappeared from the analysis. The absence of a scaling matrix in
A%(&) is entirely consistent with observed econometric practice; one rarely sees
consideration of scaling matrices appearing in applied work, and this is justifiably so.
To complete the analysis of the globally nonstationary case, we should have
methods of estimating B, when {s,(O,)} IS serially correlated. Conceptually, this
causes no problems. The Hansen-Hodrick estimator and its weighted versions
can be shown to be generally consistent for B, when the autocovariances of the
score are known to be zero after a certain point. We conjecture that the general
serial-correlation-robust estimators of the score covered in Section 4.5, when
properly standardized, remain consistent under reasonably weak conditions.
Unfortunately, we know of no formal results along these lines. This is an important
topic for future research.

Part IV. The nonergodic case

10. General results

10.1. Introduction

In this final part we turn to inference in models with nonergodic processes. Actually,
the primary distinctions between this part and the previous ones are that the score,
Ch. 45: Estimation und Infirencefhr Dependent Procrss~s 2711

when properly standardized, is not weakly dependent (so it does not have a limiting
normal distribution), and the Hessian, when properly standardized, does not
necessarily converge in probability to a nonstochastic matrix. Instead, the stand-
ardized Hessian and score converge jointly in distribution, typically to a function
of multivariate Brownian motion. The functional CLT (FCLT) or invariance
principle plays a prominent role for determining the limiting distributions.
Because of the nonstandard limit behavior of the score and Hessian, the properly
standardized estimator and related test statistics do not necessarily have the usual
normal and chi-square limiting distributions. The limiting random variable can
depend on unknown parameters, which makes asymptotic theory difficult to apply
in practice.
Recently there has been much work on estimation of linear models with non-
ergodic processes. A short list of references includes Phillips (1987, 1988, 1991),
Stock (1987), Phillips and Durlauf (1986), Park and Phillips (1988, 1989), Phillips
and Hansen (1990) and Sims et al. (1990). Fortunately, some of this research has
focused on finding statistics with either standard limiting distributions or at least
limiting distributions that are free of nuisance parameters, and which can therefore
be used for inference. Some of these results are given in Section 11; Watson (this
Handbook) gives a more extensive treatment.
In Section 10.2 we state a general theorem for nonlinear models that is a straight-
forward extension of Theorem 8.1. Section 11 covers some applications to linear
models. Section 12 sketches how the general theorem can be applied to a particular
nonlinear regression model.
All of our examples are for processes that are integrated of order one. It is
possible to apply them to explosive processes as in Domowitz and Muus (1988)
and Jeganathan (1988). An interesting open question is whether these results, or
whether the results in Part II, can be applied to estimation with strongly dependent
data.

10.2. Abstract limiting distribution result

We first analyze a setup very similar to that in Section 8.2. The score s,(O) and
Hessian Hr.(O) of the objective function QT(0) are defined there.

Theorem 10.1

Let Assumptions (i)-(iii) of Theorem 8.1 hold. Replace Assumption (iv) with
(iv) (D; 2HT(0,)O; li2, 0; 12S,(0,)) -%(,d,, Yp,), where &, is positive definite
with probability one.

Then there exists a sequence of estimators {&+ T = 1,2,. .} such that (8.1) and
(8.6) hold. Further,

D;"(B,- 0,)%c4,'.4Po. (10.1)


The proof of Theorem 10.1 is identical to that of Theorem 8.1 up to establishing
the first order representation (8.6). Then, (10.1) follows by the continuous conver-
gence theorem and Assumption (iv).
For linear models Condition (iii) (see Theorem 8.1) is trivially satisfied because
the Hessian does not depend on 8, so the difficulty lies in establishing (iv). Of
course, one hardly needs a result like Theorem 10.1 to analyze linear models since
the estimators are given in closed form. In Section 11 we show how the functional
CLT can be used to establish (iv) and the distribution of &;YO for a class of
linear models. As stated in the introduction, at this point we have no guarantee
that the distribution of &SyY. can be used for inference, as it may depend in an
intractable way on nuisance parameters.
In addition to having to establish (iv), for nonlinear models we also have to
verify (iii). As we saw in Sections 8 and 9, this is nontrivial for models with trending
but weakly dependent data. It is even more difficult when we allow for nonergodic
processes. For future applications of Theorem 10.1, more research is needed to
see how (iii) can be verified. In Section 12 we show how the FCLT can be used
to verify (iii) for a particular nonlinear model.
Theorem 10.1 assumes that d4, is positive definite with probability one. As with
Theorem 8.1, one might have to employ linear transformations to (t? - 0,) to ensure
that this holds in particular applications. But, unlike the trend-stationary case,
different linear combinations may have fundamentally different limiting distri-
butions, and so some care is required for inference about the parameters of
inference. More on this in Section 11 and in Watson (this Handbook).
The notion of locally asymptotically mixed normal (LAMN) families [for example,
LeCam (1986) Jeganathan (1980, 1988) and Phillips (1989)] has played an impor-
tant role in studying efficiency issues in nonergodic models. It turns out to be
closely related to the possibility of finding asymptotically chi-square test statistics.
The LAMN condition was originally applied to log-likelihood functions, but, as
shown in Phillips (1989) it can be applied to more general criterion functions. We
do not consider the LAMN condition or its extensions [Phillips (1989, 1991)] here,
as it involves a substantial technical burden, and for the purposes of establishing
limiting distributions the full LAMN machinery is not needed. Nevertheless, it is
informative to see what LAMN entails in Theorem 10.1.
For purposes of inference, the important consequence of LAMN is that
Assumption (iv) of Theorem 10.1 is satisfied with

where 3V0- Normal(O,I,) and is independent of &O. Thus, the LAMN condition
restricts how the random quantities Cd0 and <YOcan be related to each other. We
can easily illustrate why this condition is important for inference purposes. Even
though Ds(t?, - 0,) does not have a limiting normal distribution under (10.2)
and the conditions of Theorem 10.1, under LAMN it follows immediately that

[D; 2H,(N,)O; ][o$(e, - Q,)] % .c4; j2Y0 - Normal(0, Ip). (10.3)

Thus, when normed by a random matrix, Oi(&, - 0,) has a limiting multivariate
standard normal distribution. From (10.3),

- 0,) Lx;,
(8,.- O)fqqJ(O^, (10.4)

and so a quadratic form in the estimator has a standard limiting distribution.


Generally, when the LAMN condition holds, there exist quadratic forms that have
limiting chi-square distributions that can be used for inference about do.

11. Some results for linear models

We begin with the linear model

Y, = % + xtBo+ u,> t= 1,2,..., (11.1)

where U, in an T(O),zero-mean process, and the 1 x K vector x, is an I(1) process:

x, = x f-l + t, (11.2)

where {u,} is an I(O), zero-mean process, and there are no cointegrating relationships
among the X, (x0 is an arbitrary random vector). These assumptions imply that
y, and x, are cointegrated [see Engle and Granger (1987), Watson (this Handbook)]
and, given the normalization that the coefficient on y, is unity, there is only one
cointegrating vector. (Technically, we allow for /I, = 0, in which case y, is I(O).)
Due to the work of Stock (1987), Phillips and Durlauf (1986), Phillips (1988), Park
and Phillips (1988) and others, it is now known that the limiting distribution of the
OLS estimator of fi, is nonnormal. To derive the limiting distribution requires the
notion of a functional CLT.

Dejinition 11 .I

Let (w,: t = 1,2,. . .} be an M x 1 strictly stationary, weakly dependent process such


that
(i) E($w,) < ~0,
(ii) E( WJ = 0,
(iii) 0 = lifnn,,m Var(T _ 12CT= 1wt) exists.
Then {w,} is said to satisfy the functional central limit theorem (FCLT) (or invariance
2714

principle) if the stochastic process {B,: T = 1,2,. . .}, defined by

[Trl
B,(r) = T-12 2 w,, O<r<l,
r=1

converges in distribution to g&Z(O), an M-dimensional Brownian motion with


covariance matrix 0. Here, [Tr] denotes the integer part of Tr.

The notion of the process B, converging in distribution to g&(Q) is defined in


terms of weak convergence of their probability distributions. Weak convergence is
the extension to general metric spaces of the usual notion of convergence in
distribution over finite dimensional Euclidean spaces. The reader is referred to
Billingsley (1968, Chapters l-5) for definitions and background material.
The use of the FCLT to obtain limiting distribution results for estimators was
pioneered by Phillips (1986, 1987) for a univariate autoregression with a unit root.
The multivariate FCLT was first used in econometric applications by Phillips and
Durlauf (1986). The FCLT is known to hold under conditions analogous to the
CLT; see, for example, Billingsley (1968), McLeish (1977), Herndorff (1984) Phillips
and Durlauf (1986) and Wooldridge and White (1988). Although we have assumed
strict stationarity of (wt} for simplicity, this is not necessary; certain forms of
bounded heterogeneity are allowed, as in the references. In what follows, we simply
assume that the FCLT holds.
The following lemma is very useful for analyzing least squares estimators with
I(1) processes. Parts (i)-(iv) were proven by Park and Phillips (1988).

Lemma 11.1

Let {wt = (u:, II:)} be an M x 1 strictly stationary, weakly dependent stochastic


process with finite second moments and zero mean. Here, u, is M, x 1 and u, is
M, x 1. Define

(11.3)

and

(11.4)

z, *= E(qu:), A,, = f ~(u,u:_,), (11.5)


.s= 1
and

d,, = z,, + A2r. (11.6)

Assume that a,, and fizz are positive definite. Define X, as in (11.2) (note that X,
is a column vector for the purposes of stating the lemma). Let B denote a Brownian
motion with covariance matrix 0, and partition B as B = (B;, B;). Thus, B, -
&JA(B, r) and B, - %?&(0,,), and these processes are independent if and only if
R,, = 0. Then, under additional regularity conditions, the following hold jointly
as well as separately:

(i) Tp32 i x, A M-) dr,


r=1 s0
1

(ii) T-52 i tx, A rB,(r) dr,


t=1 s0
T 1

(iii) Tp2 1 x1x:% B2(r)B2(r) dr,


t=1 s 0

(iv) T-32 5 tu, +d rdB,(r),


I=1 s 0

(v) T- i V, - d B,(r)dB,(r)+d,,.
t=1 s 0

As shown in Park and Phillips (1988), parts (i))(iv) are an immediate consequence
of the convergence in distribution of B, to a&(n) and the continuous convergence
theorem [Billingsley (1968, Theorem 5.1)]. For example, part (i) follows because
X, = T112BT2(t/T), and so

T-32t$l x, = T- fl BT2(t/T) = [B,,(r)dr %sB,(r)dr,


0 0

where the second equality follows because BT2(.) is stepwise continuous and the
final convergence result follows because j,!,B,(r)dr is a continuous function of B.
For part (ii), tx, = Tli2tBT,(t/T); then

Tm5j2 ,zl tx, = Tp f. (fIT)B,,(t/T) = I1 rB,,(r)dr L j1 rB,(r) dr.


r=1 0 0

Parts (iii) and (iv) are handled similarly, where part (iv) uses the fact that U, =
T2[BT,(t/T) - BTl((t - 1)/T)]. Part (v) is more difficult to verify, and does not
follow simply from the convergence of B, to aM(0). The same kind of proof for
parts (i))(iv) initially appears to work, but the final convergence in distribution
does not follow from the continuous convergence theorem because the right hand
side in part (v) is not a continuous function of tiA(fl). Nevertheless, Hansen
(1992b) shows that (v) holds under fairly general conditions.
For analyzing the model (11.1) with x, 1 x K we simply replace u, with II:, x,
with xi, B with B, and so on.
The assumption that On22 is positive definite implies that the elements of x, are
not cointegrated, a restriction that should be borne in mind in what follows. With
this lemma in hand, we can establish much of the needed asymptotic distribution
theory for linear regressions with I(1) processes. Conclusions (ii) and (iv) are needed
to allow for trend-stationary processes and I(1) processes with drift.
To analyze (1 l.l), we only need the conclusions of this lemma when U, is a scalar.
Thus, in Lemma 11.1, Z,,, AZ1, and A,, are K x 1 vectors, and the long run
variance of {ut} is

a,, = E(r+,) + : {E(U:_ju,) + E(u;u,_j,}, (11.7)


j=l

the long run covariance between {II,} and (u,} is

J-221
= Wu,) + j=: 1 {-wJ_ju,)+ E(u;+ju,)}, (11.8)

and the long run variance of {Us} is

or1 = E(uf) + 2 g E(u,u,_j). (11.9)


j= 1

Define the 1 x (K + 1) stochastic process Br(r) = (BT1(r), BT2(r)) and the limit B z
(B,, B,) as the transpose of that in Lemma 11.1.
Let &fir denote the OLS estimators from the regression

y, on Lxt, t= l,...,T. (11.10)

To obtain the limiting distribution of j?,, write

&.=/Jo+
I,il (x,-X)(x,-f)
1-lElT (x,- vu* (11.11)

or

T-' i
1=1
(x,-X)(x,-f)
1 -lT- i
1=1
(x,--)u,, (11.12)
assuming that [T- C,= 1(_rt- X)(x, - X)] - exists w.p.a.1. Using Lemma 11.1,
Park and Phillips (1988) have derived the limiting distribution of T(fl, - fi,),

m, - B,) A [S 0
1
B,(r)B,(r) dr 1 [S
-1

0
1
B,(r)dB,(r)+ A,, 1 , (11.13)

where B2 denotes the demeaned process B,: for each 0 ,< r d 1,


1

B2(r) = B2(r) - B2(s) ds. (11.14)


s0

Incidentally, because a,, is positive definite, JAfi,(r)fi,(r)dr is nonsingular with


probability one, so that the distribution of the right hand side of (11.13) is
well-defined and fir exists w.p.a.1.
Generally, the distribution in (11.13) depends, in an intractable way, on the
nuisance parameters f12i and d,, . But, there is one case where it can be applied
immediately, and that is when the regressors are strictly exogenous in the sense that

E(Ax;u,) = 0, all t and s. (11.15)

Assumption (11.15) implies that d, i = a,, = 0, so that B, and B, are independent


Brownian motions. Letting W~_E wi 1, Park and Phillips (1988) argue that, when
a,, = 0, the distribution of JhB,(r)dB,(r), conditional on B,, is normal:

s 0
1

~2WdBlWl., B2(r)B,(r) dr . (11.16)

A useful heuristic device (for which I am grateful to Bruce Hansen) helps us to


see where (11.16) comes from. Take the definition of a stochastic integral to be

s 0
1

82WdBl(r)= plim
T+mr=l
5 B,(t/T)[B,(t/T) - B,((t - 1)/T)]

= plim f: B2(r)FTt,
T-m?=1

where the srt are i.i.d. Normal(O,o, T-l) by definition of a Brownian motion, and
are independent of the process B2(.). Therefore, conditional on B,(.),

,tl &@yTt - Normal O,oiT- f: B,(t/T)B,(t/T) .


( t=1 >

Taking the limit of both sides yields (11.16).


2718

Given (11.16), we have

1 -l/Z 1

(S 0
B2(r)'B2(r) dr
1 (1 0
~,(r)dB,(r) - Normal(0, w:Z,). (11.17)

But (%x/T)- I2 % [ihB,(r)&(r)dr]- I, so that

(XX/T2)-1/2T(b - p,) %Normal(O, wi1,), (11.18)

where x denotes the T x K matrix with tth row x, - 5 In practice, this means
that 6, can be treated as being asymptotically normal. Loosely,

--
BT2 Normal(fi,,o~(XX)-). (11.19)

Except for the presence of w, in place of 0: c E(u:), (11.19) is identical to the


usual approximation for the slope coefficients in regressions with essentially
stationary, weakly dependent processes and appropriately homoskedastic and
serially uncorrelated errors. Note that this is a problem for which the LAMN
condition mentioned in Section 10.2 is satisfied.
Asymptotically valid t-statistics and Wald or F-statistics can be obtained by
replacing the usual estimate of a,2,8,2 (the square of the standard error of the
regression) with a consistent estimator of 0,.2 A consistent estimator &),is obtained
by applying the estimators in Section 4.5 to the OLS residuals {Li,} (for example,
(4.30)). Asymptotically valid t-statistics are obtained by multiplying the usual
t-statistics by the ratio BJQ,; asymptotically valid Wald or F-statistics are obtained
by multiplying the usual Wald or F-statistic by the ratio 8:/d:.
Other than the strict exogeneity case, there is at least one other practical
application of (11.13) which is testing for a unit root, as in Dickey and Fuller
(1979) and Phillips (1987). Although the t-statistic does not have a limiting standard
normal distribution, its limiting distribution is either free of nuisance parameters
(as in the Dickey-Fuller setup) or a simple transformation of it is free of nuisance
parameters (as in Phillips); see Stock (this Handbook).
We now extend the model by allowing for I(0) regressors in addition to the I(1)
regressors x,. Let zf be a 1 x J vector I(0) process (these can be any I(0) variables,
including leads and lags of Ax, and lags of Ay, if y, is I(1)). The model is

Y, = CT,+ xtBo+ zryo+ e,, (11.20)

where e, is an I(0) zero-mean process, and we assume that

E(z:e,) = 0; (11.21)

this condition allows us to identify the vector y, on the I(0) variables. From the
Ch. 45: Estimation and Inference for Dependent Processes 2719

results for model (11. l), we know that 8, can be consistently estimated by ignoring
the I(0) process Z, and obtaining /?, from the regression (11.10). This is easily seen
by writing (11.20) as

Y, = ?, + XrBo
+ u,, (11.22)

where u, = e, + zty, - E(z,y,) and q0 = ~1,+ E(z,~,). Then u, is I(0) with zero mean,
and so the limiting distribution of T(B, - p,) is given in Equation (11.13). As is
now fairly well known [for example, Phillips and Durlauf (1986), Park and Phillips
(1988)], omitting I(0) regressors does not affect our ability to consistently estimate
fl, when x, is I(1) and has no cointegrating relationships among its elements.
For a variety of reasons we need to know what happens when Z, is included in
the regression. Let di,, bT, and $jTdenote the OLS estimators from the regression

Y, on Lx,, z,, t=l,...,T. (11.23)

The following.!emma is useful for finding the asymptotic distribution of properly


standardized B, and jjT. Its proof uses Lemma 11.1 and is given in Wooldridge
(1991c, Lemma 5.1).

Lemma 11.2

Let {n,}, {z,}, and (e,} satisfy (11.2) and (11.21). Let {X,: t = l,..., T} denote the
demeaned x, and let {t,: t = 1,. . . , T} denote the demeaned z,. Let f, denote the
1 x K residuals from the regression

-% on l,bt, t= l,...,T (11.24)

and let ?( denote the 1 x J residuals from the regression

z, on l,q, t= l,...,T. (11.25)

Then the following asymptotic equivalences hold:

(i) T- i .i$f, = T-2 f: _i$ft + o,(l);


1=1 t=1

(ii) T-l f: if:e, = T- f: i:e, + o,(l);


1=1 1=1

(iii) T- f: i;it, = T - f: i;z, + o,(l);


t=1 t=1

(iv) T-12 i Z:e, = T-2 ,tl i:e, + op( 1).


t=1
Note that the .i!t are the residuals from the regression x, on 1. Thus, Lemma 11.2
says that, for certain purposes, these can replace the residuals from the regression
x, on 1, z,. A similar statement holds for Z, and i;.
Combined with standard results from least squares mechanics, Lemma 11.2
yields straightforward asymptotic representations for fl, and jT. Write

(11.26)

Now, by Lemma 11.2,

The first term on the right hand side of (11.27) is exactly of the form in (11.12)
except that e, replaces u,. Thus, its limiting distribution can be obtained directly
from Lemma 11.1. Let w,~w:~,Z;r, and d,, be defined as in (11.5), (11.6) and
(11.9) with e, replacing u,. Then, from Lemma 11.1,

W-BP+ l?,(r)B,(r) dr &(r)' dB, (r) + A, 1 , (11.28)

where Bf is a SJJz(oi) process. This is as in (11.13), except that it is the covariogram


of {(~,,e,): t = 1,2,. . .} which shows up in the asymptotic distribution. Thus,
including I(0) regressors when estimating j?, changes the implicit errors in the
relationship. The form of the limiting distribution is unaltered, but the asymptotic
distributions of T(fi - b,) and T(fl- p,) are not the same.
From (11.27) and the earlier discussion it follows that if the x, are strictly
exogenous in (11.20) that is

E(Ax:e,) = 0, all t and S, (11.29)

then we can treat a, as approximately normal. As before,

a, z Normal(/?,,o,2(_$?aii-1), (11.30)

where 2 is the T x K matrix with tth row given by the 1 x K vector residual P,.
The important difference between J.ll.30) and (11.19) is that of replaces wi. Using
Lemma 11.2(i) we could replace X with x, but this would be unnatural since PT
is obtained from (11.23). Note that the validity of (11.30) as a heuristic does not
require strict exogeneity of the I(0) variables, z,; only (11.2 1) and (11.29) are assumed.
Under strict exogeneity the simple adjustments to t- and F-statistics discussed
for model (11.1) apply here as well, except that 0% is estimated using the residuals
&, from regression (11.23).
Next, consider the coefficient estimates on the I(0) variables. As shown by Phillips
(1988) Park and Phillips (1989) Sims et al. (1990) and others, the asymptotics for
jjT are standard (regardless of whether or not there is any kind of strict exogeneity).
This result has been derived under a variety of assumptions and in a number of ways.
Given Lemma 11.2, it is most easily established by writing jjT in partial regression
form. By Lemma 11.2 and standard results such as T - lCf 1iie, = Op( 1) (by the
CLT), we have

(11.31)

Thus, the I( 1) regressors have disappeared entirely from the first order asymptotic
representation for yT. Under standard assumptions for strictly stationary processes,
the right hand side of (11.31) is asymptotically normally distributed. However,
unless

(e,} and (zie,} are serially uncorrelated (11.32)

and

(11.33)

the usual OLS covariance matrix estimator and test statistics will generally be
invalid. This is as in regression with essentially stationary data. Given the OLS
residuals (6,: t = 1,2,. . .} from the regression (11.23), standard serial-correlation-
robust covariance matrix estimators can applied to {Q,}, say &. The asymptotic
variance of fi(jj, - y,) is estimated as

(.?.@T)-&if~/T)-l. (11.34)

If (11.32) and (11.33) both hold, as in Sims et al. (1990), then the usual OLS
variance matrix estimator for jr is valid. Therefore, standard t- and F-statistics
for testing hypotheses about y, are valid under no serial correlation and homo-
skedasticity assumptions. Note that these have nothing to do with the I(l), non-
cointegrated regressors x,.
This limiting distribution theory can be applied to the augmented Dickey-Fuller
regression under the null of a unit root. The model is
Wooldridge (1991c)l. Also important is that the leads and lags estimator of Boj
(with or without z, included in the leads and lags regression) is generally an
inconsistent estimator of poj under (11.35). If x,~ is not cointegrated with x,(~) then
the leads and lags estimator of /Ioj produces asymptotically normal t-statistics for
poj, just as before.
The discussion in the preceding paragraph shows that the cointegrating proper-
ties of X, need to be known before currently available methods can be used for
inference about fl,. For further discussion and examples, see Phillips (1988), Park
and Phillips (1989), Sims et al. (1990), Wooldridge (1991~) and Watson (this
Handbook).
The preceding results have extensions to I(1) processes with drift, integrated
processes of higher order, and multivariate regression and instrumental variables
techniques. See Phillips (1988) Park and Phillips (1988, 1989), Sims et al. (1990)
and Phillips and Hansen (1990).

12. Applications to nonlinear models

In this section we sketch how Theorem 10.1 can be applied to a nonlinear model.
We wish to estimate the model

Y, = a, + I%,,Y,)+ x,/j,+ u,, (12.1)

E(u,Iz,)=O (12.2)

by nonlinear least squares, where {(z,, u,): t = 1,2,. . .} is strictly stationary and
{.qt= 1,2,...}isal x K I( 1) process without drift, as in (11.2), with no cointegrating
relationships among the elements of x,. y, is an M x 1 vector. We assume that the
gradient of f(z,,~), V,f(z,, y,), contains no constant elements, so that both its
variance and long run variance are positive definite.
Letting m,(Q) = tl + f(z,, y) + X,/I, we have V,m,(H) = [ 1, VJ,(r), x,]. The score of
the NLS objective function for observation t is

s,(Q)= - v,m,(e)u,(e). (12.3)

When we evaluate this at Ho we get ~~(0,) = - [u,, Vyf,(Yo)u,, x,u,]. From the CLT
and (12.2), T - Cl= 1Vyf,(yO)ut has a limiting normal distribution. From Lemma
11.1, T - Et= 1xiu, converges in distribution to a functional of Brownian motion.
Given this, it is clear that the scaling matrix must be

D, =! TO
0
0
TIM
0
0
0
Tt, !, (12.4)
2724

in which case

,tl
0; 12 s,@,) (12.5)

converges in distribution to a nondegenerate random vector.


Suppose, for now, that Condition (iii) of Theorem 10.1 holds, and let o^, denote
the NLS estimator of 8,. Then (8.6) becomes

q((j,_+ - D,2

(12.6)

where h,(0,) = V,m,(0,)V,m,(0,) - V,mf(O,)u,. Because -f3u,Iz,)= 0, ~CV~.fk,~~,)~,l=


0, and so

(12.7)
t=1 i=l

by the WLLN applied to {V,m:(O,)u,}. Therefore, we can write

-1
D;~@,- e,) = - 0; l/2 i v,m,(0,yv,m,(~,)D; 112
[ I=1 1

x D; lj2 i Vomt(t3Jt4, + o#). (12.8)


[ i=l 1

This puts us back in the linear model case covered in Section 11. Using partitioned
inverse and Lemma 11.2, T(b, - /?,) has the same representation as in (11.23).
Similarly, fi(fT - y,) has a representation as in (11.31), but with zI replaced with
V,f(z,, Y,).
The main point of this example is that, once we have a linearized representation
as in (12.8), the asymptotic analysis is almost identical to the linear case, provided
the joint limiting distribution of

D;j2 i V,m,(e,),V,m,(e,)D,liZ and Dp112i V,,mt(eJu,


1=1 1=1

can be found. The structure of the regression function in (12.1) ensures that this
is the case. In general, finding the limiting distribution can be much more difficult,
particularly if the nonergodic variables x, appear nonlinearly.
We have yet to do the hard part of the analysis, and that is to verify Condition (iii)
of Theorem 10.1. It turns out that Lemma 9.1 can also be applied in this case. We
2775

sketch how this can be done. Define

where a, < 1 and a2 < 2. Note that the minimum diagonal element of this matrix
is c r(r) = T. Assume that ,f(z,, y) is thrice continuously differentiable, where each
derivative is dominated by an integrable function. Then, in the notation of
Lemma 9.1, the functions bTtij = b,,,, where 1 d i d M + 1 and M + 1 <j < M +
K + 1, can be taken to be of the form

bTtij E 'tij = Srijl ',j I + Stij,

where gtij is a stationary function of z, that dominates f&/I) and its first three
derivatives. (The terms for other combinations of i and j are easy to handle.) We
assume that E(gtij) < co for all i and j.
An application of the FCLT, as in Lemma 11.1, implies that

f. Stij IX,j I = 0p(T3), (12.10)


t=1

so that

f. htij = O,( T3). (12.11)


r=1

Now JGzgl) =Tal@. Thus, for condition (ii) of Lemma 9.1 to be satisfied,
we must have (a, + aJ2) > $, a condition easily satisfied because a, + a,/2 can be
made arbitrarily close to 2 under the restrictions stated above. Thus, the conditions
of Theorem 10.1 hold under general conditions, and therefore representation (12.8)
is valid.
An important topic for future research is to examine how the conditions of
Theorem 10.1, or a result with similar scope, can be verified for more complicated
nonlinear models. It seems likely that the functional CLT will play an important
role.

Appendix

1. Notation

The transpose of a K x M matrix A is denoted by A.


11a 11denotes the Euclidean norm of the P x 1 vector a.
11A 11= [tr(AA)]* denotes the Euclidean matrix norm of the matrix A.
For a continuously differentiable function q(O), where 8 is a P x 1 vector, the
gradient of 4 is denoted by the 1 x P vector V&B).
For a K x M differentiable matrix A(B), where 0 is a P x 1 vector, we denote the
gradient of A by V,A(H) = avec A(tI)/aO, which is a KM x P matrix.
The second derivative of a matrix, denoted ViA(O), is defined as

V,zA(8)= v,[v,A(e)].
For random vectors y and x, &lx) denotes the conditional distribution of y
given X, E(yjx) denotes the conditional expectation, and Var(y(x) denotes the
conditional variance.

2. Dejinitions und proojY$

Dejnition A.1

Let (O,P, P) be a probability space, and let {Or: T = 1,2,. . .} be a sequence of


events defined on this space. Then (Or} occurs with probability approaching one
(w.p.a.1) if

P(OT)-+l as T+co.

Definition A.2

A random function r: W x 0 satisfies the standard measurability and continuity


conditions on W x 0 if
(i) for each & 0, r(., 0) is measurable;
(ii) for each weW,r(w;) is continuous on 0.

Dejinition A.3

Let 0 be a compact (closed and bounded) subset of lRp and let {Q,: w x 0 + R}
be a sequence of functions satisfying the standard measurability and continuity
conditions on 9 x 0. Let Q: 0 -+ R be a nonstochastic continuous function on
0. Then Qr(~,e) conoerges in probability to Q(O) uniformly on 0 if and only if

maxIQr(W,@-Q(e)lLo as T+KI. (a.1)


068

When (a.1) holds we often write Q, 3 Q uniformly on 0.

Theorem A.1

Let 0 be a subset of Rp and let {QT: w x 0 -+R:T= 1,2,...} beasequenceofreal-


valued functions; Assume that
2727
Ch. 45: Estimation and Inference for Dependent Processes

(i) 0 is compact;
(ii) IQ=) sa t'is fi
es the standard measurability and continuity conditions on
-llr x 0.
Then a (measurable) estimator 8,: YV + 0 exists such that

Q~(w, @r(w)) = min Q~(w, ~9) for all WEYY.


BE8

In addition, - assume that


(iii) QT LQ uniformly on 0, where Q is a nonstochastic, continuous, real-
valued function on 0;
(iv) OO>s the unique minimizer of Q on 0.
Then 13, A 19~.

Proof

This follows from White (1993, Theorem 3.4) or Newey and McFadden (Theorem
2.1).

Lemma A.1

Let G,: YV x 0 + R and G: 0 + R be functions satisfying the standard measur-


ability and continuity conditions on the compact set 0. Suppose that G, 3 G
uniformly on 0 and 8, % 8,. Then G,(t?,) 3 G(0,).

Proof

Follows from White (1993, Theorem 3.7).

Definition A.4

Let 0 be a subset of Rp with nonempty interior. A random function r: W x 0 -+


lRKsatisfies the standard measurability and$rst order (second order) differentiability
conditions on W x 0 if
(i) for each 0~ 0, r( ., 0) is measurable;
(ii) for each WE%, r(w;) is once (twice) continuously differentiable on int(O).

For the abstract optimization problem of Theorem A.l, define the score of the
objective function as the P x 1 vector

s,(e) = S,(w, 0) E V,Q,(w, 0) = Q;; ),...,$ e,>


.
1 P

The Hessian of Qr is defined to be the P x P symmetric matrix


in particular, the (i, j)th element of H,(w, 0) is aQr(w, O)/??OiaOj. Hr(CJ) denotes the
P x P symmetric random matrix evaluated at 8.

Theorem A.2

Let the conditions of Theorem A.1 be satisfied. In addition, assume that


(v) o0 is in the interior of 0;
(vi) {Q,1sa t' 1sfi
es the standard measurability and differentiability conditions;
(vii) H, LA uniformly on 0, where A: 0 + Rpxp IS a nonrandom continuous
matrix function, and A, = A(8J is nonsingular.
(viii) fiS,(fI,) L Normal(0, B,) where B, is a positive definite matrix.
Then

fi(8, - 0,) ~Normal(O,A;B,A;). (a.2)

Proof

This essentially follows from Theorem 3.1 in Newey and McFadden. A separate
proof is available on request from the author.

Proof of Theorem 4.2

Define

Q,(wQ)= T- i: qt(w,,@
1=1

and Q,(0) = E[Q,(w, e)]. Then we must show that, for each E > 0,

maxIQT(w,e)-QT(0)I>E
888 1 -+O as T+co. (a.3)

Let 6 > 0 be a number to be set later. Because 0 is compact, there exists a finite
covering of 0, say Y6(ej), j = 1,2,. . . , K(6), where 9,(ej) is the sphere of radius 6
about ej. Set Yj = ,4,(0,), K = K(6), and Q,(e) E Qr(rv, 0). Because 0 c uj= rYj,
it follows that

ese
P maxIQ,(e) - Q,(q > E
1 d P[ max max IQ=(e) - &(e)I > E1
E1. ia.4
1 <j< K e&,

max IQ,(e) - Q,(e)1 >


ec9,

We will bound each probability in the summand. For BEYj,


by the triangle inequality

G T- i k,(e) - 4,(j)l + T- i 4,tej) - 4,(ej)


1=1 I=1

+ T-l i h,(e) - 4,(ej)13


1=1

where q,(Q) E E[qt(B)]. By Condition (iv)(a), for &Yj,

Ide) - q,(ej)ld cl(wt)IIe- 'j II < SC,

and

Iqtte)- Gt('j)l d C,(ej) IIe - dj 11< sqe,),

where Ct = E(c,). Thus, we have

maxIQT(e)- Q&9
e&j
G6
1I + T- l t
1=1
q,(ej) - q,(ej)

<26T- i C,+6 T- 5 et-Et + T- t$l qf(j) - 4t(j) 3


1=1 1=1 I 1

<26c+6 T- i ct-El + T-
t=1 I I

where T- Ct= 1C, d c < cc by (iv)(b). It follows that

p
[
maxI QT(4 - QTm > E
os.vj 1
T- l i: c, - c, + T- 5
1=1 I I t=1
q,(ej) - q,(ej)
I> & - 2sc
I.
Now choose 6 < 1 such that (E - 26c) < 42 (this affects K, but not c). Then

maxIQT(WQT(W=
BEYj 1
GP T- 5 Ct-2, + T- 5 q,(oj)-q,(oj) > E/2 .
1=1 I I 1=1 I >
2730

Next, choose T, so that

T
P
[IT- c q-Cc, + T-
1=1 I I

for all T 3 T,, and all j = 1,2,. , K; this is possible by Assumptions (iii) and
(iv)(b) of Theorem 4.2 (and because K = K(6) is finite). From (a.4) it follows that,
for T 3 T,,

P max 1QT(6) - QT(@l > s


Oet3 1 d c,

which establishes the result.

Proof of Theorem 4.3

We verify the conditions of Theorem A.l. Define

By Assumptions M. 1 and M.2 of Theorem 4.3, it follows from White (1993, Theorem
3.7) that QT(w, 0) converges in probability to Q(O) 3 q(fI; y*) uniformly on 0. The
result now follows from Assumption M.3 and Theorem A.l.

Proof of Theorem 4.4

This is a simple application of Theorem A.2. A mean value expansion gives (w.p.a.1)

O=T-.~~~~,(~~;g,)+[T- i +((i,-t$),
t=1

where & is h,(t?;9,) evaluated at mean values between 6, and 8,

where V,; is V,s,(BO; y) evaluated at mean values between pT and y*. By


Assumptions M.4(iii), M.S(iii) and M.7 of Theorem 4.4,

T-1 f:
1= 1
V,S;=T-1
,;,T -w,a~Y*)l + o,(l) = o,(l).
7731

Therefore,

0 = T- 12 ,fl~~(0,;y*) + fi(8, - 0,) + o,(l).

By M.5(i) and M.S(ii), because L?, 3 O,, T- lx,= 1it 3 A0 and so T- lx,= 14, is
nonsingular w.p.a.1. Thus, w.p.a.1 we can write

where we have used the fact that Tp /CITE 1s,(O,; y*) = O,(l) (by M.6). This proves
the result.

Proof of Lemma 4.3

The proof is standard and follows from a second order Taylors expansion. See,
for example, Amemiya (1985, Section 4.5) and White (1993, Theorem 8.10).

Proof of Theorem 8.1

From a second order Taylors expansion,

where R,(8; (3,) = (0 - flJ[Hr(& 0,) - H,](8 - 0,) and H,(8; 0,) denotes ZfT(Q)
evaluated at mean values between 8 and BO.Define a random vector by 0, = 8, -
If;- SC (w.p.a. 1). After a little algebra we have

Q,(O)- Q,(e;.) = (H OTy~(O


- d +R,.(B; 0,) - RT(eT;0,). (a.3

Also, write R,(O; 0,) = clT(8)A,(H; 0,)cr,(0), where a,(0) = Ck2(Q - I!?~,)
and A.(& 0,) =
c, 12[HT(u; 0,) - HOr]CQ 12. Now by assumption (iii)(b) of Theorem 8.1,
11A,(H; 0,) /) A 0 uniformly over the set {@ 1)q.(d) I/ d E} for any E d 1. It follows
2732

that for E < 1,

for all 0 such that I/Ck2(Q - 0J 11< E, where 6, AO.


Next, define

.,FT(E)
= (U: /IDk(U - 8,) I/ < 6:).

By Assumptions (iv)(a) and (iv)(b), Di2(8T - (3,) = O,( 1). By (iii)(a) Ci!D; l/2 +O;
therefore, there exists a sequence {c, > 0} with sT +O such that

P[/IC+.2(lJ-Bo)ll <+-I as T+co. (a.7)

Also because CDT _ I2 -+ 0,


T

VT(ET) = { 8: I) Ch2(0- ir,) IId Ed} w.p.a. 1. (a4

By the triangle inequality,

IIq2(o - 0,) II < I/q2(6, - e,, I(+ llcy(e, - UJ 1). (a.9

BY (a.71and(a.% if IICk(Q- &) II d &Tthen I/C$2(d- do)II6 2~~ w.p.a. 1. Thus,


from (a.8) it follows that

~T(&T)c {0: IIC$2(U - 0,) 11d 2ET,\ w.p.a. 1. (a. 10)

Now (a.6) and (a.lO) imply that

sup IRT(8;00)I <4dT6$ w.p.a.1 (a.11)

and (a.6) and (a.7) imply that

IR,(H,;O,)I d 6,s; w.p.a.1. (a. 12)

Letting &?,(E,) denote the boundary of J?,(E,), (a.?+ (a.1 1) and (a.12) imply that
w.p.a.1,

min QT(U) - 3 iA;,min~$- 56,~;


QT(gT)
eE?.P~I.(ET)

= {+j.;,min
- ~S,.}E+,
Ch. 45: Estimution and Ir$wncr fbr Dependent Procusxs 2733

where 1; min is the smallest eigenvalue of 0; l12H'+D; I/2.By Assumption (iv)(a)


of Theorem 8.1, 2; min 2 I> 0 w.p.a.1. Because 6, A0 and Ed > 0 for all T, QT
cannot achieve its minimum on the boundary of J?~(+) w.p.a.1; therefore, it
achieves its minimum on the interior of J?~(+). Let e, denote this estimator.
Then S,(&,) = 0 w.p.a.1. and 11 Di'"(6, - 8,)IId +, SO that

D;"(& - tl,)=
O,(1). (a.13)

Now we are almost done. Use a mean value expansion of the score to write (w.p.a. 1)

DT1'2ST(eT)=DT1'2SOT+DT1'2HOTDT1'2D~2(83-e,)
+D;1'2(l;iT-H;)D;1'2D;2(&9)
0' (a. 14)

where ki, is evaluated at mean values. Letting 8, denote a generic mean value it
is easily shown that Dy2(gT- 0,) = O,(l). But this implies that ~,EJV; w.p.a.1
because C1'2D-1/2 +O. Thus, by (iii)(b) and (a.13), the last term in (a.14) is o,(l).
We have n:w lstablished that w.p.a.1,

0= D,"'S; + D;1'2H;D;1'2Dy2(6T-
0,)+o,,(l);

by (iv)(a) we can write

D~'(&-0,)= -A,'D,"2S"T+ o,(l). (a.15)

Along with (iv)(b), this completes the proof.

References

Amemiya, T. (1985) Advanced Econometrics. Cambridge: Harvard University Press.


Anderson, T.W. (1971) The Statistical Analysis of Time Series. New York: Wiley.
Andrews, D.W.K. (1987) Consistency in Nonlinear Econometric Models: A Generic Uniform Law of
Large Numbers, Econometrica, 55, 1465-1472.
Andrews, D.W.K. (1988) Laws of Large Numbers for Dependent Non-Identically Distributed Random
Variables, Econometric Theory, 4, 458-467.
Andrews, D.W.K. (1989) Asymptotics for Semiparametric Econometric Models I: Estimation, Cowles
Foundation for Research in Economics Working Paper No. 908.
Andrews, D.W.K. (1991) Heteroskedasticity and Autocorrelation Consistent Covariance Matrix
Estimation, Econometrica, 59, 817-858.
Andrews, D.W.K. and J. McDermott (1993) Nonlinear Econometric Models with Deterministically
Trending Variables, Cowles Foundation for Economic Research Working Paper No. 1053.
Andrews, D.W.K. and J.C. Monohan (1992) An Improved Heteroskedasticity and Autocorrelation
Consistent Covariance Matrix Estimator, Econometrica, 60, 953-966.
Basawa, I.V. and D.J. Scott (1983), Asymptotic Optimal Inferencefor Nonergodic Models. New York:
Springer-Verlag.
Basa~a, I.V., P.D. Feigin and C.C. Heyde (1976) Asymptotic Properties of Maximum Likelihood
Estimators for Stochastic Processes, Sankhya, Series A, 38, 259-270.
Bates, C.E. and H. White (1985) A Unified Theory of Consistent Estimation for Parametric Models,
Econometric Theory, 1, 15 I- 175.
Bates, C.E. and H. White (1993) Determination of Estimators with Minimum Asymptotic Variance,
Econometric Theory, 9, 633-648.
Berk, K.N. (1974) Consistent Autoregressive Spectral Estimates, Annals of Statistics,2, 489-502.
Berndt, E.R., B.H. Hall, R.E. Hall and J.A. Hausman (1974) Estimation and Inference in Nonlinear
Structural Models, Annals of Economic and Social Measurement, 4, 653-665.
Bhat, B.R. (1974) On the Method of Maximum Likelihood for Dependent Observations, Journal of
/he Royal Statistical Society, Series B, 36, 48-53.
Bierens, H.J. (1981) Robust Methods and Asymptotic Theory in Nonlinear Economerrics. New York:
Springer-Verlag.
Bierens, H.J. (1982) A Uniform Weak Law of Large Numbers Under &mixing with Application to
Nonlinear Least Squares Estimation, Statistica Nederlandica, 36, 81-86.
Billingsley, P. (1968) Conreryence of Probability Measures. New York: Wiley.
Billingsley, P. (1986) Probability and Measure. Second edition. New York: Wiley.
Bloomfield, P. (1976) Fourier Analysis of Time Series: An Introduction. New York: Wiley.
Bloomfield, P. and W.L. Steiger (1983) Least Absolute Deviations. Boston: Birkhauser.
Bollerslev, T. (1986)Generalized Autoregressive Conditional Heteroscedasticity, Journal of Econo-
metrics, 3 1, 307-328.
Bollerslev, T. and J.M. Wooldridge (1992) Quasi-Maximum Likelihood Estimation and Inference in
Dynamic Models with Time-Varying Covariances, Econometric Reviews, II, 143-172.
Brillinger, D.R. (1981) Time Series: Dara Analysis and Theory. New York: Holden-Day.
Brockwell, P.J. and R.A. Davis (1991) Time Series: Theory and Methods. New York: Springer-Verlag.
Burguete, J.F., A.R. Gallant and G. Souza (1982) On the Unification of the Asymptotic Theory of
Nonlinear Econometric Models, Econometric Reviews, I, 151-190.
Chamberlain, G. (1982) The General Equivalence of Granger and Sims Causality, Econometrica, 50,
569-581.
Chamberlain, G. (1987) Asymptotic Efficiency in Estimation with Conditional Moment Restrictions,
Journal of Econometrics, 34, 305-334.
Chesher, A. and R. Spady (1991) Asymptotic Expansions of the Information Test Statistic,
Econometrica, 59, 787-8 15.
Crowder, M.J. (1976) Maximum Likelihood Estimation with Dependent Observations, Journal of
the Royal Statistical Society,
Series B, 38, 45-53.
Davidson, J. (1992) A Central Limit Theorem for Globally Nonstationary Near-Epoch Dependent
Functions of Mixing Processes, Econometric Theory, 8, 3 13-329.
Davidson, R. and J.G. MacKinnon (1984) Convenient Specification Tests for Logit and Probit Models,
Journal qf Econometrics, 25, 241-262.
Davidson, R. and J.G. MacKinnon (1991) A New Form of the Information Matrix Test, Econometrica,
60, 145-158.
Dickey, D.A. and W.A. Fuller (1979) Distribution of the Estimators for Autoregressive Time Series
with a Unit Root, Journal of the American Statistical Association, 74, 427-431.
Domowitz, I. (1985) New Directions in Nonlinear Estimation with Dependent Observations, Canadian
Journal of Economics, 19, l-27.
Domowitz, I. and L.T. Muus (1988) Asymptotic Inference for Nonergodic with Econometric Appli-
cations, in: W.A. Barnett, E.R. Berndt ind H. White, eds., Proceedings of the Third International
Symposium in Economic Theory and Econometrics. New York: Cambridge University Press,
Domowitz, I. and H. White (1982) Maximum Likelihood Estimation of Misspecified Models, Journal
of Econometrics, 20, 35-58.
Engle, R.F. (1984) Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics, in: Z.
Griliches and M.D. Intriligator, eds., Handbook ofEconometrics, Vol. II. Amsterdam: North-Holland,
775-826.
Engle, R.F. and C.W.J. Granger (1987) Cointegration and Error Correction: Representation,
Estimation and Testing, Econometrica, 55, 251-276.
Fuller, W. (1976) Introduction to Statistical Time Series. New York: Wiley.
Gallant, A.R. (1987) Nonlinear Statistical Models. New York: Wiley.
Gallant, A.R. and H. White (1988) A Unified Approach to Estimation and Inference in Nonlinear Dynamic
Models. Oxford: Basil Blackwell.
Godfrey, L.C. (1988) Misspecijication Tests in Econometrics: The LM Principle and Other Approaches.
New York: Cambridge University Press.
Goldberger, A. (1968) Topics in Reyression Analysis. New York: Macmillan.
Gourieroux, C., A, Monfort and A. Trognon (1984) Pseudo-Maximum Likelihood Methods: Theory,
Econometrica, 52, 68 1~700.
Gourieroux, C., A. Monfort and A. Trognon (1985) A General Approach to Serial Correlation,
Econometric Theory, 1, 3155340.
Granger, C.W.J. (1969) Investigating Causal Relations by Econometric Models and Cross-Spectral
Methods, Econometrica, 37, 424-438.
Hall, P. and C.C. Heyde (1980) Martingale Limit Theory and Its Application. New York: Academic Press.
Hannan, E.J. (1971) Non-Linear Time Series Regression, Journal ofApplied Probability, 8,767-780.
Hansen, B.E. (1991a) Strong Laws for Dependent Heterogeneous Processes, Econometric Theory, 7,
213-221.
Hansen, B.E. (1991b) Inference When a Nuisance Parameter is Not Identified Under the Null
Hypothesis, Rochester Center for Economic Research, Working Paper no. 296.
Hansen, B.E. (1992a) Consistent Covariance Matrix Estimation for Dependent Heterogeneous
Processes, Econometrica, 60, 9677972.
Hansen, B.E. (1992b) Convergence to Stochastic Integrals for Dependent Heterogeneous Processes,
Econometric Theory, 8, 4899500.
Hansen, L.P. (1982) Large Sample Properties of Generalized Method of Moments Estimators,
Econometrica, 50, 1029- 1054.
Hansen, L.P. (1985) A Method for Calculating Bounds in the Asymptotic Covariance Matrices of
Generalized Method of Moments Estimators, Journal of Econometrics. 30, 203-238.
Hansen, L.P. and R.J. Hodrick (1980) Forward Exchange Rates as Optimal Predictors of Future Spot
Rates: An Econometric Analysis, Journal qf Political Economy, 88, 8299853.
Hansen, L.P. and K.J. Singleton (1982) Generalized Instrumental Variables Estimation of Nonlinear
Rational Expectations Models, Econometrica, 50, 126991286.
Hansen, L.P., J.C. Heaton and M. Ogaki (1988) Efficiency Bounds Implied by Multiperiod Conditional
Moment Restrictions, Journal of the American Statistical Association, 83, 863-87 1,
Harvey, A.C. (1990) The Econometric Analysis of Time Series. Cambridge: MIT Press.
Heijmans, R.D.H. and J.R. Magnus (1986) On the First-order Efficiency and Asymptotic Normality of
Maximum Likelihood Estimators Obtained from Dependent Observations, Statistica Nederlandica,
40.
Hendry, D.F. and J.-F. Richard (1983) The Econometric Analysis of Economic Time Series, Inter-
national Statistical Review, 51, 11 l-163.
Hendry, D.F., A.R. Pagan and J.D. Sargan (1984) Dynamic Specification, in: Z. Griliches and M.D.
Intriligator, eds., Handbook of Econometrics, Vol. II. Amsterdam: North-Holland, 1023-l 100.
Herndorff, N. (1984) An Invariance Principle for Weakly Dependent Sequences of Random Variables,
Annals of Probability, 12, 141-153.
Hsieh, D.A. (1983) A Heteroskedasticity-Consistent Covariance Matrix Estimator for Time Series
Regressions, Journal of Econometrics, 22, 281-290.
Huber, P.J. (1967) The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions,
Proceedings of the Fijth Berkeley Symposium in Mathematical Statistics and Probability. Berkeley:
University of California Press.
Jeganathan, P. (1980) An Extension of a Result of L. LeCam Concerning Asymptotic Normality,
Sankhya, Series A, 43, 23-36.
Jeganathan, P. (1988) Some Aspects of Asymptotic Theory with Applications to Time Series Models,
University of Michigan Department of Statistics, Technical Report No. 166.
Johansen, S. (1988) Statistical Analysis of Cointegrating Vectors, Journal ofEconomic Dynamics and
Control, 12, 23 l-54.
Keener, R.W., J. Kmenta. and N.C. Weber (1991) Estimation of the Covariance Matrix of the Least
Squares Regression Coefficients when the Disturbance Covariance Matrix is of Unknown Form,
Econometric Theory, 7, 22245.
Klimko, L.A. and P.T. Nelson (1978) Conditional Least Squares Estimation for Stochastic Processes,
Annals of Statistics,6, 6299642.
LeCam, L. (1986) Asymptotic Methods in Statistical Decision Theory. New York: Springer-Verlag.
Levine, D. (1983) A Remark on Serial Correlation in Maximum Likelihood, Journal c$Econometrics,
23, 337-342.
Lin, W.-L. (1992) Alternative Estimators for Factor GARCH Models A Monte Carlo Comparison,
Journal of Applied Econometrics, 7, 259-279.
MacKinnon, J.G. (1992) Model Specification Tests and Artificial Regressions, Journal of Economic
Literature, 30, 102Z 146.
MacKinnon, J.G. and H. White (1985) Some Heteroskedasticity Consistent Covariance Matrix
Estimators with Improved Finite Sample Properties, Journal of Econometrics, 19, 305-325.
Magnus, J.R. and H. Neudecker (1986) Symmetry, O-l Matrices and Jacobians: A Review, Econo-
metric Theory, 2, 157-190.
Manski, C. (1975) Maximum Score Estimation of the Stochastic Utility Model of Choice, Journal
ofEconometrics, 3, 205-225.
Manski, CF. (1988) Analog Estimation Methods in Econometrics. New York: Chapman and Hall.
McLeish, D.L. (1974) Dependent Central Limit Theorems and Invariance Principles, Annals of
Probability, 2, 81-85.
McLeish, D.L. (1975) A Maximal Inequality and Dependent Strong Laws, Annals of Probability, 3,
826-836.
McLeish, D.L. (1977) On the Invariance Principle for Nonstationary Mixingales, Annals of Prob-
ability, 5, 616-621.
Nelson, D.B. (1991) Conditional Heteroskedasticity in Asset Returns: A New Approach, Econometrica,
59, 347F370.
Newey, W.K. (1990) Efficient Instrumental Variables Estimation of Nonlinear Econometric Models,
Econometrica, 58, 809%837.
Newey, W.K. (1991a) Uniform Convergence in Probability and Stochastic Equicontinuity, Econo-
metrica, 59, 1161~1167.
Newey, W.K. (1991b) Consistency and Asymptotic Normality of Nonparametric Projection Estimators,
mimeo, MIT Department of Economics.
Newey, W.K. and K.D. West (1987) A Simple Positive Semi-Definite Heteroskedasticity and Auto-
correlation Consistent Covariance Matrix, Econometrica, 55, 703-708.
Orme, C. (1990) The Small-Sample Performance of the Information Matrix Test, Journal of Econo-
metrics, 46, 309-331.
Pagan, A.R. and H. Sabau (1987) On the Inconsistency of the MLE in Certain Heteroskedastic
Regression Models, mimeo, University of Rochester, Department of Economics.
Park, J.Y. and P.C.B. Phillips (1988) Statistical Inference in Regressions with Integrated Processes:
Part l, Econometric Theory, 4, 468-497.
Park, J.Y. and P.C.B. Phillips (1989) Statistical Inference in Regressions with Integrated Processes:
Part 2, Econometric Theory, 5, 95-131.
Phillips, P.C.B. (1986) Understanding Spurious Regressions in Econometrics, Journal ofEconometrics.
33, 311-340.
Phillips, P.C.B. (1987) Time Series Regression with a Unit Root, Econometrica, 55, 277-301.
Phillips, P.C.B. (1988) Multiple Regression with Integrated Time Series, Contemporary Mathematics,
80, 79%105.
Phillips, P.C.B. (1989) Partially Identified Econometric Models, Econometric Theory, 5, 181-240.
Phillips, P.C.B. (1991) Optimal Inference in Cointegrated Systems, Econometrica, 59, 283-306.
Phillips, P.C.B. and S.N. Durlauf (1986) Multiple Time Series Regression with Integrated Processes,
Review of Economic Studies, 53, 473-496.
Phillips, P.C.B. and B.E. Hansen (1990) Statistical Inference in Instrumental Variables Regression
with I(1) Processes, Review of Economic Studies, 57, 99-125.
Phillips, P.C.B. and M. Loretan (1991) Estimating Long-Run Economic Equilibria, Reoiew (If
Economic Studies, 58, 407-436.
Poirier, D.J. and P.A. Ruud (1988) Probit with Dependent Variables, Reoiew of EconomicStudies,
54, 593-614.
PGtscher, B.M. and I.R. Prucha (1986) A Class of Partially Adaptive One-Step Estimators for the Non-
linear Regression Model with Dependent Observations, Journal of Econometrics, 32, 219-251.
PGtscher, B.M. and I.R. Prucha (1989) A Uniform Law of Large Numbers for Dependent and
Heterogeneous Data Processes, Econometrica, 57, 675-684.
Ch. 45: Estimation and Ir$rence~fhr Dependent Processes 2737

Pijtscher, B.M. and I.R. Prucha (1991a) Basic Structure of the Asymptotic Theory in Dynamic Non-
linear Econometric Models, Part I: Consistency and Approximation Concepts, Ihnometric Reviews,
10, 125-216.
Potscher, B.M. and I.R. Prucha (1991b) Basic Structure of the Asymptotic Theory in Dynamic Non-
linear Econometric Models, Part II: Asymptotic Normality, Econometric Reviews, 10, 2533325.
Quah, D. (1990) An Improved Rate for Non-Negative Definite Consistent Covariance Matrix
Estimation with Heterogeneous Dependent Data, Economics Letfers, 33, 133- 140.
Quah, D. and J.M. Wooldridge (1988) A Common Error in the Treatment of Trending Time Series,
MIT Department of Economics, Working Paper No. 483.
Quandt, R.E. and J.B. Ramsey (1978) Estimating Mixtures of Normal Distributions and Switching
Regressions, Journal of the American Statistical Association, 73, 730-738.
Ranga Rao, R. (1962) Relations Between Weak and Uniform Convergence of Measures with
Applications, Annals of Mathematical Statistics, 33, 659-680.
Rao, C.R. (1948) Large Sample Tests of Statistical Hypotheses Concerning Several Parameters with
Applications to Problems of Estimation, Proceedings of the Cambridge Philosophical Society, 44,
50-57.
Rilstone, P. (1991) Efficient Instrumental Variables Estimation of Nonlinear Dependent Processes,
mimeo, Universite Laval.
Robinson, P.M. (1972) Nonlinear Regression for Multiple Time Series, Journal ofApplied Probability,
9, 758-768.
Robinson, P.M. (1982) On the Asymptotic Properties of Estimators of Models Containing Limited
Dependent Variables, Econometrica, 50, 27-42.
Robinson, P.M. (1987) Asymptotically Efficient Estimation in the Presence of Heteroskedasticity of
Unknown Form, Econometrica, 55, 875-891.
Robinson, P.M. (1991a) Best Nonlinear Three-Stage Least Squares Estimation of Certain Econometric
Models, Econometrica, 59, 755-786.
Robinson, P.M. (1991b) Testing for Strong Serial Correlation and Dynamic Conditional Hetero-
skedasticity in Multiple Regression, Journal of Econometrics, 47, 67-84.
Rosenblatt, M. (1956) A Central Limit Theorem and a Strong Mixing Condition, Proceedings of the
National Academy of Sciences USA, 42,43-47.
Rosenblatt, M. (1978) Dependence and Asymptotic Independence for Random Processes, in: M.
Rosenblatt, ed., Studies in Probability Theory. Washington, DC: Mathematical Association of
America.
Roussas, G.G. (1972) Contiguity of Probability Measures. Cambridge: Cambridge University Press.
Saikkonen, P. (1991) Asymptotically Efficient Estimation of Cointegration Regressions, Econometric
Theory, 7, 1-21.
Sargan, J.D. (1958) The Estimation of Economic Relationships Using Instrumental Variables,
Econometrica, 26, 393-415.
Schmidt, P. (1976) On the Statistical Estimation of Parametric Frontier Production Functions, Reoiew
of Economics and Statistics, 58, 238-239.
Seaks, T.G. and S.K. Layson (1983) Box-Cox Estimation with Standard Econometric Problems,
Review of Economics and Statistics, 65, 857-859.
Sims, CA. (1972) Money, Income and Causality, American Economic Review, 62, 540-552.
Sims, C.A., J.H. Stock and M.W. Watson (1990) Inference in Linear Time Series Models with Some
Unit Roots, Econometrica, 58, 113-144.
Sowell, F. (1988) Maximum Likelihood Estimation of Fractionally Integrated Time Series, GSIA,
Carnegie Mellon University, Working Paper.
Steigerwald, D. (1992) Adaptive Estimation in Time Series Models, Journal of Econometrics, 54,
251-275.
Stock, J.H. (1987) Asymptotic Properties of Least Squares Estimators of Cointegrating Vectors,
Econometrica, 55, 1035-1056.
Stock J.H. and M.W. Watson (1993) A Simple MLE of Cointegrating Vectors in Higher Order
Integrated Systems, Econometrica, 61, 783-820.
Weiss, A.A. (1986) Asymptotic Theory for ARCH Models: Estimation and Testing, Econometric
Theory, 2, 107-131.
Weiss, A.A. (1991) Estimating Nonlinear Dynamic Models Using Least Absolute Error Estimation,
Econometric Theory, 7, 46-68.
2138 J.M. Wooldrid</r

Weiss, L. (1971) Asymptotic Properties of Maximum Likelihood Estimators in some Nonstandard


Cases I, Journal of the American Statisticul Association, 66, 345-350.
Weiss, L. (1973) *Asymptotic Properties of Maximum Likelihood Estimators in some Nonstandard
Cases II, Journal of the American Statistical Association, 68, 428-430.
White, H. (1982) Maximum Likelihood Estimation of Misspecified Models, Econometrica, 50, l-25.
White H. (1984) Asymptotic Theory@ Econometricians. Orlando: Academic Press.
White, H. (1987) Specification Testing in Dynamic Models, in: T. Bewley, ed., Advances in
Econometrics ~ Fifth World Congress, Vol. I, l-58. New York: Cambridge University Press.
White, H. (1993) Estimution, Inference, und Specification Analysis. New York: Cambridge University
Press.
White, H. and I. Domowitz (1984) Nonlinear Regression with Dependent Observations, Econometrica,
52, 143-162.
White, H. and M. Stinchcombe (1991) Adaptive Efficient Weighted Least Squares Estimation with
Dependent Observations, mimeo, UCSD Department of Economics.
Wolak, F.A. (1991) The Local Nature of Hypothesis Tests Involving Inequality Constraints in Non-
linear Models, Econometrica, 59, 98 l-996.
Wooldridge, J.M. (1986) Asymptotic Properties of Econometric Estimators, UCSD Department of
Economics, Ph.D. Dissertation.
Wooldridge, J.M. (1991a) On the Application of Robust, Regression-Based Diagnostics to Models of
Conditmnal Means and Conditional Variances, Journal of Econometrics, 47, 5-46.
Wooldridee. J.M. (1991b) Suecification Testing _ and Quasi-Maximum
. Likelihood Estimation, Journal
of Eckketrics; 48, 24-55.
Wooldridge, J.M. (1991~) Notes on Regression with Difference-Stationary Data, mimeo, Michigan
State University Department of Economics.
Wooldridge, J.M. and H. White (1985) Consistency of Optimization Estimators, UCSD Department
of Economics, Discussion Paper 85-29.
Wooldridge, J.M. and H. White (1988) Some Invariance Principles and Central Limit Theorems for
Dependent Heterogeneous Processes, Econometric Theory, 4, 210-230.
Wooldridge, J.M. and H. White (1989) Central Limit Theorems for Dependent, Heterogeneous Processes
with Trending Moments, mimeo, MIT Department of Economics.
Chapter 46

UNIT ROOTS, STRUCTURAL BREAKS AND TRENDS

JAMES H. STOCK*

Harvard University

Contents

Abstract 2740
1. Introduction 2740
2. Models and preliminary asymptotic theory 2744
2.1. Basic concepts and notation 2745
2.2. The functional central limit theorem and related tools 2748
2.3. Examples and preliminary results 2751
2.4. Generalizations and additional references 2756
3. Unit autoregressive roots 2757
3.1. Point estimation 2758
3.2. Hypothesis tests 2763
3.3. Interval estimation 2785
4. Unit moving average roots 2788
4.1. Point estimation 2790
4.2. Hypothesis tests 2792
5. Structural breaks and broken trends 2805
5.1. Breaks in coefficients in time series regression 2807
5.2. Trend breaks and tests for autoregressive unit roots 2817
6. Tests of the I( 1) and I(0) hypotheses: links and practical limitations 2821
6.1. Parallels between the I(0) and I(1) testing problems 2821
6.2. Decision-theoretic classification schemes 2822
6.3. Practical and theoretical limitations in the ability to distinguish I(0) and
I( 1) processes 2825
References 2831

*The author thanks Robert Amano, Donald Andrews, Jushan Bai, Ngai Hang Chan, In Choi, David
Dickey, Frank Diebold, Robert Engle, Neil Ericsson, Alastair Hall, James Hamilton, Andrew Harvey,
Sastry Pantula, Pierre Perron, Peter Phillips, Thomas Rothenberg, Pentti Saikkonen, Peter Schmidt,
Neil Shephard and Mark Watson for helpful discussions and/or cements on a draft of this chapter.
Graham Elliott provided outstanding research assistance. This research was supported in part by the
National Science Foundation (Grants SES-89-10601 and SES-91-22463).

Handbook ofEconometrics, Volume IV, Edited by R.F. Enyle and D.L. McFadden
0 1994 Elsevier Science B. V. All rights reserved
2740 J.H. Stock

Abstract

This chapter reviews inference about large autoregressive or moving average roots
in univariate time series, and structural change in multivariate time series
regression. The problem of unit roots is cast more broadly as determining the
order of integration of a series; estimation, inference, and confidence intervals are
discussed. The discussion of structural change focuses on tests for parameter
stability. Much emphasis is on asymptotic distributions in these nonstandard
settings, and one theme is the general applicability of functional central limit theory.
The quality of the asymptotic approximations to finite-sample distributions and
implications for empirical work are critically reviewed.

1. Introduction

The past decade has seen a surge of interest in the theoretical and empirical analysis
of long-run economic activity and its relation to short-run fluctuations. Early
versions of new classical theories of the business cycle (real business cycle models)
predicted that many real economic variables would exhibit considerable persistence,
more precisely, would contain a unit root in their autoregressive (AR) representations.
Halls (1978) fundamental work on the consumption function showed that, under
a simple version of the permanent income hypothesis, future changes in consumption
are unpredictable, so consumption follows a random walk or, more generally, a
martingale. The efficient markets theory of asset pricing recapitulated by Fama
(1970) had the same prediction: if future excess returns were predictable, they would
be bid away so that the price (or log price) would follow a martingale. The
predictions of these theories often extended to multivariate relations. For example,
if labor income has a unit root, then a simple version of the intertemporal permanent
income hypothesis implies that consumption will also have a unit root and
moreover that income minus consumption (savings) will not have a unit root, so
that consumption and income are, in Engle and Grangers (1987) terminology,
cointegrated [Campbell (1987)]. Similarly, versions of real business cycle models
predict that aggregate consumption, income and investment will be cointegrated.
The empirical evidence on persistence in economic time series was also being
refined during the 1980s. The observation that economic time series have high
persistence is hardly new. Orcutt (1948) found a high degree of serial correlation in
the annual time series data which Tinbergen (1939) used to estimate his econometric
model of the U.S. economy. By plotting autocorrelograms and adjusting for their
downward bias when the true autocorrelation is large, Orcutt concluded that many of
these series ~ including changes in aggregate output, investment andconsumption -
were well characterized as being generated by the first-order autoregression,
Ay, = 0.3Ay,_, + I:, [Orcutt (1948, eq. 50)], where Ay, = y, - y,- , , that is, they
contained an autoregressive unit root. During the 1960s and 1970s, conventional
time series practice was to model most economic aggregates in first differences, a
practice based on simple diagnostic devices rather than formal statistical tests. In
their seminal article, Nelson and Plosser (1982) replaced this informal approach
with Dickey and Fullers (1979) formal tests for a unit root, and found that they
could not reject the hypothesis of a unit autoregressive root in 13 of 14 U.S. variables
using long annual economic time series, in some cases spanning a century. Similarly,
Meese and Singleton (1982) applied Dickey- Fuller tests and found that they could
not reject the null of a single unit root in various exchange rates. Davidson et al.
(1978) found that an error-correction model, later recognized as a cointegrating
model, provided stable forecasts of consumption in the U.K. As Campbell and
Mankiw (1987a, 1987b) and Cochrane (1988) pointed out, the presence of a unit
root in output implies that shocks to output have great persistence through base
drift, which can even exceed the magnitude of the original shock if there is positive
feedback in the form of positive autocorrelation.
This body of theoretical and empirical evidence drew on and spurred develop-
ments in the econometric theory of inference concerning long-run properties of
economic time series. This chapter surveys the theoretical econometrics literature
on long-run inference in univariate time series. With the exception of Section 5.1
on stability tests, the focus here is strictly univariate; for multivariate extensions see
the chapter by Watson in this Handbook. Throughout, we write the observed series
y, as the sum of a deterministic trend d, and a stochastic term u,

y, = d, + uf t= 1,2 ,..., T. (1.1)

The trend in general depends on unknown parameters, for example, in the leading
case of a linear time trend, d, = b, + fl, t, where /I, and pi are unknown. Unless
explicitly stated otherwise, it is assumed that the form of the trend is correctly
specified. If u, has a unit autoregressive root. then u, is integrated of order one (is
I( 1)) in the sense of Box and Jenkins (1976). If Au, has a unit moving average (MA)
root, then u, is integrated of order zero (is I(0)). In the treatment here, the focus is
on these largest roots and the parameters describing the deterministic term are
treated as nuisance parameters. The two types of unit roots (AR and MA) introduce
obvious ambiguity in the phrase unit root, so this chapter emphasizes instead the
I(0) and I( 1) terminology. Precise definitions are given in Section 2.
The specific aim of this chapter is to outline the econometric theory of four areas
of inference in time series analysis: unit autoregressive roots and inference for I(l)
and nearly I( 1) series; unit moving average roots and testing for a series being I(0);
inference on d,, and, in particular, testing for a unit autoregressive root when d,
might have breaks, for example, be piecewise linear; and tests for parameter
instability and structural breaks in regression models. Although the analysis of
2742 J.H. Srock

structural breaks stems from a different literature than unit roots, the mathematics
and indeed some test statistics in these two areas are closely related, and this survey
emphasizes such links.
There have been four main areas of application of the techniques for inference
about long-run dependence discussed in this chapter. The first and perhaps the most
straightforward is data description. Does real GNP contain an autoregressive unit
root? What is a 95/, confidence interval for the largest root? If output has a unit
root, then it has a permanent component, in the sense that it can be decomposed
into a stochastic trend (a martingale component) plus an I(O) series. What does this
permanent component, trend output, look like, and how can it be estimated? This
question has led to estimating and testing an unobserved components model. For
empirical applications of the unobserved components model see Harvey (1985),
Watson (1986), Clark (1987, 1989) Quah (1992) and Harvey and Jaeger (1993); for
a technical discussion, see Harvey (1989); for reviews see Stock and Watson (1988a)
and Harvey and Shephard (1992). A natural question is whether there is in fact a
permanent component. As will be seen in Section 4, this leads to testing for a unit
moving average root or, more generally, testing the null hypothesis that the series
is l(0) against the I(1) alternative.
A second important application is medium- and long-term forecasting. Suppose
one is interested in making projections of a series over a horizon that represents a
substantial fraction of the sample at hand. Such long-term forecasts will be
dominated by modeling decisions about the deterministic and stochastic trends.
Several of the techniques for inference studied in this chapter ~ for example, tests
for unit AR or MA roots and the construction of median-unbiased estimates of
autoregressive coefficients - have applications to long-run forecasting and the
estimation of forecast error bands.
A third application, perhaps the most common in practice, is to guide subsequent
multivariate modeling or inference involving the variable in question. For example,
suppose that primary interest is in the coefficients on yt in a regression in which y,
appears as a regressor. Inference in this regression in general depends on the order
of integration of y, and on its deterministic component [see West (1988a), Park
and Phillips (1988), Sims et al. (1990), and the chapter by Watson in this Handbook].
As another example, if multiple series are I(1) then the next step might be to test for
and model cointegration. Alternatively, suppose that the objective is to decompose
multiple time series into permanent and transitory components, say to study
short-run dynamic effects of permanent shocks [Blanchard and Quah (1989), King
et al. (1991)]. In each of these cases, how best to proceed hinges on knowing whether
the individual series are I(O) or I(1). Although these multivariate applications are
beyond the scope of this chapter, inference about univariate AR and MA roots plays
a key initial step in these multivariate applications.
Fourth, information on the degree of persistence in a time series and, in particular,
on its order of integration can help to guide the construction or testing of economic
theories. Indeed, a leading interpretation of Nelson and Plossers (1982) findings
Ch. 46: Unit Koots, Structural Breaks and Trends 2143

was that the prevalence of I( 1) series in their long annual data set provided support
for real theories of the business cycle. Alternatively, knowledge of the order of
integration of certain variables can be used to suggest more precise statements (and
to guide inference) about certain economic theories, for example, the possibility of
a vertical long-run Phillips curve or the neutrality of money [Fisher and Seater
(1993), King and Watson (1992)].
In addition to these empirical applications, technical aspects of the econometric
theory of unit roots, trend breaks and structural breaks are related to several other
problems in econometric theory, such as inference in cointegrated systems. The
theory developed here provides an introduction to the more involved multivariate
problems.
Several good reviews of this literature are already available and an effort has been
made here to complement them. Phillips (1988) surveys the theoretical literature on
univariate and multivariate autoregressive unit root distributions, and a less
technical introduction to these topics is given in Phillips (1992a). Diebold and
Nerlove (1990) provide a broad review of the econometric literature on measures
and models of persistence. Campbell and Perron (199 1) provide an overview of the
literature on unit autoregressive roots, as well as on cointegration, with an eye
towards advising applied researchers. Banerjee et al. (1992a) provide a thorough
introduction to testing and estimation in the presence of unit autoregressive roots
and multivariate modeling of integrated time series, with special attention to
empirical applications.
The main approach to inference about long-term properties of time series which
is excluded from this survey is fractional integration. In this alternative to the
I(O)/I( 1) framework, it is supposed that a series is integrated of order d, where d need
not be an integer. The econometric theory of inference in fractionally integrated
models has seen ongoing important work over the past two decades. This literature
is large and the theory is involved, and doing it justice would require a lengthier
treatment than possible here. The R/S statistic of Mandelbrot and Van Ness (1968),
originally developed to detect fractional integration, is discussed briefly in Section 3.2
in the context of tests for an AR unit root. Otherwise, the reader is referred to recent
contributions in this area. Two excellent surveys are Beran (1992) and, at a more
rigorous level, Robinson (1993). Important contributions to the theory of inference
with fractional integration include Geweke and Porter-Hudak (1983), Fox and
Taqqu (1986), Dahlhaus (1989) and Sowell (1990, 1992). Recent empirical work in
econometrics includes Lo (1991) (R/S analysis of stock prices), Diebold and
Rudebusch (1989, 199la) and Diebold et al. (1991) (estimation of the fractional
differencing parameter for economic data).
The chapter is organized as follows. Section 2 describes the I(0) and I( 1) models
and reviews some tools for asymptotic analysis. Section 3 examines inference about
the largest autoregressive root when this root equals or is close to one. Section 4
studies inference about unit or near-unit moving average roots. Two related topics
are covered in Section 5: tests for parameter stability and structural breaks when
2744 J.H. Stock

the break date is unknown, and tests for AR unit roots when there are broken
trends. Section 6 concludes by drawing links between the I(O) and I(1) testing
problems and by suggesting some conclusions concerning these techniques that, it is
hoped, will be useful in empirical practice. Most of the formal analysis in this chapter
is based on asymptotic distribution theory. The treatment of the theory here is
self-contained. Readers primarily interested in empirical applications can omit
Sections 2.4,3.2.3 and 4.2.3 with little loss of continuity. Readers primarily interested
in tests for parameter stability and structural breaks in time series regression can
restrict their attention to Sections 2 and 5.1.

2. Models and preliminary asymptotic theory

This section provides an introduction to the basic models and limit theory which
will be used to develop and to characterize the statistical procedures studied in
the remainder of this chapter. Section 2.1 introduces basic notation used throughout
the chapter, and provides formulations of the I(O) and I( 1) hypotheses. This section
also introduces a useful tool, Beveridge and Nelsons (1981) decomposition of an
I(1) process into I(0) and I( 1) components. This leads naturally to a second
expression for the I(0) and I( 1) hypothesis in a components representation.
Section 2.2 summarizes the limit theory which will be used to derive the asymp-
totic properties of the various test procedures. A variety of techniques have been
and continue to be used in the literature to characterize limiting distributions in the
unit MA and AR roots problems. However, the most general and the simplest to
apply is the approach based on the functional central limit theorem (FCLT, also
called the invariance principle or Donskers theorem) and the continuous mapping
theorem (CMT), and that is the approach used in this chapter. [There are a number
of excellent texts on the FCLT. The classic text is Billingsley (1968). A more modern
treatment, on which this chapter draws, is Hall and Heyde (1980). Ethier and Kurtz
(1986) provide more advanced material and applications. Also, see the chapter by
Wooldridge in this Handbook.] The version of the FCLT used in this chapter, which
applies to the sequence of partial sums of martingale difference sequences, is due to
Brown (1971). The main advantage of this approach is that, armed with the FCLT
and the CMT, otherwise daunting asymptotic problems are reduced to a series of
relatively simple calculations. White (I 958) was the first to suggest using the FCLT
to analyze unit root distributions. Other early applications of the FCLT, with
i.i.d. or martingale difference sequence errors, to statistics involving I( 1) processes
include Bobkoski (1983) and Solo (1984). Phillips (1987a) influential paper
demonstrated the power of this approach by deriving the distribution of the AR( 1)
estimator and t-statistic in the misspecified case that the process has additional
[non-AR(l)] dependence. These were paralleled by important developments in the
asymptotics of multivariate unit root models; see the chapter by Watson in this
Handbook for a review.
Ch. 46: Unit Roots, Structural Breaks and Trends 2745

The aim of this chapter is to provide a treatment at a level suitable for graduate
students and applied econometricians. To enhance accessibility, we make two main
compromises in generality and rigor. The first is to restrict attention to time series
which can be written as linear processes with martingale difference errors, subject
to some moment restrictions. This class is rich enough to capture the key
complications in the theory and practice of inference concerning unit roots and
trend breaks, namely the presence of possibly infinitely many nuisance parameters
describing the short-run dependence of an I(0) disturbance. However, most of the
results hold under some forms of nonstationarity. References to treatments which
handle such nonstationarity are given in Section 2.4. The second technical
compromise concerns details of proofs of continuity of functionals needed to apply
the continuous mapping theorem; these details are typically conceptually straight-
forward but tedious and notationally cumbersome, and references are given to
complete treatments when subtleties are involved.

2.1. Basic concepts and notation

Throughout this chapter, u, denotes a purely stochastic I(0) process and E, denotes
a serially uncorrelated stochastic process, specifically a martingale difference
sequence. The term I(O)is vague, so defining a process to be I(0) requires additional
technical assumptions. The formulation which shall be used throughout this chapter
is that u, is a linear process with martingale difference sequence errors. That is, the
I(0) process u, has the (possibly infinite) moving average representation,

0, = c(L)q, t=O ,-+1 ,-+2 ,.. . 3 (2.1)

where c(L) = C,?=,cjL is a one-sided moving average polynomial in the lag operator
L which in general has infinite order. The errors are assumed to obey

E(E,lE,_1,E,_2,...)=0,

T-1 f E(&:I&,_1,E,_2,...)~a.s. E~:=c-+o as T+Q


r=1
E(~;ll.2,_~,~,_~,... )<K, a.s.forallt. (2.2)
That is, E, can exhibit conditional heteroskedasticity but this conditional hetero-
skedasticity must be stationary in the sense that fourth moments exist and that E,
is unconditionally homoskedastic. Because E, is unconditionally homoskedastic,
under (2.1) and (2.2) U, is covariance stationary. This simplifies the discussion of

A time series y, is strictly stationary if the distribution of (y ,+,,...,y,+,)doesnotdependonk.The


series is covariance stationary (or second-order stationary) if Ey, and Ey,y,_ j, j = 0, k 1,. . . exist and are
independent oft.
functions of second moments of L,such as its spectrum, s,.(c)), or autocovariances,
r,,(,j),j = 0, k 1, & 2,. The representation (2.1) is similar to the Wold representation
for a covariance stationary series, although the Wold representation only implies
that the errors are serially uncorrelated, not martingale difference sequences.
Central to the idea that a process is I(O), is that the dependence between distant
observations is limited. In the context of (2.1), this amounts to making specific
assumptions on c(L). The assumption which will be maintained throughout is that
c(L) has no unit roots and that it is one-summable [e.g. Brillinger (1981, ch. 2.7)]

c(l)#O and fj/cjl < co, (2.3)


j=O

where c(l) = CJYocj. The conditions (2.3) can alternatively be written as restrictions
on the spectrum of u,,s,(w). Because s,(w) = (aE2/27C)ICjm_oCjeiwj12(where i = fi I),
~~(0) = ofc( 1)/27c, so c(1) # d implies that the spectral density of u, at frequency zero
is nonzero. Similarly, the one-summability condition implies that ds,(w)/do is finite
at o = 0. Thus, these conditions on c(L) restrict the long-term behavior of u,. Unless
explicitly stated otherwise, throughout this chapter it is assumed that u, satisfies
(2.1))(2.3).
The definition of general orders of integration rests on this definition of I(0): a
process is said to be I(d), d 3 1, if its dth difference, Ad+ is T(O). Thus u, is I(1) if
Au, = u,, where u, satisfies (2.1))(2.3). In levels, u, = cf= rus + uo, so that the
specification of the levels process of u, must also include an assumption about the
initial condition. Unless explicitly stated otherwise, it is assumed that, if u, is I(l),
then the initial condition satisfies Eui < co.
A leading example of processes which satisfy (2.3) are finite-order ARMA models
as popularized by Box and Jenkins (1976). If u, has an ARMA(p, q) representation,
then it can be written

P(W, = 4(0% (2.4)

where p(L) and 4(L), respectively, have finite orders p and q. If the roots of p(L) and
4(L) lie outside the unit circle, then the ARMA process is stationary and invertible
and u, is integrated of order zero. If u, satisfies (2.4) and is stationary and invertible,
then u, = c(L)a, where c(L) = p(L)_ 4(L) and it is readily verified that (2.3) is satisfied,
since #( 1) # 0 and eventually c(L) decays exponentially.
ARMA models provide a simple framework for nesting the I(0) and I(1)
hypotheses, and are the origin of the unit root terminology. Suppose u, in (1.1)
satisfies

(1 - olL)u, = (1 - BL)u,, (2.5)

where u, is I(0). If (~1 < 1, u, is stationary. If 181-c 1, then (1 - OL) is said to be


invertible. If c( = 1 and 101< 1, then u, is integrated of order one; that is, U, - u0 can
be expressed as the partial sum -- loosely, the integration - of a stationary process.
If c( = 1 and 6, = 1, then u, = L,+ (u, - LJ and u, is stationary, or integrated of order
zero. [If (3= 1 and (a 1< 1, then u, is integrated of order - 1, but we will not consider
this case since then u, in (2.2) can be replaced by its accumulation I:= rus, which in
turn is I(O).]
This framework provides an instructive interpretation of the I( 1) and I(O) models
in terms of the properties of long-run forecasts of the series. As Harvey (1985, 1989)
has emphasized, an intuitively appealing definition of the trend component of a
series is that its long-run forecast is its trend. If u, is I(l), then its long-run forecast
follows a martingale, while if u, is I(O), its long-run forecast tends to its unconditional
mean (here zero). In this sense, if u, is T(1) then it and y, can be said to have a
stochastic trend.
This correspondence between the order of integration of a series and whether it
has a stochastic trend is formally provided by Beveridge and Nelsons (1981)
decomposition of u, into I(1) and I(O) components. Suppose that Au, = u,. The
BeveridgeeNelson (1981) decomposition rests on writing c(L) as c(L) = c(l) +
[c(L) - c(l)] = c( 1) + c*(L)A, where A = 1 - L and cj* = - C,: j+ 1ci (this identity is
readily verified by writing out c(L) - c( 1) = AC*(L) and collecting terms). Thus u,
can be written u, = ~(1)s~ + c*(L)AE~. Then, because u, = cf= iv, + tq,, we get the
Beveridye-Nelson decomposition

u, = c( 1) i E, + c*(L)&, + C,, (2.6)


s=l

where ul,, = u0 - c*(L)&,. It is readily verified that, under (2.1)-(2.3), c*(L)q is


covariance stationary. This follows from the one-summability of c(L), which implies
that c*(L) is summable. [Specifically,

which is finite by (2.3).] Thus,

E(c*(L)E,)* = f (cj*)o, < j~olq *g:,


j=O ( >

which is finite by (2.2) and (2.3).


The BeveridgeeNelson decomposition (2.6) therefore represents U, as the sum of
a constant times a martingale, a covariance stationary disturbance and an initial
condition do. If u. is fixed or drawn from a distribution on the real line, then zio can
be neglected and often is set to zero in statements of the Beveridge-Nelson
2748 J.H. Stock

decomposition. The martingale term can be interpreted as the long-run forecast of


u,: because c*(L) is summable, the long-term forecast, u,+~,~ for k very large, is
c(l)Cf, rs,. Thus an I(1) series can be thought of as containing a stochastic trend.
Equally, if u, is I(O), then plim,, mu, + kit = 0, so that U, does not have a stochastic
trend.

2.2. The functional central limit theorem and related tools

If u, is stationary, or more generally has sufficiently many moments and limited


dependence on past observations, then averages such as T- CT= ,u: will be
consistent for their expectation, and scaled sums like T-12Ct= 1u, will obey a
central limit theorem; see the chapter by Wooldridge in this Handbook for
a general treatment. By the nature of the problems being studied, however,
conventional limit theory does not apply to many of the statistics covered in this
chapter. For example, the null distribution of a test for a unit autoregressive root
is derived for U, being I(1). However, this violates the assumptions upon which
conventional asymptotic tools, such as the weak law of large numbers (WLLN), are
based. For example, if u, is I(l), then the sample mean U is Op( T) and T- j2ti has
a limiting normal distribution, in sharp contrast to the I(0) case in which U is
consistent2
The approach to this and related problems used in this chapter is based on the
functional central limit theorem. The FCLT is a generalization of the conventional
CLT to function-valued random variables, in the case at hand, the function
constructed from the sequence of partial sums of a stationary process. Before
discussing the FCLT, we introduce extensions to function spaces of the standard
notions of consistency, convergence in distribution, and the continuous mapping
theorem. Let C[O, l] be the space of bounded continuous functions on the unit
interval with the sup-norm metric, d(f, g) = ~up,,t,,r~lf(s) - g(s)l, wheref, gEC[O, 11.

Consistency. A random element r+C[O, l] converges in probability to f (that is,


~,~f)ifPr[d(~,,f)>6]+Oforall6>0.

Convergence in distribution. Let {r,, T> l} be a sequence of random elements of


C[O, l] with induced probability measures {r-c,}. Then rcr converges weakly to 7c,
or equivalently <,=s 5 where 4 has the probability measure rc, if and only if
jfdrcr+Jfdrrforallb ounded continuous f: C[O, l] -+ W. The notations CT * 5 and
(,(.)a[(.), where . denotes the argument of the functions 5, and [, are used
interchangeably in this chapter.

Suppose Au, = E, and I+, = 0. Clearly conventional assumptions in the WLLN, such as u, having
a bounded second moment, do not hold. Rather, T-% = T-32~~z 1u, = T3*xT= ,xT= ,E, =
T- 12xT= 1(l -(s - 1)/T)&,, so a central limit theorem for weighted sums implies that T-$%
YO, af/3).
Ch. 46: Unit RooI.~. Strucrurul l3rrak.s und Trvnd.s 2749

The continuous mupping theorem (CMT). If h is a continuous functional mapping


C[O, l] to some metric space and (,a 5, then h(t,) * h(t).

The FCLT generalizes the usual CLT to random functions 5r~C[0, I]. Let I.1
denote the greatest lesser integer function. Let lT(/l) be the function constructed by
linearly interpolating between the partial sums of c, at the points ,I = (0, l/T,
2/T,. . , l), that is,

so that t, is a piecewise-linear random element of C[O, 11. The CLT for vector-
valued processes ensures that, if i,, . . . ,3., are fixed constants between zero and one
and condition (2.2) holds, then [tr(I_,), rT(,12), . . , i;T(&,)] converges in distribution
jointly to a k-dimensional normal random variable. The FCLT extends this result
to hold not just for finitely many fixed values of J*, but rather for 5r treated as a
function of 2. The following FCLT is a special case of Browns (1971) FCLT [see
Hall and Heyde (1980), Theorem 4.1 and discussion].

Theorem 1 (Functional cent& limit theorem for a martingale)

Suppose that E, is a martingale difference sequence which satisfies (2.2). Then tT = W,


where W is a standard Brownian motion on the unit interval.

An FCLT for processes which satisfy (2.1)-(2.3) can be obtained by verifying that
condition (5.24) in Hall and Heydes (1980) Theorem 5.5 is satisfied if c(L) is
one-summable. [One-summability is used because of its prior use in unit root
asymptotics [Stock (1987)], although it can be replaced by the weaker condition
that c(L) is &summable; see Solo (1989) and Phillips and Solo (1992).] However,
Hall and Heydes theorem is more general than needed here and for completeness
an FCLT is explicitly derived from Theorem 1 for processes satisfying (2.1))(2.3).
The argument here relies on inequalities in Hall and Heyde (1980) and follows
Phillips and Solo (1992), except that the somewhat stronger conditions used here
simplify the argument. See Phillips and Solo (1992) for an extensive discussion,
based on the Beveridge-Nelson decomposition, of conditions under which the
FCLT holds for linear processes.
To show that Theorem 1 and the Beveridge-Nelson decomposition can be used
to yield directly an FCLT for partial sums of I(0) processes which satisfy conditions
(2.1))(2.3), let

(,,(A) = (cf T) - 12 czI 0, + (Tj. - CT4 brrl,+1 .

According to the BeveridgeeNelson decomposition (2.6), this scaled partial sum for
2750 J.H.Sfock

fixed 1. is c( l)T- ~~~j~, plus a term which is T- Ii2 times an I(0) variable.
Because tT+ W, this suggests that [,.,.*c(l)W.
To show this formally, the argument that [,,T - Cam LO must be made
uniformly in I., that is, that Pr[sup,l t,,,(A) ~ c( I)tr(jb)l > (s] - 0 for all 6 > 0. Now.

ITi [Ti.]
l<,.,(2) - c(1)5,(i.)I = (a:T)) I2 1 0, + (Ti. - [Tj.])C,TAl+, ~ c(1) 1 cI
,=I 1=1

- (Ti. - [ Ti.] )c( l)+il +,

17-11
= (a,27-- I/2 c( 1) 1 F, + c*(L)c[,,] ~ c*(L)i:,
f=l

+ (Tr ~ [T])(c(l)EIT~]+ 1 + c*(L)dE,,.;,,+1)

[Til
-c(l) 1 E, -(Ti - [Tr])c(l)c,r,,+,
1=1

=(aZT)-21~*(L)~,TAIl-(.*(L)~0 +(Ti.- [TI.])(c*(L)h,,.,,+ 1)1

6 bfT)- { Ic*(L)c[,,jl + ic*(L)c,T~]+lI + k*b%,I}

~2a,max,=,,,,,,TIT-1!2~*(L)~,I +oF; Tm21c*(L)c,I (2.7)

where the second equality uses the BeveridgeeNelson decomposition. The term
T-2 Ic*(L)c,l in the final line of (2.7) does not depend on 1, and is asymptotically
negligible, so we drop it and have

Pr[sup,llvT(A) - c(l)tT(A)I > S] < Pr[2cr~~max,IT~ 12c*(L)~,I > 61


-3

E max, IT - 12c*(L)c, 13

(2.8)

where the final inequality follows from Minkowskis inequality. Because max,EleJ < m
[by (2.2)] and &?= 1/cf I < M, [by the argument following (2.6)], Pr[sup, Ir,.,(i) -
c(l)<,.(A)1 > S] +O for all 6 > 0 so trT - c(l)<, 3 0. Combining this asymptotic
equivalence with Theorem 1, we have the general result that, if u, satisfies (2.1))(2.3)
then t,.,*c(l)W.
Ch. 46: Unit Koors. Structurul Breaks and Trends 2751

The continuity correction involved in constructing 5, and cVr is cumbersome and


is asymptotically negligible in the sup-norm sense [this can be shown formally using
the method of (2.7) and (2.8)]. We shall therefore drop this correction henceforth
and write the result t,,=+c(l)W as the FCLTfor general I(0) processes,

IT.1
T-1/2 1 u,*o$(l)w(~) = oW(.), (2.9)
s=l

where w = ~,c( 1).3


Suppose u, is an I(1) process with Au, = u, and, as is assumed throughout,
Eui < co. Then the levels process of u, obeys the FCLT (2.9): TP 12ulPI =
T-2C~\u,+ T-1i2u,=oW(.), where T l/22(0 A 0 by Chebyshevs inequality.
A special case of this is when u0 is fixed and finite.
The result (2.9) provides a concrete link between the assumptions (2.2) and (2.3)
used to characterize an I(1) process, the BeveridgeeNelson decomposition (2.6)
and the limit theory which will be used to analyze statistics based on I( 1) processes.
Under (2.2) and (2.3), the partial sum process is dominated by a stochastic trend, as
in (2.6). In the limit, after scaling by T I2 , this behaves like w times a Brownian
motion, where w2 = 27cs,(O) is the zero-frequency power, or long-run variance, of G,.
Thus the limiting behavior of u,, where Au, = u,, is the same (up to a scale factor)
for a wide range oft(L) which satisfy (2.2). It is in this sense that we think of processes
which satisfy (2.1)-(2.3) as being I(0).

2.3. Examples and preliminary results

The FCLT and the CMT provide a powerful set of tools for the analysis of statistics
involving I(1) processes. The examples in this section will be of use later but are
also of independent interest.

Example 1. Sample moments of I( 1) processes

A problem mentioned in Section 2.2 was the surprising behavior of the sample mean
of an I(1) process. The limiting properties of this and higher moments are readily
characterized using the tools of Section 2.2. Let u, be I(l), so that Au, = u,, and let
u0 = 0. Then,

T-2U= T-3$u,= T- .f (T-12+ 1(T-12u,,,,)dA:+ T-3I2~,,


t=1 s 0

(2.10)

3Formally, the process on the left-hand side of (2.9) is an element of D[O, 11, the space of functions on
[0, I] that are right-continuous and have left-hand limits. However, the discontinuous partial sum
process is asymptotically equivalent to 5,,T~C[0, I], for which Theorem I applies. See Billingsley(1968,
ch. 3) or Ethier and Kurtz (1986) for a treatment of convergence on D[O, 11.
2152 J.H. Stock

where the final equality follows by definition of the integral. The final expression in
(2.10) can be written T-%i = hI(T-lh,T.I) + T-h,(T-2u,,.,), where h, and h,
are functions from [0, l] + 3, namely h,(f) = Jkf(1.) dl, and h2(,f) = f(1). Both
functions are readily seen to be continuous with respect to the sup-norm, so by (2.9)
and the continuous mapping theorem,

h,(T~l2u,,.,)=>h,(oW) = 0
s 0
1

W(A) d1.

so T-1h,(T-i2u,,.,) A0 and T- %Ii~o~~ W(A)di., which has a normal distribu-


tion (cf. footnote 2).
This approach can be just as easily applied to higher moments, say the kth
moment.

(2.11)

where the convergence follows from the FCLT and the CMT.
The final expression in (2.11) uses a notational convention which will be used
commonly in this chapter: the limits on integrals over the unit interval will be
omitted, so for example JWk denotes jA( W(Iw))kdA. S imilarly, the stochastic (It6)
integral size W(A) dG(i) is written SW dG for two continuous-time stochastic
processes W and G.

Example 2. Detrended I( 1) processes

Because of the presence of the deterministic term d, in (1. l), many statistics of interest
involve detrending. It is therefore useful to have limiting representations of the
detrended series. The most common form of detrending is by an ordinary least
squares (OLS) regression of y, on polynomials in time, the leading cases being
demeaning and linear detrending. Let

Y;=Y,- T- f Y,,
s=1

Y:=Yt-&-BIL (2.12b)

where (ii,, /;i,) are the OLS estimators of the parameters in the regression of y, onto
(1, t). If d, = /I,, then (2.12a) applies, while if d, = /I, + /I, t, then (2.12b) applies.
Ch. 46: Unit Roots, Structural Breaks and Trends 2753

As in the previous example, suppose that U, is I(l), so that (2.9) applies. NOW
yf = u,- T-lC~zl~S, so T-l/Zyp = T-% - T-U. The CMT and (2.10)
thus imply that T- 12y~T.l=Sw{ F[.) - SW>, IF1 so that the demeaned I(1) process
converges to a demeaned Brownian motion. Similar arguments, with a bit more
algebra, apply to the detrended case. Summarizing these two results, if U,is a general
I(1) process so that Au, = u, where u, satisfies (2.1)-(2.3) and Eut < co, then we have4

T-liZy&.I*oW(~), where W(n) = W(1) - W, (2.13a)


s

T- 12y;Tl-wWr(.), where W(A)= W(A)-(4-62) W-(121-6) sW(s)ds.


s s
(2.13b)

Perron (1991~) provides expressions extending (2.13) to the residuals of an I(1)


process which has been detrended by an OLS regression onto a pth order
polynomial in time.
These results can be used to study the behavior of an I( 1) process which has been
spuriously detrended, that is, regressed against (1, c) when in fact y, is purely
stochastic. Because the sample R2 is 1 - {I,= 1(y)I:)*/CT=1(y:)}, the results (2.13) and
the CMT show that RZ has the limit R2 =S 1 - {s(Wr)/s( WP)2), which is positive
with probability one; that is, the regression R2 is, asymptotically, a positive random
variable. It follows that the standard t-test for significance of p1 rejects with
probability one asymptotically even though the true coefficient on time is zero
[Durlauf and Phillips ( 1988)15. Next, consider the autocorrelogram of the detrended
process, P&) = 9,,(CT~l)/&(O) = h&T- 1/2y;r.1), say, where &(j) = (T - lj I)- x
XT=,j, + ly:y:_, j,. Because h, is a continuous mapping from D[O, l] to O[O, 11, the
FCLT and CMT imply that or * p*, where p*(1) = (1 - A)- If= 1W(s) W(s - A)ds/
[Wr2, 0 < 2 < 1. Thus, in particular, the first k sample autocorrelations converge in
probability to one for k fixed, although it eventually declines towards zero. Nelson
and Kang (198 1) show, using other techniques, that the autocorrelogram dips below
zero, suggesting periodicity which spuriously arises from the detrending of the I( 1)
process. The results here indicate that this is an asymptotic phenomenon, when the
autocorrelation is interpreted as a fraction of the sample size.

Example 3. Cumulated detrended I(0) processes

Section 4 considers testing for a unit moving average root when it is maintained
that there is a deterministic trend. A statistic which arises in this context is the
4Derivations of W arc given in the proof of Theorem 5.1 of Stock and Watson (1988b) and in Park
and Phillips (1988); the result can also be derived from Theorem 2.1 of Durlauf and Phillips (1988). As
Park and Phillips (1988) demonstrate, W can be thought of as detrended Brownian motion, the residual
of the projection of W onto (1,s).
Phillips (1986) gives similar results for regressions with two independent random walks.
2754 J.H. Stock

cumulation of an I(0) process, which has been detrended by OLS. The asymptotics
of this process are also readily analyzed using the FCLT and the CMT.
For this example, suppose that u, is a general I(0) process and U, = u,, where V,
satisfies (2.1))(2.3). Consider the demeaned case, and define the statistic

LTJ.1
Y;,.(%) = T- c yi,
s=l

so
ITAl CT21
Y;,(l.)= T-1'2 1 (us-~)= T-/z 1 u,_ Tm12 f u,.
s= 1 SE1 f=l

Then (2.9) and the CMT yield the limit Yz, =~o@, where B(A) = w(A) - AW(1).
The process BP is a standard Brownian bridge on the unit interval, so called because
it is a Brownian motion that is tied down to be zero at 0 and 1. Similarly, define
Y:,(A) = T- 12C&ajyz; then YbT 3 wE, where B is a second-level Brownian
bridge on the umt interval, given by B(1) = W(A) - APV(l) + 6A(1 - 1){+ W( 1) - SW)
[MacNeill (1978)]. Collecting these results, we have

T-12 c yr*wB(.), B(1) = W(A) - iW( l), (2.14a)


s=1

c
LT.1
T- 12
s=1
y:=>oB(.), B(A)= H(+APV(l)+61.(1 -A){+(l)-SW},

(2.14b)

MacNeill (1978) extended these results to kth order polynomial detrending. Let
Yt&(A) = T-2C~~jy~k)
t,
be the process of cumulated kth order detrended data,
where y,(k is the residual from the OLS regression of y, onto (1, , tk). Then
Ytk =+WIP - l), where Bck is a kth level generalized Brownian bridge, expressions
for which are given by MacNeill (1978, eq. 8).

Example 4. Processes with an autoregressive root local to unity

One of the issues considered in Section 3 is the asymptotic properties of statistics


when the process is nearly I(l), in the sense that the largest root of the process is
local to unity. The starting point in these calculations is characterizing the large-
sample behavior of the series itself when the root is close to one. Let U, obey

u,=cIu,_, +u,, where do= 1 +c/T and Eu~-c CO. (2.15)

This is the local-to-unity model considered (under various assumptions on the


disturbances ut) by Bobkoski (1983), Cavanagh (1985), Chan and Wei (1987) and
Ch. 46: Unit Roots, Structural Breaks and Trends 2755

Phillips (1987b). The treatment here follows Bobkoski (1983). In particular, we use
the method of proof of Bobkoskis (1983) Lemma 3.4 to generalize his local-to-unity
representations from his case of i.i.d. disturbances to general I(0) disturbances which
satisfy (2.1))(2.3). As we shall see, this extension is a straightforward application of
the FCLT (2.9) and the CMT.
Use recursive substitution in (2.15) to write u, as

y-- l/2u _ T- 112


1-

1-l
= T-lj2 c (u-~- l)u,+ T-1/2 i v,+ T-li2ccu,

s= 1 s=l

= (ct - 1):I~cx-~-~(
1 T-i/2$1~r)+ ~~~~~~~~~~ ~-+h,

= k,(o,S,,)(tlT) + o,(1). (2.16)

The third equality in (2.16) obtains as an identity by noting that ,r - 1 = (IX- l)CiZ+~j
and rearranging summations. The final equality obtains by noting that (1 + c/T)~ =
exp(ci) + o(l) uniformly in J., 0 < 1. < 1, and by defining k4(f)(I.) = CIAec(-f(s)ds+
f(i), where toT is defined in Section 2.2. The op( 1) term in the final expression arises
from the assumption Eut < a, so T- li2u0 = oP( I), and from the approximation
(1 + c/T)[~] E exp(c2).
As in the previous examples, k, is a continuous functional, in this case from
C[O, l] to C[O, 11. Using the FCLT (2.9) and the CMT we have

T-2~t,.l~k,(wW)() = ml+,(.), (2.17)

where W,(n) = cJG e c(-s)W(s) ds + W(A). The stochastic process WC is the solution
to the stochastic differential equation, dw, = c WC(%)d1. + d W(2) with W,(O) = 0.
Thus, for Mlocal-to-unity, T - 1i2uCTIconverges to w times a diffusion, or Ornstein-
Uhlenbeck, process.

A remark on the interpretation of limiting jiinctionals of Brownian motion. The


calculations of this section ended when the random variable of interest was shown
to have a limiting representation as a functional of Brownian motion or, in the local-
to-unity case of Example 4, a diffusion process. These representations show that the
limiting distribution exists; they indicate when a limiting distribution is nonstandard;
and, importantly, they show when and how nuisance parameters describing the
short-run dependence of u, enter the limiting distribution. Because W is a Gaussian
process, the results occasionally yield simply-evaluated distributions. For example,
W(l),lW, and JsW h ave normal distributions. However, in most cases the limiting
distributions are nonstandard.
2756 J.H. Stock

This leads to the practical question of how to compute limiting distribution


functions, once one has in hand the limiting representation of the process as a
functional of Brownian motion. The simplest approach, both conceptually and in
terms of computer programming, is to evaluate the functional by Monte Carlo
simulation using discretized realizations of the underlying Brownian motions. This
is equivalent to generating pseudo-data j, from a Gaussian random walk with PO = 0
and with unit innovation variance and replacing W by its discretized realization.
For example, W(1) would be replaced by (T-j,) and W(.) would be replaced
by T- 12{j$T.l - T-lx,= ,$,}. For T sufficiently large, the FCLT ensures that the
limiting distribution of these pseudo-random variates converges to those of the
functionals of Brownian motion. The main disadvantage of this approach is that
high numerical accuracy requires many Monte Carlo repetitions. For this reason,
considerable effort has been devoted to alternative methods for evaluating some of
these limiting distributions. Because these techniques are specialized, they will not
be discussed in detail, although selected references are given in Section 2.4.

2.4. Generalizations and additional mferences

The model (2.1)-(2.3) provides a concise characterization of I(0) processes with


possibly infinitely many nuisance parameters describing the short-run dependence,
but this simplicity comes at the cost of assuming away various types of nonstation-
arity and heteroskedasticity which might be present in empirical applications. The
key result used in Section 2.3 and in the sections to follow is the FCLT, which
obtains under weaker conditions than stated here. The condition (2.2) is weakened
in Browns (1971) FCLT, which uses the Lindeberg condition and admits un-
conditional heteroskedasticity which is asymptotically negligible, in the sense that
T-z;&: -+a:. The result (2.9) for linear processes can be obtained under
Browns (1971) weaker conditions by modifying the argument in Section 2.2; see
Hall and Heyde (1980, Chapter 4) and Phillips and Solo (1992).
An alternative approach is to use mixing conditions, which permit an explicit
tradeoff between the number of moments and the degree of temporal dependence
in u, and which admit certain nonstationarities (which are asymptotically negligible
in the sense above). This approach was introduced to the unit roots literature by
Phillips (1987a), who used Herrndorfs (1984) mixing-condition FCLT, and much
of the recent unit roots literature uses these conditions. Phillips (1987b) derives the
local-to-unity result (2.17) using Herrndorfs (1984) mixing-condition FCLT.
An elegant approach to defining I(0) is simply to make the high-level assumption
that u, is I(0) if its partial sum process converges weakly to a constant times a
standard Brownian motion. Thus (2.9) is taken as the assumption rather than the
implication of (2.1)-(2.3). With additional conditions assuring convergence of
sample moments, such as sample autocovariances, this high-level assumption
provides a general definition of I(O), which automatically incorporates u, which
Ch. 46: Unit Root,s, Structural Breaks und Trends 2757

satisfy Herrndorfs (1984) FCLTs. The gain in elegance of this approach comes at
the cost of concreteness. However, the results in this chapter that rely solely on
the FCLT and CMT typically can be interpreted as holding under this alternative
definition.
The FCLT approach is not the only route to asymptotic results in this literature.
The approach used by Fuller (1976) Dickey and Fuller (1979) and Sargan and
Bhargava (1983a) was to consider the limiting behavior of quadratic forms such as
C,r= ,u: expressed as $A,v], where q is a T x 1 standard normal variate; thus the
limiting behavior is characterized by the limiting eigenvalues of A,. See Chan (1988)
and Saikkonen and Luukkonen (1993b) for discussions of computational issues
involved with this approach.
There is a growing literature on numerical evaluation of these asymptotic
distributions. In some cases, it is possible to obtain explicit expressions for moment
generating functions or characteristic functions which can be integrated numerically;
see White (1958, 1959), Evans and Savin (1981a), Perron (1989b, 1991a), Nabeya
and Tanaka (1990a, 1990b) and Tanaka (1990a). Finally, under normality, exact
finite-sample distributions can be computed using the Imhof method; see, for
example, Evans and Savin (198 1b, 1984).

3. Unit autoregressive roots

This section examines inference concerning c1in the model

y, = d, + uI) u,=c(u,-l +u,, t= 1,2,...,T (3.1)

where CI is either close to or equal to one and u, is I(O) with spectral density at
frequency zero of d/27c. Unless explicitly stated otherwise, it is assumed that u0
might be random, with Euf, < co, and that u, is a linear process satisfying (2.1)-(2.3).
The trend term d, will be specified as known up to a finite-dimensional parameter
vector /?. The leading cases for the deterministic component are (i) no deterministic
term (d, = 0); (ii) a constant (d, = PO); and (iii) a linear time trend (d, = p,, + Prt).
Extensions to higher-order polynomial trends or trends satisfying more general
conditions are typically straightforward and are discussed only briefly. Another
possibility is a piecewise-linear (or broken) trend [Rappaport and Reichlin (1989),
Perron (1989a, 1990b)], a topic taken up in Section 5.
Most of the procedures for inference on CItreat the unknown parameters in the
trend term d, as nuisance parameters, so that many of the statistics can be
represented generally in terms of detrended data. Throughout, yp denotes a general
detrended process with unspecified detrending. For specific types of detrending, we
adopt Dickey and Fullers (1979) notation: yr denotes demeaned data and y: denotes
linearly detrended data when the detrending is by OLS as in (2.12).
2758 J.H. Sfock

The focus of this section is almost exclusively on the case in which there is at
most a single real unit root. This rules out higher orders of integration (two or more
real autoregressive unit roots) and seasonal unit roots (complex roots on the unit
circle). These topics have been omitted because of space limitations. However, the
techniques used here extend to these other cases. References on estimation and
testing with seasonal unit roots include Hasza and Fuller (198 l), Dickey et al. (1984),
Chan and Wei (1988), Ghysels (1990) Jegganathan (1991), Ghysels and Perron
(1993), Hylleberg et al. (1990), Diebold (1993) and Beaulieu and Miron (1993). See
Banerjee et al. (1992a) for an overview.
In the area of testing when there might be two or more unit roots, an important
practical lesson from the theoretical literature is that a downward testing
procedure (starting with the greatest plausible number of unit roots) is consistent,
while an upward testing procedure (starting with a test for a single unit root) is
not. This was shown for F-type tests in the no-deterministic case by Pantula (1989).
Based on simulation evidence, Dickey and Pantula (1987) recommend a downward-
testing, sequential t-test procedure. Pantula (1989) proves that the distribution of
the relevant t-statistic under each null has the standard Dickey-Fuller (1979)
distribution. Also, Hasza and Fuller (1979) provide distribution theory for testing
two versus zero unit roots in an autoregression.

3.1. Point estimation

The four main qualitative differences between regressions with I(1) and I(0)
regressors are that, in contrast to the case of I(O) regressors, inference on certain
linear combinations of regression coefficients is nonstandard, with: (i) estimators
which are consistent at rate T rather than at the usual rate fi; (ii) limiting
distributions of estimators and test statistics which are often nonstandard and have
nonzero means; (iii) estimators which are consistent even if the regression misspecifies
the short-run dynamics, although, in this case, the limiting distributions change;
and (iv) limiting distributions which depend on both the true and estimated trend
specifications.
These differences between I(0) and I( 1) regressors can be seen by examining the OLS
estimator of CY in (3.1). First, consider the no-deterministic case, so oi = CT= 2ytyt _ i/
C,= &_ i. When 1ct1< 1 and u, = E,, conventional fi asymptotics apply and oi has
a normal limiting distribution,

ifIx < 1, T(d - a) A N(0, 1 - cr2), (3.2)

which was derived by Mann and Wald (1943) under the assumptions that E, is i.i.d.
and all the moments of E, exist.
In contrast, suppose that the true value of LYis 1, and let u, follow the general
Ch. 46: Unit Roots, Structural Breaks and Trends 2759

linear process (2.1)-(2.3). Then the OLS estimator can be rewritten,

T- it AY,Y,-,
t=2
T(&-1)= T ~
Tp2 2 yf-,
1=2

T-(yZ,-my:)- T- f: (AY,)'
1=2
(3.3)
r2 i~:-~
1=2

where the second line uses the identity y: - yf = 2C,T=,Ay,y,_ 1 + CTE2(AyJ2.


Although the conditions for Mann and Walds result do not apply here because of
the unit autoregressive root, an asymptotic result for T(oi - 1) nonetheless can be
obtained using the FCLT (2.9) and the CMT. Because Eni < co, T-12yl =
T - (u, + vl) -% 0. Thus, because y, = u0 + C= 1us,by (2.9) and the CMT, we have
T-i2yT=~W(1) and T-2CtT=2yf_1 =z-w2~Ws?. Also, T - CT= 2(Ay,)2 = y*,,(O) %
Y,,(O) = Y,(O). Thus,

#V(l)2 - K)
ifa = 1, T(& - l)= where K =T, (3.4a)
JW2,

This expression was first obtained by White (1958) in the AR(l) model with K = 1
(although his result was in error by a factor of $) and by Phillips (1987a) for
general K.
An alternative expression for this limiting result obtains by using the continuous-
time analogue of the identity used to obtain the second line of (3.3), namely
IWdW=i(W(1)2 - 1) [Arnold (1973, p. 76)]; thus,

ifa = 1, T(oi - l)= Wdw- $c - 1) w2. (3.4b)

This result can also be obtained from the first line in (3.3) by applying Theorem 2.4
of Chan and Wei (1988).
The results (3.2) and (3.4) demonstrate the first three of the four main differences
between I(0) and I(1) asymptotics. First, the OLS estimator of cI is superconsistent,
converging at rate T rather than ,,&. While initially surprising, this has an intuitive
interpretation: if the true value of ccis less than one, then in expectation the mean
squared error E(y, - cly,_ 1)2 is minimized at the true value of c( but remains finite
for other values of cc In contrast, if CIis truly 1, then E(AyJ2 is finite but, for any
2160 J.H. Stock

fixed value of CI# 1, (1 - clL)y, = by, + (1 - cr)y,_ 1 has an integrated component;


thus the OLS objective function T- C,TZ2(yt - ccy,_ J2 is finite, asymptotically, for
CI= 1 but tends to infinity for fixed tl # 1. An alternative intuitive interpretation of
this result is that the variance of the usual OLS estimator depends on the sampling
variability of the regressors, here CT= 2yf_ 1; but, because y, is I(l), this sum is O,( T*)
rather than the conventional rate O,(T).
Second, the limiting distribution in (3.4) is nonstandard. While the marginal
distribution of W(l)* is XT, the distribution of the ratio in (3.4a) does not have a
simple form. This distribution has been extensively studied. In the leading case that
u, is serially uncorrelated, then 0 = y,(O) so that K = 1 and (3.4a) becomes
+( W( 1)2 - 1)/jW2 [and (3.4b) becomes j W d W/j W2], This distribution was tabulated
by Dickey (1976) and reproduced in Fuller (1976, Table 85.1). The distribution is
skewed, with asymptotic lower and upper 5 percent quantiles of - 8.1 and 1.28.
Third, oi is consistent for c1 even though the regression of y, onto y,_i is
misspecified, in the sense that the error term v, is serially correlated and correlated
with (differences of) the regressor. This misspecification affects the limiting distribu-
tion in an intuitive way. Use the definition of K to write

Because o* can be thought of as the long-run variance of u,, +(K - 1) represents the
correlation between the error and the regressor, which enters as a shift in the
numerator of the limiting representation but does not introduce inconsistency. This
term can increase the bias of d in finite samples. Although this bias decreases at the
rate T-i, when u, is negatively serially correlated, so that +(K - 1) is positive, in
sample sizes often encountered in practice this bias can be large. For example, if u,
follows the MA(l) process u, = (1 - HI+,, then +(K - 1) = e/(1 - e), so for 8 = 0.8,
3(K - 1) = 20.
To examine the fourth general feature of regressions with I(1) variables, the
dependence of limiting distributions on trend specifications, consider the case that
d, = PO + /I, t. Substitute this into (3.1), transform both sides of (3.1) by (1 - crL), and
thus write

y,=6,+6,t+cry,-, +u,, t= 1,2 ,..., T, (3.5)

where 6, = (1 - g)po + LX/~,and 6, = (1 - a)/3,. If both PO and pi are unrestricted,


(3.5) suggests estimating c( by the regression of y, onto (1, t, y,_ ,); if b, is restricted
a priori to be zero, then CIcan be estimated by regressing y, onto (1, yt_ i). Consider,
for the moment, the latter case in which d, = /I, where fi, is unknown. Then, by the
algebra of least squares, the OLS estimator of c(,oi, can be written (after centering
Ch. 46: Unit Roots. Structural Breaks and Trends 2761

and scaling) as

T-l i AY,Y:- ,
t=2
T(@- l)=
r2 i (YP~J
1=2

t T-(y;2_y;2)- T- (AyJ2
x2
2
(3.6)

where yr_ 1 =Y~_~ -(T- 1)-Z~~2y,_1.


The method for obtaining a limiting representation of T(& - 1) under the
hypothesis that c( = 1 is analogous to that used for T(B - l), namely, to use the FCLT
to obtain a limiting representation for T - yr_ 1 and then to apply the continuous
mapping theorem. Expression (2.13a) provides the needed limiting result for the
demeaned levels process of the data; applying this to (3.6) yields

T(@- 1)=${W(1)2 - Wp(O)2- ti)/jWp2


={jWYdW-+(K- l)}/jMf12

(3.7)

where the second representation is obtained using PV(O)= - SW.


The detrended case can be handled the same way. Let 8 denote the estimator of
rl obtained from estimating (3.5) including both the constant and time as regressors.
Then T(oi - 1) can be written in the form (3.6), with y: replacing yr. The application
of (2.13b) to this modification of (3.6) yields the limiting representation

T(c?- I)=+{ Wr(l)2 - Wr(O)2 - tc]/jWr2 = {jWdW-+(c I)}/jW?

(3.8)

Because the distributions of W, Wp and W differ, so do the distributions in (3.4),


(3.7) and (3.8). When u, = E, so that K = 1, the distribution of T(B - 1) is skewed and
sharply shifted to the left, with asymptotic lower and upper 5 percent quantiles of
- 14.1 and - 0.13. With linear detrending, the skewness is even more pronounced,
with 5 percent quantiles of - 21.8 and - 2.66. This imparts a substantial bias to the
estimates of CY: for example, with T= 100 and u, = E,, the mean of 8, based on the
asymptotic approximation (3.8), is 0.898.
Another feature of regression with I(1) regressors is that, when the regression
contains both I( 1) and I(0) regressors, estimators of coefficients (and their associated
2162 J.H. Stock

test statistics) on the I(1) and I(0) regressors, in a suitably transformed regression,
are asymptotically independent. This is illustrated here in the AR(p) model with a
unit root as analyzed by Fuller (1976). General treatments of regressions with
integrated regressors in multiple time series models are given by Chan and Wei
(1988), Park and Phillips (1988) and Sims et al. (1990). When u, has nontrivial
short-run dynamics so that w2 # y,,(O), an alternative approach to estimating c( is
to approximate the dynamics of o, by a pth order autoregression, u(L)u, = e,. In the
time-trend case, this leads to the OLS estimator, oi, from the regression

Ay,=60+6it+(~-l)y,_i + t ajAY,-j+c,, t=1,2 ,..., T. (3.9)


j=l

If u, in fact follows an AR(p), then e, = E, and (3.9) is correctly specified. To simplify


the calculation, consider the special case of no deterministic terms, conditional
homoskedasticity and p = 1, so that y, is regressed on (y,_ i,Ay,_i). Define
TT = diag( T 1/Z T), let a=(a,,a-l), let zl-i=(Ayt_i,yt_i), and let d be the
OLS estimatorof a. Then

-1
l-&3-a)=
(
3-F i
1=2
Zt_lZ;_lr;l
)( 3; 5
t=2
Z,_lE,
>
.

A direct application of the FCLT and the CMT shows that Yg CT= 2z,_ lzip 1Y, I*
dia&,,(0), w2~W2), w h ere cc)= oJ( 1 - a(l)) (in the special case p = 1, a( 1) = al).
Similarly,

r, i Z,_lE,= T-1'21i2A~,-~&p T-l,i2~t-l~t)'.


r=2

Because Ay,_ 1E, is a martingale difference sequence which satisfies Theorem 1,


T~2CT=2Ayt_1.zt%~*, where I* _ N(0, 01E(AyJ2). Direct application of Chan
and Weis (1988) Theorem 2.4 (or, alternatively, an algebraic rearrangement of the
type leading to (3.4b)) implies that the second term has the limit o,oJW d W. From
Theorem 2.2 of Chan and Wei (1988), this convergence is joint and moreover the
(W, q*) are independent. Upon completing the calculation, one obtains the result

{ T12(Lil - a,), T(oi - l)} _i N(O, 1 - ~:~>~@4~~dW~2}, (3.10)

where asymptotically the two terms are independent. The joint asymptotic distri-
bution of { T1j2(b1 -a,), T(B - l)} was originally obtained by Fuller (1976) using
different techniques. The result extends to values of p > 1 and to more general time
trends. For example, in the AR(p) model with a constant and a linear time trend,
T(&- l)=+(~,/w)~W~dW/f(W)~.
Ch. 46: Unit Roots. Structural Breaks and Trends 2163

Ordinary least squares estimation is not, of course, the only way to estimate (x.
Interestingly, the asymptotic distribution is sensitive to seemingly minor changes
in the estimator. Consider, for example, Dickeys et al. (1984) symmetric least
squares estimator in the no-deterministic case

i*Y,Yt-1
a, = T-l
(3.1 la)
c Y: + 3(Y: + Y,)
r=2

Straightforward algebra and an application of the FCLT reveals that

- + T- i (Ay,)
t=2 K
T(cc,- l)= a-p (3.1 lb)
T-l 2
T-2 c Y: + +(Y: + Y$)
1=2

so that T(@, - 1) is negative with probability one.


If point estimates are of direct interest, then the bias in the usual OLS estimator
can be a problem. For example, if ones object is forecasting, then the use of a biased
estimator of (Ywill result in median-biased conditional forecasts of the stochastic
component.6 This has led to the development of median-unbiased estimators of ct.
This problem is closely related to the construction of confidence intervals for czand
is taken up in Section 3.3.

3.2. Hypothesis tests

3.2.1. Test of c( = 1 in the Gaussian AR(l) model

The greatest amount of research effort concerning autoregressive unit roots, both
empirical and theoretical, has been devoted to testing for a unit root. Because of
the large number of tests available, a useful starting point is the no-deterministic
i.i.d. Gaussian AR( 1) model,

Yt = NY, - 1 + 5, E, i.i.d. N(0, a), t = 1,2,. . . , T, (3.12)

When (al < 1 and a is fixed, bi is also biased towards zero. In the Gaussian AR(l) model with d, = 0,
Hurwicz (1950) derives the approximation Eoi = {(T ~ 2T+ 3)/(T2 - l)}x for a close to zero. When G(
is close to one, Hurwiczs approximation breaks down but the distribution becomes well-approximated
using the local-to-unity approximations discussed in the next section, and the downward bias remains.
Approximate biases in the stationary constant model are given by Marriot and Pope (1954). For an
application of these bias expressions, see Rudebusch (1993). Also see Magnus anctpesaran (1991) and
Stine and Shaman (1989).
2164 J.H. Stock

where y, = 0. Suppose further that rs2 is known, in which case we can set rr2 = 1.
Because there is only one unknown parameter, CI,the NeymanPearson lemma can
be used to construct the most powerful test of the null hypothesis c1= a, vs.
the point alternative c( = Cc.The likelihood function is proportional to L(a) =
k, exp( --$(a - E)CT= 2yf_ i), where k, does not depend on ~1. The Neyman
Pearson test of 01= 1 vs. c1= C?rejects if L(c()/L(l) is sufficiently large; after some
manipulation, this yields a critical region of the form

[T(a-I)]2T-2~y;_1-2T(E-l)~~1 5 y,_,Ay,<k, (3.13)


t=2 t=2

where k is a constant.
The implication of (3.13) is that the most powerful test of c1= 1 vs. tl = &is a linear
combination of two statistics, with weights that depend on a. It follows that, even
in this simplified problem, there is no uniformly most powerful (UMP) test of a = 1
vs. 1~1< 1. This difficulty is present even asymptotically: suppose that the alternative
of interest is local to one in the sense (2.15), so that Cr= 1 + f/T, where C is a fixed
constant. Then, T(& - 1) = C. Under the null

T-2 i yf_,,2T- 5 yt_tAy, w2, w(1)2 - 1 1


t=2 t=2

so both terms in (3.13) are O,(l). Thus, there is no single candidate test which
dominates on theoretical grounds, either in finite samples [Anderson (1948), Dufour
and King (1991)] or asymptotically.
From the perspective of empirical work, the model (3.12) is overly restrictive
because of the absence of deterministic components and because the errors are
assumed to be i.i.d. The primary objective of the large literature on tests for unit
autoregressive roots has, therefore, been to propose tests that have three characteris-
tics: first, the test is asymptotically similar under the general I(1) null, in the sense
that the null distribution depends on neither the parameters of the trend process
(assuming the trend has been correctly specified) nor the nuisance parameters
describing the short-run dynamics of u,; second, it has good power in large samples;
and third, it exhibits small size distortions and good power over a range of
empirically plausible models and sample sizes. The next three subsections, respec-
tively, summarize the properties of various unit root tests in terms of these three
characteristics.

This draws on Rothenberg (1990). Manipulation of(3.13) shows that the Dickey-Fuller (1979) p test,
which rejects if T(ci - 1) < k (where k < 0 for conventional significance levels), is efficient against E = 2k,
although this does not extend to the demeaned or detrended cases. We thank Thomas Rothenberg and
Pentti Saikkonen for pointing this out.
Ch. 46: Unit Roots. Structural Breuks and Trends 2165

3.2.2. Tests of the general I( 1) null

This subsection describes the basic ideas used to generalize tests from the AR(l)
model to the general I(1) null by examining four sets of tests in detail. Some other
tests of the general I( 1) null are then briefly mentioned.
If u, follows an AR(p) and d, = p,, + Pit, then the regression (3.9) serves as a basis
for two tests proposed by Dickey and Fuller (1979): a test based on the t-statistic
testing 01= 1, z*, and a test based on p = (Cr)AR/BJT(&- l), where d& is the
autoregressive spectral density estimator (the AR estimator ofo):

(3.14)

where (a,, a,, . . . ,c?,) are the OLS estimators from (3.9), modified, respectively, to
omit t or (1, t) as regressors in the d, = /I0 or d, = 0 cases. In the time-trend case,
under the null hypothesis u = 1 and the maintained AR(p) hypothesis, the limiting
representations of these statistics are

(3.15)

neither of which depends on nuisance parameters. Thus these statistics form the
basis for an asymptotically similar test of the unit root hypothesis in the AR(p)/time-
trend model. Their distributions have come to be known as the Dickey-Fuller
(1979) p and Y distributions and are tabulated in Fuller (1976), Tables 8.51
and 8.5.2, respectively. In the constant-only case (d, = PO), the only modification is
that t is dropped as a regressor from (3.9) and W replaces W in (3.15). In the no-
deterministic case, the intercept is also dropped from (3.9) and W replaces W.
In an important extension of Fuller (1976) and Dickey and Fuller (1979), Said and
Dickey (1984) used Berks (1974) results for AR( co) I(0) autoregressions to analyze
the case that o, follows a general ARMA(p, q) process with unknown p, q. In this
case, the true autoregressive order is infinite so the regression (3.9) is misspecified.
If, however, the autoregressive order pT increases with the sample size (specifically,
pT-+ co, p;/T+O), then Said and Dickey (1984) showed that the results (3.15)
continue to hold. Thus the Dickey-Fuller/Said-Dickey tests have a nonparametric
interpretation, in the sense that they are valid under a more general I(0) null with
weak conditions on the dynamics of u,.~

8Berks (1974) conditions on c(L) (in the notation of (2.1)) are less restrictive than Said and Dickeys
(1984) assumption that v, obeys an ARMA(p, q). For a related discussion and extension to the
multivariate case, see Lewis and Reinsel(l985) and especially Saikkonen (1991).
2166 J.H. Stock

Alternative tests were proposed by Phillips (1987a) and Phillips and Perron
(1988). They recognized that if K in (3.4) were consistently estimable, then T(oi - 1)
from the misspecified AR(I) model could be adjusted so that it would be asymp-
totically similar. This reasoning led Phillips (1987a) to propose the corrected
statistics

WdW
g?? - l)& --fs
Z,= T(oS-- l)+ (3.16a)
Tp2 5 y:+ w=
s

(3.16b)

where I? and d2 are consistent estimators of K and III, and where r is the t-statistic
testing for a unit root in the OLS regression of y, onto y,_, . Phillips and Perron
(1988) extended these statistics to the constant and time-trend cases by replacing
the regression of y, onto y,_ 1 with a regression of y, onto (1, y,_ r) in the constant
case or, in the linear-trend case, onto (1, t, yt- 1). The limiting distributions for these
two cases are as in (3.16), with Wb or W, respectively, replacing W.
Because +(K - 1) = &,(O) - 02)/02, the estimation of the correction entails the
estimation of 02. Phillips (1987a) and Phillips and Perron (1988) suggested
estimating o2 using a sum-of-covariances (SC) spectral estimator (the SC estimator
of 02),

IT
A2
coSC = F r*;(m), (3.17)
tn= c -IT k( T>

wherey,(m)= V- ml- CtTz (x, - X)(X,_,,, - X), k( .) is a kernel weighting func-


A

lml + 1

tion and G, is the residual from the regression of y, onto y,_ t, (1, y, _ t) or (1, t, y,_ 1)
in the no-deterministic, constant or linear-trend cases, respectively. The appropriate
choice of kernel ensures that S.& > 0 [Newey and West (1987), Andrews (1991)];
Phillips (1987a) and Phillips and Perron (1988) suggested using Bartlett (linearly
declining) weights. If [r increases to infinity at a suitable rate [e.g. $/T+O from
Phillips (1987a, Theorem 4.2); see Andrews (1991) for optimal rates], then c;)& A w2
as required. Like the SaiddDickey tests, the Phillips/Phillips-Perron tests thus
provide a way to test the general (nonparametric) I( 1) null.
A third test for the general I(1) null can be obtained .by generalizing a statistic
derived by Sargan and Bhargava (1983a) in the no-trend and constant cases and
extended to the time-trend case by Bhargava (1986). They used Andersons (1948)
approximation to the inverse of the covariance matrix to derive the locally most
Ch. 46: Unit Roots, Structural Breaks and Trends 2767

powerful invariant (LMPI) test [although the test is not LMPI if the true inverse
is used, as pointed out by Nabeya and Tanaka (1990b)l. The SarganBhargava
statistics, E,, E;, and &, are

r2 i (Y,)
t=1
iiT=y- (3.18a)
T- l 2 (AyJ2
1=2

r2 i (~3
&--, 1=1
(3.18b)
T- i (AY,)'
t=2

(3.18~)

where y;=yl- T-lC,T=Iyt and yf=y,-&&t, where E= T-C,T=lyr-


[(T+ 1)/2(T- l)](yT - y,)and j$ = (yT - y,)/(T- 1). Note that @ is the maximum
likelihood estimator (MLE) of ,4i under the null c1= 1 when u, is i.i.d. normal. Also,
the statistic d, is asymptotically equivalent to minus one-half times the inverse of
the symmetric least squares estimator T(ii, - 1) in (3.11).
These statistics have seen little use in empirical work because of their derivation
in the first-order case and because, in their form (3.18), the tests are not similar under
the general I(1) null. They are, however, readily extended to the general I(1) null.
While of independent interest, this extension is worth explaining here because it
demonstrates a simple way that a large class of tests of the unit root model can be
extended to the general I(1) case, namely, by replacing yI by y,/&. A direct
application of the FCLT and the CMT yields ET 3 K-~~W~, @_=> K-J( W)2 and
i+-K_2 j( WB)2, where W(E) = IV(n) - (1. - $W( 1) - SW. Thus, if li- is consistent,
modified Sargan-Bhargava (MSB) statistics are obtained as R, = R2R,, R; = g2E;
and Rt = 12k& which are similar under the general I(1) null.
These statistics can be summarized using a compact functional notation. Note
that the demeaned (say) MSB statistic with li;= y*,,(O)/ci, can be written as
RI+ = IIK~T-C,~= 1 Y;(t/T)2, where Y;(E,) = T- 12yrT,l. This suggests the notation

l-1

MSB = f (JJ2 d;l, (3.19)


J 1=0
2768 J.H. Stock

where f(A) = O- Y,(n), & Y;(1) and &YB,(J*), respectively, in the three cases,
where Y,(I) = T- *ytTsl and Y:(i) = T- *yFTnl.
The approach used to extend the SB statistic to the general I(1) null can be
applied to other statistics as well. The NeymanPearson test regions (3.13) have the
same drawback as the SB critical regions, that is, they depend on o under the general
I( 1) null. Consider the no-deterministic case under which (3.13) was derived. Then,
2T-CTZ2y_,Ayf = (T-*/yT)* -y,(O) + o,(l), so the critical regions in (3.13) are
asymptotically equivalent to critical regions based on (T(i - l))2T-2CT=2y:_1 -
T(cl - l)( T - *yT)*, which has the limiting representation o*{F*~W* - CW(1)2}.
While this depends on w, if Q2 -%02 under the null and local alternatives, then
an asymptotically equivalent test can be performed using the statistic P, =
d-2(C2T-2~~=2y~_1 - CT- yi}. Because P,=z-C*~W* - EW( l)*, this test is asymp-
totically similar and is moreover asymptotically equivalent to the NeymanPearson
test in the case that u, is i.i.d. N(0, a*).
When deterministic terms are present, it is desirable to modify the P, statistic.
A feature of the tests discussed so far is that they are invariant to the values of the
parameters describing the deterministic terms (PO and pi in the linear-trend case),
that is, a change in the value of b does not induce a change in the value of the test
statistic. This feature is desirable, particularly in the case of p,,. For example, when
a test is performed on a series in logarithms, then a change in the units of
measurement of the series (from thousands to millions of dollars, say) will appear,
after taking logarithms, as an additive shift in /IO. It is natural to require a test for
unit roots to be unaffected by the units of measurement, which translates here into
requiring that the test be invariant to /IO.
This line of reasoning led Dufour and King (1991), drawing on King (1980), to
develop most powerful invariant (MPI) finite-sample tests of cI = a, vs. c1= c~i, where
t1e is a general value, not necessarily one, in the Gaussian AR(l) model with
Eui < co. The finite-sample distribution of these tests hinges on the Gaussian AR(l)
assumption and they are not similar under the general I(1) null. These were extended
by Elliott et al. (1992) to the general case using the same device as was used to extend
the Neyman-Pearson tests to the statistic P,. The resultant statistics, P$ and P;, are
asymptotically MPI against the alternative c = C. These statistics have forms similar
to P,, but are constructed using demeaned and detrended series, where the trend
coefficients are estimated by generalized least squares (GLS) under a local
alternative (c = C), rather than under the null. This local GLS detrending results
in intercept estimators which are asymptotically negligible, so that P;=E-C~~W* -
Cl+(l)*. This suggests examining other unit root test statistics using local detrending.
One such statistic, proposed by Elliott et al. (1992), is their DickeyyFuller GLS
(DF-GLS) statistic, in which the local GLS-demeaned or local GLS-detrendcd
series is used to compute the r-statistic in the regression (3.9), where the intercept
and time trend are suppressed. The construction of the P;,P;, DF-GLV and

The DF-GLS statistic is computed in two steps. Let z, = (1, t). (1) /I0 and PI are estimated by GLS
under the assumption
that the process is an AR(l) with coefficient E = 1 + F/T and u0 = 0. That is, PO
Ch. 46: Unit Roots. Structural Breaks and Trends 2769

DF-GLS tests requires the user to choose C. Drawing upon the arguments in King
(1988), a case can be made for choosing C so that the test achieves the power envelope
against stationary alternatives (is asymptotically MPI) at 50 percent power. This
turns out to be achieved by setting C = - 7 in the demeaned case and C = - 13.5 in
the detrended case.
Another statistic of independent interest is the so-called resealed range (R/S)
statistic which was proposed and originally analyzed by Mandelbrot and Van Ness
(1968), Mandelbrot (1975) and Mandelbrot and Taqqu (1979). The statistic is

T- 2(max,= i,....ry, - mm*= l,...,TYr 1


R/S=- (3.20)
T- i (AyJ2 .
1=2

Although the R/S statistic was originally proposed as a method for measuring the
differencing parameter in a fractionally integrated (fractionally differenced) model,
the R/S test also has power against stationary roots in the autoregressive model. In
functional notation, the statistic is supJ(A)- infnf(lZ), which is a continuous
functional from C[O, l] to B. As Lo (1991) pointed out, this statistic is not similar
under the general I(1) null, but if evaluated using f(1) = T- 12yI,,1/Q it is
asymptotically similar (note that this statistic needs no explicit demeaning in the
d, = /ie case). Thus, the asymptotic representation of this modified R/S statistic
under the general I( 1) null is sup, W(n) - inf, W(1).
Although a large number of unit root tests have been proposed, many fall in the
same family, in the sense that they have the same functional representation. It will be
shown in the next section that if two tests have the same functional representation
then they have the same local asymptotic power functions. However, as will be seen
in Section 3.2.4, tests which are asymptotically equivalent under the null and local
alternatives can perform quite differently in finite samples.

3.2.3. Consistency and local asymptotic power

Consistency. A simple argument proves the consistency of unit root tests which
can be written in functional notation such as (3.19). Suppose that a test has the
representation g(f), where g: C[O, l] -tB is continuous, and f is T-2y,,.,/&,
T- 12yFT.I/c&or T- 12y;PIjc3 in the no-deterministic, demeaned or detrended cases,
respectively. If g(0) falls m the rejection region, then consistency follows immediately,
provided that the process being evaluated is consistent for zero under all fixed

and /I1 are estimated by regressing [y,,(l - 07L)y,,. .(l_- aL)y,] onto [zr,(l - KC)z,, ,(l - cX)Z,]:
call the resulting estimator &s. Detrended J, = yr - z&Ls is then computed. (2) The Dickey-Fuller
regression (3.9) is run using j$ without the intercept and time trend; the r-statistic on J,- 1is the DF-GLS
statistic. The DF-GLS statistic is computed similarly except that the regressor t is omitted in the first
step. The DF-GLS statistic has the no-deterministic Dickey-Fuller r*distribution and the distribution
of DF-GLS is tabulated in Elliott et al. (1992).
2770 J.H. Stock

alternatives. As a concrete example, let d, = /I,, and consider the demeaned MSB
statistic (3.19) with f= 0-l Y;. The test rejects for small values of the statistic, so
consistency against the general I(0) alternative follows if Q- i Y; A 0. Now if U, is
I(O)then Pr[sup, ( Y;(;I)I > S] + 0 for all 6 > 0 [the proof is along the lines of (2.8)].
It follows that Q-i Y; LO if cj L k > 0 for some constant k under the I(O)
alternative. Thus, with this additional assumption, MSB = j(& Y;(1)) dA +
o,,( 1) A 0 under the fixed I(0) alternative, and the test is consistent.
The assumption that d2 5 k > 0 for some constant k under the I(0) alternative
is valid for certain variants of both the SC and AR spectral estimators. For the AR
spectral estimator, this was shown by Stock (1988, Lemma 1). For the SC spectral
estimator, test consistency is an implication of Phillips and Ouliaris (1990, Theorem
5.1) more general result for tests for cointegration. These results, combined with
some additional algebra, demonstrate the consistency for the MSB, Z,, Z,, P, and
R/S statistics. lo

Local asymptotic power. Power comparisons are a standard way to choose among
competing tests. Because finite-sample distribution theory in nearly Z(1) models is
prohibitively complicated, research has focused on asymptotic approximations to
power functions. For consistent tests, this requires computing power against
alternatives which are local to (in a decreasing neighborhood of) unity. Applications
of asymptotic expansions commonly used in T-asymptotic problems, in par-
ticular Edgeworth expansions and saddlepoint approximations, provided poor
distributional approximations for c1near unity [Phillips (1978); also, see Satchel1
(1984)]. This led to the exploration of the alternative nesting, cur= 1 + c/T;
important early work developing this approach includes Bobkoski (1983), Cavanagh
(1985), Phillips (1987a, 1987b), Chan and Wei (1987) and Chan (1988, 1989).
The treatment here follows Bobkoski (1983) as generalized in Example 4 of
Section 2.3. The key observation is that, under the local-to-unity alternative (2.15),
the processes T- 1/2u1Pl =oW,(.), where WCis a diffusion process on the unit interval
satisfying d W,(A) = c W,(A) + d W(A) with W,(O) = 0. In addition, both the SC and
AR spectral estimators have the property that d2 * o2 under the local alternative.
These results directly yield local-to-unity representations of those test statistics with
functional representations such as (3.19).

toNot all plausible estimators of& will satisfy this condition. For example, consider the SC estimator
constructed using not the quasi-difference d - By;d_1, as in (3.17), but the first difference A$. These two
estimators are asymptotically equivalent under the null, but under the alternative y, is overdifferenced.
Thus, the spectrum of Afl at frequency zero is zero under the I(O) alternative, the SC estimator of the
spectrum does not satisfy the positivity condition, and tests constructed using the first-differenced SC
estimator are not in general consistent. [Precise statements of this result are given by Stock and Watson
(1988b) in the MA(q) case with I, fixed and by Phillips and Ouliaris (1990, Theorem 5.2) in the general
case.] This problem of overdifferencing by imposing a = 1 when nuisance parameters are estimated also
results in the inconsistency of Solos (1984) Lagrange multiplier (LM) test for a unit AR root in an
ARMA(p,q) model, as demonstrated by Saikkonen (1993).
See Phillips (1987b) for the SC estimator and Stock (1988) for the AR estimator.
Ch. 46: Unit Roots. Structural Breaks and Trends 2111

As a concrete example, again consider the MSB statistic. Under the local-to-
unity alternative, Y!+-o( W, - IWJ = o WE. Thus, test statistics of the form g(6 i Y,)
have the local-to-unity representation g( W,). An important implication is that the
local asymptotic power of these tests does not depend on the nuisance parameter
o, simplifying their comparison.
Phillips (1987b, Theorem 2) showed that this framework bridges the gap between
the conventional Gaussian I(0) asymptotics and the nonstandard I( 1) asymptotics.
Specifically, as c + - cc the (suitably normalized) local-to-unity approximations for
T(bi - 1) and the associated t-statistic approached their I(O) Gaussian limits and, as
c -+ + co, these distributions, respectively, tend to Cauchy and normal, in accordance
with the asymptotic results of White (1958, 1959) and Anderson (1959) for the
Gaussian AR(l) model with ICX/> 1.
Another application of this approach is to derive the asymptotic Gaussian power
envelope for unit root tests, that is, the envelope of the power functions of the family
of most powerful unit root tests. Because there is no UMP test (or, in the time-trend
case, no uniformly most powerful invariant test), this envelope provides a concrete
way to judge the absolute asymptotic performance of various unit root tests. In the
no-deterministic case, this envelope is readily derived using the local-to-unity limit
(2.17) and the Neyman-Pearson critical regions (3.13). Assume that (i) the process
is a Gaussian AR( 1) so that o2 = a:; (ii) En: < co; (iii) the alternative against which
the test is most powerful is local-to-unity, so that F = T(i - 1) is fixed; and (iv) the
true process is local-to-unity with c = T(cc - 1). Then, the probability of rejecting
CI= 1 against the one-sided alternative l&l - 1 is, asymptotically,

Pr (T(&- 1))2T-2 i
t=2
y:_i -2T(E- l)T- i
r=2
yt_iAy,<k
1
Wz-CWc(1)2<k)
1, (3.21)

where k and k are constants which do not depend on c. When c = C, the second
expression in (3.21) provides the envelope of the power functions of the most
powerful (Neyman-Pearson) tests and, thus, provides an asymptotic performance
bound on all unit root tests in the Gaussian model.
This result is extended in several ways in Elliott et al. (1992). The bound (3.21) is
shown to hold if d, is unknown but changes slowly, in the sense that d, satisfies

T
T- 1 (Ad,)+0 and T-j2 max ld,l+O. (3.22)
I=1 f= l,...,T

If d, = &, + /IIt, then the bound (3.21) cannot be achieved uniformly in pi, but a
similar bound can be derived among the class of all invariant tests, and this is
achieved by the Pi statistic. Although the bound (3.21) was motivated here for u, i.i.d.
2112 J.H. Stock

N(O,aZ), this bound applies under the more general condition that u, obeys a
Gaussian AR(p).
We now turn to numerical results for asymptotic power, computed using the
local-to-unity asymptotic representations of various classes of statistics. In addition
to the tests discussed so far, we include expressions for the Park (1990) J(p, q)
variable-addition test [also, see Park and Choi (1988)] and, in the no-deterministic
case, the modified asymptotically LMPI test [invariant under change of scale, this
test rejects for small values of (T- 1izyr/Q)2 and is obtained by letting c --, 0 in (3.13)
and rearranging]. Let W.f penote a general OLS-detrended WC process, that is,
Wf = WCin the no-determmtstic case, W:(A) = W;(A) = W,(i) - j WCin the demeaned
case and W:(A) = W;(A) = WC(A)- (4 - 6n)l W, - (122 - 6)fs W,(s) ds in the detrended
case [cf. (2.13)]. Let Wf denote the asymptotic limit of the Bhargava-detrended
process T - yf3,., used to construct the Bhargava statistic (3.18~) in the detrended
case; specifically, W:(A) = W,(A) - (A - i) Wc(1) - s WC. Finally, let Vc(A)= W,(A) -
n{E+WE(l)+(l -c+)3IrW,( Y)d r}, w h ere Zt = (1 - C)/(l - C + C2/3), denote the limit of
the detrended process obtained from local GLS detrending, which is used to
construct the Pk and DF-GLS statistics. The local-to-unity representations for
various classes of unit root test statistics are given by the following expressions:

b-class -l{
w;(l)2 - w:(o)2 - l}; (3.23a)

Q-class
-l/2
{ w;(l)2 - w:(o)2 - 1 }; (3.23b)

SB-class (Wf) (no-deterministic, demeaned cases); (3.23~)


s

SB-class ( WF)2 (detrended case); (3.23d)


s

R/S IsiP1J W:(A) - inf W,(A); (3.23e)


J.c(O.1)

( yj2
s
J@, 1) ~ - 1 (demeaned case); (3.23f)
(WZ)
s

(Wi)
s
JCL 2) - - 1 (detrended case); w%)
(WZ)
s
Ch. 46: Unit Roots. Structural Breuks and Trends 2113

LMPI W,( 1)2 (no-deterministic case only); (3.23h)

PT c2
sw; -cwc(q2 (no-deterministic, demeaned cases); (3.231)

C2 Vf - (C - l)VC(1)2 (detrended case); (3.23j)


PT
s
- l/2

DFPGLS ; { Wc( 1)2 - Wc(O)2 - 1} (demeaned case); (3.23k)

- l/2

DF-GLS ; { VC(l)2 - VC(0) - l} (detrended case). (3.231)

The p class includes the Dickey-Fuller (1979) p tests and the Phillips (1987a)/Phillips-
Perron (1988) 2, tests. The f class includes the Dickey-Fuller (1979) t-tests and the
Phillips (1987a)/PhillipsPerron (1988) 2, tests. The SB class includes the Schmidt-
Phillips (1992) test. Most of these representations can be obtained by directly
applying the previous results. For those statistics with functional representations
already given and where the statistic is evaluated using OLS detrending [the
SB-class (demeaned case) and R/S statistics], the results obtain as a direct
application of the continuous mapping theorem. In the cases involving detrending
other than OLS in the time trend case (the SB-class and DF-GLS statistics), an
additional calculation must be made to obtain the limit of the detrended processes.
The other expressions follow by direct calculation.i2
Asymptotic power functions for leading classes of unit root tests (5 percent level)
are plotted in Figures 1, 2 and 3 in the no-deterministic, constant and trend cases,
respectively. i3 The upper line in these figures is the Gaussian power envelope. In
the d, = 0 case, the power functions for the t, p and SB tests are all very close to
the power envelope, so this comparison provides little basis for choosing among
them. Also plotted in Figure 1 is the power function of the LMPI test. Although
this test has good power against c quite close to zero, its power quickly falls away
from the envelope and is quite poor for distant alternatives.

r2References for these results include: for the p, Q statistics, Phillips (1987b) (Z, statistic) and Stock
(1991) (DickeyyFuller AR(p) statistics); for the SB-class statistics, Schmidt and Phillips (1992) and Stock
(1988); for the P,- and DF-GLS statistics, Elliott et al. (1992).
The asymptotic power functions were computed using the functional representations in (3.23)
evaluated with discrete Gaussian random walks (T= 500) replacing the Brownian motions, with 20,000
Monte Carlo replications. Nabeya and Tanaka (1990b) tabulate the power functions for tests including
the SB and Q tests, although they do not provide the power envelope. Because they derive and integrate
the characteristic function for these statistics in the local-to-unity case, their results presumably have
higher numerical accuracy than those reported here. Standard errors of rejection rates in Figures l-3
are at most 0.004. Some curves in these figures originally appeared in Elliott et al. (1992).
2114 J.H. Stock

In the empirically more relevant cases of a constant or constant and trend, the
asymptotic power functions of the various tests differ sharply. First, consider the
cased, = /?,. Perhaps the most commonly used test in practice is the DickeyyFuller/
Said-Dickey t-test, Q@;however, its power is well below not just the power envelope
but the power of the 6 (equivalently, Zd) test. The SB-class statistics have

--
--
--
_--
--
/- LMPI

01 I
OO 4 8 12 i6 20 24 28 32

-c

Figure 1

OO 4 a 12 16 20 24 28 32
-c

Figure 2
Ch. 46: Unit Roots, Structural Breaks and Trends 2775

OO 4 8 12 16 20 24 28 32

-c

Figure 3

asymptotic power slightly above the jY statistics, particularly for power between
0.3 and 0.8, but remains well below the envelope. In contrast, the asymptotic local
power function of the Pt test, which is, by construction, tangent to the power
envelope at 50 percent power, is effectively on the power envelope for all values of
c. Similarly, the DF-GLS power function is effectively on the power envelope.
Pitman efficiency provides a useful way to assess the importance of these power
differences. Pitmans proposal was to consider the behavior of two tests of the same
hypothesis against a sequence of local alternatives, against which at least one of the
tests had nondegenerate power. The Pitman efficiency [or asymptotic relative
efficiency (ARE)] is the ratio of the sample sizes giving, asymptotically, the same
power for that sequence. In conventional fi-normal asymptotics, often the ARE
can be computed as a ratio of the variances entering the denominators of the two
Studentized test statistics. Although this approach is inapplicable here, the ARE
can be calculated using the asymptotic power functions. Suppose that two tests
achieve power /I against local alternatives cl(B) and cZ(b); then the ARE of the first
test relative to the second test is ci(fi)/c,(/?) [Nyblom and Makelainen (1983)]. Using
this device, the ARE of the P; test, relative to the optimal test, at power of 50 percent
is 1.O, by construction, and the ARE of the DF-GLS test is effectively 1. In contrast,
the AREs of the SB-, p- and ?-class tests, relative to the Pt test, are 1.40, 1.53 and
1.91. That is, to achieve 50 percent power against a local alternative using the
DickeyyFuller t-statistic asymptotically requires 90 percent more observations than
are needed using the asymptotically efficient P; test or the nearly efficient DF-GLS
test.
2116 J.H. Stock

The results in the detrended case are qualitatively similar but, quantitatively, the
power differences are less. The Y-class statistics have low power relative to the
envelope and to the SB- and ,Y-class tests. The SB-class tests have power slightly
above the b-class tests and all power functions are dominated by the P; test. Some
of the other tests, in particular the R/S test, have power that is competitive with the
MSB- and fir-class tests. At 50 percent power, the Pitman efficiency of the Q tests is
1.39 and of the 1 tests is 1.25. Interestingly, the power function of P; actually lies
above the power function of z*,even though P; involves the additional estimation
of the linear-trend coefficient /II. Comparing the results across the figures highlights
a common theme in this literature: including additional trend terms reduces the
power of the unit root tests if the trends are unnecessary.14
So far, the sampling frequency has been fixed at one observation per period. A
natural question is whether power can be increased by sampling more frequently,
for example, by moving from annual to quarterly data, while keeping the span of
the data fixed. A simple argument, however, shows that it is the span of the data
which matters for power, not the frequency of observation. To be concrete, consider
the demeaned case and suppose that the true value of CIis 1 + cl/T, based on T
annual observations, where c1 is fixed. Suppose that the MSB statistic is used with
sufficiently many lags for cG2to be consistent. With the annual data, the test statistic
has the limiting representation j( W;,). The quarterly test statistic has the limiting
representation S(WQ2, where c4 is the local-to-unity parameter at the quarterly
frequency and the factor of 4 arises because there are four times as many quarterly
as annual observations. Because tl = 1 + cl/T at the annual level, at the quarterly
level this root is ~1~~z 1 + (cJ4T) = 1 + (c4/T), so c4 = cJ4. Thus, I(W&J2 = @;I)2
and the quarterly and annual statistics have the same limiting representations and,
hence, the same rejection probabilities. Although there are four times as many
observations, the quarterly root is four times closer to one than the annual root,
and these two effects cancel asymptotically. For theoretical results, see Perron
(1991b); for Monte Carlo results, see Shiller and Perron (1985). More frequent
observations, however, might improve estimation of the short-run dynamics, and
this, apparently, led Choi (1993) to find higher finite-sample power at higher
frequencies in a Monte Carlo study.

The case of uO drawn from its unconditional distribution. The preceding analysis
makes various assumptions about u 0: to derive the finite-sample Neyman-Pearson
tests, that ue = 0 (equivalently, u0 is fixed and known) and for the asymptotics, that

i4Asymptotic power was computed for the Dickey-Fuller (1981) and Perron (1990a) F-tests, but this
is not plotted in the figures. These statistics test the joint restriction that a = 1 and that 6, = 0 in (3.5)
or (3.9). Unlike the other tests considered here, these F-tests are not invariant to the trend parameter
under local and fixed alternatives. The power of the two F-tests depends on p, under the alternative, so
for drifts sufficiently large their power functions can, in theory, exceed the power envelope for invariant
tests. If /I, =0 or is small, the F-tests have very low asymptotic power; well below the f-class tests,
Perrons (1990a) calculations indicate, however, that for p, sufficiently large, the F-tests can have high
(size-adjusted) power.
Ch. 46: Unit Root,s. Structural Breaks and Trends 2771

T- r*uO LO, as specified after (2.9). Under the null, the tests considered are
invariant to /I, and thus to uO. Although this finite+, case has received the vast
majority of the attention in the literature, some work addresses the alternate model
that u0 is drawn from its unconditional distribution or is large relative to the sample
size. In finite samples, this modification is readily handled and leads to different
tests [see Dufour and King (1991)]. The maximum likelihood estimator is different
from that when u0 is fixed, being the solution to a cubic equation [Koopmans (1942);
for the regression case, Beach and MacKinnon (1978)].
As pointed out by Evans and Savin (1981b, 1984) and further studied by
Nankervis and Savin (1988), Perron (1991a), Nabeya and Sorensen (1992), Schmidt
and Phillips (1992) and DeJong et al. (1992a), the power of unit tests depends on
the assumption about uO. Analytically, this dependence arises automatically if the
asymptotic approximation relies on increasingly finely observed data, in Phillips
(1987a) terminology, continuous record asymptotics [see Perron (1991a, 1992),
Sorensen (1992) and Nabeya and Sorensen (1992)]. Alternatively, equivalent
expressions can be obtained with the local-to-unity asymptotics used here if
T-*IL, = O,(l) [in the stationary AR(l) case, a natural device is to let T-*u, be
distributed N(0, af/T(l - CX~))-+N(0, - ~cT~/c), where c < 0, so that an additional
term appears in (2.17)]. Elliott (1993a) derives the asymptotic power envelope under
the unconditional case and shows that tests which are efficient in the unconditional
case are not efficient in the conditional case in either the demeaned or detrended
cases. The quantitative effect on the most commonly used unit root tests of drawing
u0 from its unconditional distribution is investigated in the Monte Carlo analysis
of the next subsection.

3.2.4. Finite-sample size and power

There is a large body of Monte Carlo evidence on the performance of tests for a
unit AR root. The most influential Monte Carlo study in this literature is Schwert
(1989), which found large size distortions in tests which are asymptotically similar
under the general I(1) null, especially the Phillips-Perron (1988) Z, and Z, statistics.
A partial list of additional papers which report simulation evidence includes Dickey
and Fuller (1979), Said and Dickey (1985), Perron (1988,1989c, 1990a), Diebold and
Rudebusch (1991 b), Pantula and Hall (1991), Schmidt and Phillips (1992), Elliott
et al. (1992), Pantula et al. (1992), Hall (1992a), DeJong et al. (1992b), Ng and Perron
(1993a, 1993b) and Bierens (1993).
Taken together, these experiments suggest four general findings. First, all the
asymptotically valid tests exhibit finite-sample size distortions for models which are
in a sense close to I(0) models. However, the extent of the distortion varies widely
across tests and depends on the details of the construction of the spectral estimator
c3*. Second, the estimation of nuisance parameters describing the short-run
dynamics reduces test power, in some cases dramatically. Third, these two observa-
tions lead to the use of data-dependent truncation or AR lag lengths in the
2778 J.H. Stock

estimation of o2 and the resulting tests show considerable improvements in size


and power. Fourth, the presence of nonnormality or conditional heteroskedasticity
in the errors results in size distortions, but these are much smaller than the dis-
tortions arising from the short-run dynamics.
We quantify these findings using a Monte Carlo study with eight designs (data
generating processes or DGPs) which reflect some leading cases studied in the
literature. In each, y, = u,, where u, = au,_ 1 + II,. Five values of CIwere considered:
1.0, 0.95,0.9,0.8 and 0.7. All results are for T= 100. The DGPs are

Gaussian MA( 1):

0 = 0.8, 0.5,0, -0.5, -0.8, (3.24a)

Gaussian MA(l), u0 unconditional: u, = E, - !3~,_ 1, u. - N(0, y,(O)),

8 = 0.5,0, - 0.5, (3.24b)

where in each case E, - i.i.d. N(0, 1). The Gaussian MA( 1) DGP (3.24a) has received
the most attention in the literature and was the focus of Schwerts (1989) study. The
unconditional variant is identical under the null, but under the alternative u0 is
drawn from its unconditional distribution N(0, y,(O)), where y,(O) = (1 + 8 - 2&)/
(1 - a). This affects power but the size is the same as for (3.24a). The unconditional
model is of particular interest because the power functions in Section 3.2.3 were for
the so-called conditional (uO fixed) case.
The tests considered are: the Dickey-Fuller fi statistic T(6i - l)/(l - CT= iSj)
[where &,a,, . .,A, are ordinary least squares estimators (OLSEs) from (3.9)];
the Phillips (1987a)/Phillips_Perron (1988) Z, statistic (3.16a); the Dickey-Fuller z*
statistic computed from the AR(p + 1) (3.9); the MSB statistic (3.19) computed using
*
o,,; the Schmidt-Phillips (1992) statistic, which is essentially (3.19) computed using
&-; and the DF-GLS of statistic of Elliott et al. (1992).
Various procedures for selecting the truncation parameter 1, in O.& and the
autoregressive order pT in ci)iR are considered. Theoretical and simulation evidence
suggest using data-based rules for selecting 1,. Phillips and Perron (1988) and
DeJong et al. (1992b) use the Parzen kernel, so this kernel is adopted here.16 The
truncation parameter 1, was chosen using Andrews (1991) optimal procedure for
this kernel as given in his equations (5.2) and (5.4). The AR estimator lag length pT
in (3.14) (with a constant but no time trend in the regression, in both the demeaned

The results here are drawn from the extensive tabulations of 20 tests in 13 data generating processes
(DGPs) in Elliott (1993b). Other tests examined include: the Dickey-Fuller (1981) and Perron (1990b)
F-tests; the Phillips-Perron Z, test; the modified R/S statistic; Halls (1989) instrumental variable
statistic; Stocks (1988) MZ, statistic; and the Park J(p,p + 3) tests for p = 1, 2. In brief, each of these
tests had drawbacks - distorted size, low power or both - which, in our view, makes them less attractive
than the tests examined here, so, to conserve space, these results are omitted.
6TheParzenkernelisgivenby:k(x)=1-6x2+6~x~3,O~~~~~~;k(x)=2(1-~~()3,~~~~~~1;and
k(x)=O,Ixl> 1.
2780 J.H. Stock

Table 2
Size and size-adjusted power of selected tests of the I(1) null: Monte Carlo results
(5 percent level tests, detrended case, T= 100, y, = u,, u, = au,_ I + u,,v, = E,- BE,_
,).
Unconditional:
MA(l), 0= MA( 1). 0 =
Test Asymptotic
Statistic a Power -0.8 -0.5 0.0 0.5 0.8 -0.5 0.0 0.5

DF-I AR(4) 1.00 0.05 0.03 0.05 0.05 0.06 0.37 0.05 0.05 0.06
0.95 0.09 0.07 0.07 0.07 0.08 0.09 0.07 0.08 0.08
0.90 0.19 0.10 0.12 0.13 0.16 0.18 0.12 0.14 0.17
0.80 0.61 0.24 0.28 0.32 0.43 0.49 0.28 0.32 0.44
0.70 0.94 0.40 0.45 0.53 0.71 0.78 0.45 0.52 0.72

DF-Q AR(BIC) 1.00 0.05 0.10 0.07 0.05 0.09 0.58 0.07 0.05 0.09
0.95 0.09 0.09 0.08 0.08 0.09 0.08 0.08 0.08 0.09
0.90 0.19 0.16 0.14 0.15 0.18 0.17 0.15 0.15 0.18
0.80 0.61 0.36 0.36 0.39 0.51 0.50 0.36 0.39 0.52
0.70 0.94 0.57 0.58 0.64 0.81 0.80 0.58 0.64 0.81
DF-r^ AR(LR) 1.00 0.05 0.09 O.Jl 0.08 0.22 0.65 0.1 I 0.08 0.22
0.95 0.09 0.08 0.09 0.09 0.10 0.09 0.09 0.09 0.09
0.90 0.19 0.14 0.16 0.17 0.22 0.17 0.16 0.18 0.20
0.80 0.61 0.29 0.39 0.46 0.56 0.42 0.39 0.47 0.56
0.70 0.94 0.42 0.58 0.74 0.76 0.57 0.58 0.74 0.77

DF-p AR(BIC) 1.00 0.05 0.21 0.16 0.13 0.21 0.81 0.16 0.13 0.21
0.95 0.10 0.09 0.09 0.10 0.10 0.08 0.09 0.09 0.10
0.90 0.23 0.18 0.18 0.20 0.22 0.17 0.18 0.19 0.20
0.80 0.70 0.42 0.45 0.49 0.58 0.47 0.43 0.48 0.57
0.70 0.97 0.62 0.67 0.74 0.87 0.73 0.66 0.73 0.86
Z, SC(aut0) 1.00 0.05 0.00 0.01 0.05 0.65 J.00 0.01 0.05 0.65
0.95 0.10 0.09 0.09 0.11 0.10 0.09 0.09 0.10 0.09
0.90 0.23 0.19 0.20 0.25 0.25 0.16 0.20 0.23 0.21
0.80 0.70 0.56 0.62 0.74 0.73 0.44 0.62 0.73 0.70
0.70 0.97 0.89 0.92 0.98 0.97 0.72 0.92 0.98 0.97
MSB AR(BIC) 1.00 0.05 0.23 0.17 0.13 0.12 0.49 0.17 0.13 O.J2
0.95 0.10 0.10 0.09 0.10 0.10 0.08 0.09 0.09 0.09
0.90 0.25 0.19 0.20 0.21 0.21 0.16 0.18 0.19 0.19
0.80 0.73 0.42 0.45 0.48 0.50 0.42 0.41 0.44 0.46
0.70 0.97 0.60 0.65 0.69 0.74 0.69 0.61 0.64 0.69
MSB SC(auto) 1.00 0.05 0.00 0.01 0.03 0.46 0.99 0.01 0.03 0.46
0.95 0.10 0.10 0.10 0.11 0.11 0.11 0.09 0.10 0.10
0.90 0.25 0.24 0.24 0.28 0.27 0.22 0.21 0.24 0.23
0.80 0.73 0.63 0.66 0.75 0.74 0.42 0.61 0.70 0.65
0.70 0.97 0.89 0.91 0.97 0.94 0.42 0.86 0.93 0.89
DF-GLS AR(BIC) 1.00 0.05 0.1I 0.08 0.07 O.JJ 0.58 0.08 0.07 0.11
0.95 0.10 0.11 0.10 0.10 0.11 0.12 0.09 0.09 0.09
0.90 0.27 0.23 0.23 0.24 0.28 0.27 0.19 0.19 0.21
0.80 0.8 1 0.53 0.57 0.61 0.72 0.70 0.46 0.49 0.54
0.70 0.99 0.75 0.80 0.84 0.94 0.91 0.67 0.71 0.76

AR(BIC) indicates that the AR spectral estimator based on (3.9) with the time trend suppressed was
used. See the notes to Table 1.
Ch. 46: Unit Roots, Structural Breaks and Trends 2781

and the detrended cases) was selected using the Schwartz (1977) Bayesian information
criterion (BIG), with a minimum lag of 3 and a maximum of 8. For comparison
purposes a sequential likelihood ratio (LR) downward-testing procedure with 10
percent critical values, as suggested by Ng and Perron (1993b), was also applied to
the Dickey-Fuller t-statistic. 7
The results for tests of asymptotic level 5 percent are summarized in Table 1 for
the demeaned case and in Table 2 for the detrended case. For each statistic, the first
column provides the asymptotic approximation to the size (which is always 5
percent) and to the local-to-unity power. The remaining entries for c1= 1 are the
empirical size, that is, the Monte Carlo rejection rate based on asymptotic critical
values. The entries for ICC/< 1 are the size-adjusted power, that is, the Monte Carlo
rejection rates when the actual 5 percent critical value computed for that model
with c( = 1 is used to compute the rejections. Of course, in practice the model and
this correct critical value are unknown, so the size-adjusted powers do not reflect
the empirical rejections based on the asymptotic critical values. However, it is the
size-adjusted powers, not the empirical rejection rates, which permit examining the
quality of the local-to-unity asymptotic approximations reported in the first
column.
These results illustrate common features of other simulations. Test performance,
both size and power, varies greatly across the statistics, the models generating the
data and the methods used to estimate the long-run variance. The most commonly
used test in practice is the Dickey-Fuller z*statistic. Looking across designs, this
statistic has size closer to its level than any other statistic considered here, with size
in the range 5-10 percent in both the demeaned and detrended cases with 8 6 0.5,
for both the AR(4) and BIC choices of lag length. However, as the asymptotic
comparisons of the previous subsection suggest, this ability to control size in a
variety of models comes at a high cost in power. For example, consider the case
13= -0.5. In the demeaned case with c( = 0.9, the DF-r* test has power of 0.22 (BIC
case) while the DF-GLS test has power of 0.59. In the detrended case, as the
asymptotic results suggest, the power loss from using the DF-Q statistic is less,
Again in the 0 = - 0.5, CI= 0.9 case, the powers of the DF-Q and the DF-GLS
statistics are 0.14 and 0.23. Typically, the p- and SB-class tests also have better
size-adjusted power than the DF-z^ statistics.
Three lag length selection procedures are compared for the DF ?fl and t*statistics,
and the choice has important effects on both size and power. In the 8 = 0 case, for
example, using 4 lags results in substantial power declines against distant alternatives,
relative to either data-dependent procedure. DeJong et al. (1992b) show that in-
creasing p typically results in a modest decrease in power but a substantial
reduction in size distortions. The results here favor the BIC over the LR selector;

Alternative strategies, both data-based and not, were also studied, but they, typically, did not
perform as well as the procedures reported here and thus are not reported here to save space. In general,
among SC estimators, the Andrews (1991) procedure studied here performed substantially better (in
terms of size distortions and size-adjusted power) than non-data-based procedures with I, = k(T/100)0~2
with k = 4 or 12. See Elliott (1993b).
2782 J.H. Stock

a finding congruent with Halls (1992b) proof that the asymptotic null distribution
of the DF statistic is the same using the BIC as if the true order were known
(assuming the maximum possible lag is known and fixed). However, Ng and Perron
(1993b) provide evidence supporting the sequential LR procedure. In any event,
currently available information suggests using one of these two lag selection
procedures.
Although the size distortions are slight for the cases with positive serial
correlation in u,, the introduction of moderate negative serial correlation results
in very large size distortions for several of the statistics. This is the key finding of
Schwerts (1989) influential Monte Carlo study and is one of the main lessons for
practitioners of this experiment. For several statistics, these size distortions are
extreme. For example, for the Gaussian MA(l) process with 8 = 0.5, which
corresponds to a first autocorrelation of u, of - 0.4, the detrended Phillips-Perron
Za statistic has a rejection rate of 65 percent. These large size distortions are
partially but not exclusively associated with the use of the SC spectral estimator.
For example, the sizes of the MSB/AR(BIC) test and Schmidt and Phillips (1992)
version of this test implemented with the Parzen kernel, the MSB/SC(auto)
statistic, are respectively 9 percent and 38 percent in the 0 = 0.5 case. Similarly,
the Z, test can be modified using an AR estimator to reduce the distortions
substantially, although they remain well above the distortions of the DF-? or
DF-GLS statistics. Ng and Perron (1993a) give theoretical reasons for the
improvement of the AR over SC estimators. Part of the problem is that the SC
estimators are computed using the estimated quasidifference of ya or y:, where the
quasidifference is based on 8, which in turn is badly biased in the very cases where
the correction factor is most important [see the discussion following (3.4b)].*
Looking across the statistics, the asymptotic power rankings provide a good
guide to finite-sample size-adjusted power rankings, although the finite-sample
power typically falls short of the asymptotic power. As predicted by the asymptotic
analysis, the differences in size-adjusted powers is dramatic. For example, in the
demeaned 8 = 0 case with c1= 0.9, the Dickey-Fuller t-statistic (BIC case) has
power of 22 percent, Z, has power of 44 percent, and DF-GLS has power of 60
percent.
There is some tradeoff between power and size. The DF-t statistic exhibits
the smallest deviation from nominal size, but it has low power. Other tests, such
as the Z, and MSB/SC (auto) statistics, have high size-adjusted power but very
large size distortions. The DF-GLS statistic appears to represent a compromise,
in the sense that its power is high - based on results in Elliott et al. (1992), typically
as high as the asymptotic point-optimal test P, - but its size distortions are low,

*aConsistent with the asymptotic theory, introducing generalized autoregressive conditional hetero-
skedasticity [GARCH, Bollerslev (1986)] has only a small effect on the empirical size or power of any
of the statistics. Elliott (1993b) reports simulations with MA(l) GARCH(1, 1) errors and coefficients
which add to 0.9. For example, for the DF-GLS statistic, demeaned case, 0 = 0 or -0.5, size and power
(T= 100) differ at most by 0.03 from those in Table 1 for a = I to 0.7.
Ch. 46: Unit Roots, Structural Breaks and Trends 2783

although not as low as the DF-r statistic. In the demeaned results, DF-GLS has
sizes of 0.07-0.11, compared to the DF-r (BIC) which has sizes 0.06-0.08 (except
in the extreme, 19=0.8, case). In the detrended case, the DF-GLS has sizes of
0.07-0.11, while DF-r has sizes in the range 0.05-0.10.
Drawing the initial value from its unconditional distribution changes the
rankings of size-adjusted power; in particular the size-adjusted power of DF-GLS
drops, particularly for distant alternatives. However, the DF-GLS power remains
above the DF-r^ (BIC) power in both demeaned and detrended cases, and, of
course, the large size distortions of the other tests are not mitigated in this DGP.
Recent Monte Carlo evidence by Pantula et al. (1992) suggests that, in the correctly
specified demeaned AR(l) model, better power can be achieved against the
unconditional alternative by a test based on a weighted symmetric least squares
estimator. However, the unconditional case has been less completely studied than
the conditional case and it seems premature to draw conclusions about which
tests perform best in this setting.

3.2.5. Effects of misspecifying the deterministic trend

The discussion so far has assumed that the order of the trend has been correctly
specified. If the trend is misspecified, however, then the estimators of c( and the
tests of tl = 1 can be inconsistent [Perron and Phillips (1987) West (1987,1988a)].
This argument can be made simply in the case where the true trend is
d, = /?, + /?, t with /?, a nonzero constant, but the econometrician incorrectly uses
the constant-only model. Because y, contains a linear time trend, asymptotically
the OLS objective function will be minimized by first-differencing yt, whether or
not u, is I(l), and a straightforward calculation shows that c?L 1. It follows that
b and f tests will not be consistent. This inconsistency is transparent if one works
with the functional representation of the tests: T- Y!+* hP, where h(A) = /3,(A - i).
In finite samples, the importance of this omitted variable effect depends on the
magnitude of the incorrectly omitted time-trend slope relative to w [West (1987)
provides Monte Carlo evidence on this effect]. This problem extends to other types
of trends as well, and in particular to misspecification of a piecewise-linear trend
(a broken trend) as a single linear trend [see Perron (1989a, 1990b), Rappoport
and Reichlin (1989) and the discussion in Section 5.2 of this chapter].
The analogy to the usual regression problem of omitted variable bias is useful
here: if the trend is underspecified, unit root tests (and root estimators) are
inconsistent, while if the trend is overspecified, power is reduced, even asymptotically.
This contrasts with the case of mean-zero I(O) regressors, in which the reduction
in power, resulting from unnecessarily including polynomials in time, vanishes
asymptotically. The difference is that while I(0) regressors are asymptotically

See West (1988a), Park and Phillips (1988) and Sims et al. (1990) for extensions of this result to
multiple time series models.
2784 J.H. Stock

uncorrelated with the included time polynomials, I( 1) regressors are asymptotically


correlated (with a random correlation coefficient). This asymptotic collinearity
reduces the power of the unit root tests when a time trend is included. A procedure of
sequential testing of the order of the trend specification prior to inference on c1will
result in pretest bias arising from the possibility of making a type I error in the tests
for the trend order. This problem is further complicated by the dependence of the
distributions of the trend coefficients and test statistics on the order of integration
of the stochastic component.20

3.2.6. Summary and implications for empirical practice

If one is interested in testing the null hypothesis that a series is I(1) - as opposed
to testing the null that the series is I(0) or to using a consistent decision- or
information-theoretic procedure to select between the I(0) and I(1) hypotheses - then
the presumption must be that there is a reason that the researcher wishes to control
type I error with respect to the I(1) null. If so, then a key criterion in the selection
of a unit root test for practical purposes is that the finite-sample size be
approximately the level of the test.
Taking this criterion as primary, we can see from Tables 1 and 2 that only a
few of the proposed tests effectively control size for a range of nuisance parameters.
In the demeaned case, only the Dickey-Fuller z^fl and DF-GLS tests have
sizes of 12 percent or under (excluding the extreme 8 = 0.8 case). However, the z*
statistic has much lower size-adjusted power than the DF-GLS statistic. Moreover,
asymptotically, the DF-GLS statistic can be thought of as approximately UMP
since its power function nearly lies on the Neyman-Pearson power envelope in
Figure 2, even though, strictly, no UMP test exists. When u0 is drawn from its
unconditional distribution, the power of the DF-GLS statistic exceeds that of ?
except against distant alternatives. These results suggest that, of the tests studied
here, the DF-GLS statistic is to be preferred in the d, = PO case.
In the detrended case, only t^ and DF-GLS have sizes less than 12 percent
(excepting 0 = 0.8). The size-adjusted power of the DF-GLS (BIC) test exceeds
that of the t^ (BIC) test in all cases except u. unconditional, 8 = 0.5 and a = 0.7.
Because the differences in size distortions between the ? and DF-GLS tests is
minimal, this suggests that again the DF-GLS test is preferred in the detrended
case.
In both the demeaned and the detrended cases, an important implication of the
Monte Carlo results here and in the literature is that the choice of lag length or
truncation parameter can strongly influence test performance. The LR and BIC

In theory, this can be addressed by casting the trend order/integration order decision as a model
selection problem and using Bayesian model selection techniques, an approach investigated by Phillips
and Ploberger (1992). See the discussion in Section 6 of this chapter.
Ch. 46: Unit Roots, Structural Breaks and Trends 2785

rules have the twin advantages of relieving the researcher from making an arbitrary
decision about lag length and of providing reasonable tradeoffs between controlling
size with longer lags and gaining size-adjusted power with shorter lags.
One could reasonably object to the emphasis on controlling size in drawing
these conclusions. In many applications, particularly when the unit root test is
used as a pretest, it is not clear that controlling type I error is as important as
achieving desirable statistical properties in the subsequent analysis. This suggests
adopting alternative strategies: perhaps testing the I(0) null or implementing a
consistent classification scheme. These strategies are respectively taken up in
Sections 4 and 6.

3.3, Interval estimation

Confidence intervals are a mainstay of empirical econometrics and provide more


information than point estimates or hypothesis tests alone. For example, it is more
informative to estimate a range of persistence measures for a given series than
simply to report whether or not the persistence is consistent with there being a
unit root in its autoregressive representation [see, for example, Cochrane (1988)
and Durlauf (1989)]. This suggests constructing classical confidence intervals for
the largest autoregressive root ~1,for the sum of the coefficients in the autoregressive
approximation to u,, or for the cumulative impulse response function. Alternatively,
if one is interested in forecasting, then it might be desirable to use a median-unbiased
estimator of a, so that forecasts (in the first-order model) would be median-unbiased.
Because a median-unbiased estimator of u corresponds to a 0 percent equal-tailed
confidence interval [e.g. Lehmann (1959, p. 174)], this again suggests considering
the construction of classical confidence intervals for a. Moreover, a confidence
interval for tl would facilitate computing forecast prediction intervals which take
into account the sampling uncertainty inherent in estimates of a.
The construction of classical confidence intervals for a, however, involves
technical and computational complications. Only recently has this been the subject
of active research. Because of the nonstandard limiting distribution at u = 1, it is
evident that the usual approach of constructing confidence intervals, as, say, + 1.96
times the standard error of 61,has neither a finite-sample nor an asymptotic
justification. This approach does not produce confidence intervals with the correct
coverage probabilities, even asymptotically, when c( is large. To see this, suppose
that a is estimated in the regression (3.9) and that the true value of a is one. Then
the usual asymptotic 95 percent confidence interval will contain the true value
of a when the absolute value of the r-ratio testing a = 1, constructed using ar, is
less than 1.96. When u = 1, however, this t-ratio has the Dickey-Fuller t?
distribution, for which Pr[lt^l > 1.961 jO.61. That is, the purported 95 percent
confidence interval actually has an asymptotic coverage rate of only 39 percent!
It is, therefore, useful to return to first principles to develop a theory of classical
2786 J.H. Stock

interval estimation for CI. A 95 percent confidence set for c(, S(y,, . . . , yT), is a
set-valued function of the data with the property that Pr[a&] = 0.95 for all values
of CIand for all values of the nuisance parameters. In general, a confidence set can
be constructed by inverting the acceptance region of a test statistic that
has a distribution which depends on a but not on any nuisance parameters. Were
there a UMP test of cx= CQ,available for all null values CY~,then this test could be
inverted to obtain a uniformly most accurate confidence set for a. However, as
was shown in Section 3.2.1, no such UMP test exists, even in the special case of
no nuisance parameters, so uniformly most accurate (or uniformly most accurate
invariant or invariant-unbiased) confidence sets cannot be constructed by inverting
such tests. Thus, as in the testing problem, even in the finite-sample Gaussian
AR(l) model the choice of which test to invert is somewhat arbitrary.
As the discussion of Section 3 revealed, a variety of statistics for testing a = cl,,
are available for the construction of confidence intervals. Dufour (1990) and Kiviet
and Phillips (1992) proposed techniques for constructing exact confidence regions
in Gaussian AR(l) regression with exogenous regressors and Andrews (1993a)
develops confidence sets for the Gaussian AR(l) model in the no-deterministic,
demeaned and detrended cases with no additional regressors. Dufours (1990)
confidence interval is based on inverting the Durbin-Watson statistic, Kiviet and
Phillips (1992) inverted the t-statistic from an augmented OLS regression, and
Andrews (1993a) inverted 61- u (in the detrended case bir - ~1). In practice, the
inversion of these test statistics is readily performed using a graph of the confidence
belt for the respective statistics, which plots the critical values of the test statistic as a
function of the true parameter. Inverting this graph yields those parameters which
cannot be rejected for a given realization of the test statistic, providing the desired
confidence interval [see Kendall and Stuart (1967, Chapter 2O)].l
In practice one rarely, if ever, knows a priori that the true autoregressive order
is one and that the errors are Gaussian, so a natural question is how to construct
confidence intervals for a in the more general model (3.1). If treated in finite
samples, even if one maintains the Gaussianity assumption this problem is quite
difficult because of the additional nuisance parameters describing the short-run
dependence. However, as first pointed out by Cavanagh (1985), the local-to-unity
asymptotics of Section 3.2.3 can be used to construct asymptotically valid confidence
intervals for TVwhen a is close to one.

21Dufour studied linear regression with Gaussian AR(l) disturbances, of which the constant and
constant/time-trend regression problems considered here are special cases. Both Dufour (1990) and
Andrews (1993a) computed the exact distributions of these statistics using the Imhof method. In earlier
work, Ahtola and Tiao (1984) proposed a method for constructing confidence intervals in the Gaussian
AR(l) model with no intercept. Ahtola and Tiaos approach can be interpreted as inverting the score
test of the null p = pO, with two important approximations. First, they use a normal-F approximation
to the distribution of the score test, which seems to work well over their tabulated range of a. Second
(and more importantly), their proposed procedure for inverting the confidence belt requires the belt to
be linear and parallel, which is not the case over a suitably large range of a, even at the scale of the
local-to-unity model a = I + c/T.
Ch. 46: Unit Roots, Structural Breaks and Trends 2787

If the local-to-unity asymptotic representation of the statistic in the general I(1)


case has a distribution which depends only on c [a condition satisfied by any
statistic with the limiting representations in (3.23)] then this test can be inverted
to construct confidence intervals for c and, thus, for CI. In the finite-sample case,
~1cannot be determined from the data with certainty, and similarly in the asymptotic
case c cannot be known with certainty even though a is consistently estimated.
However, the nesting a = 1 + c/Tprovides confidence intervals that shorten at the
rate T- rather than the usual T- 112.
To be concrete, the Dickey-Fuller t-statistic from the pth order autoregression
(3.9) (interpreted in the Said-Dickey sense of p increasing with the sample size)
has the local-to-unity distribution (3.23b) which depends only on c and, so, can
be used to test the hypothesis c = c0 against the two-sided alternative for any finite
value of c,,. The critical values for this test depend on ce. The plot of these values
constitutes an asymptotic confidence belt for the local-to-unity parameter c, based
on the Dickey-Fuller t-statistic. Inverting the test based on this belt provides an
asymptotic local-to-unity confidence interval for c. Asymptotic confidence belts
based on the Dickey-Fuller t-statistic in (3.9) and, alternatively, the modified
Sargan-Bhargava (MSB) statistic are provided by Stock (1991) in both the
demeaned and the detrended cases. Stocks (1991) Monte Carlo evidence suggests
that the finite-sample coverage rates of the interval based on the Dickey-Fuller
t-statistic are close to their asymptotic confidence levels in the presence of MA(l)
disturbances, but the finite-sample coverage rates of the MSB-based statistics
exhibit substantial distortions relative to their asymptotic confidence levels.
Both the finite-sample AR( 1) and asymptotic AR(p) confidence intervals yield, as
special cases, median-unbiased estimators of a. The OLS estimates of a are biased
downwards, and both the finite-sample and asymptotic approaches typically
produce median-unbiased estimates of a larger than the OLS point estimates.
While this approach produces confidence intervals and median-unbiased esti-
mators of a, the researcher might not be interested in the largest root per se but
rather in some function of this root, such as the sum of the AR coefficients in the
autoregressive representation. To this end, Rudebusch (1992) proposed a numerical
technique based on simulation to construct median-unbiased estimators of each
of the p + 1 autoregressive parameters; his algorithm searches for those autoregres-
sive parameters for which the median of each of the AR parameters equals the
observed OLS estimate for that parameter. Andrews and Chen (1992) propose a
similar algorithm, except that their emphasis is the sum of the autoregressive
parameters rather than the individual autoregressive parameters themselves and
their calculations are done using the asymptotic local-to-unity approximations.
A completely different approach to interval estimation which has been the subject
of considerable recent research and controversy is the construction of Bayesian
regions of highest posterior probability and an associated set of Bayesian tests of
the unit AR root hypothesis. (References are given in Section 6.2.) Although
these procedures examine the same substantive issue, they are not competitors of
2788 J.H. Stock

the classical methods in the sense that, when interpreted from a frequentist
perspective, many of the proposed Bayesian intervals have coverage rates that
differ from the stated confidence levels, even in large samples. A simple example
of this occurs in the Gaussian AR( 1) model with a constant and a time trend when
there are flat priors on the coefficients. Then, in large samples, the Bayesian 95
percent coverage region is constructed as those values of CIwhich are within 1.96
standard errors (conventional OLS formula) of the point estimate [Zellner (1987)].
As pointed out earlier in this subsection, if c( = 1 this interval will contain the
true value of c( only 39 percent of the time in the detrended case. Of course these
Bayesian regions have well-defined interpretations in terms of thought-experiments
in which CI is itself random and have optimality properties, given the priors.
However, given the lack of congruence between the classical and the Bayesian
intervals in this problem, and the sensitivity of the results to the choice of priors
[see Phillips (1991a) and his discussants], applied researchers should be careful in
interpreting these results.

4. Unit moving average roots

This section examines inference in two related models, the moving average model
and the unobserved components model. The moving average model is

y, = d, + u t, Au, = (1 - BL)o, (4.1)

where u, is, in general, an I(0) process satisfying (2.1)-(2.3). If 0 = 1, u, = u, + (ue - u,),


so that with the initial condition that u0 = uO,u, = u, is a purely stochastic I(O)
process. If 1131 < 1, then (1 - BL)- yields a convergent series and (1 - BL) is
invertible, so u, is I(1). The convention in the literature is to refer to 101= 1 as the
noninvertible case.
The unobserved components (UC) model considered here can be written

y, = d, + u,> u, = vt + i,, ,q = P,- 1 + v,, t = 42,. . . , T, (4.2)

where [, and v, are I(0) with variances at and 0: and where d, is a trend term as
in (1.1). If [, and v, have a nondegenerate joint distribution, then, in general, the
I(1) component pt and the I(0) component [, cannot be extracted from the observed
series without error, even with known additional parametric structure; hence the
unobserved components terminology.
It should be observed at the outset that the unit MA root/UC models
are a mirror image of the unit AR root model, in the sense that the unit AR root
model parameterizes the I(1) model as a point (c1= 1) and the I(0) model as a
continuum ()c11< l), while in the unit MA root/UC models the reverse is true. In
the latter two models, the I(0) case is parameterized as a point (0 = 1 in the unit MA
root model, 0, = 0 in the UC model), while the I(1) case is parameterized as a
Ch. 46: Unit Roots, Structural Breaks and Trends 2789

continuum (I 81 < 1 in the MA model, 0: > 0 in the UC model). This suggests that,
at least qualitatively, some of the general lessons from the AR problem will carry
over to the MA/UC problems. In particular, because the points 0 = 1 and 0: = 0
represent discontinuities in the long-run behavior of the process, it is perhaps not
surprising that, as in the special case of a unit AR root, the first-order asymptotic
distributions of estimators of 0 and 0, exhibit discontinuities at these points. In
addition, just as the unit AR root model lends itself to constructing tests of the
general I(1) null, the unit MA root model lends itself to constructing tests of the
general I(0) null.
Although the MA model (4.1) and UC model (4.2) appear rather different, they
are closely related. To see this, consider only the stochastic components of the
models. In general, for suitable choices of initial conditions, all MA models (4.1)
have UC representations (4.2) and all UC models (4.2) have MA representations
of the form (4.1). To show the first of these statements, we need only write Au, =
(1 - 8L)o, = (1 - 0)u, + BAv,; then, cumulating Au, with the initial condition u0 = o0
yields

u,=(l -e) i u,+eu,. (4.3)


s=o
By construction (1 - 0)Cf, ,,u, is I(1) and 8u, is I(0). Therefore, the MA model (4.1)
has the UC representation (4.3) with v, = (1 - t9)0,and i, = 8u,. If 0 = 1, then the
I(1) term in (4.3) vanishes and u, is I(0). To argue that all UC models have MA
representations of the form (4.1), it is enough to consider the two cases, cz = 0
and 0, > 0. If 0: = 0 and pc, = 0, then u, = [,, which is I(O), so (4.1) obtains with
ue = u0 and 8 = 1. If 0: > 0, then u, is I(l), so Au, = v, + A[, is I(O),and it follows that
Au, has a Wold decomposition and hence an MA representation of the form (4.1)
where (01 < 1.
A leading special case of the UC model which is helpful in developing intuition
and which will be studied below is when (Al,, v,) are serially uncorrelated and are
mutually uncorrelated. Then, Au, in the UC model has MA(l) autocovariances:
Au, = A[, + vt, so that ~~~(0)= 20: + a:, y,,(l) = - 6: and ydu(j) = 0,1jl > 1. Thus
Au, has the MA(l) representation (4.1), Au, = (1 - 8L)e,, where e, is serially
uncorrelated, 8 solves 0 + 8-l - 2 = at/$ and 0, = $/0. Because EA&v, = 0
by assumption, this UC model is incapable of producing positive autocorrelations
of AuI, so, while all uncorrelated UC models have an MA(l) representation, the
converse in not true. [The MA(l) model will, however, have a correlated UC
representation and a UC representation with independent permanent and transitory
components which themselves have complicated short-run dynamics; see Quah
(1992).]
The UC model can equivalently be thought of as having I( 1) and I(0) components,
or as being a regression equation with deterministic regressor(s) d, (which in general
has unknown parameter(s) /I), an I(0) error and a constant which is time-varying
2190 J.H. Stock

and follows an I( 1) process. Thus the problem of testing for a unit moving average
root and testing for time variation in the intercept in an otherwise standard
time series regression model are closely related.

4.1. Point estimation

When the MA process is noninvertible, or nearly noninvertible, estimators of 0


fail to have the standard Gaussian limiting distributions. The task of characterizing
the limiting properties of estimators of 8 when 19is one, or nearly one, is difficult,
and the theory is less complete than in the case of nearly-unit autoregressive roots.
Most of the literature has focused on the Gaussian MA(l) model with d, = 0 and
u, = Ed,and this model is adopted in this subsection, except as explicitly noted. One
complication is that the limiting distribution depends on the specific maximand and
the treatment of initial conditions. Because the objective here is pedagogical rather
than to present a comprehensive review, our discussion of point estimation focuses
on two specific estimators of 8, the unconditional and conditional (on s1 = 0) MLE.
Suppose that the data have been transformed so that x, = Ay,, t = 2,. . . , T. Then
8 can be estimated by maximizing the Gaussian likelihood for (x2,. . . , xT). The exact
form of the likelihood depends on the treatment of the initial condition. If x2 is
treated as being drawn from its stationary distribution, so that x2 = s2 - 0&i, then
X=(X2,..., xJ has covariance matrix afQ,(@, where 0, ii= 1 + 0 and s2,,ii, 1= - 0.
This is the unconditional case, and the Gaussian likelihood is

%?2;X
A (0, of) = - $ T In 27~0:- $ln det (0) - 2
frf

where det(R,) denotes the determinant of 0,. Estimation proceeds by maximization


of A in (4.4). Numerical issues associated with this maximization are discussed at
the end of the subsection.
The conditional case sets .sl = 0, so that x2 = e2. The conditional likelihood is
given by (4.4) with a, replaced by a,, where 0, = 0 except that a,,,, = 1. A
principal advantage of maximizing the conditional Gaussian likelihood is that the
determinant of the covariance matrix does not depend on 0, so maximization can
proceed by minimizing the quadratic form X0; X. Because st = x, + es,_ r, with
the additional assumption that .si = 0, the residuals e,(e) can be constructed
recursively as (1 - BL)e,(8) = x, and estimation reduces to the nonlinear least
squares problem of minimizing C,=2 e,(8)2.
If (8(< 1, so that the process is invertible, then standard ,/!? asymptotic theory
applies. More generally, if an ARMA(p, 4) process is stationary and invertible and
has no common roots, then the Gaussian maximum likelihood estimator of the
ARMA parameters is fi-consistent and has a normal asymptotic distribution;
see Brockwell and Davis (1987, Chapter 10.8). In the MA(l) model, fi(&,,, - 0)
Ch. 46: Unit Roots, Structural Breaks and Trends 2191

is asymptotically distributed N(0, 1 - 0). This provides a simple way to construct


tests of whether 8 equals some particular value. Alternatively, confidence intervals
for 8 can be constructed as e+ 1.96( 1 - @) (for a 95 percent two-sided confidence
interval).
These simple results fail to hold in the noninvertible case. This is readily seen by
noting that the asymptotic normal approximation to the distribution of the MLE
is degenerate when 8 = 1. The most dramatic and initially surprising feature of this
failure is the pileup phenomenon. In a series of Monte Carlo experiments,
investigators found that the unconditional MLE took on the value of exactly one
with positive probability when the true value of 8 was near one, a surprising finding
at the time since 0 can take on any value in a continuum. Shephard (1992) and Davis
and Dunsmuir (1992) attribute the initial discovery of the pileup effect to unpublished
work by Kang (1975); early published simulation studies documenting this pheno-
menon include Ansley and Newbold (1980) Cooper and Thompson (1977), Davidson
(1979, 198 l), Dent and Min (1978) and Harvey (198 1, pp. 13669); also see Plosser
and Schwert (1977), Dunsmuir (198 1) and Cryer and Ledolter (198 1).
The intuition concerning the source of the pileup effect is straightforward, and
concerns the lack of identification of (0, a) in the unconditional model. Note that
n,(Q) = 820,(8-l); upon substituting this expression into the unconditional likeli-
hood (4.4), one obtains A(& g2) = A(& , e2a2) and A(0) = A(F), where A denotes
the likelihood concentrated to be an argument only of 8. Because A is symmetric
in In 8 for B > 0, it follows immediately that A will have a local maximum at B = 1 if
~2~/~~2),~,~O,sotheprobabilityofalocalmaximumat~=1isPr[~2~/~~2~,~,~O].
Sargan and Bhargava (1983b, Corollary 1) [also see Pesaran (1983) and Anderson
and Takemura (1986, Theorem 4. l)] provide expressions for this limiting probability
in the noninvertible case, which can be calculated by interpolation of Table I in
Anderson and Darling (1952) and is 0.657.
These results were extended to the case of higher-order MA and ARMA models
by Anderson and Takemura (1986), where the estimation is by Gaussian MLE when
the order of the ARMA process is correctly specified. Tanaka (1990b) considered
a different problem, in which u, is a linear process which is I(O) but otherwise is only
weakly restricted, but Q is estimated by using the misspecified Gaussian MA(l)
likelihood. Tanaka (1990b) found that, despite the misspecification of the model
order, the unconditional MLE continues to exhibit the pileup effect, in the sense
that the probability of a local (but not necessarily global) maximum at i?= 1 is
nonzero if the true value of 0 is one. Also see Tanaka and Satchel1 (1989) and
Potscher (1991).
Because of the close link between the UC and MA models, not surprisingly the
pileup phenomenon occurs in those models as well. In this model, if the signal-to-
noise ratio o,/c$ is zero or is in a T 2 neighborhood of zero, then 0, is estimated
to be precisely zero with finite probability. However, the value of this point
probability depends on the precise choice of maximand (e.g. maximum marginal
likelihood or maximum profile likelihood) and the treatment of the initial condition
2792 J.H. Stock

p0 (as fixed or alternatively as random with a variance which tends to infinity, or


equivalently as being drawn from a diffuse prior). Various versions of this problem
have been studied by Nabeya and Tanaka (1988), Shephard and Harvey (1990)
and Shephard (1992, 1993).
Research on the limiting distribution of estimators of 8 when 0 is close to one is
incomplete. Davis and Dunsmuir (1992) derive asymptotic distributions of the local
maximizer closest to one of the unconditional likelihood, when the true value is in
a l/T neighborhood of 0 = 1. Their numerical results indicate that their distributions
provide good approximations, even for 8 as small as 0.6 with T= 50. Their approach
is to obtain representations of the first and second derivatives of the likelihoods as
stochastic processes in T(1 - 0). They do not (explicitly) use the FCLT, and working
through the details here would go beyond the scope of this chapter.

A remark on computation. The main technical complication that arises in the


estimation of stationary and invertible ARMA(p,q) models is the numerical
evaluation of the likelihood when q 3 1. If the sample size is small, then Q; can
be computed and inverted directly. In sample sizes typically encountered in econo-
metric applications, however, the direct computation of 0; 1 is time-consuming and
can introduce numerical errors. One elegant and general solution is to use the
Kalman filter, which is a general device for computing the Gaussian log likelihood,
P(Y r,. . , yr) with the factorization Z(Yr,. . . , Y,)=~(Y,)+C,T_*~(YtIYt-l,...,Yl),
when the model can be represented in state space form (as can general ARMA models).
The Kalman filter operates by recursively computing the conditional mean and
variance of Yt, which, in turn, specifies the conditional likelihood Y(y, 1y,_ l,. . . , yl).
The Gaussian MLE is then computed by finding the parameter vector that
maximizes the likelihood. The chapter by Hamilton in this Handbook describes the
particulars of the standard Kalman filter and provides a state space representation
for ARMA models which can be used to compute their Gaussian MLE. The model
(4.2) is a special case of unobserved components time series models, which in general
can be written in state space form so that they, too, can be estimated using the
Kalman filter; see Harvey (1989) and Harvey and Shephard (1992) for discussions
and related examples.
The literature on the estimation of stationary and invertible ARMA models is
vast and it will not be covered further in this chapter. See Brockwell and Davis
(1987, Chapter 8) for a discussion and references. For additional discussion of the
Kalman filter with applications and a bibliography, see the chapter by Hamilton in
this Handbook.

4.2. Hypothesis tests

4.2.1. Tests of 8 = 1 in the conditional Gaussian MAC1 ) model

As Sargan and Bhargava (1983b) pointed out, the pileup phenomenon means that
likelihood ratio tests cannot be used for hypothesis testing at conventional
Ch. 46: Unit Roots, Structural Breaks and Trends 2793

significance levels, at least using the unconditional Gaussian MLE. Given this
difficulty, it is perhaps not surprising that research into testing the null of a unit
MA root has been limited and has largely focused on the MA(l) conditional
Gaussian model. This model, therefore, provides a natural starting point for our
discussion of tests of the general I(0) null.
The conditional Gaussian MA(l) model with a general pth order polynomial time
trend is

yr=z:/?+ut, ui =~i, AU,=&,-t&i, t> 1 (4.5)

where E, is i.i.d. N(0, crt) and z, = (1, t, . , t). Let z = (z,, . . . , zT), and similarly define
the T x 1 vectors y and u. Then y is distributed N(za, JCJ, where EC,, i = 0: and the
remaining elements can be calculated directly from the moving average representa-
tion Au, = (1 - 19L)q.
The problem of testing values of 0 is invariant to transformations of the form
y, + ay, + zib, /? -+ a/? + b and 0: + a*of. It is therefore reasonable to restrict
attention to the family of tests which are invariant to this transformation, and
among that family to find the most powerful tests of 0 = 1 against the fixed
alternative 0 = I%An implication of the general results of King (1980, 1988) is that
the MPI test of 0 = 1 vs. 0 = 8rejects for small values of the statistic

where 6 = (ri,,zi *, . . . , t&-)), where {tit} are the residuals from the OLS regression of
y, onto z,, 6 are the GLS residuals from the estimation of (4.5) under the alternative,
and EUU = X,(g) is the covariance matrix of U = (u,, . . . , uT) under the alternative
t? In the MA(l) model, the GLS transformation can be written explicitly, and GLS
simplifies to the OLS regression of Y,(g) onto Z,(e), where Y,(g) = y, and
(1 - &!,)Y,(@ = Ayr, t > 1, and similarly for Z,(f?).
As in the case of MP tests for an autoregressive unit root discussed in Section
3.2.1, the dependence of the MPI test statistic (4.6) on the alternative ecannot be
eliminated, so there does not exist a UMPI test of f3= 1 vs. 101< 1. This has led
researchers to propose alternative tests. A natural approach is to consider tests
which have maximal power for local alternatives, that is, to consider the locally
most powerful invariant test. In the special case that d, is zero, Saikkonen and
Luukkonen (1993a) show that the LMPI test has the form, Tj2/8f, where
y= T-C,T_lyf an d Si = T-Cf,y: (which is the MLE of 0: under the null). In
the case d, = /lo, Saikkonen and Luukkonen (1993a) use results in King and Hillier
(1985) to derive the locally most powerful invariant unbiased test, which is based
on the statistic

(4.7)
2794 J.H. Stock

where Y:,.(n) = T- C$jy~ and (8,) = T lx,= ,(yr), where yt = y, - j [also see
Tanaka (1990b)l. Note that (c?:)~ is the (conditional) MLE of 0: under the null
hypothesis. Because the statistic was derived for arbitrarily small deviations from
the null, the parameter 0 does not need to be estimated to construct L.22
A natural generalization of (4.7) to linear time trends is to replace the demeaned
process yf by the detrended process y::

(4.8)

where Y;,(A) = T-12CjT_:]yz and (&I)* = T-CsT_l(y3)2, where y: = y, - z$ and fi


is the OLS estimator from the regression of y, onto (1, t).
The asymptotic null distributions of Lp and C are readily obtained using the
FCLT and CMT, under the maintained assumption that the order of the estimated
deterministic trend is at least as great as the order of the true trend. First consider
Lw. As the second expression in (4.7) reveals, L@ can be written as a continuous
functional of Y;,/BE. To obtain limiting representations for Lfl it therefore suffices to
obtain limiting results for the stochastic process Y&/C?:. The limit of the numerator
of this process was derived in Section 2.3 (Example 3) and is given in (2.14a) for u,
being a general I(0) process; because it is assumed in this subsection that u, = E,
under the null, this result applies here with o = aE. In addition, the maintained
assumption that the trend is correctly specified ensures that 8: A aE. It follows
that Y;r/d:=W and that L@a~(B)*, where B(A) = W(I1) - ;CW(l) is a standard
Brownian bridge. An identical argument applies to the linearly detrended case and
yields the limit L=>~(Br)2, where B is the second-level Brownian bridge in (2.14b).
In the leading case that d, is a constant, L has the asymptotic distribution of the
Cramer-von Mises statistic derived by Anderson and Darling (1952). Nyblom and
Makeltiinen (1983, Table 1) provide critical values of the finite-sample distribution
of L, computed using the Imhof method for Gaussian errors. Kwiatkowski et al.
(1992, Table 1) provide a table of critical values of J(BP)2 and J(W) which agrees
closely with earlier computations, e.g. MacNeill (1978, Table 2). Although the
motivation for the L-statistic comes from considering the Gaussian MA(l) model,
it is evident from the preceding derivation that the same asymptotic distribution
obtains for MA(l) models with errors which satisfy the weaker assumptions, such
as being martingale difference sequences which satisfy (2.2).
The Lstatistics (4.7) and (4.8) have intuitive interpretations. To be concrete,
consider L. Under the null hypothesis, y, - /3, is serially uncorrelated and the
partial sum process of the demeaned data, Cl5jy;, is I( 1). The statistic Lv thus can
be seen to test the null hypothesis that y, is I(0) by testing its implication that the

Nabeya and Tanaka (1990b) showed that the statistic (4.7) is also locally MPI unbiased for the
unconditional Gaussian MA(l) model with known d,.
Ch. 46: Unit Roots, Structural Breaks and Trends 2195

process of accumulated (demeaned) y,s is I( 1). Rejection occurs if L is large, SO the


statistic tests the null that the accumulation of y, is I( 1) against the alternative that it is
1(2). Comparison of (4.7) to the expression (3.18b) for the Sargan-Bhargava statistic
testing the null of a unit autoregressive root in the d, = 8, case shows that the two
statistics are closely related: the Sargan-Bhargava statistic rejects the I(1) null
against the I(O) alternative when the sum of squared y,s is small, while the LMPI
statistic Lp rejects the I(0) null against the I(1) alternative when the sum of-squared
accumulated y,s is large.
Because of the similarities between the UC and MA models, not surprisingly the
L-statistics can be alternatively derived as tests of rrt = 0 in the UC model. In this
formulation, the tests have the interpretation that they are testing the null that the
regression intercept in (4.2) is constant, versus the alternative that it is time-varying,
evolving as a martingale. To be concrete, suppose that yt is generated by (4.2) with
(&, vt) i.i.d. N(0, a: diag(1, q)), p. = 0, and set q = cs/c$. Then (yr,. . . , yT) is distributed
N(z/?, af&(q)), where i&-(q) = I + qf2 *, where Qc = min(i, j). Again, the results
of King (1980) imply that the most powerful invariant test of q = 0 against q = 4 > 0
is a ratio of quadratic forms similar to (4.6) but involving &(4). The resulting
statistic depends on 4, so no uniformly MPI test exists.
Because there is no UMPI test, it is reasonable to examine the locally MPI test
in the UC model. In the case d, = PO, Nyblom and Makelainen (1983) derived the
LMPI test statistic and showed it to be Lp. Nyblom (1986) extended this analysis
to the case d, = Do + flit and showed the LMPI statistic to be L. Nabeya and
Tanaka (1988) extended these results to the general Gaussian regression problem
in which coefficients on some of the variables follow a random walk while others
are constant. The special case of Nabeya and Tanakas (1988) results, of interest
here, is when d, = z$, where z, = (1, t,. . . , tP) and the intercept term, pt in (4.2),
follows a random walk. Then Nabeya and Tanakas (1988) LM test statistic
simplifies to (4.7), except that yf, the residual from the OLS regression of y, onto the
time polynomials z,, replaces yr and Y: = T- /CL= lyl replaces Y;.23
Despite the differences in the derivations in the MA and UC cases, the fact that
the same local test statistics arise has a simple explanation. As argued above, Au,
generated by the UC model has an MA(l) representation with parameters (0,af)
which solve q=&+&2 and 0 g2 = 0:. Thus, the distribution of (y,, . . . , yT) can
be written as N(z/I, a2Z,c(B)), wheri z = (z,, . . . ,zT), Z,, 11 = (1 + q)af/at =(1-o+
19) and where the remaining elements of Z,,(0) equal those of EC(e). Thus the UC
and conditional MA models are the same except for their treatment of the initial
valueyl.But&c,,,(l)=~~ ,,(I), so, when 8 = 1 (equivalently, when q = 0), the two
models are identical.
A third interpretation of the L-tests arises from recognizing that the UC model

Z3This simplification obtains from Nabeya and Tanakas (1988) equation (2.5) by noting that the tth
element of their My is d, by recognizing that, in our application, their DX is the T x T identity matrix,
and by carrying out the summation explicitly. See Kwiatkowski et al. (1992).
2796 J.H. Srock

is a special case of time-varying parameter models, so that the tests can be viewed
as a test for a time-varying intercept. This interpretation was emphasized by Nabeya
and Tanaka (1988) and by Nyblom (1989). We return to this link in Section 5.
Local optimality is not the only testing principle which can be fruitfully exploited
here, and other tests of the hypothesis q = 0 in the UC model have been proposed.
LaMotte and McWhorter (1978) proposed a family of exact tests for random walk
coefficients, which contains the i.i.d. UC model (4.1) with d, = b, and d, = fi,, + fill
as special cases, under the translation group y + y + zh, /I + p + 6. Powers of the
LaMotte and McWhorter tests are tabulated by Nyblom and Makelainen (1983)
(constant case) and by Nyblom (1986) (time-trend case). Franzini and Harvey (1983)
considered tests in the Gaussian UC model with the maintained hypothesis of
nonzero drift in p,, which is equivalent to (4.2) with ([,,v,) i.i.d. Gaussian and
d, = fl, + flit. Franzini and Harvey (1983) suggested using a point-optimal test, that
is, choosing an MPI test of the form where their recommendation corresponds to
4 E 0.75 for T= 20. Shively (1988) also examined point-optimal tests in the UC
model with an intercept and suggested using the MPI tests tangent to the power
envelope at, alternatively, powers of 50 percent and 80 percent, respectively, corre-
sponding to q = 0.023 and 0.079 for T= 51.

4.2.2. Tests of the general I(0) null

Because economic theory rarely suggests that an error term is i.i.d. Gaussian, the
Gaussian MA( 1) and i.i.d. UC models analyzed in the previous subsection are too
special to be of interest in most empirical applications. While the asymptotic null
distributions of the L and L statistics obtain under weaker conditions than
Gaussianity, such as u, = E, where E, satisfies (2.2) these statistics are not asymptoti-
cally similar under the general I(0) null in which u, is weakly dependent and satisfies
(2.1))(2.3). A task of potential practical importance, therefore, is to relax this
assumption and to develop tests which are valid under the more general assumption
that u, is I(0).
The two main techniques which have been used to develop tests of the general
I(O) null parallel those used to extend autoregressive unit root tests from the AR(l)
model to the general I(1) model. The first, motivated by analogy to the way that
Phillips and Perron (1988) handled the nuisance parameter o in their unit root tests,
is to replace the estimator of the variance of u, in statistics such as Lp and L with
an estimator of the spectral density of u, at frequency zero; this produces modified
Lp and L statistics.24 The second, used by Saikkonen and Luukkonen (1993a,
1993b), is to transform the series using an estimated ARMA process for u,.
The device of Section 3, in which unit root tests were represented as functionals of

24This approach was proposed by Park and Choi (1988) to generalize their variable addition tests,
discussed in the subsequent paragraphs, to the general I(0) null. It was used by Tanaka (1990b) to extend
the L statistic to the general I(0) null. [Tanakas (1990b) expression (7) is asymptotically equivalent to
(4.7).] Kwiatkowski et al. (1992) used this approach to extend the L statistic to the general I(0) null.
Ch. 46: Unit Root.s. Structural Breuks and Trends 2197

the levels process of the data, can be applied in this problem to provide a general
treatment of those tests of the I(O) null which involve an explicit correction using
an estimated spectral density. The main modification is that, in the I(O) case, the
tests are represented as functionals of the accumulated levels process rather than
the levels process itself. This general treatment produces, as special cases, the
extended L and L statistics and the variable addition test statistics, G(p,q),
proposed by Park and Choi (1988).
Park and Chois (1988) G(p, q) statistic arises from supposing that the true trend
is (at most) a pth order polynomial. The detrending regression is then intentionally
overspecified, including polynomials of order q where q > p. If u, is I(O), then the
OLS estimators of the coefficients on these additional q - p trends are consistent
for zero. If, however, u, is I(l), then the regression of y, on the full set of
trends introduces spurious detrending, as discussed in Example 2 of Section 2.3,
and the LR test will reject the null hypothesis that the true coefficients on
P+l
(t , . . . , t) are zero. These two observations suggest considering the modified LR
statistic, G(p,q) = T(8 - 62)/c22, where b2 and 6, respectively, are the mean
squared residuals from the regression of y, onto (1, t, . . . , tp) and of y, onto (1, r, . . . , P).
In functional notation, the L and G(p, q)-tests have the representations

L: L= gI,(f), g&f) = Jf2> (4.9a)

G(P,4): G(P,4)= g,(f), CI,(.II= j=$+


1 2, (4.9b)

where f denotes the random function being evaluated and hj is the jth Legendre
polynomial on the unit interval.
As the representations (4.9) make clear, to study the limiting behavior of the
statistics it suffices to study the behavior of the function being evaluated and then
to apply the CMT. Under the general I(0) null, T- 1/2~~~]1u,~wW, so that the
general detrended process Yir (defined in Section 2.3, Example 3) has the limit
Yd,,/c+(02/a,2)B (p). This suggests modifying these statistics by evaluating the
functionals using I/&., where

v,(A) = Q, T - 1 yf. (4.10)


s=l

If && is consistent for w2 then V, * Bcp) and the asymptotic distributions of the
statistics (4.9) will not depend on any nuisance parameters under the general I(0)
null. The SC estimator&&is used for this purpose by Park and Choi (1988), Tanaka
(1990b), Kwiatkowski et al. (1992) and Stock (1992). The rate conditions I, + cc and
I, = o(T~) are sufficient to ensure consistent estimation of o under the null and,
as is discussed in the next subsection, test consistency under a (fixed) alternative.
2798 J.H. Stock

A Second approach to extending the MA( 1) tests to the general I(O) null, developed
by Saikkonen and Luukkonen (1993a, 1993b), involves modifying the test statistic
by, in effect, filtering the data. Saikkonen and Luukkonen (1993a) consider the
Gaussian LMPI unbiased (LMPIU) test under an ARMA(p, q) model for the I(0)
errorsu=(u,,..., uT). Let the covariance matrix of u, be Z, and assume that C, were
known. With Z, known, under the null hypothesis PO can be estimated by GLS
yielding the estimator fiO, and let fi, denote residuals y, - &. The transformed GLS
residuals are then given by Z = Z; I217.The Gaussian LMPIU test in this model is
a ratio of quadratic forms in E analogous to (4.6), where the covariance matrix in
the numerator is evaluated under the 8 = 1 null. In practice, the parameters of the
short-run ARMA process used to construct Z:, must be estimated; see Saikkonen
and Luukkonen (1993a) for the details. Saikkonen and Luukkonen (1993b) apply
this approach to extend the finite-sample point-optimal invariant tests of the form
(4.6) to general I(0) errors in the d, = & case and to derive the asymptotic
distribution of these tests under the null and local alternatives.25

4.2.3. Consistency and local asymptotic power

Consistency. The statistics with the functional representations (4.9) reject for large
values of the statistic. It follows that the tests based on the modified ~5~ and L
statistics and on the G(p, q) statistics are consistent if V, 5 co under the I(1)
alternative. Consider, first, the case d, = 0, so that the numerator of V, in (4.10) is
T-CE\u,. Under the I(1) alternative this cumulation is I(2) and T-32Cz\~,*
OS: = ,, W(s) ds. Similarly, if 1, + co but I, = o(T), then @c has the limit I&,!
[TCfnT, _,,k(m/l,)] =f32jW2.26 Combining these two results, we have

N-iV=/T
T
v*, where V*(n) =
s
1

W(s) ds

i/2
(4.11)

and NT = T/c:= +k(m/l,). Because the kernel k is bounded and I, = o(T),

25Bierens and Guo (1993) used a different approach to develop a test of the general I(0) null against
the I(1) alternative, in which the distribution under the null is made free of nuisance parameters not by
explicit filtering or estimation of the spectral density at frequency zero, but rather by using a weighted
average of statistics in which the weights are sensitive to whether the I(0) or I(1) hypothesis is true.
26Suppose that d, = 0 and that the SC estimator is constructed using a fixed number I autocovariances
of y,, rather than letting I, * 00; this would be appropriate were the MA order of u, finite and known a
priori. Ify, is I(l), T-Z~~=i+,y,y,~i-_T2~~=1~:~0,i= l,...,k, and moreover T~2~~=1y:*~2~WZ.
It follows by direct calculation that 4&/[Tx_ &n/l)] =.02~W2. The proof for the general SC
estimator entails extending this result from fixed I to a sequence of I, increasing sufficiently slowly. For
details see Phillips (199lb, Appendix) for the d, = 0 case; Kwiatkowski et al. (1992) for OLS detrending
with a constant or a linear time trend; Perron (19914 for general polynomial trends estimated by OLS;
and Stock (1992) under general trend conditions including an estimated broken trend.
Ch. 46: Unit Roots, Structural Breaks and Trends 2199

N, + cc as T+ 03, so under the fixed I( 1) alternative Vr 3 co. Thus, tests which are
continuous functionals of V, and which reject I(0) in favor of I(1) for large
realizations of I, will be consistent against the fixed I( 1) alternative.
This is readily extended to general trends. For example, in the case d, = I_I,,
N; I2 I$(.)=- V*a, where P*B(%)= Jt W(s) ds/(JIV2} 12. In the detrended case,
N; I2 If;(.)= I/*, where P*(n) = {t IV(s) ds/{~Wr2} l, where W is the OLS-
detrended Brownian motion in (2.13b). This, in turn, implies that, under the fixed
I( 1) alternative, V; and V; 3 co, and consistency of the test statistics in (4.9) follows
directly.

Local asymptotic power. We examine local asymptotic power using a local version
of the UC model (4.2):

Y, = 4+ 4, 4 = uot+ H,u,t (4.12)

where uot and ult are respectively I(0) and I(1) in the sense that (uot,Ault) satisfy
(2.1)-(2.3) and where H, = h/T, where h is a constant. Because H, -+O, the I( 1)
component of y, in (4.12) vanishes asymptotically, so that (4.12) provides a model
in which y, is a local-to-I(O) process.
For concreteness and to make the link between the UC and MA models precise,
we will work with the special case of (4.12) in which uot and Aui, are mutually and
serially uncorrelated and have the same variance. Then the local-to-I(O) UC model
(4.12) has the MA(l) representation, Au, = E, - 0r~,_ i, where 0, = 1 - h/T+ o(T-)
and where E, is serially uncorrelated. Thus, the local-to-I(O) parameterization
H, = h/T is asymptotically equivalent to a local-to-unit MA root with the nesting
eT = 1 - hlT.27
To analyze the local power properties of the tests in (4.9), we obtain a limiting
representation of V, under the local-to-I(O) model. First consider the case d, = 0. The
behavior of the numerator follows from the FCLT and CMT. Define the indepen-
dent Brownian motions W, and W, respectively as the limits T- /2C~\uos~
q, W,(.) and T- 12ulIT,l+-~l WI(.). By assumption, w. = oi, so T-12C~~~~, =
T- C~=& + hT-3/2xfT1 5= l~ts~~OUh(~), where u,,(J) = W,(A) + hJtW,(s)ds. It
can additionally be shown that, in the local-to-I(O) model (4.12), the SC estimator has
the limit 6ic A c$, [Elliott and Stock (1994, Theorem 2)]. Thus Vr=> U,, from
which it follows that the statistics in (4.9) have the local asymptotic representation
g(U,) for their respective g functionals.

27The rate T for this local nesting is consistent with the asymptotic results in the unit MA root and
UC test literatures, which in general find that this nesting is an appropriate one for studying rates of
convergence of the MA estimators and/or the local asymptotic power of tests. In the MA unit root
literature, see Sargan and Bhargava (1983b), Anderson and Takemura (1986), Tanaka and Satchel1 (1989),
Tanaka (1990b) and Saikkonen and Luukkonen (1993b); in the UC literature, see Nyblom and Makelainen
(1983) Nyblom (1986, 1989) and Nabeya and Tanaka (1988).
2800 J.H. Stock

Table 3
Power of MA unit root tests
[S percent level tests, demeaned case (d, = &,)].

h PE LP POI(O.5) G(O, 1) WI 2)

A. T= 50
1 0.064 0.064 0.061 0.058 0.059
2 0.103 0.101 0.095 0.087 0.089
5 0.304 0.299 0.298 0.279 0.268
10 0.631 0.570 0.633 0.583 0.596
15 0.823 0.717 0.815 0.745 0.776
20 0.909 0.803 0.898 0.823 0.864
30 0.976 0.878 0.960 0.890 0.932
40 0.992 0.914 0.979 0.912 0.954
B. T= 100
1 0.064 0.064 0.066 0.065 0.060
2 0.106 0.107 0.101 0.103 0.090
5 0.332 0.319 0.321 0.311 0.289
10 0.659 0.605 0.664 0.623 0.629
15 0.845 0.765 0.841 0.779 0.807
20 0.931 0.852 0.919 0.857 0.890
30 0.985 0.923 0.974 0.924 0.952
40 0.996 0.958 0.99 1 0.944 0.974
C. T=200
1 0.062 0.063 0.064 0.062 0.063
2 0.102 0.104 0.099 0.097 0.095
5 0.314 0.309 0.316 0.305 0.299
10 0.669 0.605 0.667 0.621 0.640
15 0.851 0.758 0.841 0.779 0.811
20 0.937 0.847 0.922 0.854 0.894
30 0.988 0.934 0.980 0.924 0.956
40 0.998 0.965 0.995 0.950 0.976
D. T= 1000
1 0.061 0.062 0.057 0.055 0.057
2 0.099 0.102 0.087 0.086 0.08 1
5 0.329 0.321 0.310 0.296 0.283
10 0.661 0.613 0.663 0.624 0.632
15 0.853 0.717 0.843 0.789 0.811
20 0.944 0.866 0.929 0.871 0.900
30 0.992 0.948 0.985 0.937 0.963
40 0.999 0.978 0.996 0.962 0.981

Data were generated according to the unobserved components


model, y, = u,, where u, = uO, + H,u,,, (u,,~,A+) are i.i.d. N(0, I)
and H, = h/T. PE denotes the power envelope. The remaining tests
are based on the indicated statistics as defined in the text. Based on
5000 Monte Carlo repetitions.

The power functions for the statistics in (4.9) along with the power envelope are
summarized in Tables 3 (for the case d, = 8,) and 4 (for the case d, = p,, + fill). The
power functions were computed by Monte Carlo simulation for various values of
7, so technically all the power functions are finite-sample although the simulations
Ch. 46: Unit Roots, Structural Breaks and Trends 2801

Table 4
Power of MA unit root tests
[5 percent level tests, detrended case (d, = j& + Bit)].

h PE L POI(O.5) G(L2) G(l,3)

A. T= 50
1 0.053 0.052 0.052 0.055 0.055
2 0.062 0.062 0.061 0.062 0.062
5 0.139 0.132 0.131 0.117 0.118
10 0.349 0.322 0.348 0.259 0.299
15 0.585 0.513 0.578 0.379 0.477
20 0.744 0.659 0.746 0.469 0.603
30 0.905 0.810 0.899 0.569 0.733
40 0.958 0.880 0.954 0.618 0.795
B. T= 100
1 0.055 0.055 0.053 0.052 0.05 1
2 0.064 0.063 0.063 0.060 0.060
5 0.127 0.136 0.130 0.121 0.113
10 0.359 0.349 0.359 0.277 0.298
15 0.610 0.560 0.609 0.414 0.491
20 0.775 0.704 0.777 0.507 0.632
30 0.939 0.864 0.928 0.629 0.777
40 0.984 0.928 0.975 0.689 0.842
C. T=200
1 0.055 0.054 0.049 0.053 0.050
2 0.063 0.064 0.056 0.062 0.062
5 0.136 0.136 0.122 0.111 0.125
10 0.369 0.357 0.362 0.269 0.323
15 0.613 0.569 0.610 0.415 0.517
20 0.785 0.718 0.776 0.521 0.655
30 0.948 0.880 0.935 0.640 0.792
40 0.989 0.950 0.980 0.708 0.867
D. T= 1000
1 0.051 0.052 0.054 0.052 0.052
2 0.060 0.059 0.063 0.062 0.060
5 0.131 0.127 0.123 0.120 0.116
10 0.370 0.353 0.366 0.323 0.335
15 0.629 0.576 0.624 0.521 0.554
20 0.815 0.743 0.805 0.671 0.712
30 0.963 0.905 0.953 0.825 0.868
40 0.994 0.969 0.990 0.893 0.942

See the notes to Table 3.

suggest that the T= 1000 power is effectively the asymptotic local power.28 In
addition, the power of the point-optimal tests which are tangent to the power

Tabulations of exact power functions and the finite-sample power envelope under the Gaussian
model appear in several places in the literature. Those tabulations are based on the Imhof algorithm.
When results in the literature are directly comparable to those in Tables 3 and 4, they agree to within
two decimals. For results in the demeaned UC model, see Nyblom and Makelainen (1983) and Shively
(1988); for tabulations in the detrended UC model, see Nyblom (1986). Tanaka (1990b) tabulates both
finite-sample and limiting powers of the L statistic, where the latter is computed by inverting numerically
its limiting characteristic function [Tanaka (1990b, Theorem 2)]. Tanakas limiting power for L agrees
with the T= 1000 powers in Table 3 to within the Monte Carlo error.
2802 J.H. Stock

envelope at power of 50percent are reported. In the demeaned case, this test was
suggested by Shively (1988), and the test is the MPI test against the local alternative
h= 7.74. In the detrended case, calculations suggest that the power envelope attains
5Opercent at approximately h = 13, so the MPI test against the local alternative
h= 13 is reported. This test, the POI(0.5) statistic in Table 4, is almost the same test
as was proposed by Franzini and Harvey (1983): if the local-to-I(O) asymptotics are used
to interpret their recommendations (which were based on a Monte Carlo experiment
with T= 20), then the Franzini-Harvey statistic is the point-optimal invariant test
which is point-optimal against the local alternative CE 17. (Interpreted thus, the
Franzini-Harvey statistic is asymptotically MPI at a power of approximately 70
percent.) These tables summarize the power findings of Nyblom and Makelainen
(1983), Nyblom (1986), Shively (1988), Tanaka (1990b) and Saikkonen and Luukkonen
(1993a, 1993b).
Five main conclusions emerge from these tables. First, the convergence of the
finite-sample powers to the asymptotic limits appears to be relatively fast, in the
sense that the T = 100 powers and T = 1000 powers typically differ by less than 0.02.
Second, as was the case with tests for a unit autoregressive root, the powers deteriorate
as the order of detrending increases from demeaning to linear detrending, particularly
for alternatives of h near zero. For example, the L@statistic has a limiting power of
0.61 against h = 10, while the corresponding power for the L statistic is 0.35. Third,
the point-optimal tests perform better than the LMPIU test against all but the
closest alternatives. Fourth, although the ParkkChoi G(p,p + 1) and G(p,p + 2)
tests are strictly below the power envelope, they nonetheless perform rather well
and in particular have power curves only slightly below the L and L statistics.
Fifth, it is important to emphasize that all these differences are rather modest in
comparison to the large differences in powers found among the various tests for a
unit AR root. For example, the Pitman efficiency of the LP statistic relative to the
MPI test at power = 50 percent is approximately 1.1, indicating a loss of the
equivalent of only 10 percent of the sample if the L statistic is used in this case
rather than the MPI test.

4.2.4. Finite-sample size and power

A small Monte Carlo experiment was performed to examine the finite-sample size
and power of tests of the I(0) null. Unlike for tests of the I( 1) null, as of this writing,
there have been few Monte Carlo investigations of tests of the general I(0) null;
exceptions include Amano and van Norden (1992) and Kwiatkowski et al. (1992). The
simulation here summarizes the results of these two studies for the L statistic by
using a similar design (autoregressive errors) and extends them to include the
ParkkChoi G(p,p + 2) statistics and to examine the effect of kernel choice on test
performance.
In the d, = &, case, the experiment considers the modified L@and G(O,2) statistics
(based on I,); in the d, = /I, + fil t case, the statistics are the modified L and G( 1,3)
Ch. 46: Unit Roots, Structural Breaks and Trends 2803

statistics (based on V,). The spectral density was estimated using two SC spectral
estimators with a truncated automatic bandwidth selector. The automatic bandwidth
is I, = min[f& 12(T/100)~2], where &- is Andrew? (1991) automatic selector based
on an estimated AR(l) model. The two kernels are the Parzen kernel and the QS
kernel, the latter being Andrews (1991) optimal kernel, and the appropriate selector
for each kernel is used. [The automatic bandwidth selector is truncated because
unless ir is bounded in the I( 1) case it does not satisfy the o( T- 12) rate condition
needed for consistency as described in Section 4.2.3.1
The pseudo-data were generated so that u, followed the AR(l),

Y, = u,> Au, = (1 - fX)u,,

where u, = pu,_ I + Em, E, i.i.d. N(0, l), (4.13)

where u. = 0 and u. is drawn from its unconditional distribution. When Ip I < 1 and
0 = 1, y, is I(0) and the experiment examines the size of the test. When IpI < 1 and
181< 1, y, is I(1). When p = 0, this is the MA(l) model and corresponds to the
local-to-unity model (4.12) with (uor, AUK,) mutually and serially uncorrelated with
the same variance, in which case 0 = 1 - h/T+ o(T- ).
Empirical size (in italics) and size-adjusted power are presented for T= 100 in
Table 5 (the demeaned case) and Table 6 (the detrended case). Size-adjusted power
in a (p,B) design, 101< 1, is computed using the 5 percent empirical quantile for
(p, 0 = 1) for each value of p.
These results suggest three conclusions. First, the choice of spectral estimator
matters for size, less so for size-adjusted power. For example, if the Parzen kernel
is used, the size deteriorates substantially when the serial correlation is large
(p = 0.9). [If the Bartlett kernel is used, as suggested by Tanaka (1990b) and
Kwiatkowski et al. (1992), similar size distortions arise (results not shown in these
tables).] In contrast, the size is much better controlled using the QS kernel. This is
true for both of the statistics examined, in both the demeaned and detrended cases.
On the other hand, the size-adjusted powers for both statistics in both cases are
comparable for the two spectral estimators. Interestingly, for distant alternatives
the size-adjusted power declines in the p = 0 case for the demeaned statistics, and
the decline is more pronounced for the QS statistics.
Second, a comparison of the results in Tables 3 and 4 with those in Tables 5 and
6, respectively, reveals that when p = 0 the finite-sample size-adjusted power is fairly
close to the power predicted by the local-to-I(O) asymptotics of Section 4.2.3, at least
for close and moderately close alternatives. At least in the p = 0 case, the use of the
SC estimator seems to have little impact on either size or power. However, size-
adjusted power deteriorates sharply as the autoregressive nuisance parameter
increases towards one. Interestingly, detrending makes little difference in terms of
size. This is noteworthy, given the large impact of detrending in the I(1) test
situations.
2804 J.H. Stock

Table 5
Size and size-adjusted power of selected tests of the I(0) null: Monte Carlo results
[5 percent level tests, demeaned case (d, = [j,), T= 1001
[Data generating process: (1 - pL)Ay, = (1 - OL)a,, E, i.i.d. N(0, l)].

P=
Test Asymptotic
Statistic Power 0.0 0.9 0.75 0.5 -0.5
-
V 1.00 0.05 0.05 0.26 0.10 0.06 0.04
P(auto) 0.95 0.32 0.29 0.26 0.25 0.25 0.29
0.90 0.61 0.55 0.43 0.46 0.47 0.56
0.80 0.87 0.69 0.53 0.58 0.60 0.79
0.70 0.95 0.68 0.55 0.61 0.64 0.87
1.00 0.05 0.05 O.fl 0.05 0.06 0.04
QS(auto) 0.95 0.32 0.30 0.24 0.2 1 0.24 0.29
0.90 0.61 0.57 0.39 0.36 0.39 0.56
0.80 0.87 0.72 0.46 0.47 0.43 0.80
0.70 0.95 0.67 0.49 0.49 0.44 0.88
G(O, 2) 1.00 0.05 0.05 0.29 0.10 0.05 0.03
P(aut0) 0.95 0.28 0.28 0.26 0.24 0.23 0.26
0.90 0.63 0.58 0.48 0.49 0.49 0.57
0.80 0.90 0.76 0.62 0.66 0.67 0.8 1
0.70 0.96 0.76 0.65 0.70 0.72 0.88
WI 2) 1.00 0.05 0.05 0.07 0.04 0.06 0.04
QS(auto) 0.95 0.28 0.29 0.19 0.16 0.21 0.26
0.90 0.63 0.60 0.37 0.33 0.34 0.57
0.80 0.90 0.78 0.49 0.47 0.37 0.81
0.70 0.96 0.72 0.52 0.52 0.38 0.88

For each statistic, the first row of entries are the empirical rejection rates under the null, that is, the
empirical size of the test, based on the asymptotic critical value. The remaining entries are the
size-adjusted power for the model given in the column heading. The column Asymptotic Power gives
the T= 1000 rejection rate for that statistic from Table 3 using 8, = 1 - h/T. The entry below the name
of each statistic indicates the spectral density estimator used. P(auto) and QS(auto) refer to the SC
estimator, computed respectively using the Parzen and QS kernels, each with lag lengths chosen by the
respective truncated automatic selector in Andrews (1991). Based on 5000 Monte Carlo repetitions.

Third, the differences in size-adjusted power across test statistics are modest.
Because of its better size performance, we restrict the discussion to the results for
the QS kernel. In the demeaned case, G(0,2) has somewhat better size-adjusted
power than the modified LP statistic for distant alternatives when u, is positively
correlated; for 0 near one, the modified LP statistic is more powerful. In the
detrended case, G( 1,3) and modified L have essentially the same size-adjusted
powers.

4.2.5. Summary and implications for empirical practice

The literature on tests of the general I(0) null against the I(1) alternative is still young.
Subject to this caveat, the results here suggest several observations. The asymptotic
power analysis of Section 4.2.3 suggests that there is little room for improvement
on the performance of the currently proposed tests, at least in terms of local
Ch. 46: Unit Roots, Structural Breuks and Trends 2805

Table 6
Size and size-adjusted power of selected tests of the I(0) null: Monte Carlo results
[5 percent level tests, detrended case (d, = t!&, + PI t). T= 1001
[Data generating process: (1 - pL)Ay, = (1 - OL)E,,E, i.i.d. N(0, l)].

P=
Test Asymptotic
Statistic 0 Power 0.0 0.9 0.75 0.5 -0.5

L 1.00 0.05 0.05 0.29 0.11 0.06 0.04


P(aut0) 0.95 0.13 0.13 0.12 0.12 0.12 0.12
0.90 0.35 0.34 0.23 0.25 0.25 0.32
0.80 0.74 0.62 0.36 0.41 0.43 0.65
0.70 0.91 0.64 0.40 0.47 0.50 0.81
c 1.00 0.05 0.05 0.10. 0.05 0.06 0.04
QS(auto) 0.95 0.13 0.13 0.10 0.09 0.12 0.13
0.90 0.35 0.35 0.19 0.16 0.22 0.33
0.80 0.74 0.65 0.28 0.25 0.25 0.67
0.70 0.91 0.68 0.30 0.28 0.23 0.83

G(L 3) 1.00 0.05 0.04 0.30 0.12 0.07 0.04


P(aut0) 0.95 0.12 0.12 0.12 0.11 0.11 0.11
0.90 0.34 0.31 0.25 0.24 0.23 0.28
0.80 0.71 0.58 0.40 0.43 0.43 0.59
0.70 0.87 0.62 0.46 0.49 0.51 0.73

G(L 3) 1.00 0.05 0.04 0.13 0.07 0.07 0.04


QS(auto) 0.95 0.12 0.12 0.10 0.08 0.10 0.11
0.90 0.34 0.32 0.18 0.15 0.20 0.29
0.80 0.71 0.61 0.28 0.24 0.25 0.60
0.70 0.87 0.64 0.32 0.28 0.24 0.74

See the notes to Table 5

asymptotic power. The various tests have asymptotic relative efficiencies fairly close
to one, and the point-optimal tests (the Shively and Franzini-Harvey tests),
interpreted in the local-to-I(O) asymptotic framework, have power functions that are
close to the power envelope for a large range of local alternatives.
The Monte Carlo results suggest, however, that there remains room for improve-
ment in the finite-sample performance of these tests. With the Parzen kernel, the
tests exhibit large size distortions; with the QS kernel, the size distortions are
reduced but the finite-sample power can be well below its asymptotic limit. For
autoregressive parameters not exceeding 0.75, both the G(p, p + 2) and L statistics,
evaluated using the QS(auto) kernel, have Monte Carlo sizes near their asymptotic
levels and have comparable power.

5. Structural breaks and broken trends

This section examines two topics: structural breaks and parameter instability in
time series regression; and tests for a unit root when there are kinks or jumps in the
2806 J.H. Stock

deterministic trend (the broken-trend model). At first glance these problems seem
quite different. However, there are close mathematical and conceptual links which
this section aims to emphasize. Mathematically, a multidimensional version of the
FCLT plus CMT approach of Section 2 is readily applied to provide asymptotic
representations for a variety of tests of parameter stability. [An early and
sophisticated application of the FCLT to the change-point problem can be found
in MacNeill(1974).] Conceptually, the unobserved components model with a small
independent random walk component is in fact a special case of the more general
time-varying-parameter model. Also, these topics recently have become intertwined
in empirical investigations into unit roots when one maintains the possibility that
the deterministic component has a single break, for example is a piecewise-linear
time trend.
Section 5.1 addresses testing for and, briefly, estimation of parameter instability
in time series regression with I(O) regressors, including the case when there are lagged
dependent I(0) variables and, in particular, stationary autoregressions. The main
empirical application of these tests is as regression diagnostics and, as an example
in Section 5.1.4, the tests are used to assess the stability of the link between various
monetary aggregates and output in the U.S. from 1960 to 1992. The literature on
parameter instability and structural breaks is vast, and the treatment here provides
an introduction to the main applications in econometric time series regression from
a classical perspective. The distribution theory for the tests is nonstandard. Here,
the alternatives of interest have parameters which are unidentified under the null
hypothesis; for example, in the case of a one-time change in a coefficient, under the
null of no break the magnitude of the change is zero and the break date is
unidentified. Davies (1977) showed that, if parameters are unidentified under the
null, standard x2 inference does not obtain, and many of the results in Section 5.1
can be seen as special cases of this more general problem. For further references on
parameter instability and breaks, the reader is referred to the reviews and
bibliographies in Hack1 and Westlund (1989) Krishnaiah and Miao (1988), Kramer
and Sonnberger (1986) and, for Bayesian work in this area, Zacks (1983) and Barry
and Hartigan (1993).
Section 5.2 turns to inference about the largest root in univariate autoregression
under the maintained hypothesis that there might be one-time breaks or jumps in
the deterministic component. In innovative papers, Perron (1989a, 1990b) and
Rappoport and Reichlin (1989) independently suggested that the broken-trend
model provides a useful description of a wide variety of economic time series. Perron
(1989a) argued, inter alia, that U.S. postwar real GNP is best modeled as being I(O)
around a piecewise-linear trend with a break in 1973, and Rappoport and Reichlin
(1989) argued that U.S. real GNP from 1909-1970 [the Nelson-Plosser (1982) data]
was stationary around a broken trend with a break in 1940. These results seem to
suggest that the long-term properties of output are determined not by unit-root
dynamics, but rather by rare events with lasting implications for mean long-term
growth, such as World War II and the subsequent shift to more activist governmental
Ch. 46: Unit Roots, Structural Breaks and Trends 2807

economic policy, or the oil shock and productivity slowdown of the mid-1970s.
Whether this view is upheld statistically is a topic of ongoing debate in which the
tests of Section 5.2 play a central role.

5.1. Breaks in coeficients in time series regression

5.1 .l. Tests,for a single break date

Suppose y, obeys the time series regression model

Y, = a:x,- 1 + E;, (5.1)


where under the null hypothesis /I, = fl for all t. Throughout Section 5.1, unless
explicitly stated otherwise, it is maintained that E,is a martingale difference sequence
with respect to the o-fields generated by {E,_ i , X, _ i, E,_ 2, X, _ 2,. . .}, where X, is a
k x 1 vector of regressors, which are here assumed to be constant and/or I(0) with
EX,X: = Z, and, possibly, a nonzero mean. For convenience, further assume that
E, is conditionally (on lagged E, and X,) homoskedastic. Also, assume that
T- CyzA,XSX:A AZ, uniformly in 1 for 1~[0,1]. Note, in particular, that X,_ 1 can
include lagged dependent variables as long as they are I(0) under the null.
The alternative hypothesis of a single break in some or all of the coefficients is

&=/I, t <r and p, =/?+ y, t > r, (5.2)

where r, k + 1 < r < T, is the break date (or change point) and y # 0.
When the potential break date is known, a natural test for a change in j? is the
Chow (1960) test, which can be implemented in asymptotically equivalent Wald,
Lagrange multiplier (LM), and LR forms. In the Wald form, the test for a break at
a fraction r/T through the sample is

r =SSR,,,- WR,,, + SW+ 1,d


F,
0T (S%,, + SSR,+,,,)MT- 2k)
(5.3)

where SSR13, is the sum of squared residuals from the estimation of (5.1) on
observations 1,...,r,etc. For fixed r/T, F,(r/T) has an asymptotic xz distribution
under the null. When the break date is unknown, the situation is more complicated.
One approach might be to estimate the break date, then compute (5.3) for that break.
However, because the change point is selected by virtue of an apparent break at
that point, the null distribution of the resulting test is not the same as if the break
date were chosen without regard to the data. The means of determining r/T must
be further specified before the distribution of the resulting test can be obtained.
2808 J.H. Stock

A natural solution, proposed by Quandt (1960) for time series regression and
extended by Davies (1977) to general models with parameters unidentified under
the null, is to base inference on the LR statistic, which is the maximal F, statistic
over a range of break dates ro, . , rl. This yields the Quandt likelihood ratio (QLR)
statistic.

_ ,...,r, F,
QLR = max r -10 (5.4)

Intuition suggests that this statistic will have power against a change in /I even
though the break date is unknown. The null asymptotic distribution of the QLR
statistic remained unknown for many years. The FCLT and CMT, however, provide
ready tools for obtaining this limit. The argument is sketched here; for details, see
Kim and Siegmund (1989) and, for a quite general treatment of sup tests in
nonlinear models, Andrews (1993b).
To obtain the limiting null distribution of the QLR statistic, let F&/T) = SSR,,, -
(SSR,,, + SSR,+,,,) and use (5.1) to write

= - v,(l)VT(l)-v,(l) + VT($(;)-vJ+)

+[~,(l)-V,(:)][vWG(~)]-l[vr(l)-v,(~)].

(5.5)

where vT(A) = T-12C~~~Xt_1~, and VT(A) = T-C~T_hjX,_lX~_l. Because E, is a


martingale difference sequence, X,_ 1E, is a martingale difference sequence. Addi-
tionally, assume throughout Section 5.1 that X,_ 1 has sufficiently limited dependence
and enough moments for X,_ 1~, to satisfy a multivariate martingale difference
sequence FCLT, so vr-(.)=a,Z$* W,(*), w h ere W, is a k-dimensional standard
Brownian motion. Also, recall that by assumption V,(A) 11, AZ, uniformly in ,I. By
applying these two limits to the second expression in (5.5), one obtains

&(+=a;F*(.), (5.6)
Ch. 46: Unit Roots. Structural Breaks and Trends 2809

where

w&)%(n) + [W,(l) - ~k(4IC~kU)- ~k(41


F*(A) = - W,(l)W,(l) + - ~
1-A
B;(A)B;(k)
= n(1 -2)

where &(I.) = IV,(n) - Al+,(l), where W,(l) is a k-dimensional Brownian bridge.


Because g,=sazF* and SSR,,,/(T- k) L cf under the null, (SSR,,, + SSR,, i,r)/
(T- 2k) -% cf uniformly in r. Thus F, a F *. It follows from the continuous
mapping theorem that the QLR statistic has the limiting representation,

3 (5.7)

-where ,$ = lim T4mri/T, i = 0,l. For fixed 1, F*(x) has a x, distribution.


Andrews (1993b, Table I) reports asymptotic critical values of the functional in
(5.7), computed by Monte Carlo simulation for a range of trimming parameters and
k = 1,. . . ,20. The critical values are much larger than the conventional fixed-break
xf critical values. For example, consider 5 percent critical values with truncation
fractions (A,,, 2,) = (0.15,0.85): for k = 1, the QLR critical value is 8.85, while the xf
value is 3.84; for k = 10, the QLR critical value is 27.03, while the x:, critical value
is 18.3. In practice the researcher must choose the trimming parameters r,, and ri.
In some applications the approximate break date might be known and used to
choose r0 and rI. Also, with nonnormal errors and small r. the fixed-r distribution
of the FT(r/T) statistic can be far from x:, so one way to control size is to choose r.
sufficiently large, say r,lT= 0.15 and r,lT= 0.85.
The error process has been assumed to be serially uncorrelated. If it is serially
correlated but uncorrelated with the regressors, then the distribution of the
change-point test differs. In the case of a known break date, this problem is well
studied and the Wald test statistic should be computed using an autocorrelation-
consistent estimator of the covariance matrix; for recent work and discussion of the
literature, see Andrews (1991) and Andrews and Monahan (1992). For the extension
to break tests with unknown break dates, see Tang and MacNeill (1993).

29Functionals of Fr(L) other than the supremum are possible. Examples include the average of F,.,
perhaps over a restricted range, as studied by Andrews and Ploberger (1992) and Hansen (1990) [see
Chernoff and Zacks (1964) and Gardner (1969) for historical precedents]. Andrews and Ploberger (1992)
consider tests which maximize weighted average local asymptotic power, averaged over the unidentified
nuisance parameters (here, the break date). The resulting family of optimal tests are weighted averages
of exponentials, with the simple weighted average as the limit for nearby alternatives. The Andrews-
Ploberger (1992) tests are reviewed in the chapter by Andrews in this Handbook.
2810 J.H. Stock

The derivation of (5.6) assumes that T- C~E~XS_ 1~, obeys an FCLT and
T - C~=,lX,Xs A AZ, uniformly in 2. These assumptions hold if X, contains a
constant and/or I(0) regressors, but not if X, is I(1). A sufficient condition for (5.6)
not to hold is that the standard Chow test for fixed r/T does not have an asymptotic
x2 distribution, since F*(x) has a x2 distribution for any fixed 1. This will occur, in
general, for I( 1) regressors (although there are exceptions in cointegrating relations;
see Watsons chapter in this Handbook) and, in these cases, the derivations must
be modified; see Banerjee et al. (1992b), Chu and White (1992) and Hansen (1992)
for examples.
In principle this approach can be extended to more than one break. A practical
difficulty is that the computational demands increase exponentially in the number
of breaks (all values of the two-break F-statistic need to be computed for break
dates (r,s) over the range of r and s), which makes evaluating the limiting
distributions currently difficult for more than two or three break dates. More
importantly, positing multiple exogenous breaks raises the modeling question of
whether the breaks are better thought ofas stochastic or as the result of a continuous
process. Indeed, this line of reasoning leads to a formulation in which the parameters
change stochastically in each period by random amounts, which is the time-varying
parameter model discussed in Section 5.1.3.
A related problem is the construction of confidence intervals for the break date.
A natural estimator of the break date is the Gaussian MLE 1, which is the value of
~E(&,~,) which maximizes the LR test statistic (5.3). The literature on inference
about the break date is large and beyond the scope of this chapter, and we make
only two observations. First, 1 is consistent for ;1 when the break magnitude is
indexed to the sample size (y = yr) and yr +O, T1j2yT+ co [Picard (1985), Yao
(1987), Bai (1992)], although F itself is not consistent. Second, it is possible to
construct asymptotic confidence intervals for 2, but this is not as straightforward
as inverting the LR statistic using the QLR critical values because the null for the
LR statistic is no break, while the maintained hypothesis for the construction of a
confidence interval is that a break exists. Picards (1985) results can be used to
construct confidence intervals for the break date by inverting a Wald-type statistic,
an approach extended to time series regression with dependent errors by Bai (1992).
Alternatively, finite-sample intervals can be constructed with sufficiently strong
conditions on E, and strong conditions on X,; see Siegmund (1988) and Kim and
Siegmund (1989) for results and discussion.

5.1.2. Recursive coejicient estimates and recursive residuals

Another approach to the detection of breaks is to examine the sequence of regression


coefficients estimated with increasingly large data sets, that is, to examine p(A), the
OLS estimator of /? computed using observations 1, . . . , [TA]. These tests typically
have been proposed without reference to a specific alternative, although the most
commonly studied alternative is a single structural break. Related is Browns et al.
Ch. 46: Unit Roots, Structural Breaks and Trends 2811

(1975) CUSUM statistic, which rejects when the time series model systematically
over- or under-forecasts ytr more precisely, when the cumulative one-step-ahead
forecast errors, computed recursively, are either too positive or negative. The
recursive coefficients and Browns et al. (1975) recursive residuals and CUSUM
statistic are, respectively, given by

(5.8)

w =Yt-if@- l)/VX,-1
f (5.9)
.l-t
CT11
T2 c ws
s=k+l
CUSUM(n) = (5.10)
ZE

where d,= {T-1~~=k+l(~,-~)2)12 and f,= (1 +X:_l~~=:Xs_,X:_,)-X,_,)i2


(this comes from noting that the variance of the one-step-ahead forecast error is
eff:). The CUSUM test rejects for large values of sup,,< 1G 11CUSUM(A)/(l + 2A)I.
Because the recursive coefficients are evaluated at each point r, the distribution
of the recursive coefficients differs from the usual distribution of the OLS estimator.
The asymptotics readily obtain using the tools of Section 2. Under the null
hypothesis /I, = /I, the arguments leading to (5.6), applied here, yield

a,z,~2wk(A)
T2(p(-)-p)= vT(pvT()*p*(*), p*(n)= ~ (5.11)

(Ploberger et al. (1989), Lemma A.l). For fixed I, p*(A) has the usual OLS asymptotic
distribution. An immediate implication of (5.11) is that conventional 95 percent
confidence intervals, plotted as bands around the path of recursive coefficient
estimates, are inappropriate since those bands fail to handle simultaneous inferences
on the full plot of recursive coefficients.
Combined with the CMT, (5.11) can be used to construct a formal test for parameter
constancy based on recursive coefficients. An example is Plobergers et al. (1989)
fluctuations test [also see Sen (198O)J which rejects for large changes in the
recursive
- coefficients,
- specifically when b(A) - p(1) is large. From (5.1 l), note that
T12(8(~)-B(1))ja,~,12(Wk(;l)/~- Wk(1)) uniformly in 1. Because the full-sample
OLS estimator &f is consistent under the null, @r)(n) = 6E-1(T-xX,_ lXi_ 1)1/2 x
T2(p(n) - B( 1)) =- W,(n)/2 - W,(l), uniformly in 1. This leads to Plobergers et al.
(1989) fluctuations test and its limiting representation under the null of parameter
2812 J.H. Stock

constancy,

where BIT) is the ith element of BcT)and Bki is the ith element of the k-dimensional
Brownian bridge B,.
The null distribution of the CUSUM test is also obtained by FCLT and CMT
arguments. If X,_ 1 is strictly exogenous and E,is i.i.d. N(0, G:), then w, is i.i.d. N(0, a:),
so the FCLT and CMT imply

CUSUM(.) =z.W( .), (5.13)

where W is a l-dimensional Brownian motion. The same limit obtains with general
I(O) regressors and a constant, but the calculation is complicated and is omitted
here; for the details, see Kramer et al. (1988), who prove (5.13) for time series
regressions possibly including lagged dependent variables and for general i.i.d.
errors (their i.i.d. assumption can be relaxed to the martingale difference assumption
used here). Critical values for sup,)CUSUM(i)/(l + 2A)( are obtained from results
in Brownian motion theory; see Brown et al. (1975).
An important feature of the CUSUM statistic is that, as shown by Kramer et al.
(1988), it has local asymptotic power only in the direction of the mean regressors:
coefficient breaks of order T- I2 on mean-zero stationary regressors will not be
detected. This has an intuitive explanation. The cumulation of a mean-zero
regressor will remain mean-zero (and will obey an FCLT) whether or not its true
coefficient changes, while the nonzero mean of the cumulation of the constant
implies that breaks in the intercept will result in systematically biased forecast
errors.3o This is both a limitation and an advantage, for rejection suggests a
particular alternative (instability in the intercept or the direction of the mean
regressors).
Several variants of the CUSUM statistic have been proposed. Ploberger and
Kramers (1992a) version, in which full-sample OLS residuals &,replace the recursive
residuals Gi),,is attractive because of its computational simplicity. Again, the
distribution is obtained using the FCLT and CMT. Their test statistic and its

30Consider the simplest case, in which y, = et under the null while under the local alternative y, =
T-12yl(t > r)X,_ 1+ E,.Since /Jis known to equal zero, under the null (with y = 0 imposed) the cumulated
residuals process is just T -iz~~=Iy~. Under the local alternative, T - q:;;ys = T - zc~;Es +
YT-p1 s ,+I x,_, =>- W(A) + ymax(O,r/T- /I)EX,. If EX, is zero, the distribution IS the same under the
local alternative and the null; the test only has power in the direction of the mean vector EX,. Estimation
of /?, as is of course done in practice, does not affect this conclusion qualitatively because the alternative
is local. Also see Ploberger and Kramer (1990).
Ch. 46: Unit Roots, Strucrural Breaks und Trends 2813

limiting null representation are

max ~~~Yr-l-- 3 sup jLY$)l, (5.14)


ko[l,Tl 8, J.e[O,ll

where B, is the one-dimensional Brownian bridge and the limit obtains using the
FCLT and CMT. Other variants include Browns et al. (1975) CUSUM-of-Squares
test based on w:, and McCabe and Harrisons (1980) CUSUM-of-Squares test based
on OLS residuals. See Ploberger and Kramer (1990) for a discussion of the low
asymptotic power of the CUSUM-of-Squares test. See Deshayes and Picard (1986)
and the bibliography by Hack1 and Westlund (1989) for additional references.
If the regressors are I(l), the distribution theory for rolling and recursive tests
changes, although it still can be obtained using the FCLT and CMT as it was
throughout this chapter. See Banerjee et al. (1992b) for rolling and recursive tests
with a single I(1) regressor, Chu and White (1992) for fluctuations tests in models
with stochastic and deterministic trends, and Hansen (1992) for Chow-type (e.g.
QLR) and LM-type [e.g. Nybloms (1989) statistic] tests with multiple I(1)
regressors in cointegrating equations. Also, the distribution of the CUSUM statistic
changes if stochastically or deterministically trending regressors are included; see
MacNeill(l978) and Ploberger and Kramer (1992b).

5.1.3. Tests against the time-varying-parameter model

A flexible extension of the standard regression model is to suppose that the


regression coefficients evolve over time, specifically

y, = B;X,_ 1 + E,, /3,= /?_ 1 + vt, Ev,v; = .r2G, (5.15)

where E, and vt are uncorrelated and v, is serially uncorrelated. The formulation


(5.15), of course, nests the standard linear regression model when r2 = 0. By setting
v, = y, t = I + 1 and v, = 0, t # r + 1, (5.15) nests the single-break model (5.2). The
alternative of specific interest here, however, is when v, is i.i.d. N(0, t2G) (where G is
assumed to be known) so that the coefficient & follows a multivariate stochastic
trend and thus evolves smoothly but randomly over the sample period. When
combined with the additional assumption that E, is i.i.d. N(0, a:), this is referred to
as the time-varying-parameter (TVP) model [see Cooley and Prescott (1976),
Sarris (1973) and the reviews by Chow (1984) and Nicholls and Pagan (1985)].
Maximum likelihood estimation of the TVP model is a direct application of the
Kalman filter (/I, is the unobserved state vector and y, = &X,_ 1 + E, is the
measurement equation) and the estimation of /3, and its standard error under
the alternative is well understood; see the chapter in this Handbook by Hamilton.
We therefore focus on the problem of testing the null that r2 = 0.
2814 J.H. Stock

The TVP model (5.15) nests, as a special case, the MA(l) model considered in
Section 4. Setting X, = 1 yields the unobserved components model (4.21, y, = PO + u,,
where u, = (/?, - Do) + E, = C:= 1v, + E,.Thus the testing problem in the general TVP
model can be seen as an extension of the unit MA root testing problem. Starting with
Nyblom and Makelainen (1983), several authors have studied the properties of
locally most powerful tests of r2 = 0 against t2 > 0 in (5.15) or in models where only
some of the coefficients are assumed to evolve over time (that is, where G has reduced
rank); see for example King and Hillier (1985) King (1988) Nyblom (1989), Nabeya
and Tanaka (1988), Leybourne and McCabe (1989), Hansen (1990), Jandhyala and
MacNeill(l992) and Andrews and Ploberger (1992). [Also see Watson and Engle
(1985) who consider tests against /?, following a stationary AR(l).] The treatment
here follows Nyblom and Makeliinen (1983) and builds on the discussion in
Section 4 of tests of the UC model.
To derive the LMPI test of r2 =o;krsus r2 > 0, suppose that X,_ 1 is strictly
exogenous (although the asymptotics hold more generally). Under the TVP model,
(5.15) can be rewritten as y, =&X,-i + {(C:=,v,)X,_ 1 + ct}, where the term in
curly brackets is an unobserved error. In standard matrix notation [Y denotes
(Yi, . . . , yr), X denotes (X,, . . , Xr- i), etc.], the conditional distribution of Y, given
X, is

Y-NjXP,,o:[l,+(~)V,lj,
where v,=fi*O(XGX) (5.16)

where 0: = min(i, j) and 0 denotes the, Hadamard (elementwise) product. The


testing problem is invariant to scale/translation shifts of the form y + ay + bX, so
the most powerful invariant test against an alternative .r2 will be a ratio of quadratic
forms involving I, + (?2/a%)vr.. However, this depends on the alternative, so no
uniformly most powerful test exists. One solution is to consider the LMPI test, which
rejects for large values of Fvr&/&c, where {e,} are the full-sample OLS residuals.
Straightforward algebra shows that T-2&1/Te* = T-lx,= ,S,(s/T)GS,(s/T), where
S,(A)= T-12C,T_[T1J+le*tXt_,, which provides a simpler form for the test: reject if
T-lx,= ,S,(s/T)GS,(s/T) is large. Because this test and its limiting distribution
depend on G, Nyblom (1989) suggested the simplification G=(T-lCT=lX,_ ,X;_,)-.
Accordingly, the test rejects for large values of

(5.17)

Conditional on (X,> the TVP model induces a heteroskedastic random walk into
the error term which is detected by L using the cumulated product of the OLS
residuals and the regressors.
Nyblom (1989) derived the statistic (5.17) by applying local arguments to a
likelihood for generally nonlinear, nonnormal models, and his general statistic
Ch. 46: Unit Roots. Structural Breaks and Trends 2815

simplifies to (5.17) in the Gaussian linear regression model. If X, = 1, (5.17) reduces


to the LMPI test (4.7) of the i.i.d. null against the random walk alternative, or
equivalently the test of the null of a unit MA root. 31 Henceforth, we refer to (5.17)
as the Nyblom statistic.
The asymptotics of the Nyblom statistic follow from the FCLT and the CMT.
As usual, E, need not be i.i.d. normal and X, need not be strictly exogenous; rather
the weaker conditions following (5.1) are sufficient for the asymptotics. Under those
weaker conditions, E,X, 1 is a martingale difference sequence and, by the FCLT
and CMT, S,(.)+CT$, 1/2Bp k( .), where Bf: is a k-dimensional standard Brownian
bridge. Because T- Et= 1X, _ 1XL_ 1 -% Z, and SF L a:, under the null hypothesis,

L=s B$)B;(E.) dl. (5.18)


s

The literature contains Monte Carlo results on the finite-sample power of the
tests in Sections 5.1.1-5.1.3 against various alternatives. The range of alternatives
considered is broad and some preliminary conclusions have firmly emerged. Many
of the tests overreject in moderately large samples (T = 100) when asymptotic critical
values are used. This is exacerbated if errors are nonnormal and, especially, if
autoregressions have large autoregressive parameters [QLR and related tests; see
Diebold and Chen (1992)]. In their Monte Carlo study of the QLR test, exponentially
averaged F-tests, the CUSUM test and several other tests against alternatives of
one-time breaks and random walk coefficients, Andrews et al. (1992) found that in
general the weighted exponential tests performed well and often the QLR and
Nyblom tests performed nearly as well. For additional results, see Garbade (1977)
and the references in Hack1 and Westlund (1989, 1991).

5.1.4. Empirical application: stability of the money-output relation

At least since the work of Friedman and Mieselman (1963), one of the long-standing
empirical problems in macroeconomics has been whether money has a strong and
stable link to aggregate output; for a discussion and recent references, see Friedman
and Kuttner (1992). Since their introduction to this literature by Sims (1972, 1980),
Granger-causality tests and vector autoregressions have provided the workhorse
machinery for quantifying the strength and direction of these relations in nonstruc-
tural time series models (see the chapter by Watson in this Handbook for a
discussion of vector autoregressions). But for such empirical models to be useful
guides for monetary policy they must be stable, and the tests of this section can play
a useful role in assessing their stability. Of particular importance is whether one of
the several monetary aggregates is arguably most stably related to output.

3To show this, rewrite S,(s/T) using the identity that the mean OLS residual is zero
2816 J.H. Stock

Table 7
Tests for structural breaks and time-varying parameters in the moneyyoutput relation
(Dependent variable: Nominal GDP growth)
(Estimation period: quarterly, 1960:2 to 1992:2).

F-tests on
coefficients on:
P-K
M r R2 M r QLR CUSUM Nyblom L

1 Base - 0.153 3.85** 40.23*** 1.43** 2.83**


2 Base R-90 0.178 1.43 2.50* 54.15*** 1.36** 2.80
3 Ml - 0.140 3.17** 32.57*** 1.22* 1.90
4 Ml R-90 0.179 1.50 2.87** 51.02*** 1.35** 2.85
5 M2 0.221 7.60*** - 20.82 0.73 1.31
6 M2 R-90 0.332 8.40*** 3.19** 25.44 1.05 1.53

All regressions include 3 lags each of the nominal GDP growth rate, GDP inflation and the growth
rate of the monetary aggregate. The M column specifies the monetary aggregate. The r column indicates
whether the 90-day U.S. Treasury bill rate is included in the regression. If the interest rate is included,
it is included in differences (3 lags) and one lag of an error-correction term from a long-run money
demand equation is also included. The I? is the usual OLS adjusted R2. The F-tests are Wald tests of
the hypothesis that the coefficients on the indicated variable are zero; the restriction that the error-
correction term (when present) has a zero coefficient is included in the Wald test on the monetary
aggregate. QLR is the Quandt (1960) likelihood ratio statistic (5.4) with symmetric 15 percent trimming;
P-K CUSUM is the Ploberger-Kramer (1992a) CUSUM statistic (5.14); and the final column reports
the Nyblom (1989) L statistic (5.17). Break test critical values were taken from published tables and/or
were computed by Monte Carlo simulation of the limiting functionals of Brownian motion, as described
in Section 2.3. Tests are significant at the *lO percent; **5 percent; ***l percent level.

Table 7 presents regression summary statistics and three tests for parameter
stability in typical money-output regressions for three monetary aggregates, the
monetary base, Ml, and M2, over the period 1960:2-1992:2. The results are taken
from Feldstein and Stock (1994), to which the reader is referred for additional detail.
Based on preliminary unit root analysis, log GDP, the log GDP deflator, log money
and the 90-day US. Treasury bill rate are specified as having a single unit root so
that the GDP growth rate, GDP inflation, the money growth rate and the first
difference of the interest rate are used in the regressions. Drawing on the cointegra-
tion evidence in Hoffman and Rasches (1991) and Stock and Watsons (1993) studies
of long-run money demand, in the models including the interest rate we model log
money, log output and the interest rate as being cointegrated so that the equations
include an error-correction term, the cointegrating residual. The long-run cointe-
grating equation was estimated by imposing a unit long-run income elasticity and
estimating the interest semi-elasticity using the Saikkonen (1991)/PhillipssLoretan
(1991)/StockkWatson (1993) dynamic OLS efficient estimator.32 The main

32Alldata were taken from the Citibase data base. The hypothesis of two unit roots was rejected at the
5 percent level for each series (demeaned case) using DF-GLS tests with AR(BIC), I <p < 8 except that
a unit root in inflation is rejected at the 10 percent but not 5 percent level, For each series, DF-GLS
Ch. 46: Unit Roots, Structural Breaks and Trends 2817

conclusions are insensitive to empirically plausible changes in the unit root


specifications of interest rates and money; in particular see Konishi et al. (1993) for
F-statistics in specifications where the interest rate is assumed stationary.
The Granger-causality test results indicate that including the interest rate makes
base money and Ml insignificant, although M2 remains significant (this is partly
due to the error-correction term). The QLR test rejects the null hypothesis of
parameter stability at the 1 percent level in all specifications including base money
or M 1; the L-statistic rejects in the base-only specification; and the Ploberger-Kramer
(1992a) CUSUM based on OLS residuals rejects in the base and Ml specifications.
The hypothesis of stability is thus strongly rejected for the base-output and
Ml-output relations. The evidence against stability is much weaker for the
M22output relation; none of the stability tests reject at the 10 percent level. Once
changes in velocity are controlled for by including the error-correction term in
regression 6, both M2 and the interest rate enter significantly and there is no
evidence of instability. As with any empirical investigation, some caveats are
necessary: these results are based on only a few specifications, and stability in this
sample is no guarantee of stability in the future. Still, these results suggest that, of
the base, Ml, and M2, only M2 had a stable reduced-form relationship with output
over this period.

5.2. Trend breaks and tests for autoregressive unit roots

5.2.1. The trend-break model and efSects of misspecifying the trend

Rappoport and Reichlin (1989) and Perron (1989a, 1990b) argued that a plausible
model for many economic variables is stationarity around a time trend with a break,
and that autoregressive unit root tests based on linear detrending as discussed in
Section 3 have low power against this alternative. Two such broken-trend

failed to reject a single unit root at the 10 percent level (detrending for each variable except interest rates,
for which the demeaned statistics were used), except for the interest rate, which rejected at the 10 percent
but not 5 percent level. The 95 percent asymptotic confidence intervals, computed as in Stock (1991) by
inverting the Dickey-Fuller r*statistic (+ for interest rates) as described in Section 3.3, for the largest
autoregressive roots are: log Ml, (0.821, 1.026); log M2, (0.998, 1.039); log base, (0.603, 0.882); go-day
T-bill rate, (0.838, 1.015); log GDP, (0.950, 1.037); GDP inflation, (0.876, 1.032). The results are robust
to using the AR(BIC) selector with 3 < p < 8, as in the Monte Carlo simulations, except that the M2
confidence interval rises to (1.011, 1.040). For consistency, all monetary aggregates are specified in
growth rates [but see Christian0 and Ljungqvist (1988) and Stock and Watson (1989)]. These results
leave room to argue that inflation should be entered in changes, but for comparability with other
specifications in the literature inflation itself is used. There is some ambiguity about the treatment of
interest rates, but to be consistent with recent investigations of long-run money demand they are treated
here as I(1). The evidence on cointegration involves statistics not covered in this chapter and the reader
is instead referred to Hoffman and Rasche (1991) and Stock and Watson (1993).
2818 J.H. Stock

specifications are

(Shift in mean) d, = B, + /II l(t > r), (5.19)

(Shift in trend) d, = B, + B, t + &(t - r)W > 4, (5.20)

where l(t > r) is a dummy variable which equals one for t > r and zero otherwise.33
For conciseness, attention here is restricted to the trend-shift model (5.20), a model
suggested by Perron (1989a) and Rappoport and Reichlin (1989) for real GNP.
It was emphasized in Section 3.2.5 that if the trend is misspecified then unit root
tests can be misleading. This conclusion applies here as well. Suppose that (5.20) is
correct and r/T-+ S,6 fixed, 0 < 6 < 1, but that statistics are computed by linear
detrending. Then the power of the unit root test tends to zero against fixed
alternatives. The intuition is simple: if a linear time trend is fitted to an I(0) process
around a piecewise-linear trend, then the residuals will be I(0) around a mean-zero
Y-shaped trend. These residuals have variances growing large (with T) at the start
and end of the sample and standard tests will classify the residuals as having a unit
root.34 In the mean-shift case, Dickey-Fuller unit root tests are consistent but have
low power if the mean shift is large [Perron (1989a); for Monte Carlo evidence,
Hendry and Neale (199 l)]. See Campbell and Perron (1991) for further discussion.
This troubling effect of trend misspecification raises the question of how to test for
AR unit roots in the presence of a possibly broken trend.

5.2.2. Unit root tests with broken trends

If the break date is known a priori, as Perron (1989a, 1990b) and Rappoport and
Reichlin (1989) assumed, then detrending can be done by correctly specified OLS,
and the asymptotic distribution theory is obtained using a straightforward
extension of Sections 2 and 3. However, as Christian0 (1992) and, subsequently,
Banerjee et al. (1992b), Perron and Vogelsang (1992) and Zivot and Andrews (1992)
pointed out, the assumption that the break date is data-independent is hardly
credible in macroeconomic applications. For example, in Perrons applications to

33Under the null hypothesis of a unit root, the mean-shift model is equivalent to assuming that there
is a single additive outlier in v, at time r + 1, since, under the null hypothesis, (3.1) and (5.19) imply
Ay, = v, + j,l(t = r + 1). A third trend model is Perrons (1989a) model C, with both a mean and a
trend shift.
34To show this, consider the AR(l) case, so that y,(O) = 02, and the Dickey-Fuller roe; test, T(c?_- 1).
If the trend is, in fact, given by (5.20), then the detrended process is yi = u, + d,, where d, = (/lo - &,) +
(8, - /ll)t + p2(t - r)l(t > r). For r/T+ 6,6 fixed, if u, is I(0) then straightforward but tedious calculations
show that the scaled detrended process has the deterministic limit, T- 1y;T1,LB,, + pIA. + &(A - 6)&i > 6),
uniformly in J., where fll and & are nonrandom functions of &,,/$,& and 6. It follows that T(B - l)=
g(6), where g is nonrandom. An explicit expression for g(6) is given in Perron (1989a, Theorem l(b)).
Perron (1989a) shows that g(6) is in the acceptance region of the detrended DF root test, so, asymptotically,
the null is incorrectly accepted with probability one.
Ch. 46: Unit Roots, Structural Breaks and Trends 2819

GNP, the break dates were chosen to be in the Great Depression and the 1973 oil
price shock, both of which are widely recognized as having important and lasting
effects on economic activity. Thus the problem becomes testing for a unit root when
the break dates are unknown and determined from the data.
Two issues arise here: devising tests which control for the nuisance parameters,
in particular the unknown break date, and, among such tests, finding the most
powerful. To date research has focused on the first of these topics. There is little
work which addresses this problem starting from the theory of optimal tests, and
this is not pursued here.35
The procedures in the literature for handling the unknown date of the trend break
are based on a modified Dickey-Fuller test. To simplify the argument, consider the
AR(l) case, so that c.? = y,(O). Then, as suggested by Christian0 (1992), Banerjee
et al. (1992b) and Zivot and Andrews (1992) one could test for a unit root by
examining the minimum of the sequence of Dickey-Fuller r-statistics, constructed
by first detrending the series by OLS using (5.20) for I over the range r,,, . . . , rI,

t;;lp = min zd(S), (5.21)


dE[&l.dll

where

T - l f: AY:'@)Y;- 169
t=2
P(6) =
(r3d)2(S)T-2 f: (y;_,(c?))~
t=2
12
where ~$9 = Y, - z,W&%where B(S) ,z,(W,(@1~ CCf,4-9~,1, z,kV=
= CE:T=
[l, t, (t - [Ts])l(t > [Ti?])] and (8d)2 (A) is the sample variance of the residual from
the regression of y:(6) onto yy_ ,(6).
Just as the null distribution of the QLR statistic differs from the distribution of
the fixed-date F-statistics, the null distribution oft@ differs from the distribution
of rd for a fixed break point. The approach to obtaining the null distribution is
similar; namely to obtain a limiting representation for the sequence of statistics p(S),
uniformly in 6. Relative to the QLR statistic, this entails an additional complication,
because under the null the broken-trend detrended process will be I(1). This leads
to limit results for elements of O[O, 1) x O[O, 11. While no new tools are needed for
these calculations, they are tedious and notationally cumbersome and the reader is
referred to the articles by Banerjee et al. (1992b) and Zivot and Andrews
(1992) for different derivations of the same limiting representation. Not surprisingly,
the critical values of the minimal DF statistic are well below the critical values of

Elliott et al. (1992) show that the asymptotic Gaussian power envelope in the mean-shift model
(5.19) with /I1 fixed equals the no-detrending power envelope plotted in Figure 1.
2820 J.H. Stock

the usual linearly detrended statistic; for example, with symmetric 15 percent
trimming, the one-sided 10 percent asymptotic critical value is approximately - 4.13
[Banerjee et al. (1992b, Table 2)], compared with - 3.12 in the linearly detrended
case.

5.2.3. Finite-sample size and power

There are fewer Monte Carlo studies of the broken-trend and broken-mean unit
root statistics than of the linearly detrended case, perhaps in part because the
additional minimization dramatically increases the computational demands. None-
theless, the results of Hendry and Neale (1991), Perron and Vogelsang (1992), Zivot
and Andrews (1992) and Banerjee et al. (1992b) provide insights into the performance
of the tests. The finite-sample distributions are sensitive to the procedures used to
determine the lag length in the augmented DF regression, and the null distributions
depend on the nuisance parameters even though the tests are asymptotically similar.
Typically, the asymptotic critical values are too small, that is, the sizes of the tests
exceed their nominal level. The extent of the distortion depends on the actual values
of the nuisance parameters. Zivot and Andrews (1992) examined size distortions by
Monte Carlo study of ARIMA models estimated using the Nelson-Plosser (1982)
U.S. data set; for the mean-shift model (5.19), the finite-sample 10 percent critical
values were found to fall in the range -4.85 to - 5.05, while the corresponding
asymptotic value is - 4.58; for each series, tests of asymptotic level 2.5 percent
rejected between 5 percent and 10 percent of the time. Perron and Vogelsang (1992)
found larger rejection rates under the null when there is more negative serial
correlation than present in the Zivot-Andrews simulations.
The Monte Carlo evidence confirms the view that the finite-sample power of the
unit root tests is reduced by trend- or mean-shift detrending, in the sense that if the
true trend is linear then introducing the additional break-point reduces power. The
extent of this power reduction, however, depends on the nuisance parameters and,
in any event, if the broken-trend specification is correct then broken-trend detrending
is necessary. The more relevant comparison is across different procedures which
entail broken-trend detrending, but only limited results are available [see Perron
and Vogelsang (1992) for some conclusions comparing four Dickey-Fuller-type
tests with different lag length selection procedures].

5.2.4. Conclusions and practical implications

Although the research on trend-break unit root tests is incomplete, it is possible to


draw some initial conclusions. On a practical level, the size distortions found in the
demeaned and linear detrended cases in Section 3 appear, if anything, to be more
severe in the broken-trend case, and the power of the tests also deteriorates. One
can speculate that this reflects a dwindling division between the I( 1) model and other
competing representations; were the trend-shifts I(0) and occurring every period,
then the extension of (5.20) would deliver an I(2) model for y,.
Ch. 46: Unir Roots, Strucrurul Brraks und Trrnds 2821

A useful way to summarize the broken-trends literature is to return to our original


four motivating objectives for analyzing unit roots. As a matter of data description,
Perrons (1989a, 1990b) and Rappoport and Reichlins (1989) analyses demonstrate
that the broken-trend models deliver very different interpretations from conven-
tional unit root models, emphasizing the importance of a few irregularly occurring
events in determining the long-run path of aggregate variables; this warrants
continued research in this area. The practical implications concerning the remaining
three objectives remain largely unexplored. From a forecasting perspective, if the
single-break model is taken as a metaphor for multiple irregular breaks, then one
must be skeptical that out-of-sample forecasts will be particularly reliable, since
another break could occur. Equally importantly, for this reason treating the break
as a one-time nonrandom event presumably leads to understating the uncertainty
of multistep forecasts. Little is currently known about the practical effect of
misspecifying trend breaks in subsequent multivariate modeling, although the
asymptotic theory of inference in vector autoregressions (VAR) with unit roots and
cointegration analysis discussed in Watsons chapter in this Handbook must be
modified if there are broken trends. Finally, the link between these trend+break
models and economic theory is undeveloped. In any event, the statistical difficulties
with inference in this area does not make one optimistic that trend-break models
will parse economic theories, however capable they are of producing suggestive
stylized facts.

6. Tests of the I(1) and I(0) hypotheses: links and practical limitations

Sections 3,4, and 5.2 focused on inference in the I( 1) and I(O) models. When inference
is needed about the order of integration of a series, sometimes there is no compelling
a priori reason to think that one or other of these models is the best starting point;
rather, the models might best be treated symmetrically. In this light, this section
addresses three topics. Section 6.1 examines some formal links between the I( 1) and
I(0) models. Section 6.2 summarizes some recent work taking a different approach
to these issues, in which the determination of whether a series is I(0) or I( 1) is recast
as a classification problem, so that the tools of Bayesian analysis and statistical
decision theory can be applied. Section 6.3 then raises several practical difficulties
which arise in the interpretation of both these Bayesian classification schemes and
classical unit-root hypothesis tests in light of the size distortions coupled with low
power of the tests studied in the Monte Carlo experiments of Sections 3 and 4.

6.1. Parallels between the I(0) and I(1) testing problems

The historical development of tests of the I(0) and I( 1) hypotheses treated the issues
as conceptually and technically quite different. To a large extent, these differences
2822 J.H. Stock

are artificial, arising from their ARMA parameterizations. Since an integrated I(0)
process is I(l), a test of the I(0) null against the I( 1) alternative is, up to the handling
of initial conditions, equivalent to a test of the I(1) null against the I(2) alternative.
In this sense, the tests of the previous sections can both be seen as tests of the I(1)
null, on the one hand, against I(0) and, on the other hand, against I(2). What is
interesting is that this reinterpretation is valid not just on a heuristic level but also
on a technical level.
To make this precise, consider the cased, = fiO, v, = E,. The LMPIU test of the unit
MA root in (4.1) rejects for large values of the Nyblom-Makelainen (1983) statistic

where y; = y, - jj. If instead the null hypothesis is that u, is a Gaussian random


walk and the alternative is that u, is an AR(l) with 1x1< 1, then one could test
this hypothesis by rejecting for small values of the demeaned Sargan-Bhargava
statistic

T-* t (yf)*
t=1
R;= (6.4
T- i (Ay;)
t=2

The Lp statistic rejects if the mean square of the I( 1) process, the cumulation of yp,
is large, while &. rejects if the mean square of the I(1) process, y;, is small. Both
tests can be seen as tests of the I(1) null but, respectively, against the I(2) and I(0)
alternatives.

6.2. Decision-theoretic classijication schemes

A standard argument for using conventional hypothesis tests is that the researcher
has a particular reason for wishing to control the Type I error rate. While this might
be appropriate in some of the applications listed in Section 1, in others, such as
forecasting, the ultimate objective of the empirical analysis is different and classical
hypothesis tests are not necessarily the best tools to achieve those objectives. In such
cases, the researcher might rather be interested in having a procedure which will
deliver consistent inference, in the sense that the probability of correctly classifying
a process as I( 1) or I(0) asymptotically tends to one; that is, the probabilities of both
Type I and Type II errors tend to zero.
Ch. 46: Unit Roots, Sfrucrural Breaks and Trends 2823

In theory, this can be achieved by using a sequence of critical values which tend
to - co as an appropriate function of the sample size. To be concrete, suppose that
the researcher computed the Dickey-Fuller? statistic, and evaluated it using critical
values, b,. If the null is true, then Pr[2* < &.11(l)] +O for any sequence b, + - a3,
so that the probability of correctly concluding that the process is I(1) tends to one.
Similarly, it is plausible that for a suitable choice of b,, if the process is truly I(0)
then Pr[f < b,l I(O)] + 1 and the Type II error rate tends to zero. For such a choice
of b,, this would be a consistent classification scheme. Because the DickeyyFuller
t-statistic tends to - co at the rate Tj2 under a fixed alternative, one candidate for
b, is b, = - k, - k, In T for some positive constants (k,, k,). Thus the rule is

Classify y, as I(0) if t*< - k, - k, In T (6.3)

and, otherwise, classify y, as I(1). The problem with this scheme is that, in practice,
the researcher is left to choose k, and k,. Because the sample size is, of course, fixed
in an actual data set, the conceptual device of choosing this sequence is artificial
and the researcher is left with little practical guidance.
One solution is to frame this as a classification or decision-theoretic problem and
to apply Bayesian techniques. In this context, an observed series is classified as I(0)
or I(1) based on the posterior odds ratio Lir, which we write heuristically as

where BT =
PrC(yI~...~h-)lW)l
(6.4)
PrC(Y1,...,YT)lI(0)I

where rci and rco are prior weights that the series is I( 1) and I(0) and where B, is the
Bayes ratio. If Z7r > 1, then the posterior odds favor the I(1) model and the series
is classified as I( 1).
Although (6.4) appears simple, numerous subtleties are involved in its evaluation
and addressing these subtleties has spawned a large literature on Bayesian
approaches to autoregressive unit roots; see in particular Sims (1988), Schotman
and van Dijk (1990), Sims and Uhlig (1991), DeJong and Whiteman (1991a, 1991b),
Diebold (1990), Sowell (199 1) and the papers by Phillips (199 1a) and his discussants
in the special issue of the Journal of Applied Econometrics (October-December,
1991). In most cases, implementations of (6.4) have worked within specifications
which require placing explicit priors over key continuous parameters, such as the
largest autoregressive root. The proposed priors differ considerably and can imply
substantial differences in empirical inferences [see the review by Uhlig (1992)].
Because of this dependence on priors, and given space limitations, no attempt will
be made here to summarize this literature. Instead, we briefly discuss two recent
approaches, by Phillips and Ploberger (1991) and Stock (1992), which provide
simple ways to evaluate the posterior odds ratio (6.4) and which avoid explicit
integration over priors on continuous parameters. These procedures require only
2824 J.H. Stock

that the researcher place priors rrO and n, = 1 - rcO on the respective point
hypotheses I(0) and I( 1).
Phillips and Ploberger (1991) derive their procedure from a consideration of the
likelihood ratio statistic in the AR( 1) model, and obtain the rule

IiYf-1
Classify y, as I(0) if
0 i i
2 > In 7(1 + In +
x0
c7
E
(6.5)

and, otherwise, classify y, as I(l), where t* is the Dickey-Fuller t-statistic.36 The


expression (6.5) bears considerable similarity to (6.3): a unit AR root is rejected
based on the Dickey-Fuller r-statistic, with a critical value that depends on the
sample size. The difference here is that the critical value is data-dependent; if y, is
I(l), the critical value will be 2 In T+ O,(l), while if y, is I(O), it will be In T+ Op( 1).
As Phillips and Ploberger (1992) point out, this procedure can be viewed as an
extension to the I(1) case of the BIC model selection procedure, where the issue is
whether to include or to exclude y,_ 1 as a regressor in the DF regression (3.9). The
procedure is also closely related to the predictive least squares principle, see Wei
(1992).
Another approach is to evaluate the Bayes factor in (6.4) directly, using a reduced-
dimensional statistic rather than the full data set. Suppose that 4r is a statistic which
is informative about the order of integration, such as a unit root test statistic; then
the expression for the Bayes factor in (6.4) could be replaced with

B* = Wd4IU))
(6.6)
W&l I(O))
The approach in Stock (1992) is to construct a family of statistics which have limiting
distributions which, on the one hand, do not depend on nuisance parameters under
either the I(1) or I(0) hypothesis but, on the other hand, diverge, depending on
which hypothesis is true. The results in the previous sections can, in fact, be used
to construct such statistics. Consider the process I, defined in (4. lo), and consider
the no-deterministic case. If y, is a general I(0) process then Vr= W. On the other
hand, if y, is a general I( 1) process then NT I, * I/*, where V* is defined in (4.11)
and N, = TIC:= +&n/l,). In either case the limiting representation of Vr does not
depend on any nuisance parameters. To make this concrete, consider the statistic
4T = In L= ln{ T-CT= 1Vf_ ,}. Then, for u, a general I(0) process, from (2.9) (in the
I(0) case) and (4.11) (in the I( 1) case), 4r has the limiting representations

ifI(0) #,*ln IV* , (6.7a)


(s >
36Phillips and Plobergers (1991) formula has been modified for an estimated variance as in Phillips
(1992b).
Ch. 46: Unit Roots, Structurul Breaks and Trends 2825

ifI(1) 4,-lnJV,*ln V2 . (6.7b)


U >

The limiting distributions under the I(O) and I( 1) models can be computed numeri-
cally from (6.7a) and (6.7b), respectively, which in turn permits the numerical
evaluation of the Bayes factor (6.6) based on this statistic.
It must be stressed that, although consistent decision-theoretic procedures such
as these have both theoretical and intuitive appeal, they have properties which
empirical researchers might find undesirable. One is that these procedures will
consistently classify local-to-I( 1) processes as I( 1) rather than I(O), and local-to-I(O)
processes as I(0) rather than as I( 1). That is, if y, is local-to-I(l) with local parameter
tl = 1 + c/T, then, as the sample size increases, this process will be classified as I(1)
with probability increasing to one, even though along the sequence it is always an
I(0) process [see Elliott and Stock (1994) for details]. More generally, because these
procedures can have large misclassification rates in finite samples (loosely, their size
can be quite large), care must be taken in interpreting the results.
Initial empirical applications [Phillips (1992b)] and Monte Carlo simulations
[Elliott and Stock (1994), Stock (1992)] suggest that, for some applications such as
forecasting and pretesting, these approaches are promising. To date, however, the
investigation of the sampling properties of these and alternative procedures, and
in particular the effect of their use in second-stage procedures, is incomplete. It
would be premature to make concrete recommendations for empirical practice.

6.3. Practical and theoretical limitations in the ability to distinguish I(0)


and I( 1) processes

6.3.1. Theory

The evidence on tests of the I(1) null yields two troubling conclusions. On the one
hand, the tests have relatively low power against I(0) alternatives that might be of
interest; for example, with 100 observations in the detrended case, the local-to-unity
asymptotics indicate that the 5 percent one-sided MPI test has a power of 0.27
against c( = 0.9 and that the DickeyyFuller t-test has a power of only 0.19. On the
other hand, processes which are I( 1) but which have moderate negative autocorrelation
in first differences are incorrectly rejected with high probability, that is, the unit AR
root tests exhibit substantial size distortions, although the extent of these distortions
varies widely across test statistics. The same general conclusions were found for
tests of the general I(0) null: the power against interesting alternatives can be low
and, depending on the choice of spectral estimator, the rejection rate for null values
that have substantial positive autocorrelation can be well above the asymptotic
level.
A natural question is how one should interpret these finite-sample size distortions.
In this regard, it is useful to develop some results concerning the source of these
2826 J.H. Stock

size distortions and whether they will persist in large samples. Section 4 examined
the behavior of tests of the I(0) null in the event that y, was I( 1) but local-to-I(O) in
the sense (4.12), and found that the I(0) tests with functional representations had
nondegenerate asymptotic power functions against these alternatives. It is natural
to wonder, then, what is the behavior of tests of the I(1) null, if the true process is
I(1) but is local to I(O)?
As a starting point, consider the d, = 0 case and the sequence of models

u, = bT5,+ i] rl,, (i,, qt) i.i.d. N(O,ozI). (6.8)


s=1

This is just the local-to-I(O) model (4.12), resealed by multiplication by h- T where


b = II-, with (Ault, u,,J Gaussian and mutually and serially uncorrelated.
If in fact b = 0, then u, is a Gaussian random walk, so one might consider
using the Sargan-Bhargava statistic i,. A direct calculation indicates that, for
b > 0, under this nesting n?, 3;. It follows that Pr[R, < k] = Pr[T& < Tk] + 1
for any constant k, so that the rejection probability of the test tends to one. The
implication is that (unmodified) Sargan-Bhargava tests of sequences which are local
to I(0) in the sense (6.8) will incorrectly reject with asymptotic probability 1. The
implication of this result is perhaps clearer when u, is cast in its ARIMA form,
AU, = (1 - B,L).s,. For finite T, jOT/< 1, but the limiting result indicates that the
rejection probability approaches one and so can be quite large for finite T.
A similar set ofcalculations can be made for tests of the I(0) null. Here, the relevant
sequence ofnull models to consider are those models against which the AR unit root
tests have nondegenerate local asymptotic power, namely the local-to-unity models
studied in Section 3.2.3. Again let d, = 0 and suppose that the I(0) null is tested using
the Lv statistic. A straightforward calculation shows that

where W; is the demeaned local-to-unity diffusion process defined in Section 3.2.3.


It follows that Pr[L > k] = Pr[T-L > T-k] -+ 1 for any constant k, so that the
rejection probability of the test tends to one. For these processes, which are local
to I(l), the Lp test rejects with probability approaching one even though, for fixed
T,u, has the AR(l) representation (1 -p,L)u,= F, with (~~1 < 1.
These results elucidate the Monte Carlo findings in Sections 3 and 4. In the AR
case, the implication is that there are I(1) models which are local to I(0) for which
the I(1) null will be incorrectly rejected with high probability. In the MA case, there
are I(0) models which are local to I(1) for which the I(0) null will be incorrectly
Ch. 46. Unit Roots. Structural Breaks and Trends 2821

rejected with high probability. Thus the high false rejection rates (the size distor-
tions) found in the Monte Carlo analysis can be expected to persist asymptotically,
but the range of models suffering size distortions will decrease.
The foregoing analysis is limited both because of the tests considered and because
it does not address the question of the size of the neighborhoods of these incorrect
rejections; it was shown only that the neighborhoods are at least O(T- ). In the
case of I(1) tests, Pantula (1991) has provided results on the sizes of these neighbor-
hoods for several tests of the unit AR root hypothesis in the MA( 1) model. He found
that these neighborhoods vanish at different rates for different tests, with the slowest
rate being the Phillips-Perron (1988) statistic. This finding explains the particularly
large size distortions of this statistic with negative MA roots, even with very large
samples [e.g. T= 500; Pantula (1991, Table 2)]. In related work, Perron (1991d)
and Nabeya and Perron (1991) provide approximations to the distribution of the
OLS root estimator with sequences of negative MA and negative AR roots ap-
proaching their respective boundaries.
Because tests for the general I(0) null have only recently been developed, as of
this writing there have been few empirical analyses in which both I(0) and I( 1) tests
are used [exceptions include Fisher and Park (1991) and Ogaki (1992)]. The
foregoing theoretical results suggest, however, that there will be a range of models
for which the I(1) test will reject with high probability and the I(0) test will not,
although the process is I(1); for which the I(0) test will reject and the I(1) test will
not, although the process is I(0); and for which both tests will reject and the process
is I(1). It also seems plausible that there are models that are truly I(0) but for which
both tests reject with high probability, but this has not been investigated formally.
There is currently little evidence on the volume of these regions of contradictory
results, although Amano and van Nordens (1992) Monte Carlo evidence suggests
that they may well be large in moderate sample sizes.
In summary, tests of the general I(0) null and tests of the general I(1) null are
neither similar nor unbiased. Asymptotically, the tests have size equal to their stated
level for fixed null models; but problems arise when we consider sequences of null
and alternative models for which the I(0) and I( 1) models become increasingly close.
On the one hand, there are null models which will be rejected with arbitrarily high
probability; on the other hand, there are alternative models against which the tests
will have power approaching the nominal level. Although these regions diminish
asymptotically, in finite samples this implies that there is a range of I(0) and I(1)
models amongst which the unit MA and AR root tests are unable to distinguish.

6.3.2. Practical implications

The asymptotic inability to distinguish certain I(0) and I(1) models raises the
question of how these tests are to be interpreted, and this has generated great
controversy in the applied literature on the practical value of unit root tests. Some
2828 J.H. Stock

of the earliest criticisms of Nelson and Plossers (1982) stylized fact that many
economic time series contain a unit root came from Bayesian analyses [Sims (1988),
DeJong and Whiteman (1991a, 1991b); see the references in Section 6.2 following
equation (6.4)], although the discussion here follows the debate from a classical
perspective in Blough (1992), Christian0 and Eichenbaum (1990), Cochrane (1991 a),
Rudebusch (1992, 1993) and Stock (1990). In particular, the reader is referred to
Campbell and Perrons (1991) thoughtful discussion and the comments on it by
Cochrane (199 1b). There has, however, been little systematic research on the practi-
cal implications of this problem, so ones view of the importance of this lack of
unbiasedness remains largely a matter of judgment. Because of the prominence of
this issue it nonetheless seems appropriate to organize the ways in which such
judgment can be exercised. This discussion focuses exclusively on AR unit root tests,
although several of the remarks have parallels to MA unit root tests.
It is useful to return to the reasons, listed in the introduction, why one might wish
to perform inference concerning orders of integration: as data description; for fore-
casting; as pretesting for subsequent specification analysis or testing; or for testing
or distinguishing among economic theories. Although this discussion proceeds in
general terms, it must be emphasized that the size and power problems vary greatly
across test statistics, so that the difficulties discussed here are worse for some tests
than others.

Data description. The size distortions and low power of even the best-performing
tests imply that the literal interpretation of unit AR root tests as similar and unbiased
tests of the I(1) null against the I(0) alternative is inappropriate. However, the Monte
Carlo evidence provides considerable guidance in the interpretation of unit root test
results. For some tests, such as the Dickey-Fuller t-statistic and the DF-GLS
statistic, the size is well controlled over a wide range of null models so rejection can
be associated rather closely with the absence of a unit root. In contrast, the severe
size distortions of the Phillips-Perron tests [or other tests with the SC spectral
estimator, such as the Schmidt-Phillips (1992) MSB statistic] in the presence of
moderate negative MA roots and their low empirical rejection rates in the stationary
case with moderate positive MA or second AR roots indicates that rejection by this
statistic is only secondarily associated with the presence or absence of a unit root,
and instead is indicative of the extent of positive serial correlation in the process.
Interpretation of results based on extant versions of these statistics using SC
estimators is, thus, problematic. In any event, confidence intervals for measures of
long-run persistence are arguably more informative than unit root tests themselves;
constructing these confidence intervals entails testing a range of possible values of
tl, not just the unit root hypothesis.
An important caveat is that the unit root tests and thus confidence intervals
require that the trend order be correctly specified; depending on the type of misspeci-
fication, the tests might otherwise be inconsistent. We agree with Campbell and
Perrons (1991) emphasis on the importance of properly specifying the trend order
Ch. 46: Unit Roots, Structural Breaks and Trends 2829

before proceeding with the classical tests, and this is an area in which one should
bring economic theory to bear to the maximum extent possible. For example, a
priori reasoning might suggest using a constant or shift-in-mean specification in
modeling interest rates, rather than including a linear time trend. We speculate that
while one could develop a consistent downward-testing procedure, starting with the
highest possible trend order and letting the test level decline with the sample size,
such an approach would have high misclassification rates in moderate samples (size
distortions and low power). The Bayesian approach in Phillips and Ploberger (1992)
to joint selection of the trend and order of integration is theoretically appealing for
fixed models but the finite-sample performance of this approach has not yet been
fully investigated.

Forecasting. Campbell and Perron (1991) and Cochrane (1991 b) examined the
effect of unit root pretests on forecasting performance. In their Monte Carlo experi-
ment, data were generated by an ARMA(l, 1) and forecasts were made using an
autoregression. Their most striking finding was that, in models with a unit AR root
and large negative correlation in first differences, the out-of-sample forecast error
was substantially lower with the unit root pretest than if the true differences
specification was used. This finding appeared both at short and long horizons (1
and 20 periods with a sample of size 100). In cases with less severe negative
correlation or with a stationary process, little was lost by pretesting relative to using
the true model. Because economic forecasting is largely done using multivariate
models, these initial results do not bear directly on most professional forecasting
activities. Still, they suggest that for forecasting the size distortions might be an
advantage, not a problem. A promising alternative to pretesting is to forecast using
median-unbiased estimates of CIas discussed in Section 3.3. To date, however, there
has been no thorough examination of whether this delivers finite-sample improve-
ments in forecasts and forecast standard errors.

Pretests for second-stage inference. Perhaps the most common use of unit root
tests is as pretests for second-stage inference: as a preliminary stage for developing
a forecasting model, for formulating a cointegrated system, or for determining
subsequent distribution theory. In the final of these applications, the existing distri-
bution theory for inference in linear time series regressions conditions upon the
number and location of unit roots in the system, in the sense that the orders of
integration and cointegration are assumed known. In empirical work, these orders
are typically unknown, so one way to proceed is to pretest for integration or
cointegration and then to condition on the results of these pretests in performing
second-stage inference. In practice, this can mean using a unit root pretest to decide
whether a variable should enter a second-stage regression in levels or differences, as
was done in the empirical application in Section 51.4. Alternatively, if the relation-
ship of interest involves the level of the variable in a second-stage regression, a unit
2830 J.H. Stock

root pretest could be used to ascertain whether standard or nonstandard distribution


theory should be used to compute second-stage tests.
There has, however, been little research on the implications of this use of unit
root tests. Some evidence addressing this is provided by Elliott and Stock (1994)
who consider a bivariate problem in which there is uncertainty about whether the
regressor has a unit root. In their Monte Carlo simulation, they find that unit root
pretests can induce substantial size distortions in the second-stage test. If the
innovations of the regressor and the second-stage regression error are correlated,
the first-stage Dickey-Fuller t-statistic and the second-stage t-statistic will be
dependent so the size of the second stage in this two-stage procedure cannot be
controlled effectively, even asymptotically. Although this problem is important
when this error correlation is high, in applications with more modest correlations
the problem is less severe.

Formulating and testing economic theories. This is arguably the application most
damaged by the problems of poor size and low power. In special cases - the
martingale theories of consumption and stock prices being the leading examples -
simple theories predict that the series is not only I(1) but is a martingale. In this
case, the null models are circumscribed and the problems of size distortions do
not arise. However, the initial appeal of unit root tests to economists was that
they seemed to provide a way to distinguish between broad classes of models: on
the one hand, dynamic stochastic equilibrium models (real business cycle models)
in which fluctuations were optimal adjustments to supply shocks, on the other hand,
Keynesian models in which fluctuations arose in large part from demand distur-
bances. Indeed, this was the original context in which they were interpreted in
Nelson and Plossers (1982) seminal paper.
Unfortunately, there are two problems, either of which alone is fatal to such an
interpretation. The first is a matter of economic theory: as argued by Christian0
and Eichenbaum (1990) stochastic equilibrium models need not generate unit
roots in observed output, and as argued by West (1988b), Keynesian models can
generate autoregressive roots that are very close to unity. Thus a rejection by an
ideal unit root test (that is, one with no size distortions) need not invalidate a real
business cycle model and a failure to reject should not be interpreted as evidence
against Keynesian models. The second is the lack of unbiasedness outlined above:
even if the match between classes of macroeconomic theories and whether macro-
economic series are I( 1) were exact, the size distortions and low power would mean
that the outcomes of unit root tests would not discriminate among theories. In this
light the idea, however appealing, that a univariate unit root test could distinguish
which class of models best describes the macroeconomy seems in retrospect overly
ambitious.
This said, inference about the order of integration of a time series can usefully
guide the specification and empirical analysis of relations of theoretical interest in
economics. For example, King and Watson (1992) and Fisher and Seater (1993) use
Ch. 46: Unit Roots, Structural Breaks and Trends 2831

these techniques to provide evidence on which versions of money neutrality (long-


run neutrality, superneutrality) can be investigated empirically. They show that
long-run neutrality can be tested without specifying a complete model of short-run
dynamics, as long as money and income are I(1). Similarly, investigations into
whether there are unit roots in exchange rates have proven central to inferences
about such matters as long-run purchasing power parity and the behavior of
exchange rates in the presence of target zones [see Johnson (1993) and Svensson
(1992) for reviews]. Finally, quantitative conclusions about the persistence in uni-
variate series have proven to be a key starting point for modeling the long-run
properties of multiple time series and cointegration analysis, an area which has seen
an explosion of exciting empirical and theoretical research and is the topic of the
next chapter in this Handbook.

References

Ahtola, J. and G.C. Tiao (1984) Parameter Inference for a Nearly Nonstationary First Order Auto-
regressive Model, Biometrika, 71, 263-72.
Amano, R.A. and S. van Norden (1992) Unit Root Tests and the Burden of Proof, manuscript, Bank
of Canada.
Anderson, T.W. (1948) On the Theory of Testing Serial Correlation, Skandinauisk Aktuariedskrift, 31,
88-l 16.
Anderson, T.W. (1959) On Asymptotic Distributions of Estimates of Parameters of Stochastic Difference
Equations, Annals of Mathematical Statistics, 30, 676-687.
Anderson, T.W. and D. Darling (1952) Asymptotic Theory of Certain Goodness of Fit Criteria Based
on Stochastic Processes, Annals of Mathematical Statistics, 23, 193-212.
Anderson, T.W. and A. Takemura (1986) Why Do Noninvertible Moving Averages Occur?, Journal
of Time Series Analysis, 7, 235-254.
Andrews, D.W.K. (1991) Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Esti-
mation, Econometrica, 59, 817-858.
Andrews, D.W.K. (1993a) Exactly Median-Unbiased Estimation of First Order Autoregressive/Unit
Root Model, Econometrica, 61, 139-166.
Andrews, D.W.K. (1993b) Tests for Parameter Instability and Structural Change with Unknown
Change Point, Econometrica, 821-856.
Andrews, D.W.K. and H.-Y. Chen (1992) Approximately Median-Unbiased Estimation of Autoregres-
sive Models with Applications to U.S. Macroeconomic and Financial Time Series, Cowles Foundation
Discussion Paper no. 1026, Yale University.
Andrews, D.W.K. and J.C. Monahan (1992) An Improved Heteroskedasticity and Autocorrelation
Consistent Covariance Matrix Estimator, Econometrica, 60, 953-966.
Andrews, D.W.K. and W. Ploberger (1992) Optimal Tests When a Nuisance Parameter is Identified
Only Under the Alternative, Cowles Foundation Discussion Paper no. 1015, Yale University.
Andrews, D.W.K., I. Lee and W. Ploberger (1992) Optimal Changepoint Tests for Normal Linear
Regression, Cowles Foundation Discussion Paper no. 1016, Yale University.
Ansley, C.F. and P. Newbold (1980) Finite Sample Properties of Estimators for Autoregressive Moving
Average Models, Journal of Econometrics, 13, 1599 183.
Arnold, L. (1973) Stochastic Differential Equations: Theory and Applications. Wiley:New York.
Bai, J. (1992) Econometric Estimation of Structural Change, unpublished Ph.D. Dissertation, Depart-
ment of Economics, University of California, Berkeley.
Banerjee, A., J. Dolado, J.W. Galbraith and D.F. Hendry (1992a) Cointegration. Error Correction and
the Econometric Analysis ofNon-Stationary Data. 0xford:Oxford University Press.
Banerjee, A., R.L. Lumsdaine and J.H. Stock (1992b) Recursive and Sequential Tests of the Unit Root
2832 J.H. Stock

and Trend Break Hypotheses: Theory and International Evidence, Journal ofBusiness and Economic
Statistics, 10, 271-288.
Barry, D. and J.A. Hartigan (1993) A Bayesian Analysis for Change Point Problems, Journal ofthe
American Statistical Association, 88, 309-319.
Beach, C.M. and J.G. MacKinnon (1978) A Maximum Likelihood Procedure for Regression with
Autocorrelated Errors, Econometrica, 46, 51-58.
Beaulieu, J.J. and J.A. Miron (1993) Seasonal Unit Roots in Aggregate U.S. Data, Journal ofEconometrics,
55,305-328.
Beran, J. (1992) Statistical Models for Data with Long-Range Dependence (with discussion), Statistical
Science, 4, 404427.
Berk, K.N. (1974) Consistent Autoregressive Spectral Estimates, Annals qf.Statistics, 2, 489-502.
Beveridge, S. and C.R. Nelson (1981) A New Approach to Decomposition of Economic Time Series
into Permanent and Transitory Components with Particular Attention to Measurement of the
Business Cycle 0 Journal of Monetary Economics, 7, 151- 174.
Bhargava, A. (1986) On the Theory of Testing for Unit Roots in Observed Time Series, Reoiew of
Economic Studies, 53, 369-384.
Bierens, H.J. (1993) Higher-Order Sample Autocorrelations and the Unit Root Hypothesis, Journal
of Econometrics, 57, 137- 160.
Bierens, H.J. and S. Guo (1993) Testing Stationarity and Trend Stationarity Against the Unit Root
Hypothesis, Econometric Reviews, 12, l-32.
Billingsley, P. (1968) Convergence of Probability Measure. Wiley:New York.
Blanchard, O.J. and D. Quah (1989) The Dynamic Effects of Aggregate Demand and Supply Disturbances,
American Economic Review, 79, 655-73.
Blough, S.R. (1992) The Relationship Between Power and Level for Generic Unit Root Tests in Finite
Samples, Journal of Applied Econometrics, 7, 295-308.
Bobkoski, M.J. (1983) Hypothesis Testing in Nonstationary Time Series, unpublished Ph.D. thesis,
Department of Statistics, University of Wisconsin.
Bollerslev, T. (1986) Generalized Autoregressive Conditional Heteroskedasticity, Journal of Econo-
metrics, 3 1, 307-327.
Box, G.E.P. and G.M. Jenkins (1976) Time Series Analysis: Forecasting and Control. Revised Edition.
San Francisco: Holden-Day.
Brillinger, D.R. (1981) Time Series: Data Analysis and Theory. San Francisco: Holden-Day.
Brockwell, P.J. and R.A. Davis (1987) Time Series: Theory and Methods. New York: Springer-Verlag.
Brown, B.M. (1971) Martingale Central Limit Theorems, Annals of MathematicalStatistics, 42,59-66.
Brown, R.L., J. Durbin and J.M. Evans (1975) Techniques for Testing the Constancy of Regression
Relationships over Time with Comments, Journal of the Royal Statistical Society, Series B, 37,
1499192.
Campbell, J.Y. (1987) Does Saving Anticipate Declining Labor Income? An Alternative Test of the
Permanent Income Hypothesis, Econometrica, 55, 1249-1274.
Campbell, J.Y. and N.G. Mankiw (1987a) Are Output Fluctuations Transitory, Quarterly Journal of
Economics, 102,857-880.
Campbell, J.Y. and N.G. Mankiw (1987b) Permanent and Transitory Components in Macroeconomic
Fluctuations, The American Economic Review, 77, 11l-1 17.
Campbell, J.Y. and P. Perron (1991) Pitfalls and Opportunities: What Macroeconomists Should Know
about Unit Roots, NBER Macroeconomics Annual, 141-200.
Cavanagh, C (I 985) Roots Local To Unity, manuscript, Department of Economics, Harvard University.
Chart, N.H. (1988) On The Parameter Inference for Nearly Nonstationary Time Series, Journal ofthe
American Statistical Association, 83, 857-862.
Chan, N.H. (1989) Asymptotic Inference for Unstable Autoregressive Time Series With Drifts. Journal
of StatisticalPlanning and InJbence, 23, 301-3 12.
Chan, N.H. and C.Z. Wei (1987) Asymptotic Inference For Nearly Nonstationary AR(l) Processes,
Annals ofStatistics, 15, 1050-1063.
Chart, N.H. and C.Z. Wei (1988) Limiting Distributions of Least Squares Estimates of Unstable
Autoregressive Processes, Annals ofStatistics, 16. 367-401.
Chernotf-H. and S. Zacks (1964) Estimating the Current Mean of a Normal Distribution Which Is
Subject to Changes in Time, Annals ofMathematical Statistics, 35,999%1028.
Choi, 1. (1993) Effects of Data Aggregation on the Power of Tests for a Unit Root: A Simulation Study,
Economic Letters, forthcoming.
Ch. 46: Unit Roots, Structural Breaks and Trends 2833

Chow, G.C. (1960) Tests of Equality Between Sets of Coefficients in Two Linear Regressions, Econome-
trica, 28, 59 l-605.
Chow, G.C. (1984) Random and Changing Coefficient Models, in: Z. Griliches and M. Intriligator,
eds., Handbook ofEconometrics. Vol. 2, North-Holland: Amsterdam.
Christiano, L.J. (1992) Searching for a Break in GNP, Journal ofBusiness and Economic Statistics, 10,
237-250.
Christiano, L.C. and M. Eichenbaum (1990) Unit Roots in GNP: Do We Know and Do We Care?
Carnegie-Rochester Conference Series on Public Policy, 32, 7-62.
Christiano, L.J. and L. Ljungqvist (1988) Money Does Granger-Cause Output in the Bivariate Money-
Output Relation, Journal of MonetaryEconomics, 22,217-236.
Chu, C.-S., and H. White (1992) A Direct Test for Changing Trend, Journal ofBusiness and Economic
Statistics, 10, 289-299.
Clark, P.K. (1987) The Cyclical Component of U.S. Economic Activity, The Quarterly Journal of
Economics, 102, 798-814.
Clark, P.I$. (1989) Trend Reversion in Real Output and Unemployment, Journal of Econometrics, 40,
15-32.
Cochrane, J.H. (1988) How Big is the Random Walk in GNP?, Journal of Political Economy, 96,
893-920.
Cochrane, J.H. (1991a) A Critique of the Application of Unit Root Tests, Journal ofEconomic Dynamics
and Control, 15,275-284.
Cochrane, J.H. (1991b) Comment on Campbell and Perron, NBER Macroeconomics Annual 1991, 5,
201-210.
Cooley, T.F. and EC. Prescott (1976) Estimation in the Presence of Stochastic Parameter Variation,
Econometrica, 44, 167- 184.
Cooper, D.M. and R. Thompson (1977) A Note on the Estimation of the Parameters of the Autoregressive-
Moving Average Process, Biometrika, 64, 625-628.
Cryer, J.D. and J. Ledolter (1981) Small-Sample Properties of the Maximum Likelihood Estimator in
the First-Order Moving Average Model, Biometrika, 68, 691-694.
Dahlhaus, R. (1989) Efficient Parameter Estimation for Self-Similar Processes, Annals of Statistics, 17,
1749-1766.
Davidson, J.E.H. (1979) Small Sample Properties of Estimators of the Moving Average Process, in:
E.G. Charatsis, ed., Proceedings of the Econometric Society Meeting in 1979: Selected Econometric
Papers in Memory of Stephan Valananis. North-Holland: Amsterdam.
Davidson, J.E.H. (1981) Problems With the Estimation of Moving Average Processes, Journal of
Econometrics, 16, 295-310.
Davidson, J.E.H., D.F. Hendry, F. Srba and S. Yeo (1978) Econometric Modelling of the Aggregate
Time-Series Relationship Between Consumers Expenditure and Income in the United Kingdom,
Economic Journal, 86,661&692.
Davies, R.B. (1977) Hypothesis Testing When a Nuisance Parameter is Present only Under the
Alternative, Biometrika, 64, 247-254.
Davis, R.A. and W.T.M. Dunsmuir (1992) Inference for MA(l) Processes with a Root On or Near the
Unit Circle, Discussion Paper no. 36, Bond University School of Business.
DeJong, D.N. and C.H. Whiteman (1991a) Reconsidering Trends and Random Walks in Macroeco-
nomic Time Series, Journal of Monetary Economics, 28, 221-254.
DeJong, D.N. and C.H. Whiteman (1991b) The Temporal Stability of Dividends and Stock Prices:
Evidence from the Likelihood Function, American Economic Review, 81, 600~617.
DeJong, D.N., J.C. Nankervis, N.E. Savin and C.H. Whiteman (1992a) Integration versus Trend-
Stationarity in Macroeconomic Time Series, Econometrica, 60, 423-434.
DeJong, D.N., J.C. Nankervis, N.E. Savin and C.H. Whiteman (1992b) The Power Problems of Unit
Root Tests for Time Series with Autoregressive Errors, Journal of Econometrics, 53, 323-43.
Dent, W. and A.-S. Min (1978) A Monte Carlo Study of Autoregressive Integrated Moving Average
Processes, Journal of Econometrics, 7, 23-55.
Deshayes, J. and D. Picard (1986) Off-line Statistical Analysis of Change-Point Models Using Non
Parametric and Likelihood Methods, in: M. Basseville and A. Benveniste, eds., Detection of Abrupt
Changes in Signals and Dynamical Systems. Springer Verlag Lecture Notes in Control and Information
Sciences, no. 77, 103-168.
Dickey, D.A. (1976) Estimation and Hypothesis Testing in Nonstationary Time Series, Ph.D. dis-
sertation, Iowa State University.
2834 J.H. Stock

Dickey, D.A. and W.A. Fuller (1979) Distribution of the Estimators for Autoregressive Time Series with
a Unit Root, Journal ofthe American Statistical Association, 74, 427-431.
Dickey, D.A. and W.A. Fuller (1981) Likelihood Ratio Tests for Autoregressive Time Series with a
Unit Root, Econometrica, 49, 105771072.
Dickey, D.A. and SC. Pantula (1987) Determining the Order of Differencing in Autoregressive Processes,
Journal of Business and Economic Statistics, 5, 455-462.
Dickey, D.A., D.P. Hasza, and W.A. Fuller (1984) Testing for Unit Roots in Seasonal Time Series,
Journal of the American Statistical Association, 79, 355-367.
Diebold, F.X. (1990) On Unit Root Econometrics: Discussion of Geweke, Sims, and DeJong and
Whiteman, in: Proceedings of the American Statistical Association Business and Economics Statistics
Session. Washington, D.C.: American Statistical Association.
Diebold, F.X. (1993) The Effect of Seasonal Adjustment Filters on Tests for a Unit Root: Discussion,
Journal of Econometrics, 55, 999103.
Diebold, F.X. and C. Chen (1992) Testing Structural Stability with Endogenous Break Point: A Size
Comparison of Analytic and Bootstrap Procedures, manuscript, Department of Economics, University
of Pennsylvania.
Diebold, F.X. and M. Nerlove (1990) Unit Roots in Economic Time Series: A Selective Survey, in: T.B.
Fomby and G.F. Rhodes, eds., Advances in Econometrics: Co-Integration, Spurious Regressions, and
Unit Roots. Greenwich, CT: JAI Press.
Diebold, F.X. and G.D. Rudebusch (1989) Long Memory and Persistence in Aggregate Output,
Journal of Monetary Economics, 24, 189-209.
Diebold, F.X. and G.D. Rudebusch (1991a) Is Consumption Too Smooth? Long Memory and the
Deaton Paradox, Review ofEconomics and Statistics, 71, l-9.
Diebold, F.X. and G.D. Rudebusch (1991b) On the Power of Dickey-Fuller Tests Against Fractional
Alternatives, Economics Letters, 35, 1555160.
Diebold, F.X., S. Husted and M. Rush (1991) Real Exchange Rates Under the Gold Standard, Journal
of Political Economy, 99, 1252-1271.
Dufour, J.-M. (1990) Exact Tests and Confidence Sets in Linear Regressions with Autocorrelated
Errors, Econometrica, 58,475-494.
Dufour, J.-M. and M.L. King (1991) Optimal Invariant Tests for the Autocorrelation Coefficient in
Linear Regressions with Stationary or Nonstationary AR(l) Errors, Journal of Econometrics, 47,
115-143.
Dunsmuir, W.T.M. (1981) Estimation for Stationary Time Series When Data Are Irregularly Spaced
or Missing, in: D.F. Findley, ed., Applied Time Series II, Academic Press: New York.
Durlauf, S.N. (1989) Output Persistence, Economic Structure and the Choice of Stabilization Policv.
Brookings Papers an Ecanomic Activity, 2,69-136.
Durlauf. S.N. and P.C.B. Phillins (1988) Trends Versus Random Walks in Time Series Analvsis. ,
Econametrica, 56,1333-1354. 1 ~
Elliott, G. (1993a) Efficient Tests for a Unit Root When the Initial Observation Is Drawn From Its
Unconditional Distribution, manuscript, Department of Economics, Harvard University.
Elliott, G. (1993b) Monte Carlo Performance of Twenty Unit Root Tests, manuscript, Department of
Economics, Harvard University.
Elliott, G. and J.H. Stock (1994) Inference in Time Series Regression When the Order of Integration of
a Regressor is Unknown, Econometric Theory, forthcoming.
Elliott, G., T.J. Rothenberg and J.H. Stock (1992) Efficient Tests for an Autoregressive Unit Root, NBER
Technical Working Paper no. 130.
Engle, R.F. and C.W.J. Granger( 1987) Cointegration and Error Correction: Representation, Estimation
and Testing, Econometrica, 55, 251-276.
Ethier, S.N. and T.G. Kurtz (1986) Markov Processes: Characterization and Convergence. New York:
Wiley.
Evans, G.B.A. and N.E. Savin (I 981a) The Calculation of the Limiting Distribution of the Least Squares
Estimator of the Parameter in a Random Walk Model, Annals of Statistics, 9, 1114-l 118.
Evans, G.B.A. and N.E. Savin (1981b) Testing for Unit Roots: I, Econometrica, 49, 753-779.
Evans, G.B.A. and N.E. Savin (1984) Testing for Unit Roots: II, Econometrica, 52, 1241-1260.
Fama. E.F. (1970) Efficient Capital Markets: A Review of Theory and Empirical Work, Journal of
Finance, 25,383-417.
Feldstein, M. and J.H. Stock (1994) The Use of a Monetary Aggregate to Target Nominal GDP, in:
N.G. Mankiw, ed., Monetary Policy. Chicago: University of Chicago Press.
Ch. 46: Unit Roots, Structural Breaks and Trends 2835

Fisher, E.O. and J.Y. Park (1991) Testing Purchasing Power Parity Under the Null Hypothesis of
Cointegration, Economic Journal, 101, 1476684.
Fisher, M.E. and J.J. Seater (1993) Long-Run Neutrality and Superneutrality in an ARIMA Framework,
American Economic Review, 83,4022415.
Fox, R. and MS. Taqqu (1986) Large-Sample Properties of Parameter Estimates for Strongly Dependent
Stationary Gaussian Time Series, Annals ofStatistics, 14, 517-532.
Franzini, L. and A.C. Harvey (1983) Testing for deterministic trend and seasonal components in time
series models, Biometrika, 70, 673-682.
Friedman, B.M. and K.N. Kuttner (1992) Money, Income, Prices and Interest Rates, American
Economic Reoiew, 82,472~492.
Friedman, M. and D. Meiselman (1963) The Relative Stability of the Investment Multiplier and
Monetary Velocity in the United States, 1897-1958, in: Stabilization Policies. Commission on Money
and Credit, Englewood Cliffs, NJ: Prentice-Hall, 1655268.
Fuller, W.A. (1976) Introduction to Statistical Time Series. New York:Wiley.
Garbade, K. (1977) Two Methods of Examining the Stability of Regression Coefficients, Journal of
The American Statistical Association, 72, 54463.
Gardner, L.A. (1969) On Detecting Changes in the Mean of Normal Variates, Annals ofMathematical
Statistics, 40, 116-126.
Geweke, J. and S. Porter-Hudak (1983) The Estimation and Application of Long Memory Time Series
Models, Journal of Time Series Analysis, 4, 221-238.
Ghysels, E. (1990) Unit-Root Tests and the Statistical Pitfalls of Seasonal Adjustment: The Case of U.S.
Postwar Real Gross National Product, Journal of Business and Economic Statistics, 8, 1455152.
Ghysels, E. and P. Perron (1993) The Effect of Seasonal Adjustment Filters on Tests for a Unit Root,
Journal of Econometrics, 55, 57-98.
Hack], P. and A. Westlund (1989) Statistical Analysis of Structural Change: An Annotated Bibliography,
Empirical Economics, 143, 1677192.
Hackl, P. and A.H. Westlund (1991) eds., Economic Structural Change: Analysis and Forecasting.
Springer-Verlag: Berlin, 95-l 19.
Hall, A. (1989) Testing for a Unit Root in the Presence of Moving Average Errors, Biometrika, 76,
49-56.
Hall, A. (1992a) Testing for a Unit Root in Time Series Using Instrumental Variable Estimation with
Pretest Data-Based Model Selection, Journal of Econometrics, 54,223-250.
Hall, A. (1992b) Unit Root Tests After Data Based Model Selection, Proceedings of the 1992 Summer
Meetings of the American Statistical Association: Business and Economics Section.
Hall, P. and CC. Heyde (1980) Martingale Limit Theory and its Applications. New York: Academic
Press.
Hall, R.E. (1978) Stochastic Implications of the Life Cycle-Permanent Income Hypothesis: Theory and
Evidence, Journal of Political Economy, 86, 971-87.
Hansen, B.E. (1990) Lagrange Multiplier Tests for Parameter Instability in Non-Linear Models, manu-
script, Economics Department, University of Rochester.
Hansen, B.E. (1992) Tests for Parameter Instability in Regressions with I(1) Processes, Journal of
Business and Economic Statistics, 10, 321-336.
Harvey, A.C. (1981) Time Series Models. Oxford: Philip Allan.
Harvey, A.C. (1985) Trends and Cycles in Macroeconomic Time Series, Journal of Business and
Economic Statistics, 3, 216-227.
Harvey, A.C. (1989) Forecasting, Structural Models and the Kalman Filter. Cambridge, U.K.: Cambridge
University Press.
Harvey, A.C. and A. Jaeger (1993) Detrending, Stylized Facts and the Business Cycle, Journal of Applied
Econometrics, 8, 231-248.
Harvey, A.C. and N.G. Shephard (1992) Structural Time Series Models, in: G.S. Maddala, CR. Rao
and H.D. Vinod, eds., Handbook of Statistics, Vol. 2, Elsevier: Amsterdam.
Hasza, D.P. and W.A. Fuller (1979) Estimation for Autoregressive Processes with Unit Roots, Annals
of Statistics, 7, 1106-1120.
Hasza, D.P. and W.A. Fuller (1981) Testing for Nonstationary Parameter Specifications in Seasonal
Time Series Models, Annals of Statistics, 10, 1209-1216.
Hendry, D.F. and A.J. Neale (1991) A Monte Carlo Study of the Effects of Structural Breaks on Tests
for Unit Roots, in. P. Hack1 and A.H. Westlund, eds., Economic Structural Change: Analysis and
Forecasting. SpringerrVerlag: Berlin, 955119.
2836 J.H. Slack

Herrndorf, N.A. (1984) A Functional Central Limit Theorem for Weakly Dependent Sequences of
Random Variables, Annals cfProhahility, 12, 141-153.
Hoffman, D. and R.H. Rasche (1991) Long-Run Income and Interest Elasticities of Money Demand in
the United States, Review ofEconomics and Statistics, 73, 665-674.
Hurwicz, L. (1950) Least-Squares Bias in Time Series, in: T. Koopmans, ed., Statistical Inference in
Dynamic Economic Models. New York: Wiley.
Hylleberg, S., R.F. Engle, C.W.J. Granger and S. Yoo (1990) Seasonal Integration and Cointegration,
Journal ofEconometrics, 44, 215-238.
Jandhyala, V.K. and LB. MacNeill (1992) On Testing for the Constancy of Regression Coefficients
Under Random Walk and Change-Point Alternatives, Econometric Theory, 8, 501-517.
Jegganathan, P. (1991) On the Asymptotic Behavior of Least-Squares Estimators in AR Time Series
with Roots Near the Unit Circle, Econometric Theory, 7, 269-306.
Johnson, D. (1993) Unit Roots, Cointegration and Purchasing Power Parity: Canda and the United
States, 1870-1991, in: B. OReilly and J. Murray, eds., The Exchange Rate and the Economy.
Ottawa: Bank of Canada.
Kang, K.M. (1975) A Comparison of Estimators of Moving Average Processes, manuscript, Australian
Bureau of Statistics.
Kendall, M.G. and A. Stuart (1967) The Advanced Theory ofStatistics. Vol. 2, Second edition. Charles
Griffin: London.
Kim, H.-J. and D. Siegmund (1989) The Likelihood Ratio Test for a Change-Point in Simple Linear
Regression, Biometrika, 76(3), 409-23.
King, M.L. (1980) Robust Tests for Spherical Symmetry and their Application to Least Squares
Regression, Annals oJSitatistics, 8, 1265-1271.
King, M.L. (1988) Towards a Theory of Point Optimal Testing, Econometric Reviews, 6, 169-218.
King, M.L. and G.H. Hillier (1985) Locally Best Invariant Tests of the Error Covariance Matrix of the
Linear Regression Model, Journal ofthe Royal Statistical Society, Series B, 47,98-102.
King, R.G. and M.W. Watson (1992) Testing Long-Run Neutrality, NBER Working Paper no. 4156.
King, R.G., C.I. Plosser, J.H. Stock and M.W. Watson (1991) Stochastic Trends and Economic
Fluctuations, American Economic Review, 81,819-840.
Kiviet, J.F. and G.D.A. Phillips (1992) Exact Similar Tests for Unit Roots and Cointegration, Oxford
Bulletin of Economics and Statistics, 54, 349-368.
Konishi, T., V.A. Ramey and C.W.J. Granger (1993) Stochastic Trends and Short-Run Relationships
Between Financial Variables and Real Activity, NBER Working Paper no. 4275.
Koopmans, T. (1942) Serial Correlation and Quadratic Forms in Normal Variables, Annals oJMathe-
matical Statistics, 13, 14-33.
Kramer, W. and H. Sonnberger (1986) The Linear Regression Model Under Test. Physica-Verlag:
Heidelberg.
KrPmer, W., W. Ploberger and R. Alt (1988) Testing for Structural Change in Dynamic Models,
Economet;ica, 56, 1355-1369.
Krishnaiah, P.R. and B.Q. Miao (1988) Review about Estimation ofChange Points, in: P.R. Krishnaiah
and C.R. Rao, eds., Handbook ofstatistics. 7, New York:Elsevier.
Kwiatkowski, D., P.C.B. Phillips, P. Schmidt and Y. Shin (1992) Testing the Null Hypothesis of
Stationarity Against the Alternatives of a Unit Root: How Sure Are We that Economic Time Series
Have a Unit Root?, Journal ofEconometrics, 54, 159-178.
Lamotte, L.R. and A. McWhorter (1978) An Exact Test for the Presence of Random Walk Coefficients
in a Linear Regression Model, Journal ofthe American Statistical Association, 73, 816-820.
Lehmann, E.L. (1959) Testing Statistical Hypotheses. John Wiley and Sons: New York.
Lewis, R. and G.C. Reinsel (1985) Predi&on of Multivariate-Time Series by Autoregressive Model
Fitting, Journal OfMultiuariate Analvsis, 16. 393-411.
Leybourne, S.J. and B.P.M. McCabe (lbs9) Oh the Distributiosof Some Test Statistics for Coefficient
Constancy, Biometrika, 76, 169-177.
Lo, A.W. (1991) Long-Term Memory in Stock Market Prices, Econometrica, 59, 1279-1314.
MacNeill, I.B. (1974) Tests for Change of Parameter at Unknown Times and Distributions of Some
Related Functionals on Brownian Motion, Annals ofstatistics, 2,950-962.
MacNeill, I.B. (1978) Properties of Sequences of Partial Sums of Polynomial Regression Residuals with
Applications to Tests for Change of Regression at Unknown Times, Annals ofsitatistics, 6,422-433.
Magnus, J.R. and B. Pesaran (1991) The Bias of Forecasts from a First-Order Autoregression,
Econometric Theory, 7, 222-235.
Ch. 46: Unit Roots, Structural Breaks and Trends 2837

Mandelbrot, B.B. (1975) Limit Theorems on the Self-normalized Range for Weakly and StroWdY
Dependent Processes, Zeitschrifftir Wahrscheinlichkeitstheorie und verwandte Gehiete, 3 1,271-285.
Mandelbrot, B.B. and M.S. Taqqu (1979) Robust R/S Analysis of Long Run Serial Correlation, Invited
paper, 42nd Session of the International Institute of Statistics.
Mandelbrot, B.B. and J.W. Van Ness (1968) Fractional Brownian Motions, Fractional Noise and
Applications, SIAM Review, 10, 4222437.
Mann, H.B. and A. Wald (1943) On the Statistical Treatment of Linear Stochastic Difference Equations,
Econometrica, 11, 1733220.
Marriot, F.H.C. and J.A. Pope (1954) Bias in the Estimation of Autocorrelations, Biometrika, 41,
3933403.
McCabe, B.P.M. and M.J. Harrison (1980) Testing the Constancy of Regression Relationships over
Time Using Least Squares Residuals, Applied Statistics, 29, 142-148.
Meese, R.A. and K.J. Singleton (1982) On Unit Roots and the Empirical Modeling of Exchange Rates,
Journal ofFinance, 37, 1029-1035.
Nabeya, S. and P. Perron (1991) Local Asymptotic Distributions Related to the AR(l) Model with
Dependent Errors, Econometric Research Program Memorandum No. 362, Princeton University.
Nabeya, S. and B.E. Sorensen (1992) Asymptotic Distributions of the Least Squares Estimators and Test
Statistics in the Near Unit Root Model with Non-Zero Initial Value and Local Drift and Trend,
manuscript, Department of Economics, Brown University.
Nabeya, S. and K. Tanaka (1988) Asymptotic Theory of a Test for the Constancy of Regression
Coefficients Against the Random Walk Alternative, Annals of Statistics, 16, 2188235.
Nabeya, S. and K. Tanaka (1990a) A General Approach to the Limiting Distribution for Estimators
in Time Series Regression with Nonstable Autoregressive Errors, Econometrica, 58, 1455163.
Nabeya, S. and K. Tanaka (1990b) Limiting Powers of Unit Root Tests in Time Series Regression,
Journal ofEconometrics, 46,247-271.
Nankervis, J.C. and N.E. Savin (1988) The Exact Moments of the Least-Squares Estimator for the
Autoregressive Model: Corrections and Extensions, Journal ofEconometrics, 37, 381-388.
Nelson, CR. and H. Kang (1981) Spurious Periodicity in Inappropriately Detrended Time Series,
Econometrica, 49, 741-751.
Nelson, C.R. and C.I. Plosser (1982) Trends and Random Walks in Macro-economic Time Series: Some
Evidence and Implications, Journal ofMonetary Economics, 10, 1399162.
Newey, W. and K.D. West (1987) A Simple Positive Semi Definite, Heteroskedasticity and Autocorrela-
tion Consistent Covariance Matrix, Econometrica, 55, 703-708.
Ng, S. and P. Perron (1993a) Useful Modifications to Some Unit Root Tests with Dependent Errors
and their Local Asymptotic Properties, manuscript, C.R.D.E., University of Montreal, Quebec.
Ng, S. and P. Perron (1993b) Unit Root Tests in ARMA Models With Data-Dependent Methods for the
Selection of the Truncation Lag, manuscript, C.R.D.E., University of Montreal, Quebec.
Nicholls, S. and A. Pagan (1985) Varying Coefficient Regression, in: Handbook ofStatistics, 4133449.
Nyblom, J. (1986) Testing for Deterministic Linear Trend in Time Series, Journal of the American
Statistical Association, 81, 545-549.
Nyblom, J. (1989) Testing for the Constancy of Parameters Over Time, Journal of the American
Statistical Association, 84, 223-30.
Nyblom, J. and T. Makellinen (1983) Comparisons of Tests for the Presence of Random Walk
Coefficients in a Simple Linear Model, Journal ofthe American Statistical Association, 78, 856-864.
Ogaki, M. (1992) Engels Law and Cointegration, Journal ofPolitical Economy, 100, 1027-1046.
Orcutt, G.H. (1948) A Study of the Autoregressive Nature of the Time Series Used for Tinbergens Model
of the Economic System of the United States, 1919-1932, Journal ofthe Royal Statistical Society B,
10, l-45.
Pantula, S.G. (1989) Testing for Unit Roots in Time Series Data, Econometric Theory, 5, 256-271.
Pantula, S.G. (1991) Asymptotic Distributions of the Unit-Root Tests when the Process is Nearly
Stationary, Journal of Business and Economic Statistics, 9, 63-71.
Pantula, S.G. and A. Hall (1991) Testing for Unit Roots in Autoregressive Moving Average Models:
An Instrumental Variable Approach, Journal of Econometrics, 48, 325-353.
Pantula, S.G., G. Gonzalez-Farias and W.A. Fuller (1992) A Comparison of Unit Root Test Criteria,
manuscript, North Carolina State University.
Park, J. (1990) Testing for Unit Roots and Cointegration by Variable Addition, in: T.B. Fomby and
G.F. Rhodes, eds., Advances in Econometrics: Co-Integration. Spurious Regressions and (init Roots.
Greenwich, CT: JAI Press.
2838 J.H. Stock

Park, J. and B. Choi (1988) A New Approach to Testing for a Unit Root, Working Paper no. 88-23,
Center for Analytical Economics, Cornell University.
Park, J.Y. and P.C.B. Phillips (1988) Statistical Inference in Regressions with Integrated Processes: Part
I, Econometric Theory, 4, 468-497.
Perron, P. (1988) Trends and Random Walks in Macroeconomic Time Series: Further Evidence from
a New Approach, Journal ofEconomic Dynamics and Control, 12,297-332.
Perron, P. (1989a) The Great Crash, the Oil Price Shock and the Unit Root Hypothesis, Econometrica,
57, 1361-1401.
Perron, P. (1989b) The Calculation of the Limiting Distribution of the Least-Squares Estimator in a
Near-Integrated Model, Econometric Theory, 5,241-255.
Perron, P. (1989~) Testing for a Random Walk: A Simulation Experiment of Power When the Sampling
Interval is Varied, in: B. Raj, ed., Advances in Econometrics and Modelling. Kluwer Academic Publishers:
Boston.
Perron, P. (1990a) Tests of Joint Hypothesis for Time Series Regression with a Unit Root, in: T.B.
Fomby and G.F. Rhodes, eds., Advances in Econometrics: Co-Integration, Spurious Regressions and
Unit Roots. Greenwich, CT: JAI Press.
Perron, P. (1990b) Testing for a Unit Root in a Time Series with a Changing Mean, Journal ofBusiness
and Economic Statistics, 8, 153-162.
Perron, P. (199la) A Continuous Time Approximation to the Stationary First-Order Autoregressive
Model, Econometric Theory, 7, 236-252.
Perron, P. (199lb) Test Consistency with Varying Sampling Frequency, Econometric Theory, 7,
341-368.
Perron, P. (1991~) A Test for Changes in a Polynomial Trend Function for a Dynamic Time Series,
manuscript, Princeton University.
Perron, P. (1991d) The Adequacy of Asymptotic Approximations in the Near-Integrated Autoregressive
Model with Dependent Errors, manuscript, Princeton University.
Perron, P. (1992) A Continuous Time Approximation to the Unstable First Order Autoregressive
Process: The Case Without an Intercept, Econometrica, 59, 211-236.
Perron, P. and P.C.B. Phillips (1987) Does GNP Have a Unit Root? A Reevaluation, Economics Letters,
23, 139-145.
Perron, P. and T.S. Vogelsang (1992) Nonstationarity and Level Shifts With an Application to Purchasing
Power Parity, Journal of Business and Economic Statistics, 10, 301-320.
Pesaran, M.H. (1983) A Note on the Maximum Likelihood Estimation of Regression Models with,
First-Order Moving Average Errors with Roots on the Unit Circle, Australian Journal of Statistics,
25.442-448.
Phillips, P.C.B. (1978) Edgeworth and Saddlepoint Approximations in the First-Order Noncircular
Autoregression, Biometrika, 65, 91-98.
Phillips, P.C.B. (1986) Understanding Spurious Regression in Econometrics, Journal of Econometrics,
33,3llL340.
Phillips, P.C.B. (1987a) Time Series Regression with a Unit Root, Econometrica, 55, 277-302.
Phillips, P.C.B. (1987b) Towards a Unified Asymptotic Theory for Autoregression, Biometrika, 74,
535-547.
Phillips, P.C.B. (1988) Multiple Regression with Integrated Time Series, Contemporary Mathematics,
80, 79-105.
Phillips, P.C.B. (199la) To Criticize the Critics: An Objective Bayesian Analysis of Stochastic Trends,
Journal of Applied Econometrics, 6, 333-364.
Phillips, P.C.B. (199lb) Spectral Regression for Cointegrated Time Series, in: W. Barnett, J. Powell
and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics and Statistics.
Cambridge University Press: Cambridge, UK.
Phillips, P.C.B. (1992a) Unit Roots, in: P. Newman, M. Milgate and J. Eatwell, eds., The New
Palgrave Dictionary of Money and Finance. London: Macmillan, 726-730.
Phillips, P.C.B. (1992b) Bayes Methods for Trending Multiple Time Series with An Empirical
Application to the U.S. Economy, manuscript, Cowles Foundation, Yale University.
Phillips, P.C.B. and M. Loretan (1991) Estimating Long Run Economic Equilibria, Reoiew of Economic
Studies, 58, 407-436.
Phillips, P.C.B. and S. Ouliaris (1990) Asymptotic Properties of Residual Based Tests for Cointegration,
Econometrica, 58, 165-94.
Ch. 46: Unit Roots, Structural Breaks and Trends 2839

Phillips, P.C.B. and P. Perron (1988) Testing for a Unit Roots in a Time Series Regression, Biometrika,
X,335-346.
Phillips, P.C.B. and W. Ploberger (1991) Time Series Modeling with a Bayesian Frame of Reference: I.
Concepts and Illustrations, manuscript, Cowles Foundation, Yale University.
Phillips, P.C.B. and W. Ploberger (1992) Posterior Odds Testing for a Unit Root with Data-Based Model
Selection. manuscript, Cowles Foundation for Economic Research, Yale University.
Phillips P.C.B. and V. Solo (1992) Asymptotics for Linear Processes, Annals ofstatistics, 20,971-1001.
Picard, D. (1985) Testing and Estimating Change-Points in Time Series, Advances in Applied
Probability, 176, 841-867.
Ploberger, W. and W. Kramer (1990) The Local Power of the CUSUM and CUSUM of Squares Tests,
Econometric Theory, 6335-347.
Plobeger, W. and W. Kramer (1992a) The CUSUM test with OLS Residuals, Econometrica, 60,
271-286.
Ploberger, W. and W. Kramer (1992b) A Trend-Resistant Test for Structural Change Based on
OLS-Residuals, manuscript, Department of Statistics, University of Dortmund.
Ploberger, W., W. Kramer and K. Kontrus (1989) A New Test for Structural Stability in the Linear
Regression Model, Journal ofEconometrics, 40, 307-318.
Plosser, C.I. and C.W. Schwert (1977) Estimation of a Noninvertible Moving-Average Process: The
Case of Overdifferencing, Journal of Econometrics, 6, 199-224.
Piitscher, B.M. (1991) Noninvertibility and Pseudo-Maximum Likelihood Estimation of Misspecified
ARMA Models, Econometric Theory, 7, 435-449.
Quah, D. (1992) The Relative Importance of Permanent and Transitory Components: Identification
and some Theoretical Bounds, Econometrica, 60, 107-l 18.
Quandt, R.E. (1960) Tests of the Hypothesis that a Linear Regression System Obeys Two Separate
Regimes, Journal ofthe American Statistical Association, 55, 324-330.
Rappoport, P. and L. Reichlin (1989) Segmented Trends and Non-Stationary Time Series, Economic
Journal, 99, 168-177.
Robinson, P.M. (1993) Nonparametric Function Estimation for Long-Memory Time Series, in: C.
Sims, ed., Advances in Econometrics, Sixth World Congress of the Econometric Society. Cambridge:
Cambridge University Press.
Rothenberg, T.J. (1990) Inference in Autoregressive Models with Near-Unit Root, seminar notes,
Department of Economics, University of California, Berkeley.
Rudebusch, G. (1992) Trends and Random Walks in Macroeconomic Time Series: A Reexamination,
International Economic Review, 33,661-680.
Rudebusch, G.D. (1993) The Uncertain Unit Root in Real GNP, American Economic Review, 83,
264272.
Said, S.E. and D.A. Dickey (1984) Testing for Unit Roots in Autoregressive-Moving Average Models
of Unknown Order, Biometrika, 71, 599-608.
Said, S.E. and D.A. Dickey (1985) Hypothesis Testing in ARIMA (p, l,q) Models, Journal of the
American Statistical Association, 80, 369-374.
Saikkonen, P. (1991) Asymptotically Efficient Estimation of Cointegrating Regressions, Econometric
Theory, 7, 1-21.
Saikkonen, P. (1993) A Note on a Lagrange Multiplier Test for Testing an Autoregressive Unit Root,
Econometric Theory, 9, 343-362.
Saikkonen, P. and R. Luukkonen (1993a) Testing for Moving Average Unit Root in Autoregressive
Integrated Moving Average Models, Journal of the American Statistical Association, 88, 5966601.
Saikkonen, P. and R. Luukkonen (1993b) Point Optimal Tests for Testing the Order of Differencing in
ARIMA Models, Econometric Theory, 9, 343-362.
Sargan, J.D. and A. Bhargava (1983a) Testing Residuals from Least Squares Regression for Being
Generated by the Gaussian Random Walk. Econometrica. 51. 153-174.
Sargan, J.D. and A. Bhargava (1983b) Maximum Likelihood Estimation of Regression Models with
First Order Moving Average Errors when the Root Lies on the Unit Circle, Econometrica, 51,
799-820.
Sarris, A.H. (1973) A Bayesian Approach to Estimation of Time Varying Regression Coefficients,
Annals of Economic and Social Measurement, 2, 501-523.
Satchell, SE. (1984) Approximation to the Finite Sample Distribution for Nonstable First Order
Stochastic Difference Equations, Econometrica, 52, 1271-1289.
J.H. Stock

Schmidt. P. and P.C.B. Phillips (1992) LM Tests for a Unit Root in the Presence of Deterministic
Trends, Oxford Bulletin of Economics and Statistics, 54, 2577288.
Schotman, P. and H.K. van Dijk (1990) Posterior Analysis of Possibly Integrated Time Series with an
Application to Real GNP, manuscript, Erasmus University Econometric Institute, Netherlands.
Schwarz, G. (1978) Estimating the Dimension of a Model, Annals of Statistics, 6,461L464.
Schwert. G.W. (1989) Tests for Unit Roots: A Monte Carlo Investigation, Journalof Businessand
Economic Statistics, I, 147-l 59.
Sen, P.K. (1980) Asymptotic Theory of Some Tests for a Possible Change in the Regression Slope
Occurrine at an Unknown Time Point, Zeitschrifi fir Wahrscheinlichkeitstheorie und verwandte
Gebiete, 52, 2033218.
Sheohard. N.G. (1992) Distribution of the ML Estimator of a MA(l) and a Local Level Model,
manuscript, Nuffield College, Oxford.
Shephard, N.G. (1993) Maximum Likelihood Estimation of Regression Models with Stochastic Trend
Components, Journal of the American Statistical Association, 88, 590-595.
Shephard, N.G. and A.C. Harvey (1990) On the Probability of Estimating a Deterministic Component
in the Local Level Model, Journal of Time Series Analysis, 11, 339-347.
Shiller, R.J. and P. Perron (1985) Testing the Random Walk Hypothesis: Power versus Frequency of
Observation, Economics Letters, 18, 1271-1289.
Shively, T.S. (1988) An Exact Test for a Stochastic Coefficient in a Time Series Regression Model,
Journal of Time Series Analysis, 9, 81-88.
Siegmund, D. (1988) Confidence Sets in Change-point Problems, International Statistical Review, 56,
31-47.
Sims, C.A. (1972) Money, Income and Causality, American Economic Review, 62, 540-552.
Sims, C.A. (1980) Macroeconomics and Reality, Econometrica, 48, l-48.
Sims, CA. (1988) Bayesian Skepticism on Unit Root Econometrics, Journal of Economic Dynamics and
Control, 12,463-474.
Sims, C.A. and H. Uhlig (1991) Understanding Unit Rooters: A Helicopter Tour, Econometrica, 59,
1591-1600.
Sims, CA., J.H. Stock, and M.W. Watson (1990) Inference in Linear Time Series Models with Some
Unit Roots, Econometrica, 58, 1133144.
Solo, V. (1984) The Order of Differencing in ARIMA Models, Journal of the American Statistical
Association, 79, 916-921.
Solo, V. (1989) A Simple Derivation of the Granger Representation Theorem, manuscript, Macquarie
University, Sydney, Australia.
Sorensen, B. (1992) Continuous Record Asymptotics in Systems of Stochastic Differential Equations,
Econometric Theory, 8,28-51.
Sowell, F. (1990) Fractional Unit Root Distribution, Econometrica, 58,495-506.
Sowell, F. (1991) On DeJong and Whitemans Bayesian Inference for the Unit Root Model, Journal
of Monetary Economics, 28, 2555264.
Sowell, F.B. (1992) Maximum Likelihood Estimation of Stationary Univariate Fractionally-
Integrated Time-Series Models, Journal of Econometrics, 53, 1655188.
Stine, R.A. and P. Shaman (1989) A Fixed Point Characterization for Bias of Autoregressive Estimates,
Annals of Statistics, 17, 1275-1284.
Stock, J.H. (1987) Asymptotic Properties of Least Squares Estimators of Cointegrating Vectors,
Econometrica, 55, 1035-1056.
Stock, J.H. (1988) A Class of Tests for Integration and Cointegration, manuscript, Kennedy School of
Government, Harvard University.
Stock. J.H. (1990) Unit Roots in GNP: Do We Know and Do We Care?: A Comment. Carneaie-

RochesterConference on Public Policy, 26, 63-82.
Stock, J.H. (1991) Confidence Intervals for the Largest Autoregressive Root in U.S. Economic Time
Series, Journal of Monetary Economics, 28,435-460.
Stock, J.H. (1992) Deciding Between I(0) and I(l), NBER Technical Working Paper no. 121; Journal
of Econometrics, forthcoming.
Stock, J.H. and M.W. Watson (1988a) Variable Trends in Economic Time Series, Journal of Economic
Perspectives, 2, 147-l 74.
Stock, J.H. and M.W. Watson (1988b) "Testingfor Common Trends, Journal ofthe American Statistical
Association, 83, 1097-l 107.
Ch. 46: Unit Roots, Structural Breaks and Trends 2841

Stock, J.H. and M.W. Watson (1989) Interpreting the Evidence on Money-Income Causality, Journal of
Econometrics, 40, 161-182.
Stock. J.H. and M.W. Watson (1993) A Simple Estimator of Cointegrating Vectors in Higher Order
Integrated Systems, Econometrica, 61, 783:820.
Svensson. L.E.O. (1992) An Interpretation of Recent Research on Exchange Rate Target Zones, Journal
of economic Pe;spectives, 6, 11&144.
Tanaka, K. (1990a) The Fredholm Approach to Asymptotic Inference in Nonstationary and Non-
invertible Time Series Models, Econometric Theory, 6, 41 l-432.
Tanaka, K. (1990b) Testing for a Moving Average Unit Root, Econometric Theory, 6,433-444.
Tanaka. K. and S.E. Satchel1 (1989) Asvmptotic Properties of the Maximum Likelihood and
Non-Linear Least-Squares E&mat&s fo> fioninvertable Moving Average Models, Econometric
Theory, $333-353.
Tang, S.M. and LB. MacNeill (1993) The Effect of Serial Correlation on Tests for Parameter Change
at Unknown Time, Annals of Statistics, 21, 552-575.
Tinbergen, J. (1939) Statistical Testing of Business-Cycle Theories. Vol, II: Business Cycles in the United
States of America, 1919-1932. League of Nations: Geneva.
Uhlig, H. (1992) What Macroeconomists Should Know About Unit Roots As Well: The Bayesian
Perspective, manuscript, Department of Economics, Princeton University.
Watson, M.W. (1986) Univariate Detrending Methods with Stochastic Trends, Journal of Monetary
Economics, 18, 49-75.
Watson, M.W. and R.F. Engle (1985) Testing for Regression Coefficient Stability with a Stationary
AR(l) Alternative, Review of Economics and Statistics, LXVII, 341-345.
Wei, C.Z. (1992) On Predictive Least Squares Principles, Annals of Statistics, 20, l-42.
West, K.D. (1987) A Note on the Power of Least Squares Tests for a Unit Root, Economics Letters,
24,1397-1418.
West, K.D. (1988a) Asymptotic Normality when Regressors have a Unit Root, Econometrica, 56,
1397-1418.
West, K.D. (1988b) On the Interpretation of Near Random Walk Behavior in GNP, American
Economic Review, 78,202-208.
White, J.S. (1958) The Limiting Distribution of the Serial Correlation Coefficient in the Explosive Case,
Annals of Mathematical Statistics, 29, 1188-l 197.
White, J.S. (1959) The Limiting Distribution of the Serial Correlation Coefficient in the Explosive Case
II, Annals of Mathematical Statistics, 30, 831-834.
Yao, Y.-C. (1987) Approximating the Distribution of the Maximum Likelihood Estimate of the
Change-Point in a Sequence ofIndependent Random Variables, Annals ofStatistics, 15,1321-1328.
Zacks, S. (1983) Survey of Classical and Bayesian Approaches to the Change-Point Problem: Fixed
Sample and Sequential Procedures of Testing and Estimation, in: M.H. Rizvi et al., eds., Recent
Advances in Statistics. New York: Academic Press, 245-269.
Zellner (1987) Bayesian Inference, in: J. Eatwell, M. Milgate and P. Newman, eds., The New Pa/grave:
A Dictionary of Economics, Macmillan Press: London, 208-219.
Zivot, E. and D.W.K. Andrews (1992) Further Evidence on the Great Crash, the Oil Price Shock, and
the Unit Root Hypothesis, Journal of Business and Economic Statistics, 10, 251-270.
Chapter 47

VECTOR AUTOREGRESSIONS AND


COINTEGRATION*

MARK W. WATSON

Northwestern University and Federal Reserve Bank of Chicago

Contents

Abstract 2844
1. Introduction 2844
2. Inference in VARs with integrated regressors 2848
2.1. Introductory comments 2848
2.2. An example 2848
2.3. A useful lemma 2850
2.4. Continuing with the example 2852
2.5. A general framework 2854
2.6. Applications 2860
2.7. Implications for econometric practice 2866
3. Cointegrated systems 2870
3.1. Introductory comments 2870
3.2. Representations for the I(1) cointegrated model 2870
3.3. Testing for cointegration in I(1) systems 2876
3.4. Estimating cointegrating vectors 2887
3.5. The role of constants and trends 2894
4. Structural vector autoregressions 2898
4.1. Introductory comments 2898
4.2. The structural moving average model, impulse response functions and
variance decompositions 2899
4.3. The structural VAR representation 2900
4.4. Identification of the structural VAR 2902
4.5. Estimating structural VAR models 2906
References 2910

*The paper has benefited from comments by Edwin Denson, Rob Engle, Neil Ericsson, Michael
Horvath, Soren Johansen, Peter Phillips, Greg Reinsel, James Stock and students at Northwestern
University and Studienzentrum Gerzensee. Support was provided by the National Science Foundation
through grants SES-89910601 and SES-91-22463.

Handbook of Econometrics, Fo~ume IV, Edited by R.F. Engle and D.L. McFadden
0 1994 Elsevier Science B.V. All rights reserved
2844 M. W. Watson

Abstract

This paper surveys three topics: vector autoregressive (VAR) models with integrated
regressors, cointegration, and structural VAR modeling. The paper begins by
developing methods to study potential unit root problems in multivariate models,
and then presents a simple set of rules designed to help applied researchers conduct
inference in VARs. A large number of examples are studied, including tests for
Granger causality, tests for VAR lag length, spurious regressions and OLS estimators
of cointegrating vectors. The survey of cointegration begins with four alternative
representations of cointegrated systems: the vector error correction model (VECM),
and the moving average, common trends and triangular representations. A variety
of tests for cointegration and efficient estimators for cointegrating vectors are
developed and compared. Finally, structural VAR modeling is surveyed, with an
emphasis on interpretation, econometric identification and construction of efficient
estimators. Each section of this survey is largely self-contained. Inference in VARs
with integrated regressors is covered in Section 2, cointegration is surveyed in
Section 3, and structural VAR modeling is the subject of Section 4.

1. Introduction

Multivariate time series methods are widely used by empirical economists, and
econometricians have focused a great deal of attention at refining and extending
these techniques so that they are well suited for answering economic questions.
This paper surveys two of the most important recent developments in this area:
vector autoregressions and cointegration.
Vector autoregressions (VARs) were introduced into empirical economics by
Sims (1980), who demonstrated that VARs provide a flexible and tractable frame-
work for analyzing economic time series. Cointegration was introduced in a series
of papers by Granger (1983) Granger and Weiss (1983) and Engle and Granger
(1987). These papers developed a very useful probability structure for analyzing
both long-run and short-run economic relations.
Empirical researchers immediately began experimenting with these new models,
and econometricians began studying the unique problems that they raise for econo-
metric identification, estimation and statistical inference. Identification problems
had to be confronted immediately in VARs. Since these models dont dichotomize
variables into endogenous and exogenous, the exclusion restrictions used to
identify traditional simultaneous equations models make little sense. Alternative
sets of restrictions, typically involving the covariance matrix of the errors, have
been used instead. Problems in statistical inference immediately confronted
researchers using cointegrated models. At the heart of cointegrated models are
integrated variables, and statistics constructed from integrated variables often
behave in nonstandard ways. Unit root problems are present and a large research
effort has attempted to understand and deal with these problems.
This paper is a survey of some of the developments in VARs and cointegration
that have occurred since the early 1980s. Because of space and time constraints,
certain topics have been omitted. For example, there is no discussion of forecasting
or data analysis; the paper focuses entirely on structural inference. Empirical
2846 M. W. Watson

proposition is testable without a complete specification of the structural model.


The basic idea is that when money and output are integrated, the historical data
contain permanent shocks. Long-run neutrality can be investigated by examining
the relationship between the permanent changes in money and output. This raises
two important econometric questions. First, how can the permanent changes in
the variables be extracted from the historical time series? Second, the neutrality
proposition involves exogenous components of changes in money; can these
components be econometrically identified? The first question is addressed in Section
3, where, among other topics, trend extraction in integrated processes is discussed.
The second question concerns structural identification and is discussed in Section 4.
One important restriction of economic theory is that certain Great Ratios are
stable. In the eight-variable system, five of these restrictions are noteworthy. The
first four are suggested by the standard neoclassical growth model. In response
to exogenous growth in productivity and population, the neoclassical growth
model predicts that output, consumption and investment will grow in a balanced
way. That is, even though y,, c,, and i, increase permanently in response to increases
in productivity and population, there are no permanent shifts in c, - y, and i, - y,.
The model also predicts that the marginal product of capital will be stable in the
long run, suggesting that similar long-run stability will be present in ex-post real
interest rates, r - Ap. Absent long-run frictions in competitive labor markets,
real wages equal the marginal product of labor. Thus, when the production func-
tion is Cobb-Douglas (so that marginal and average products are proportional),
(w - p) - (y - n) is stable in the long run. Finally, many macroeconomic models
of money [e.g. Lucas (1988)] imply a stable long-run relation between real balances
(m - p), output (y) and nominal interest rates (r), such as m - p = /3,y + &r; that
is, these models imply a stable long-run money demand equation.
Kosobud and Klein (1961) contains one of the first systematic investigations of
these stability propositions. They tested whether the deterministic growth rates in
the series were consistent with the propositions. However, in models with stochastic
growth, the stability propositions also restrict the stochastic trends in the variables.
These restrictions can be described succinctly. Let x, denote the 8 x 1 vector
(y,, c,, i,, n,, w,, m,, pr, rJ. Assume that the forcing processes of the system (productivity,
population, outside money, etc.) are such that the elements of x, are potentially
I(1). The five stability propositions imply that z, = CLX,is I(O), where

1 1 -1 -0, 0
-1 0 0 0 0
o-1 0 00
0 0 100
cI=
0 0 100
0 0 0 10
0 0 -1 -1 0
0 0 0 -B, 1
Ch. 47: Vector Autoregressions and Cointegration 2847

The first two columns of IXare the balanced growth restrictions, the third column
is the real wage - average labor productivity restriction, the fourth column is stable
long-run money demand restriction, and the last column restricts nominal interest
rates to be I(0). If money and prices are I(l), Ap is I(0) so that stationary real rates
imply stationary nominal rates.
These restrictions raise two econometric questions. First, how should the stability
hypotheses be tested? This is answered in Section 3.3 which discusses tests for
cointegration. Second, how should the coefficients /?, and p, be estimated from the
data, and how should inference about their values be carried out?* This is the subject
of Section 3.4 which considers the problem of estimating cointegrating vectors.
In addition to these narrow questions, there are two broad and arguably more
important questions about the business cycle behavior of the system. First, how
do the variables respond dynamically to exogenous shocks? Do prices respond
sluggishly to exogenous changes in money? Does output respond at all? And if
so, for how long? Second, what are the important sources of fluctuations in the
variables. Are business cycles largely the result of supply shocks, like shocks to
productivity? Or do aggregate demand shocks, associated with monetary and fiscal
policy, play the dominant role in the business cycle?
If the exogenous shocks of econometric interest ~ supply shocks, monetary
shocks, etc. ~ can be related to one-step-ahead forecast errors, then VAR models
can be used to answer these questions. The VAR, together with a function relating
the one-step-ahead forecast errors to exogenous structural shocks is called a
structural VAR. The first question ~ what is the dynamic response of the variables
to exogenous shocks? ~ is answered by the moving average representation of the
structural VAR model and its associated impulse response functions. The second
question - what are the important sources of economic fluctuations? ~ is answered
by the structural VARs variance decompositions. Section 4 shows how the impulse
responses and variance decompositions can be computed from the VAR. Their
calculation and interpretation are straightforward. The more interesting econometric
questions involve issues of identification and efficient estimation in structural VAR
models. The bulk of Section 4 is devoted to these topics.
Before proceeding to the body of the survey, three organizational comments are
useful. First, the sections of this survey are largely self contained. This means that
the reader interested in structural VARs can skip Sections 2 and 3 and proceed
directly to Section 4. The only exception to this is that certain results on inference
in cointegrated systems, discussed in Section 3, rely on asymptotic results from
Section 2. If the reader is willing to take these results on faith, Section 3 can be
read without the benefit of Section 2. The second comment is that Sections 2 and

Since nominal rates are I(0) from the last column of a, the long run interest semielasticity of money
demand, fi,, need not appear in the fourth column of a.
The values of BYand b, are important to macroeconomists because they determine (i) the relation-
ship between the average growth rate of money, output and prices and (ii) the steady-state amount of
seignorage associated with any given level of money growth.
2848 M. W. Watson

3 are written at a somewhat higher level than Section 4. Sections 2 and 3 are based
on lecture notes developed for a second year graduate econometrics course and
assumes that students have completed a traditional first year econometrics
sequence. Section 4, on structural VARs, is based on lecture notes from a first
year graduate course in macroeconomics and assumes only that students have a
basic understanding of econometrics at the level of simultaneous equations. Finally,
this survey focuses only on the classical statistical analysis of I(1) and I(0) systems.
Many of the results presented here have been extended to higher order integrated
systems, and these extensions will be mentioned where appropriate.

2. Inference in VARs with integrated regressors

2.1. Introductory comments

Time series regressions that include integrated variables can behave very differently
than standard regression models. The simplest example of this is the AR( 1) regression:
y, = py,- 1 + E,, where p = 1 and E, is independent and identically distributed with
mean zero and variance g2, i.i.d.(0,a2). As Stock shows in his chapter of the
Handbook, p, the ordinary least squares (OLS) estimator of p, has a non-normal
asymptotic distribution, is asymptotically biased, and yet is super consistent,
converging to its true value at rate T.
Estimated coefficients in VARs with integrated components, can also behave
differently than estimators in covariance stationary VARs. In particular, some of
the estimated coefficients behave like p, with non-normal asymptotic distributions,
while other estimated coefficients behave in the standard way, with asymptotic
normal large sample distributions. This has profound consequences for carrying
out statistical inference, since in some instances, the usual test statistics will not
have asymptotic x2 distributions, while in other circumstances they will. For
example, Granger causality test statistics will often have nonstandard asymptotic
distributions, so that conducting inference using critical values from the x2 table
is incorrect. On the other hand, test statistics for lag length in the VAR will usually
be distributed x2 in large samples. This section investigates these subtleties, with
the objective of developing a set of simple guidelines that can be used for conducting
inference in VARs with integrated components. We do this by studying a .model
composed of I(0) and I(1) variables. Although results are available for higher order
integrated systems [see Park and Phillips (1988, 1989), Sims et al. (1990) and Tsay
and Tiao (1990)], limiting attention to I(1) processes greatly simplifies the notation
with little loss of insight.

2.2. An example

Many of the complications in statistical inference that arise in VARs with unit
Ch. 47: Vector Autoregressions and Cointegration 2849

roots can be analyzed in a simple univariate AR(2) model3

Y,=4lY,-, + 42Yt-2 + I,. (2.1)

Assume that #i + $2 = 1 and Ic$~/ < 1, so that process contains one unit root. To
keep things simple, assume that qr is i.i.d.(O, 1) and normally distributed [n.i.i.d.(O, l)].
Let x, = (y,_ 1yt_ 2) and C$= (4i 42), so that the OLS estimator is 4 = (Cx,xi)- x
(CX,Y,) and (4~- 4) = (Cx,x:)-(C-V,). (U n 1ess noted otherwise, C will denote
x:,=, throughout this paper.)
In the covariance stationary model, the large sample distribution of C$is deduced
by writing T12($ - 4) = (~~~x,x~)~(~~*~x~~~), and then using a law oflarge
numbers to show that T-Cx,x; A E(x,xj) = V, and a central limit theorem to
show that T-112C~,q, -%N(O, V). These results, together with Slutskys theorem,
imply that T1j2($ - 4) %N(O, V-l).
When the process contains a unit root, this argument fails. The most obvious
reason is that, when p = 1, E(x,x:) is not constant, but rather grows with t. Because
of this, T-Cx,x: and TelCxtqt no longer converge: convergence requires that
Cxrxi be divided by T2 instead of T, and that CXJ, be divided by T instead of
T12. Moreover, even with these new scale factors, Tp2Cx,xi converges to a
random matrix rather than a constant, and T- Cx,q, converges to a non-normal
random vector.
However, even this argument is too simple, since the standard approach can be
applied to a specific linear combination of the regressors. To see this, rearrange
the regressors in (2.1) so that

Y, = Y~AY,-~ +Y*Yt-I +rll, (2.2)

where yi = - d2 and y2 = C#J~


+ b2. Regression (2.2) is equivalent to regression (2.1)
in the sense that the OLS estimates of 4i and 42 are linear transformations of
the OLS estimators of yi and y2. In terms of the transformed regressors

(2.3)

and the asymptotic behavior of pi and f2 (and hence 6) can be analyzed by studying
the large sample behavior of the cross products CAY:_,,CAY,_,Y,-,,CY:-,,

CAY~-~V~ and CY,-in,.


To begin, consider the terms CAY:_ 1 and CAY,_ lvr. Since 41 + 42 = y2 = 1,

AY, = - 42Ayr_l + nl. (2.4)

3Many of the insights developed by analyzing this example are discussed in Fuller (1976) and Sims
(1978).
2850 M. W. Watson

Since 1c$~1< 1, Ayt (and hence Ay,_ , ) is covariance stationary with mean zero.
Thus, standard asymptotic arguments imply that T- CAY:_ 1 An& and
T-*CAy,_ iqt %N(O,G&). This means that the first regressor in (2.2) behaves
in the usual way. Unit root complications arise only because of the second
regressor, y, _ 1. To analyze the behavior of this regressor, solve (2.4) backwards
for the level of y,:

y,=u +dQ-151+Yo+s,> (2.5)

where rl=Ci=iyls and s,= -(l +(p2)-1C~~~(-~2)i+1ylt_i, and vi=0 for ib0
has been assumed for simplicity. Equation (2.5) is the BeveridgeeNelson (1981)
decomposition of y,. It decomposes y, into the sum of a martingale or stochastic
trend [(l + 4,)) &I, a constant (y,) and an I(0) component (s,). The martingale
component has a variance that grows with t, and (as is shown below) it is this
component that leads to the nonstandard behavior of the cross products Cy:_ i,
CY~-~AY~-~ and D-i%
Other types of trending regressors also arise naturally in time series models and
their presence affects the sampling distribution of coefficient estimators. For
example, suppose that the AR(2) model includes a constant, so that

Y,=~+~,AY,-,+y,y,~,+~lr. (2.6)
This constant introduces two additional complications. First, a column of ls must
be added to the list of regressors. Second, solving for the level of y, as above:

y,=(l + $+-iClt+(l + &-14,+yo+st. (2.7)

The key difference between (2.5) and (2.7) is that now y, contains the linear trend
(1 + 4*)-l~lt. This means that terms involving yrP 1 now contain cross products
that involve linear time trends. Estimators of the coefficients in equation (2.6) can
be studied systematically by investigating the behavior of cross products of(i) zero
mean stationary components (like qt and Ay,_ i), (ii) constant terms, (iii) martingal,es
and (iv) time trends. We digress to present a useful lemma that shows the limiting
behavior of these cross products. This lemma is the key to deriving the asymptotic
distribution for coefficient estimators and test statistics for linear regressions
involving I(0) and I(1) variables, for tests for cointegration and for estimators of
cointegrating vectors. While the AR(2) example involves a scalar process, most of
the models considered in this survey are multivariate, and so the lemma is stated
for vector processes.

2.3. A useful lemma

Three key results are used in the lemma. The first is the functional central limit
theorem. Letting qt denote an n x 1 martingale difference sequence, this theorem
Ch. 47: Vector Autoregressions and Cointegration 2851

expresses the limiting behavior of the sequence of partial sums 5, = xi= tqs,
t= l,..., T, in terms of the behavior of an II x 1 standardized Wiener or Brownian
motion process B(s) for 0 <s < 1.4 That is, the limiting behavior of the discrete
time random walk 5, is expressed in terms of the continuous time random walk
B(s). The result implies, for example, that T 12iJ,sTl*B(s) - N(0, s), for 0 < s < 1,
where [ST] denotes the first integer less than or equal to ST. The second result used
in the lemma is the continuous mapping theorem. Loosely, this theorem says that
the limit of a continuous function is equal to the function evaluated at the limit of
its arguments. The nonstochastic version of this theorem implies that T C,= 1t =
T-~T=I(t/T)+~$ds = $. The stochastic version implies that T-312CT=1<, =
T-CT=1(T-25,)jS~B(s)ds. The final result is the convergence of TpCyt_lq:
to the stochastic integral IkB(s) dB(s), which is one of the moments directly under
study. These key results are discussed in Wooldridges chapter of the Handbook.
For our purposes they are important because they lead to the following lemma.

Lemma 2.3

Let 11, be an n x 1 vector of random variables with E(v~[v~_ . . , or) = 0,r, .

E(~,~:l~,~ l,. . . , u],) = In, and bounded fourth moments. Let F(L) = C,=,FiLi and
G(L) = C,p)=, GiLi denote two matrix polynomials in the lag operator with
C,&ilFil < CC and C,&ilGil < 00. Let 5, = C:=rvs, and let B(s) denote an n x 1
dimensional Brownian motion process. Then the following converge jointly:

(4 T- 2CWh, -N)~B(s)ds,
(b) T-CS,s:+l *j-B(s)dW,
(4 T- CS,CWh,l a F( 1) + jB(s) dB(s)F( l),
(d) T- C C(L)V,ICG(L)~I,IJL CZ 1FiGi>
(4 T-32C~CWhr+Il =+ dB(s)F( l),
(f) T-3%& * IB(s) ds,
(g) T-C5,5: =+(s)B(s) ds,
(h) T-52Ct& =s {sB(s) ds,

where, to simplify notation 1; is denoted by J. The lemma follows from results in


Chan and Wei (1988) together with standard versions of the law of large numbers
and the central limit theorem for martingale difference sequences [see White
(1984)]. Many versions of this lemma (often under assumptions slightly different
from those stated here) have appeared in the literature. For example, univariate
versions can be found in Phillips (1986, 1987a), Phillips and Perron (1988) and
Solo (1984), while multivariate versions (in most cases covering higher order
integrated processes) can be found in Park and Phillips (1988, 1989), Phillips and

Throughout this paper B(s) will denote a multivariate standard Brownian motion process, i.e., an
n x 1 process with independent increments B(r) - B(s) that are distributed N(O,(r - s)l,) for r > s.
2852 M. W. Watson

Durlauf (1986) Phillips and Solo (1992) Sims et al. (1990) and Tsay and Tiao
(1990).
The specific regressions that are studied below fall into two categories: (i) regres-
sions that include a constant and a martingale as regressors or, (ii) regressions
that include a constant, a time trend and a martingale as regressors. In either case,
the coefficient on the martingale is the parameter of interest. The estimated value
of this coefficient can be calculated by including a constant or a constant and time
trend in the regression, or, alternatively, by first demeaning or detrending the data.
It is convenient to introduce some notation for the demeaned and detrended
martingales and their limiting Brownian motion representations._Thus let tf = 5, -
T-CT= it, denote the demeaned martingale, and let 5: = t, - pi - b2t denote the
detrended martingale, where pi and b, are the OLS estimators obtained from the
regression of 5, onto (1 t). Then, from the lemma, a straightforward calculation yields

and
T- 12~~sTldl(S) -
sl 0
B(r) dr = W(s)

T-2~;s,,*B(s) - 1 a,(r)B(rjdr - s a,(r)B(r)dr = p(s),


s 0 s 0

where a,(r) = 4 - 6r and a2(r) = - 6 + 12r.

2.4. Continuing with the example

We are now in a position to complete the analysis of the AR(2) example. Consider
a scaled version of (2.3)

1
T2(fl - Yl) = T-day:_, T-32~Ayt_1yr-l -

T-
TV2 - ~2) I[ T-32bv&-l T-2C~:- 1

X
12CA~t-
T-&~lvt
lylt
1
From (2.5) and result (g) of the lemma, TP2Cyf_ 1*(l + 42))2JB(s)2 ds and from
(b) T~Cyt_iq,=(l +4,)-JB(s)dB(s). Finally, noting from (2.4) that Ayt=
(1 + b2L)-qf, (c) implies that T~32CAyt~1y,~1 LO. This result is particularly
important because it implies that the limiting scaled XXmatrix for the regression
is block diagonal. Thus,
Ch. 47: Vector Autoregressions and Cointegration 2853

and

Two features of these results are important. First, y*i and y**converge at different
rates. These rates are determined by the variability of their respective regressors:
yi is the coefficient on a regressor with bounded variance, while y2 is the coefficient
on a regressor with a variance that increases at rate t. The second important
feature is that y*ihas an asymptotic normal distribution, while the asymptotic
distribution of f2 is non-normal. Unit root complications will affect statistical
inference about y2 but not yl.
Now consider the estimated regression coefficients 0, and 4, in the untransformed
regression. Sin_ce 6, = - fl, T~*(c$~ - 4*) 3 N(0, o&f). Furthermore, since 4, =
pi + y^*,T1/*(bl, 4,) = P*(p, - yi) + T1i2(y2- y2) = T1/2(yl - y,) + o,(l). That
is, even though 4i depends on both 7, and y**,the super consistency of y**implies
that its sampling error can be ignored in large samples. Thus, T1*(Jl - 4,) 3
N(O,a&,*), so that both ~$i and b2 converge at rate T* and have asymptotic
normal distributions. Their joint distribution is more complicated. Since ~$t + c$* =
y2, T12(41 - 4,) + Ty*(4, - c$*) = T1*(ji2 - y2) LO and the joint asymptotic
distribuiion of T1/*(dl - 4,) and T1*(4* - 4,) is singular. The liner combi-
nation 4l + 42 converges at rate T to a non-normal distribution: T[(+, + 4,) -
(4, + &)I= 73, - Y,)*U + ~2)[ISB(s)2dSl-111SB(S)dB(S)l.
There are two important practical consequences of these results. First, inference
about 41 or about dz can be conducted in the usual way. Second, inference about
the sum of coefficients 41 + e52 must be carried out using nonstandard asymptotic
distributions. Under the null hypothesis, the t-statistic for testing the null H,: 4, = c
converges to a standard normal random variable, while the r-statistic for testing
the null hypothesis H,: +1 + c#* = 1 converges to [sB(s)* ds]-*[JB(s)dB(s)],
which is the distribution of the Dickey-Fuller T statistic (see Stocks chapter of
the Handbook).
As we will see, many of the results developed for the AR(2) carry over to more
general settings. First, estimates of linear combinations of regression coefficients
converge at different rates. Estimators that correspond to coefficients on stationary
regressors, or that can be written as coefficients on stationary regressors in a trans-
formed regression (yl in this example), converge at rate T* and have the usual
asymptotic normal distribution. Estimators that correspond to coefficients on I( 1)
regressors, and that cannot be written as coefficients on I(0) regressors in a trans-
formed regression (y2 in this example), converge at rate T and have a nonstandard
asymptotic distribution. The asymptotic distribution of test statistics is also
affected by these results. Wald statistics for restrictions on coefficients correspond-
ing to I(0) regressors have the usual asymptotic normal or x2 distributions. In
2854 M. W. Watson

general, Wald statistics for restrictions on coefficients that cannot be written as


coefficients on l(0) regressors have nonstandard limiting distributions. We now
demonstrate these results for the general VAR model with I(1) variables.

2.5. A general framework

Consider the VAR model

Y,=a+ f @iY,_i+E,, (2.8)


i=l

where Y, is an n x 1 vector and E, is a martingale difference sequence with constant


conditional variance Z, (abbreviated mds(Z,)) with finite fourth moments. Assume
that the determinant of the autoregressive polynomial (I - @,z - @,z2 - ... - @,zp(
has all of its roots outside the unit circle or at z = 1, and continue to maintain
the simplifying assumption that all elements of Y, are individually I(0) or I(1).5
For simplicity, assume that there are no cross equation restrictions, so that the
efficient linear estimators correspond to the equation-by-equation OLS estimators.
We now study the distribution of these estimators and commonly used test
statistics.

2.5.1. Distribution of estimated regression coejficients

To begin, write the ith equation of the model as

Yi,t = xip+i .f (2.9)

where yi,t is the ith element of Y,, X, = (1 Y:_ r Y:_ 2... Y:_,) is the (np + 1) vector
of regressors, /I is the corresponding vector of regression coefficients, and F~,~is the
ith element of E,. (For notational convenience the dependence of p on i has been
suppressed.) The OLS estimator of fi is fl= (CX,Xi)- (CX,yi,,), so that B - /I =
(CX,x:)-l(Cx,Ei ,)
As in the univariate AR(2) model, the asymptotic behavior of b is facilitated by

5Higher order integrated processes can also be studied using the techniques discussed here, see Park
and Phillips (1988) and Sims et al. (1990). Seasonal unit roots (corresponding to zeroes elsewhere on
the unit circle) can also be studied using a modification of these procedures. See Tsay and Tiao
(1990) for a careful analysis of this case.
6The analysis in this section is based on a large body of work on estimation and inference in multi-
variate time series models with unit roots. A partial list of relevant references includes Chan and Wei
(1988) Park and Phillips (1988, 1989) Phillips (1988) Phillips and Durlauf (1986), Sims et al. (1990),
Stock (1987), Tsay and Tiao (1990), and West (1988). Additional references are provided in the body
of the text.
Ch. 47: Vector Autoregressions and Cointegration 2855

transforming the regressors in a way that isolates the various stochastic and deter-
ministic trends. In particular, the regressors are transformed as Z, = DX,, where
D is nonsingular and Z, = (z~,,z~,~ ...z~,~), where the zi,t will be referred to as
canonical regressors. These regressors are related to the deterministic and
stochastic trends given in Lemma 2.3 by the transformation

or

z, = F(L)V, - 1,

where v, = (9; 1 r: t). The advantage of this transformation is that it isolates the
terms of different orders of probability. For example, zi,( is a zero mean I(0)
regressor, z2 t is a constant, the asymptotic behavior of the regressor z~,~ is
dominated by the martingale component Fx3tt_ i, and z~,~ is dominated by the
time trend Fd4t. The canonical regressors z*,~ and z~,~ are scalars, while zi f and
zs,* are vectors. In the AR(2) example analyzed above, zl,* = Ay,_ i = (1 + 4,L)-iqr- i,
so that F, i(L) = (1 + c$~L)- ; z~,~ is absent, since the model did not contain a
constant;~,,,=y,_~=(1+~,)-5,_,+y,+s,_,,sothatF,,=(l+~,)-,F,,=y,
andF,,(L)=&(l +$J1(l +~#~~L)-;andz,,, is absent since y, contains no deter-
ministic drift.
Sims et al. (1990) provide a general procedure for transforming regressors from
an integrated VAR into canonical form. They show that Z, can always be formed
so that the diagonal blocks, Fii, i > 2 have full row rank, although some blocks
may be absent. They also show that F,, = 0, as shown above, whenever the VAR
includes a constant. The details of their construction need not concern us since,
in practice, there is no need to construct the canonical regressors. The transfor-
mation from the X, to the Z, regressors is merely an analytic device. It is useful for
two reasons. First, X:D'(D')- '/I = Ziy,with y = (D')- 'p.Thus the OLS estimators
of the original and transformed models are related by 0'9= b.Second, the asymp-
totic properties of $ are easy to analyze because of the special structure of the
regressors. Together these imply that we can study the asymptotic properties of
b by first studying the asymptotic properties of y^ and then transforming these
coefficients into the /?s.
The transformation from X, to Z, is not unique. All that is required is some
transformation that yields a lower triangular F(L) matrix. Thus, in the AR(2)
example we set ~i~=Ay~_~ and ~~~=y~_~, but an alternative transformation
would have set z1 f = Ay, _ 1 and z3 , = y, _ 2. Since we always transform results for
2856 M.W. Watson

the canonical regressors Z, back into results for the natural regressors X,, this
non-uniqueness is of no consequence.
We now derive the asymptotic properties of y*constructed from the regression
Y,,~= Z;v + ei f. Writing E, = ,Xj ql, where qt is the standardized n x 1 martingale
difference sequence from Lemma 2.3, then Q = COQ= q:o, where w is the ith row
of ,YE12, and y*- y = (CZ,Z~)-(CZ&O). Lemma 2.3 can be used to deduce the
asymptotic behavior of CZ,Z: and CZ,Y@O. Some care must be taken, however,
since all of the z~,~elements of Z, are growing at different rates. Assume that zr,,
contains k, elements, z~,~ contains k, elements, and partition y conformably with
Z, as y = (yr yz y3,y4), where yj are the regression coefficients corresponding to Zj,t.
Let

1
P2z kl
0 0 0

0 T12 0 0
Yy,=
0 0 TI!i, 0
~312
: 0 0 0

and consider Yu,(P - y) = (YU, CZ,Z; Y; )-( Y; CZ,I@O). The matrix !PT
multiplies the various blocks of (9, - y,),CZ,Z;, and CZ,q, by the scaling factors
appropriate from the lemma. The first block of coefficients, yl, are coefficients on
zero mean stationary components and are scaled up by the usual factor of T1j2;
the same scaling factor is appropriate for yz, the constant term; the parameters
making up y3 are coefficients on regressors that are dominated by martingales,
and these need to be scaled by T; finally, y4 is a coefficient on a regressor that is
dominated by a time trend and is scaled by 73/2.
Applying the lemma, we have Y; CZ,Z: !P; * V, where, partitioning I/ con-
formably with Z,:

T- l&.tz;,t AC Fll,jF;l (j = 1/11?

T-&J2 -+F;, = V 227

T-2xz3,,zj t *F33 V33,


LJ

T-3~(z,,,)z A+ = V449

T-j2~zl,tz; f -I1-,0 = Vlj = Vi1 for j = 2,3,4,

T-32~z2,,z; f =sF,, B(s)dsF;, = v,, = V;,,


s
P F,,F,,
T-2CZ2.tZ4,, ---- = v24 = v42y
2

T- 52cz3,tz4,t =aF,, sB(s)dsF,, = v,, = Vk3,


s
Ch. 47: Vector Autoregressions and Cointegration 2857

where the notation reflects the fact that F,, and F,, are scalars. The limiting value
of this scaled moment matrix shares two important characteristics with its analogue
in the univariate AR(2) model. First, V is block diagonal with Vlj = 0 for j # 1.
(Recall that in the AR(2) model T-312CAyt_1yt_1 LO.) Second, many of the
blocks of V contain random variables. (In the AR(2) model T-2Cy:_ 1 converged
to a random variable.)
Now, applying the lemma to YU, CZ,q:w yields Y, CZ,r]iw=- A, where,
partitioning A conformably with Z,:

T-li2& tq;w Gv[O,(oo)vll] = A,,


n

T- ~z2 ,q;co = F,, dB(s)w = 4,


i
T- cz, J;W *Fa3 B(s)dB(s)w = A,,
s
I

Putting the results together, Y,(p - y)* VA, and three important results follow.
First, the individual coefficients converge to their values at different rates: y^i and
9, converge to their values at rate T12, while all of the other coefficients converge
more quickly. Second, the block diagonality of I/ implies that Tli2(y*, - y,) 3
N(0, cf V;,), where 0: = wo = var(sf). Moreover, A, is independent of Aj forj > 1
[Chan and Wei (1988, Theorem 2.2)], so that T(y^, - yl) is asymptotically
independent of the other estimated coefficients. Third, all of the other coefficients
will have non-normal limiting distributions, in general. This follows because Vj3 # 0
for j > 1, and A, is non-normal. A notable exception to this general result is when
the canonical regressors do not contain any stochastic trends, so that z~,~ is absent
from the model. In this case I/ is a constant and A is normally distributed, so that
the estimated coefficients have a joint asymptotic normal distribution. The leading
example of this is polynomial regression, when the set of regressors contains
covariance stationary regressors and polynomials in time. Another important
example is given by West (1988), who considers the scalar unit root AR(l) model
with drift.
The asymptotic distribution of the coefficients /? that correspond to the natural
regressors X, can now be deduced. It is useful to begin with a special case of the
general model,

Yi,, = PI + x;,*B2 + x;,t83 + i,r> (2.10)

A,, A,, and A, are jointly normally distributed since JsdB(s)w is a normally distributed random
variable with mean 0 and variance (ow)J?ds.
2858 M. W. Watson

where ~i,~ = 1 for all t,~~,~ is an h x 1 vector of zero mean I(0) variables and x3,(
contains the other regressors. It is particularly easy to transform this model into
canonical form. First, since x~,~ = 1, we can set z~,~ = ~i,~; thus, in terms of the
transformed regression, 0, = y2. Second, since the elements of x~,~ are zero mean
I(0) variables, we can set the first h elements of z~,~ equal to x~,~; thus /3* is equal
to the first h elements of yi. The remaining elements of z, are linear combination
of the regressors that need not concern us here. In this example, since fi2 is a subset
of the elements of yi, T(B, - b2) is asymptotically normal and independent
of the coefficients corresponding to trend and unit root regressors. This result is
very useful because it provides a constructive sufficient condition for estimated
coefficients to have an asymptotic normal limiting distribution: whenever the block
of coefficients can be written as coefficients on zero mean I(0) regressors in a model
that includes a constant term they will have a joint asymptotic normal distribution.
Now consider the general model. Recall that fi= Dy*. Let dj denote the jth
column of D, and partition this conformably with y, so that dj =it;j_d;jd\jdkj)),
where dij and qi are the same dimension. Then thejth elem_ent of /? is pj = Cidijpi.
Since the components of y^ converge at different rates, flj will converge at the
slowest rate of the gi included in the sum. Thus, when d,j # 0, pj will converge at
rate T1/2, the rate of convergence of $,.

2.5.2. Distribution of Wald test statistics

Consider Wald test statistics for linear hypotheses of the form R/3 = r, where R is
a q x k matrix with full row rank,

(Recall that fi corresponds to the coefficients in the ith equation, so that W tests
within-equation restrictions.) Letting Q = R(D), an equivalent way of writing the
Wald statistic is in terms of the canonical regressors Z, and their estimated
coefficients y^,

w = (Q?- 4CQ(%Z;)-Ql- (Qr* - 4


6;

Care must be taken when analyzing the large sample behavior of W because the
individual coefficients in p converge at different rates. To isolate the different com-
ponents, it is useful to assume (without loss of generality) that Q is upper triangular.*

*This assumption is made without loss of generality since the constraint Qy = r (and the resulting
Wald statistic) is equivalent to CQy = Cr, for nonsingular C. For any matrix Q, C can chosen so that
CQ is upper triangular.
Ch. 47: Vector Autoregressions and Cointegration 2859

Now, partition Q, conformably with 9 and the canonical regressors making up Z,,
so that Q = [qij] where qij is a qi x kj matrix representing qi constraints on the kj
elements in yj. These blocks are chosen SO that qii has full row rank and qij = 0
for i <j. Since the set of constraints Qy = r may not involve yi, the blocks qij might
be absent for some i. Thus, for example, when the hypothesis concerns only y3,
then Q is written as Q = [q31q32q33q34], where q31 = 0, q32 = 0 and q33 has full
row rank. Partition r = (I; r; r; rk) conformably with Q, where again some of the
li may be absent.
Now consider the first q1 elements of Q$qll$l + q12y2 + q13p3 + q14f4. Since
yj, for j > 2, converges more quickly than PI and p2, the sampling error in this
vector will be dominated asymptotically by the sampling error in qllfl + q12f2.
Similarly, the sampling error in the next group of q2 elements of Q9 is dominated
by q22y*2, in the next q3 by q33y*3, etc. Thus, the appropriate scaling matrix for
Qp--r is

T"ZI 41 0 0 0

!i+
I 0
0

Now, write the Wald statistic


T'121
0 42
0

as
TIq3
0
0
00
T3i21
94

But, under the null,

T12(qd1+ q12f2+ q13Y*3


+ q14.94
-rl)
= T12(q1191
+ q12f2- rl) + o,(l), and
T- lqqjjTj + . . . + qj4p4 _ rj) = T(j- )I2 (qjj~j - rj) + O,(l), for j > 1.

Thus. if we let

p311q12 0 01
0 q; , 0
1
f-j= )

0 0 Gq&$!
then

*,(QY- r) = eYT(p - y) + o&l)


2860 M. W. Watson

under the nu11.9 Similarly, it is straightforward to show that

Finally, since Yu,(g - y)= v/-IA and YV,CZ,Z: Y/s* V, then W=>(Ql/-A) x
(Qv-Q)-(Qv?4).
The limiting distribution of W is particularly simple when qii = 0 for i > 2. In
this case, all of the hypotheses of interest concern linear combinations of zero
mean I(0) regressors, together with the other regression coefficients. When q12 = 0,
so that the constant term is unrestricted, we have

a:w=cq11v1-r1UCq11(Czl,,z;,,)-q;,l-cq1l~~l -?,)I + O,(l)?

so that W 3x:, . When the constraints involve other linear combinations of the
regression coefficients, the asymptotic x2 distribution of the regression coefficients
will not generally obtain.
This analysis has only considered tests of restrictions on coefficients from the
same equation. Results for cross equation restrictions are contained in Sims et al.
(1990). The same general results carry over to cross equation restrictions. Namely,
restrictions that involve subsets of coefficients, that can be written as coefficients
on zero mean stationary regressors in regressions that include constant terms, can
be tested using standard asymptotic distribution theory. Otherwise, in general, the
statistics will have nonstandard limiting distributions.

2.6. Applications

2.6.1. Testing lag length restrictions

Consider the VAR(p + s) model,

P+S
Y,=Cr+ C ~iY*_i+r
i=l

and the null hypothesis H,: Qp+ 1 = Qpt2 = ... = @p+s = 0, which says that the true
model is a VAR(p). When p 2 1, the usual Wald (and LR and LM) test statistic
for H, has an asymptotic x2 distribution under the null. This can be demonstrated
by rewriting the regression so that the restrictions in H, concern coefficients on
zero mean stationary regressors. Assume that AY, is I(0) with mean p, and then

941*is the only off-diagonal element appearing in @. It appears because fl and f, both converge
at rate T.
Ch. 47: Vector Autoreyressions and Cointeyration 2861

rewrite the model as

pis- 1
Y,=Z+AY,_, + 2 Oi(AY,_i-p)+~t,
i=l

where A = ~~~~ Qi, Oi = - x$T:+ 1Qj and a= c1+ Cfz:- Oip. The restrictions
@P+l = cDp+2 = ... = Qpcs = 0, in the original model are equivalent to 0, =
@p+l=...= Op+s_ 1 in the transformed model. Since these coefficients are zero
mean I(0) regressors in regression equations that contain a constant term, the test
statistics will have the usual large sample x2 distribution.

2.6.2. Testing for Granger causality

Consider the bivariate VAR model

y2,t = 2 + IfI 42l,iYl,t-i + IfI +*2,iY2.t-i + 2,t


i=l i=l

The restriction that yZ,t does not Granger-cause yl,, corresponds to the null hypo-
thesis Ho: 412,1 = 412,2 = ... = c),,,, = 0. When (yl f y2,,) are covariance stationary,
the resulting Wald, LR or LM test statistic for &is hypothesis will have a large
sample x, distribution. When (y, t y, ,) are integrated, the distribution of the test
statistic depends on the location ok &it roots in the system. For example, suppose
that yl,, is I(l), but that y, , is I(0). Then, by writing the model in terms of deviations
of y,,, from its mean, the testrictions involve only coefficients on zero mean I(0)
regressors. Consequently, the test statistic has a limiting x,: distribution.
When yZ,t is I(l), then the distribution of the statistic will be asymptotically x2
when Y, t and y2,1 are cointegrated. When yl,, and y,,, are not cointegrated, the
Grangerlcausality test statistic will not be asymptotically x2, in general. Again,
the first result is easily demonstrated by writing the model so the coefficients of
interest appear as coefficients on zero mean stationary regressors. In particular,
when Y~,~ and y,,, are cointegrated, there is an I(0) linear combination of the
variables, say w, = yZ,r - ;l~,,~, and the model can be rewritten as

Y1.z = al + i &ll,iYl,t-i + i +12,itwr-i -Pw) + &l.t3

where pw is the mean of wt,E1 = ~+C~Z1~lz,i~~ and 4,l.i = 4ll.i + 412,i&


i=l , . . . , p. In the transformed regression, the Granger-causality restriction corre-
sponds to the restriction that the terms w,-i - pL, do not enter the regression. But
2862 M. W. Watson

these are zero mean I(0) regressors in a regression that includes a constant, so
that the resulting test statistics will have a limiting xf distribution. When ~i,~ and
y, ~ are not cointegrated, the regression cannot be transformed in this way, and
the resulting test statistic will not, in general, have a limiting x2 distribution.
The Mankiw-Shapiro (1985)/Stack-West (1988) results concerning Halls test
of the life-cycle/permanent income model can now be explained quite simply.
Mankiw and Shapiro considered tests of Halls model based on the regression of
AC, (the logarithm of consumption) onto y,_i (the lagged value of the logarithm
of income). Since y,_ 1 is (arguably) integrated, its regression coefficient and t-
statistic will have a nonstandard limiting distribution. Stock and West, following
Halls (1978) original regressions, considered regressions of c, onto c,_ 1 and y,_ 1.
Since, according to the life-cycle/permanent income model, c,_ 1 and y,_ 1 are
cointegrated, the coefficient on y,_ 1 will be asymptotically normal and its t-statistic
will have a limiting standard normal distribution. However, when y,_ 1 is replaced
in the regression with m,_, (the lagged value of the logarithm of money), the
statistic will not be asymptotically normal, since c, _ 1 and m,_ 1 are not cointegrated.
A more detailed discussion of this example is contained in Stock and West (1988).

2.6.3. Spurious regressions

In a very influential paper in the 1970s, Granger and Newbold (1974) presented
Monte Carlo evidence reminding economists of Yules (1926) spurious correlation
results. Specifically, Granger and Newbold showed that a large R2 and a large
t-statistic were not unusual when one random walk was regressed on another,
statistically independent, random walk. Their results warned researchers that
standard measures of fit can be very misleading in spurious regressions. Phillips
(1986) showed how these results could be interpreted quite simply using the frame-
work outlined above, and his analysis is summarized here.
Let Yi,, and y2,t be two independent random walks

YlJ = Yl,,-1 + %,t,


Y2.t = Y2,,- 1 + E2.v

where E, = (E~,~E~,~)is an mds(ZJ with finite fourth moments, and {~i,~}~, i and
{~~,~}~r,1 are mutually independent. For simplicity, set y,,, = y,,, = 0. Consider
the linear regression of y2,* onto ~i,~,

Y2.r = BYi,, + % (2.11)

where u, is the regression error. Since y, ,f and y,,, are statistically independent
fi = 0 and u, = Y~,~.

A detailed discussion of Granger-causality tests in integrated systems in contained in Sims et al.


(1990) and Toda and Phillips (1993a, b).
2864 M. W. Watson

Y2,t= PYI,, + u2,t3


(2.16)

where u, = (~i,~ u2,J = DEB,wh ere E, is an mds(Z,) with finite fourth moments. Like
the spurious regression model, both yr,, and y2,t are individually I(1): yr,, is a
random walk, while Ay2,t follows a univariate ARMA(l, 1) process. Unlike the
spurious regression model, one linear combination of the variables y2,t - fi~i,~ = u2,t
is I(O), and so the variables are cointegrated.
Stock (1987) derives the asymptotic distribution of the OLS estimator of coin-
tegrating vectors. In this example, the limiting distribution is quite simple. Write

(2.17)

and let dij denote the ijth element of D, and Di = (di, di2) denote the ith row of D.
Then the limiting behavior, or the denominator of b - j?, follows directly from the
lemma:

T-~C(~,,,)~ = DlCT-2~t,t:lD;=D~ [ jB(s)B(s)ds]D;,(2.18)


where 5, is the bivariate random walk, with A& = E, and B(s) is a 2 x 1 Brownian
motion process. The numerator is only slightly more difficult:

T-CYdzt= 1 3 T-'CYl,l442,t+ T-'CAY1,u2t


1 9
= D,[T-~5,_l~;]D; + Dl[T-l&~;]D;

*Dl[ j$s)dB(s).]D; + D,D;. (2.19)

Putting these two results together,

I[ s 1
-1
+ D, D; D, B(s)B(s)dsD; . (2.20)

There are three interesting features of the limiting representation (2.20). First,
j? is super consistent, converging to its true value at rate T. Second, while super
consistent, fi is asymptotically biased, in the sense that the mean of the asymptotic
distribution is not centered at zero. The constant term DID; = d,,d,, + d,,d,,
that appears in the numerator of (2.20) is primarily responsible for this bias. To
see the source of this bias, notice that the regressor yl,t is correlated with the error
term u~,~. In standard situations, this simultaneous equation bias is reflected in
Ch. 47: Vector Autoregressions and Cointegration 2865

large samples as an inconsistency in B. With cointegration, the regressor is I(1)


and the error term is I(O), so no inconsistency results; the simultaneous equations
bias shows up as bias in the asymptotic distribution of b. In realistic examples
this bias can be quite large. For example, Stock (1988) calculates the asymptotic
bias that would obtain in the OLS estimator of the marginal propensity to
consume, obtained from a regression of consumption onto income using annual
observations with a process for u, similar to that found in U.S. data. He finds that
the bias is still -0.10 even when 53 years of data are used. Thus, even though
the OLS estimators are super consistent, they can be quite poor.
The third feature of the asymptotic distribution in (2.20) involves the special
case in which d,, = d,, = 0 so that u1 , and u2, are statistically independent. In
this case the OLS estimator corresponds to the Gaussian maximum likelihood
estimation (MLE). When d 12 = d,, = 0, (2.20) simplifies to

s ,&W&(4

s
(2.21)
B,(s)*ds

where B(s) is partitioned as B(s) = [B,(s)B,(s)]. This result is derived in Phillips


and Park (1988) where the distribution is given a particularly simple and useful
interpretation. To develop the interpretation, suppose for the moment that u~,~ =
dZZcZ,twas n.i.i.d. (In large samples the normality assumption is not important; it
is made here to derive simple and exact small sample results.) Now, consider the
distribution of $ conditional on the regressors {y,,,}T= i. Since Q is n.i.i.d., the
restriztion d,, = d,, = 0 implies that u~,~ is independent of {y, ,}f 1. This means
t_hat8-Dl{y,JT=, - N(0,d~,CC(Y,,,)21-),so that the unconditional distribution
/I - p is normal with mean zero and random covariance matrix, d:2[C(yl,t)2]-1.
In large samples, T-2C(y,,,)2~dIlSB1(S)2dS, so that T(fi--/I) converges to a
normal random variable with a mean of zero and random covariance matrix,
(d,,ld,,)2CSBl(s)2dsl-1.Thus, T(B - - p) has an asymptotic distribution that is a
random mixture of normals. Since the normal distributions in the mixture have a
mean of zero, the asymptotic distribution is distributed symmetrically about zero,
and thus j!?is asymptotically median unbiased.
The distribution is useful, not so much for what it implies about the distribution
of b, but for what it implies about the t-statistic for fi. When d,, or d,, are not
equal to zero, the t-statistic for testing the null fi = /I0 has a nonstandard limiting
distribution, analogous to the distribution of the Dickey-Fuller t-statistic for
testing the null of a unit AR coefficient in a univariate regression. However, when
d,, = d,, = 0, the t-statistic has a limiting standard normal distribution. To see

Stock (1988, Table 4). These results are for durable plus nondurable consumption. When nondurable
consumption is used, Stock estimates the bias to be -0.15.
2866 M. W. Watson

why this is true, again consider the situation in which u~,~is n.i.i.d. When d, Z = d, I= 0,
the distribution of the t-statistic for testing b = PO conditional on {yl,t},E 1 has an
exact Students t distribution with T - 1 degrees of freedom. Since this distribution
does not depend on {Y~,~},=1, this is the unconditional distribution as well. This
means that in large samples, the t-statistic has a standard normal distribution. AS
we will see in this next section, the Phillips and Park (1988) result carries over to
a much more general setting.
In the example developed here, u, = DE,is serially uncorrelated. This simplifies
the analysis, but all of the results hold more generally. For example, Stock (1987)
assumes that u,= D(L)&,, where D(L)=~t?L,DiL.f,lD(l)/ #O and C,?Y, i/D,/ < co.
In this case,

(2.22)

where Dj(l) is thejth row of D(1) and Dj,i


is thejth row of Di.Under the additional
assumption that d12(1) = dZl(l) = 0 and Cz?LoD,,iD;,i = 0,T(b- /I) is distributed
as a mixed normal (asymptotically) and the r-statistic for testing /J= /3, has an
asymptotic normal distribution when d12(1) = dT1(l) = 0 [see Phillips and Park
(1988) and Phillips (1991a)l.

2.7. Implications for econometric practice

The asymptotic results presented above are important because they determine the
appropriate critical values for tests of coefficient restrictions in VAR models.
The results lead to three lessons that are useful for applied practice.

(1) Coefficients that can be written as coefficients on zero mean I(0) regressors id
regressions that include a constant term are asymptotically normal. Test statistics
for restrictions on these coefficients have the usual asymptotic x2 distributions.
For example, in the model

Y, = YlZl,, + Y2 + Y3Z3.t + Y$t+ Et, (2.23)

where z1 f is a mean zero I(0) scalar regressor and z3 t is a scalar martingale


regressor: this result implies that Wald statistics for tesiing H,:y1= c is asymp-
totically x2.

(2) Linear combinations of coefficients that include coefficients on zero mean I(0)
regressors together with coefficients on stochastic or deterministic trends will have
asymptotic normal distributions. Wald statistics for testing restrictions on these
Ch. 47: Vector Autoregressions and Cointegration 2867

linear combinations will have large sample x2 distributions. Thus in (2.23) Wald
statistics for testing H,: R,y, + R,y, + R,y, = r, will have an asymptotic x2
distribution if R, # 0.
(3) Coefficients that cannot be written as coefficients on zero mean I(0) regressors
(e.g. constants, time trends, and martingales) will, in general, have nonstandard
asymptotic distributions. Test statistics that involve restrictions on these coefficients
that are not a function of coefficients on zero mean I(0) regressors will, in general,
have nonstandard asymptotic distributions. Thus in (2.23), the Wald statistic for
testing: H,: R(y, y3 y4) = I has a non-x asymptotic distribution, as do test statistics
for composite hypotheses of the form H,: R(y, y3 y4) = r and y1 = c.

When test statistics have a nonstandard distribution, critical values can be deter-
mined by Monte Carlo methods by simulating approximations to the various
functionals of B(s) appearing in Lemma 2.3. As an example, consider using Monte
Carlo methods to calculate the asymptotic distribution of sum of coefficients 4i +
42 = y2 in the univariate AR(2) regression model (2.1). Section 2.4 showed that
T(?, - ~2)=4 + ~2)CjB(S)2ds1-1CSB(s)dB(s)l, where B(s) is a scalar Brownian
motion process. If x, is generated as a univariate Gaussian random walk, then
one draw of the random variable [JB(s)ds] - [jB(s)dB(s)] is well approximated
by (T-2~x~)-(T-~x,Ax,+, ) with T large. (A value of T = 500 provides an
adequate approximation for most purposes.) The distribution of T(y*, -7,) can
then be approximated by taking repeated draws of (T-2Cxf)-(Tp CX~AX~+~)
multiplied by (1 + 4,). An example of this approach in a more complicated multi-
variate model is provided in Stock and Watson (1988).
Application of these rules in practice requires that the researcher know about
the presence and location of unit roots in the VAR. For example, in determining
the asymptotic distribution of Granger-causality test statistics, the researcher has
to know whether the candidate causal variable is integrated and, if it is integrated,
whether it is cointegrated with any other variable in the regression. If it is cointe-
grated with the other regressors, then the test statistic has a x2 asymptotic distri-
bution. Otherwise the test statistic is asymptotically non-X2, in general. In practice
such prior information is often unavailable, and an important question is what is
to be done in this case?12
The general problem can be described as follows. Let W denote the Wald test
statistic for a hypothesis of interest. Then the asymptotic distribution of the Wald
statistic when a unit root is present, say F(WI U), is not equal to the distribution
of the statistic when no unit root is present, say F( WI N). Let cU and cN denote

Toda and Phillips (1993a, b) discuss testing for Granger causality in a situation in which the
researcher knows the number of unit roots in the model but doesnt know the cointegrating vectors.
They develop a sequence of asymptotic x2 tests for the problem. When the number of unit roots in
the system in unknown, they suggest pretesting for the number of unit roots. While this will lead to
sensible results in many empirical problems, examples such as the one presented at the end of this
section show that large pretest biases are possible.
2868 M. W. Watson

the unit root and no unit root critical values for a test with size c(. That is, cu
and cN satisfy: P( W > cu( U) = P( W > cN( N) = a under the null. The problem is
that cu # cN, and the researcher does not know whether U or N is the correct
specification.
In one sense, this is not an ususual situation. Usually, the distribution of statistics
depends on characteristics of the probability distribution of the data that are un-
known to the researcher, even under the null hypothesis. Typically, there is
uncertainty over certain nuisance parameters, that affect the distribution of the
statistic of interest. Yet, typically the distribution depends on the nuisance para-
meters in a continuous fashion, in the sense that critical values are continuous
functions of the nuisance parameters. This means that asymptotically valid inference
can be carried out by replacing the unknown parameters with consistent estimates.
This is not possible in the present situation. While it is possible to represent
the uncertainty in the distribution of test statistics as a function of nuisance para-
meters that can be consistently estimated, the critical values are not continuous
functions of these prameters. Small changes in the nuisance parameters ~ associated
with sampling error in estimates - may lead to large changes in critical values.
Thus, inference cannot be carried out by replacing unknown nuisance parameters
with consistent estimates. Alternative procedures are required.13
Development of these alternative procedures is currently an active area of
research, and it is too early to speculate on which procedures will prove to be the
most useful. It is possible to mention a few possibilities and highlight the key issues.
The simplest procedure is to carry out conservative inference. That is, to use the
largest of the unit root and no unit root critical values, rejecting the null when
W > max(c,, cN). By construction, the size of the test is less than or equal to a.
Whenever W > max(c,,c,), so that the null is rejected using either distribution,
or W < min(c,, cN), so that the null is not rejected using either distribution, one
need not proceed further. However a problem remains when min(c,, cN) < W <
max(c,, cN). In this case, an intuitively appealing procedure is to look at the data
to see which hypothesis - unit root or no unit root - seems more plausible.
This approach is widely used in applications. Formally, it can be described as
follows. Let y denote a statistic helpful in classifying the stochastic process as a
unit root or no unit root process. (For example, y might denote a Dickey-Fuller
t-statistic or one of the test statistics for cointegration discussed in the next
section.) The procedure is then to define a region for y, say R,, and when yeR,,
the critical value cu is used; otherwise the critical value cN is used. (For example,
the unit root critical value might be used if the Dickey-Fuller t-statistic was
greater than -2, and the no unit root critical value used when the DF statistic

13Alternatively, using local-to-unity asymptotics, the critical values can be represented as


continuous functions of the local-to-unity parameter, but this parameter cannot be consistently
estimated from the data. See Bobkoski (1983), Cavanagh (1985), Chan and Wei (1987), Chan (1988),
Phillips (1987b) and Stock (1991).
Ch. 47: Vector Autoregressions and Cointegration 2869

was less than -2.) In this case, the probability of type 1 error is

P(Type 1 error) = P(W > co(y~R,)P(yeR,) + P(W > c,ly$R,)P(y$R,).

The procedure will work well, in the sense of having the correct size and a power
close to the power that would obtain when the correct unit root or no unit root
specification were known, if two conditions are met. First, P(~ER,) should be
near 1 when the unit root specification is true, and P(y$R,) should be near 1
when the unit root specification is false, respectively. Second, P( W > cLi)yeR,) and
P( W > cN )y $Ru) should be near P( W > cu 1U) and P( W > cNlN), respectively.
Unfortunately, in practice neither of these conditions may be true. The first requires
statistics that perfectly discriminate between the unit root and non-unit root
hypotheses. While significant progress has been made in developing powerful
inference procedures [e.g. Dickey and Fuller (1979), Elliot et al. (1992), Phillips
and Ploberger (1991), Stock (1992)], a high probability of classification errors is
unavoidable in moderate sample sizes.
In addition, the second condition may not be satisfied. An example presented
in Elliot and Stock (1992) makes this point quite forcefully. [Also see Cavanagh
and Stock (1985).] They consider the problem of testing whether the price-divided
ratio helps to predict future changes in stock prices.14 A stylized version of the
model is

Pt - 4 = m- 1 - d,- 1)+ %.t, (2.24)

AP,= HP,- I- 4 - 1)+ ~z,t, (2.25)

where pt and d, are the logs of prices and dividends, respectively, and (E~,~E~,~))is
an mds(Z,). The hypothesis of interest is H,: p = 0. Under the null, and when 14 I < 1,
the t-statistic for this null will have an asymptotic standard normal distribution;
when the hypothesis 4 = 1, the t-statistic will have a unit root distribution. (The
particular form of the distribution could be deduced using Lemma 2.3, and critical
values could be constructed using numerical methods.) The pretest procedure
involves carrying out a test of 4 = 1 in (2.24), and using the unit root critical value
for the t-statistic for fi = 0 in (2.25) when 4 = 1 is not rejected. If 4 = 1 is rejected,
the critical value from the standard normal distribution is used.
Elliot and Stock show that the properties of this procedure depends critically
on the correlation between &I f and Ed f. To see why, consider an extreme example.
In the data, dividends are much smoother than prices, so that most of the variance
in the price-dividend ratio comes from movements in prices and not from dividends.
Thus, E~,~and E~,~are likely to be highly correlated. In the extreme case, when

14Hodrick (1992) contains an overview of the empirical literature on the predictability of stock
prices using variables like the price-dividend ratio. Also see, Fama and French (1988) and Campbell
(1990).
2870 M. W. Watson

they are perfectly correlated, (p - b) is proportional to (6 - 4), and the t-statistic


for testing /3 = 0 is exactly equal to the t-statistic for testing 4 = 1. In this case
F(WIy) is degenerate and does not depend on the null hypothesis. All of the
information in the data about the hypothesis /I = 0 is contained in the pretest.
While this example is extreme, it does point out the potential danger of relying
on unit root pretests to choose critical values for subsequent tests.

3. Cointegrated systems

3.1. Introductory comments

An important special case of the model analyzed in Section 4 is the cointegrated


VAR. This model provides a framework for studying the long-run economic
relations discussed in the introduction. There are three important econometric
questions that arise in the analysis of cointegrated systems. First, how can the
common stochastic trends present in cointegrated systems be extracted from the
data? Second, how can the hypothesis of cointegration be tested? And finally, how
should unknown parameters in cointegrating vectors be estimated, and how should
inference about their values be conducted? These questions are answered in this
section.
We begin, in Section 3.2, by studying different representations for cointegrated
systems. In addition to highlighting important characteristics of cointegrated
systems, this section provides an answer to the first question by presenting a
general trend extraction procedure for cointegrated systems. Section 3.3 discusses
the problem of testing for the order of cointegration, and Section 3.4 discusses the
problem of estimation and inference for unknown parameters in cointegrating
vectors. To keep the notation simple, the analysis in Sections 3.2-3.4 abstracts
from deterministic components (constants and trends) in the data. The complications
in estimation and testing that arise when the model contains constants and trends
is the subject of Section 3.5. Only I(1) systems are considered here. Using Engle
and Grangers (1987) terminology, the section discusses only CI(1,l) systems; that
is, systems in which linear combinations of I(1) and I(0) variables are I(0). Extensions
for CI(d, b) systems with d and b different from 1 are presented in Johansen (1988b,
1992c), Granger and Lee (1990) and Stock and Watson (1993).

3.2. Representations for the I (1) cointegrated model

Consider the VAR

xr= t z7iX,_i+E,, (3.1)


i=l
Ch. 47: Vector Autoreyressions andCointegration 2871

where x, is an n x 1 vector composed of I(0) and I(1) variables, and E, is an mds(Z,).


Since each of the variables in the system are I(0) or I(l), the determinantal poly-
nomial 1n(z)1 contains at most n unit roots, with n(z) = I - Cf= 1 IIizi. When
there are fewer than n unit roots, then the variables are cointegrated, in the sense
that certain linear combinations of the x,s are I(0). In this subsection we derive
four useful representations for cointegrated VARs: (1) the vector error correction
VAR model, (2) the moving average representation of the first differences of the
data, (3) the common trends representation of the levels of the data, and (4) the
triangular representation of the cointegrated model.
All of these representations are readily derived using a particular SmithhMcMillan
factorization of the autoregressive polynomial 17(L). The specific factorization used
here was originally developed by Yoo (1987) and was subsequently used to derive
alternative representations of cointegrated systems by Engle and Yoo (1991). Some
of the discussion presented here parallels the discussion in this latter reference.
Yoos factorization of n(z) isolates the unit roots in the system in a particularly
convenient fashion. Suppose that the polynomial n(z) has all of its roots on or
outside the unit circle, then the polynomial can be factored as U(z) = U(z)M(z)V(z),
where U(z) and V(z) are n x n matrix polynomials with all of their roots outside
the unit circle, and M(z) is an n x IZdiagonal matrix polynomial with roots on or
outside the unit circle. In the case of the I(1) cointegrated VAR, M(L) can be
written as

MM =
[
4o 0
z
I 1 9

where Ak = (1 - L)Z, and k + r = n. This factorization is useful because it isolates


all of the VARs nonstationarities in-the upper block of M(L).
We now derive alternative representations for the cointegrated system.

3.2.1. The vector error correction VAR model (VECM)

To derive the VECM, subtract x,_ 1 from both sides of (3.1) and rearrange the
equation as

p-1
Ax, = 17x,_ 1 + C ~iXt-i+Ef, (3.2)
i=l

whereZ7= -1,,+C;==,U,= -r;l(l),andQi= -CjP=i+lnj,i=l,...,p-l.Since


n(l) = U(l)M(l)V( l), and M(1) has rank r, 17 = - I7( 1) also has rank r. Let GI
denote an n x r matrix whose columns form a basis for the row space of n, so
that every row of 17 can be written as a linear combination of the rows of cc!.
Thus, we can write 17 = &z, where 6 is an n x r matrix with full column rank.
2872 M. W. Watson

Equation (3.2) then becomes

pl
Ax, = &x,_i + c @iAxt_i + E, (3.3)
i=l

or

p-1
Ax,=Sw,_i + 1 Q+x,_~+c,, (3.4)
i=l

where w, = IXX,. Solving (3.4) for w,_i shows that w,_ 1 = (6S)- *S[Ax, -
C;:t @iAx,_i - EJ, so that wt is I(0). Thus, the linear combinations of the poten-
tially I(1) elements of x, formed by the columns of a are I(O), and the columns of
c1are cointegrating vectors.
The VECM imposes k < n unit roots in the VAR by including first differences
of all of the variables and r = n - k linear combinations of levels of the variables.
The levels of x, are introduced in a special way - as w, = rzx, - so that all of the
variables in the regression are I(0). Equations of this form appeared in Sargan
(1964) and the term error correction model was introduced in Davidson et al.
(1978).15 As explained there and in Hendry and von Ungern-Sternberg (1981),
CIX,= 0 can be interpreted as the equilibrium of the dynamical system, w, as the
vector of equilibrium errors and equation (3.4) describes the self correcting
mechanism of the system.

3.2.2. The moving average representation

To derive the moving average representation for Ax,, let

WJ = 1,
o A
0
[ I 1 ,

so that M(L)M(L) = (1 - L)Z,. Then,

M(L)M(L)V(L)x, = ti(L)u(L)-&,,

so that

I(L)Ax, = ti(L)Cr(L)- *st,

*As Phillips and Loretan (1991) point out in their survey, continuous time formulations of error
correction models were used extensively by A.W. Phillips in the 1950s. I thank Peter Phillips for
drawing this work to any attention.
Ch. 47: Vector Autoregressions and Cointegration 2873

and

Ax, = C(L)&,, (3.5)

where C(L) = V(L)-iM(L)U


There are two special characteristics of the moving average representation. First,
C( 1) = V(l)- a( l)U(l)- has rank k and is singular when k < n. This implies that
the spectral density matrix of Ax, evaluated at frequency zero, (27c_iC(l)Z,C( l), is
singular in a cointegrated system. Second, there is a close relationship between
C(1) and the matrix of cointegrating vectors ~1.In particular, aC(1) = 0.i6 Since
w, = CIX,is I(O), Aw, = CLAX,is I(- 1) so that its spectrum at frequency zero,
(27~)~~crC(l)C,C(l)cc, vanishes.
The equivalence of vector error correction models and cointegrated variables
with moving average representations of the form (3.5) is provided in Granger (1983)
and forms the basis of the Granger Representation Theorem [see Engle and
Granger (1987)].

3.2.3. The common trends representation

The common trends representation follows directly from (3.5). Adding and sub-
tracting C(l)&, from the right hand side of (3.5) yields

Ax, = C( l)s, + [C(L) - C( l)]e,. (34

Solving backwards for the level of x,,

x, = C( 1)5,+ C*(L)&, + x0, (3.7)

where 5, = C:= l~, and C*(L) = (1 - L)-[C(L) - C(l)] = C,YLOC~Li, where Cr =


-x,?=i+lCj and si=O for. i < 0 is assumed. Equation (3.7) is the multivariate
BeveridgeeNelson (1981) decomposition of xt; it decomposes x, into its permanent
component, C(l)& + x0, and its transitory component, C*(L)s,.17 Since C( 1)
has rank k, we can find a nonsingular matrix G, such that C(l)G = [A 0, .,I,
where A is an n x k matrix with full column rank. Thus C(l)& = C(l)GG-<,,

r6To derive this result, note from (3.2) and (3.3) that 17 = -n(l) = - U(l)M(l)V(l) = 6~. Since
M(1) has zeroes everywhere, except the lower diagonal block which is I,,x must be a nonsingular
transformation of the last r rows of V(1). This implies that the first k columns of uV(l)-r contain only
zeroes, so that aV(l)-M(l)U(l) = aC(1) = 0.
The last component can be viewed as transitory because it has a finite spectrum at frequency zero.
Since U(z) and V(z) are finite order with roots outside the unit circle, the Ci coefficients decline
exponentially for large i, and thus CiilC,I is finite. Thus the CT matrices are absolutely summable,
and C*(l)Z,C*(l) is finite.
The matrix G is not unique. One way to construct G is from the eigenvectors of A. The first k
columns of G are the eigenvectors corresponding to the nonzero eigenvalues of A and the remaining
eigenvectors are the last n-k columns of G.
2874

so that

x, = AT, + C*(L)&, + X0 (3.8)

where r, denotes the first k components of G-l<,.


Equation (3.8) is the common trends representation of the cointegrated system. It
decomposes the n x 1 vector x, into k permanent components r, and n transitory
components C*(L)&,. These permanent components often have natural interpre-
tations. For example, in the eight variable (y, c, i, n, w, m, p, r) system introduced in
Section 1, five cointegrating vectors were suggested. In an eight variable system
with five cointegrating vectors there are three common trends. In the (y, c, i, II, m, p, r)
systems these trends can be interpreted as population growth, technological
progress and trend growth in money.
The common trends representation (3.8) is used in King et al. (1991) as a device
to extract the single common trend in a three variable system consisting of y,c
and i. The derivation of (3.8) shows exactly how to do this: (i) estimate the VECM
(3.3) imposing the cointegration restrictions; (ii) invert the VECM to find the
moving average representation (3.5); (iii) find the matrix G introduced below
equation (3.7); and, finally, (iv) construct t, recursively from r, = r,_ 1 + e,, where
e, is the first element of G- E,, and where E, denotes the vector of residuals from
the VECM. Other interesting applications of trend extraction in cointegrated
systems are contained in Cochrane and Sbordone (1988) and Cochrane (1994).

3.2.4. The triangular representation

The triangular representation also represents x, in terms of a set of k non-cointegrated


I(1) variables. Rather than construct these stochastic trends as the latent variables
r, in the common trends representation, a subset of the x, variables are used. In
particular, the triangular representation takes the form:

Ax1.t= Ul,,, (3.9)

X2.r - kt = u2,*9 (3.10)

where x, = (xi,, xi 1),~i,~ is k x 1 and x2 .f is r x 1. The transitory components are


u, = cu; f u; f) = D(L)E,, where (as we show below) D(1) has full rank. In this re-
presentation, the first k elements of x, are the common trends and x~,~ - px, f are
the I(0) linear combinations of the data.
To derive this representation from the VAR (3.2), use H(L) = U(L)M(L)V(L) to
write

W)~W)wJx, = E,, (3.11)


Ch. 47: Vector Autoregressions and Cointegration 2815

so that

M(L)V(L)x, = u(L)-&,. (3.12)

Now, partition P(L) as

V(L) =
VII(L)
[ %1(L)
uu(L)
u&L) 1

where ull(L) is k x k, u12(L) is k x r, tizl(L) is r x k and uz2(L) is r x r. Assume


that the data have been ordered so that uz2(L) has all of its roots outside the unit
circle. (Since V(L) has.all of its roots outside the unit circle, this assumption is
made with no loss of generality.) Now, let

(AL)
=[ B(L)
1, 0
I, 1

where B(L) = - u~~(L)-~u~~(L). Then

M(L)v(L)C(L)C(L)-lx, = u(L)-&, (3.13)

or, rearranging and simplifying,

(3.14)

where p*(L) = (1 - L)- [/i(L) - p(l)] and /I = /I( 1). Letting G(L) denote the matrix
polynomial on the left hand side of (3.14), the triangular representation is obtained
by multiplying equation (3.14) by G(L)-. Thus, in equations (3.9) and (3.10), u, =
D(L)&,, with D(L) = G(L)-U(L)- .
When derived from the VAR (3.2), D(L) is seen to have a special structure that
was inherited from the assumption that the data were generated by a finite order
VAR. But of course, there is nothing inherently special or natural about the finite
order VAR; it is just one flexible parameterization for the x, process. When the
triangular representation is used, an alternative approach is to parameterize the
matrix polynomial D(L) directly.
An early empirical study using this formulation is contained in Campbell and
Shiller (1987). They estimate a bivariate model of the term structure that includes
long term and short term interest rates. Both interest rates are assumed to be I(l),
but the spread or difference between the variables is assumed to be I(0). Thus,
in terms of (3.9))(3.10) ~i,~ is the short term interest rate, x2 t is the long rate and
/I = 1. In their empirical work, Campbell and Shiller modeled the process U, in
(3.10) as a finite order VAR.
Ch. 47: Vector Autoregressions and Cointegration 2877

to the constant (z,,,) and the deterministic time trends (z,,,). Hypothesis testing
when deterministic components are present is discussed in Section 3.5.
There are a many tests for cointegration: some are based on likelihood methods,
using a Gaussian likelihood and the VECM representation for the model, while
others are based on more ad hoc methods. Section 3.3.1 presents likelihood based
(Wald and Likelihood Ratio) tests for cointegration constructed from the VECM.
The non-likelihood-based methods of Engle and Granger (1987) and Stock and
Watson (1988) are the subject of Section 3.3.2, and the various tests are compared
in Section 3.3.3.

3.3.1. Likelihood based tests for cointegration

In Section 3.2.1 the general VECM was written as

p-1
Ax, = SCIX,_ i + C @iA~,-i+~,. (3.3)
i=l

To develop the restrictions on the parameters in (3.3) implicit in the null hypothesis,
first partition the matrix of cointegrating vectors as CI= [a, up] where ~1,is an n x r.
matrix whose columns are the cointegrating vectors present under the null and ~1,
is the n x I, matrix of additional cointegrating vectors present under the alternative.
Partition 6 conformably as 6 = [S,S,], let r =(@, Q2... Qp_i) and let z, =
(Ax;_~ Ax;_~~x:_~+~ ). The VECM can then be written as

Ax, = S&xt_ 1 + Sac+_ 1 + I-z, + et, (3.15)

where, under the null hypothesis, the term d,~lhx~_ 1 is absent. This suggests writing
the null and alternative hypotheses as Ho: 6, = 0 vs. H,: 6, # 0.21 Written in this
way, the null is seen as a linear restriction on the regression coefficients in (3.15).
An important complication is that the regressor cQ_ 1 depends on parameters in
~1, that are potentially unknown. Moreover, when 6, = 0, c+_ I does not enter
the regression, and so the data provide no information about any unknown param-
eters in cls. This means that these parameters are econometrically identified only
under the alternative hypothesis, and this complicates the testing problem in ways
discussed by Davies (1977, 1987), and (in the cointegration context) by Engle and
Granger (1987).
In many applications, this may not be a problem of practical consequence, since
the coefficients in a are determined by the economic theory under consideration.
For example, in the (y,c, i, w, n,r,m,p) system, candidate error correction terms

Much of the discussion in this section is based on material in Horvath and Watson (1993).
ZIFormally, the restriction rank@,&) = rO should be added as a qualifier to H,. Since this constraint
is satisfied almost surely by unconstraiied estimators of (3.15) it can safely be ignored when constructing
likelihood ratio test statistics.
2878 M. W. Watson

with no unknown parameters are y - c, y - i, (w - p) - (y - n) and r. Only one


error correction term, m - p - fi,y - /?,T, contains potentially unknown param-
eters. Yet, when testing for cointegration, a researcher may not want to impose
specific values of potential cointegrating vectors, particularly during the preliminary
data analytic stages of the empirical investigation. For example, in their investigation
of long-run purchasing power parity, Johansen and Juselius (1992) suggest a two-
step testing procedure. In the first step cointegration is tested without imposing
any information about the cointegrating vector. If the null hypothesis of no cointe-
gration is rejected, a second stage test is conducted to see if the cointegrating
vector takes on the value predicted by economic theory. The advantage of this
two-step approach is that it can uncover cointegrating relations not predicted by
the specific economic theory under study. The disadvantage is that the first stage
test for cointegration will have low power relative to a test that imposes the correct
cointegrating vector.
It is useful to have testing procedures that can be used when cointegrating
vectors are known and when they are unknown. With these two possibilities in
mind, we write r = rk + ru, where rk denotes the number of cointegrating vectors
with known coefficients, and r,, denotes the number of cointegrating vectors with
unknown coefficients. Similarly, write r, = rok + rou and ra = rak + reu, where the
subscripts k and u denote known and unknown respectively. Of course, the
rak subset of known cointegrating vectors are present only under the alternative,
and ahxt is I(1) under the null.
Likelihood ratio tests for cointegration with unknown cointegrating vectors (i.e.
H,: r = r9 vs. H,: r = ro, + rou) are developed in Johansen (1988a), and these tests
are modified to incorporate known cointegrating vectors (nonzero values of r
and rak) in Horvath and Watson (1993). The test statistics and their asymptot::
null distributions are developed below.
For expositional purposes it is convenient to consider three special cases. In the
first, r, = rek, so that all of the additional cointegrating vectors present under the
alternative are assumed to be known. In the second, r, = r,,, so that they are all
unknown. The third case allows nonzero values of both rak and ra,. To keep the
notation simple, the tests are derived for the r. = 0 null. In one sense, this is without
loss of generality, since the LR statistic for H,: r = r. vs. H,: r = r, + r, can always be
calculated as the difference between the LR statistics for [H,: r = 0 vs. H,: r = r, + r,]
and [H,: r = 0 vs. H,: r = r,]. However, the asymptotic null distribution of the test
statistic does depend on ror and ro,, and this will be discussed at the end of this
section.

Testing H,: r = 0 vs. H,: r = rek When r. = 0, equation (3.15) simplifies to

Ax, = S,(abx,_ *) + l-2, + E,. (3.16)

Since abx, _ 1 is known, (3.16) is a multivariate linear regression, so that the LR, Wald
Ch. 47: Vector Autoregressions and Cointeyration 2879

and LM statistics have their standard regression form. Letting X = [x, x2 . ..x~].
x-1 =[xoxi...xT-l ],AX=X-X_,,Z=[z,z,...z,],s=[.slsZ~..sT]andM,=
[I - Z(ZZ)- Z], the OLS estimator of 6, is s:, = (AXMzX _ ia,)(rx~X_ ,M,XP 1~,)-1,
which is the Gaussian MLE. The corresponding Wald test statistic for H, vs. H, is

W= [vec(8J][(u~Xf_lMZX_lu,)-1 C32,]-[vec(G^,)]

= [vec(AXM,X_,cr,)][(crbX,M,X_,cc,)-~~~-I

x Cvec(AXM,X _ 1cc,)], (3.17)

where .??, is the usual estimator value of z, (2, = T-66, where E*is the matrix of
OLS residuals from (3.16)), vet is the operator that stacks the column of a
matrix, and the second line uses the result that vec(ABC) = (C x A) vet(B) for
conformable matrices A,B and C. The corresponding LR and LM statistics are
asymptotically equivalent to W under the null and local alternatives.
The asymptotic null distribution of W is derived in Horvath and Watson (1993),
where it is shown that

W=-Trace{ [ b,(s)dB(s)][ b,(s)Bi(s)ds]-[ ~,(s)dB(s)]~, (3.18)

where B(s) is an n x 1 Wiener process partitioned into r, and n - I, components


B,(s) and B2(s), respectively. A proof of this result will not be offered here, but the
form of the limiting distribution can be understood by considering a special case
with I- = 0 (so that there are no lags of Ax, in the regression), zc = I, and al = [l,aO].
In this case, x, is a random walk with n.i.i.d.(O,Z,) innovations, and (3.16) is the
regression of Ax, onto the first r. elements of x,_ i, say x1 f_ i. Using the true value
of CE, the Wald statistic in (3.17) simplifies to

W= Cvec(CAx,x;,,-l)IC(~x~,~-~x~,~-~)-l
o~,lCvec(CAx,x;,,-,)l.
=TraceC(CAx,x;,,-,)(Cx,,,-,x;,,-,)-1(Cx,,,-,Ax:)l
= TraceC(T-~Ax,x;,,_ 1)(T-2~x1,,_,x~,,_ l)-l(T-~x,,,_ ,&)I

*Trace[ (SB,(WW)( ~IWIH~l( SB,(WW)],

where the second line uses the result that for square matrices, Trace(AB) =
Trace(BA), and for conformable matrices, Trace(ABCD) = [vec(D)](A x C) vec(B)
[Magnus and Neudecker (1988, page 30)], and the last line follows from Lemma
2.3. This verifies (3.18) for the example.
2880 M.W. Watson

Testing H,:r = 0 vs. H,:r = rau When 1, is unknown, the Wald test in (3.17)
cannot be calculated because the regressor abX,_ 1 depends on unknown param-
eters. However, the LR statistic can be calculated, and useful formulae for the
LR statistic are developed in Anderson (1951) (for the reduced rank regression
model) and Johansen (1988a) (for the VECM). In the context of the VECM (3.3),
Johansen (1988a) shows that the LR statisticscan be written as

LR = - T c ln(1 - y,), (3.19)


i=l

where yi are the ordered squared canonical correlations between Ax, and x,_ 1,
after controlling for Ax, _ 1,. . , Ax, _ D+ 1. These canonical correlations can be calcu-
lated as the eigenvalues of T-IS, There S = ,!?; 2(AXMzX_ ,)(X_ ,M,X_ i))r x
(X_ rM,AX)(Z;), and where Z, = T-(AXh4,AX) is the estimated covariance
matrix of s,, computed under the null [see Anderson (1984, Chapter 12) or Brillinger
(1980, Chapter lo)]. Letting %i(S) denote the eigenvalues of S ordered as
A,(S) 2 12(S) > ... 3 &(S), then yi from (3.19) is yi = T-A,(S). Since the elements
of S are O,(l) from Lemma 2.3, a Taylor series expansion of ln(1 - ri) shows that
the LR statistic can be written as

LR = 1 &(S) + o,(l). (3.20)


i=l

Equation (3.20) shows why the LR statistic is sometimes called the Maximal eigen-
value statistic when rou= 1 and the Trace-statistic when rou= n [Johansen and
Juselius (1990)].
One way to motivate the formula for the LR statistic given in (3.20), is by mani-
pulating the Wald statistic in (3.17).23 To see the relationship between LR and W
in this case, let L(6,, a,) denote the log likelihood written as a function of 6, and
c1,, and let &(c(,) denote the MLE of 6, for fixed a,. When Z, is known, then the
well known relation between the Wald and LR statistics in the linear regression
model [Engle (1984)] implies that the Wald statistic can be written as

(3.21)

where the last line follows since cz, does not enter the likelihood when 6a = 0, and
where W(crJ is written to show the dependence of W on ~1,. From (3.21), with ZE

In standard jargon, when r 0 #O, the trace statistic corresponds to the test for the alternative
ro, = n - rO,.
Z3See Hansen (1990b) for a general discussion of the relationship between Wald, LR and LM tests
in the presence of unidentified parameters.
2881
Ch. 47: Vector Autoregressions and Cointeyration

known,

sup W(cc,) = sup 2CL(&(aJ, a,) - UO, 011


aa =u
= 2[L(&, 02,)- L(0, O)]
=LR (3.22)

where the Sup is taken over all n x ra matrices (Ye. When Z, is unknown, this
equivalence is asymptotic, i.e. Sup,= kV(a,) = LR + o,(l).
To calculate Sup, W(cc,), rewrite (3.17) as

kV(a,)= [vec(AXM,X_,cc,)][(ahX~lM,X_,cr,)-l @~,I

x [vec(AXM,X _ 1cr,)]
=TR[~~-2(AXM,X_,~,)(cr~X_,M,X_,~x,)-

x @AX_ 1M,AX)(f, )]

= TR [,!?, 12(AXM,X ~, )DD(X_ 1M,AX)(T, )], where

1M,X
D = !xX,(rAbX_ _ 1CIJ 12

= TR [D(X_ 1M,AX)&- (AXM,X _ l)D]

= TR [FCCF], (3.23)

where F=(X~lM,X_l)i2~,Ja~X_lM,X_lcr,)~2,and C=(X~,M,X_,))1i2 x


(X- 1M,AX)Z, j2. Since FF = Ira
u
rw rot.
Sup W(ccJ = Sup TR[F(CC)F] = c i,(CC) = 1 ;l,(CC)
CI, FF=I i=l i=l

= LR + oP( l), (3.24)

where Ai denote the ordered eigenvalues of (CC), and the final two equalities
follow from the standard principal components argument [for example, see Theil
(1971, page 46)] and A,(CC) = &(CC). Equation (3.24) shows that the likelihood
ratio statistic can then be calculated (up to an o,(l) term) as the largest r. eigen-
values of

cc = t, "2(AX'MZX_ ,)(X'_ ,M,X_ ,)- (x_ ,M,AX)(~, l/2),

To see the relationship between the formulae for the LR statistics in (3.24) and
(3.20), notice that CC in (3.24) and S in (3.20) differ only in the estimator of ZE; CC
uses an estimator constructed from residuals calculated under the alternative, while
S uses an estimator constructed from residuals calculated under the null.
2882 M. W. Watson

In general settings, it is not possible to derive a simple representation for the


asymptotic distribution of the Likelihood Ratio statistic when some parameters
are present only under the alternative. However, the special structure of the VECM
makes such a simple representation possible. Johansen (1988a) shows that the LR
statistic has the limiting asymptotic null distribution given by
rau
LR=> 2 A,(H) (3.25)
i=l

where H = [jB(s)dB(s)][{B(s)B(s)ds]-[fB(s)dB(s)], and B(s) is an n x 1 Wiener


process. To understand Johansens result, again consider the special case with
r = 0 and z, = I,,. In this case, CC becomes

CC=(AXX_,)(X_,X_,)~(X_IAX)

= [~Ax,x:_ 1IC~~,-,~:-~l-C~~,-~~~:l

= CT-~AX,.~~_,][T-~~X~_~~~_~]-~[T-C~,_X~]

-[ /B($dR(s)][ jB(s)B(s)ds] - [ /B(s)dB(s)] (3.26)

from Lemma 2.3. This verifies (3.25) for the example.

Testing H,: r = 0 vs. H,: r, = ra, + rau The model is now

AX, = %J~&kx,-1)+ 4&,x,- 1)+ Pz, + &t, (3.27)

where a, has been partitioned so that uor contains the rllk known cointegrating
vectors, c1,, contains the ru, unknown cointegrating vectors and 6, has been
partitioned conformably as 6 = (dak 6,). As above, the LR statistic can be approxi-
mated up to an o,(l) term by maximizing the Wald statistic over the unknown
parameters in claU. Let M,, = M, - M,X_ laak(a~,X_ 1M,X_ 1~,,)- c&X_ 1M,
denote the matrix that partials both Z and X_lcl,k out of the regresslon (3.27),
The Wald statistic (as a function of apI( and CC,~)can then be written as24

W(c(,*,Q = Cvec(AXMZX_ ,a,,)lC(~~,x~,M,X_ ItyJ1 @~,I


x [vec(AXM,X_ 1a,,)]

x [vec(AXM,,X_,cr,U)][(~~,Xl,M,,X-,Clyl)-lO~E1l

x [vec(AXM,,X_ 1a,y)]. (3.28)

14The first term in (3.28) is the Wald statistic for testing Sak = 0 imposing the constraint that 6,u = 0.
The second term is the Wald statistic for testing &as= 0 with ub,XI_ t and ZC partialled out of the
regression. This form of the Wald statistic can be deduced from the partitioned inverse formula.
2884 M. W. Wurson

As pointed out above, when the null hypothesis is H,: r = rOk+ rO,, the LR test
statistic can be calculated as the difference between the LR statistics for [H,: r = 0
vs. H,: r = r. + r,] and [H,: r = 0 vs. H,: r = r,]. So, for example, when testing
Ho: r = ro, vs. Ha: r = ro, + ra,, the LR statistic is

al4 In
LR = - T c ln(1 - yi) = 1 &(S) + o,(l), (3.3 1)
i=r,,+ 1 i=r,,+ 1

where yi are the canonical correlations defined below equation (3.19) [see Anderson
(1951) and Johansen (1988a)]. Critical values for the case rok = rak = 0 and n - rou < 5
are given in Johansen (1988a) for the trace-statistic (so that the alternative is rou =
n - r.,,); these are extended for n - ro, ,< 11 in Osterwald-Lenum (1992), who also
tabulates asymptotic critical values for the maximal eigenvalue statistic (so that
r =r = 0 and ram= 1). Finally, asymptotic critical values for all combinations
o?r Okr r and
0 lllr ra, with n - rO, < 9 are tabulated in Horvath and Watson (1992).

3.3.2. Non-likelihood-based approaches

In addition to the likelihood based tests discussed in the last section, standard
univariate unit root tests and their multivariate generalizations have also been
used as tests for cointegration. To see why these tests are useful, consider the
hypotheses H,: r = 0 vs. H,: r = 1, and suppose that CI is known under the alter-
native. Since the data are not cointegrated under the null, w, = ax, is I(l), while
under the alternative it is I(0). Thus, cointegration can be tested by applying a
standard unit root test to the univariate series w,. To be useful in more general
cointegrated models, standard unit root tests have been modified in two ways.
First, modifications have been proposed so that the tests can be applied when c1
is unknown. Second, multivariate unit root tests have been developed for the general
testing problem H,: r = r, vs. H,: r = r. + r,. We discuss these two modifications
in turn.
Engle and Granger (1987) develop a test for the hypotheses H,: r = 0 vs. H,: r = 1
when c1is unknown. They suggest using OLS to estimate the single cointegrating
vector and applying a standard unit root test (they suggest an augmented Dickey-
Fuller t-test) to the OLS residuals, +r = &x,. Under the alternative, 6i is a consistent
estimator of LX,so that $, will behave like w,. However, under the null, 6i is obtained
from a spurious regression (see Section 2.6.3) and the residuals from a spurious
regression (kZ) behave differently than non-stochastic linear combinations of I(1)
variables (w,). This affects the null distribution of unit root statistics calculated
using tit. For example, the Dickey-Fuller t-statistic constructed using fit has a
different null distribution than the statistic calculated using w,, so that the usual
critical values given in Fuller (1976) cannot be used for the Engle-Granger test. The
correct asymptotic null distribution of the statistic is derived in Phillips and Ouliaris
(1990), and is tabulated in Engle and Yoo (1987) and MacKinnon (1991). Hansen
Ch. 47: Vector Autoregressions and Cointegration 2885

(1990a) proposes a modification of the EngleeGranger test that is based on an


iterated Cochrane-Orcutt estimator which eliminates the spurious regression
problem and results in test statistics with standard Dickey-Fuller asymptotic
distributions under the null.
Stock and Watson (1988), building on work by Fountis and Dickey (1986),
propose a multivariate unit root test. Their procedure is most easily described by
considering the VAR(l) model, x, = @x,_ 1 + c,, together with the hypotheses H,:
r = 0 vs. H,: r = ra. Under the null the data are not cointegrated, so that @ = I,.
Under the alternative there are r. covariance stationary linear combinations of
the data, so that @ has r. eigenvalues that are less than one in modulus. The Stock-
Watson test is based on the ordered eigenvalues of 6, the OLS estimator of @.
Writing these eigenvalues as Il,I 2 1x,( > . . .. the test is based on ;ln_r,+ 1, the r,th
smallest eigenvalue. Under the null, An_*,+r = 1, while under the alternative,
IAn- r. + 11< 1. The asymptotic null distribution of T(G - I) and T( Iii\ - 1) are
derived in Stock and Watson (1988), and critical values for T( I%n_-r,+ 1) - 1) are
tabulated. This paper also develops the required modifications for testing in a
general VAR(p) model with r0 # 0.

3.3.3. Comparison of the tests

The tests discussed above differ from one another in two important respects. First,
some of the tests are constructed using the true value of the cointegrating vectors
under the alternative, while others estimate the cointegrating vectors. Second, the
likelihood based tests focus their attention on 6 in (3.3), while the non-likelihood-
based tests focus on the serial correlation properties of certain linear combinations
of the data. Of course, knowledge of the cointegrating vectors, if available, will
increase the power of the tests. The relative power of tests that focus on S and
tests that focus on the serial correlation properties of w, = CLX,is less clear.
Some insight can be obtained by considering a special case of the VECM (3.3)

Ax, = SJa;xt_ ,) + E,. (3.32)

Suppose that ~1,is known and that the competing hypotheses are H,: r = 0 vs. H,:
r = 1. Multiplying both sides of (3.31) by a: yields

Aw~=Bw,-~+~,, (3.33)

where w, = abxl, 0 = czbs, and e, = c~:E,. Unit root tests constructed from w, test
the hypotheses H,: 8 = (cQ,) = 0 vs. H,: 0 = ($8,) < 0, while the VECM-based LR
and Wald statistics test H,: 6, = 0 vs. H,: 6, # 0. Thus, unit root tests constructed
from w, focus on departures from the 6, = 0 null in the direction of the cointegrating
vector CI,. In contrast, the VECM likelihood based tests are invariant to transfor-
mations of the form Pcrbx, _ 1 when CI, is known and Px,_, when u, is unknown,
2886 M. W. Watson

where P is an arbitrary nonsingular matrix. Thus, the likelihood based tests arent
focused in a single direction like the univariate unit root test. This suggests that
tests based on w, should perform relatively well for departures in the direction of
LY.,but relatively poorly in other directions. As an extreme case, when LY~S.= 0, the
elements of x, are I(2) and w, is I(1). [The system is CZ(2,l) in Engle and Grangers
(1987) notation.] The elements are still cointegrated, at least in the sense that a
particular linear combination of the variables is less persistent than the individual
elements of x,, and this form of cointegration can be detected by a nonzero value
of 6, in equation (3.32) even though 8 = 0 in (3.33).26
A systematic comparison of the power properties of the various tests will not
be carried out here, but one simple Monte Carlo experiment, taken from a set of
experiments in Horvath and Watson (1993), highlights the power tradeoffs.
Consider a bivariate model of the form given in (3.32) with E, - n.i.i.d.(O, I,), CI,=
(1 - 1) and 6, = (S,, da*). This design implies that 0 = de, - S,* in (3.33), so that
the unit root tests should perform reasonably well when Id,, - fi(12(is large and
reasonably poorly when 16,, - 8J is small. Changes in 6, have two effects on the
power of the VECM likelihood based tests. In the classical multivariate regression
model, the power of the likelihood based tests increase with { = hi, + a:,. However,
in the VECM, changes in 6,, and Lju2also affect the serial correlation properties
of the regressor, w, _ 1 = dx, _ 1, as well as [. Indeed, for this design, w, - AR(l) with

Table 1
Comparing power of tests for cointegration.
Size for 5 percent asymptotic critical values and power for tests carried out
at 5 percent IeveP

Test (0,O) (0.05,0.055) (-0.05,0.055) (-0.0105,0)

DF (a known) 5.0 6.5 81.5 81.9


EGDF (a unknown) 4.1 2.9 31.9 32.5
Wald (a known) 4.1 95.0 54.2 91.5
LR (c( unknown) 4.4 86.1 20.8 60.7

Design is

where E, = (E:c,?) - n.l.l. d.(O,I,) and t = I ,_.., 100.


bThese results are based on 10,000 replications. The first column shows
rejection frequencies using asymptotic critical values. The other columns show
rejection frequencies using 5 percent critical values calculated from the
experiment in column 1.

26This example was pointed out to me by T. Rothenberg.


Ch. 47: Vector Autorepmions und Cointegration 2887

AR coefficient 8 = da1 - Sal [see equation (3.33)]. Increases in 6 lead to increases


in the variability of the regressor and increases in the power of the test.
Table I shows size and power for four different values of 6, when T = 100 in
this bivariate system. Four tests are considered: (1) the Dickey-Fuller (DF) t-test
using the true value of IX;(2) the Engle-Granger test (EG-DF, the Dickey-Fuller
t-test using a value of c( estimated by OLS); (3) the Wald statistic for Ho: 6, = 0
using the true value of a; and (4) the LR statistic for H,: 6, = 0 for unknown ~1.
The table contains several noteworthy results. First, for this simple design, the
size of the tests is close to the size predicted by asymptotic theory. Second, as
expected, the DF and EG-DF tests perform quite poorly when 16,, - da21 is small.
Third, increasing the serial correlation in w, = abxt, while holding Szl + 8:, constant,
increases the power of the likelihood based tests. [This can be seen by comparing
the 6, = (0.05,0.055) and 6, = (- 0.05,0.055) columns.] Fourth, increasing 6f1 + 6,2,,
while holding the serial correlation in w, constant, increases the power of the
likelihood based tests. [This can be seen by comparing the 6, = (- 0.05, 0.055)
and 6, = (- 0.105, 0.00) columns.] Fifth, when the DF and EG--DF are focused
on the correct direction, their power exceeds the likelihood based tests. [This can
be seen from the 6, = (- 0.05, 0.055) column.] Finally, there is a gain in power
from incorporating the true value of the cointegrating vector. (This can be seen
by comparing the DF test to the EG-DF test and the Wald test to the LR test.)
A more thorough comparison of the tests is contained in Horvath and Watson
(1993).

3.4. Estimating cointegrating vectors

3.4.1. Gaussian maximum likelihood estimation (MLE) based


on the triangular representation

In Section 3.2.4 the triangular representation of the cointegrated system was written
as

Ax1.t= Ul,,, (3.9)

X2,* - Px1,r = U2.f (3.10)

where U, = D(L)&,. In this section we discuss the MLE estimator of fi under the
assumption that E, - n.i.i.d.(O, I). The n.i.i.d. assumption is used only to motivate the
Gaussian MLE. The asymptotic distribution of estimators that are derived below
follow under the weaker distributional assumptions listed in Lemma 2.3. In Section
2.6.4 we considered the OLS estimator of/l in a bivariate model, and paid particular
attention to the distribution of the estimator when D(L) = D with d,, = d,, = 0.
In this case, ~i,~ is weakly exogenous for /I and the MLE estimator corresponds
2888 M.W. Wutson

to the OLS estimator. Recall (see Section 2.6.4) that when d,, = It,, = 0, the OLS
estimator of p has an asymptotic distribution that can be represented as a variance
mixture of normals and that the t-statistic for p has an asymptotic null distribution
that is standard normal. This means that tests concerning the value of /I and
confidence intervals for B can be constructed in the usual way; complications from
the unit roots in the system can be ignored. These results carry over immediately
to the vector case where ~i,~ is k x 1 and x*,~ is r x 1 when u, is serially uncorrelated
and ul,t is independent of u~,~. Somewhat surprisingly, they also carry over to the
MLE of fi in the general model with u, = D(L)&,, so that the errors are both serially
and cross correlated.
Intuition for this result can be developed by considering the static model with
u, = D.zt and D is not necessarily block diagonal. Since ~i,~ and u2,* are correlated,
the MLE of /3 corresponds to the seemingly unrelated regression (SUR) estimator
from (3.9)-(3.10). But, since there are no unknown regression coefficients in (3.9),
the SUR estimator can be calculated by OLS in the regression

x2.t = BXl,,+ a1,1+ ez,r, (3.34)

where y is the coefficient from the regression of u*,~ onto u~,~, and e2,, = u~,~ -
E[u,,,lu,,,] is the residual from this regression. By construction, e2,1 is independent
of {xi,,}:= 1 for all t. Moreover, since y is a coefficient on a zero mean stationary
regressor and b is a coefficient on a martingale, the limiting scaled XX matrix
from the regression is block diagonal (Section 2.5.1). Thus from Lemma 2.3,

T(&p)=(T-C e2,tx;,,)(T~2Cx1,,x;,,)-1
+ o,(l)

)( Z,l/
s
B,(s)B,(s)ds(Zuj)
-1

, (3.35)

where Z,,, = var(u,,,), ZeZ = var(e2,J and B(s) is an n x 1 Brownian motion process,
partitioned as B(s) = [B,(s)B,(s)], where B,(s) is k x 1 and B*(S) is r x 1. Except
for the change in scale factors and dimensions, equation (3.35) has the same form
as (2.21), the asymptotic distribution of fi in the case d,, = d,, = 0. Thus, the
asymptotic distribution of j? can be represented as a variance mixture of normals.
Moreover, the same conditioning argument used when d,, = d,, = 0 implies that
the asymptotic distribution of Wald test statistics concerning p have their usual
large sample x2 distribution. Thus, inference about j can be carried out using
standard procedures and standard distributions.
Now suppose that a, = D(L)&,. The dynamic analogue of (3.34) is

X 2,, = Bx,,,+ Y(-WX,,,+ e2,t, (3.36)

where Y(G%,, =~C~2,rl~~~l,rI~~ll = ~C~2,tl{~I,r~T=11~ and e2,t=uz,t-


E[u,,,l{u~,~}T=
1]. Letting D,(L) denote the first k rows of D(L) and D,(L) denote
Ch. 47: Vector Autoreqressions and Cointeyration 2889

the remaining r rows, then from classical projection formulae [e.g. Whittle (1983,
Chapter 5)], y(L) = D,(L)D,(L)[D,(L)D,(L-)]-.27 Equation (3.36) differs from
(3.34) in two ways. First, there is potential serial correlation in the error term of
(3.36), and second, y(L) in (3.36) is a two-sided polynomial. These differences
complicate the estimation process.
To focus on the first complication, assume that y(L) = 0. In this case, (3.36)
is a regression model with a serially correlated error, so that (asymptotically) the
MLE of p is just the feasible GLS estimator in (3.36). But, as shown in Phillips
and Park (1988), the GLS correction has no effect on the asymptotic distribution
of the estimator: the OLS estimator and GLS estimators of p in (3.17) are asymp-
totically equivalent.28 Since the regression error e2,( and the regre:sors (x~,~}~~ 1
are independent, by analogy with the serially uncorrelated case, T(/I - fi) will have
an asymptotic distribution that can be represented as a variance mixture of normals.
Indeed, the distribution will be exactly of the form (3.35), where now Z,, and Ze2
represent long-run covariance matrices.29
Using conditioning arguments like those used in Section 2.6.4, it is straightforward
to show that the Wald test statistics constructed from the GLS estimators of /I
have large sample x2 distributions. However, since the errors in (3.36) are serially
correlated, the usual estimator of the covariance matrix for the OLS estimators of
p is inappropriate, and a serial correlation robust covariance matrix is required.30
Wald test statistics constructed from OLS estimators of fi together with serial
correlation robust estimators of covariance matrices will be asymptotically x2 and

27This is the formula for the projection onto the infinite sample, i.e. y(L)Ax,! = ECU:) {Ax, ),a= ,I.
In general, y(L) is two-sided and of infinite order, so that this is an approximation to E[u:J {Ax,}f, r].
The effect of this approximation error on estimators of B is discussed below.
sThis can be demonstrated as follows. When y(L) = 0, e,,, = D&)c,,, and x1,, = D,~(L)E~,~ where
E,.~ and cl,, are the first k and last r elements of E,, and D, t(L) and Dzz(L) are the appropriate dtagonal
blocks of D(L). Let C(L) = [D22(L)]-1 and assume that the matrix coefficients in C(L), D,,(L) and
D,,(L) are I-summable. Letting 6 = vet(@), the GLS estimator and OLS estimators satisfy

where q, = [xl,,@I,], and defining the operator L so that z,L = Lz, = s,_r,ri, = [x1,,@ C(L)]. Using
the Lemma 2.3

T&s - 4 = IT-2&,,~;,, OI,I-CT-C(x;,,OI,)D,,(L)&,.,l


= IT-z&.$,,, O~,I-CT-~(X;,~~~~I~)~~.,)I + o,(t),
T(6^,,, -6)=[T-Z~{C(~)~,,,}{x;,,C(L)}O~,I-CT~~(~~,,OC(~)I)~~.,1
= IT~~x,.,x;,, OC(t~C(t)l-CT-~(x~.~~C(~~~~~,)l+o,(t)~
Since C(l)- = Dz2(l), T(go,, - 6) = T(6^,,, - 6) + o,(l), so that T($o,,, - Jo,,) LO.
The long-run covariance matrix for an n x 1 covariance stationary vector y, with absolutely
summable autocovariances is x,E m Cov(y,, y,_& which is 2n times the spectral density matrix for y,
at frequency zero.
See Wooldridges chapter of the Handbook for a thorough discussion of robust covariance matrix
estimators.
2890 M. W. Watson

Will be asymptotically equivalent to the statistics calculated using the GLS


estimators of b [Phillips and Park (1988)]. In summary, the serial correlation in
(3.36) poses no major obstacles.
The two-sided polynomial y(L) poses more of a problem, and three different
solutions have developed. In the first approach, y(L) is approximated by a finite
order (two-sided) polynomial. 31 In this case, equation (3.36) can be estimated by
GLS, yielding what Stock and Watson (1993) call the Dynamic GLS estimator
of fi. Alternatively, utilizing the Phillips and Park (1988) result, an asymptotically
equivalent Dynamic OLSestimator can be constructed by applying OLS to (3.36).
To motivate the second approach, assume for a moment that y(L) were known.
The OLS estimator of p would then be formed by regressing x~,~ - y(L)Ax,,, onto
x1,,. But T-C[y(L)Axi,,]xi,, = r-~[y(l)Ax,,,]x,,, + B + o,,(l), where B =
lim f+m E(y,x,,,), where Y, = [Y(L) - YU)IAX~,,. (This can be verified using (c) and
(d) of Lemma 2.3.) Thus, an asymptotically equivalent estimator can be constructed
by regressing x~,~ - y(l)Axi,, onto x1,( and correcting for the bias term B. Parks
(1992) Canonical Cointegrating Regression estimator and Phillips and Hansens
(1990) Fully Modified estimator use this approach, where in both cases y(l) and
B are replaced by consistent estimators.
The final approach is motivated by the observation that the low frequency
movements in the data asymptotically dominate the estimator of p. Phillips (1991 b)
demonstrates that an efficient band spectrum regression, concentrating on frequency
zero, can be used to calculate an estimator asymptotically equivalent to the MLE
estimator in (3.36).32
All of these suggestions lead to asymptotically equivalent estimators. The
estimators have asymptotic representations of the form (3.35) (where _ZU,and Z_
represent long-run covariance matrices), and thus their distributions can be re-
presented as variance mixtures of normals. Wald test statistics computed using
the estimators (and serial correlation robust matrices) have the usual large sample
x2 distributions under the null.

3.4.2. Gaussian maximum likelihood estimation based on the VECM

Most of the empirical work with cointegrated systems has utilized parameterizations
based on the finite order VECM representation shown in equation (3.3). Exact
MLEs calculated from the finite order VECM representation of the model are
different from the exact MLEs calculated from the triangular representations that
were developed in the last section. The reason is that the VECM imposes constraints
on the coefficients in y(L) and the serial correlation properties of e2,r in (3.36).
3This suggestion can be found in papers by Hansen (1988) Phillips (1991a), Phillips and Loretan
(1991), Saikkonen (1991) and Stock and Watson (1993). Saikkonen (1991) contains a careful discussion
of the approximation error that arises when y(L) is approximated by a finite order polynomial. Using
results of Berk (1974) and Said and Dickey (1984) he shows that a consistent estimator of y(l) (which,
as we show below is required for an asymptotically efficient estimator of /?) obtains if the order of the
estimated polynomial y(L) increases at rate Td for 0 < 6 cf.
-See Hannan (1970) and Engle (1976) for a general discussion of band spectrum regression.
Ch. 47: Vector Autoregressions and Cointegration 2891

These restrictions were not exploited in the estimators discussed in Section 3.41.
While these restrictions are asymptotically uninformative about j?, they impact the
estimator in small samples.
Gaussian MLEs of/J constructed from the finite order VECM (3.2) are analyzed
in Johansen (1988a, 1991) and Ahn and Reinsel (1990) using the reduced rank
regression framework originally studied by Anderson (195 1). Both papers discuss
computational approaches for computing the MLEs, and more importantly, derive
the asymptotic distribution of the Gaussian MLE. There are two minor differences
between the Johansen (1988a, 1991) and Ahn and Reinsel(l990) approaches. First,
different normalizations are employed. Since 17 = 6~ = 6FF- CYfor any nonsingular
r x r matrix F, the parameters in 6 and CYare not econometrically identified without
further restrictions. Ahn and Reinsel (1990) use the same identifying restriction
imposed in the triangular model, i.e., a = [ - /IZ,]; Johansen (1991) uses the
normalization OiRoi= I,, where R is the sample moment matrix of residuals from
the regression of x,-r onto Ax~_~, i = 1,. . . , p - 1. Both sets of restrictions are
normalizations in the sense that they just identify the model, and lead to identical
values of the maximized likelihood. Partitioning Johansens MLE as oi = (02; a;),
where bi, is k x I and oi, is r x r, implies that the MLE of fi using Ahn and Reinsels
normalization is p^= - (&,&;y.
The approaches also differ in the computational algorithm used to maximize
the likelihood function. Johansen (1988a), following Anderson (1951), suggests an
algorithm based on partial canonical correlation analysis between Ax, and x,_ 1
given Ax,- i, i = 1,. , p - 1. This framework is useful because likelihood ratio tests
for cointegration are computed as a byproduct (see Equation 3.19). Ahn and Reinsel
(1990) suggests an algorithm based on iterative least squares calculations. Modern
computers quickly find the MLEs for typical economic systems using either
algorithm.
Some key results derived in both Johansen (1988a) and Ahn and Reinsel (1990)
are transparent from the latters regression formulae. As in Section 3.3, write the
VECM as

_ 1+
Ax, = ~CXX, I-z, + E,
=fiCX2,t-l-BXl,t~11+~Z1+&,, (3.37)

where z, includes the relevant lags of Ax, and the second line imposes the Ahn
Reinsel normalization of ~1.Let w, _ I = x2 f _ 1 - /?x, f _ 1 denote the error correction
term, and let 0 = [vet(G) vet(r) vec(j?)] denote the vector of unknown parameters.
Using the well known relations between the vet operator and Kronecker products,
vec(rz,) = (z:Ol,)vec(T), vec(bw,_ 1) = (w:_ I @l,)vec(r) and vec(&?x,,,_ r) =
(x; f_ I 0 6) vet(p). Using th ese expressions, and defining Q, = [(z, 0 I,) (& _ 1@ I,,)
(x~:~~1OS)], then the Gauss-Newton iterations for estimating 0 are

@+l= e+ [cQI~','Q,]-'[CQ:~','~:,] (3.38)


2892 M. W. Watson

where 8 denotes the estimator of 0 at the ith iteration, k, = T- CE,& and Q,


and E, are evaluated at i?.33 Thus, the Gauss-Newton regression corresponds to
the GLS regression of E, onto (z: 0 I,), (wi _ 10 I,,) and (x, ,! _ 10 6). Since the z,
and w, are I(0) with zero mean and x~,~ is I(l), the analysis m Section 2 suggests
that the limiting regression XX matrix will be block diagonal, and the MLEs
of 6 and I- will be asymptotically independent of the MLE of /?. Johansen (1988a)
and Ahn and Reinsel (1990) show that this is indeed the case. In addition they
demonstrate that the MLE of /3 has a limiting distribution of the same form as
shown in equation (3.35) above, so that T(fi - fl) can be represented as a variance
mixture of normals. Finally, paralleling the result for MLEs from triangular
representation, Johansen (1988a) and Ahn and Reinsel (1990) demonstrate that

[~Q;~','Q,]-"2(&6) %'(O,Z),

so that hypothesis tests and confidence intervals for all of the parameters in the
VECM can be constructed using the normal and x2 distributions.

3.4.3. Comparison and efJiciency of the estimators

The estimated cointegrating vectors constructed from the VECM (3.3) or the
triangular representation (3.9)-(3.10) differ only in the way that the I(0) dynamics
of the system are parameterized. The VECM models these dynamics using a VAR
involving the first differences Ax, ,f, AxZ,( and the error correction terms, x~,~ - /?x~,~;
the triangular representation uses only Ax, f and the error correction terms. Section
3.4.1 showed that the exact parameterization of the I(0) dynamics ~ y(L) and the
serial correlation of the error term in (3.36) - mattered little for the asymptotic
behavior of the estimator from the triangular representation. In particular,
estimators of b that ignore residual serial correlation and replace y(L) with y(l)
and adjust for bias are asymptotically equivalent to the exact MLE in (3.36).
Saikkonen (1991) shows that this asymptotic equivalence extends to Gaussian
MLEs constructed from the VECM. Estimators of /? constructed from (3.36) with
appropriate nonparametric estimators of y( 1) are asymptotically equivalent to
Gaussian MLEs constructed from the VECM (3.3). Similarly, test statistics for
H,: R[vec(/l)] = r constructed from estimators based on the triangular representa-
tion and the VECM are also asymptotically equivalent.

3SConsistent initial conditions for the iterations are easily constructed from the OLS_estimator_s of
the parameters in the VAR (3.2). Let fi denote the OLS estimator of IT, partitioned as 17 = [IT,II,],
where r is n x (n -r) and e2 is n x r; further partition I?, = [I?; Lfi;r] and fi, = [I?;, Ii;,],
where 17, 1is (n - r) x (n - r), I7,, is r x (n - r), or2 is (n - r) x r and 17,, is r x r. Then I?, serves as an
initial consistent estimator of 6 and -(J722))nz1 serves as an estimator of 8. Ahn and Reinsel(l990)
and Saikkonen (1992) develop efficient two-step estimators of B constructed from IT, and Engle and
Yoo (1991) develop an efficient three-step estimator of all the parameters in the model using iterations
similar to those in (3.38).
Ch. 47: Vector Autoregressions und Cointeyration 2893

Since estimators of cointegrating vectors do not have asymptotic normal distri-


butions, the standard analysis of asymptotic efficiency - based on comparing
estimators asymptotic covariance matrices ~ cannot be used. Phillips (199 1a) and
Saikkonen (199 1) discuss efficiency of cointegrating vectors using generalizations
of the standard efficiency definitions.34 Loosely speaking, these generalizations
compare two estimators in terms of the relative probability that the estimators
are contained in certain convex regions that are symmetric about the true value
of the parameter vector. Phillips (1991a) shows that when U, in the triangular
representation (3.9)-(3.10) is generated by a Gaussian ARMA process, then the
MLE is asymptotically efficient. Saikkonen (1991) considers estimators whose
asymptotic distributions can be represented by a certain class of functionals of
Brownian motion. This class contains the OLS and nonlinear least squares
estimators analyzed in Stock (1987), the instrumental variable estimators analyzed
in Phillips and Hansen (1990), all of the estimators discussed in Sections 3.4.1 and
3.4.2, and virtually every other estimator that has been suggested. Saikkonen shows
that the Gaussian MLE or (any of the estimators that are asymptotically equivalent
to the Gaussian MLE) are asymptotically efficient members of this class.
Several studies have used Monte Carlo methods to examine the small sample
behavior of the various estimators of cointegrating vectors. A partial list of the
Monte Carlo studies is Ahn and Reinsel (1990), Banerjee et al. (1986), Gonzalo
(1989), Park and Ogaki (1991), Phillips and Hansen (1990), Phillips and Loretan
(1991) and Stock and Watson (1993). A survey of these studies suggests three
general conclusions. First, the static OLS estimator can be very badly biased even
when the sample size is reasonably large. This finding is consistent with the bias
in the asymptotic distribution of the OLS estimator (see equation (2.22)) that was
noted by Stock (1987).
The second general conclusion concerns the small sample behavior of the Gaussian
MLE based on the finite order VECM. The Monte Carlo studies discovered that,
when the sample size is small, the estimator has a very large mean squared error,
caused by a few extreme outliers. Gaussian MLEs based on the triangular represen-
tation do not share this characteristic. Some insight into this phenomenon is
provided in Phillips (1991~) which derives the exact (small sample) distribution of
the estimators in a model in which the variables follow independent Gaussian
random walks. The MLE constructed from the VECM is shown to have a Cauchy
distribution and so has no integer moments, while the estimator based on the tri-
angular representation has integer moments up to order T - n + r. While Phillips
results concern a model in which the variables are not cointegrated, it is useful
because it suggests that when the data are weakly cointegrated - as might be
the case in small samples - the estimated cointegrating vector will (approximately)
have these characteristics.
The third general conclusion concerns the approximate Gaussian MLEs based

See Basawa and Scott (1983)and Sweeting (1983).


2894 M. w. Wutson

on the triangular representation. The small sample properties of these estimators


and test statistics depend in an important way on the estimator used for the long-
run covariance matrix of the data (spectrum at frequency zero), which is used to
construct an estimator of y( 1) and the long-run residual variance in (3.36). Experi-
ments in Park and Ogaki (1991), Stock and Watson (1993) and (in a different
context) Andrews and Moynihan (1990), suggest that autoregressive estimators
or estimators that rely on autoregressive pre-whitening outperform estimators
based on simple weighted averages of sample covariances.

3.5. The role of constants and trends

3.5.1. The model of deterministic components

Thus far, deterministic components in the time series (constants and trends) have
been ignored. These components are important for three reasons. First, they
represent the average growth or nonzero level present in virtually all economic
time series; second, they affect the efficiency of estimated cointegrating vectors and
the power of tests for cointegration; finally, they affect the distribution of estimated
cointegrating vectors and cointegration test statistics. Accordingly, suppose that
the observed n x 1 time series y, can be represented as

Y,=P,+Plt+x,, (3.39)

where x, is generated by the VAR (3.1). In (3.39), pLo+ pLlt represents the deter-
ministic component of yt, and x, represents the stochastic component. In this
section we discuss how the deterministic components affect the estimation and
testing procedures that we have already surveyed.35
There is a simple way to modify the procedures so that they can be applied
to y,. The deterministic components can be eliminated by regressing y, onto a
constant and time trend. Letting y: denote the detrended series, the estimation
and testing procedures developed above can then be used by replacing x, with y;.
This changes the asymptotic distribution of the statistics in a simple way: since
the detrended values of y, and x, are identical, all statistics have the same limiting
representation with the Brownian motion process B(s) replaced by B(s), the
detrended Brownian motion introduced in Section 2.3.
While this approach is simple, it is often statistically inefficient because it discards
all of the deterministic trend information in the data, and the relationship between
these trends is often the most useful information about cointegration. To see this,

35We limit discussion lo linear trends in y, for reasons of brevity and because this is the most
important model for empirical applications. The results are readily extended to higher order trend
polynomials and other smooth trend functions.
Ch. 47: Vector Autoregressions and Cointeyration 2895

let CIdenote a cointegrating vector and consider the stable linear combination

dy, = A, + 2, t + wt, (3.40)

where JUO= CX,U~, R, = a~~, and w, = ax,. In most (if not all) applications, the
cointegrating vector will annihilate both the stochastic trend and deterministic
trend in ay,. That is, w, will be I(0) and J., =0.36 As shown below, this means
that one linear combination of the coefficients in the cointegrating vector can be
consistently estimated at rate T 32. In contrast, when detrended data are used, the
cointegrating vectors are consistently estimated at rate T. Thus, the datas deter-
ministic trends are the dominant source of information about the cointegrating
vector and detrending the data throws this information away.
The remainder of this section discusses estimation and testing procedures that
utilize the datas deterministic trends. Most of these procedures are simple modifi-
cations of the procedures that were developed above.

3.5.2. Estimating cointegrating vectors

We begin with a discussion of the MLE of cointegrating vectors based on the


triangular representation. Partitioning yt into (n - r) x 1 and r x 1 components,
yi,, and y2,t, the triangular representation for y, is

=Y+
AY1.t Ul,,, (3.41)

Y2.t - BYI,, = Al + 4t + U2.r. (3.42)

This is identical to the triangular representation for x, given in (3.9))(3.10) except


for the constant and trend terms. The constant term in (3.41) represents the average
growth in yr,,. In most situations 1, = 0 in (3.42) since the cointegrating vector
annihilates the deterministic trend in the variables. In this case, 2, denotes the
mean of the error correction terms, which is unrestricted in most economic
applications.
Assuming that i1 = 0 and 1, and y are unrestricted, efficient estimation of b in
(3.42) proceeds as in Section 3.3.1. The only difference is that the equations now
include a constant term. As in Section 3.3.1, Wald, LR or LM test statistics for
testing H,: R[vec(p)] = r will have limiting x2 distributions, and confidence inter-
vals for the elements of /I can be constructed in the usual way. The only result
from Section 3.3.1 that needs to be modified is the asymptotic distribution of b.
This estimator is calculated from the regression of y, f onto Y,,~, leads and lags of
AY,,, and a constant term. When the y, ,f data contain a trend [y # 0 in (3.41)],

360 aki and Park (1990) define these two restrictions as stochastic and deterministic cointegration.
Stochattic cointegration means that w, is I(O), while deterministic cointegration means that 1, = 0.
2896 M. W. Watson

one of the canonical regressors is a time trend (z~,~ from Section 2.5.1), and the
estimated coefficient on the time trend converges at rate T3*. This means that
one linear combination of the estimated coefficients in the cointegrating vector
converges to its true value very quickly; when the model did not contain a linear
trend the estimator converged at rate T.
The results for MLEs based on the finite order VECM representation are
analogous to those from the triangular representation. The VECM representation
for y, is derived directly from (3.2) and (3.39),

p-1
Ay, = pl +@x,_ 1) + 1 @+Ax,_~ + E,
i=l

p-1
=~1+6("'y,_,--1,--i,t)+ c @iAyt_i+~,, (3.43)
i=l

where b1 = (I - C;:t Q&r, 2, = clpO and /2, = pt. Again, in most applications
II, = 0, and the VECM is

'y, = ' + '("y,_ 1)+ C ~i'y,_ i +&r, (3.44)


i=l

where 0 = PI - 62,. When the only restriction on pl is apl = 0, the constant term
19is unconstrained, and (3.44) has the same form as (3.2) except that a constant
term has been added. Thus, the Gaussian MLE from (3.44) is constructed exactly
as in Section 3.4.2 with the addition of a constant term in all equations. The distri-
bution of test statistics is unaffected, but for the reasons discussed above, the
asqrmptotic distribution of the cointegrating vector changes because of the presence
of the deterministic trend.
In some situations the data are not trending in a deterministic fashion, so that
pl = 0. (For example, this is arguably the case when y, is a vector of U.S. interest
rates.) When pL1= 0, then ,iil = 0 in (3.43), and this imposes a constraint on 0 in
(3.44). To impose this constraint, the model can be written as

(3.45)

and estimated using a modification of the Gauss-Newton iterations in (3.38) or


a modification of Johansens canonical correlation approach [see Johansen and
Juselius (1990)].
Ch. 47: Vector Autoregressions and Cointegration 2897

3.5.3. Testing for cointegration

Deterministic trends have important effects on tests for cointegration. AS discussed


in Johansen and Juselius (1990) and Johansen (1991, 1992a), it is useful to consider
two separate effects. First, as in (3.43))(3.44) nonzero values of p0 and pi affect
the form of the VECM, and this, in turn, affects the form of the cointegration test
statistic. Second, the deterministic components affect the properties of the regressors,
and this, in turn, affects the distribution of cointegration test statistics. In the most
general form of the test considered in Section 3.3.1, c1was partitioned into known
and unknown cointegrating vectors under both the null and alternative; that is, CI
was written as c1= (cI,~GL,CI,~clou). When nonzero values of pu, and .~i are allowed,
the precise form of the statistic and resulting asymptotic null distribution depends
on which of these cointegrating vectors annihilate the trend or constant [see
Horvath and Watson (1993)]. Rather than catalogue all of the possible cases, the
major statistical issues will be discussed in the context of two examples. The reader
is referred to Johansen and Juselius (1990), Johansen (1992a) and Horvath and
Watson (1993) for a more systematic treatment.
In the first example, suppose that r = 0 under the null, that CIis known under
the alternative, that j+, and pi are nonzero, but that pi = 0 is known. To be
concrete, suppose that the data are aggregate time series on the logarithms of
income, consumption and investment for the United States. The balanced growth
hypothesis suggests two possible cointegrating relations with cointegrating vectors
(1, - 1,O) and (l,O, - 1). The series exhibit deterministic growth, so that pL1# 0,
and the sample growth rates are approximately equal, so that pi = 0 is reasonable.
In this example, (3.44) is the correct specification of the VECM with 0 unrestricted
under both the null and alternative and 6 = 0 under the null. Comparing (3.44)
and the specification with no deterministic components given in (3.3), the only
difference is that x, in (3.3) becomes y, in (3.44) and the constant term 8 is added.
Thus, the Wald test for HO: 6 = 0 is constructed as in (3.17) with y, replacing x,
and Z augmented by a column of 1s. Since c1pi = 0, the regressor is ay, _ 1 =
CYX, _ 1 + apo, but since a constant is included in the regression, all of the variables
are deviated from their sample means. Since the demeaned values of GI~~_1 and
ax,_i are the same, the asymptotic null distribution of the Wald statistic for
testing H,: 6 = 0 in (3.44) is given by (3.18) with /I(s), the demeaned Wiener process
defined below Lemma 2.3, replacing B(s).
The second example is the same as the first, except that now x is unknown.
Equation (3.44) is still the correct VECM with 0 unrestricted under the null and
alternative. The LR test statistic is calculated as in (3.19), again with y, replacing
x, and Z augmented by a vector of 1s. Now, however, the distribution of the test
statistic changes in an important way. Since the regressor y,_ 1 contains a nonzero
trend, it behaves like a combination of time trend and martingale components.
When the n x 1 vector y,_ 1 is transformed into the canonical regressors of Section 2,
this yields one regressor dominated by a time trend and n - 1 regressors dominated
2898 M.W Watson

by martingales. As shown in Johansen and Juselius (1990) the distribution of the


resulting LR statistic has a null distribution given by (3.25) where now

H = [ j&)dB(s)][ k(s)F($dr] - [ k(s)dB(s)],

where F(s) is an n x 1 vector, with first n - 1 elements given by the first n - 1


elements of /P(s) and the last element given by the demeaned time trend, s-i.
(The components are demeaned because of the constant term in the regression.)
Johansen and Juselius (1990) also derive the asymptotic null distribution of the
LR test for cointegration with unknown cointegrating vectors when p1 = 0, so that
(3.45) is the appropriate specification of the VECM. Tables of critical values are
presented in Johansen and Juselius (1990) for n - r,, ,< 5 for the various deterministic
trend models, and these are extended in Osterwald-Lenum for n - rOu< 11.
Horvath and Watson (1992) extend the tables to include nonzero values of rOk
and r+.
The appropriate treatment of deterministic components in cointegration and
unit root tests is still unsettled, and remains an active area of research. For example,
Elliot et al. (1992) report that large gains in power for univariate unit root tests
can be achieved by modifying standard DickeyyFuller tests by an alternative
method of detrending the data. They propose detrending the data using GLS
estimators or p0 and pI from (3.39) together with specific assumptions about initial
conditions for the x, process. Analogous procedures for likelihood based tests for
cointegration can also be constructed. Johansen (1992b) develops a sequential
testing procedure for cointegration in which the trend properties of the data and
potential error corrections terms are unknown.

4. Structural vector autoregressions

4.1. Introductory comments

Following the work of Sims (1980), vector autoregressions have been extensively
used by economists for data description, forecasting and structural inference.
Canova (1991) surveys VARs as a tool for data description and forecasting; this
survey focuses on structural inference. We begin the discussion in Section 4.2 by
introducing the structural moving average model, and show that this model
provides answers to the impulse and propagation questions often asked by
macroeconomists. The relationship between the structural moving average model
and structural VAR is the subject of Section 4.3. That section discusses the condi-
tions under which the structural moving average polynomial can be inverted, so
that the structural shocks can be recovered from a VAR. When this is possible, a
structural VAR obtains. Section 4.4 shows that the structural VAR can be interpreted
Ch. 47: Vector Autoregressions and Cointegration 2899

as a dynamic simultaneous equations model, and discusses econometric identifi-


cation of the models parameters. Finally, Section 4.5 discusses issues of estimation
and statistical inference.

4.2. The structural moving average model, impulse response


functions and variance decompositions

In this section we study the model

Y, = wk-,3 (4.1)

where y, is an ny x 1 vector of economic variables and E, is an n, x 1 vector of


shocks. For now we allow ny # nE. Equation (4.1) is called the structural moving
average model, since the elements of E, are given a structural economic interpre-
tation. For example, one element of F, might be interpreted as an exogenous shock
to labor productivity, another as an exogenous shock to labor supply, another as
an exogenous change in the quantity of money, and so forth. In the jargon developed
for the analysis of dynamic simultaneous equations models, (4.4) is the final form
of an economic model, in which the endogenous variables y, are expressed as a
distributed lag of the exogenous variables, given here by the elements of E,. It will
be assumed that the endogenous variables y, are observed, but that the exogenous
variables E, are not directly observed. Rather, the elements of E, are indirectly
observed through their effect on the elements of y,. This assumption is made without
loss of generality, since any observed exogenous variables can always be added
to the y, vector.
In Section 1, a typical macroeconomic system was introduced and two broad
questions were posed. The first question asked how the systems endogenous
variables responded dynamically to exogenous shocks. The second question asked
which shocks were the primary causes of variability in the endogenous variables.
Both of these questions are readily answered using the structural moving average
model.
First, the dynamic effects of the elements of e, on the elements of y, are determined
by the elements of the matrix lag polynomial C(L). Letting C(L) = C, + C, L +
C,L2 + ..) where C, is an ny x n, matrix with typical element [cij,J, then

cij /( =
aYi,t = a!* (4.2)
aEj,*-k aEj,t
where y,,, is the ith element of y,, E~,~ is the jth element of E,, and the last equality
follows from the time invariance of (4.1). Viewed as a function of k, cij,k is called
the impulse response function of ej,r for Y,,~.It shows how y, f+k changes in response
to a one unit impulse in Ed,*.In the classic econometric literature on distributed
lag models, the impulse responses are called dynamic multipliers.
To answer the second question concerning the relative importance of the shocks,
the probability structure of the model must be specified and the question must be
refined. In most applications the probability structure is specified by assuming
that the shocks are i.i.d.(O,ZJ, so that any serial correlation in the exogenous
variables is captured in the lag polynomial C(L). The assumption of zero mean is
inconsequential, since deterministic components such as constants and trends can
always be added to (4.1). Viewed in this way, E, represents innovations or unanti-
cipated shifts in the exogenous variables. The question concerning the relative
importance of the shocks can be made more precise by casting it in terms of the
h-step-ahead forecast errors of y,. Let ytjt_,, = E(y,((s,}f;h,) denote the h-step-
ahead forecast of y, made at time t - h, and let atit_, = y, - yt,, _,, = C:Zk Cket _li
denote the resulting forecast error. For small values of h, atjt _,, can be interpreted as
short-run movements in yt, while for large values of h, a,,,_, can be interpreted
as long-run movements. In the limit as h + co, atit_,, = y,. The importance of a
specific shock can then be represented as the fraction of the variance in atit_,, that
is explained by that shock; it can be calculated for short-run and long-run
movements in y, by varying h. When the shocks are mutually correlated there is
no unique way to do this, since their covariance must somehow be distributed.
However, when the shocks are uncorrelated the calculation is straightforward.
Assume ZE is diagonal with diagonal elements a;, then the variance of the ith
element of a,,( _ h is

so that

(4.3)

shows the fraction of the h-step-ahead forecast error variance in yi,t attributed to
E;.~. The set of n, values of Rz,h are called the variance decomposition of y,,, at
horizon h.

4.3. The structural VAR representation

The structural VAR representation of (4.1) is obtained by inverting C(L) to yield

A(JYy, = s,, (4.4)


Ch. 47: Vector Autoregressions and Cointegration 2901

where A(L) = A, - Cc= 1A,Lk is a one-sided matrix lag polynomial. In (4.4), the
exogenous shocks E, are written as a distributed lag of current and lagged values
of y,. The structural VAR representation is useful for two reasons. First, when the
model parameters are known, it can be used to construct the unobserved exogenous
shocks as a function of current and lagged values of the observed variables y,.
Second, it provides a convenient framework for estimating the model parameters:
with A(,!,) approximated by a finite order polynomial, Equation (4.4) is a dynamic
simultaneous equations model, and standard simultaneous methods can be used
to estimate the parameters.
It is not always possible to invert C(L) and move from the structural moving
average representation (4.1) to the VAR representation (4.4). One useful way to
discuss the invertibility problem [see Granger and Anderson (1978)] is in terms
of estimates of E, constructed from (4.4) using truncated versions of A(L). Since a
semi-infinite realization of the y, process, {y,},, _ ,, is never available, estimates
of E, must be constructed from (4.4) using {y,},, 1. Consider the estimator
& = C:;~&J,_~ constructed from the truncated realization. If & converges to E, in
mean square as t -P co, then the structural moving average process (4.1) is said to be
invertible. Thus, when the process is invertible, the structural errors can be recovered
as a one-sided moving average of the observed data, at least in large samples.
This definition makes it clear that the structural movit?g average process cannot
be inverted if n? < ne. Even in the static model y, = is,, a necessary condition for
obtaining a unique solution for E, in terms of y, is that n,, 2 n,. This requirement
has a very important implication for structural analysis using VAR models: in
general, small scale VARs can only be used for structural analysis when the
endogenous variables can be explained by a small number of structural shocks.
Thus, a bivariate VAR of macroeconomic variables is not useful for structural
analysis if there are more than two important macroeconomic shocks affecting the
variables.37 In what follows we assume that n,, = nE. This rules out the simple cause
of noninvertibility just discussed; it also assumes that any dynamic identities
relating the elements of y, when nY> nE have been solved out of the model.
With nY= n, = n, C(L) is square and the general requirement for invertibility is
that the determinantal polynomial 1C(z)1 has all of its roots outside the unit circle.
Roots on the unit circle pose no special problems; they are evidence of over-
differencing and can be handled by appropriately transforming the variables (e.g.
accumulating the necessary linear combinations of the elements of y,). In any event,
unit roots can be detected, at least in large samples, by appropriate statistical tests.
Roots of [C(z)1 that are inside the unit circle pose a much more difficult problem,
since models with roots inside the unit circle have the same second moment pro-
perties as models with roots outside the unit circle. The simplest example of this

37Blanchard and Quah (1989) and Faust and Leeper (1993) discuss special circumstances when some
structural analysis is possible when nY< n,. For example, suppose that y, is a scalar and the nCelements
of&, affect y, only through the scalar index e, = DE,, where D is an nL x 1vector. In this case the impulse
response functions can be recovered up to scale.
2902 M. W. Watson

is the univariate MA(l) model y, = (1 - cL)E~, where E, is i.i.d.(O, at). The same first
and second moments of y, obtain for the model y, = (1 - FL)&, where 5 = c- and
Et is IID(O,o,) with of = c 2crE2. Thus the first two moments of y, cannot be used
to discriminate between these two different models. This is important because it
can lead to large specification errors in structural VAR models that cannot be
detected from the data. For example, suppose that the true structural model is
y, = (1 - cL)s, with ICI > 1 so that the model is not invertible. A researcher using
the invertible model would not recover the true structural shocks, but rather .Ct=
(1 - ZL)- y, = (1 - cL)- ( 1 - CL)&,= cl - (c- c)C,p_1I?&,_ 1_i. A general discussion
of this subject is contained in Hannan (1970) and Rozanov (1967). Implications
of these results for the interpretation of structural VARs are discussed in Hansen and
Sargent (199 1) and Lippi and Reichlin (1993). For related discussion see Quah (1986).
Hansen and Sargent (1991) provides a specific economic model in which non-
invertible structural moving average processes arise. In the model, one set of
economic variables, say xt, are generated by an invertible moving average process.
Another set of economic variables, say yt, are expectational variables, formed as
discounted sums of expected future x,s. Hansen and Sargent then consider what
would happen if only the y, data were available to the econometrician. They show
that the implied moving average process of y,, written in terms of the structural
shocks driving xt, is not invertible. 38 The Hansen-Sargent example provides an
important and constructive lesson for researchers using structural VARs: it is
important to include variables that are directly related to the exogenous shocks
under consideration (x, in the example above). If the only variables used in the
model are indirect indicators with important expectational elements (y, in the
example above), severe misspecification may result.

4.4. Ident$cation qf the structural VAR

Assuming that the lag polynomial of A(L) in (4.4) is of order p, then structural
VAR can be written as

A,y,=A,Y,-,+A,~,_,+...+A,y,_,+~,. (4.5)
j*A simple version of their example is as follows: suppose that y, and xI are two scalar time series,
with x, generated by the MA(l) process x, = E, - fk_ ,. Suppose that y, is related to x, by the expec-
tational equation

Y, = E, g Bx,+,
i=0
= x, + Y&X, + 1
= (1 - PO)&,- OS,_ , = C(L)&,,

where the second and third lines follow from the MA(l) process for x,. It is readily verified that the
root of C(z) is (1 - BO)/O, which may be less than 1 even when the root of (1 - Oz) is greater than 1.
(For example, if 0 = /I = 0.8, the root of (1 - Oz) is 1.25 and the root of C(z) is 0.45.)
Ch. 47: Vector Autoreyressions and Cointeyrarion 2903

Since A, is not restricted to be diagonal, equation (4.5) is a dynamic simultaneous


equations model. It differs from standard representations of the simultaneous
equations model [see Hausman (1983)] because observable exogenous variables
are not included in the equations. However, since exogenous and predetermined
variables - lagged values of y,_ 1 ~ are treated identically for purposes of identifi-
cation and estimation, equation (4.5) can be analyzed using techniques developed
for simultaneous equations.
The reduced form of (4.5) is

Y, = @d-l + @ZYlP2 + ... + @py,_p + e,, (4.6)

where@i=A;Ai,fori=l,...,p,ande,=A, E,. A natural first question concerns


the identifiability of the structural parameters in (4.5) and this is the subject taken
up in this section.
The well known order condition for identification is readily deduced. Since
y, is n x 1, there are pn2 elements in (CD,, Q2,. . . , Qp) and n(n + 1)/2 elements in
r
Z, = A; Z&A 0 I), the covariance matrix of the reduced form disturbances. When
the structural shocks are n.i.i.d.(O, Z,), these [np + n(n + 1)/2] parameters com-
pletely characterize the probability distribution of the data. In the structural model
(4.5) there are (p + l)n2 elements in (A,, A,, . . . , AJ and n(n + 1)/2 elements in Z,.
Thus, there are n2 more parameters in the structural model than are needed to
characterize the likelihood function, so that n2 restrictions are required for identifi-
cation. As usual, setting the diagonal elements of A, equal to 1, gives the first n
restrictions. This leaves n(n - 1) restrictions that must be deduced from economic
considerations.
The identifying restrictions must be dictated by the economic model under
consideration. It makes little sense to discuss the restrictions without reference to
a specific economic system. Here, some general remarks on identification are made
in the context of a simple bivariate model explaining output and money; a more
detailed discussion of identification in structural VAR models is presented in
Giannini (1991). Let the first element of y,, say yr,,, denote the rate of growth of real
output, and the second element of y,, say y,,, denote the rate of growth of money.39
Writing the typical element of A, as uij,k, equation (4.5) becomes

Yl,, = - lZ,OY2,t+ f ll,iYl,*-i+ f a12,iY2,t-i + l,ty


i=l i=l

Y2.t = - ~21,OYlJ + If 2l,iYl,t-i + fI 22,iY2,t-i + 2,t (4.7b)


i=l i=l

Equation (4.7a) is interpreted as an output or aggregate supply equation, with

Much of this discussion concerning this example draws from King and Watson (1993).
2904 M. W. Watson

cl,* interpreted as an aggregate supply or productivity shock. Equation (4.7b) is


interpreted as a money supply reaction function showing how the money supply
responds to contemporaneous output, lagged variables, and a money supply shock
s2 1. Identification requires n(n - 1) = 2 restrictions on the parameters of (4.7).
In the standard analysis of simultaneous equation models, identification is
achieved by imposing zero restrictions on the coefficients for the predetermined
variables. For example, the order condition is satisfied if yi,,_ 1 enters (4.7a) but
not (4.7b), and y, f_2 enters (4.7b) but not (4.7a); this imposes the two constraints
21 1 = a, 1 2 = 0. in this case, y, t_ 1 shifts the output equation but not the money
iquation, while y, 1_ 2 shifts the money equation but not the output equation. Of
course, this is a very odd restriction in the context of the output-money model,
since the lags in the equations capture expectational effects, technological and
institutional inertia arising production lags and sticky prices, information lags, etc..
There is little basis for identifying the model with the restriction u2i,i = u11,2 = 0.
Indeed there is little basis for identifying the model with any zero restrictions on
lag coefficients. Sims (1980) persuasively makes this argument in a more general
context, and this has led structural VAR modelers to avoid imposing zero restrictions
on lag coefficients. Instead, structural VARs have been identified using restrictions
on the covariance matrix of structural shocks ZE,, the matrix of contemporaneous
coefficients A, and the matrix of long-run multipliers A(l)-.
Restrictions on EC have generally taken the form that Z, is diagonal, so that
the structural shocks are assumed to be uncorrelated. In the example above, this
means that the underlying productivity shocks and money supply shocks are
uncorrelated, so that any contemporaneous cross equation impacts arise through
nonzero values of u12,0 and u~~,~. Some researchers have found this a natural
assumption to make, since it follows from a modeling strategy in which unobserved
structural shocks are viewed as distinct phenomena which give rise to comovement
in observed variables only through the specific economic interactions studied in
the model. The restriction that EC is diagonal imposes n(n - 1) restrictions on the
model, leaving only n(n - 1)/2 additional necessary restrictions.40
These additional restrictions can come from a priori knowledge about the A,
matrix in (4.5). In the bivariate output-money model in (4.7), if Z, is diagonal, then
only ~(n - 1)/2 = 1 restriction on A, is required for identification. Thus, a priori
knowledge of u12,0 or uZ1,o will serve to identify the model. For example, if it was
assumed that the money shocks affect output only with a lag, so that ay, &Z~E~,~ = 0,
then ui2 o = 0, and this restriction identifies the model. The generalization of this
restriction in the n-variable model produces the Wold causal chain [see Wold
(1954) and Malinvaud (1980, pp. 6055608)], in which ayi,r/&j,! = 0 for i <j. This
restriction leads to a recursive model with A, lower triangular, yielding the required
n(n - 1)/2 identifying restrictions. This restriction was used in Sims (1980), and has

400ther restrictions on the covariance matrix are possible, but will not be discussed here. A more
general discussion of identification with covariance restrictions can be found in Hausman and Taylor
(1983). Fisher (1966). Rothenberg (1971) and the references cited there.
Ch. 47: Vector Autoregressions and Cointeyration 2905

become the default identifying restriction implemented automatically in com-


mercial econometric software. Like any identifying restriction, it should never be
used automatically. In the context of the output-money example, it is appropriate
under the maintained assumption that exogenous money supply shocks, and the
resulting change in interest rates, have no contemporaneous effect on output. This
may be a reasonable assumption for data sampled at high frequencies, but loses
its appeal as the sampling interval increases.41342
Other restrictions on A, can also be used to identify the model. Blanchard and
Watson (1986) Bernanke (1986) and Sims (1986) present empirical models that
are identified by zero restrictions on A, that dont yield a lower triangular matrix.
Keating (1990) uses a related set of restrictions. Of course, nonzero equality
restrictions can also be used; see Blanchard and Watson (1986) and King and
Watson (1993) for examples.
An alternative set of identifying restrictions relies on long-run relationships. In
the context of structural VARs these restrictions were used in papers by Blanchard
and Quah (1989) and King et al. (1991). 43 These papers relied on restrictions on
A( 1) = A, - Cr= 1Ai for identification. Since C( 1) = A( 1)) , these can alternatively
be viewed as restrictions on the sum of impulse responses. To motivate these
restrictions, consider the output-money example.44 Let x~,~ denote the logarithm
of the level of output and x~,~ denote the logarithm of the level of money, so that
Y1,t --Act and y2,t = Ax~,~. Then from (4.1),

axi ffk
A=
aEj,t
,;oa*=
c *I,+@J.1
(4.8)

for i, j = 1,2, so that

(4.9)

which is the ijth element of C(1). Now, suppose that money is neutral in the long
run, in the sense that shocks to money have no permanent effect on the level of
output. This means that lim,,a,ax,,,+,/&, t = 0, so that C(1) is a lower triangular

41The appropriateness of the Wold causal chain was vigorously debated in the formative years of
simultaneous equations. See Malinvaud (1980, pp. 55-58) and the references cited there.
*Applied researchers sometimes estimate a variety of recursive models in the belief (or hope) that
the set of recursive models somehow brackets the truth. There is no basis for this. Statements like
the ordering of the Wold causal chain didnt matter for the results say little about the robustness
of the results to different identifying restrictions.
43For other early applications of this approach, see Shapiro and Watson (1988) and Gali (1992).
44The empirical model analyzed in Blanchard and Quah (1989) has the same structure as the output-
money example with the unemployment rate used in place of money growth.
2906 M. W. Watson

matrix. Since A( 1) = C( 1)-l, this means that A(1) is also lower triangular, and this
yields the single extra identifying restriction that is required to identify the bivariate
model. The analogous restriction in the general n-variable VAR, is the long-run
Wold causal chain in which E~,~has no long-run effect on yj,, forj < i. This restriction
implies that A(1) is lower triangular yielding the necessary n(n - I)/2 identifying
restrictions.45

4.5. Estimating structural VAR models

This section discusses methods for estimating the parameters of the structural VAR
(4.5). The discussion is centered around generalized method of moment (GMM)
estimators. The relationship between these estimators and FIML estimators
constructed from a Gaussian likelihood is discussed below. The simplest version
of the GMM estimator is indirect least squares, which follows from the relationship
between the reduced form parameters in (4.6) and the structural parameters in (4.5):

A,A,= cDi, i= l,...,p, (4.10)

A,ZEA; = Ze. (4.11)

Indirect least squares estimators are formed by replacing the reduced form param-
eters in (4.10) and (4.11) with their OLS estimators and solving the resulting
equations for the structural parameters. Assuming that th_e mod< is exactly
identified, a solution will necessarily exist. Given estimators ai and A,,, equation
(4.10) yields Ai = &@i. The quadratic equation (4.11) is more difficult to solve. In
general, iterative techniques are required, but simpler methods are presented below
for specific models.
To derive the large sample distribution of the estimators and to solve the in-
direct least squares equations when there are overidentifying restrictions, it is
convenient to cast the problem in the standard GMM framework [see Hansen
(1982)]. Hausman et al. (1987) show how this framework can be used to construct
efficient estimators for the simultaneous equations model with covariance restric-
tions on the error terms, thus providing a general procedure for forming efficient
estimators in the structural VAR model.
Some additional notation is useful. Let z, = (y: _ 1, y: _ 2,. . . , y:_ J denote the
vector of predetermined variables in the model, and let 8 denote the vector of un-
known parameters in A,, A,, . . . , A, and Ze. The population moment conditions
that implicitly define the structural parameters are

E&z;) = 0, (4.12)

450f course, restrictions on A, and A(1) can be used in concert to identify the model. See Cali
(1992)for an empirical example.
Ch. 47: Vector Autoreyressions and Cointrgration 2907

E(E,&;) = z,, (4.13)

where E, and Z, are functions of the unknown 0. GMM estimators are formed by
choosing 8 so that (4.12) and (4.13) are satisfied, or satisfied as closely as possible,
with sample moments used in place of the population moments.
The key ideas underlying the GMM estimator in the structural VAR model can
be developed using the bivariate output-money example in (4.7). This avoids the
cumbersome notation associated with the n-equation model and arbitrary covari-
ante restrictions. [See Hausman et al. (1987) for discussion of the general case.]
Assume that the model is identified by linear restrictions on the coefficients of
&A,,..., AP and the restriction that E(E~,~E~,~)= 0. Let wl,, denote the variables
appearing on the right hand side of (4.7a) after the restrictions on the structural
coefficients have been solved out, and let 6, denote the corresponding coefficients.
Thus, if aI2 0 = 0 is the only coefficient restriction in (4.7a), then only lags of y,
appear in the equation and w~,~= (y:_,,y:_, , . . . , y;_,). If the long-run neutrality
assumption C;= 0 a 12.i =Ois imposed in(4.7a), then w~,~=(~~i,~_~,yi,,_~ ,..., ~i,~,-~,
Ay,,,,Ay,,,_ i,. . . ,AY~,~_~+ 1).46 Defining w2,t and 6, analogously for equatton
(4.7b), the model can be written as

Y 1.1 = 4,A + El,,? (4.14a)

y 2,r = w; ,$2 + Ez,t, (4.14b)

and the GMM moment equations are:

%,El.J = 0, (4.15a)

UZ,EZ,J= 0, (4.15b)

~(~l.,E2.t)
= 0, (4.15c)

E(& - a:,) = 0, i = 1,2. (4.15d)

The sample analogues of (4.15a))(4.15~) determine the estimators 8, and g2, while
(4.15d) determines A:, and BE: as sample averages of sums of squared residuals.
Since the estimators of Sfl and Sh are standard, we focus on (4.15a)-(4.15~) and
the resulting estimators of 6, and 6,. Let u, = (zs
f 5tZtE2,t~ E E )andzi=T-Cu,
1,t 2,t
denote the sample values of the second moments m (4.15a)-(4.15~). Then the GMM
estimators, 8, and 6,, are values of 6, and 6, that minimize

=lf a,,(L) =X=, ,,Q,,,~Land u,,(l) = 0, then u,~(L)Y~,, = cl:,(L)(l - L)Y,,, = a~,(f$~,,,, where
a:#) = CG, a:J!, where a;, i = - x,=i+, LI,~,,.The discussion that follows assumes the linear
restrictions on the structural coefkcients are homogeneous (or zero). As usual, the only change required
for nonhomogeneous (or nonzero) linear restrictions is a redefinition of the dependent variable.
2908 M. W. Watson

(4.16)

where 2 is a consistent estimator of E(u,ui). 47 These estimators have a simple


GLS or instrumental variable interpretation. To see this, let 2 = (z, z2 . ..zr) denote
the T x 2pmatrixofinstruments;let W, =(w~,~ w,:,...w,,,)and W, =(w~,~ w*,~..
w~,~) denote the TX k, and T x k, matrices of right hand side variables; finally,
let Yr, Y2, sr and s2 denote the T x 1 vectors composed of ~~r,~,y~,~,er:~ and E*,~,
respectively. Multiplying equations (4.14a) and (4.14b) by z, and summmg yields

ZY, = (ZW,)6, + ZE1 (4.17a)

ZY, = (zl W#, + ZE2. (4.17b)

Now, letting Ei = Yi - WiTi, for some 4

EIE2 + E; W,S, + E; WI Jr = (C,W,)6, + (F; W,)d, + E; e2 + quadratic terms.


(4.17c)

Stacking equations (4.17a)-(4.17~) and ignoring the quadratic terms in (4.17~) yields

Q = P,6, + P,6, + V, (4.18)

where Q = [(ZY1)J(ZY,)J(Cr;E, +E;WZg2 +E~Wl~l)],P, = [(~W,))O,,.,,l(~~Ww,)l,


P, = CO,, x Jr2j(ZW,)I(F; W,)], and V = [(zle,)l(~~,)I(~;&~)], and where I denotes
vertical concatenation (stacking). By inspection V = Tii from (4.16). Thus when
Q, P, and P, are evaluated at $r = s^, and Tz = gz, the GMM estimators coincide
with the GLS estimators from (4.18). This means that the GMM estimators can
be formed by iterative GLS estimation of equations (4.18), updating 5, and c?~ at
each iteration and using T- Cti,t2~ as the GLS covariance matrix.
Hausman et al. (1987) show that the resulting GMM estimators of ~?,,&,a%
and crz*are jointly asymptotically normally distributed when the vectors (z: 6:) are
independently distributed and standard regularity conditions hold. These results
are readily extended to the structural VAR when the roots of CD(Z)are outside the
unit circle, so that the data are covariance stationary. Expressions for the asymptotic
variance of the GMM estimators are given in their paper. When some of the vari-
ables in the model are integrated, the asymptotic distribution of the estimators
changes in a way like that discussed in Section 2. This issue does not seem to have
been studied explicitly in the structural VAR model, although such an analysis
would seem to be reasonably straightforward.4s

*When elements of u and u are correlated for t # r, 2 is replaced by a consistent estimator of the
limiting value of the vahance df Tti.
481nstrumental variable estimators constructed from possibly integrated regressors and instruments
are discussed in Phillips and Hansen (1990).
Ch. 47: Vector Autoregressions and Cointegration 2909

The paper by Hausman et al. (1987) also discuss the relationship between efficient
GMM estimators and the FIML estimator constructed under the assumption that
the errors are normally distributed. It shows that the FIML estimator can be
written as the solution to (4.16), using a specific estimator of X, appropriate under
the normality assumption. In particular, FIML uses a block diagonal estimator of
_X,, since E[(E,,,E~,,)(E,,,z,)] = E([E~,,E~,,)(E~,,z,)] = 0 when the errors are normally
distributed. When the errors are not normally distributed, this estimator of z,
may be inconsistent, leading to a loss of efficiency in the FIML estimator relative
to the efficient GMM estimator.
Estimation is simplified when there are no overidentifying restrictions. In this
case, iteration is not required, and the GMM estimators can be constructed as
instrumental variable (IV) estimators. When the model is just identified, only one
restriction is imposed on the coefficient in equation (4.7). This implies that one of
the vectors 6, or 6, is 2p x 1, while the other is (2~ + 1) x 1, and (4.17) is a set of
4p + 1 linear equation in 4p + 1 unknowns. Suppose, without loss of generality,
that 6, is 2p x 1. Then s^, is determined from (4.17a) as s^, = (ZW,))(ZY,), which
is the usual IV estimator of equation (4.14a) using z, as instruments. Using this
value for gi in (4.17~) and noting that Y, = W,c?f2+ E;, equation (4.17~) becomes

E*;Y, = (6; w,)d, + E;E2, (4.19)

where E*i= Y, - W, 8, is the residual from the first equation. The GMM estimator
of 6, is formed by solving (4.17b) and (4.19) for 8,. This can be recognized as the
IV estimator of equation (4.14b) using z, and the residual from (4.14a) as an instru-
ment. The residual is a valid instrument because of the covariance restriction
(4.1%).49
In many structural VAR exercises, the impulse response functions and variance
decompositions defined in Section 4.2 are of more interest than the parameters of
the structural VAR. Since C(L) = A(L)-, the moving average parameters/impulse
responses and the variance decompositions are differentiable functions of the
structural VAR parameters. The continuous mapping theorem directly yields the
asymptotic distribution of these parameters from the distribution of the structural
VAR parameters. Formulae for the resulting covariance matrix can be determined
by delta method calculations. Convenient formulae for these covariance matrices
can be found in Lutkepohl (1990), Mittnik and Zadrozny (1993) and Hamilton
(1994).

49 While this instrumental variables scheme provides a simple way to compute the GMM estimator
using standard computer software, the covariance matrix of the estimators constructed using the usual
formula will not be correct. Using 6, , asan instrument introduces generated regressor complications
familiar from Pagan (1984). Corrections for the standard formula are provided in Kine. and Watson
(1993). An alternative approach is to carry out one GMM iteration using the IV estimaks as starting
values. The point estimates will remain unchanged, but standard GMM software will compute a
consistent estimator of the correct covariance matrix. The usefulness of residuals as instruments is
discussed in more detail in Hausman (1983) Hausman and Taylor (1983) and Hausman et al. (1987).
2910 M.W. Warson

Many applied researchers have instead relied on Monte Carlo methods for
estimating standard errors of estimated impulse responses and variance decom-
positions. Runkle (1987) reports on experiments comparing the small sample
accuracy of the estimators. He concludes that the delta method provides reasonably
accurate estimates of the standard errors for the impulse responses, and the resulting
confidence intervals have approximately the correct coverage. On the other hand,
delta method confidence intervals for the variance decompositions are often
unsatisfactory. This undoubtedly reflects the [0, l] bounded support of the variance
decompositions and the unbounded support of the delta method normal approxi-
mation.

References

Ahn, SK. and G.C. Reinsel (1990) Estimation for Partially Nonstationary Autoregressive Models,
Journal of the American Statistical Association, 85, 8 13-823.
Anderson, T.W. (1951) Estimating Linear Restrictions on Regression Coefficients for Multivariate
Normal Distributions, Annals qf MathematicalStatistics, 22, 327751.
Anderson, T.W. (1984) An Introduction to Multivariate Statistical Analysis, 2nd Edition. Wiley:
New York.
Andrews, D.W.K. and J.C. Moynihan (1990) An Improved Heteroskedastic and Autocorrelation
Consistent Covariance Matrix Estimator, Cowles Foundation Discussion Paper No. 942, Yale
University.
Banerjee, A., J.J. Dolado, D.F. Hendry and G.W. Smith (1986) Exploring Equilibrium Relationships
in Econometrics through Static Models: Some Monte Carlo Evidence, Oxford Bulletin of Economics
and Statistics, 48(3), 253370.
Banerjee, A., J. Dolado, J.W. Galbraith and D.F. Hendry (1993) Co-Integration, Error-Correction, and
the Econometric Analysis ofNon-Stationary Data. Oxford University Press: Oxford.
Basawa, I.V. and D.J. Scott (1983) Asymptotic Optimal Inference for Nonergodic Models. Springer
Verlag: New York.
Berk, K.N. (1974) Consistent Autoregressive Spectral Estimates, Annals of Statistics, 2, 4899502.
Bernanke, B. (1986) Alternative Explanations of the Money-Income Correlation, Carnegie-Rochester
Conference Series on Public Policy. Amsterdam: North-Holland Publishing Company.
Beveridge, S. and C.R. Nelson (1981) A New Approach to Decomposition of Time Series in Permanent
and Transitory Components with Particular Attention to Measurement of the Business Cvcle.
Journal af Monetary konomics, 7, 15 l-74.
Blanchard, O.J. and D. Quah (1989) The Dynamic Effects of Aggregate Demand and Supply
Disturbances, American Economic Review, 79, 655573.
Blanchard, O.J. and M.W. Watson (1986) Are Business Cycles All Alike?, in: R. Gordon, ed., The
American Business Cycle: Continuity and Change. Chicago: University of Chicago Press, 123-179.
Bobkosky, M.J. (1983) Hypothesis Testing in Nonstationary Time Series, Ph.D. thesis, Department of
Statistics, University of Wisconsin.
Brillinger, D.R. (1980) Time Series, Data Analysis and Theory. Expanded Edition, Holden-Day: San
Francisco.
Campbell, J.Y. (1990) Measuring the Persistence of Expected Returns, American Ecbtomic Review,
80(2), 43-47.
Campbell, J.Y. and P. Perron (1991) Pitfalls and Opportunities: What Macroeconomists Should Know
about Unit Roots, NBER Macroeconomics -Annual. MIT Press: Cambridge, Mass.
Campbell, J.Y. and R.J. Shiller (1987) Cointegration and Tests of Present Value Models, Journal af
Political Economy, 95, 1062-1088. Reprinted in: R.F. Engle and C.W.J. Granger, eds., Long-Run
Economic Relations: Readings in Cointegration. Oxford University Press: New York, 1991.
Canova, Fabio (1991) Vector Autoregressive Models: Specification Estimation and Testing, manuscript,
Brown University.
Ch. 47: Vector Autoregressions and Cointegration 2911

Cavanagh, C.L. (1985) Roots Local to Unity, manuscript, Department of Economics, Harvard University.
Cavanaeh. C.L. and J.H. Stock (1985) Inference in Econometric Models with Nearly Nonstationary
Regre%s, Manuscript, Kennedy School of Government, Harvard University.
Chan. N.H. (1988) On Parameter Inference for Nearly Nonstationary Time Series, Journal of the
American Statistical Association, 83, 851762.
Ghan, N.H. and C.Z. Wei (1987) Asymptotic Inference for Nearly Nonstationary AR(i) Processes,
The Annals of Statistics, 15, 1050-63.
Chan, N.H. and C.Z. Wei (1988) Limiting Distributions of Least Squares Estimates of Unstable Auto-
regressive Processes, The Annals of Statistics, 16(l), 3677401.
Cochrane, J.H. (1994) Permanent and Transitory Components of GNP and Stock Prices, Quarterly
Journal of Economics, 109, 241-266.
Cochrane, J.H. and A.M. Sbordone (1988) Multivariate Estimates of the Permanent Components of
GNP and Stock Prices, Journal of Economic Dynamics and Control, 12, 255-296.
Davidson. J.E.. D.F. Hendrv. F. Srba and S. Yeo (1978) Econometric Modelling of the Aggregate Time-
Series Relationship Between Consumers Expenditures and Income in the United Kingdom: Economic
Journal, 88, 661-692.
Davies, R.B. (1977) Hypothesis Testing When a Parameter is Present Only Under the Alternative,
Biometrika, 64, 247-54.
Davies, R.B. (1987) Hypothesis Testing When a Parameter is Present Only Under the Alternative,
Biometrika, 14, 33-43.
Dickey, D.A. and W.A. Fuller (1979) Distribution of the Estimators for Autoregressive Time Series
with a Unit Root, Journal of the American Statistical Association, 74, 427731.
Elliot, G. and J.H. Stock (1992) Inference in Time Series Regressions when there is Uncertainty about
Whether a Regressor Contains a Unit Root, manuscript,-Harvard University.
Elliot, G., T.J. Rothenberg and J.H. Stock (1992) Efficient Tests of an Autoregressive Unit Root, NBER
Technical Working Paper 130.
Engle, R.F. (1976) Band Spectrum Regression, International Economic Review, 15, l-l 1.
Enele. R.F. (1984) Wald. Likelihood Ratio. and Lagrange Multiplier Tests in Econometrics, in:
2. Griliches and M. Intriligator, eds., Handbook of Econometrics. North-Holland: New York, Vol. 2,
775-826.
Engle, R.F. and C.W.J. Granger (1987) Cointegration and Error Correction: Representation,
Estimation, and Testing, Econometrica, 55, 251-276. Reprinted in: R.F. Engle and C.W.J. Granger,
eds., Long-Run Economic Relations: Readings in Cointegration. Oxford University Press: New York,
1991.
Engle, R.F. and B.S. Yoo (1987) Forecasting and Testing in Cointegrated Systems, Journal of
Econometrics, 35, 143-59. Reprinted in: R.F. Engle and C.W.J. Granger, eds., Long-Run Economic
Relations: Readings in Cointegration. Oxford University Press: New York, 1991.
Engle, R.F. and B.S. Yoo (1991) Cointegrated Economic Time Series: An Overview with New Results,
in R.F. Engle and C.W.J. Granger, eds., Long-Run Economic Relations: Readings in Cointegration.
Oxford University Press: New York.
Engle, R.F., D.F. Hendry, and J.F. Richard (1983) Exogeneity, Econometrica, 51(2), 2777304.
Fama, E.F. and K.R. French (1988) Permanent and Transitory Components of Stock Prices, Journal
of Political Economy, 96, 246-73.
Faust, J. and E.M. Leeper (1993) Do Long-Run Identifying Restrictions Identify Anything?, manuscript,
Board of Governors of the Federal Reserve System.
Fisher, F. (1966) The Zdentijication Problem in Econometrics. New York: McGraw-Hill.
Fisher, M.E. and J.J. Seater (1993) Long-Run Neutrality and Superneutrality in an ARIMA Frame-
work, American Economic Review, 83(3), 402-415.
Fountis, N.G. and D.A. Dickey (1986) Testing for a Unit Root Nonstationarity in Multivariate Time
Series, manuscript, North Carolina State University.
Fuller, W.A. (1976) Zntroduction to Statistical Time Series. New York: Wiley.
Gali, J. (1992) How Well does the IS-LM Model Fit Postwar U.S. Data, Quarterly Journal of
Economics, 107, 709-738.
Geweke, J. (1986) The Superneutrality of Money in the United States: An Interpretation of the
Evidence. Econometrica, 54, l-21.
Gianini, C. (1991) Topics in Structural VAR Econometrics, manuscript, Department of Economics,
Universita Degli Studi Di Ancona.
2912 M. W. Watson

Gonzalo, J. (1989) Comparison of Five Alternative Methods of Estimating Long Run Equilibrium
Relationships, manuscript, UCSD.
Granger, C.W.J. (1969) Investigating Causal Relations by Econometric Methods and Cross Spectral
Methods, Econometrica, 34, 150-61.
Granger, C.W.J. (1983) Co-Integrated Variables and Error-Correcting Models, UCSD Discussion Paper
83-13.
Granger, C.W.J. and A.P. Andersen (1978) An Introduction to Bilinear Time Series Models. Vandenhoeck
& Ruprecht: Gottingen.
Granger, C.W.J. and T.-H. Lee (1990) Multicointegration, Advances in Econometrics, 8, 71-84.
Reprinted in: R.F. Engle and C.W.J. Granger, eds., Long-Run Economic Relations: Readings in
Cointegration. Oxford University Press: New York, 1991.
Granger, C.W.J. and P. Newbold (1974) Spurious Regressions in Econometrics, Journal of Econo-
metrics, 2, 11 l-20.
Granger, C.W.J. and P. Newbold (1976) Forecasting Economic Time Series. Academic Press: New York.
Granger, C.W.J. and A.A. Weiss (1983) Time Series Analysis of Error-Correcting Models, in: Studies
in Econometrics, Time Series and Multivariate Statistics. Academic Press: New York, 255-78.
Hall, R.E. (1978) Stochastic Implications of the Life Cycle - Permanent Income Hypothesis: Theory
and Evidence, Journal of Political Economy, 86(6), 971-87.
Hamilton, J.D. (1994) Time Series Analysis. Princeton University Press: Princeton, NJ.
Hannan, E.J. (1970) MultipIe Time Series Wiley: New York.
Hansen. B.E. (1988) Robust Inference in General Models ofCointegration, manuscript, Yale University.
Hansen, B.E.(1996a) A Powerful, Simple Test for Cointegration Using Cochrane-Orcutt, Working
Paper No. 230, Rochester Center for Economic Research.
Hansen, B.E. (1990b) Inference When a Nuisance Parameter is Not Identified Under the Null
Hypothesis, manuscript, University of Rochester.
Hansen, B.E. and P.C.B. Phillips (1990) Estimation and Inference in Models of Cointegration: A
Simulation Study, Adoances in Econometrics, 8, 225-248.
Hansen, L.P. (1982) Large Sample Properties of Generalized Method of Moments Estimators,
Econometrica, 50, 1029954.
Hansen, L.P. and T.J. Sargent (1991) Two Problems in Interpreting Vector Autoregressions, in:
L. Hansen and T. Sargent, eds., Rational Expectations Econometrics. Westview: Boulder.
Hausman, J.A. (1983) Specification and Estimation of Simultaneous Equation Models, in: Z. Griliches
and M. Inteligator, eds., Handbook of Econometrics. North Holland: New York, Vol. 1, 391-448.
Hausman, J.A. and W.E. Taylor (1983) Identification in Linear Simultaneous Equations Models with
Covariance Restrictions; An Instrumental Variables Interpretation, Econometrica, 51(5), 1527-50.
Hausman, J.A., W.K. Newey and W.E. Taylor (1987) Efficient Estimation and Identification of
Simultaneous Equation Models with Covariance Restrictions, Econometrica, 55(4), 849-874.
Hendry, D.F. and T. von Ungern-Sternbcrg (1981) Liquidity and Inflation Effects on Consumers
Expenditure, in: A.S. Deaton, ed., Essays in the Theory and Measurement of Consumers Behavior.
Cambridge University Press: Cambridge.
Hodrick, R.J. (1992) Dividend Yields and Expected Stock Returns: Alternative Procedures for Inference
and Measurement, The Review of Financial Studies, 5(3), 357-86.
Horvath, M. and M. Watson (1992) Critical Values for Likelihood Based Tests for Cointegration When
Some Cointegrating May be Known, manuscript, Northwestern University.
Horvath, M. and M.W. Watson (1993) Testing for Cointegration When Some of the Cointegrating
Vectors are Known, manuscript, Northwestern University.
Johansen, S. (1988a) Statistical Analysis of Cointegrating Vectors, Journal of Economic Dynamics
and Control, 12, 231-54. Reprinted in: R.F. Engle and C.W.J. &anger, eds., Long-Run Economic
Relations: Readings in Cointegration. Oxford University Press: New York, 1991.
Johansen, S. (1988b) The Mathematical Structure of Error Correction Models, in: N.U. Prabhu, ed.,
Contemporary Mathematics, vol. 80: Structural Inference from Stochastic Processes. American
Mathematical Society: Providence, RI.
Johansen, S. (1991) Estimation and Hypothesis Testing of Cointegrating Vectors in Gaussian Vector
Autoregression Models, Econometrica, 59, 1551-l 580.
Johansen, S. (1992a) The Role of the Constant Term in Cointegration Analysis of Nonstationary
Variables, Preprint No. 1, Institute of Mathematical Statistics, University of Copenhagen.
Johansen, S. (1992b) Determination of Cointegration Rank in the Presence of a Linear Trend, Oxford
Bulletin of Economics and Statistics, 54, 383-397.
Ch. 47: Vector Autoregressions and Cointegration 2913

Johansen, S. (1992~) A Representation of Vector Autoregressive Processes Integrated of Order 2,


Econometric Theory, 8(2), 188-202.
Johansen, S. and K. Juselius (1990) Maximum Likelihood Estimation and Inference on Cointegration -
with Applications to the Demand for Money, Oxford Bulletin of Economics and Statistics, 52(2),
169-210.
Johansen, S. and K. Juselius (1992) Testing Structural Hypotheses in a Multivariate Cointegration
Analysis of the PPP and UIP of UK, Journal of Econometrics, 53, 21 I-44.
Keating, J. (1990) Identifying VAR Models Under Rational Expectations, Journal of Monetary
Economics, 25(3), 453-76.
King, R.G. and M.W. Watson (1993) Testing for Neutrality, manuscript, Northwestern University.
King, R.G., C.I. Plosser, J.H. Stock and M.W. Watson (1991) Stochastic Trends and Economic
Fluctuations, American Economic Review, 81, 819-840.
Kosobud, R. and L. Klein (1961) Some Econometrics ofGrowth: Great Ratios of Economics, Quarterly
Journal of Economics, 25, 173-98.
Lippi, M. and L. Reichlin (1993)The Dynamic Effects of Aggregate Demand and Supply Disturbances:
Comments, American Economic Reoiew, 83(3), 644652.
Lucas, R.E. (1972) Econometric Testing of the Natural Rate Hypothesis, in: 0. Eckstein, ed., The
Econometrics ofPrice Determination. Washington, D.C.: Board of Governors of the Federal Reserve
System.
Lucas, R.E. (1988) Money Demand in the United States: A Quantitative Review, Carnegie-Rochester
Conference Series on Public Policy, 29, 137-68.
Lutkepohl, H. (1990) Asymptotic Distributions of Impulse Response Functions and Forecast Error
Variance Decompositions of Vector Autoregressive Models, Review ofEconomics and Statistics, 72,
116-25.
MacKinnon, J.G. (1991) Critical Values for Cointegration Tests, in: R.F. Engle and C.W.J. Granger,
eds., Long-Run Economic Relations: Readings in Cointegration. Oxford University Press: New York.
Magnus, J.R. and H. Neudecker (1988) Matrix Difirential Calculus. Wiley: New York.
Malinvaud, E. (1980) Statistical Methods of Econometrics. Amsterdam: North-Holland.
Mankiw, N.G. and M.D. Shapiro (1985) Trends, Random Walks and the Permanent Income Hypothesis,
Journal of Monetary Economics, 16, 165-74.
Mittnik, S. and P.A. Zadrozny (1993) Asymptotic Distributions of Impulse Responses, Step
Responses and Variance Decompositions of Estimated Linear Dynamic Models, Econometrica,
61, 857-70.
Ogaki, M. and J.Y. Park (1990) A Cointegration Approach to Estimating Preference Parameters,
manuscript, University of Rochester.
Osterwald-Lenum, M. (1992) A Note with Quantiles of the Asymptotic Distribution of the Maximum
Likelihood Cointegration Rank Test Statistics, Oxford Bulletin of Economics and Statistics, 54,
461-71.
Pagan, A. (1984) Econometric Issues in the Analysis of Regressions with Generated Regressors,
International Economic Review, 25, 221-48.
Park, J.Y. (1992) Canonical Cointegrating Regression, Econometrica, 60(l), 119-144.
Park, J.Y. and M. Ogaki (1991) Inference in Cointegrated Models Using VAR Prewhitening to Estimate
Shortrun Dynamics, Rochester Center for Economic Research Working Paper No, 281.
Park, J.Y. and P.C.B. Phillips (1988) Statistical Inference in Regressions with Integrated Regressors
I, Econometric Theory, 4, 468-97.
Park, J.Y. and P.C.B. Phillips (1989) Statistical Inference in Regressions with Integrated Regressors:
Part 2, Econometric Theory, 5, 95-131.
Phillips, P.C.B. (1986) Understanding Spurious Regression in Econometrics, Journal ofEconometrics,
33, 31 l-40.
Phillips, P.C.B. (1987a) Time Series Regression with a Unit Root, Econometrica, 55, 277-301.
Phillips, P.C.B. (l987b) Toward a Unified Asymptotic Theory for Autoregression, Biometrika, 74,
535-47.
Phillips, P.C.B. (1988) Multiple Regression with Integrated Regressors, Contemporary Mathematics,
80. 79-105.
Phillips, P.C.B. (1991a) Optimal Inference in Cointegrated Systems, Econometrica, 59(2), 283-306.
Phillips, P.C.B. (1991b) Spectra1 Regression for Cointegrated Time Series, in: W. Barnett, ed.,
Nonparametric and Semiparamerric Methods in Economics and Statistics. Cambridge University Press:
Cambridge, 413-436.
2914 M. W. W&son

Phillips, P.C.B. (1991~) The Tail Behavior of Maximum Likelihood Estimators of Cointegrating
Coefficients in Error Correction Models, manuscript, Yale University.
Phillips, P.C.B. (1991d) To Criticize the Critics: An Objective Bayesian Analysis of Stochastic Trends,
Journal of Applied Econometrics, 6, 3333364.
Phillips, P.C.B. and S.N. Durlauf (1986) Multiple Time Series Regression with Integrated Processes,
Rekew of Economic Studies, 53,473-96.
Phillios. P.C.B. and B.E. Hansen (1990) Statistical Inference in Instrumental Variables Regression
with I(1) Processes, Review of Economic Studies, 57, 999125.
Phillips, P.C.B. and M. Loretan (1991) Estimating Long Run Economic Equilibria, Review of Economic
Studies, 58, 4077436.
Phillips, P.C.B. and S. Ouliaris (1990) Asymptotic Properties of Residual Based Tests for Cointegration,
Econometrica, 58, 165594.
Phillips, P.C.B. and J.Y. Park (1988) Asymptotic Equivalence of OLS and GLS in Regression with
Integrated Regressors, Journal of the American Statisticat Association, 83, 11 l-l 15. _
Phillips. P.C.B. and P. Perron (1988)~ Testing for Unit Root in Time Series Reeression. Biometrika.
75;335-46.
Phillips, P.C.B. and W. Ploberger (1991) Time Series Modeling with a Bayesian Frame of Reference:
I. Concepts and Illustrations, manuscript, Yale University.
Phillips, P.C.B. and V. Solo (1992) Asymptotics for Linear Processes, Annals afStatistics,20, 97lL
1001.
Quah, D. (1986) Estimation and Hypothesis Testing with Restricted Spectral Density Matrices: An
Application to Uncovered Interest Parity, Chapter 4 of Essays in Dynamic Macroeconometrics,
Ph.D. Dissertation, Harvard University.
Rothenberg, T.J. (1971) Identification in Parametric Models, Econometrica, 39, 577-92.
Rozanov, Y.A. (1967) Stationary Random Processes. San Francisco: Holden Day.
Runkle, D. (1987) Vector Autoregressions and Reality, Journal of Business and Economic Statistics,
5(4), 4377442.
Said, SE. and D.A. Dickey (1984) Testing for Unit Roots in Autoregressive-Moving Average Models
of Unknown Order, Biometrika, 71, 599-608.
Saikkonen, P. (1991) Asymptotically Efficient Estimation of Cointegrating Regressions, Econometric
Theory, 7(l), l-21.
Saikkonen, P. (1992) Estimation and Testing of Cointegrated Systems by an Autoregressive Approxi-
mation, Econometric Theory, 8(l), l-27.
Sargan, J.D. (1964) Wages and Prices in the United Kingdom: A Study in Econometric Methodology,
in: P.E. Hart, G. Mills and J.N. Whittaker, eds., Econometric Analysisfor National Economic Planning.
London: Butterworths.
Sargent, T.J. (1971) A Note on the Accelerationist Controversy, Journal of Money, Banking and
Credit, 3, 50-60.
Shapiro, M. and M.W. Watson (1988) Sources of Business Cycle Fluctuations, NBER Macroeconomics
Annual, 3, 11 l-1 56.
Sims, CA. (1972) Money, Income and Causality, American Economic Review, 62, 540-552.
Sims, CA. (1978) Least Squares Estimation of Autoregressions with Some Unit Roots, University of
Minnesota, Discussion Paper No. 78-95.
Sims, C.A. (1980) Macroeconomics and Reality, Econometrica, 48, l-48.
Sims, C.A. (1986) Are Forecasting Models Usable for Policy Analysis?, Quarterly Review, Federal
Reserve Bank of Minneapolis, Winter.
Sims, CA. (1989) Models and Their Uses, American Journal of Agricultural Economics, 71,489-494.
Sims, CA., J.H. Stock and M.W. Watson (1990) Inference in Linear Time Series Models with Some
Unit Roots, Econometrica, 58(l), 113-44.
Solo, V. (1984) The Order of Differencing in ARIMA Models, Journal of the American Statistical
Association, 79, 916-21.
Stock, J.H. (1987) Asymptotic Properties of Least Squares Estimators of Cointegrating Vectors,
Econometrica, 55, 1035-56.
Stock, J.H. (1988) A Reexamination of Friedmans Consumption Puzzle, Journal of Business and
Economic Statistics, 6(4), 401-14.
Stock, J.H. (1991) Confidence Intervals of the Largest Autoregressive Root in U.S. Macroeconomic
Time Series, Journal of Monetary Economics, 28, 435560.
Ch. 47: Vector Autoregressions and Cointeyration 2915

Stock, J.H. (1992) Deciding Between I(0) and I(l), manuscript, Harvard University.
Stock, J.H. (1993) Forthcoming in: R.F. Engle and D. McFadden, eds., Handbook of Econometrics.
Vol. 4, North Holland: New York.
Stock, J.H. and M.W. Watson (1988a) Interpreting the Evidence on Money-Income Causality, Journal
of Econometrics, 40(l), 161-82.
Stock, J.H. and M.W. Watson (1988b) Testing for Common Trends, Journal ofthe American Statistical
Association, 83, 1097-l 107. Reprinted in: R.F. Engle and C.W.J. Granger, eds., Long-Run Economic
Relations: Readings in Cointeyration. Oxford University Press: New York, 1991.
Stock, J.H. and M.W. Watson (1993) A Simple Estimator of Cointegrating Vectors in Higher Order
Integrated Systems, Econometrica, 61, 783-820.
Stock, H.H. and K.D. West (1988) Integrated Regressors and Tests of the Permanent Income
Hypothesis, Journal of Monetary Economics, 21 (l), 85-95.
Sweeting, T. (1983) On Estimator Efficiency in Stochastic Processes, Stochastic Processes and their
Ao&cations. 15, 93-98.
The$ H. (1971) Principles of Econometrics. Wiley: New York.
Toda, H.Y. and P.C.B. Phillips (1993a) Vector Autoregressions and Causality, Econometrica, 62(l),
1367-1394.
Toda, H.Y. and P.C.B. Phillips (1993b) Vector Autoregressions and Causality. A Theoretical Overview
and Simulation Studv. Econometric Reviews. 12, 321-364.
Tsav, R.S. and G.C. Tiao (1990) Asymptotic Properties of Multivariate Nonstationary Processes with
Applications to Autoregressions: Annals of Statistics, 18, 220-50.
West. K.D. (1988) Asvmntotic Normalitv when Regressors Have a Unit Root, Econometrica, 56,
139771418. - -
White, H. (1984) Asymptotic Theoryfor Econometricians. New York: Academic Press.
Whittle, P. (1983) Prediction and Regulation by Linear Least-Square Methods. Second Edition, Revised.
University of Minnesota Press: Minneapolis.
Wold, H. (1954) Causality and Econometrics, Econometrica, 22, 162-177.
Wooldridge, J. (1993) Forthcoming in: R.F. Engle and D. McFadden, eds., Handbook ofEconometrics.
Vol. 4, North-Holland: New York.
Yoo, B.S. (1987) Co-Integrated Time Series Structure, Forecasting and Testing, Ph.D. Dissertation,
UCSD.
Yule, G.C. (1926) Why Do We Sometimes Get Nonsense-Correlations Between Time-Series, Journal
of the Royal Statistical Society B, 89, l-64.
Chapter 48

ASPECTS OF MODELLING NONLINEAR TIME SERIES*

TIM0 TERASVIRTA

Copenhagen Business School and Bank of Norway

DAG TJ@STHEIM

University ofBergen

CLIVE W.J. GRANGER

University of California

Contents

Abstract 2919
1. Introduction 2919
2. Types of nonlinear models 2921
2.1. Models from economic theory 2921
2.2. Models from time series theory 2922
2.3. Flexible statistical parametric models 2923
2.4. State-dependent, time-varying parameter and long-memory models 2924
2.5. Nonparametric models 2925
3. Testing linearity 2926
3.1. Tests against a specific alternative 2921
3.2. Tests without a specific alternative 2930
3.3. Constancy of conditional variance 2933
4. Specification of nonlinear models 2934
5. Estimation in nonlinear time series 2937
5.1. Estimation of parameters in parametric models 2937

*The work for this paper originated when TT and DT were visiting the University of California, San
Diego. They wish to thank the economics and mathematics departments, respectively, of UCSD for their
hospitality and John Rice and Murray Rosenblatt, in particular. The research of TT was also supported
by the University of Giiteborg, Bank of Norway and a grant from the YrjG Jahnsson Foundation. DT
acknowledges financial support from the Norwegian Council for Research and CWJG from NSF, Grant
SES 9023037.

Handbook of Econometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden
0 1994 Elsevier Science B. V. All rights reserved
2918 T.Teriisvirta et al.

5.2. Estimation of nonparametric functions 2938


5.3. Estimation of restricted nonparametric and semiparametric models 2942
6. Evaluation of estimated models 2945
7. Example 2946
8. Conclusions 2952
References 2953
Ch. 48: Aspects ofModellingNonlinear Time Series 2919

Abstract

This paper surveys some of the recent developments in nonlinear analysis of


economic time series. The emphasis lies on stochastic models. Various classes of
nonlinear models appearing in the economics and time series literature are
presented and discussed. Linearity testing and estimation of nonlinear models, both
parametric and nonparametric, are considered as well as post-estimation model
evaluation. Data-based nonlinear model building is illustrated with an empirical
example.

1. Introduction

It is common practice for economic theories to postulate nonlinear relationships


between economic variables, production functions being an example. If a theory
suggests a specific functional form, econometricians can propose estimation techni-
ques for the parameters, and asymptotic results about normality and consistency,
under given conditions, are known for these estimates, see, e.g. Judge et al. (1983,
White (1984) and Gallant (1987, Chapter 7). However, in many cases the theory
does not provide a single specification, or specifications are incomplete and may
not capture the major features of the actual data, such as trends, seasonality or the
dynamics. When this occurs, econometricians can try to propose more general speci-
fications and tests of them. There are clearly an immense number of possible
parametric nonlinear models and there are also many nonparametric techniques
for approximating them. Given the limited amount of data that is usually available
in economics it would not be appropriate to consider many alternative models or
to use many techniques. Because of the wide possibilities, the methods and models
available for analysing nonlinearities are usually very flexible so that they can
provide good approximations to many different generating mechanisms. A conse-
quence is that, with fairly small samples, the methods are inclined to over-fit, so
that if the true mechanism is linear, say, with residual variance 02, the fitted model
may appear to find nonlinearity and an estimated residual variance less than u2.
The estimated model will then be inclined to forecast badly in the post-sample
period. It is therefore necessary to have a specific research strategy for modelling
nonlinear relationships between time series. In this chapter the modelling process
concentrates on a particular situation, where there is a single dependent variable
y, to be explained and X, is a vector of exogenous variables. Let I, be the informa-
tion set

(1.1)
2920 T. TerSisvirtaet al.

and denote all of the variables (and lags) used in I, by w,. The modelling process
will then attempt to find a satisfactory approximation for f(wJ such that

(1.2)

If the error is

E, = Y, - f(wJ
then in some cases a more parsimonious representation will specifically include
lagged ES in f(s).
The strategy proposed is as follows.
(i) Test y, for linearity, using the information I,. As there are many possible forms
of nonlinearity it is likely that no one test will be powerful against them all,
so several tests may be needed.
(ii) If linearity is rejected, consider a small number of alternative parametric
models and/or nonparametric estimates. Linearity tests may give guidance
as to which kind of nonlinear models to consider.
(iii) These models should be estimated in-sample and compared out-of-sample.
The properties of the estimated models should be checked. If a single model
is required, the one that is best out-of-sample may be selected and re-
estimated over all available data.

The strategy is by no means guaranteed to be successful. For example, if the


nonlinearity is associated with a particular feature of the data, but if this feature
does not occur in the post-sample evaluation period, then the nonlinear model may
not perform any better than a linear model.
Section 2 of the chapter briefly considers some parametric models, Section 3
discusses tests of linearity, Section 4 reviews specification of nonlinear models,
Section 5 considers estimation and Section 6 evaluation of estimated models.
Section 7 contains an example and Section 8 concludes. This survey largely deals
with linearity in the conditional mean, which occurs if f(wJ in (1.2) can be well
approximated by some linear combination cpw, of the components of w,. It will
generally be assumed that w, contains lagged values of y, plus, possibly, present and
lagged values of X, including 1. This definition avoids the difficulty of deciding
whether or not processes having forms of heteroskedasticity that involve explanatory
or lagged variables, such as ARCH, are nonlinear. It is clear that some tests of
linearity will be confused by these types of heteroskedasticity. Recent surveys of
some of the topics considered here include Tong (1990) for univariate time series,
Delgado and Robinson (1992) Hardle (1990) and Tjsstheim (1994) for semi- and
nonparametric techniques, Brock and Potter (1993) for linearity testing and
Granger and Terasvirta (1993).
There has recently been a lot of interest, particularly by economic theorists, in
Ch. 48: Aspects of Modelliny Nonlinear Time Series 2921

chaotic processes, which are deterministic series which have some of the linear
properties of familiar stochastic processes. A well known example is the tent-map
y, = 4y,_ 1 (1 - y,- i), which, with a suitable starting value in (0, l), generates a series
with all autocorrelations equal to zero and thus a flat spectrum, and so may be
called a white chaos, as a stochastic white noise also has these properties.
Economic theories can be constructed which produce such processes as discussed
in Chen and Day (1992). Econometricians are unlikely to expect such models to be
relevant in economics having a strong affiliation with stochastic models and, so far,
there is no evidence of actual economic data having been generated by a deterministic
mechanism. A difficulty is that there is no statistical test which has chaos as a null
hypothesis, so that non-rejection of the null could be claimed to be evidence in
favour of chaos. For a discussion and illustrations, see Liu et al. (1992). However,
a much-used linearity test has been proposed by Brock et al. (1987), based on chaos
theory, whose properties are discussed in Section 3.2.
The hope in using nonlinear models is that better explanations can be provided
of economic events and consequently better forecasts. If the economy were found
to be chaotic, and if the generating mechanism could be discovered using some
learning model, say, then forecasts would be effectively exact, without any error.

2. Types of nonlinear models

2.1. Models from economic theory

Theory can both be used to suggest possibly sensible nonlinear models or to take
into account some optimizing behaviour, with arbitrary assumed cost or utility
functions, to produce a model. An example is a relationship of the form

y, = min (#Wt, t9wJ + s,, (2.1)

so that y, is the smaller of a pair of alternative linear combinations of the vector


of variables used to model y,. This model arises from a disequilibrium analysis of
some simple markets, with the linear combinations representing supply and demand
curves; for more discussion see Quandt (1982) and Maddala (1986).
If we replace the min condition by another variable z,_~ which may also be one
of the elements of W, but not 1, we may have

y, = qow,+ Bw,F(z,_J + *, (2.2)


where F(z,_,) = 0, z,_~ < c and F(z,_,) = 1, z,_~ > c. This is a switching regression
model with switching variable z~_~ where d is the delay parameter; see Quandt
(1982). In univariate time series analysis (2.2) is called a two-regime threshold
autoregressive model; see, e.g. Tong (1990). Model (2.2) may be generalized by
2922 T. Teriisuirta et al.

assuming a continuum of regimes instead of only two. This can be done for instance
by defining

F(z,_,)= (1 +expC-y(z,-,-Cc)l~-~ Y>O (2.3)

in (2.2). Maddala (1977, p. 396) [see also Bacon and Watts (1971)] has already pro-
posed such a generalization which is here called a logistic smooth transition regres-
sion (LSTR) model. F may also have the form of a probability density rather than a
cumulative distribution function. In the univariate case this would correspond to
the exponential smooth transition autoregressive (ESTAR) model (Terasvirta, 1994)
or its well-known special case, the exponential autoregressive model (Haggan and
Ozaki, 1981). The transition variable may represent changing political or policy
regimes, high versus low inflation, upswings versus downswings of the business cycle
and so forth. These switching models or their smooth transition counterparts occur
frequently in theory which, for example, suggests changes in relationships when
there is idle production capacity versus otherwise or when unemployment is low
versus high. Aggregation considerations suggest that a smooth transition regression
model may often be more sensible than the abrupt change in (2.2).
Some theories lead to models that have also been suggested by time series
statisticians. An example is the bivariate nonlinear autoregressive model described
as a prey-predator model by Desai (1984) taking the form

Aylt = - a + b exp(y,,),
Ayzt = c + b ew(y,,)~

where y, is the logarithm of the share of wages in national income and y, is the
logarithm of the employment rate. Other examples can be found (Chen and Day,
1992). The fact that some models do arise from theory justifies their consideration
but it does not imply that they are necessarily superior to other models that
currently do not arise from economic theory.

2.2. Models from time series theory

The linear autoregressive, moving average and transfer function models have been
popular in the time series literature following the work by Box and Jenkins (1970)
and there are a variety of natural generalizations to nonlinear forms. If the
information set being considered is

I,= {y,_j, j= l,...,q,X,_i,i=O,...,q}, 4< co,

denote by E, the residual from yt explained by I, and let ekt be the residual from xkt
explained by I, (excluding xkt itself). The components of the models considered in
this section are nonlinear functions of components such as g(y, _j), h(x,,, _ i), G(E,_ j),
Ch. 48: Aspects ofModelling Nonlinear Time Series 2923

H(e,,,_i) plus cross-products such as y,_j~k,t_i,yt_jst-i,~,,t_jeb,,-i or E,_jek,t_i. A


model would string together several such components, each with a parameter. For
a given specification, the model is linear in the parameters so they can be easily
estimated by OLS. The big questions are about the specification of the model; what
components, functions and lags to use. There are so many possible components and
combinations that the curse of dimensionality soon becomes apparent, so that
choices of specification have to be made. Several classes of models have been
considered. They include
(9 nonlinear autoregressive, involving only functions of the dependent variable.
Typically only simple mathematical functions have been considered (such as
sine or cosine, sign, modulus, integer powers, logarithm of modulus or ratios
of low order polynomials);
(ii) nonlinear transfer function models, using functions of the lagged dependent
variable and current and lagged explanatory variables, usually separately;
(iii) bilinear models, y, = Cj,Jjkyr_ j~,_k + similar terms involving products of a
component of X, and a lagged residual of some kind. This can be thought
of as one equation of a multivariate bilinear system, as considered by
Stensholt and Tjostheim (1987);
(iv) nonlinear moving averages, being sums of functions of lagged residuals E,,e,;
(v) doubly stochastic models which contain the cross-products between lagged
y, and current and lagged components of xkt or a random parameter process
and are discussed in Tjostheim (1986).
Most of the models are augmented by a linear autoregressive term. There has
been little consideration of mixtures of these models. Because of difficulty of analysis,
lags are often taken to be small. Specifying the lag structure in nonlinear models is
discussed in Section 4.
A number of results are available for some of these models, such as stability for
simple nonlinear autoregressive models (Lasota and Mackey, 1989), stationarity and
invertibility of bilinear models or the autocorrelation properties of certain bilinear
systems, but are often too complicated to be used in practice. To study stability or
invertibility of a specific model it is recommended that a long simulation be formed
and the properties of the resulting series be studied. There is not a lot of experience
with these models in a multivariate setting and little success in their use has been
reported. At present they cannot be recommended for use in preference to the
smooth transition regression model of the previous section or the more structured
models of the next section. A simple nonlinear autoregressive or bilinear model with
just a few terms may be worth considering from this group.

2.3. Flexible statistical parametric models

A number of important modelling procedures concentrate on models of the form

(2.4)
2924 T. Teriisuirta et al.

where w, is a vector of past y, values and past and present values of a vector of
explanatory variables x, plus a constant. The first component of the model is linear
and the cpj(x) are a set of specific functions in x, examples being:
(i) power series, cpj(x) = xj (x is generally not a lag of y);
(ii) trigonometric, q(x) = sinx or cosx, (2.4) augmented by a quadratic term
w, Aw, gives the flexible function forms discussed by Gallant (1981);
(iii) cpj(x) = q(x) for all j, where q(x) is a squashing function such as a
probability density function or the logistic function q(x) = [ 1 + exp( - x)] - i.
This is a neural network model, which has been used successfully in various
fields, especially as a learning model, see, e.g. White (1989) or Kuan and White
(1994);
(iv) if cpj(x) is estimated nonparametrically, by a super-smoother, say, the
method is that of projection-pursuit, as briefly described in the next section.
The first three models are dense, in the sense that theorems exist suggesting that
any well-behaved function can be approximated arbitrarily well by a high enough
choice of p, the number of terms in the sum, for example Stinchcombe and White
(1989). In practice, the small sample sizes available in economics limit p to a small
number, say one or two, to keep the number of parameters to be estimated at a
reasonable level. In theory p should be chosen using some stopping criterion or
goodness-of-fit measure. In practice, a small, arbitrary value is usually chosen, or
some simple experimentation is undertaken. These models are sufficiently structured
to provide interesting and probably useful classes of nonlinear relationships in
practice. They are natural alternatives to nonparametric and semiparametric
models. A nonparametric model, as discussed in Section 2.5, produces an estimate
of a function at every point in the space of explanatory variables by using some
smoother, but not a specific parametric function. The distinction between parametric
and nonparametric estimators is not sharp, as methods using splines or neural nets
with an undetermined cut-off value indicate. This is the case, in particular, for the
restricted nonparametric models in Section 6.

2.4. State-dependent, time-varying parameter and long-memory models

Priestley (1988) has discussed a very general class of models for a system taking the
form

(moving average terms can also be included) where Y, is a k x 1 stochastic vector


and x, is a state-variable consisting of x, = (Y,, Y,_ i, . . . , Y, _k + J and which is
updated by a Markov system

X 1+1=h(x,)+F(x,)xt+v,+1.
Ch. 48: Asprcts oJModelling Nonlinear Time Series 2925

Here the cps and the components of the matrix Fare general functions, which in
practice will be approximated by linear or low-order polynomials. Many of the
models discussed in Section 2.2 can be embedded in this form. It is clearly related
to the extended Kalman filter [see Anderson and Moore (1979)] and to time-varying
parametric ARMA models, where the parameters evolve according to some simple
AR model; see Granger and Newbold (1986, Chapter 10). For practical use various
approximations can be applied, but so far there is little actual use of these models
with multivariate economic series.
For most of the models considered in Section 2.2, the series are assumed to be
stationary, but this is not always a reasonable assumption in economics. In a linear
context many actual series are I(l), in that they need to be differenced in order to
become stationary, and some pairs of variables are cointegrated, in that they are
both I(1) but there exists a linear combination that is stationary. A start to
generalizing these concepts to nonlinear cases has been made by Granger and
Hallman (1991a,b). I( 1) is replaced by a long-memory concept and cointegration
by a possibly nonlinear attractor, so that yt, .x, are each long-memory but there is
a function g(x) such that y, - g(x,) is stationary. A nonparametric estimator for gp)
is proposed and an example provided.

2.5. Nonparametric models

Nonparametric modelling of time series does not require an explicit model but for
reference purposes it is assumed that there is the following model

y,=f(Y*-1,x,-,)+g(y,-,,x,-,)&, (2.5)

where {y,, x,> are observed with (x,) being exogenous, and where y, _ r = (y, _ i,, . . . ,
Yt_i,)andx,-,=(x,-j,,...,x,-j,) are vectors of lagged variables, and {.st}is a sequence
of martingale differences with respect to the information set I, = {y, _ i, i > 0; x, _ i, i > O}.
The joint process {y,,x,} is assumed to be stationary and strongly mixing [cf.
Robinson (1983)]. The model formulation can be generalized to several variables
and the instantaneous transformation of exogenous variables. There has recently
been a surge of interest in nonparametric modelling; for references see, for instance,
Ullah (1989), Barnett et al. (1991) and Hardle (1990). The motivation is to approach
the data with as much flexibility as possible, not being restricted by the straitjacket
of a particular class of parametric models. However, more observations are needed
to obtain estimates of comparable variability. In econometric applications the two
primary quantities of interest are the conditional mean

WJ?4=M(y, ,..., yp;xl )...) XJ


=E(Y~lY~-i,=Y~,~~~~Yf-i,=Yp;Xt_j,=X~,...,Xf_jq=X4) (2.6)
T. Teriisvirta et al.

and the conditional variance

(2.7)

The conditional mean gives the optimal least squares predictor of y, given lagged
values y, _ i, ,..., y,_ip;Xt-j ,,...., Xt-jq. Derivatives of M(x;y) can also have economic
interpretations (Ullah, 1989) and can be estimated nonparametrically. The condi-
tional variance can be used to study volatility. For (2.5) M(y,x) =f(y,x) and
V(y, x) = a2g2(y, x), where 0 = E($). As pointed out in the introduction, this survey
mainly concentrates on M(y; x) while it is assumed that g(y; x) = 1.
A problem of nonparametric modelling in several dimensions is the curse of
dimensionality. As the number of lags and regressors increases, the number of
observations in a unit volume element of regressor space can become very small,
and it is difficult to obtain meaningful nonparametric estimates of (2.6) and (2.7).
Special methods have been designed to overcome this obstacle, and they will be
considered in Sections 4 and 5.3. Applying these methods often results in a model
which is an end product in that no further parametric modelling is necessary.
Another remedy to dimension difficulties is to apply semiparametric models.
These models usually assume linear and parametric dependence in some variables,
and nonparametric functional dependence in the rest. The estimation of such models
as well as restricted nonparametric ones will*be considered in Section 5.3.

3. Testing linearity

When parametric nonlinear models are used for modelling economic relationships,
model specification is a crucial issue. Economic theory is often too vague to allow
complete specification of even a linear, let alone a nonlinear model. Usually at least
the specification of the lag structure has to be carried out using the available data.
As discussed in the introduction, the type of nonlinearity best suited for describing
the data may not be clear at the outset either. The first step of a specification strategy
for any type of nonlinear model should therefore consist of testing linearity. As
mentioned above, it may not be difficult at all to fit a nonlinear model to data from
a linear process, interpret the results and draw possibly erroneous conclusions. If the
time series are short that may sometimes be successfully done even in situations in
which the nonlinear model is not identified under the linearity hypothesis. There is
more statistical theory available for linear than nonlinear models and the parameter
estimation in the former models is generally simpler than in the latter. Finally,
multi-step forecasting with nonlinear models is more complicated than with linear
ones. Therefore the need for a nonlinear model should be considered before any
attempt at nonlinear modelling.
Ch. 48: Aspects qf Modelliny Nonlinear Time Series 2921

3.1. Tests against a specijic alternative

Since the estimation of nonlinear models is generally more difficult than that of
linear models, it is natural to look for linearity tests which do not require estimation
of any nonlinear alternative. In cases where the model is not identified under the null
hypothesis of linearity, tests based on the estimation of the nonlinear alternative
would normally not even be available. The score or Lagrange multiplier principle
thus appears useful for the construction of linearity tests. In fact, many well-known
tests in the literature are Lagrange multiplier (LM) or LM type tests. Moreover,
some well-known tests, such as Tsays (1986), which have been introduced as general
linearity tests without a specific nonlinear alternative in mind, can be interpreted
as LM tests against a particular nonlinear model. Other tests, not built upon the
LM principle, do exist and we shall mention some of them. Recent accounts of
linearity testing in nonlinear time series analysis include Brock and Potter (1993),
De Gooijer and Kumar (1992), Granger and Terasvirta (1993, Chapter 6) and Tong
(1990, Chapter 5). For small-sample comparisons of some of the tests, see Chan and
Tong (1986), Lee et al. (1993), Luukkonen et al. (1988a), Petruccelh (1990) and
Terlsvirta et al. (1993).
Consider the following nonlinear model

Y, = VW,+S(e,w,,v,)+ u,, (3.1)

wherew,=(l,y,_, ,..., yt-p,xtl ,..., x,J, v, = (n, - 1, . . . . u, -J, u, = d ICI,vp,0, w,, v,)E,and
E,is a martingale difference process: E(E~1I,) = 0, COV(E,1I,) = af, where I, is as in (1.1).
It follows that E(u,)I,) = 0 and cov(u,JI,) = aig( e, cp, 0, w,, v,). Assume that f is at
least twice continuously differentiable with respect to the parameters 8 = (0,). . . , 0,).
Let f(0, w,, VJ = 0, so that the linearity hypothesis becomes He: 8 = 0.
Here we shall concentrate on the case g = 1 so that u, 3 E,. To test the linearity
hypothesis write the conditional (pseudo) logarithmic likelihood function as

T
=c--1oga; - $&.
2

The relevant block of the score vector scaled by l/JT becomes


2928 T. Trriisvirta et al.

This is the block that is nonzero under the null hypothesis. The information matrix
is block diagonal such that the diagonal element conforming to c, builds a separate
block. Thus the inverse of the block related to 0 and evaluated at H, becomes

where h,is At evaluated at H,; see, e.g. Granger and Terasvirta (1993, Chapter 6).
Settingii=(ii,,..., fir) the test statistic, in obvious notation, has the form

LM = d-2rlH(HMwH)-1Hti, (3.2)

where Mw=Z- W(wIW)-lw), g2=(1/T)x iI: and the vector P consists of resi-
duals from (3.1) estimated consistently under H, and g = 1. Under a set of assump-
tions which are moment conditions for (2.2) [see White(1984, Theorem 4.25)], (3.2)
has an asymptotic x2(m) distribution when H, holds. A practical way of carrying
out the test is by ordinary least squares as follows.
(i) Regress y, on w,, compute the residuals IY& and the sum of squared residuals
SSR,.
(ii) Regress i& on w, and &, compute the sum of squared residuals SSR,
(iii) Compute

(SSR, - SSRJm
F(m,T-n-m)=
SSR,/(T- m - n)

with n = k + p + 1, which has an approximate F distribution under f?= 0.


The use of an F test instead of the x2 test given by the asymptotic theory is
recommended in small samples because of its good size and power properties;
see Harvey (1990, pp. 174- 175).
As an example, assume w, = (1, I?;) with Wt = (y,_ r, . . , y,_J and f = vi@@, =
.

(v, 0 tiJ vet(O) so that (3.1) is a univariate bilinear model. Then /I, = (v, @ W,),
h; = ( ijt 0 tit) and (3.2) is a linearity test against bilinearity as discussed in Weiss
(1986) and Saikkonen and Luukkonen (1988).
In a few cases f in (3.1) factors as follows:

fvt 4 = (e;w,m,, e,,4 (3.3)

and fr(0, f?,, w,) = 0. Assume that 0, is a scalar whereas 0, may be a vector. This is
the case for many nonlinear models such as the smooth transition regression models
discussed in Section 2.1. Vector v, is excluded for simplicity. The linearity hypothesis
can be expressed as He,: 0, = 0. However, H,,: 8, = 0 is also a valid linearity
hypothesis. This is an indication of the fact that (3.1) with (3.3) is only identified
under the alternative o2 # 0 but not under e2 = 0. If we choose H,, as our
Ch. 48: Aspects of Model/kg Nonlinear Time Series 2929

starting-point, we may use the Taylor expansion

+ R2(~2,@33Wt)~V2*
fl(e,,e3,W,)=f1(0,e33,W,)+~(0,e3,w,)~2 (3.4)

Assume furthermore that b, has the form

4 = P(Qww,) (3.5)

where j?(e,) and k(w,) are I x 1 vectors. Next replace fI in (3.3) by the first-order
Taylor approximation at 0, = 0

me,,w,)= iw3)w9e2.
Then (3.3) becomes

where +r = $,(e,, f3,, 0,) and & = 021c/2(f11,0,). Vector g(w,) contains those elements
of k(w,)w: that are of higher order than one. From this it follows that the
approximation of (3.1) has the form

Yr = +; wt+ VMW)+ UI. (3.6)

The test can be carried out as before. After estimating (3.1) under H,, u,is regressed
on W, and g(w,) and under He,: & = 0 the test statistic has an asymptotic x2(s)
distribution if e2 is an s x 1 vector.
From (3.6) it is seen that the original null hypothesis H,, has been transformed
into Hb,:e2 = 0.Approximating fI as in (3.4) and reparametrizing the model may
be seen as a way of removing the identification problem. However, it may also be
seen as a solution in the spirit of Davies (1977, 1987). Let ti* be the residual vector
from the regression (3.6). Then

Z*rl* = inf qeyiqe),


61,83
where rl(0) = y, - VW, - 8211/2(e,,t9,)g(w,) are the OLS residuals from regressing y,
on W, and $,(f3,, Qg(w,) while keeping 8, and 0, fixed.
The test statistic is

F = sup F(e e ) = pk - inf~ww)i/~


2 3
01.83 infu(e),if(e)/(T- n - s)
2930 T. Teriimirta et al.

The price of the neat asymptotic null distribution is that not all the information in
the original model has been used. The original null hypothesis involved only a
single parameter.
As an example, assume w, = tit = (y,_ 1,. . . , yt-J, let /I($) = 1 and k(w,) = y:- 1.
This gives & = 8,8, and g( wJ = ( y:_ i , y:_ i Y,_ 2, , Y,_i y, _ J. The resulting test is
the linearity test against the univariate exponential autoregressive model in
Saikkonen and Luukkonen (1988). In that model, fi = 1 - exp[ - 02(yr_ 1 - 83)2]
with 8, = 0,8, > 0. Take another example where k(w,) = IV,, &g(w,) = Cf= ,CyEi
qijy,_iy,_jand Hb,:cpij=O,i= l,..., p; j=i ,..., p. The test is the first of the three
linearity tests against smooth transition autoregression in Luukkonen et al. (1988b)
when the delay parameter d is unknown but it is assumed that 1 d d d p. The number
of degrees of freedom in the asymptotic null distribution equals p(p + 1)/2. If w, also
contains other variables than lags of yt, the test is a linearity test against smooth
transition regression; see Granger and Terasvirta (1993, Chapter 6). If the delay
parameteris known, k(w,)=(l,y,_,),so that g(w,)=(y,_iy,_, ,..., y,_, ,..., ~,_~y,-~)
and the F test has p and T - n - p degrees of freedom.
In some cases the first-order Taylor series approximation is inadequate. For
instance, let 0, = (0,,, 0,. . . , 0) in (3.3) so that the only nonlinearity is described by
fi multiplied by a constant. Assume furthermore that k(w,) = W, so that 0;w, = B,,
and B(%)&%) = B(&)wl. Then the LM type test has no power against the
alternative because the auxiliary regression (3.6) is of order one, i.e. +2 = 0. In such
a situation, a third-order Taylor series approximation of fi is needed for construct-
ing a proper test; see Luukkonen et al. (1988b) for discussion.

3.2. Tests without a specific alternative

The above linearity tests are tests against a well-specified nonlinear alternative.
There exist other tests that are intended as general tests without a specific
alternative. We shall consider some of them. The first one is the Regression Error
Specification Test (RESET; Ramsey, 1969). Suppose we have a linear model

Y, = pw, + $7 (3.7)

where w, is as in (3.1) and whose parameters we estimate by OLS. Let Ul,,t = 1,. . , T,
be the estimated residuals and 9, = y, - ~7,the fitted values. Construct an auxiliary
regression

Et = lpw, + i sjz + u:. (3.8)


j=2

The RESET is the F-test of the hypothesis H,: Sj = 0, j = 2,. . ., h in (3.8). If


w, =(l,y,_ r,. . . , y,_J and h = 2, (3.8) yields the univariate linearity test of Keenan
Ch. 48: Aspects ofModellingNonlinear Time Series 293 1

(1985). In fact, RESET may also be interpreted as an LM test against a well-specified


alternative; see for instance Terasvirta (1990) or Granger and Terlsvirta (1993,
Chapter 6).
Tsay (1986) suggested augmenting the univariate (3.7) by second-order terms so
that the auxiliary regression corresponding to (3.8) becomes

iit= +Wt + f i (Pij.Y-iYr-j + : (3.9)


i=lj=i

The linearity hypothesis to be tested is H,:cpij = 0, Vi,j. The generalization to


multivariate models is immediate. This test also has as LM type interpreta-
tion showing that the test has power against a larger variety of nonlinear models
than the RESET. This is seen by comparing (3.9) with (3.6), assuming $ig(w,) =
Cp= iC$, ipijytp iy, _ j as discussed in the previous section. The advantage of RESET
lies in the small number of parameters in the null hypothesis. When w, = (1, y,_ i)
[or w, = (1, x,~)], the two tests are identical.
A general linearity test can also be based on the neural network model (2.4), and
such a test is presented in Lee et al. (1993). In computing the test statistics,
yj, j= l,..., p, in (2.4) are selected randomly from a distribution. Terasvirta et al.
(1993) showed that this can be avoided by deriving the test by applying the LM
principle. The auxiliary regression for the test becomes

ii,= +Wf+ i t PijYt-iY,-j+ i $J f (PijkYt-iYt-jYt-fc+ : (3.10)


i=lj=i i=l j=ik=j

and the linearity hypothesis H,:cp,, = 0, qijk = 0, Vi, j, k. The simulation results in
Terasvirta et al. (1993) indicate that in small samples the test based on (3.10) often
has better power than the original neural network test.
There has been no mention yet about tests against piecewise linear or switching
regression or its univariate counterpart, threshold autoregression. The problem is
that f, in (3.3) is not a continuous function of parameters if the switch-points or
thresholds are unknown. This makes the likelihood function irregular and the score
principle inapplicable. Ertel and Fowlkes (1976) suggested the use of cumulative
sums of recursive residuals for testing linearity. First, order the variables in
ascending (or descending) order according to the transition variable. Compute the
parameters recursively and consider the cumulative sum of the recursive residuals.
The test is analogous to the CUSUM test that Brown et al. (1975) suggested in which
time is the transition variable and no lags of y, are allowed in w,. However, Kramer
et al. (1988) showed that the presence of lags of yt in the model does not affect the
asymptotic null distribution of the CUSUM statistic. Even before that, Petruccelli
and Davies (1986) proposed the same test for the univariate (threshold autoregressive)
case; see also Petruccelli (1990). The CUSUM test may also be based on residuals
2932 T. Teriisuirta et al.

from OLS estimation using all the observations instead of recursive residuals.
Ploberger and Kramer (1992) recently discussed this possibility.
The CUSUM principle is not the only one available from the literature of
structural change. Quandt (1960) suggested generalizing the F test (Chow, 1960)
for testing parameter constancy in a linear model with known change-point by
applying F= ~up~~rF(t) where T = (t/t, <t < T- tl}. He noticed that the null
distribution of F was nonstandard. Andrews (1993) provided the asymptotic null
distribution for F and tables for critical values; see also Hansen (1990). If the observ-
ations are ordered according to,a variable other than time, a linearity test against
switching regression is obtained. In the univariate case, Chan (1990) and Chan and
Tong (1990) applied Quandts idea to testing linearity against threshold autoregres-
sion (TAR) with a single threshold; see also Tong (1990, Chapter 5). Chan (1991)
provided tables of percentage points of the null distribution of the test statistic. In
fact, this test can be regarded as one against a well-specified alternative: a
two-regime switching regression or threshold autoregressive model with a known
transition variable or delay parameter. For further discussion, see Granger and
Terasvirta (1993, Chapter 6).
Petruccelli (1990) compared the small-sample performance of the CUSUM, the
threshold autoregression test of Chan and Tong and the LM type test against
logistic STAR of Luukkonen et al. (1988b) when the true model was a single-
threshold TAR model. The results showed that the first two tests performed
reasonably well [for the CUSUM test a reverse CUSUM (Petruccelli, 1990) was
used]. However, they also demonstrated that the LM type test had quite comparable
power against this TAR which is a special case of the logistic STAR model.
As mentioned in the introduction, Brock et al. (1987) proposed a test (BDS test)
of independent, identically distributed observations based on the correlation
integral, a concept that arises in chaos theory. Let Y, n be a part of a time series
YT,T=(y, ,..., y,): Yt,n=(yc,yt_l ,..., Y~_,,+~). Compare a pair of such vectors Y,,,
and Y,,,. They are said to be no more than E apart if

II 't,j - 's,j IId > j=O,l ,...,Iz- 1. (3.11)

The correlation integral is defined as

C,(E) lim Te2(number of pairs (t, s) with 1 d t,s d T such that (3.11) holds).
T+CC

Brock et al. (1987) defined

S(n,E)= e(B)- [~l(E)]. (3.12)

Under the hypothesis that {y,} is an i.i.d. process, (3.12) has an asymptotic normal
distribution with zero mean and variance, given in Brock et al. (1987). Note that
Ch. 48: Aspects oJ Modelling Nonlineur Time Serirs 2933

(3.12) depends on n and e which the investigator has to choose and that the size of
the test is very sensitive to these two parameters. A much more thorough discussion
of the BDS test and its properties is found in Brock and Potter (1993) or Scheinkman
(1990). It may be mentioned, however, that a rather long time series is needed to
obtain reasonable power. Lee et al. (1993) contains some small-sample evidence on
the behaviour of the BDS test but it is not very conclusive; see Terasvirta (1990).
Linearity of a single series may also be tested in the frequency domain. Let {y,}
be stationary and have finite moments up to the sixth order. Then we can define
the bispectral density f(oi, oj) of y, based on third moments and

b(q, Oj) = -2
If(wito)12~
f(wilf(ojV(oi + Oj)

where f(oi) is the spectral density of y,. Two hypotheses can be tested: (i) if
f(wi, oj) E 0 then y, is linear and Gaussian, (ii) if b(wi, wj) = b, > 0 then y, is linear
but not Gaussian, i.e. the parametrized linear model for {y,} has non-Gaussian
errors. Subba Rao and Gabr (1980) proposed tests for testing these two hypotheqes.
Hinich (1982) derived somewhat different tests for the same purpose. For more
discussion, see, e.g. Priestley (1988) and Brockett et al. (1988). A disadvantage of
these tests seems to be relatively low power in small samples. Besides, performing
the tests requires more computation than carrying out most of their time domain
counterparts.
It has been assumed, so far, that g = 1 in (3.1). If this assumption is not satisfied,
the size of the test may be affected. At least the BDS test and the tests based on
bispectral density are known to be sensitive to departures from that assumption. If
linearity of the conditional mean is tested against a well-specified alternative using
LM type tests, some possibilities of taking conditional heteroskedasticity into
account exist and will be briefly mentioned in the next section.

3.3. Constancy of conditional variance

The assumption g = 1 is also a testable hypothesis. However, because conditional


heteroskedasticity is discussed in Chapter 49 of this volume, testing g = 1 against
nonconstant conditional variance is not considered here. This concerns not only
testing linearity against ARCH but also testing it against random coefficient linear
regression; see, e.g. Nicholls and Pagan (1985) for further discussion on the latter
situation.
If f = 0 and g = 1 are tested jointly, a typical LM or LM type test is a sum of two
separate LM (type) tests for f- 0 and g = 1, respectively. This is the case because
under this joint null hypothesis the information matrix is block diagonal; see
Granger and Terasvirta (1993, Chapter 6). Higgins and Bera (1989) derived a joint
LM test against bilinearity and ARCH. On the other hand, testing f- 0 when g f 1
2934 T Tertisvirta et al.

is a more complicated affair than it is when g = 1. If g is parametrized, the null


model has to be estimated under conditional heteroskedasticity. Besides, it may no
longer be possible to carry out the test making use of a simple auxiliary regression,
see Granger and Terasvirta (1993). If g is not parametrized but g f 1 is suspected
to hold then the tests described in Section 3.1 as well as the RESET and the Tsay
test can be made robust against g f 1. Davidson and MacKinnon (1985) and
Wooldridge (1990) described techniques for doing this. The present simulation
evidence is not yet sufficient to fully evaluate their performance in small samples.

4. Specification of nonlinear models

If linearity tests indicate the need for a nonlinear model and economic theory does
not suggest a completely specified model, then the structure of the model has to be
specified from the data. This problem also exists in nonparametric modelling as a
variable selection problem because the lags needed to describe the dynamics of the
process are usually unknown; see Auestad and TjBstheim (1991) and Tj&heim and
Auestad (1994a, b). To specify univariate time series models, Haggan et al. (1984)
devised a specification technique based on recursive estimation of parameters of a
linear autoregressive model. The parameters of the model were assumed to change
over time in a certain fashion. Choosing a model from a class of state-dependent
models, see Priestley (1988), was carried out by examining the graphs of recursive
estimates. Perhaps because the family of state-dependent models is large and, thus,
the po.&ibilities are many, the technique is not easy to apply.
If the class of parametric models to choose from is more restricted, more concrete
specification methods may be developed. [For instance, Box and Jenkins (1970)
restricted their attention to linear ARMA models.] Tsay (1989) presented a technique
making use of linearity tests and visual inspection of some graphs to specify a model
from the class of threshold autoregressive models. It is easy to use and seems to work
well. Chen and Tsay (1993a) considered the specification of functional-coefficient
autoregressive models whereas Chen and Tsay (1993b) extended the discussion to
additive functional-coefficient regression models. The key element in that procedure
is the use of arranged local regressions in which the observations are ordered
according to a transition variable. Lewis and Stevens (1991a) applied multivariate
adaptive regression splines (MARS), see Friedman (1991), to specify adaptive spline
threshold autoregressive models. Terasvirta (1994) discussed the specification of
smooth transition autoregressive models. This technique was generalized to smooth
transition regression models in Granger and Tergsvirta (1993, Chapter 7) and will
be considered next.
Consider the smooth transition regression (STR) model with p + k + 1 independent
variables

Y, = 4Dwt+ (ew,)F(z,) + u,, (4.1)


Ch. 48: Aspects ofModellingNonlinear Time Series 2935

where E{u,lI,}=O, cov{~,(I,}=~~, I,={Y,_~, j>O,x,_j,i, i= l,..., k, j>O} as in


(1.1) 4p=((p0,p1 ,..., cp,), 8=(8,,0, ,..., 8,),m=p+k+l and wI=(l,yt-i ,... ,
ytpp; x11,. . . >x,J. The alternatives for F are F(z,) = (1 + exp [ - y(z, - c)] } - , y > 0,
which gives the logistic STR model, and F(z,) = 1 - exp [ - y(z, - c)], y > 0, corres-
ponding to the exponential STR model. The transition variable z, may be any
element of W, other than 1 or another variable not included in w,.
The data-based specification proceeds in three stages. First, specify a linear model
to serve as a base for testing linearity. This is done by using a suitable model selection
criterion. Second, test linearity against STR using the linear model as the null model.
If linearity is rejected, determine the transition variable from the data, Third, choose
between LSTR and ESTR models. Testing linearity against STR is not difficult. A
test with power against both LSTR and ESTR if the transition variable is assumed
known is obtained by proceeding as in Section 3.1. This leads to auxiliary regression

where z,~ is the transition variable and ti, is the OLS residual from the linear
regression y, = pw, + u,. If z,~ is an element of wt, W, = (1, ~5:)has to be replaced by
wt in (4.2) except for the first right-hand-side term. The linearity hypothesis is
H,,:/?i = /& = /I3 = 0. Equation (4.2) is also used for selecting z,~.
The test is carried out for all candidates for z,~, and the one yielding the smallest
p-value is selected if that value is sufficiently small. If it is not, the model is taken to
be linear. This procedure is motivated as follows. Suppose there is a true STR model
with a transition variable z,~ that generated the data. Then the LM type test against
that alternative has optimal power properties. If an inappropriate transition
variable is selected for the test, the resulting test may still have power against the true
alternative but the power is less than if the correct transition variable is used. Thus,
the strongest rejection of the null hypothesis suggests that the corresponding
transition variable should be selected. For more discussion of this procedure see
Terbvirta (1994) and Granger and Terasvirta (1993, Chapter 6 and 7). If linearity
is rejected and a transition variable selected, then the third step is to choose between
LSTR and ESTR models. This can be done by testing a set of nested null hypotheses
within (4.2) with an F-test: the hypotheses are H&:& = 0,Hg2:& = Ol& = 0 and
Hi,:& = Ol& = & = 0.If the p-value of the test of Hg2 is the smallest, choose the
ESTR model, otherwise choose the LSTR model.
Specifying the lag structure of (4.1) could be done within (4.2) using an appropriate
model selection criterion but little is known about the success of such a procedure.
In the existing applications, a general-to-specific approach based on estimating
nonlinear STR (or STAR) models has mostly been used.
The model specification problem also arises in nonparametric time series
modelling. Taking model (2.5) as a starting-point, there is the question of which lags
X f-i13.?x~-tp; Y,-j,,t Y~_~, should be included in the model. Furthermore, .it
should be investigated whether the functions f and g are linear or nonlinear and
2936 T. Teriisuirta et al.

whether they are additive or not. Moreover, if interaction terms are included, how
should they be modelled and, more generally, can the nonparametric analysis
suggest functional forms, such as the smooth transition or threshold function, or an
ARCH type function for conditional variance?
These are problems of exploratory data analysis for nonlinear time series, and
relatively little nonparametric work has been done in the area. Various graphical
model indicators have been tried out in Tong (1990, Chapter 7) Haggan et al. (1984)
and Auestad and Tjostheim (1990) however. Perhaps the most natural quantities
to look at are the lagged conditional mean and variance of increasing order, i.e.

M,,,(Y) = E(Y,lY,_, = y) M,,,(x) = &lx,_, =x)


(4.3)
V,,,(Y) = var(ytly,_,= Y) v,,,(x)= My, Ix, -k = x).

In univariate modelling these quantities have been extensively used, albeit informally,
see Tong (1990, Chapter 7). They can give a rough idea of the type of nonlinearity
involved, but they fail to reveal things like the lag structure of an additive model.
A more precise and obvious alternative is to look at the functions M(y;x) and
I/(y;x) defined in (2.6) and (2.7) but they cannot be graphically displayed for
p + q > 2, and the curse of dimensionality quickly becomes a severe problem.
Auestad and Tjostheim (1991) and Tjostheim and Auestad (1994a) introduced
projections as a compromise between M&x), V&x) and the indicators (4.3). To
define projections consider the conditional mean function M(y, _ i,, . . . , y,, . . . , y, _ iP;
X f_ j,, . . . , x,_ j,) with yfpik excluded. The one-dimensional projector of order (p, q)
projecting on lag i, of yt is defined by

P~,k(Y)=E{M(YV,-i,~,Yk,,Y~-ip;Xf-j,~~Xf-jq)} (4.4)

The projector PXk(x) is defined in the same way. For an additive model with
NY,, >YP; Xl,. :. >x4) = X:1 1 ai+ Cq= 1fij(xj) it is easily seen that if all p + q
lags are included in the projection operation, then

y,,(Y) = k(Y) + p(k Px,k(X) = bktX) + ek,

where ,u~ = E(y,) - E[a,(y,)] and 8, = E(x,) - E[fik(x,)]. Clearly the additive terms
rzk(y) and p,(x) cannot be recovered using My,k and Mx k of (4.3).
Projectors can be defined similarly for the conditional variance and, in principle,
they reveal the structure of models having an additive conditional variance function.
Both types of projectors can be estimated by replacing theoretical expectations with
empirical averages and by introducing a weight function to screen off extreme data.
Properties and details are given in Auestad and Tjsstheim (1991) and Tjsstheim
and Auestad (1994a). Consistency is proved in Masry and Tjostheim (1994).
An important part of the model specification problem consists of singling out the
significant lags i,, . . . , i,; j, , . . . ,j, and the orders p and q for the conditional mean
Ch. 48: Aspects of Model&g Nonlinear Time Series 2931

(2.6) and conditional variance (2.7). Auestad and Tjostheim (1990, 1991), Tjostheim
and Auestad (1994b) and Cheng and Tong (1992) considered this problem, Granger
and Lin (1991) did the same from a somewhat different point of view. Auestad and
Tjsstheim adopted an approach analogous to the parametric final prediction error
(FPE) criterion of Akaike (1969). They treated it only in the univariate case, but it
is easily extended to the multivariate situation. Algorithms and formulae including
the heterogeneous case g f 1 are given in Tjostheim and Auestad (1994b) to which
the reader is referred for details of derivation and examples with simulated and real
data. Cheng and Tong (1992) discussed a closely related approach based on cross
validation.
An alternative and less computer intensive method is outlined by Granger and
Lin (1991). They use the Kendall rank partial autocorrelation function and the
bivariate information measure

s log fk Y)
bawxY)l
fib, Y) dxdy

for a pair of lags. Joe (1989) studied its properties in the i.i.d. case. Robinson (1991)
considered the random process case and tests of independence. Related tests of
independence and a power comparison with the BDS test are given in Skaug and
Tjastheim (1993a, b, c). Specification of semiparametric time series models is discussed
in the next section together with estimation.

5. Estimation in nonlinear time series

5.1. Estimation of parameters in parametric models

For parametric nonlinear models, conditional nonlinear least squares is the most
common estimation technique. If the errors are normal and independent, this is
equivalent to conditional maximum likelihood. The theory derived for dynamic
nonlinear models (3.1) with g = 1 gives the conditions for consistency and asymptotic
normality of the estimators. For an account, see, e.g. Gallant (1987, Chapter 7).
Even more general conditions were recently laid out in PStscher and Prucha
(1991a, b). These conditions may be difficult to verify in practice, so that the
asymptotic standard deviation estimates, confidence intervals and the like have to
be interpreted with care. For discussions of estimation algorithms, see, e.g. Quandt
(1983) Judge et al. (1985, Appendix B) and Bates and Watts (1988). The estimation
of parameters in (2.2) may not always be straightforward. Local minima may occur,
so that estimation with different starting-values is recommended. Estimation of y
in transition function (2.3) may create problems if the transition is rapid because
there may not be sufficiently many observations in the neighbourhood of the point
about which the transition takes place. The convergence of the estimate sequence
2938 T. Teriisvirta et al.

may therefore be slow and the standard deviation estimate of y most often very
large. This problem is discussed, e.g. in Bates and Watts (1988, p. 87) Granger and
Terasvirta (1993, Chapter 7) Seber and Wild (1989, pp. 480-481) and Tergsvirta
(1994). For simulation evidence and estimation using real economic data sets, see
also Chan and Tong (1986) Granger et al. (1993) Luukkonen (1990) and Terasvirta
and Anderson (1992). Model (2.2) may even be a switching regression model in which
case y is not finite and, in principle, cannot be estimated. In that case convergence
may still occur at some very large value, but obtaining a negative definite Hessian
probably turns out to be a problem. An available alternative then is to fix y at some
sufficiently large but finite value and estimate the remaining parameters conditionally
on that value.
The estimation of parameters becomes more complicated if the model contains
lagged errors as the bilinear model does. Subba Rao and Gabr (1984) outlined a
procedure for the estimation of a bilinear model based on maximizing the
conditional likelihood. Quick preliminary estimates may be obtained using a long
autoregression to estimate the residuals and OLS for estimating the parameters
keeping the residuals fixed. This is possible because the bilinear model has a simple
structure in the sense that it is linear in the parameters if we regard the lagged
residuals as observed. Granger and Terasvirta (1993, Chapter 7) suggested this
alternative.
If the model is a switching regression or threshold autoregressive model,
nonlinear least squares is an inapplicable technique because of the irregularity of
the sum of squares or the likelihood function. The problem consists of the unknown
switch-points or thresholds for which unique point estimates are not available as
long as the number of observations is finite. Tsay (1989) suggested specifying
(approximate) switch-points from scatterplots of t-values in ordered (according
to the switching variable) recursive regressions. As long as the recursion stays in the
same regime, the t-value of a coefficient estimate converges to a fixed value. When
observations from another regime are added into the regression, the coefficient
estimates start changing and the t-values deviating. Tsay (1989) contains examples.
The estimation of parameters in regimes is carried out by ordinary least squares.
Chan (1993) showed (in the univariate case) that if the model is stationary and
ergodic, the parameter estimates, including those of the thresholds, are strongly
consistent; for a discussion see Tong (1990, Section 5.5.3).

5.2. Estimation of nonparametric functions

In nonparametric estimation the most common way of estimating the conditional


mean (2.6) and variance (2.7) is to apply the so-called kernel method. It is based on
a kernel function k(x) which, typically, is a real continuous, bounded, symmetric
function integrating to one. Usually it is required that k(x) 3 0 for all x, but
sometimes it is advantageous to allow k(x) to take negative values, so that we may
Ch. 4X: Aspects oJModelling Nonlinear Time Series 2939

have fxk(x)dx = 0. The kernel method is explained in much greater detail in


Chapter 38 of this volume.
The kernel acts as a smoothing device in the estimation procedure. For quantities
depending on several variables as in (2.6) and (2.7) a product kernel can be used.
Then the kernel estimates of M and V are

$CY,
fi k~,l(Y~-Y,_,~,~lk,,2(x,-
ti(y, y,,x, x(J=+l (5.1)
fcn k,,l(Yl-Ys-i,) fi kh,2(X,-Xs-i,)
)...) )...)

s r=l r=l

fI kta.2(Xr-Xs-i,)
~~Y:rfJlk~,l(Yr-Ys-J r=1
P(y,)...) yp,xl )...) x4)=-L ~_____
~ ~~ ~~
$1fi k,,l(Yr-Ys-i,) fI kh,2(Xr-xs-i,)*
s r=l r=l

- &Kc4)2, (5.2)

where k, i(x) = hipki(himlx), i = 1,2. Here k, and k, are the kernel functions
associated with the {y,} and (xt} processes, and h, and h, are the corresponding
bandwidths. The bandwidth controls the width of the kernel function and thus the
amount of smoothing involved. The bandwidth will depend on the total number
of observations T, so that h = h(T) + 0 as T-t m. It also depends on the dimensions
p and q, but this has been suppressed in the above notation. In the following, to
simplify notation, it is assumed that {y,}, {x, } are measured roughly on the same
scale, so that the same bandwidth and the same kernel function can be used
everywhere.
Under regularity conditions (Robinson, 1983) it can be proved that $(y, x) and
c(y, x) are asymptotically normal. More precisely,

(ThP+4)2[6(y,r)-M(y,x)]+N
1 (5.3)

and

(ThP+4)2[~(y,.r) - V(y,n)] + N
1, (5.4)

where the convergence is in the distribution, J = sk2(x) dx and s(y, x) is defined in


Auestad and Tjostheim (1990).
Several points should be noted for (5.3) and (5.4). For parametric models we have
fi-consistency. For nonparametric models the rate is JThp+4, which is slower.
2940 T. Teriistha et al.

The presence of p(y,x) in the denominator on the left-hand side of (5.3) and (5.4)
means that the variance blows up close to the boundaries of the data set, and
extreme care must be used there in the interpretation of G(x,y) and ?(x,y).
There are other aspects of practical significance that are not immediately
transparent from (5.3) and (5.4). They will be discussed next.

Conjidence inter-v&. Asymptotic confidence intervals can in principle be computed


from (5.3) and (5.4) by replacing p(y,x), V(y,x) and s(y, x) by corresponding
estimated quantities. An alternative is to try to form bootstrap confidence intervals.
Franke and Wendel(l990) discussed a simple example where the bootstrap performs
much better than asymptotic intervals. In the general case the bootstrap developed
by Ktinsch (1989) and Politis and Roman0 (1990) may be needed.

Bias. As seen from (5.3) and (5.4), G(y, x) and p(y ,x ) are asymptotically unbiased.
For a finite sample size the bias can be substantial. Thus, reasoning as in Auestad
and Tjsstheim (1990) yields

where I, = Ix2k(x)dx. A corresponding formula (Tjostheim and Auestad, 1994a)


holds for the conditional variance. A Gaussian linear model will have a linear bias
in the conditional mean, but, in general, the bias can lead to a misspecified model.
For example, a model with a flat conditional variance (no conditional hetero-
skedasticity) may in fact appear to have some form of heteroskedasticity due to
bias from a rapidlyrarying M(y, x). An example is given in Auestad and Tjostheim
(1_990). Generally, V(y,x) is more affected by bias and has more variability than
M(y,x). This makes it harder to reveal the structure of the conditional variance
using purely nonparametric means; see, for instance, the example of conditional
stock volatility in Pagan and Schwert (1990). Another problem is that misspecification
of the conditional mean may mix up conditional mean and variance effects. This is,
of course, a problem in parametric models as well.

Choosing the bandwidth. Comparing the variance and bias formulae (5.335.5), it
is seen that the classical problem of all smoothing operations is present. As h
increases, the variance decreases whereas the bias increases and vice versa. How
should h be chosen for a given data set?
There are at least three approaches to this problem. The simplest solution is
to compute estimates for several values of h and to select one subjectively. A
Ch. 45: Aspects of Modellinq Nonlinear Time Srries 2941

second possibility is to use asymptotic theory. From (5.3-5.5) it is seen that if we


require that variance and bias squared should be asymptotically balanced, then
(ThP+q)- 1 _ h4: or h - T-1/(P+q+4. A n extension of this argument (Truong and
Stone, 1992) yields h - T-i(P+q+2R), where R is a smoothness parameter. The
problem of choosing the proportionality factor still remains. A discussion of this
and related problems is given in Hardle (1990, Chapter 5), in Chapter 38 of this
volume and in Marron (1989). The third possibility, which is the most time consum-
ing but, possibly, the one most used in practice, is to use some form of cross vali-
dation. For details, see the above references. Simulation experiments showing
considerable variability for h selected by cross validation for one and the same
model have been reported.

Bounllary effects. For a point (y, x) close to the boundary of the data set there will
be disproportionally more points on the inward side of (y,x). This asymmetry
implies that we are not able to integrate over the entire support of the kernel
function, so that we cannot exploit the fact that Jxk(x) dx = 0. This, in turn, means
that there is an additional bias of order h due to this boundary effect. For example,
for a linear regression model the estimated regression line would bend close to the
boundary. The phenomenon has, primarily, been examined theoretically in the fixed
regression design case (Rice, 1984; Miller, 1990).

Higher order kernels. Sometimes so-called higher order kernels have been suggested
for reducing bias. It is seen from (5.5) that if k is chosen such that Jx2k(x)dx = 0,
the bias will effectively be reduced to the next order term in the bias expansion
(typically of order h4). However, practical experience in the finite sample case has
been mixed and a higher order kernel does not work unless T is rather large.

Curse of dimension&y. This problem was mentioned in the introduction. It is a


well-known difficulty of multidimensional data analysis and a serious one in
nonparametric estimation. Although the bandwidth h typically increases somewhat
as the dimensions p and 4 increase, this is by no means enough to compensate for
the sparsity of points in a neighbourhood of a given point. There may still be some
useful information left in $(y,x) that can be used for specification purposes
(Tjsstheim and Auestad, 1994a,b) or as initial input to iterative algorithms described
in the next section, but it is of little use as an accurate estimate of M(y, x).
In general one should try to avoid the curse of dimensionality by not looking at
too many regressors simultaneously; i.e. by considering (2.6) and (2.7) such that while
i, and i, may be large, p and 4 are not. This requires a method for singling out
significant lags nonparametrically, which was discussed in Section 4. Alternatively,
the problem may be handled by applying more restricted models which will be
considered in the next section.

Other estimation methods. There are a number of alternative nonparametric


2942 T. Teriisvirra et al.

estimation methods. These are described in Hardle (1990, Chapter 3) and Hastie
and Tibshirani (1990, Chapter 2). The most commonly used are spline smoothing,
nearest neighbour estimation, orthogonal series expansion and the regressogram.
For all of these methods there is a smoothing parameter that must be chosen
analogously to the choice of bandwidth for the kernel smoother. The asymptotic
properties of the resulting estimators are roughly similar to those in kernel
estimation. The spline smoother (Silverman, 1984) can be rephrased asymptotically
as a kernel estimator with negative sidelobes. Diebolt (1990) applied the regressogram
to testing linearity. Yakowitz (1987) considered nearest neighbour methods in time
series. Further applications will be mentioned in the next section.

5.3. Estimation in restricted nonparametric and semiparametric models

As mentioned above, general nonparametric estimation with many variables leads


to increased variability and problems with the curse of dimensionality. To alleviate
these problems one can look at more restrictive models requiring particular forms
for f and g in (2.5) or one can consider semiparametric models. This section is
devoted to models of that kind.

Additioe models. Virtually all restrictive models have some sort of additivity built
into them. In the simplest case (using consecutive lags)

Y, = i i(Y,-i) + f Pitx,-J + Et.


i=l i=l

Regression versions of such models and generalizations with interaction terms are
analysed extensively in Hastie and Tibshirani (1990) and references therein. By
taking conditional expectations with respect to y,_ i and x, _ j, simple identities are
obtained which can be used as a basis for an iterative algorithm for computing the
unknown functions cli and pj. The algorithm needs initial values of these functions.
One possibility is to use either projections or simply a linear model for this purpose.
Some examples and theoretical properties in the pure regression case are given by
Hastie and Tibshirani. See also Chen and Tsay (1993b).
The ACE algorithm treats a situation in which the dependent variable may be
transformed as well, so that

NY,) = Ci(Y,-i) + CPitxt-i)+ Et.


I I

The algorithm is perhaps best suited for a situation where c(~= 0 for all i, so that
there is a clear distinction between the input and output variables. The method was
developed in Breiman and Friedman (1985). Some curious aspects of the ACE
Ch. 48: Aspects ofModelliny Nonlinear Time Series 2943

algorithm are highlighted in Hastie and Tibshirani (1990, pp. 184- 186). In view of
the above comments it is perhaps not surprising that, in a time series example,
Hallman (1990) obtained better results by using a version of backfitting (Tibshirani,
1988) than with the ACE algorithm.
Chen and Tsay (1993a) considered a univariate model allowing certain interactions.
Their functional coefficient autoregressive (FCAR) model is given as

with i, < p. By ordering the observations according to some variable or a known


combination of them to an ordered local regression the authors proposed an
iterative procedure for evaluating f,, . ,f, and gave some theoretical properties.
The procedure simplifies dramatically if all the fj are one-dimensional. The authors
fitted an FCAR model of this type to the chicken pox data of Sugihara and May
(1990). The fitted model seemed to point at a threshold autoregressive model. The
forecasts from such a model, subsequently, fitted to the data had an MSE at least
30% smaller than a seasonal ARMA model used as a comparison for forecasting
4411 months ahead.

Projection pursuit type models. These models can be written as

.Y, = i Bj(Yj'Ut- 1 + Vixt- 1)+ et3


j=l

where fij, j = 1,. , r, are unknown functions, yj and pj are unknown vectors
determining the direction of the jth projector, and yt_ 1,x,_ 1 are as in (2.5). An
iterative procedure (Friedman and Stuetzle, 1981) exists for deriving optimal
projectors (projection pursuit step) and functions /Ij. The curse of dimensionality is
avoided since the smoothing part of the algorithm exploits the fact that fij is a
function of one scalar variable. For time series data, experience with this method is
limited. A small simulation study that Granger and Terasvirta (1992) conducted
gave marginal improvements compared to linear model fitting for the particular
nonlinear models they considered. Projection pursuit models are related to neural
network models, but for the latter the functions pj are assumed known and often
pj=/I, j= l,..., Y, thus giving a parametric model class. The fitting of neural
network models is discussed in White (1989).

Regression trees, splines and MARS. Assume a model of form

and approximate f(y, X) in terms of simple basis functions Bj(y,x) so that


f,,,,(y,x) = CjcjBj(y,x). In the regression tree approach (Breiman et al., 1984)
2944 T. Teriisvirta et al.

f appr is built up recursively from indicator functions Bj(y, X) = I{ (_v,x)ER~} and the
regions Rj are partitioned in the next step of the algorithm according to a certain
pattern. As can be expected, there are problems in fitting simple smooth functions
like the linear model.
Friedman (1991) in his MARS (multivariate adaptive regression splines) metho-
dology has made at least two important new contributions. First, to overcome the
difficulty in fitting simple smooth functions, Friedman proposed not to automatically
eliminate the parent region Rj in the above recursive scheme for creating subregions.
In subsequent iteration both the parent region and its corresponding subregions
are eligible for further partitioning. This allows for much greater flexibility. The
second contribution is to replace step functions by products of linear left and right
truncated regression splines. The products make it possible to include interaction
terms. For a detailed discussion the reader is referred to Friedman (1991).
Lewis and Stevens (1991a) applied MARS to time series, both simulated and real
data. As for most of the techniques discussed in this section a number of input
parameters are needed. Lewis and Stevens recommended running the model for
several sets of parameters and then selecting a final model based on various specifica-
tion/fitting tests. They fitted a model to the sunspot data which has 3 one-way, 3
two-way and 7 three-way interaction terms. The MARS model produced better
overall forecasts of the sunspot activity than the models applied before. In Lewis
and Stevens (1991 b) riverflow is fitted against temperature and precipitation and
good results obtained. There are as yet no applications to economic data.
The MARS technology appears very promising but must of course be tested more
extensively on real and simulated data sets. No asymptotic theory with confidence
intervals is available yet.

Stepwise series expansion of conditional densities. In a sense the conditional density


p(y, Iyt_ 1, x, _ J is the most natural quantity to look at in a joint modelling of { y,, xr}
since predictive distributions as well as the conditional mean and variance can all
be derived from this quantity. Gallant and Tauchen (1989) used this fact as their
starting-point.
The conditional density is estimated, to avoid the curse of dimensionality, by
expanding it in Hermite polynomials. These are centred and scaled so that the
conditional mean M(y,x) and variance V(y,x) play a prominent role. As a first
approximation they are supposed to be linear Gaussian and of ARCH type,
respectively.
Gallant et al. (1992) looked at econometric applications, notably to stock market
data. In particular, they investigated the relationship between the volatility of stock
prices and volume. A main finding was that an asymmetry in the volatility of prices
when studied by itself more or less disappears when volume is included as an
additional conditional variable. Possible asymmetry in the conditional variance
function (univariate case) has recently been studied by a number of investigators
using both parametric and nonparametric methods; see Engle and Ng (1993) and
references therein.
Ch. 48: Aspects ofMuddling Nonlinear Time Series 2945

Semiparametric models. Another way of trying to eliminate the difficulties in


evaluating high-dimensional conditional quantities is to assume nonlinear and
nonparametric dependence in some of the predictors, and parametric and usually
linear dependence in others. An illustrative example is given by Engle et al. (1986)
who modelled electricity sales using a number of predictor variables. It is natural
to assume the impact of temperature on electricity consumption to be nonlinear, as
both high and low temperatures lead to increased consumption, whereas a linear
relationship may be assumed for the other regressors. A similar situation arose in
Shumway et al. (1988) which is a study of mortality as a function of weather and
pollution variables in the Los Angeles region.
In the context of model (2.5) with a linear dependence on lags of y, and
nonlinearity with respect to the exogenous variable {x,}, we have

The modelling technique would depend somewhat on the dimension of x,_ 1. In the
case where the argument of f is scalar, it can be incorporated in the backfitting
algorithm of Hastie and Tibshirani (1990, p. 118). Under quite general assumptions
it is possible to obtain fi-consistency for the parametric part as demonstrated by
Heckman (1986) and Robinson (1988). Powell et al. (1989) developed the theory
further and gave econometric applications.

6. Evaluation of estimated models

After estimating a nonlinear time series model it is necessary to evaluate its


properties to see if the specified and estimated model may be regarded as an
adequate description of the relationship it was constructed to characterize. The
residuals of the model can be subjected to various tests such as those against error
autocorrelation, ARCH and normality. At least in the parametric case linearity of
the time series was tested, and similar tests may now be performed on the residuals
to see if the model adequately characterizes the nonlinearity the tests previously
suggested. For instance, Eitrheim and Terasvirta (1993) proposed testing the STAR
model against an alternative containing two additive STAR components and
derived an LM type test for this purpose. The test applies to STR models as well.
As to testing the null of no error autocorrelation it should be noted that the
asymptotic distribution of the Ljung-Box test statistic based on estimated residuals
is not available, as the correct number of degrees of freedom is known only if the
estimated model iss linear ARMA model. For this reason, Eitrheim and Terasvirta
(1993) also derived an LM test for testing the residuals of the STR model against
autocorrelation.
One should also study the long-term properties of the model, which generally can
only be done numerically by simulating the model without noise. A bilinear model
constitutes an exception as its long-term solution is the same as that of the
2946 T. Teriisvirta et al.

corresponding linear autoregressive model. The exogenous variables should be set


on a constant level, for instance, to equal their sample means. If the solution path
diverges, the model should be rejected and respecification attempted. Other
examples of a solution are a limit cycle or a unique stable singular point. Sometimes
several solutions may appear depending on the starting-values. See, e.g. Ozaki
(1985) for further discussion.
The out-of-sample prediction of the model is an important part of the evaluation
process. The precision of the forecasts should be compared to those from the
corresponding linear model. However, as mentioned in the introduction, the results
also depend on the data during the forecasting period. If there are no observations
in the range in which nonlinearity of the model makes an impact, then the forecasts
cannot be expected to be more accurate than those from a linear model. The check
is thus negative: if the forecasts from the nonlinear model are significantly less
accurate than those from the corresponding linear one, then the nonlinear specifica-
tion should be reconsidered.
A further check of the estimated model is to see whether it can reproduce a feature
of interest in the data. A fitted model is considered adequate only if it is capable of
doing that. The spectral density function of the time series may be such a feature.
The check is carried out by bootstrapping the estimated model (linear or nonlinear)
which is required to be parametric. Details and examples can be found in Tsay
(1992).

7. Example

As an example of the parametric specification, estimation, and evaluation cycle


discussed in Section 4 we shall consider modelling the seasonally unadjusted logari-
thmic Austrian industrial output 1960(l) to 1986(4). This is one of the series analysed
in Luuk konen and Terasvirta (199 1) and Terasvirta and Anderson (1992). However,
while those authors tested linearity of the series and rejected it, they did not report
any modelling results. Our aim is to see whether we can describe the four-quarter
differences (y,) of the series or annual growth rates, appearing in Figure 1, by a
STAR model. In order to do that we first have to specify a linear autoregressive
model for the series. Following Terasvirta and Anderson we choose an AR(S) model
yielded by AIC. Having done this, the second step is to test linearity against STAR
using five lags and applying an F-test based on the auxiliary regression (4.2). The
results are found in Table 1 where it is seen that the smallest p-value of the tests for
d=l , . . . ,5, is obtained at d = 1. We take this value (= 0.010) to be sufficiently small
to reject linearity in favour of STAR. The next step is to choose between an exponen-
tial and a logistic STAR model assuming that d = 1. Table 1 shows that the p-values
of the F-tests of both H,*, and Hz3 are smaller than that of the test of H,*, (see
Section 4), so that the decision rule discussed in Section 4 leads us to choose an
LSTAR model.
Ch. 48: Aspects ofModelliny Nonlinear Time Series 2941

0.06

-0.06

1965(l) 1969(l) 1973(i) 1977(1)1 1981(l) 1985(i)

Quarter

Figure 1. Four-quarter differences of the logarithmic index of Austrian industrial production,


1961(l)-1986(4).

Table 1
The p-values of the LM type linearity test against STAR based on (4.2) for delays
d = 1,. ., 5, and the p-values of the model specification tests to choose between
LSTAR and ESTAR for d = 1, for the four-quarter differences of the logarithmic
Austrian industrial production model, 1960(l)-1986(4). The linear base model is
AR(5).

d
Null
hypothesis 1 2 3 4 5

H,,$, =/32=&=0 0.010 0.29 0.15 0.22 0.41


H;3:/J3 = 0 0.034
H;2:/?2=OI/13=0 0.24
H,*,:~,=O~/?,=j?,=O 0.039

After selecting the type of model, the next problem is that of specifying the lag
structure. An obvious way to start is to estimate the parameters of the full model
(2.2) with (2.3) as the transition function. However, here, as in many other similar
situations, this leads to convergence problems because some of the parameters are
redundant and their estimates highly correlated with those of other parameters.
2948 T. Terti.wirta et ul.

To avoid this it is often advisable to fix y or even both y and c in (2.3) and estimate
cp and 8 conditionally. This helps one to put restrictions on the elements of these
parameter vectors, and after finding a sensible set of parameters, the model can be
re-estimated without any restrictions on y and c. It is of course possible and
sometimes even desirable to impose further restrictions on p and 8 even after this
stage. Note that, apart from the usual restrictions of the type pj = 0 and Bj = 0, the
exclusion restrictions pj = - Bj are useful. While pj = 0 makes the parameter
pj + OjF = 0 for F = 0, the latter does the same for F = 1.
The final estimated LSTAR model has the form

y, = 0.76~~ _ 1 + 0.3Oy,_, - 0.37y,_, - 0.63y,_, + 0.55y,_ 5 + (0.087 - 0.76y,_ 1


(0.18) (0.17) (0.16) (0.15) (0.11) (0.008 1) (0.18)

- 0.30~,_~ + 0.37~,_~ + 0.63~,_~ - 0.55y,_,)


(0.17) (0.16) (0.15) (0.11)

x [l + exp( - 2.2 x 24(y,_ 1 - 0.063)}]-1 + ti, (7.1)


(0.80) (0.010)

s = 0.0217, s/s; = 0.87, F,,(6,78) = 1.2(0.34), FARCH(4,90) = 0.96(0.44), sk = 0.054,


ek = 0.96, LJB = 3.9(0.15).
The restrictions p,, = 0 and pj = - tIj, j = 1,. . . ,5, were suggested by the data and
imposed during the lag specification stage. The figures below the parameter estimates
are the estimated standard deviations based on the Hessian, the ones in parentheses
following the values of the test statistics are p-values. Note that the exponent of the
transition function is standardized by dividing it by the sample standard deviation
of y,[l/B(y) = 241. This is useful because y originally is not scale-free, and
standardizing it makes it much easier to give it a suitable starting-value for
estimation. Furthermore, s is the estimated standard error of residuals, sL is ditto
for the AR(5) model, F,,(q, n) is the F-test of no autocorrelation (Section 6) against
qth order autocorrelation, FAR&q, n) is the LM test against ARCH of order q, sk
is skewness, ek excess kurtosis, and LJB the Lomnicki-JarqueeBera normality test.
The tests do not reveal any serious inadequacy of the model. There seems to be
some excess kurtosis in the residuals but it amounts to one half of that in the
AR(5) model. Table 2 contains the results of the test of no remaining nonlinearity
(Section 6). They indicate that this null hypothesis cannot be rejected at conventional
significance levels. The numerical long-run solution paths converge to the same
point independent of the starting-values. This allows us to conclude that (7.1) has
a unique stable singular point (Section 6). Thus the model cannot be rejected on the
grounds of a diverging solution path. Note, however, that the value of the solution
( = 0.063) clearly exceeds the sample mean of the series which equals 0.034.
The statistical analysis of the model, so far, thus does not reveal any serious model
inadequacy and we can proceed to interpreting the estimated model. The parameters
Ch. 48: Aspects of ModellingNonlinear Time Series 2949

Table 2
The p-values of the test of no remaining nonlinearity in Eitrheim and
Terasvirta (1993) performed on the residuals of LSTAR model (7.1)
for delays d = 1,. ,5.

Test 1 2 3 4 5

F(15,75) 0.17 0.70 0.22 0.31 0.30

most easily interpreted are y and c. The former indicates how rapidly the
parameter vector q + BF changes from the one extreme to the other with y,_ i.
The larger y, the more rapid the change. Location parameter c tells where in the
range the c_hange occurs as e = 0.5 at y,_ i = e. This information summed up by
grap@g F, and the graph appears in Figure 2. It is seen that in our example
Q + f< changes rather slowly with y,_ i. It may also be interesting to know how
@+ 8F has varied over time. Figure 3 shows that low values of F^ have been much
more common than high ones.
It is of course impossible to interpret individual parameter estimates qjj or I!?~in
(7.1). A study of the roots of characteristic polynomials offers a better way of

1.0

0.8

0.6

Cu 0.5

0.4

0.3

0.2

0.1

0.0
-004 -0.02 0.00 002 0.04 0.06 0.08 0.10 0.12

Y(t- 1)

Figure 2. Transition function of model (7.1) for the four-quarter differences of the logarithmic index of
Austrian industrial production.
2950 T. Trriisvirtu et al.

1.0

09

0.8

0.7

0.6

'U 0.5

0.4

0.3

0.2

0.1

00
1965(l) 1969(l) 1973(l) 1977(l) 1981(l) 1985(i)

Quarter

Figure 3. Values over time of the transfer function of model (7.1) for the four-quarter differences of
the logarithmic index of Austrian industrial production.

interpreting (7.1) exactly as it does in the case of a linear autoregressive model. The
roots can be computed at various values of F, of which zero and one are, perhaps,
the most interesting ones. Table 3 contains the roots for F^ = 0 and 0.5. When E = 0
the local dynamics of (7.1) are characterized by a strong cyclic component with a
period of about two years. When F increases this component grows weaker. For
P = 0.5, the modulus of the corresponding pair of complex roots has decreased from

Table 3
The roots of the characteristic polynomial of (7.1) for
p = 0, 0.5.

e Root Modulus Period

0 0.71 k 0.63i 0.95 8.7


-0.71 * 0.55i 0.90 2.5
0.75 0.75
0.5 0.51 & 0.64i 0.82 7.0
- 0.63 k 0.5Oi 0.8 1 2.5
0.63

The characteristic polynomial C(z) = zs -I;= 1


(gj + ej&-j.
Ch. 48: Aspects of Modelling Nonlinear Time Series 2951

0.95 to 0.82. At the same time, the intercept has increased from zero to 0.044. Indeed,
it is seen directly from (7.1) that when @ approaches unity the (local) cyclical
variation disappears altogether. The local autoregression for F^ = 1 is merely a white
noise process with mean 0.087. Thus after entering a recession industrial production
on average is bound to recover strongly after a few quarters because of the cyclical
component. On the other hand, according to (7.1) it is somewhat more difficult for
a recovery to change into a recession. During recovery the cycle is less pronounced,
and the local linear approximation has a positive intercept as well. A sufficiently
large negative shock may be required to depress the industrial output from high to
low growth rates.
It is also illuminating to compare the residuals of (7.1) with those from the linear
model. They are shown in Figure 4. It is seen that the LSTAR model explains the
aftermath of the exceptionally large observation (0.14) in 1972(4) much better than
the AR(S) model. This is the case because the local dynamics of the LSTAR model
predict a drop to 0.086 in the next period while the AR(5) model offers a much
slower return to lower growth. The residual sum of squares of (7.1) is 85 percent of
that of the AR model. After subtracting the residual for 1973(l) the same figure is
92 percent. This is still a fair improvement, but these figures do indicate that a single
observation may have quite a large influence on the results. Thus one should be

0.06 !-

-0.0611.
1965(l) 1969(l) 1973{1) 1977(l) 1961(l) 1985(l)

Quarter

Figure 4. Residuals of the LSTAR model (7.1) (solid line) and the AR(5) model (broken line) for the
four-quarter differences of the logarithmic index of Austrian industrial production.
2952 T. Tertisuirta et al.

aware of the possibility that results from nonlinear modelling can be rather sensitive
to data errors or other outliers of similar kind.
It does not seem unusual that an LSTAR model is useful in modelling the
consequences of exceptional events. The modelling exercise with other industrial
production series in Terasvirta and Anderson (1992) showed that nonlinearity was
mainly needed to describe the response of the output to large negative shocks. In
the absence of such shocks, both the STAR and the AR models seemed to fit the data
equally well. The main difference here is that the most important contribution of
the LSTAR model (7.1) is to characterize the response of the system to a large
positive shock.
The specialized nature of (7.1) becomes obvious also when the model is used for
one-quarter-ahead forecasting. The observations 1987(1))1988(4) were saved for
this purpose. The root mean square error (RMSE) of the eight forecasts equals 0.023
which is about the size of the residual standard derivation of (7.1). However, the
RMSE of the forecasts from the AR(5) model only equals 0.013. The test of the
hypothesis that both models have the same mean square error of prediction against
the alternative that the AR(5) has the lower mean square error of the two, has p-value
0.049. The reason for this outcome is that the prediction period does not contain
any nonlinearity of the kind appearing during the estimation period. The simple
AR model, thus, can forecast such a regular period better than the more involved
LSTAR one.
This example is univariate. Bacon and Watts (1971) contained probably the first
application of a bivariate STR model, but the data were not economic. An
application of bivariate STR models to economic data can be found in Granger
et al. (1993), but as a whole the number of applications so far is small.

8. Conclusions

This chapter is an attempt at an overview of various ways of modelling nonlinear


economic relationships. Since nonlinear time series models and methods are a very
large field, not all important developments have been covered. The emphasis has
been on model building, and the modelling cycle, comprising linearity testing, model
specification, parametric or nonparametric function estimation and model evaluation,
has been highlighted. The estimation of fully specified nonlinear theory models such
as disequilibrium models has not been included here. A majority of results concern
the estimation of the conditional mean of a process and, therefore, the conditional
variance has received less attention. This is, in part, because conditional hetero-
skedasticity is discussed in Chapter 49. Random coefficient models also belong
under that heading and have not been considered here. Furthermore, this presenta-
tion reflects the belief that economic phenomena are more naturally characterized
by stochastic rather than deterministic models, so that deterministic chaos and its
applications to economics have only been briefly mentioned in the discussion.
Ch. 48: Aspects of Modelling Nonlinear Time Series 2953

At present the number of applications of nonlinear time series models in


economics is still fairly limited. Many techniques discussed here are as yet relatively
untested. However, the situation may change rather rapidly, so that in a few years
the possibilities of evaluating the empirical success of present and new techniques
will be considerably better than now.

References

Akaike, H. (1969) Fitting autoregressions for predictions, Annals of the Institute of Statistical
Mathematics, 21,243-247.
Anderson, B.D.O. and J.B. Moore (1979) Optimaljltering. Englewood Cliffs, NJ: Prentice-Hall.
Andrews, D.W.K. (1993) Tests for parameter instability and structural change with unknown change
point, Econometrica, 61, 821-856.
Auestad, B. and D. Tjsstheim (1990) Identification of nonlinear time series: First order characterization
and order determmation, Biometrika, 77,669-687.
Auestad. B. and D. Tiostheim (1991) Functional identification in nonlinear time series, In: G. Roussas,
ed., Nonparametric functional estimation and related topics. Amsterdam: Kluwer Academic Publishers,
493-507.
Bacon, D.W..and D.G. Watts (1971) Estimating the transition between two intersecting straight lines,
Biometrika, 58, 525-534.
Barnett, W.A., J. Powell and G.E. Tauchen (1991) eds., Nonparametric and semi-parametric methods in
econometrics and statistics. Proceedings of the 5th International Symposium in Economic Theory and
Econometrics. Cambridge: Cambridge University Press.
Bates, D.M. and D.G. Watts (1988) Nonlinear regression analysis and its applications. New York: Wiley.
Box. G.E.P. and G.M. Jenkins (1970) Time series analysis, . .forecasting and control. San Francisco:
Holden-Day.
Breiman, L. and J.H. Friedman (1985) Estimating optimal transformations for multiple regression and
correlation. Journal ofthe American Statistical Association, 80, 580-619 (with discussion).
Breiman, L., J.H. Friedman, R. Olshen and C.J. Stone (1984) Classification and regression trees. Belmont,
CA: Wadsworth.
Brock, W.A. and S.M. Potter (1993) Nonlinear time series and macroeconometrics, in: G.S. Maddala,
CR. Rao and H.R. Vinod, eds., Handbook of Statistics, Vol. 11. Amsterdam: North-Holland,
195-229.
Brock, W.A., W.D. Dechert and J.A. Scheinkman (1987) A test for independence based on the correlation
dimension. Working paper, University of Wisconsin-Madison, Social Systems Research Institute.
Brockett, P.L., M.J. Hinich and D. Patterson (1988) Bispectral-based tests for the detection of
Gaussianity and linearity in time series, Journal of the American Statistical Association, 83,657-664.
Brown, R.L., J. Durbin and J.M. Evans (1975) Techniques for testing the constancy of regression
coefficients over time, Journal of the Royal Statistical Society B, 37, 149-192 (with discussion).
Chan, K.S. (1990) Testing for threshold autoregression, Annals of Statistics, 18, 1886-1894.
Chan, K.S. (1991) Percentage points of likelihood ratio tests for threshold autoregression, Journal of
the Royal Statistical Society B, 53,691-696.
Chan, K.S. (1993) Consistency and limiting distribution of the least squares estimator of a threshold
autoregressive model, Annals of Statistics, 21,520-533.
Chan. KS. and H. Tone (1986) On estimatine. thresholds in autoregressive models. Journal of Time
Series Analysis, 7, 179-i90.
Chan, KS. and H. Tong (1990) On likelihood ratio tests for threshold autoregression, Journal of the
Royal Statistical Society B, 52,469-476.
Chen, P. and R.H. Day (1992), eds., Non-linear dynamics and evolutionary economics. Cambridge, MA:
MIT Press.
Chen, R. and R.S. Tsay (1993a) Functional coefficient autoregressive models, Journal of the American
Statistical Association, 88, 298-308.
2954 T. Teriisvirta et al.

Chen, R. and R.S. Tsay (1993b) Nonlinear additive ARX models, Journal of the American Statistical
Association, 88, 9555961.
Cheng, B. and H. Tong (1992) On consistent nonparametric order determination and chaos, Journal of
the Royal Statistical Society B, 54,427-449.
Chow, G.C. (1960) Testing for equality between sets of coefficients in two linear regressions,
Econometrica, 28, 591-605.
Davidson, R. and J.G. MacKinnon (1985) Heteroskedasticity-robust tests in regressions directions,
Annales de IINSEE, 59/60, 183-218.
Davies, R.B. (1977) Hypothesis testing when a nuisance parameter is present only under the alternative,
Biometrika, 74, 241-254.
Davies, R.B. (1987) Hypothesis testing when a nuisance parameter is present only under the alternative,
Biometrika, 84, 33-43.
De Gooijer, J.G. and K. Kumar (1992) Some recent developments in non-linear time series modelling,
testing and forecasting, International Journal of Forecasting, 8, 135-156.
Delgado, M.A. and P.M. Robinson (1992) Nonparametric and semiparametric methods for economic
research, Journal of Economic Surveys, 6, 201-249.
Desai, M. (1984) Econometric models of the share of wages in national income, U.K. 185551965, in:
R.M. Goodwin, M. Kruger and A. Vercelli, eds., Nonlinear models offluctuating growth. Lecture Notes
in Economics and Mathematical Systems No. 228, New York: Springer Verlag.
Diebolt, J. (1990) Testing the functions defining a nonlinear autoregressive time series, Stochastic
Processes and their Applications, 36, 85-106.
Eitrheim, 8. and T. Terasvirta (1993). Testing the adequacy of smooth transition autoregressive models.
Bank of Norway, Research Department, Working Paper 1993/13.
Engle, R.F. and V. Ng (1993) Measuring and testing the impact of news on volatility, Journal of Finance,
41, 1749-l 778.
Engle, R.F., C.W.J. Granger, J. Rice and A. Weiss (1986) Semiparametric estimates of the relation
between weather and electricity sales, Journal of the American Statistical Association, 81, 310-320.
Ertel, J.E. and E.B. Fowlkes (1976) Some algorithms for linear sphne and piecewise multiple linear
regression, Journal of the American Statistical Association, 71,640&648.
Franke, J. and M. Wendel (1990) A bootstrap approach for nonlinear autoregressions. Some
preliminary results, Preprint, to appear in: Proceedings of the International Conference on Bootstrap-
ping and Related Techniques, Trier, June 1990.
Friedman, J.H. (1991) Multivariate adaptive regression splines, Annals of Statistics, 19, l-141 (with
discussion).
Friedman, J.H. and W. Stuetzle (1981) Projection pursuit regression, Journal ofthe American Statistical
Association, 76, 817-823.
Gallant, A.R. (1981) On the bias in flexible functional forms and an essentially unbiased form: The
Fourier Flexible Form, Journal of Econometrics, 15, 211-245.
Gallant, A.R. (1987) Nonlinear statistical models. New York: Wiley.
Gallant, A.R. and G. Tauchen (1989) Seminonparametric estimation of conditionally constrained
heterogeneous processes: asset pricing applications, Econometrica, 57, 1091-l 120.
Gallant, A.R., P.E. Rossi and G. Tauchen (1992) Stock prices and volume, Review ofFinancial Studies,
5, 1999242.
Grange& C.W.J. and J.J. Hallman (1991a) Nonlinear transformations of integrated time series, Journal
of Time Series Analysis, 12, 207-224.
Granger, C.W.J. and J.J. Hallman (1991b) Long-memory processes with attractors, Oxford Bulletin of
Economics and Statistics, 53, 1 l-26.
Granger, C.W.J. and J.L. Lin (1991) Nonlinear correlation coefficients and identification of nonlinear
time series models. University of California, San Diego, Department of Economics, Discussion Paper.
Granger, C.W.J. and P. Newbold (1986). Forecasting economic time series. 2nd edition. Orlando, FL:
Academic Press.
Granger, C.W.J. and T. Terasvirta (1992) Experiments in modeling nonlinear relationships between
time series, in: M. Casdagli and S. Eubank, eds., Nonlinear modeling and foreasting. Proceedings of
the Workshop on Nonlinear Modeling and Forecasting Held September, 1990 in Santa Fe, New Mexico,
Redwood City, CA: Addison-Wesley, 189-197.
Ch. 48: Aspects of Modelling Nonlinear Time Series 2955

Granger, C.W.J. and T. Terhsvirta (1993) Modeltiny nonlinear economic relationships. Oxford: Oxford
University Press.
Granger, C.W.J., T. Terlsvirta and H.M. Anderson (1993) Modelling non-linearity over the business
cycle, in: J.H. Stock and M.W. Watson, eds., Business cycles, indicators and forecasting. Chicago:
University of Chicago Press, 31 l-325.
Haggan, V. and T. Ozaki (1981) Modelling non-linear random vibrations using an amplitude-dependent
autoregressive time series model, Biometrika, 68, 1899196.
Haggan, V., SM. Heravi and M.B. Priestley (1984) A study of the application of state-dependent models
in nonlinear time series analysis, Journal of Time Series Analysis, 5, 699102.
Hallman, J.J. (1990) Nonlinear integrated series, cointegration and application. PhD Thesis. University
of California, San Diego, Department of Economics.
Hansen, B.E. (1990) Lagrange multiplier tests for parameter instability in non-linear models. Paper
presented at the Sixth World Congress of the Econometric Society, Barcelona.
Hlrdle, W. (1990) Applied nonparametric regression. Oxford: Oxford University Press.
Harvey, A.C. (1990) Econometric analysis of time series, 2nd edition. Cambridge, MA: MIT Press.
Hastie, T.J. and R.J. Tibshirani (1990) Generalized additiue models. London: Chapman and Hall.
Heckman, N. (1986) Spline smoothing in a partly linear model, Journal of the Statistical Society B, 48,
244-248.
Higgins, M. and A.K. Bera (1989) A joint test for ARCH and bilinearity in the regression model,
Econometric Reviews, I, 171-181.
Hinich, M.J. (1982) Testing for Gaussianity and linearity of a stationary time series, Journal of Time
Series Analysis, 3, 1699176.
Joe, H. (1989) Estimation of entropy and other functionals of a multivariate density, Annals of the
Institute of Statistical Mathematics, 41, 6833697.
Judge, G.G., W.E. Griffiths, R.C. Hill, H. Liitkepohl and T.-C. Lee (1985) The theory and practice of
econometrics, 2nd edition. New York: Wiley.
Keenan, D.M. (1985) A Tukey non-additivity type test for time series nonlinearity, Biometrika, 72,
39-44.
Kramer, W., W. Ploberger and R. Ah (1988) Testing for structural change in dynamic models,
Econometrica, 56, 1335-l 310.
Kuan, C.-M. and H. White (1994) Artificial neural networks: An econometric perspective, Econometric
Reviews, 13, 1-143 (with discussion).
Kiinsch, H. (1989) The jackknife and the bootstrap for general stationary observations, Annals of
Statistics, 17, 121771241.
Lasota, A. and M.C. Mackey (1989) Stochastic perturbation of dynamical systems: The weak
convergence of measures, Journal of Mathematical Analysis and Applications, 138, 232-248.
Lee, T.-H., White, H. and C.W.J. Granger (1993) Testing for neglected nonlinearity in time series models.
A comparison of neural network methods and alternative tests, Journal of Econometrics, 56,269-290.
Lewis, P.A.W. and J.G. Stevens (1991a) Nonlinear modeling of time series using multivariate adaptive
regression splines (MARS), Journal of the American Statistical Association, 86, 864877.
Lewis, P.A.W. and J.G. Stevens (1991b) Semi-multivariate nonlinear modeling of time series using
multivariate adaptive regresson splines (MARS). Preprint, Naval Post-graduate School.
Liu, T., C.W.J. Granger and W. Heller (1992) Using the correlation exponent to decide whether an
economic series is chaotic, Journal of Applied Econometrics, 7, S25-S39.
Luukkonen, R. (1990) On linearity testing and model estimation in non-linear time series analysis. Helsinki:
Finnish Statistical Society.
Luukkonen, R. and T. Terlsvirta (1991) Testing linearity of economic time series against cyclical
asymmetry, Annales de leconomic et de statistique, 20121, 125-142.
Luukkonen, R., P. Saikkonen and T. Terlsvirta (1988a) Testing linearity in univariate time series,
Scandinavian Journal of Statistics, 15, 161-175.
Luukkonen, R., P. Saikkonen and T. Terlsvirta (1988b) Testing linearity against smooth transition
autoregression, Biometrika, 75,491-499.
Maddala, G.S. (1977) Econometrics, New York: McGraw-Hill.
Maddala, G.S. (1986) Disequilibrium, self-selection and switching models, in: Z. Griliches and M.D.
Intrihgator, eds., Handbook of econometrics. Vol. 3, Amsterdam: North-Holland, 1634-1688.
2956 T Teriisuirta et al.

Marron, S. (1989) Automatic smoothing parameter selection: A survey, in: A. Ullah, ed., Semiparametric
and nonparametric econometrics. Heidelberg: Physica-Verlag, 65-86.
Masry, E. and D. Tjnstheim (1994) Nonparametric estimation and identification of ARCH and
ARX nonlinear time series. Strong convergence and asymptotic normality, Econometric Theory
(forthcoming).
Miiller, H.G. (1990) Smooth optimum kernel estimators near endpoints. Preprint, University of
California, Davis.
Nicholls, D.F. and A.R. Pagan (1985) Varying coefficient regression, in: E.J. Hannan, P.R. Krishnaiah
and M.M. Rao, eds., Handbook ofstatistics. Vol. 5. Amsterdam: Elsevier, 413-449.
Ozaki, T. (1985) Non-linear time series models and dynamical systems, in: E.J. Hannan, P.R. Krisnaiah
and M.M. Rao, eds., Handbook OfStatistics. Vol. 5. Amsterdam: Elsevier, 25-83.
Pagan, A.R. and G.W. Schwert (1990) Alternative models for conditional stock volatility, Journal of
Econometrics, 45, 261-290.
Petruccelli, J.D. (1990) A comparison of tests for SETAR-type non-linearity in time series, Journal of
Forecasting, 9, 25-36.
Petruccelli, J.D. and N. Davies (1986) A portmanteau test for self-exciting threshold autoregressive-type
nonlinearity, Biometrika, 73, 687-694.
Ploberger, W. and W. Kriimer (1992) The CUSUM-test with OLS residuals, Ecoaometrica, 60,
271-285.
Politis, D.N. and J.P. Roman0 (1990) A nonparametric resampling procedure for multivariate confidence
regions in time series analysis. Technical Report, Department of Statistics, Stanford University.
Ptitscher. B.M. and I.R. Prucha (199la) Basic structure of the asymptotic theory in dynamic nonlinear
econometric models, Part I: Consistency and approximation concepts, Econometric Reoiews, 10,
125-216.
Pb;tscher, B.M. and I.R. Prucha (199lb) Basic structure of the asymptotic theory in dynamic nonlinear
econometric models, Part II: Asymptotic normality, Econometric Reviews, 10, 253-325.
Powell, J.L., J.H. Stock and T:M. Stoker (1989) Semiparametric estimation of index coefficients,
Econometrica, 57, 1403- 1430.
Priestley, M. (1988) Non-linear and non-stationary time series analysis. London and San Diego: Academic
Press.
Quandt, R. (1960) Tests of the hypothesis that a linear regression system obeys two separate regimes,
Journal of the American Statistical Association, 55, 324-330.
Quandt, R. (1982) Econometric disequilibrium models, Econometric Reviews, 1, l-63.
Quandt, R. (1983) Computational problems and methods, in: 2. Griliches and M.D. Intriligator, eds.,
Handbook of econometrics. Vol. 1. Amsterdam: North-Holland, 699-746.
Ramsey, J.B. (1969) Tests for specification errors in classical linear least-squares regression analysis,
Journal of the Royal Statistical Society B, 31, 350-371.
Rice, J. (1984) Boundary modification for kernel regression, Communications in Statistics, Theory and
Methods, 13, 893-900.
Robinson, P.M. (1983) Non-parametric estimation for time series models, Journal of Time Series
Analysis, 4, 185-208.
Robinson, P.M. (1988) Root-N-consistent semiparametric regression, Econometrica, 56, 931-954.
Robinson, P.M. (1991) Consistent nonparametric entropy-based testing, Reuiew of Economic Studies,
437-453.
Saikkonen, P. and R. Luukkonen (1988) Lagrange multiplier tests for testing nonlinearities in time
series models, Scandinavian Journal of Statistics, 15, 55-68.
Scheinkman, J.A. (1990) Nonlinearitiesin econor&c dynamics, Economic Journal, 100, Supplement,
33-48.
Seber, G.A.F. and C.J. Wild (1989) Nonlinear regression. New York: Wiley.
Shumway, R.H., AS. Azari and Y. Pawitan (1988) Modeling mortality fluctuations in Los Angeles as
functions of pollution and weather effects, Environmental Research, 45, 224-241.
Silverman, B.W. (1984) Spline smoothing: the equivalent variable kernel method, Annals of Statistics,
12,898-916.
Skaug, H. and D. Tjastheim (1993a) Nonparametric tests of serial independence, in: T. Subba Rao,
ed., The M.B. Priestley Birthday Volume. London: Chapman and Hall, 207-229.
Skaug, H. and D. Tjestheim (1993b) A nonparametric test ofserial independence based on the empirical
distribution function, Biometrika, 80, 591-602.
Ch. 48: Aspects ofModelling Nonlinear Time Series 2957

Skaug, H. and D. Tjostheim (1993~) Measures of distance between densities with application to testing
for serial independence. Preprint, Department of Mathematics, University of Bergen.
Stensholt, B.K. and D. Tjostheim (1987) Multiple bilinear time series models, Journal of Time Series
Analysis, 8,221-233.
Stinchcombe, M. and H. White (1989) Universal approximations using feedforward networks with
non-sizmoid hidden laver activation functions, in: Proceedings of the International Joint Conference
on Ne&al Networks, Washington, D.C. San Diego: SOS Printing,.I: 613-618.
Subba Rao, T. and M.M. Gabr (1980) A test for linearity of stationary time series, Journal ofTime
Series Analysis, 1, 145-158.
Subba Rao, T. and M.M. Gabr (1984) An introduction to bispectral analysis and bilinear time series
models, in: Lecture Notes in Statistics, 24, New York: Springer.
Sugihara, G. and R.M. May (1990) Nonlinear forecasting as a way of distinguishing chaos from
measurement error in time series, Nature, 344, 1344741.
Terasvirta, T. (1990) Power properties of linearity tests for time series. University Of California, San
Diego, Department of Economics, Discussion Paper No. 90-l 5.
Terlsvirta, T. (1994) Specification, estimation and evaluation of smooth transition autoregressive
models, Journal of the American Statistical Association, 89,208-218.
Terasvirta, T. and H.M. Anderson (1992) Modelhng nonlinearities in business cycles using smooth
transition autoregressive models, Journal of Applied Econometrics, 7, S119-S136.
Terasvirta, T., C.-F. Lin and C.W.J. Granger (1993) Power of the neural network linearity test, Journal
of Time Series Analysis, 14, 209-220.
Tibshirani, R. (1988) Estimating optimal transformations for regression via additive and variance
stabilization, Journal of the American Statistical Association, 83, 5599568.
Tjostheim, D. (1986) Some doubly stochastic time series models, Journal of Time Series Analysis, 7,
51-72.
Tjostheim, D. (1994) Nonlinear time series: A selective review, Scandinavian Journal of Statistics,
(forthcoming).
Tjastheim, D. and B. Auestad (1994a) Nonparametric identification of nonlinear time series: Projec-
tions, Journal of the American Statistical Association, 89 (forthcoming).
Tjastheim, D. and B. Auestad (1994b) Nonparametric identification of nonlinear time series: Selecting
significant lags, Journal of the American Statistical Association, 89 (forthcoming).
Tong, H. (1990) Non-linear time series. A dynamical system approach. Oxford: Oxford University Press.
Truong, Y.K. and C. Stone (1992) Nonparametric function estimation involving time series, Annals of
Statistics, 20, 77-97.
Tsay, R.S. (1986) Nonlinearity tests for time series, Biometrika, 73, 461-466.
Tsay, R.S. (1989) Testing and modeling threshold autoregressive processes, Journal of the American
Statistical Association, 84, 23 l-240.
Tsay, R.S. (1992) Model checking via parametric bootstraps in time series analysis, Applied Statistics,
41, 1-15.
Ullah, A. (1989), ed., Semiparametric and nonparametric econometrics. Heidelberg: Physica-Verlag.
Weiss, A. (1986) ARCH and bilinear time series models: Comparison and combination, Journal of
Business and Economic Statistics, 4, 59-70.
White, H. (1984) Asymptotic theory for econometricians: Orlando, FL; Academic Press.
White, H. (1989) Some asymptotic results for learning in single hidden-layer feedforward network
models, Journal ofthe American Statistical Association, 84, 100331013.
Wooldridge, J.M. (1990) A unified approach to robust, regression-based specification tests, Econometric
Theory, 6, 17-43.
Yakowitz, S. (1987) Nearestneighbour methods for time series analysis, Journal of Time Series
Analysis, 8, 235-247.
Chapter 49

ARCH MODELS

TIM BOLLERSLEV

Northwestern University and N.B.E.R.

ROBERT F. ENGLE

University of California, San Diego and N.B.E.R.

DANIEL B. NELSON

University of Chicago and N.B.E.R.

Contents

Abstract 2961
1. Introduction 2961
1.1. Definitions 2961
1.2. Empirical regularities of asset returns 2963
1.3. Univariate parametric models 2967
1.4. ARCH in mean models 2912
1.5. Nonparametric and semiparametric methods 2912
2. Inference procedures 2974
2.1. Testing for ARCH 2914
2.2. Maximum likelihood methods 2971
2.3. Quasi-maximum likelihood methods 2983
2.4. Specification checks 2984

The authors would like to thank Torben G. Andersen, Patrick Billingsley, William A. Brock, Eric
Ghysels, Lars P. Hansen, Andrew Harvey, Blake LeBaron, and Theo Nijman for helpful comments.
Financial support from the National Science Foundation under grants SES-9022807 (Bollerslev), SES-
9122056 (Engle), and SES-9110131 and SES-9310683 (Nelson), and from the Center for Research in
Security Prices (Nelson), is gratefully acknowledged. Inquiries regarding the data for the stock market
empirical application should be addressed to Professor G. William Schwert, Graduate School of
Management, University of Rochester, Rochester, NY 14627, USA. The GAUSSTM code used in the
stock market empirical example is available from Inter-University Consortium for Political and Social
Research (ICPSR), P.O. Box 1248, Ann Arbor, MI 48106, USA, telephone (313)763-5010. Order
Class 5 under this articles name.

Handbook ofEconometrics, Volume IV, Edited by R.F. Engle and D.L. McFadden
0 1994 Elseuier Science B.V. All rights reserved
T. Bollersku et al.

3. Stationary and ergodic properties 2989


3.1. Strict stationarity 2989

3.2. Persistence 2990

4. Continuous time methods 2992


4.1. ARCH models as approximations to diffusions 2994

4.2. Diffusions as approximations to ARCH models 2996

4.3, ARCH models as filters and forecasters 2991

5. Aggregation and forecasting 2999


5.1. Temporal aggregation 2999

5.2. Forecast error distributions 3001

6. Multivariate specifications 3002


6.1. Vector ARCH and diagonal ARCH 3003

6.2. Factor ARCH 3005

6.3. Constant conditional correlations 3007

6.4. Bivariate EGARCH 3008

6.5. Stationarity and co-persistence 3009

7. Model selection 3010


8. Alternative measures for volatility 3012
9. Empirical examples 3014
9.1. U.S. Dollar/Deutschmark exchange rates 3014

9.2. U.S. stock prices 3017

10. Conclusion 3030


References 3031
Ch. 49: ARCH Models 2961

Abstract

This chapter evaluates the most important theoretical developments in ARCH type
modeling of time-varying conditional variances. The coverage include the specifica-
tion of univariate parametric ARCH models, general inference procedures, condi-
tions for stationarity and ergodicity, continuous time methods, aggregation and
forecasting of ARCH models, multivariate conditional covariance formulations,
and the use of model selection criteria in an ARCH context. Additionally, the
chapter contains a discussion of the empirical regularities pertaining to the temporal
variation in financial market volatility. Motivated in part by recent results on
optimal filtering, a new conditional variance model for better characterizing stock
return volatility is also presented.

1. Introduction

Until a decade ago the focus of most macroeconometric and financial time series
modeling centered on the conditional first moments, with any temporal depen-
dencies in the higher order moments treated as a nuisance. The increased importance
played by risk and uncertainty considerations in modern economic theory, however,
has necessitated the development of new econometric time series techniques that
allow for the modeling of time varying variances and covariances. Given the
apparent lack of any structural dynamic economic theory explaining the variation
in higher order moments, particularly instrumental in this development has been
the autoregressive conditional heteroskedastic (ARCH) class of models introduced
by Engle (1982). Parallel to the success of standard linear time series models, arising
from the use of the conditional versus the unconditional mean, the key insight
offered by the ARCH model lies in the distinction between the conditional and the
unconditional second order moments. While the unconditional covariance matrix
for the variables of interest may be time invariant, the conditional variances and
covariances often depend non-trivially on the past states of the world. Understanding
the exact nature of this temporal dependence is crucially important for many issues
in macroeconomics and finance, such as irreversible investments, option pricing, the
term structure of interest rates, and general dynamic asset pricing relationships.
Also, from the perspective of econometric inference, the loss in asymptotic efficiency
from neglected heteroskedasticity may be arbitrarily large and, when evaluating
economic forecasts, a much more accurate estimate of the forecast error uncertainty
is generally available by conditioning on the current information set.

1.I. Dejinitions

Let {E,(O)} denote a discrete time stochastic process with conditional mean and
variance functions parametrized by the finite dimensional vector OE 0 s R, where
2962 T. Bolfersleu et al.

8, denotes the true value. For notational simplicity we shall initially assume that
s,(O)is a scalar, with the obvious extensions to a multivariate framework treated in
Section 6. Also, let E,_ r(.) denote the mathematical expectation, conditional on the
past, of the process, along with any other information available at time t - 1.
The {E,(@,)}process is then defined to follow an ARCH model if the conditional
mean equals zero,

~1-1MRJ))= 0 t= 1,2,..., (1.1)

but the conditional variance.

44J = Var,- lWo)) = L l(~:(&)) t= 1,2,..., (1.2)

depends non-trivially on the sigma-field generated by the past observations; i.e.


{~t-l(~O)r~,-2(e0),...}.Wh en obvious from the context, the explicit dependence on
the parameters, 8, will be suppressed for notational convenience. Also, in the
multivariate case the corresponding time varying conditional covariance matrix will
be denoted by f2,.
In much of the subsequent discussion we shall focus directly on the {st} process,
but the same ideas extend directly to the situation in which {st} corresponds to the
innovations from some more elaborate econometric model. In particular, let {yl(O,)}
denote the stochastic process of interest with cohditional mean

PtwM= 4 - l(YJ t=l2 ) ).... (1.3)

Note, by the timing convention both ~~(0,) and a:(O,) are measurable with respect
to the time t - 1 information set. Define the {s,(e,)} process by

de,) = Y, - au t= 1,2,.... (1.4)

The conditional variance for (ct} then equals the conditional variance for the {y,}
process. Since very few economic and financial time series have a constant conditional
mean of zero, most of the empirical applications of the ARCH methodology actually
fall within this framework.
Returning to the definitions in equations (1.1) and (1.2), it follows that the
standardized process,

2,(e,) s E,(e,)a:(e,)- 12 t= 1,2,..., (1.5)

will have conditional mean zero, and a time invariant conditional variance of unity.
This observation forms the basis for most of the inference procedures that underlie
the applications of ARCH type models.
If the conditional distribution for z, is furthermore assumed to be time invariant
Ch. 49: ARCH Models 2963

with a finite fourth moment, it follows by Jensens inequality that

E(&;) = E(zf)E(a;) 2 E(z;)E(af)2 = E(zp)E(q2,

where the equality holds true for a constant conditional variance only. Given a
normal distribution for the standardized innovations in equation (1.5), the uncondi-
tional distribution for E, is therefore leptokurtic.
The setup in equations (1.1) through (1.4) is extremely general and does not lend
itself directly to empirical implementation without first imposing further restrictions
on the temporal dependencies in the conditional mean and variance functions.
Below we shall discuss some of the most practical and popular such ARCH formula-
tions for the conditional variance. While the first empirical applications of the
ARCH class of models were concerned with modeling inflationary uncertainty, the
methodology has subsequently found especially wide use in capturing the temporal
dependencies in asset returns. For a recent survey of this extensive empirical
literature we refer to Bollerslev et al. (1992).

1.2. Empirical regularities of asset returns

Even in the univariate case, the array of functional forms permitted by equation (1.2)
is vast, and infinitely larger than can be accommodated by any parametric family
of ARCH models. Clearly, to have any hope of selecting an appropriate ARCH
model, we must have a good idea of what empirical regularities the model should
capture. Thus, a brief discussion of some of the important regularities for asset
returns volatility follows.

1.2.1. Thick tails

Asset returns tend to be leptokurtic. The documentation of this empirical regularity


by Mandelbrot (1963), Fama (1965) and others led to a large literature on modeling
stock returns as i.i.d. draws from thick-tailed distributions; see, e.g. Mandelbrot
(1963), Fama (1963,1965), Clark (1973) and Blattberg and Gonedes (1974).

1.2.2. Volatility clustering

As Mandelbrot (1963) wrote,

. . . large changes tend to be followed by large changes, of either sign, and small
changes tend to be followed by small changes..

This volatility clustering phenomenon is immediately apparent when asset returns


are plotted through time. To illustrate, Figure 1 plots the daily capital gains on the
Standard 90 composite stock index from 1928-1952 combined with Standard and
Daily Standard and Poors Capital Gains

2
1920 1930 1940 1950 1960 1970 1980 1990 2000

Figure 1

Poors 500 index from 1953-1990. The returns are expressed in percent, and are
continuously compounded. It is clear from visual inspection of the figure, and any
reasonable statistical test, that the returns are not i.i.d. through time. For example,
volatility was clearly higher during the 1930s than during the 1960s, as confirmed
by the estimation results reported in French et al. (1987).
A similar message is contained in Figure 2, which plots the daily percentage
Deutschmark/U.S. Dollar exchange rate appreciation. Distinct periods of exchange
market turbulence and tranquility are immediately evident. We shall return to a
formal analysis of both of these two time series in Section 9 below.
Volatility clustering and thick tailed returns are intimately related. As noted in
Section 1.1 above, if the unconditional kurtosis of a, is finite, E(E~)/[E($)]~ 3 E(z:),
where the last inequality is strict unless ot is constant. Excess kurtosis in E, can
therefore arise from randomness in ol, from excess kurtosis in the conditional
distribution of sI, i.e., in zl, or from both.

1.2.3. Leverage eflects

The so-called leverage effect, first noted by Black (1976), refers to the tendency
for changes in stock prices to be negatively correlated with changes in stock
volatility. Fixed costs such as financial and operating leverage provide a partial
explanation for this phenomenon. A firm with debt and equity outstanding typically
2965
Ch. 49: ARCH Models

Dally U.S. Dollar-Deutschmark Appreciation

p
1960 1962 1964 1966 1988 1990 1992 1994

Figure 2

becomes more highly leveraged when the value of the firm falls. This raises equity
returns volatility if the returns on the firm as a whole are constant. Black (1976),
however, argued that the response of stock volatility to the direction of returns is
too large to be explained by leverage alone. This conclusion is also supported by
the empirical work of Christie (1982) and Schwert (1989b).

1.2.4. Non-trading periods

Information that accumulates when financial markets are closed is reflected in prices
after the markets reopen. If, for example, information accumulates at a constant
rate over calendar time, then the variance of returns over the period from the Friday
close to the Monday close should be three times the variance from the Monday
close to the Tuesday close. Fama (1965) and French and Roll (1986) have found,
however, that information accumulates more slowly when the markets are closed
than when they are open. Variances are higher following weekends and holidays
than on other days, but not nearly by as much as would be expected if the news
arrival rate were constant. For instance, using data on daily returns across all NYSE
and AMEX stocks from 1963-1982, French and Roll (1986) find that volatility is
70 times higher per hour on average when the market is open than when it is closed.
Baillie and Bollerslev (1989) report qualitatively similar results for foreign exchange
rates.

1.2.5. Forecastable events

Not surprisingly, forecastable releases of important information are associated with


high ex ante volatility. For example, Cornell (1978) and Pate11 and Wolfson (1979,
2966 T Bollersleu et al.

1981) show that individual firms stock returns volatility is high around earnings
announcements. Similarly, Harvey and Huang (1991,1992) find that fixed income
and foreign exchange volatility is higher during periods of heavy trading by central
banks or when macroeconomic news is being released.
There are also important predictable changes in volatility across the trading day.
For example, volatility is typically much higher at the open and close of stock and
foreign exchange trading than during the middle of the day. This pattern has been
documented by Harris (1986), Gerity and Mulherin (1992) and Baillie and Bollerslev
(1991) among others. The increase in volatility at the open at least partly reflects
information accumulated while the market was closed. The volatility surge at the
close is less easily interpreted.

1.2.6. Volatility and serial correlation

LeBaron (1992) finds a strong inverse relation between volatility and serial corre-
lation for U.S. stock indices. This finding appears remarkably robust to the choice
of sample period, market index, measurement interval and volatility measure. Kim
(1989) documents a similar relationship in foreign exchange rate data.

1.2.7. Co-movements in volatilities

Black (1976) observed that

. . . there is a lot of commonality in volatility changes across stocks: a 1% market


volatility change typically implies a 1% volatility change for each stock. Well,
perhaps the high volatility stocks are somewhat more sensitive to market volatility
changes than the low volatility stocks. In general it seems fair to say that when
stock volatilities change, they all tend to change in the same direction.

Diebold and Nerlove (1989) and Harvey et al. (1992) also argue for the existence
of a few common factors explaining exchange rate volatility movements. Engle et al.
(1990b) show that U.S. bond volatility changes are closely linked across maturities.
This commonality of volatility changes holds not only across assets within a market,
but also across different markets. For example, Schwert (1989a) finds that U.S. stock
and bond volatilities move together, while Engle and Susmel (1993) and Hamao
et al. (1990) discover close links between volatility changes across international
stock markets. The importance of international linkages has been further explored
by King et al. (1994), Engle et al. (1990a), and Lin et al. (1994).
That volatilities move together should be encouraging to model builders, since
it indicates that a few common factors may explain much of the temporal variation
in the conditional variances and covariances of asset returns. This forms the basis
for the factor ARCH models discussed in Section 6.2 below.
Ch. 49: ARCH Models 2961

I .2.8. Macroeconomic variables and volatility

Since stock values are closely tied to the health of the economy, it is natural to
expect that measures of macroeconomic uncertainty such as the conditional variances
of industrial production, interest rates, money growth, etc. should help explain
changes in stock market volatility. Schwert (1989a, b) finds that although stock
volatility rises sharply during recessions and financial crises and drops during
expansions, the relation between macroeconomic uncertainty and stock volatility
is surprisingly weak. Glosten et al. (1993), on the other hand, uncover a strong
positive relationship between stock return volatility and interest rates.

1.3. Univariate parametric models

1.3.1. GARCH

Numerous parametric specifications for the time varying conditional variance have
been proposed in the literature. In the linear ARCH(q) model originally introduced
by Engle (1982), the conditional variance is postulated to be a linear function of the
past q squared innovations,

o;=w+ 1 CLiE~_i-W+C((L)E:_l, (1.6)


i=l,q

where L denotes the lag or backshift operator, Ly, = Y,_~. Of course, for this model
to be well defined and the conditional variance to be positive, almost surely the
parameters must satisfy w > 0 and c(~ 3 0,. . . , a, > 0.
Defining v, = E: - a:, the ARCH(q) model in (1.6) may be re-written as

Ef = w + C((L)&f_1 + v,. (1.7)

Since E, _ I (v,) = 0, the model corresponds directly to an AR(q) model for the squared
innovations, E:. The process is covariance stationary if and only if the sum of the
positive autoregressive parameters is less than one, in which case the unconditional
variance equals Var(s,) = a2 = o/( 1 - U1 - ... - Uq).
Even though the E,)Sare serially uncorrelated, they are clearly not independent
through time. In accordance with the stylized facts for asset returns discussed above,
there is a tendency for large (small) absolute values of the process to be followed by
other large (small) values of unpredictable sign. Also, as noted above, if the distri-
bution for the standardized innovations in equation (1.5) is assumed to be time
invariant, the unconditional distribution for E, will have fatter tails than the distribu-
tion for z,. For instance, for the ARCH(l) model with conditionally normally
distributed errors, E(sp)/E($) = 3( 1 - a:)/( 1 - 34) if 3c(T < 1, and E(.$)/E(E:)~ = co
otherwise: both of which exceed the normal value of three.
2968 T Bollersleu et al.

Alternatively the ARCH(q) model may also be represented as a time varying


parameter MA(q) model for a,,

E, = 0 + cc(J!& lE,_ 1, (1.8)

where {i,} denotes a scalar i.i.d. stochastic process with mean zero and variance
one. Time varying parameter models have a long history in econometrics and
statistics. The appeal of the observational equivalent formulation in equation (1.6)
stems from the explicit focus on the time varying conditional variance of the process.
For discussion of this interpretation of ARCH models, see, e.g., Tsay (1987), Bera
et al. (1993) and Bera and Lee (1993).
In empirical applications of ARCH(q) models a long lag length and a large
number of parameters are often called for. To circumvent this problem Bollerslev
(1986) proposed the generalized ARCH, or GARCH(p, q), model,

~: = 0 + C C(iEf_i
+ C ~jjb:_j ~ W + a(L)&:_
1 + B(L)a:_
1.
(1.9)
i=l,q j= 1.~

For the conditional variance in the GARCH(p, q) model to be well defined all the
coefficients in the corresponding infinite order linear ARCH model must be positive.
Provided that a(L) and /J(L) have no common roots and that the roots of the
polynomial /I(x) = 1 lie outside the unit circle, this positivity constraint is satisfied
if and only if all the coefficients in the infinite power series expansion for a(~)/( 1 - /i(x))
are non-negative. Necessary and sufficient conditions for this are given in Nelson
and Cao (1992). For the simple GARCH(l, 1) model almost sure positivity of 0:
requires that w 3 0, c(i B 0 and /I1 > 0.
Rearranging the GARCH(p, q) model as in equation (1.7), it follows that

a; = 0 + [a(L) + B(L)l&:_
1- P(L)v,-1+ v,, (1.10)

which defines an ARMA[max(p,q),p] model for E:. By standard arguments, the


model is covariance stationary if and only if all the roots of a(x) + b(x) = 1 lie outside
the unit circle; see Bollerslev (1986) for a formal proof. In many applications with
high frequency financial data the estimate for CI(1) + /I( 1) turns out to be very close
to unity. This provides an empirical motivation for the so-called integrated GARCH
(p, q), or IGARCH(p,q), model introduced by Engle and Bollerslev (1986). In the
IGARCH class of models the autoregressive polynomial in equation (1.10) has a
unit root, and consequently a shock to the conditional variance is persistent in the
sense that it remains important for future forecasts of all horizons. Further discussion
of stationarity conditions and issues of persistence are contained in Section 3 below.
Just as an ARMA model often leads to a more parsimonious representation of
the temporal dependencies in the conditional mean than an AR model, the GARCH
(p, q) formulation in equation (1.9) provides a similar added flexibility over the linear
2910

non-linear ARCH (NARCH) models:

O:=O + C ClilE,-ilY + 1 BjOf-j. (1.13)


i=l,q j=l,P

If (1.13) is modified further by setting

(1.14)
i=l,q j=l,p

for some non-zero K, the innovations in c$ will depend on the size as well as the
sign of lagged residuals, thereby allowing for the leverage effect in stock return
volatility. The formulation in equation (1.14) with y = 2 is also a special case of
Sentanas (1991) quadratic ARCH (QARCH) model, in which 0: is modeled as a
quadratic form in the lagged residuals. A simple version of this model termed
asymmetric ARCH, or AARCH, was also proposed by Engle (1990). In the first
order case the AARCH model becomes

a:=o+tlE:_l+sE,-l+P~:_l, (1.15)

where a negative value of 6 means that positive returns increase volatility less than
negative returns.
Another route for introducing asymmetric effects is to set

0: = W + C [a+T(E,_i > 0)1&,_il + Ui-I(&t-i d o)lE~~ilyl + C Bjaf-jT (l.16)


i=l,q j= 1.p

where I(.) denotes the indicator function. For example the threshold ARCH
(TARCH) model of Zakoian (1990) corresponds to equation (1.16) with y = 1.
Glosten, Jagannathan and Runkle (1993) estimate a version of equation (1.16) with
y = 2. This so-called GJR model allows a quadratic response of volatility to news
with different coefficients for good and bad news, but maintains the assertion that
the minimum volatility will result when there is no news.
Two additional classes of models have recently been proposed. These models
have a somewhat different intellectual heritage but imply particular forms of con-
ditional heteroskedasticity. The first is the unobserved components structural ARCH
(STARCH) model of Harvey et al. (1992). These are state space models or factor
models in which the innovation is composed of several sources of error where each
of the error sources has heteroskedastic specifications of the ARCH form. Since
the error components cannot be separately observed given the past observations,
the independent variables in the variance equations are not measurable with respect

In a comparison study for daily Japanese TOPIX data, Engle and Ng (1993) found that the EGARCH
and the GJR formulation were superior to the AARCH model (1.15) which simply shifted the intercept.
Ch. 49: ARCH Models 2971

to the available information set, which complicates inference procedures2 Following


earlier work by Diebold and Nerlove (1989), Harvey et al. (1992) propose an
estimation strategy based on the Kalman filter.
To illustrate the issues, consider the factor structure

Yr = Bft + &*, (1.17)

where JJ, is an n x 1 vector of asset returns, f, is a scalar factor with time invariant
factor loadings, B, and E, is an n x 1 vector of idiosyncratic returns. If the factor
follows an ARCH( 1) process,

(1.18)

then new estimation problems arise since f,_ I is not observed, and c;,~ is not a
conditional variance. The Kalman filter gives both E, _ l(f, _ 1) and V, _ l(f, _ 1), so
the proposal by Harvey et al. (1992) is to let the conditional variance of the factor,
which is the state variable in the Kalman filter, be given by

Another important class of models is the switching ARCH, or SWARCH, model


proposed independently by Cai (1994) and Hamilton and Susmel(l992). This class
of models postulates that there are several different ARCH models and that the
economy switches from one to another following a Markov chain. In this model
there can be an extremely high volatility process which is responsible for events
such as the stock market crash in October 1987. Since this could happen at any
time but with very low probability, the behavior of risk averse agents will take this
into account. The SWARCH model must again be estimated using Kalman filter
techniques.
The richness of the family of parametric ARCH models is both a blessing and a
curse. It certainly complicates the search for the true model, and leaves quite a
bit of arbitrariness in the model selection stage. On the other hand, the flexibility
of the ARCH class of models means that in the analysis of structural economic
models with time varying volatility, there is a good chance that an appropriate
parametric ARCH model can be formulated that will make the analysis tractable.
For example, Campbell and Hentschell (1992) seek to explain the drop in stock
prices associated with an increase in volatility within the context of an economic
model. In their model, exogenous rises in stock volatility increase discount rates,
lowering stock prices. Using an EGARCH model would have made their formal
analysis intractable, but based on a QARCH formulation the derivations are
straightforward.
These models sometimes are also called stochastic volatility models; see Andersen (1992a) for a more
formal definition.
2972 T. Bollerslev et al.

1.4. ARCH in mean models

Many theories in finance call for an explicit tradeoff between the expected returns
and the variance, or the covariance among the returns. For instance, in Mertons
(1973) intertemporal CAPM model, the expected excess return on the market
portfolio is linear in its conditional variance under the assumption of a representative
agent with log utility. In more general settings, the conditional covariance with an
appropriately defined benchmark portfolio often serves to price the assets. For
example, according to the traditional capital asset pricing model (CAPM) the excess
returns on all risky assets are proportional to the non-diversifiable risk as measured
by the covariances with the market portfolio. Of course, this implies that the expected
excess return on the market portfolio is simply proportional to its own conditional
variance as in the univariate Merton (1973) model.
The ARCH in mean, or ARCH-M, model introduced by Engle et al. (1987) was
designed to capture such relationships. In the ARCH-M model the conditional
mean is an explicit function of the conditional variance,

(1.19)

where the derivative of the g(., .) function with respect to the first element is non-zero.
The multivariate extension of the ARCH-M model, allowing for the explicit influence
of conditional covariance terms in the conditional mean equations, was first consi-
dered by Bollerslev et al. (1988) in the context of a multivariate CAPM model. The
exact formulation of such multivariate ARCH models is discussed further in Section 6
below.
The most commonly employed univariate specifications of the ARCH-M model
postulate a linear relationship in 0, or af; e.g. g[o:(@, 19]= p + ha:. For 6 # 0 the
risk premium will be time-varying, and could change sign if p < 0 < 6. Note that
any time variation in CJ(will result in serial correlation in the {yt} process.3
Because of the explicit dependence of the conditional mean on the conditional
variance and/or covariance, several unique problems arise in the estimation and
testing of ARCH-M models. We shall return to a discussion of these issues in
Section 2.2 below.

1.5. Nonparametric and semiparametric methods

A natural response to the overwhelming variety of parametric univariate ARCH


models, is to consider and estimate nonparametric models. One of the first attempts
at this problem was by Pagan and Schwert (1990) who used a collection of standard

3The exact form of this serial dependence has been formally analyzed for some simple models in Hong
(1991).
Ch. 49: ARCH Models 2913

nonparametric estimation methods, including kernels, Fourier series and least squares
regressions, to fit models for the relation between yf and past yts, and then compare
the fits with several parametric formulations. Effectively, these models estimate the
function f(.) in

y:=f(y,~1,y,-2,...,Yr-p,e)+rl,. (1.20)

Several problems immediately arise in estimating f(.), however. Because of the


problems of high dimensionality, the parameter p must generally be chosen rather
small, so that only a little temporal smoothing can actually be achieved directly
from (1.20). Secondly, if only squares of the past y,s are used the asymmetric terms
may not be discovered. Thirdly, minimizing the distance between y: and f, = f(y, _ 1,
y,_ 2,. . , Y,-~, 6) is most effective if qt is homoskedastic, however, in this case it is
highly heteroskedastic. In fact, if f, were the precise conditional heteroskedasticity,
then y:f, and v,f, , would be homoskedastic. Thus, qt has conditional variance
f:, so that the heteroskedasticity is actually more severe than in y,. Not only
does parameter estimation become inefficient, but the use of a simple R2 measure
as a model selection criterion is inappropriate. An R2 criterion penalizes generalized
least squares or maximum likelihood estimators, and corresponds to a loss function
which does not even penalize zero or negative predicted variances. This issue will
be discussed in more detail in Section 7. Indeed, the conclusion from the empirical
analysis for U.S. stock returns conducted in Pagan and Schwert (1990) was that
there was in-sample evidence that the nonparametric models could outperform the
GARCH and EGARCH models, but that out-of-sample the performance deterio-
rated. When a proportional loss function was used the superiority of the nonpara-
metric models also disappeared in-sample.
Any nonparametric estimation method must be sensitive to the above mentioned
issues. Gourieroux and Monfort (1992) introduce a qualitative threshold ARCH,
or QTARCH, model, which has conditional variance that is constant over various
multivariate observation intervals. For example, divide the space of y, into J
intervals and let Ij(Y,) be 1 if y, is in thejth interval. The QTARCH model is then
written as

(1.21)

where u, is taken to be i.i.d. The mij parameters govern the mean and the bij
parameters govern the variance of the {y,} process. As the sample size grows, J can
be increased and the bins made smaller to approximate any process.
In their most successful application, Gourieroux and Monfort (1992) add a
GARCH term resulting in the G-QTARCH(l) model, with a conditional variance
given by

of = o + Boa:- 1 + 1 PjIj(Yl- *I (1.22)


j=l,J
2914 T. Bollersleu
et al.

Interestingly, the estimates using four years of daily returns on the French stock
index (CAC) showed strong evidence of the leverage effect.
In the same spirit, Engle and Ng (1993) propose and estimate a partially nonpara-
metric, or PNP, model, which uses linear splines to estimate the shape of the
response to the most recent news. The name of the model reflects the fact that the
long memory component is treated as parametric while the relationship between
the news and the volatility is treated nonparametrically.
The semi-nonparametric series expansion developed in a sequence of papers by
Gallant and Tauchen (1989) and Gallant et al. (1991,1992,1993) has also been
employed in characterizing the temporal dependencies in the second order moments
of asset returns. A formal description of this innovative nonparametric procedure
is beyond the scope of the present chapter, however.

2. Inference procedures

2.1. Testing for ARCH

2.1 .I. Serial correlation and Lagrange multiplier tests

The original Lagrange multiplier (LM) test for ARCH proposed by Engle (1982) is
very simple to compute, and relatively easy to derive. Under the null hypothesis it
is assumed that the model is a standard dynamic regression model which can be
written as

Y, = XtP+ 57 (2.1)

where x, is a set of weakly exogenous and lagged dependent variables and E, is a


Gaussian white noise process,

where I, denotes the available information set. Because the null is so easily estimated,
the Lagrange multiplier test is a natural choice. The alternative hype thesis is that
the errors are ARCH(q), as in equation (1.6). A straightforward derivation of the
Lagrange multiplier test as in Engle (1984) leads to the TR test statistic, where the
RZ is computed from the regression of ETon a constant and sf_ r,. . , E:_~. Under
the null hypothesis that there is no ARCH, the test statistic is asymptotically
distributed as a chi-square distribution with q degrees of freedom.
The intuition behind this test is very clear. If the data are homoskedastic, then
the variance cannot be predicted and variations in E: will be purely random.
However, if ARCH effects are present, large values of E: will be predicted by large
values of the past squared residuals.
Ch. 49: ARCH Models 2915

While this is a simple and widely used statistic, there are several points which
should be made. First and most obvious, if the model in (2.1) is misspecified by
omission of a relevant regressor or failure to account for some non-linearity or serial
correlation, it is quite likely that the ARCH test will reject as these errors may induce
serial correlation in the squared errors. Thus, one cannot simply assume that ARCH
effects are necessarily present when the ARCH test rejects. Second, there are several
other asymptotically equivalent forms of the test, including the standard F-test from
the above regression. Another version of the test simply omits the constant but
subtracts the estimate of the unconditional variance, g2, from the dependent variable,
and then uses one half the explained sum of squares as a test statistic. It is also quite
common to use asymptotically equivalent portmanteau tests, such as the Ljung and
Box (1978) statistic, for E:.
As described above, the parameters of the ARCH(q) model must be positive.
Hence, the ARCH test could be formulated as a one tailed test. When q = 1 this is
simple to do, but for higher values of q, the procedures are not as clear. Demos and
Sentana (1991) has suggested a one sided ARCH test which is presumably more
powerful than the simple TR* test described above. Similarly, since we find that the
GARCH(l, 1) is often a superior model and is surely more parsimoniously para-
metrized, one would like a test which is more powerful for this alternative. The
Lagrange multiplier principle unfortunately does not deliver such a test because,
for models close to the null, tlr and PI cannot be separately identified. In fact, the
LM test for GARCH( 1,1) is just the same as the LM test for ARCH( 1); see Lee and
King (1993) which proposes a locally most powerful test for ARCH and GARCH.
Of course, Wald type tests for GARCH may also be computed. These too are
non-standard, however. The t-statistic on cur in the GARCH(l, 1) model will not
have a t-distribution under the null hypothesis since there is no time-varying input
and /?r will be unidentified. Finally, likelihood ratio test statistics may be examined,
although again they have an uncertain distribution under the null. Practical experience,
however, suggests that the latter is a very powerful approach to testing for GARCH
effects. We shall return to a more detailed discussion of these tests in Section 2.2.2
below.

2.1.2. BDS test.for ARCH

The tests for ARCH discussed above are tests for volatility clustering rather than
general conditional heteroskedasticity, or general non-linear dependence. One widely
used test for general departures from i.i.d. observations is the BDS test introduced
by Brock, Dechert and Scheinkman (1987). We will consider only the univariate
version of the test; the multivariate extension is made in Baek and Brock (1992).
The BDS test has inspired quite a large literature and several applications have
appeared in the finance area; see, e.g. Scheinkman and LeBaron (1989), Hsieh (1991)
and Brock et al. (1991).
To set up the test, let {x,},=r,r denote a scalar sequence which under the null
2976 T. Bollersleu et al.

hypothesis is assumed to be i.i.d. through time. Define the m-histories of the x,


process as the vectors (xi,.. .,x,,,), (x2 ,. ..,x,+J,(xj,.. .,x,+~),. . .,(x,_,,. . .,XTel),
(Xvm+lr..., xT). Clearly, there are T-m + 1 such m-histories, and therefore
(T - m + l)(T - m)/2 distinct pairs of m-histories. Next, define the correlation
integral as the fraction of the distinct pairs of m-histories lying within a distance
c in the sup norm; i.e.

C,,,,,(c) = [(T-m + l)(T - m)/2]- c 1 I(maxj=,,,_, Ix,-j- x,_jJ cc).


f=rn,s s=m,T
(2.3)

Under weak dependence conditions, C,,, (c) converges almost surely to a limit
C,(c). By the basic properties of order-statistics, C,(c) = C,(C)~ when {xt} is
i.i.d. The BDS test is based on the difference, [C,,,(c) - C1,T(c)m]. Intuitively,
C,,,(c) > C,,,(c) means that when x,-~ and x,_~ are close forj = 1 to m - 1, i.e.
IllZlXj,~,,_~ 1x,_ j - x,_ jI< c, then x, and x, are more likely than average to be
close, also. In other words, nearest-neighbor methods work in predicting the {x,)
series, which is inconsistent with the i.i.d. assumption.4
Brock et al. (1987) show that for fixed m and c, T[C,,,(c)- C,,T(~)m] is
asymptotically normal with mean zero and variance V(m, c) given by

V(m,c)=4[K(c)+2 c K(c)~jC,(c)2~+(m-1)2C,(c)2m-m2K(c)C,(c)2m~2],
j=l,m-1

(2.4)

where K(c) = E{ [F(x, + c) - F(x, - c)]}, and F(.) is the cumulative distribution
function of x,. The BDS test is then computed as

T12CC,,r(c) - CI,TWY~(T, m,c), (2.5)

where e(T, m, c) denotes a consistent estimator of V(m, c), details of which are given
by Brock et al. (1987,199l). For fixed m > 2 and c > 0, the BDS statistic in equation
(2.5) is asymptotically standard normal.
The BDS test has power against many, though not all, departures from i.i.d. In
particular, as documented by Brock et al. (1991) and Hsieh (1991), the power against
ARCH alternatives is close to Engles (1982) test. For other conditionally hetero-
skedastic alternatives, the power of the BDS test may be superior. To illustrate,
consider the following example from Brock et al. (1991), where 0: is deterministically

C,,,(c) < C,,r(c)m indicates the reverse of nearest-neighbors predictability. It is important not to
push the nearest-neighbors analogy too far, however. For example, suppose {x,} is an ARCH process
with a constant conditional mean of 0. In this case, the conditional mean of x, is always 0, and the
nearest-neighbors analogy breaks down for minimum mean-squared-error forecasting of x,. It still
holds for forecasting, say, the probability that X, lies in some interval.
Ch. 49: ARCH Models 2971

determined by the tent map,

O;+l = 1 - 210: - 0.5). (2.6)

with (ri~(O,l). The model is clearly heteroskedastic, but does not exhibit volatility
clustering, since the empirical serial correlations of {cJ:} approach zero in large
samples for almost all values of 0:.
In order to actually implement the kDS test a choice has to be made regarding
the values of m and c. The Monte Carlo experiments of Brock et al. (199 1) suggest
that c should be between t and 2 standard deviations of the data, and that T/m
should be greater than 200 with m no greater than 5. For the asymptotic distribution
to be a good approximation to the finite-sample behavior of the BDS test a sample
size of at least 500 observations is required.
Since the BDS test is a test for i.i.d., it requires some adaptation in testing for
ARCH errors in the presence of time-varying conditional means. One of the most
convenient properties of the BDS test is that unlike many other diagnostic tests,
including the portmanteau statistic, its distribution is unchanged when applied to
residuals from a linear model. If, for example, the null hypothesis is a stationary,
invertible, ARMA model with i.i.d. errors and the alternative hypothesis is the same
ARMA model but with ARCH errors, the standard BDS test remains valid when
applied to the fitted residuals from the homoskedastic ARMA model. A similar
invariance property holds for residuals from a wide variety of non-linear regression
models, but as discussed in Section 2.4.2 below, this does not carry over to the
standardized residuals from a fitted ARCH model. Of course, the BDS test may
reject due to misspecification of the conditional mean rather than ARCH effects in
the errors. The same is true, however, of the simple TR Lagrange multiplier test
for ARCH, which has power against a wide variety of non-linear alternatives.

2.2. Maximum likelihood methods

2.2.1. Estimation

The procedure most often used in estimating 8, in ARCH models involves the
maximization of a likelihood function constructed under the auxiliary assumption
of an i.i.d. distribution for the standardized innovations in equation (1.5). In parti-
cular, let f(z,; q) denote the density function for z,(0) = E,(~)/o,(@ with mean zero,
variance one, and nuisance parameters ~EH c Rk. Also, let {yr, yr- 1,. . , y,} refer
to the sample realizations from an ARCH model as defined by equations (1.1)
through (1.4), and II/ = (0, tj) the combined (m + k) x 1 parameter vector to be
estimated for the conditional mean, variance and density functions.
The log likelihood function for the tth observation is then given by

UY,; $) = ln {fCz,(& ~11- 0.5 lnCflf(@l t= 1,2,.... (2.7)


2978 T. Bollerslev et al.

The second term on the right hand side is a Jacobian that arises in the transformation
from the standardized innovations, z,(0), to the observables, Y,(Q).~ By a standard
prediction error decomposition argument, the log likelihood function for the full
sample equals the sum of the conditional log likelihoods in equation (2.7X6

~,(Y,,Y,-l,..., Yli $) = c UY*i II/). (2.8)

The maximum likelihood estimator (MLE) for the true parameters I& = (f&, &), say
tjT, is found by the maximization of equation (2.8).
Assuming the conditional density and the mean and variance functions to be
differentiable for all $E 0 x H = Y, $, therefore solves

ST(YYT,YT-l,...,Yl;~)- 1 S,(y,;ti)=O, (2.9)


1=1,T

where s,(y,; II/) = V&y,; II/) is the score vector for the tth observation. In particular,
for the conditional mean and variance parameters,

v,u~,; b4= mm ~3- lfw4; w,~m - 0.54~ lb~:m (2.10)

where f(z,(e); r) denotes the derivative of the density function with respect to the
first element, and

v,z,(e) = - v,p,(eg(e)- 12- o.5&,(e)a:(e)-32v,afo. (2.11)

In practice, the actual solution to the set of m + k non-linear equations in (2.9) will
have to proceed by numerical techniques. Engle (1982) and Bollerslev (1986) provide
a discussion of some of the alternative iterative procedures that have been successfully
employed in the estimation of ARCH models.
Of course, the actual implementation of the maximum likelihood procedure
requires an explicit assumption regarding the conditional density in equation (2.7).
By far the most commonly employed distribution in the literature is the normal,

f[z,(e)] = (24 - ij2 exp [ - o.sz,(e)q. (2.12)

Since the normal distribution is uniquely determined by its first two moments, only
the conditional mean and variance parameters enter the likelihood function in

51n the multivariate context, l,(y,: I(/)= ln{f[er(H)~t(8)~2;~]} - 0.5 ln(].R,(B)I), where 1.1denotes the
determinant.
In most empirical applications the likelihood function is conditioned on a number of initial
observations and nuisance parameters in order to start up the recursions for the conditional mean
and variance functions. Subject to proper stationarity conditions this practice does not alter the
asymptotic distribution of the resulting MLE.
Ch. 49: ARCH Models 2919

equation (2.8); i.e. $ = 0. If the conditional mean and variance functions are both
differentiable for all 8~0, it follows that the score vector in equation (2.10) takes
the simple form,

s&; 0) = V,/~~(fI)c,(0)c~f(Q)- 12 + OSV,~;(t7)a:(fI- 2[~t(tI)2a;(8))1 - 11. (2.13)

From the discussion in Section 2.1 the ARCH model with conditionally normal
errors results in a leptokurtic unconditional distribution. However, the degree of
leptokurtosis induced by the time-varying conditional variance often does not
capture all of the leptokurtosis present in high frequency speculative prices. To
circumvent this problem Bollerslev (1987) suggested using a standardized t-distri-
bution with v > 2 degrees of freedom,

f[z,@);q] =[7r(r/ - 2)]-2I-[o.5(n + l)]r(o.s~)-[l +z,(e)(~-2)-]-+2,


(2.14)

where I-(.) denotes the gamma function. The r-distribution is symmetric around
zero, and converges to the normal distribution for 9 + co. However, for 4 < q < co
the conditional kurtosis equals 3(~ - 2)/(9 - 4), which exceeds the normal value of
three.
Several other conditional distributions have been employed in the literature to
fully capture the degree of tail fatness in speculative prices. The density function for
the generalized error distribution (GED) used in Nelson (1991) is given by

f[z,(@;?j] = P/-2-t 1+1i~)T(~-)-exp[-0.51z,(8)~-)], (2.15)

where

2 = [2(-2/)~(~-)~(3~-l)-l]l/2 (2.16)

For the tail-thickness parameter r] = 2 the density equals the standard normal
density in equation (2.10). For r] < 2 the distribution has thicker tails than the
normal, while q > 2 results in a distribution with thinner tails than the normal.
Both of these candidates for the conditional density impose the restriction of
symmetry. From an economic point of view the hypothesis of symmetry is of interest
since risk averse agents will induce correlation between shocks to the mean and
shocks to the variance as developed more fully by Campbell and Hentschel(l992).
Engle and Gonzalez-Rivera (1991) propose to estimate the conditional density
nonparametrically. The procedure they develop first estimates the parameters of
the model using the Gaussian likelihood. The density of the residuals standardized
by their estimated conditional standard deviations is then estimated using a linear
spline with smoothness priors. The estimated density is then taken to be the true
density and the new likelihood function is maximized. The use of the linear spline
2980 T. Bollersfeu et al.

simplifies the estimation in that the derivatives with respect to the conditional
density are easy to compute and store, which would not be the case for kernels or
many other methods. In a Monte Carlo study, this approach improved the efficiency
beyond the quasi MLE, particularly when the density was highly non-normal and
skewed.

2.2.2. Testing

The primary appeal of the maximum likelihood technique stems from the well-
known optimality conditions of the resulting estimators under ideal conditions.
Crowder (1976) gives one set of sufficient regularity conditions for the MLE in
models with dependent observations to be consistent and asymptotically normally
distributed. Verification of these regularity conditions has proven extremely difficult
for the general ARCH class of models, and a formal proof is only available for a
few special cases, including the GARCH (1,1) model in Lumsdaine (1992a) and Lee
and Hansen (1993). The common practice in empirical studies has been to proceed
under the assumption that the necessary regularity conditions are satisfied.
In particular, if the conditional density is correctly specified and the true parameter
vector IC/,Eint( Y), then a central limit theorem argument yields that

T1%b- $0) + NO, A, ), (2.17)

where + denotes convergence in distribution. Again, the technical difficulties in


verifying (2.17) are formidable. The asymptotic covariance matrix for the MLE is
equal to the inverse of the information matrix evaluated at the true parameter vector
* 03

Ao= - T-l c ECV,s,(y,; 11/o)l. (2.18)

The inverse of this matrix is less than the asymptotic covariance matrix for all other
estimators by a positive definite matrix. In practice, a consistent estimate for A, is
available by evaluating the corresponding sa_mple analogue at GT; i.e. replace
E[V,s,(y,; I++~)]in equation (2.18) with V&y,; I&=).Furthermore, as shown below,
the terms with second derivatives typically have expected value equal to zero and
therefore do not need to be calculated.
Under the assumption of a correctly specified conditional density, the information
matrix equality implies that A, = B,, where B, denotes the expected value of the

As discussed in Section 3 below, the condition that E(ln(a,zf +/II)] < 0 in Lunsdaine (1992a)
ensures that the GARCH(l,l) model is strictly stationary and ergodic. Note also, that by Jensens
inequality E(ln(cc,zf + PI)] <In E(a,z: + 8,) = In@, + j?,), so the parameter region covers the interest-
ing IGARCH(I,l) case in which a, + b, = 1.
Ch.49:ARCH Models 2981

outer product of the gradients evaluated at the true parameters,

Bo = T- l 1 ECs,(Yt;tio)s,(Y,; $o)'l. (2.19)


1=1,T

The outer product of the sample gradients evaluated at 6, therefore provides


an alternative covariance matrix estimator; that is, replace the summand in
equation (2.19) by the sample analogues s,(y,; Gr-)st(y,; $,). Since analytical deriva-
tives in ARCH models often involve very complicated recursive expressions, it is
common in empirical applications to make use of numerical derivatives to approxi-
mate their analytical counterparts. The estimator defined from equation (2.19) has
the computational advantage that only first order derivatives are needed, as numerical
second order derivatives are likely to be unstable.8
In many applications of ARCH models the parameter vector may be partitioned
as 8 = (Pi, PZ) where d1 and o2 operate a sequential cut on 0, x 0, = 0, such that
8i parametrizes the conditional mean and 8, parametrizes the conditional variance
function for y,. Thus, VQ~(~) = 0, and although V,,c~f(@ # 0 for all &@, it is
possible to show that, under fairly general symmetrical distributional assumptions
regarding z, and for particular functional forms of the ARCH conditional variance,
the information matrix for 0 = (Pi, &) becomes block diagonal. Engle (1982) gives
conditions and provides a formal proof for the linear ARCH(q) model in equation (1.6)
under the assumption of conditional normality. As a result, asymptotically efficient
estimates for 8,, may be calculated on the basis of a consistent estimate for Bol,
and vice versa. In particular, for the linear regression model with covariance
stationary ARCH disturbances, the regression coefficients may be consistently
estimated by OLS, and asymptotically efficient estimates for the ARCH parameters
in the conditional variance calculated on the basis of the OLS regression residuals.
The loss in asymptotic efficiency for the OLS coefficient estimates may be arbitrarily
large, however. Also, the conventional OLS standard errors are generally inappro-
priate, and should be modified to take account of the heteroskedasticity as in White
(1980). In particular, as noted by Milhoj (1985), Diebold (1987), Bollerslev (1988)
and Stambaugh (1993) when testing for serial correlation in the mean in the presence
of ARCH effects, the conventional Bartlett standard error for the estimated autocor-
relations, given by the inverse of the square root of the sample size, may severely
underestimate the true standard error.
There are several important cases in which block-diagonality does not hold. For
example, block-diagonality typically fails for functional forms, such as EGARCH,
in which 0: is an asymmetric function of lagged residuals. Another important
exception is the ARCH-M class of models discussed in Section 1.4. Consistent

In the Berndt, Hall, Hall and Hausman (1974) (BHHH) algorithm, often used in the maximization of
the likelihood function, the covariance matrix from the auxiliary OLS regression in the last iteration
provides an estimate of B,. In a small scale Monte Carlo experiment Bollerslev and Wooldridge (1992)
found that this estimator performed reasonably well under ideal conditions.
2982

estimation of the parameters in ARCH-M models generally requires that both the
conditional mean and variance functions be correctly specified and estimated simul-
taneously. A formal analysis of these issues is contained in Engle et al. (1987),
Pagan and Hong (1991) Pagan and Sabau (1987a, 1987b) and Pagan and Ullah
(1988).
Standard hypothesis testing procedures concerning the true parameter vector are
directly available from equation (2.17). To illustrate, let the null hypothesis of
interest be stated as T($,,) = 0, where I: 0 x H + R is differentiable on int( Y) and
1-c m + k. If +,,Eint( Y) and rank [V&$,)] = I, the Wald statistic takes the familiar
form

where C, denotes a consistent estimator of the covariance matrix for the parameter
estimates under the alternative. If the null hypothesis is true and the regularity
conditions are satisfied, the Wald statistic is asymptotically chi-square distributed
with (m + k) - 1 degrees of freedom.
Similarly, let $,, denote the MLE under the null hypothesis. The conventional
likelihood ratio (LR) statistic,

should then be the realization of a chi-square distribution with (m + k) - I degrees of


freedom if the null hypothesis is true and $,Eint( Y).
As discussed already in Section 2.1 above, when testing hypotheses about the
parameters in the conditional variance of estimated ARCH models, non-negativity
constraints must often be imposed, so that GO is on the boundary of the admissible
parameter space. As a result the two-sided critical value from the standard asymptotic
chi-square distribution will lead to a conservative test; recent discussions of general
issues related to testing inequality constraints are given in Gourieroux et al. (1982),
Kodde and Palm (1986) and Wolak (1991).
Another complication that often arises when testing in ARCH models, also
alluded to in Section 2.1 above, concerns the lack of identification of certain param-
eters under the null hypothesis. This in turn leads to a singularity of the information
matrix under the null and a breakdown of standard testipg procedures. For instance,
as previously noted in the GARCH(l, 1) model, /I1 and o are not jointly identified
under the null hypothesis that c(~ = 0. Similarly, in the ARCH-M model, ~~(0) =
p + &$ with p # 0, the parameter S is only identified if the conditional variance is
time-varying. Thus, a standard joint test for ARCH effects and 6 = 0 is not feasible.
Of course, such identification problems are not unique to the ARCH class of models,
and a general discussion is beyond the scope of the present chapter; for a more
detailed analysis along these lines we refer the reader to Davies (1977) Watson and
Engle (1985) and Andrews and Ploberger (1992, 1993).
Ch. 49: ARCH Models 2983

The finite sample evidence on the performance of ARCH MLE estimators and
test statistics is still fairly limited: examples include Engle et al. (1985) Bollerslev
and Wooldridge (1992), Lumsdaine (1992b) and Baillie et al. (1993). For the
GARCH(l, 1) model with conditional normal errors, the available Monte Carlo
evidence suggests that the estimate for cur + 6, is downward biased and skewed to
the right in small samples. This bias in oi, + fii comes from a downward bias in pi,
while oi, is upward biased. Consistent with the theoretical results in Lumsdaine
(1992a) there appears to be no discontinuity in the finite sample distribution of the
estimators at the IGARCH(l, 1) boundary; i.e. c1i + fii = 1. Reliable inference from
the LM, Wald and LR test statistics generally does require moderately large sample
sizes of at least two hundred or more observations, however.

2.3. Quasi-maximum likelihood methods

The assumption of conditional normality for the standardized innovations are


difficult to justify in many empirical applications. This has motivated the use of
alternative parametric distributional assumptions such as the densities in equation
(2.14) or (2.15). Alternatively, the MLE based on the normal density in equation (2.12)
may be given a quasi-maximum likelihood interpretation.
If the conditional mean and variance functions are correctly specified, the normal
quasi-score in equation (2.13) evaluated at the true parameters B0 will have the
martingale difference property,

E,(V,~L,(Bo)&,(e,)o,2(eg) + 0.5V,a:(B,)(r:(8,)~CE,(B0)2a:(e,)- - l]} = 0.


(2.20)

Since equation (2.20) holds for any value of the true parameters, the QMLE
obtained by maximizing the conditional normal likelihood function defined
by equations (2.7), (2.8) and (2.12), say gr,oMLE, is Fisher-consistent; that is,
ECS,(Yr,Y,-I,..., Y,; e)] = 0 for any 0~ 0. Under appropriate regularity conditions
this is sufficient to establish consistency and asymptotic normality of $r,oMLE.
Wooldridge (1994) provides a formal discussion. Furthermore, following Weiss
(1984, 1986) the asymptotic distribution for the QMLE takes the form

T12&oMLE - 0,) + N(0, A, r&4, ). (2.21)

Under appropriate, and difficult to verify, regularity conditions, the A, and B,


matrices are consistently estimated by the sample counterparts from equations (2.18)
and (2.19), respectively.
Provided that the first two conditional moments are correctly specified, it follows
from equation (2.13) that

E,[V,s,(y,; e,)] = - v,~,(e)v,~,(e)a:(e)- l - ~v,a:(e)v,a:(e),a:(e)-. (2.22)


2984 T. Bollersleu et al.

As pointed out by Bollerslev and Wooldridge (1992) a convenient estimate of the


information matrix, A,, involving only first derivatives is therefore available by
replacing the right hand side of equation (2.18) with the sample realizations from
equation (2.22).
The finite sample distribution of the QMLE and the Wald statistics based on the
robust covariance matrix estimator constructed from equations (2.18), (2.19) and
(2.22) has been investigated by Bollerslev and Wooldridge (1992). For symmetric
departures from conditional normality, the QMLE is generally close to the exact
MLE. However, as noted by Engle and Gonzales-Rivera (1991), for non-symmetric
conditional distributions both the asymptotic and the finite sample loss in effi-
ciency may be quite large, and semiparametric density estimation, as discussed in
Section 1.5, may be preferred.

2.4. Specijcation checks

2.4.1. Lagrange multiplier diagnostic tests

After a model is selected and estimated, it is generally desirable to test whether it


adequately represents the data. A useful array of tests can readily be constructed
from calculating Lagrange multiplier tests against particular parametric alternatives.
Since almost any moment condition can be formulated as the score against some
alternative, these tests may also be interpreted as conditional moment tests; see
Newey (1985) and Tauchen (1985). Whenever one computes a collection of test
statistics, the question of the appropriate size of the full procedure arises. It is
generally impossible to control precisely the size of a procedure when there are
many correlated test statistics and conventional econometric practice does not
require this. When these tests are viewed as diagnostic tests, they are simply aids in
the model building process and may well be part of a sequential testing procedure
anyway. In this section, we will show how to develop tests against a variety of
interesting alternatives to any particular model. We focus on the simplest and most
useful case.
Suppose we have estimated a parametric model with the assumption that each
observation is conditionally normal with mean zero and variance gf = of(O). Then
the score can be written as a special case of (2.13),

s&8) = V,ln0:(8)[&:(B)o:(8)- - 11. (2.23)

In order to conserve space, equation (2.23) may be written more compactly as

se,= %,U,, (2.24)

where x0, denotes the k x 1 vector of derivatives of the logarithm of the conditional
Ch. 49: ARCH Models 2985

variance equation with respect to the parameters 8, and u, = &:(&r:(G)- - 1 defines


the generalized residuals. From the first order conditions in equation (2.9), the MLE
for 8, gT, solves

1 A$*= 1 &ii, = 0. (2.25)


1=1,T *= l,T

Suppose that the additional set of r parameters, represented by the r x 1 vector


y, have been implicitly set to zero during estimation. We wish to test whether this
restriction is supported by the data. That is, the null hypothesis may be expressed
as y0 = 0, where 0: = a:(e, y). Also, suppose that the score with respect to y has the
same form as in equation (2.24),

sYf = x,,u,. (2.26)

Under fairly general regularity conditions, the scores themselves when evaluated
at the true parameter under the null hypothesis, 8,, will satisfy a martingale central
limit theorem. Therefore,

T1*S&) + N(0, V), (2.27)

where V = A, denotes the covariance matrix of the scores. The conventional form of
the Lagrange multiplier test, as in Breusch and Pagan (1979) or Engle (1984) is then
given by

(2.28)
f=l,T t=l,T

where tj = (Q, y), represent estimates evaluated under the null hypothesis, and ?
denotes a consistent estimate of I/. As discussed in Section 2.2, a convenient estimate
of the information matrix is given by the outer product of the scores,

iiT = T- c $,& (2.29)


t=l,T

so that the test statistic can be computed in terms of a regression. Specifically, let
the T x 1 vector of ones be denoted z, and the T x (k + r) matrix of scores evaluated
under the null hypothesis be denoted by 9 = {iV1, s*w2,.. . , iwT}. Then a simple form
of the LM test is obtained from

tlT = L?($$)-$5 = TR*, (2.30)

where the R* is the uncentered fraction of variance explained by the regression of


a vector of ones on all the scores. The test statistic in equation (2.30) is often referred
2986 T. Bollersleu et al.

to as the outer product of the gradient, or OPG, version of the test. It is very easy
to compute. In particular, using the BHHH estimation algorithm, the test statistic
is simply obtained by one step of the BHHH algorithm from the maximum achieved
under the null hypothesis.
Studies of this version of the LM test, such as MacKinnon and White (1985) and
Bollerslev and Wooldridge (1992), often find that it has size distortions and is not
very powerful as it does not utilize the structure of the problem under the null
hypothesis to obtain the best estimate of the information matrix. Of course the R2
in (2.30) will be overstated if the likelihood function has not been fully maximized
under the null so that (2.25) is not satisfied. One might recommend a first step
correction by BHHH to be certain that this is achieved.
An alternative estimate of I/ corresponding to equation (2.19) is available from
taking expectations of SS. In the simplified notation of this section,

E(SS) = c E(uf x,x;, = E(uf) 1 E(x,xJ, (2.3 1)


1=1,T f= l,T

where it is assumed that the conditional expectation E, _ ,(u:) is time invariant. Of


course, this will be true if the standardized innovations s,(B)o:(8)- I2 has a distri-
bution which does not depend upon time or past information, as typically assumed
in estimation. Consequently, an alternative consistent estimator of V is given by

?r = (T-iYa)(T-XX), (2.32)

where u = {ui,. .,u,}, X = {x1,. , xT}, and xi = {xkt, Xl*}. Since ZS= uX, the
Lagrange multiplier test based on the estimator in equation (2.32) may also be
computed from an auxiliary regression,
^ _ _
&r = tiX(XX)-X ^I u* = TR. (2.33)

Here the regression is of the percentage difference between the squared residuals
and the estimated conditional variance regressed on the gradient of the logarithm
of the conditional variance with respect to all the parameters including those set to
zero under the null hypothesis. This test statistic is similar to one step of a Gauss-
Newton iteration from an estimate under the null. It is called the Hessian estimate
by Bollerslev and Wooldridge (1992) because it can also be derived by setting com-
ponents of the Hessian equal to their expected value, assuming only that the first
two moments are correctly specified, as discussed in Section 2.3. This version of the
test has considerable intuitive appeal as it checks for remaining conditional hetero-
skedasticity in u, as a function of x,. It also performed better than the OPG test in
the simulations reported by Bollerslev and Wooldridge (1992). This is also the
version of the test used by Engle and Ng (1993) to compare various model specifi-
cations. As noted by Engle and Ng (1993), the likelihood must be fully maximized
Ch. 49: ARCH Models 2987

under the null if the test is to have the correct size. An approach to dealing with
this issue would be to first regress li, on & and then form the test on the basis
of the residuals from this regression. The RZ of this regression should be zero if the
likelihood is maximized, so this is merely a numerical procedure to purge the test
statistic of contributions from loose convergence criteria.
Both of these procedures develop the asymptotic distribution under the null
hypothesis that the model is correctly specified including the normality assumption.
Recently, Wooldridge (1990) and Bollerslev and Wooldridge (1992) have developed
robust LM tests which have the same limiting distribution under any null specifying
that the first two conditional moments are correct. This follows in the line of
conditional moment tests for GMM or QMLE as in Newey (1985) Tauchen (1985)
and White (1987,1994).
To derive these tests, consider the Taylor series expansions of the scores around
the true parameter values, s,(0) and s,(0,),

as
7%,(&J = Tl%),(&)+ Tl2 2 (& - fy,), (2.34)
ae

T*s,(e,) = T%,(&) + T* f$&e,


- e,),
where the derivatives of the scores are evaluated at 8,. The derivatives in equations
(2.34) and (2.35) are simply the H,, and H,, elements of the Fessian, respectively.
The distribution of the score with respect to y evaluated at 8, is readily obtained
from the left hand side of equation (2.34). In particular substituting in (2.35), and
using (2.26) to give the limiting distribution of the scores,

T%,(B,) + N(0, W), (2.36)

where

W = Vyy- H,,H,V,, - Vy/yeHo;lHey+ H,,H, VooH,Hor. (2.37)

Notice first, that if the scores are the derivatives of the true likelihood, then the
information matrix equality will hold, and therefore H = V asymptotically. In this
case we get the conventional LM test described in (2.28) and computed generally
either as (2.30) or (2.33). If the normality assumption underlying the likelihood is
false so that the estimates are viewed as quasi-maximum likelihood estimators, then
the expressions in equations (2.36) and (2.37) are needed.
AS pointed out by Wooldridge (1990), any score which has the additional property
that H,, converges in probability to zero can be tested simply as a limiting normal
with covariance matrix Vyu,or as a TR* type test from a regression of a vector of ones
2988 T. Elollersleu et al.

on &. By proper redefinition of the score, such a test can always be constructed. To
illustrate, suppose that syf = xylu,, s,, = x0+, and au,/atI = - xBr. Also define

s;t = (Xyr- x;$G, (2.38)

where

(2.39)

The statistic based on s,: in equation (2.38) then tests only the part of x,, which is
orthogonal to the scores used to estimate the model under the null hypothesis. This
strategy generalizes to more complicated settings as discussed by Bollerslev and
Wooldridge (1992).

2.4.2. BDS specijication tests

As discussed in Section 2.1.2, the asymptotic distribution of the BDS test is unaffected
by passing the data through a linear, e.g. ARMA, filter. Since an ARCH model
typically assumes that the standardized residuals z, = a,~,~ are i.i.d., it seems
reasonable to use the BDS test as a specification test by applying it to the fitted
standardized residuals from an ARCH model. Fortunately, the BDS test applied to
the standardized residuals has considerable power to detect misspecification in
ARCH models. Unfortunately, the asymptotic distribution of the test is strongly
affected by the fitting of the ARCH model. As documented by Brock et al. (1991)
and Hsieh (1991), BDS tests on the standardized residuals from fitted ARCH models
reject much too infrequently. In light of the filtering properties of misspecified
ARCH models, discussed in Section 4 below, this may not be too surprising.
The asymptotic distribution of the BDS test for ARCH residuals has not yet been
derived. One commonly employed procedure to get around this problem is to
simply simulate the critical values of the test statistic; i.e. in each replication generate
data by Monte Carlo methods from the specific ARCH model, then estimate the
ARCH model and compute the BDS test for the standardized residuals. This
approach is obviously very demanding computationally.
Brock and Potter (1992) suggest another possibility for the case in which the condi-
tional mean of the observed data is known. Applying the BDS test to the logarithm
of the squared known residuals, i.e. In($) = ln(zF) + ln(o:), separates ln($) into an
i.i.d. component, ln(z:), and a component which can be estimated by non-linear
regression methods. Under the null of a correctly specified ARCH model, ln(zf) =
In($) - ln(a:) is i.i.d. and, subject to the regularity conditions of Brock and Potter
(1992) or Brock et al. (1991), the asymptotic distribution of the BDS test is the same
whether applied to ln(z:) or to the fitted values In@,!) = In($) - ln(s:). While the-
assumption of a known conditional mean is obviously unrealistic in some applications,
Ch. 49: ARCH Models 2989

it may be a reasonable approximation for high frequency financial time series, where
the noise component tends to swamp the conditional mean component.

3. Stationary and ergodic properties

3.1. Strict stationarity

In evaluating the stationarity of ARCH models, it is convenient to recursively


substitute for the lagged E,Sand 0:s. For completeness, consider the multivariate
case where

Ef = R cr12Zr, 2, - i.i.d., E(Z,) = 0, X r, -q-&q = 1, x n (3.1)

and

n,=n(t,z,_,,z,_, ,... ). (3.2)

Using the ergodicity criterion from Corollary 1.4.2 in Krengel(1985), it follows that
strict stationarityof {st}t, _ 4),dois equivalent to the condition

fl,=wk,,z,-,,...), (3.3)

withR(.;,...) measurable, and

Trace(f2,Ri) < co a.s. (3.4)

Equation (3.3) eliminates direct dependence of {a,} on t, while (3.4) ensures that
random shocks to {a,} die out rapidly enough to keep (a,} from exploding
asymptotically.
In the univariate EGARCH(p, q) model, for example, equation (3.2) is obtained
by exponentiating both sides of the definition in equation (1.11). Since In($)
is written in ARMA(p,q) form, it is easy to see that if (1 + Cj=l,qcljxj) and
(1 -xi= i,J&x) have no common roots, equations (3.3) and (3.4) are equivalent to
all the roots of (1 - Ci=i,,jixi) lying outside the unit circle. Similarly, in the
bivariate EGARCH model defined in Section 6.4 below, ln(ai,,), In(&) and p,,,
all follow ARMA processes giving rise to ARMA stationarity conditions.
One sufficient condition for (3.4) is moment boundedness; i.e. clearly
E[Trace(f2,0~)P] finite for some p > 0 implies Trace(R$:) < CE a.s. For example,
Bollerslev (1986) shows that in the univariate GARCH(p, q) model defined by
equation (1.9) E(af) is finite and (et} is covariance stationary, when xi= i,,fii +
Cj= l,qaj < 1. This is a sufficient, but not a necessary condition for strict stationarity,
however. Because ARCH processes are thick tailed, the conditions for weak or
2990

covariance stationarity arc often more stringent than the conditions for strict
stationarity.
For instance, in the univariate GARCH(1, 1) model, (3.2) takes the form

[
c: = w 1 + c
k=l,m
n
i=l.k
(/I1 + cZ1z:-i)
1. (3.5)

Nelson (1990b) shows that when w > O,o: < cc a.s., and {E,, of} is strictly stationary
if and only if E[ln(fii + crizf)] < 0. An easy application of Jensens inequality shows
that this is a much weaker requirement than c1r + /I, < 1, the necessary and sufficient
condition for (.st} to be covariance stationary. For example, the simple ARCH(l)
model with z, N N(0, 1) and a1 = 3 and /?i = 0, is strictly but not weakly stationary.
To grasp the intuition behind this seemingly paradoxical result, consider the
terms in the summation in (3.5); i.e. ni= l,k(j31 + a,~:_~). Taking logarithms, it
follows directly that Ci= I,k ln(/Ii + u,z:_~) is a random walk with drift. If
E[ln(Pi + u,z:_~)] > 0, the drift is positive and the random walk diverges to co a.s.
as k + co. If, on the other hand, E[ln(/Ii + u,z:_~)] < 0, the drift is negative and the
random walk diverges to - cc a.s. as k + 00, in which case ni= l,k(/?l + u,z:_~) tends
to zero at an exponential rate in k a.s. as k -+ co. This, in turn, implies that the sum
in equation (3.5) converges a.s., establishing (3.4). Measurability in (3.3) follows
easily using Theorems 3.19 and 3.20 in Royden (1968).
This result for the univariate GARCH(l, 1) model generalizes fairly easily to other
closely related ARCH models. For example, in the multivariate diagonal GARCH( 1,l)
model, discussed in Section 6.1 below, the diagonal elements of 0, follow univariate
GARCH( 1,l) processes. If each of these processes is stationary, the CauchyySchwartz
inequality ensures that all of the elements in R, are bounded a.s. The case of the
constant conditional correlation multivariate GARCH(l, 1) model in Section 6.3 is
similar. The same method can also be used in a number of other univariate cases
as well. For instance, when p = q = 1, the stationarity condition for the model in
equation (1.16) is E[ln(cr:I(z, > O)Iz,Iy + cr;I(z, <O)lz,l)] < 0.
Establishing stationarity becomes much more difficult when we complicate the
models even slightly. The extension to the higher order univariate GARCH(p,q)
model has recently been carried out by Bougerol and Picard (1992) with methods
which may be more generally applicable. There exists a large mathematics literature
on conditions for stationarity and ergodicity for Markov chains; see, e.g. Numme-
lin and Tuominen (1982) and Tweedie (1983a). These conditions can sometimes be
verified for ARCH models, although much work remains establishing useful station-
arity criteria even for many commonly-used models.

3.2. Persistence

The notion of persistence of a shock to volatility within the ARCH class of models
is considerably more complicated than the corresponding concept of persistence in
Ch.49:ARCH Models 2991

the mean for linear models. This is because even strictly stationary ARCH models
frequently do not possess finite moments.
Suppose that {CT:}is strictly stationary and ergodic. Let F(o:) denote the uncondi-
tional cumulative distribution function (cdf) for c:, and let F,(a:) denote the cdf of
c: given information at time s < t. Then for any s, F,(a:) - F(a:) converges to 0 at
all continuity points as t + co; i.e. time s information drops out of the forecast
distribution as t + co. Therefore, one perfectly reasonable definition of persistence
would be to say that shocks fail to persist when {o,} is stationary and ergodic.
It is equally natural, however, to define persistence of shocks in terms of forecast
moments; i.e. to choose some q > 0 and to say that shocks to CJ: fail to persist if and
only if for every s, E,(afq) converges, as t + 00, to a finite limit independent of time
s information. Such a definition of persistence may be particularly appropriate when
an economic theory makes a forecast moment, as opposed to a forecast distribution,
the object of interest.
Unfortunately, whether or not shocks to {of} persist depends very much on
which definition is adopted. The conditional moment &(a:) may diverge to infinity
for some q, but converge to a well-behaved limit independent of initial conditions
for other q, even when the {o:} process is stationary and ergodic.
Consider, for example, the GARCH( 1,1) model, in which

The expectation of @: as of time s, is given by

(3.7)

It is easy to see that E,(o:) converges to the unconditional variance of w/(1 - c(r - pi)
ast+coifandonlyifa, +fli < l.IntheIGARCHmodelwitho>Oandcr, +pl = 1,
it follows that &(a:) -+ co a.s. as t -+ co. Nevertheless, as discussed in the previous
section, IGARCH models are strictly stationary and ergodic. In fact, as shown by
Nelson (1990b) in the IGARCH(l, 1) model E,(o:) converges to a finite limit
independent of time s information as t + cc whenever q < 1. This ambiguity of
persistence holds more generally. When the support of z, is unbounded it follows
from Nelson (1990b) that in any stationary and ergodic GARCH(l, 1) model, E,(azV)
diverges for all sufficiently large q, and converges for all sufficiently small q. For
many other ARCH models, moment convergence may be most easily established
with the methods used in Tweedie (1983b).
While the relevant criterion for persistence may be dictated by economic theory,
in practice tractability may also play an important role. For example, E,(a:), and
its multivariate extension discussed in Section 6.5 below, can often be evaluated
even when strict stationarity is difficult to establish, or when &(a:) for q # 1 is
intractable.
2992 T. Bollerslev et al.

Even so, in many applications, simple moment convergence criterion have not
been successfully developed. This includes quite simple cases, such as the univariate
GARCH(p, q) model when p > 1 or q > 1. The same is true for multivariate models,
in which co-persistence is an issue. In such cases, the choice of 4 = 1 may be
impossible to avoid. Nevertheless, it is important to recognize that apparent persis-
tence of shocks may be driven by thick-tailed distributions rather than by inherent
non-stationarity.

4. Continuous time methods

ARCH models are systems of non-linear stochastic difference equations. This makes
their probabilistic and statistical properties, such as stationarity, moment finiteness,
consistency and asymptotic normality of MLE, more difficult than is the case with
linear models. One way to simplify the analysis of ARCH models is to approximate
the stochastic difference equations with more tractable stochastic differential
equations. On the other hand, for certain purposes, notably in the computation of
point forecasts and maximum likelihood estimates, ARCH models are more conve-
nient than the stochastic differential equation models of time-varying volatility
common in the finance literature; see, e.g. Wiggins (1987), Hull and White (1987)
Gennotte and Marsh (1991), Heston (1991) and Andersen (1992a).
Suppose that the process {X,} is governed by the stochastic integral equation

where {Wl> is an N x 1 standard Brownian motion, and ,u(.) and 012(.) are
continuous functions from RN into RN and the space of N x N real matrices
respectively. The starting value, X,, may be fixed or random. Following Karatzas
and Shreve (1988) and Ethier and Kurtz (1986), if equation (4.1) has a unique
weak-sense solution, the distribution of the (X,} process is then completely deter-
mined by the following four characteristics:9
(i) the cumulative distribution function, F(x,), of the starting point X,;
(ii) the drift p(x);
(iii) the conditional covariance matrix 0(x) = Q(x)~[Q(x)~];~
(iv) the continuity, with probability one, of {X,} as a function of time.
Our interest here is either in approximating (4.1) by an ARCH model or visa
versa. To that end, consider a sequence of first-order Markov processes {,,X,}, whose

Formally, we consider {X,} and the approximating discrete time processes {,,X,} as random variables
in DR[O, co), the space of right continuous functions with finite left limits, equipped with the Skorohod
topology. D&O, cc) is a complete, separable metric space [see, e.g. Chapter 3 in Ethier and Kurtz
(1986)J
roQ(x)l/z ISa matrix square root of L?(x), though it need not be the symmetric square root since
we require only that ~(~)~[L?(x)~] = a(x), not f2(~)~*f2(x)~ = Q(x).
Ch. 49: ARCH Models 2993

sample paths are random step functions with jumps at times h, 2h, 3h,. . . . For each
h > 0, and each non-negative integer k, define the drift and covariance functions by
/&)---lECbX,+i - ,X,)/,X, = x], and Q,(x) 3 h- Cov[(J,Xk+, - ,,Xk)l,,XL=x],
respectively. Also, let F&,x,) denote the cumulative distribution function for ,,XO.
Since (i)-(iv) completely characterize the distribution of the {X,} process, it seems
intuitive that weak convergence of {,,Xt} to {X,} can be achieved by matching
these properties in the limit as hJ0. Stroock and Varadhan (1979) showed that this
is indeed the case.

Theorem 4.1. [Stroock and Varadhan (1979)]

Let the stochastic integral equation (4.1) have a unique weak-sense solution. Then
{,,Xr} converges weakly to {X,} for hJ0 if
(i) F,,(.) -+ F(.) as hJ0 at all continuity points of F(.),
(ii) p,,(x) -p(x) uniformly on every bounded x set as hJ0,
(iii) Q,(x) + n(x) uniformly on every bounded x set as h10,
(iv) for some 6 > 0, h-E[ Il,,Xk+l - hXk I/ + lhXk = x] + 0 uniformly on every
bounded x set as h10.

This result, along with various extensions, is fundamental in all of the continuous
record asymptotics discussed below.
Deriving the theory of continuous time approximation for ARCH models in its
full generality is well beyond the scope of this chapter. Instead, we shall simply
illustrate the use of these methods by explicit reference to a diffusion model frequently
applied in the options pricing literature; see e.g. Wiggins (1987). The model considers
an asset price, Y,, and its instantaneous returns volatility, ot. The continuous time
process for the joint evolution of (Y,, a,} with fixed starting values, (Y,, a,), is given
by

dY,=pY,dt+ Y,a,dW,,, (4.2)

and

d [ln($)] = - B[ln(a,2) - al dt + ICIcl Wz,t, (4.3)

where ,u, $,fi and c( denote the parameters of the process, and WI,, and W,., are
driftless Brownian motions independent of (Y,, ci) that satisfy

1
Ld:- 2.1
CdW,,, d W,,,] = ;
[
;
1dt. (4.4)

I1 We define the matrix norm, 11 I),by 11A ((= [Trace(AA)]. It is easy to see why (i)-(iii) match
(i)-(iii) in the limit as h JO. That (iv) leads to (iv) follows from HGlders inequality; see Theorem 2.2 in
Nelson (1990a) for a formal proof.
2994 77 Bollersleu et al.

Of course in practice, the price process is only observable at discrete time intervals.
However, the continuous time model in equations (4.2)-(4.4) provides a very conve-
nient framework for analyzing issues related to theoretical asset pricing, in general,
and option pricing, in particular. Also, by Itos lemma equation (4.2) may be
equivalently written as

dyt= p-2 dt+a,dW,,,,


( >
where y, = ln( Y,). For many purposes this is a more tractable differential equation.

4.1. ARCH models as approximations to diffusions

Suppose that an economic model specifies a diffusion model such as equation (4. l),
where some of the state variables, including Q(x,), are unobservable. Is it possible
to formulate an ARCH data generation process that is similar to the true process,
in the sense that the distribution of the sample paths generated by the ARCH model
and the diffusion model in equation (4.1) becomes close for increasingly finer
discretizations?
Specifically, consider tlie diffusion model given by equations (4.2)-(4.4). Strategies
for approximating diffusions such as this are well known. For example, Melino and
Turnbull (1990) use a standard Euler approximation in defining (y,, gt),12

p.9

= ln(af)- hP[ln(cf)- a] + h12$z2,t+h,


ln(af+h) (4.6)

for t = h, 2h, 3h,. . . . Here (yO, a,) is taken to be fixed, and (Zl,t, Z,,,) is assumed to be
i.i.d. bivariate normal with mean vector (0,O) and

(4.7)

Convergence of this set of stochastic difference equations to the diffusion in


equations (4.2)-(4.4) as h 10 may be verified using Theorem 4.1. In particular, (i)
holds trivially, since (y,,, CJ,,)are constants. To check conditions (ii) and (iii), note
that

h-E,
(P
-m)1
- /3[ln(a:) - LY]
(4.8)

See Pardoux and Talay (1985) for a general discussion of the Euler approximation technique.
Ch. 49: ARCH Models 2995

and

h-Var, (4.9)

which matches the drift and diffusion matrix of (4.2))(4.4). Condition (iv) is nearly
trivially satisfied, since Zr,, and Z2,, are normally distributed with arbitrary finite
moments. The final step of verifying that the limit diffusion has a unique weak-sense
solution is often the most difficult and least intuitive part of the proof for
convergence. Nelson (1990a) summarizes several sets of sufficient conditions,
however, and formally shows that the process defined by (4.5)-(4.7) satisfies these
conditions.
While conditionally heteroskedastic, the model defined by the stochastic difference
equations (4.5)-(4.7) is not an ARCH model. In particular, for p # 1 G: is not simply
a function of the discretely observed sample path of {yt} combined with a startup
value cri. More technically, while the conditional variance (y,,,, - y,)given the
a-algebra generated by {y,,of},,< r $ f e q uals ho:, it is not, in general, the conditional
variance of ( yt + h - y,) given the smaller a-algebra generated by { yr}O,h,Zh...htt,hland
ci. Unfortunately, this latter conditional variance is not available in closed form.13
To create an ARCH approximation to the diffusion in (4.2)-(4.4) simply replace
(4.6) by

ln(a:+,,)= lnbf) - MWf) - aI+ h2g(Z,,t+J, (4.10)

where g(.) is measurable with E[Ig(Zl,t+,,)12+6] < 00 for some 6 > 0, and

(4.11)

As an ARCH model, the discretization defined by (4.5), (4.10) and (4.11) inherits
the convenient properties usually associated with ARCH models, such as the easily
computed likelihoods and inference procedures discussed in Section 2 above. As
such, it is a far more tractable approximation to (4.2))(4.4) than the discretization
defined by equations (4.5)-(4.7).
To complete the formulation of the ARCH approximation, an explicit g(.)
function is needed. Since E((Z,,,I)=(2/~)2,E(Z1,t~Zl,t~)=0 and Var(lZl,ll)=
1 - (2/7t), one possible formulation would be

(4.12)

13Jacquier et al. (1994) have recently proposed a computationally tractable algorithm for computing
this conditional variance.
2996 T Bollersleo et al.

This corresponds directly to the EGARCH model in equation (1.11). Alternatively,

2 l/2
1-P
dZl*J = PtiZ,,, + $
( >
7 v:,, - 1) (4.13)

also satisfies equation (4.11).This latter specification turns out to be the asymptotically
optimal filter for h JO, as discussed in Nelson and Foster (199 1,1994) and Section 4.3
below.

4.2. Difusions as approximations to ARCH models

Now consider the question of how to best approximate a discrete time ARCH
model with a continuous time diffusion. This can yield important insights into the
workings of a particular ARCH model. For example, the stationary distribution
of 0, in the AR(l) version of the EGARCH model given by equaton (1.11) is
intractable. However, the sequence of EGARCH models defined by equations (4.5)
and (4.10)-(4.12) converges weakly to the diffusion process in (4.2)-(4.4). When
/I > 0, the stationary distribution of ln(a:) is N(cr,11//2/I).Nelson (1990a) shows
that this is also the limit of the stationary distribution of In(a:) in the sequence
of EGARCH models (4.5) and (4.10)-(4.12) as h JO. Similarly, the continuous limit
may result in convenient approximations for forecast moments of the (~,,a:)
process.
Different ARCH models will generally result in different limit diffusions. To
illustrate, suppose that the data are generated by a simple martingale model with
a GARCH(l, 1) error structure as in equation (1.9). In the present notation, the
process takes the form,

Ltt+h = Y, + %k+t, = Yt + &,+h (4.14)

and

a:+,
= wh + (1 - tlh - ah)a f
+ h1j2ae2
t+h, (4.15)

where given time t information, E,+~is N(0, a;), and (x,, a,,) is assumed to be fixed.
Note that using the notation for the GARCH(p,q) model in equation (1.9)
a, + fil = 1 - Bh, so for increasing sampling frequencies, i.e., as hJ0, the parameters
of the process approach the IGARCH(l, 1) boundary as discussed in Section 3.
Following Nelson (1990a)

(4.16)
Ch. 49: ARCH Models 2991

and

(4.17)

Thus, from Theorem 4.1 the limit diffusion is given by

dx,=a*dW,,, (4.18)

and

(4.19)

where WI,, and W,,, are independent standard Brownian motions.


The diffusion defined by equations (4.18) and (4.19) is quite different from the
EGARCH limit in equations (4.2)-(4.4). For example, if d/2a2 > - 1, the stationary
distribution of c: in (4.19) is an inverted gamma, so as h 10 and t + co, the normalized
increments h-12(y,+h - y,) are conditionally normally distributed but uncondi-
tionally Students t. In particular, in the IGARCH case corresponding to 0 = 0,
as hJ0 and t + co, h-iZ(y,+h - y,) approaches a Students t distribution with two
degrees of freedom. In the EGARCH case, however, h - I/( y, +,, - y,) is conditionally
normal but is unconditionally a normal-lognormal mixture. When 0: is stationary,
the GARCH formulation in (1.9) therefore gives rise to unconditionally thicker-
tailed residuals than the EGARCH model in equation (1.11).

4.3. ARCH models as jilters and forecasters

Suppose that discretely sampled observations are only available for a subset of
the state variables in (4.1), and that interest centers on estimating the unobservable
state variable(s), Q(x,). Doing this optimally via a non-linear Kalman filter is
computationally burdensome; see, e.g. Kitagawa (1987).14 Alternatively, the data
might be passed through a discrete time ARCH model, and the resulting conditional
variances from the ARCH model viewed as estimates for 0(x,). Nelson (1992)
shows that under fairly mild regularity conditions, a wide variety of misspecified
ARCH models consistently extract conditional variances from high frequency time
series. The regularity conditions require that the conditional distribution of the
observable series is not too thick tailed, and that the conditional covariance matrix
moves smoothly over time. Intuitively, the GARCH filter defined by equation (1.9)
An approximate linear Kalman filter for a discretized version of(4.1) has been employed by Harvey
et al. (1994). The exact non-linear filter for a discretized version of (4.1) has been developed by Jacquier
et al. (1994). Danielson and Richard (1993) and Shephard (1993) also calculate the exact likelihood by
computer intensive methods.
2998 T. Bollerslev et al.

estimates the conditional variance by averaging squared residuals over some time
window, resulting in a nonparametric estimate for the conditional variance at each
point in time. Many other ARCH models can be similarly interpreted.
While many different ARCH models may serve as consistent filters for the same
diffusion process, efficiency issues may also be relevant in the design of an ARCH
model. To illustrate, suppose that the Y, process is observable at time intervals of
length h, but that g: is not observed. Let 8: denote some initial estimate of the
conditional variance at time 0, with subsequent estimates generated by the recursion

ln(@+,) = ln(8f) + hK(8:) + hg[8f, h-2(Y,+h - Y,)l. (4.20)

The set of admissible g(.;) functions is restricted by the requirement that E,{g[af,
h-2(Yt+h - y,)]} be close to zero for small values of h.i5 Define the normalized
estimation error from this filter extraction as qt = h-14[ln(8:) - ln(of)].
Nelson and Foster (1994) derive a diffusion approximation for qt when the data
have been generated by the diffusion in equations (4.2)-(4.4) and the time interval
shrinks to zero. In particular, they show that qt is approximately normally distributed,
and that by choosing the g(., .) function to minimize the asymptotic variance of
q,, the drift term for ln(a:) in the ARCH model, K(.), does not appear in the
resulting minimized asymptotic variance for the measurement error. The effect is
second order in comparison to that of the g(., .) term, and creates only an asympto-
tically negligible bias in qt. However, for rc(r~f)s - fi[ln(a:) - a], the leading term
of this asymptotic bias also disappears. It is easy to verify that the conditions of
Theorem 4.1 are satisfied for the ARCH model defined by equation (4.20) with
~(a) = - j3[ln(a2) - a] and the variance minimizing g(., .). Thus, as a data generation
process this ARCH model converges weakly to the diffusion in (4.2)-(4.4). In the
diffusion limit the first _two conditional moments completely characterize the
process, and the optimal ARCH filter matches these moments.
The above result on the optimal choice of an ARCH filter may easily be extended
to other diffusions and more general data generating processes. For example,
suppose that the true data generation process is given by the stochastic difference
equation analogue of (4.2)-(4.4),

Yt+Jl=Yt+h
( 41
P-y +51,t, (4.21)

ln(af+,) = In($) - hp[ln(af) - a] + h1/252,f, (4.22)

where (rl,tgt- , t,,,) is i.i.d. and m d ependent oft, h and y,, with conditional density
f(lr .f, t2 .rJor) with mean (O,O), bounded 2 + 6 absolute moments, Var,({,,,) = a:,

Formally, the function must satisfy that h-/4E,{g[uf, h-12(y,+, - y,)] } + 0uniformly on bounded
(y,, uC) sets as hJ0.
Ch. 49: ARCH Models 2999

and Var,(t,,,) = II/. This model can be shown to converge weakly to (4.2)-(4.4) as
h10. The asymptotically optimal filter for the model given by equations (4.21) and
(4.22) has been derived in Nelson and Foster (1994). This optimal ARCH filter
when (4.21) and (4.22) are the data generation process is not necessarily the same
as the optimal filter for (4.2)-(4.4). The increments in a diffusion such as (4.2))(4.4)
are approximately conditionally normal over very short time intervals, whereas
the innovations (rl,l, c2,J in (4.21) and (4.22) may be non-normal. This affects the
properties of the ARCH filter. Consider estimating a variance based on i.i.d. draws
from some distribution with mean zero. If the distribution is normal, averaging
squared residuals is an asymptotically efficient method of estimating the variance.
Least squares, however, can be very inefficient if the distribution is thicker tailed
than the normal. This theory of robust scale estimation, discussed in Davidian
and Carroll (1987) and Huber (1977) carries over to the ARCH case. For example,
estimating 0: by squaring a distributed lag of absolute residuals, as proposed by
Taylor (1986) and Schwert (1989a, b), will be more efficient than estimating 0: with a
distributed lag of squared residuals if the conditional distribution of the innovations
is sufficiently thicker tailed than the normal.
One property of optimally designed ARCH filters concerns their resemblance to
the true data generating process. In particular, if the data were generated by the
asymptotically optimal ARCH filter, the functional form for the second conditional
moment of the state variables would be the same as in the true data generating
process. If the conditional first moments also match, the second order bias is
similarly eliminated. Nelson and Foster (1991) show that ARCH models which
match these first two conditional moments also have the desirable property that
the forecasts generated by the possibly misspecified ARCH model approach the
forecasts from the true model as hJ0. Thus, even when ARCH models are mis-
specified, they may consistently estimate the conditional variances. Unfortunately,
the behavior of ARCH filters with estimated as opposed to known parameters,
and the properties of the parameter estimates themselves, are not yet well understood.

5. Aggregation and forecasting

5. I. Temporal aggregation

The continuous record asymptotics discussed in the preceding section summarizes


the approximate relationships between continuous time stochastic differential
equations and discrete time ARCH models defined at increasingly higher sampling
frequencies. While the approximating stochastic differential equations may result
in more manageable theoretical considerations, the relationship between high
frequency ARCH stochastic difference equations and the implied stochastic process
for less frequently sampled, or temporally aggregated, data is often of direct
importance for empirical work. For instance, when deciding on the most appropriate
3ooo T. Bollersleu et al.

sampling interval for inference purposes more efficient parameter estimates for the
low frequency process may be available from the model estimates obtained with
high frequency data. Conversely, in some instances the high frequency process
may be of primary interest, while only low frequency data is available. The
non-linearities in ARCH models severely complicate a formal analysis of temporal
aggregation. In contrast to the linear ARIMA class of models for conditional
means, most parametric ARCH models are only closed under temporal aggregation
subject to specific qualifications.
Following Drost and Nijman (1993) we say that (E,} is a weak GARCH(p, q)
process if E, is serially uncorrelated with unconditional mean zero, and c:, as
defined in equation (1.9), corresponds to the best linear projection of E: on the
space spanned by {1, E,_ 1, E,_ 2,. . . , tzf_1, E:_*, . . . }. More specifically,

E(Ef- c$,= E[ (Ef- fJf)E,_i] = E[ (Ef - a;)&;_i] = 0 i= 1,2,... . (5.1)

This definition of a weak GARCH(p, q) model obviously encompasses the conven-


tional GARCH(p,q) model in which U: is equal to the conditional expectation of
E: based on the full information set at time t - 1 as a special case. Whereas the
conventional GARCH(p, q) class of models is not closed under temporal aggregation,
Drost and Nijman (1993) show that temporal aggregation of ARIMA models with
weak GARCH(p, q) errors lead to another ARIMA model with weak GARCH(p, q)
errors. The orders of this temporally aggregated model and the model parameters
depend on the original model characteristics.
To illustrate, suppose that {Ed)follows a weak GARCH(l, 1)model with parameters
0,~~ and B1. Let {Ed}denote the discrete time temporally aggregated process
defined at t, t + m, t + 2m,. . . . For a stock variable E?) is obtained by sampling E,
every mth period. For a flow variable elm)E E,+ E,_ 1 + . . . + E,_ m+ 1. In both cases, it
is possible to show that the temporally aggregated process, {E:}, is also weak
GARCH(l, 1) with parameters W(~)= w[l - (~1~+ /?J]/(l - a1 - pl) and u\ =
(~1~+ BJ - Pi), where D\ is a complicated function of the parameters for the
original process. Thus, a? + Birn)= (al + /?l)m, and conditional heteroskedasticity
disappears as the sampling frequency decreases, provided that CQ+ B1 < 1. Moreover,
for flow variables the conditional kurtosis of the standardized residuals, $)[ajm)]-,
converges to the normal value of three for less frequently sampled observations.
This convergence to asymptotic normality for decreasing sampling frequencies of
temporally aggregated covariance stationary GARCH(p, q) flow variables has
been shown previously by Diebold (1988), using a standard central limit theorem
type argument.
These results highlight the fact that the assumption of i.i.d. innovations invoked
in maximum likelihood estimation of GARCH models is necessarily specific to
the particular sampling frequency employed in the estimation. If E,O; l is assumed
i.i.d., the distribution of .$[ajm]- 1 will generally not be time invariant. Following
the discussion in Section 2.3, the estimation by maximum likelihood methods could
Ch. 49: ARCH Models 3001

be given a quasi-maximum likelihood type interpretation, however. Issues pertaining


to the efficiency of the resulting estimators remain unresolved.
The extension of the aggregation results for the GARCH(p,q) model to other
parametric specifications is in principle straightforward. The cross sectional
aggregation of multivariate GARCH processes, which may be particularly relevant
in the formation of portfolios, have been addressed in Nijman and Sentana (1993).

5.2. Forecast error distributions

One of the primary objectives of econometric time series model building is often
the construction of out-of-sample predictions. In conventional econometric models
with time invariant innovation variances, the prediction error uncertainty is an
increasing function of the prediction horizon, and does not depend on the origin
of the forecast. In the presence of ARCH errors, however, the forecast accuracy
will depend non-trivially on the current information set. The proper construction
of forecast error intervals and post-sample structural stability tests, therefore, both
require the evaluation of future conditional error variances.16
A detailed analysis of the forecast moments for various GARCH models is
available in Engle and Bollerslev (1986) and Baillie and Bollerslev (1992). Although
both of these studies develop expressions for the second and higher moments of
the forecast error distributions, this is generally not enough for the proper
construction of confidence intervals, since the forecast error distributions will be
leptokurtic and time-varying.
A possible solution to this problem is suggested by Baillie and Bollerslev (1992),
who argue for the use of the Cornish-Fisher asymptotic expansion to take account
of the higher order dependencies in the construction of the prediction error intervals.
The implementation of this expansion requires the evaluation of higher order
conditional moments of E,+,, which can be quite complicated. Interestingly, in a
small scale Monte Carlo experiment, Baillie and Bollerslev (1992) find that under
the assumption of conditional normality for ~~a;, the ninety-five percent confidence
interval for multi-step predictions from the GARCH(l, 1) model, constructed under
the erroneous assumption of conditional normality of E~+~[E(~:+,)] -I for s > 1,
has a coverage probability quite close to ninety-five percent. The one percent
fractile is typically underestimated by falsely assuming conditional normality of
the multi-step leptokurtic prediction errors, however.
Most of the above mentioned results are specialized to the GARCH class of
models, although extensions to allow for asymmetric or leverage terms and multi-
variate formulations in principle would be straightforward. Analogous results on
forecasting ln(a:) for EGARCH models are easily obtained. Closed form expressions

bAlso, as discussed earlier, the forecasts of the future conditional variances are often of direct interest
in applications with financial data.
3002 T. Bollerslev et al.

for the moments of the forecast error distribution for the EGARCH model are
not available, however.
As discussed in Section 4.3, an alternative approximation to the forecast error
distribution may be based upon the diffusion limit of the ARCH model. If the
sampling frequency is high so that the discrete time ARCH model is a close
approximation to the continuous time diffusion limit, the distribution of the
forecasts should be good approximations too; see Nelson and Foster (1991). In
particular, if the unconditional distribution of the diffusion limit can be derived,
this would provide an approximation to the distribution of the long horizon
forecasts from a strictly stationary model.
Of course, the characteristics of the prediction error distribution may also be
analyzed through the use of numerical methods. In particular, let fJ.s,+J denote
the density function for E~+~conditional on information up through time t. Under
the assumption of a time invariant conditional density function for the standardized
innovations, f(~,g, ), the prediction error density for E~+~ is then given by the
convolution

_f&t+J=

Evaluation
ss
... f(&t+,~~~t,)f(&,+,-l~~+ls-l)...f(&,+,~t+1l)dE,+,-ld&,+,-2.

of this multi-step prediction error density may proceed directly by


numerical integration. This is illustrated within a Bayesian context by Geweke
(1989a, b), who shows how the use of importance sampling and antithetic variables
can be employed in accelerating the convergence of the Monte Carlo integration.
In accordance with the results in Baillie and Bollerslev (1992) Geweke (1989a)
finds that for conditional normally distributed one-step-ahead prediction errors,
the shorter the forecast horizon s, and the more tranquil the periods before the
origin of the forecast, the closer to normality is the prediction error distribution
for E,+,.

6. Multivariate specifications

Financial market volatility moves together over time across assets and markets.
Recognizing this commonality through a multivariate modeling framework leads
to obvious gains in efficiency. Several interesting issues in the structural analysis
of asset pricing theories and the linkage of different financial markets also call for
an explicit multivariate ARCH approach in order to capture the temporal
dependencies in the conditional variances and covariances.
In keeping with the notation of the previous sections, the N x 1 vector stochastic
process {E,} is defined to follow a multivariate ARCH process if E,_ i(s,) = 0, but
the N x N conditional covariance matrix,

E,- (q~:)= n,,


1 (6.1)
Ch. 49: ARCH Models 3003

depends non-trivially on the past of the process. From a theoretical perspective,


inference in multivariate ARCH models poses no added conceptual difficulties in
comparison to the procedures for the univariate case outlined in Section 2 above.
To illustrate, consider the log likelihood function for {Q, Q_ 1,. . . , cl} obtained
under the assumption of conditional multivariate normality,

LT(~T,~T-l,...,~I;ICI)= -0S[TNln(27r)+ C (lnI~tI+~~~,~~,)l. (6.2)


1=1,T
This function corresponds directly to the conditional likelihood function for the
univariate ARCH model defined by equations (2.7), (2.8) and (2.12), and maximum
likelihood, or quasi-maximum likelihood, procedures may proceed as discussed in
Section 2. Of course, the actual implementation of a multivariate ARCH model
necessarily requires some assumptions regarding the format of the temporal
dependencies in the conditional covariance matrix sequence, {Q}.
Several key issues must be faced in choosing a parametrization for Q. Firstly,
the sheer number of potential parameters in a geneal formulation is overwhelming.
All useful specifications must necessarily restrict the dimensionality of the parameter
space, and it is critical to determine whether they impose important untested
characteristics on the conditional variance process. A second consideration is
whether such restrictions impose the required positive semi-definiteness of the
conditional covariance matrix estimators. Thirdly, it is important to recognize
whether Granger causality in variance as in Granger et al. (1986) is allowed by
the chosen parametrization; that is, does the past information on one variable
predict the conditional variance of another. A fourth issue is whether the correlations
or regression coefficients are time-varying and, if so, do they have the same
persistence properties as the variances? A fifth issue worth considering is whether
there are linear combinations of the variables, or portfolios, with less persistence
than individual series, or assets. Closely related is the question of whether there exist
simple statistics which are sufficient to forecast the entire covariance matrix. Finally,
it is natural to ask whether there are multivariate asymmetric effects, and if so
how these may influence both the variances and covariances. Below we shall briefly
review some of the parametrizations that have been applied in the literature, and
comment on their appropriateness for answering each of the questions posed above.

6.1. Vector ARCH and diagonal ARCH

Let vech(.) denote the vector-half operator, which stacks the lower triangular
elements of an N x N matrix as an [N(N + I)/21 x 1 vector. Since the conditional
covariance matrix is symmetric, vech(Q) contains all the unique elements in Q.
Following Kraft and Engle (1982) and Bollerslev et al. (1988), a natural multivariate
extension of the univariate GARCH(p,q) model defined in equation (1.9) is then
3004 T.Bollerslev et al.

given by

vech(L?J = W+ C Aivech(&,_i&:_i) + 1 Bjvech(Q_j)


i= 1.9 j=l,p

= W+ A(L)vech(s,_ i&r) + B(L)vech@~_,), (6.3)

where W is an [N(N + 1)/2] x 1 vector, and the Ai and Bj matrices are of dimension
[N(N + 1)/2] x [N(N + 1)/2]. This general formulation is termed the vet represen-
tation by Engle and Kroner (1993). It allows each of the elements in {Q} to depend
on all of the most recent q past cross products of the E,Sand all of the most recent p
lagged conditional variances and covariances, resulting in a total of [N(N + 1)/2].
[l + (p + q)N(N + 1)/2] parameters. Even for low dimensions of N and small values
of p and q the number of parameters is very large; e.g. for N = 5 and p = q = 1
the unrestricted version of (6.3) contains 465 parameters. This allows plenty of
flexibility to answer most, but not all, of the questions above. However, this
large number of parameters is clearly unmanageable, and conditions to ensure
that the conditional covariance matrices are positive definite a.s. for all t are
difficult to impose and verify; Engle and Kroner (1993) provides one set of sufficient
conditions discussed below.
In practice, some simplifying assumptions will therefore have to be imposed. In
the diagonal GARCH(p, q) model, originally suggested by Bollerslev et al. (1988),
the Ai and Bj matrices are all taken to be diagonal. Thus, the (i,j)th element in
{Q} only depends on the corresponding past (i, j)th elements in {E&>and {Q}. This
restriction reduces the number of parameters to [N(N + 1)/2](1 + p + q). These
restrictions are intuitively reasonable, and can be interpreted in terms of a filtering
estimate of each variance and covariance. However, this model clearly does not
allow for causality in variance, co-persistence in variance, as discussed in Section 6.5
below, or asymmetries.
Necessary and sufficient conditions on the parameters to ensure that the
conditional covariance matrices in the diagonal GARCH(p, q) model are positive
definite a.s. are most easily derived by expressing the model in terms of Hadamard
products. In particular, define the symmetric N x N matrices A* and BT implicitly
by Ai = diag[vech($)] i = 1,. . . , q, Bj = diag[vech(BT)] j = 1,. . . , p, and WE vech( W*).
The diagonal model may then be written as

R,=w*+ c A;~(E,_~E;_~)+ c ByQ12-j, (6.4)


i=l,q j=l,P

where 0 denotes the Hadamard product. la It follows now by the algebra of

Note, that even with this number of parameters, asymmetric terms are excluded by the focus on
squared residuals.
The Hadamard product of two N x N matrices A and B is defined by {AOBJij = {A}ij{B}ij; see,
e.g. Amemiya (1985).
Ch. 49: ARCH Models 3005

Hadamard products, that Lit is positive definite as. for all t provided that W* is
positive definite, and the AT and Bf matrices are positive semi-definite for all
i= 1,..., q and j= l,..., p; see Attanasio (1991) and Marcus and Mint (1964) for
a formal proof. These conditions are easy to impose and verify through a Cholesky
decomposition for the parameter matrices in equation (6.4). Even simpler versions
of this model which let either A* or Bf be rank one matrices, or even simply a
scalar times a matrix of ones, may be useful in some applications.
In the alternative representationof the multivariate GARCH(p, q) model termed
by Engle and Kroner (1993) the Baba, Engle, Kraft and Kroner, or BEKK, represen-
tation, the conditional covariance matrix is parametrized as

(6.5)
k=l,Xi=l,q k=l,Kj=l,p

wherek,A,,i=l,..., q,k=l,..,, K,andBjkj=l ,..., p,k=l,..., KareallNxN


matrices. This formulation has the advantage over the general specification in
equation (6.3) that Q is guaranteed to be positive definite a.s. for all t. The model
in equation (6.5) still involves a total of [l + (p + q)K]N parameters. By taking
vech((l,) we can express any model of the form (6.5) in terms of (6.3). Thus any
vet model in (6.3) whose parameters can be expressed as (6.5) must be positive
definite. However, in empirical applications, the structure of the Aik and Bjk matrices
must be further simplified as this model is also overparametrized. A choice made
by McCurdy and Stengos (1992) is to set K =p= q = 1 and make A, and B,
diagonal. This leads to the simple positive definite version of the diagonal vet model

a,= w*+c(lcr;o(E,_lE:_l)+DIB;O,Rt_l, (6.6)


where A, = diag[a,] and B, = diag[j?,].

6.2. Factor ARCH

The Factor ARCH model can be thought of as an alternative simple parametriza-


tion of (6.5). Part of the appeal of this parametrization in applications with asset
returns stems from its derivation in terms of a factor type model. Specifically,
suppose that the N x 1 vector of returns y, has a factor structure with K factors
given by the K x 1 vector &, and time invariant factor loadings given by the N x K
matrix B:

Y, = B5, + E,. (6.7)

Assume that the idiosyncratic shocks, E,, have constant conditional covariances
Y, and that the factors, 5,, have conditional covariance matrix A,. Also, suppose
3006 T. Bollersleu et al.

that E, and 5, are uncorrelated, or that they have constant correlations. The
conditional covariance matrix of y, then equals

v,_1(y,)= n, = Y + BA,B. (6.8)

If A, is diagonal with elements IZkf,or if the off-diagonal elements are constant and
combined into Y, the model may therefore be written as

0, = Y+ c BkP;nk*Y (6.9)
k=l,K

where flk denotes the kth column in B. Thus, there are K statistics which determine
the full covariance matrix. Forecasts of the variances and covariances or of any
portfolio of assets, will be based only on the forecasts of these K statistics. This
model was first proposed in Engle (1987), and implemented empirically by Engle
et al. (1990b) and Ng et al. (1992) for treasury bills and stocks, respectively.
Diebold and Nerlove (1989) suggested a closely related latent factor model,

(6.9)
k=l,K

in which the factor variances, S& were not functions of the past information set.
An estimation approach based upon an approximate Kalman filter was used
by Diebold and Nerlove (1989). More recently King et al. (1994) have estimated
a similar latent factor model using theoretical developments in Harvey et al. (1994).
An immediate implication of (6.8) and (6.9) is that, if K < N, there are some
portfolios with constant variance. Indeed a useful way to determine K is to find
how many assets are required to form such portfolios. Engle and Kozicki (1993)
present this as an application of a test for common features. This test is applied
by Engle and Susmel(1993) to determine whether there is evidence that international
equity markets have common volatility components. Only for a limited number
of pairs of the countries analyzed can a one factor model not be rejected.
A second implication of the formulation in (6.8) is that there exist factor-
representing portfolios with portfolio weights that are orthogonal to all but one
set of factor loadings. In particular, consider the portfolio rkt = 4;y,, where +;Bj
equals 1 ifj = k and zero otherwise. The conditional variance of rkt is then given by

(6.10)

where $k = 4; Y$,. Thus, the portfolios rkl have exactly the same time variation
as the factors, which is why they are called factor-representing portfolios.
In order to estimate this model, the dependence of the Akrs upon the past
information set must also be parametrized. The simplest assumption is that there
is a set of factor-representing portfolios with univariate GARCH( 1,l) representa-
Ch. 49: ARCH Models 3007

tions. Thus,

(6.11)

and, therefore,

k=l,K k=l,K

so that the factor ARCH model is a special case of the BEKK parametrization.
Clearly, more general factor ARCH models would allow the factor representing
portfolios to depend upon a broader information set than the simple univariate
assumption underlying (6.11).
Estimation of the factor ARCH model by full maximum likelihood together
with several variations has been considered by Lin (1992). However, it is often
convenient to assume that the factor-representing portfolios are known a priori.
For example, Engle et al. (1990b) assumed the existence of two such portfolios:
one an equally weighted treasury bill portfolio and one the Standard and Poors
500 composite stock portfolio. A simple two step estimation procedure is then
available, by first estimating the univariate models for each of the factor-representing
portfolios. Taking the variance estimates from this first stage as given, the factor
loadings may then be consistently estimated up to a sign, by noticing that each of
the individual assets has a variance process which is linear in the factor variances,
where the coefficients equal the squares of the factor loadings. While this is surely
an inefficient estimator, it has the advantage that it allows estimation for arbitrarily
large matrices using simple univariate procedures.

6.3. Constant conditional correlations

In the constant conditional correlations model of Bollerslev (1990), the time-varying


conditional covariances are parametrized to be proportional to the product of
the corresponding conditional standard deviations. This assumption greatly simpli-
fies the computational burden in estimation, and conditions for 0, to be positive
definite a.s. for all t are also easy to impose.
More explicitly, let D, denote the N x N diagonal matrix with the conditional
variances along the diagonal; i.e. {Dt}ii = {Q},, and {Dt}ij = 0 for i #j, i, j = 1,. . , N.
Also, let c denote the matrix of conditional correlations; i.e. {&jij = {Q}ij[ {Q}ii.
{QJ,]-, i, j = 1,. . . , N. The constant conditional correlation model then assumes
that r, = I- is time-invariant, so that the temporal variation in {a,} is determined
solely by the time-varying conditional variances,

0 = DwrDl/z
I f f . (6.13)
3008 T. Bollerslev et al.

If the conditional variances along the diagonal in the D, matrices are all positive,
and the conditional correlation matrix r is positive definite, the sequence of
conditional covariance matrices {Qj is guaranteed to be positive definite a.s. for
all t. Furthermore, the inverse of Q is simply given by L2, = DfF2r - 1Dle2.
Thus, when calculating the likelihood function in equation (6.2) or some other
multivariate objective function involving 0, t = 1,. . ., T, only one matrix
inversion is required for each evaluation. This is especially relevant from a
computational point of view when numerical derivatives are being used. Also, by
a standard multivariate SURE analogy, r may be concentrated out of the normal
likelihood function by (D,- li2e,)(D,- 1/2Qr simplifying estimation even further.
Of course, the validity of the assumption of constant conditional correlations
remains an empirical question. However, this particular formulation has already
been successfully applied by a number of authors, including Baillie and Bollerslev
(1990), Bekaert and Hodrick (1993), Bollerslev (1990), Kroner and Sultan (1991),
Kroner and Claessens (1991) and Schwert and Seguin (1990).

6.4. Bivariate EGARCH

A bivariate version of the EGARCH model in equation (1.11) has been introduced
by Braun et al. (1992) in order to model any leverage effects, as discussed in
Section 1.2.3, in conditional betas. Specifically, let E,,( and sP,[ denote the residuals
for a market index and a second portfolio or asset. The model is then given by

6%t = ~m.tGl,t (6.14)

and

&
P.1= Bp,tEm,t+ flp.tZp,v (6.15)

where {z,,~, z~,~} is assumed to be i.i.d with mean (0,O) and identity covariance
matrix. The conditional variance of the market index, a:,,, is modeled by a
univariate EGARCH model,

Wi,,) = CL + 4,Jn(~8fJ- %J + R,,z,,~- 1 + Y,( Iz,,~- 1 I - E Iz,,, - 1 I). (6.16)

The conditional beta of sp,t with respect to E,,~, /?p,f, is modeled as

BP,, = 4 + UP,,, - 1 - Al) + 11Z,,t - 1ZP,l- 1 + G&t - 1 + J.3Zp.t


- 1. (6.17)

The coefficients j.2 and 1,, allow for leverage effects in BP,,. The non-market, or

19A formal moment based test for the assumption of constant conditional correlations has been
developed by Bera and Roh (1991).
Ch. 49: ARCH Models 3009

idiosyncratic, variance of the second portfolio, gi,*, is parametrized as a modified


univariate EGARCH model, to allow for both market and idiosyncratic news
effects.

Braun et al. (1992) find that this model provides a good description of the returns
for a number of industry and size-sorted portfolios.

6.5. Stationarity and co-persistence

Stationarity and moment convergence criteria for various univariate specifications


were discussed in Section 3 above. Corresponding convergence criteria for multi-
variate ARCH models are generally complex, and explicit results are only available
for a few special cases.
Specifically, consider the multivariate vet GARCH(l, 1) model defined in equation
(6.3). Analogous to the expression for the univariate GARCH(1, 1) model in equation
(3.10) the minimum mean square error forecast for vech(Q) as of time s < t takes
the form

E,(vech(Q,)) = W
1 + (A, + ~,)~vech(QJ, (6.19)

where (A 1 + II,) is equal to the identity matrix by definition. Let VA V- denote


the Jordan decomposition of the matrix A, + B,, so that (A, + Br)- = VAt-sV-1.20
Thus, E,(vech(Q)) converges to the unconditional covariance matrix of the process,
W(Z - A, - B,)-, for t + cc a.s. if and only if the norm of the largest eigenvalue
of A, + B, is strictly less than one. Similarly, by expressing the vector GARCH(p, q)
model in companion first order form, it follows that the forecast moments converge,
and that the process is covariance stationary if and only if the norm of the largest
root of the characteristic equation II - A(x- ) - B(x- )I = 0 is strictly less than
one. A formal proof is given in Bollerslev and Engle (1993). This corresponds
directly to the condition for the univariate GARCH(p, q) model in equation (1.9)
where the persistence of a shock to the optimal forecast of the future conditional
variances is determined by the largest root of the characteristic polynomial
c((x-I) + /I(x- ) = I. The conditions for strict stationarity and ergodicity for the
multivariate GARCH(p,q) model have not yet been established.

If the eigenvalues for A, + B, are all distinct, A equals the diagonal matrix of eigenvalues, and V
the corresponding matrix of right eigenvectors. If some of the eigenvalues coincide, A takes the more
general Jordan canonical form; see Anderson (1971) for further discussion.
3010 T. Bollerslev et al.

Results for other multivariate formulations are scarce, although in some instances
the appropriate conditions may be established by reference to the univariate results
in Section 3. For instance, for the constant conditional correlations model in
equation (6.13), the persistence of a shock to E,(Q), and conditions for the model
to be covariance stationary are simply determined by the properties of each of the
N univariate conditional variance processes; i.e., E,( {Q}ii) i = 1,. . , N. Similarly,
for the factor ARCH model in equation (6.9), stationarity of the model depends
directly on the properties of the univariate conditional variance processes for the
factor-representing porifolios; i.e. {Akf}k = 1,. . . , K.
The empirical estimates for univariate and multivariate ARCH models often
indicate a high degree of persistence in the forecast moments for the conditional
variances; i.e. &(a:) or JY,({Q},),~) i = 1,. . . , N, for t + co. At the same time, the
commonality in volatility movements suggest that this persistence may be common
across different series. More formally, Bollerslev and Engle (1993) define the
multivariate ARCH process to be co-persistent in variance if at least one element
in E,(Q) is non-convergent a.s. for increasing forecast horizons t-s, yet there
exists a non-trivial linear combination of the process, ys,, such that for every forecast
origin s, the forecasts of the corresponding future conditional variances, E,(yQy),
converge to a finite limit independent of time s information. Exact conditions for
this to occur within the context of the multivariate GARCH(p, q) model in equation
(6.3) are presented in Bollerslev and Engle (1993). These results parallel the
conditions for co-integration in the mean as developed by Engle and Granger
(1987). Of course, as discussed in Section 3 above, for non-linear models different
notions of convergence may give rise to different classifications in terms of the
persistence of shocks. The focus on forecast second moments corresponds directly
to the mean-variance trade-off relationship often stipulated by economic theory.
To further illustrate this notion of co-persistence, consider the K-factor
GARCH(p,q) model defined in equation (6.12). If some of the factor-representing
portfolios have persistent variance processes, then individual assets with non-zero
factor loadings on such factors will have persistence in variance, also. However,
there may be portfolios which have zero factor loadings on these factors. Such
portfolios will not have persistence in variance, and hence the assets are co-
persistent. This will generally be true if there are more assets than there are
persistent factors. From a portfolio selection point of view such portfolios might
be desirable as having only transitory fluctuations in variance. Engle and Lee
(1993) explicitly test for such an effect between large individual stocks and a market
index, but fail to find any evidence of co-persistence.

7. Model selection

Even in linear statistical models, the problem of selecting an appropriate model is


non-trivial, to say the least. The usual model selection difficulties are further compli-
Ch. 49: ARCH Models 3011

cated in ARCH models by the uncountable infinity of functional forms allowed by


equation (1.2) and the choice of an appropriate loss function.
Standard model selection criteria such as the Akaike (1973) and the Schwartz
(1978) criterion have been widely used in the ARCH literature, though their statistical
properties in the ARCH context are unknown. This is particularly true when the
validity of the distributional assumptions underlying the likelihood is in doubt.
Most model selection problems focus on estimation of means and evaluate loss
functions for alternative models using either in-sample criteria, possibly corrected
for fitting by some form of cross-validation, or out-of-sample evaluation. The loss
function of choice is typically mean squared error.
When the same strategy is applied to variance estimation, the choice of mean
squared error is much less clear. A loss function such as

L, = 1 (Ef- CJf) (7.1)


1=1.T

will penalize conditional variance estimates which are different from the realized
squared residuals in a fully symmetrical fashion. However, this loss function does
not penalize the method for negative or zero variance estimates which are clearly
counterfactual. By this criterion, least squares regressions of squared residuals on
past information will have the smallest in-sample loss.
More natural alternatives may be the percentage squared errors,

L, = 1 (Ef - cTf,cr-, (7.2)


2=1.T

the percentage absolute errors, or the loss function implicit in the Gaussian likelihood

L, = C [ln(a:) + $a,]. (7.3)


1=1,T

A simple alternative which exaggerates the interest in predicting when residuals are
close to zero is2

L, = C [ln(sfa;2)]2. (7.4)
f= l.T

The most natural loss function, however, may be one based upon the goals of the
particular application. West et al. (1993) developed such a criterion from the portfolio
decisions of a risk averse investor. In an expected utility comparison based on the

Pagan and Schwert (1990) used the loss functions L, and L, to compare alternative parametric
and nonparametric estimators with in-sample and out-of-sample data sets. As discussed in Section
1.5, the L, in-sample comparisons favored the nonparametric models, whereas the out-of-sample tests
and the loss function L, in both cases favored the parametric models.
3012 T Bollerslev et al.

forecast of the return volatility, ARCH models turn out to fare very well. In a related
context, Engle et al. (1993) assumed that the objective was to price options, and
developed a loss function from the profitability of a particular trading strategy. They
again found that the ARCH variance forecasts were the most profitable.

8. Alternative measures for voiatility

Several alternative procedures for measuring the temporal variation in second order
moments of time series data have been employed in the literature prior to the
development of the ARCH methodology. This is especially true in the analysis of
high frequency financial data, where volatility clustering has a long history as a
salient empirical regularity.
One commonly employed technique for characterizing the variation in conditional
second order moments of asset returns entails the formation of low frequency
sample variance estimates based on a time series of high frequency observations.
For instance, monthly sample variances are often calculated as the sum of the
squared daily returns within the month; examples include Merton (1980) and
Poterba and Summers (1986). Of course, if the conditional variances of the daily
returns differ within the month, the resulting monthly variance estimates will
generally be inefficient; see French et al. (1987) and Chou (1988). However, even if
the daily returns are uncorrelated and the variance does not change over the course
of the month, this procedure tends to produce both inefficient and biased monthly
estimates; see Foster and Nelson (1992).
A related estimator for the variability may be calculated from the inter-period
highs and lows. Data on high and low prices within a day is readily available for
many financial assets. Intuitively, the higher the variance, the higher the inter-period
range. Of course, the exact relationship between the high-low distribution and the
variance is necessarily dependent on the underlying distribution of the price process.
Using the theory of range statistics, Parkinson (1980) showed that a high-low
estimator for the variance of a continuous time random walk is more efficient than
the conventional sample variance based on the same number of end-of-interval
observations. Of course, the random walk model assumes that the variance remain
constant within the sample period. Formal extensions of this idea to models with
stochastic volatility are difficult; see also Wiggins (1991), who discusses many of the
practical problems, such as sensitivity to data recording errors, involved in applying
high-low estimators.
Actively traded options currently exist for a wide variety of financial instruments.
A call option gives the holder the right to buy an underlying security at a pre-

Since many high frequency asset prices exhibit low but significant first order serial correlation,
two times the first order autocovariance is often added to the daily variance in order to adjust for this
serial dependence.
Ch. 49: ARCH Models 3013

specified price within a given time period. A put option gives the right to sell a
security at a pre-specified price. Assuming that the price of the underlying security
follows a continuous time random walk, Black and Scholes (1973) derived an
arbitrage based pricing formula for the price of a call option. Since the only
unknown quantity in this formula is the constant instantaneous variance of the
underlying asset price over the life of the option, the option pricing formula may be
inverted to infer the conditional variance, or volatility, implicit in the actual market
price of the option. This technique is widely used in practice. However, if the
conditional variance of the asset is changing through time, the exact arbitrage
argument underlying the Black-Scholes formula breaks down. This is consistent
with the evidence in Day and Lewis (1992) for stock index options which indicate
that a simple GARCH(l, 1) model estimated for the conditional variance of the
underlying index return provides statistically significant information in addition to
the implied volatility estimates from the Black-Scholes formula. Along these lines
Engle and Mustafa (1992) find that during normal market conditions the coefficients
in the implied GARCH(l, 1) model which minimize the pricing error for a risk
neutral stock option closely resemble the coefficients obtained using more conven-
tional maximum likelihood estimation methods. 23 As mentioned in Section 4 above,
much recent research has been directed towards the development of theoretical
option pricing formulas in the presence of stochastic volatility; see, for instance,
Amin and Ng (1993), Heston (1991), Hull and White (1987), Melino and Turnbull
(1990), Scott (1987) and Wiggins (1987). While closed form solutions are only
available for a few special cases, it is generally true that the higher the variance of
the underlying security, the more valuable the option. Much further research is
needed to better understand the practical relevance and quality of the implied
volatility estimates from these new theoretical models, however.
Finance theory suggests a close relationship between the volume of trading and
the volatility; see Karpoff (1987) for a survey of some of the earlier contributions to
this literature. In particular, according to the mixtures of distributions hypothesis,
associated with Clark (1973) and Tauchen and Pitts (1983) the evolution of returns
and trading volume are both determined by the same latent mixing variable that
reflects the amount of new information that arrives at the market. If the news arrival
process is serially dependent, volatility and trading volume will be jointly serially
correlated. Time series data on trading volume should therefore be useful in inferring
the behavior of the second order moments of returns. This idea has been pursued
by a number of empirical studies, including Andersen (1992b), Gallant et al. (1992)
and Lamoureux and Lastrapes (1990). While the hypothesis that contemporaneous
trading volume is positively correlated with financial market volatility is supported

aMore specifically, Engle and Mustafa (1992) estimate the parameters for the implied GARCH(1, 1)
model by minimizing the risk neutral option pricing error defined by the discounted value of the
maximum of zero and the simulated future price of the underlying asset from the GARCH(1,l) model
minus the exercise price of the option.
3014 T. Bollerslev et al.

in the data, the result that a single latent variable jointly determines both has been
formally rejected by Lamoureux and Lastrapes (1994).
In a related context, modern market micro structure theories also suggest a close
relationship between the behavior of price volatility and the distribution of the
bid-ask spread though time. Only limited evidence is currently available on the
usefulness of such a relationship for the construction of variance estimates for the
returr?; see, e.g. Bollerslev and Domowitz (1993), Bollerslev and Melvin (1994) and
Brock and Kleidon (1992).
The use of the cross sectional variance from survey data to estimate the variance
of the underlying time series has been advocated by a number of researchers. Zamowitz
and Lambros (1987) discuss a number of these studies with macroeconomic variables.
Of course, the validity of the dispersion across forecasts as a proxy for the variance
will depend on the theoretical connection between the degree of heterogeneity and
uncertainty; see Pagan et al. (1983). Along these lines it is worth noting that Rich
et al. (1992) only find a weak correlation between the dispersion across the forecasts
for inflation and an ARCH based estimate for the conditional variance of inflation.
The availability of survey data is also likely to limit the practical relevance of this
approach in many applications.
In a related context, a number of authors have argued for the use of relative prices
or returns across different goods or assets as a way of quantifying inflationary
uncertainty or overall market volatility. Obviously, the validity of such cross
sectional based measures again hinges on very stringent conditions about the
structure of the market; see Pagan et al. (1983).
While all of the variance estimates discussed above may give some idea about
the temporal dependencies in second order moments, any subsequent model estimates
should be carefully interpreted. Analogously to the problems that arise in the use
of generated regressors in the mean, as discussed by Pagan (1984,1986) and Murphy
and Topel(1985), the conventional standard errors for the coefficient estimates in
a second stage model that involves a proxy for the variance will have to be adjusted
to reflect the approximation error uncertainty. Also, if the conditional mean depends
non-trivially on the conditional variance, as in the ARCH-M model discussed
in Section 1.4, any two step procedure will generally result in inconsistent parameter
estimates; for further analysis along these lines we refer to Pagan and Ullah (1988).

9. Empirical examples

9.1. U.S. DollarlDeutschmark exchange rates

As noted in Section 1.2, ARCH models have found particularly wide use in the
modeling of high frequency speculative prices. In this section we illustrate the
empirical quasi-maximum likelihood estimation of a simple GARCH(l, 1) model
for a time series of daily exchange rates. Our discussion will be brief. A more detailed
and thorough discussion of the empirical specification, estimation and diagnostic
Ch. 49: ARCH Models 3015

testing of ARCH models is given in the next section, which analyzes the time series
characteristics of more than one hundred years of daily U.S. stock returns.
The present data set consists of daily observations on the U.S. Dollar/Deutsch-
mark exchange rate over the January 2,198l through July 9,1992 period, for a total
of 3006 observations.24 A broad consensus has emerged that nominal exchange
rates over the free float period are best described as non-stationary, or I(l), type
processes; see, e.g. Baillie and Bollerslev (1989). We shall therefore concentrate on
modeling the nominal percentage returns; i.e. yr = lOO[ln(s,) - In@_ r)], where s,
denotes the spot Deutschmark/U.S. Dollar exchange rate at day t. This is the time
series plotted in Figure 2 in Section 1.2 above. As noted in that section, the daily
returns are clearly not homoskedastic, but are characterized by periods of tranquility
followed by periods of more turbulent exchange rate movements. At the same time,
there appears to be little or no own serial dependence in the levels of the returns.
These visual observations are also borne out by more formal tests for serial correla-
tion. For instance, the Ljung and Box (1978) portmanteau test for up to twentieth
order serial correlation in y, equals 19.1, whereas the same test statistic for twentieth
order serial correlation in the squared returns, y:, equals 151.9. Under the null of
i.i.d. returns, both test statistics should asymptotically be the realization of a chi-
square distribution with twenty degrees of freedom. Note that in the presence of
ARCH, the portmanteau test for serial correlation in y, will tend to over-reject.
As discussed above, numerous parametric and nonparametric formulations have
been proposed for modeling the volatility clustering phenomenon. For the sake of
brevity, we shall here concentrate on the results for the particularly simple
MA( I)-GARCH( 1,1) model,

Yt = PO + 01&t-1+ Et,
(9.1)
a:=oo+o,W,--o,(a, +Br)W,-, +a&_, +&a:_,,

where W, denotes a weekend dummy equal to one following a closure of the market.
The MA(l) term is included to take account of the weak serial dependence in the
mean. Following Baillie and Bollerslev (1989), the weekend dummy is entered in
the conditional variance to allow for an impulse effect.
The quasi-maximum likelihood estimates (QMLE) for this model, obtained
by the numerical maximization of the normal likelihood function defined by
equations (2.7), (2.8) and (2.12), are contained in Table 1. The first column in the
table shows that the c1r and /?r coefficients are both highly significant at the
conventional five percent level. The sum of the estimated GARCH parameters also
indicates a fairly strong degree of persistence in the conditional variance process.25

Z4The rates were calculated from the ECU cross rates obtained through Datastream.
25Reparametrizing the conditional variance in terms of (a1 +/I,) and x1, the r-test statistic for the
null hypothesis that LX,+ PI = 1 equals 3.784, thus formally rejecting the IGARCH(1, 1) model at
standard significance levels.
3016 T. Bollerslev et al.

Table 1
Quasi-maximum likelihood estimatesa

Jan. 2, 1982 Jan. 2, 1982 Oct. 7, 1986


Coefficient July 9, 1992 Oct. 6, 1986 July 9, 1992

0.014 - 0.017
(0.018) (0.016)

::::::j ;::::I;{

Q, - 0.056 - 0.058 0.055


(0.014) (0.030) (0.027)
CO.0271 CO.0271
\,::::j {0.027) {0.027}

% 0.028 0.024 0.035


(0.005) (0.009) (0.011)
[O.Ol 11
i::Ej \u,:::j (0.010)

ml 0.243 0.197 0.281


(0.045) (0.087) (0.087)
co.03 l] CO.0621
{0.022) (0.046) ~::Z:j

x1 0.068 0.076 0.063


(0.009) (0.022) (0.017)

i::::i ;:::::; ;:::::jl

PI 0.880 0.885 0.861


(0.015) (0.028) (0.033)
co.01 21 10.03 11
(0.010) {0.030)

Robust standard errors based on equation (2.21) are reported in parentheses, (.). Standard errors
calculated from the Hessian in equation (2.18) are reported in [.I. Standard errors from on the outer
product of the sample gradients in (2.19) are given in { }.

Consistent with the stylized facts discussed in Section 1.2.4, the conditional variance
is also significantly higher following non-trading periods.
The second and third columns of Table 1 report the results with the same model
estimated for the first and second half of the sample respectively; i.e. January 2, 1981
through October 6, 1986 and October 7,1986 through July 9,1992. The parameter
estimates are remarkably similar across the two sub-periods2j
In summary, the simple model in equation (9.1) does a remarkably good job of
capturing the own temporal dependencies in the volatility of the exchange rate
series. For instance, the highly significant portmanteau test for serial correlation in

26Even though the assumption of conditional normality is violated empirically, it is interesting to


note that the sum of the maximized normal quasi log likelihoods for the two sub-samples equals
_ 1727.750 - 1597.166 = - 3324.916, compared to - 3328.984 for the model estimated over the full
sample.
Ch. 49: ARCH Models 3017

the squares of the raw series, y:, drops to only 21.687 for the squared standardized
residuals, E^,B;2. We defer our discussion of other residual based diagnostics to the
empirical example in the next section.
_ While the GARCH(l, 1) model is able to track the own temporal dependencies,
the assumption of conditionally normally distributed innovations is clearly violated
by the data. The sample skewness and kurtosis for .$8,- equal - 0.071 and 4.892,
respectively. Under the null of i.i.d. normally distributed standardized residuals, the
sample skewness should be the realization of a normal distribution with a mean of
0 and a variance of 6/m = 0.109, while the sample kurtosis is asymptotically
normally distributed with a mean of 3 and a variance of 24/,,/%% = 0.438.
The standard errors for the quasi-maximum likelihood estimates reported in (.)
in Table 1 are based on the asymptotic covariance matrix estimator discussed in
Section 2.3. These estimates are robust to the presence of conditional excess kurtosis.
The standard errors reported in [.I and {.} are calculated from the Hessian and
the outer product of the gradients as in equations (2.18) and (2.19), respectively. For
some of the conditional variance parameters, the non-robust standard errors are
less than one half of their robust counterparts. This compares to the findings reported
in Bollerslev and Wooldridge (1992), and highlights the importance of appropriately
accounting for any conditional non-normality when conducting inference in ARCH
type models based on a normal quasi-likelihood function.

9.2. U.S. stock prices

We next turn to modeling heteroskedasticity in U.S. stock index returns data.


Drawing on the optimal filtering results of Nelson and Foster (1991,1994) sum-
marized in Section 4, as a guidance in model selection, new very rich parametriza-
tions are introduced.
From 1885 on, the Dow Jones corporation has published various stock indices
daily. In 1928, the Standard Statistics company began publishing daily a wider index
of 90 utility, industrial and railroad stocks. In 1953, the Standard 90 index was
replaced by an even broader index, the Standard and Poors 500 composite. The
properties of these indices are considered in some detail in Schwert (1990).27 The
Dow data has one substantial chronological break, from July 30, 1914, through
December 11, 1914, when the financial markets were closed following the outbreak
of the First World War. The first data set we analyze is the Dow data from its
inception on February 16, 1885 until the market closure in 1914. The second data
set is the Dow data from the December 1914 market reopening until January 3,
1928. The third data set is the Standard 90 capital gains series beginning in January
4, 1928 and extending to the end of May 1952. The Standard 90 index data is

G William Schwert kindly provided the data. Schwerts indices differ from ours aft& 1962, when
he use, the CRSP value weighted market index. We continue to use the S&P 500 through 1990.
3018 T. Bollerslev et al.

available through the end of 1956, but we end at the earlier date because that is
when the New York Stock Exchange ended its Saturday trading session, which
presumably shifted volatility to other days of the week. The final data set is the S&P
500 index beginning in January 1953 and continuing through the end of 1990.

9.2.1. Model specification

Our basic capital gains series, rl, is derived from the price index data, P,, as

I, z lOOln[P,/P,_,]. (9.2)

Thus, I, corresponds to the continuously compounded capital gain on the index


measured in percent. Any ARCH formulation for rt may be compactly written as

and

E, = Z,C,, z, - i.i.d., E[z,] = 0, E[zf] = 1, (9.4)

where &- 1, g:) and crt denote the conditional mean and the conditional standard
deviation, respectively.
In the estimation reported below we parametrized the functional form for the
conditional mean by

This is very close to the specification in LeBaron (1992). The pI coefficient allows
for first order autocorrelation. The u2 term denotes the sample mean of rf, which
is essentially equal to the unconditional sample variance of rt. As noted by LeBaron
(1992), serial correlation seems to be a decreasing function of the conditional
variance, which may be captured by equation (9.5) through p2 > 0. The parameter
pL3is an ARCH-M term.
We assume that the conditional distribution of E, given crt is generalized t; see,
e.g. McDonald and Newey (1988). The density for the generalized t-distribution
takes the form

.fCw- l; v, $1 = ~~ (9.6)
2a,h.$1iB(l,q, I+!+[:+ le,l/($ba:)]ti+ liq

where B(l/q, II/) 3 T(l/q)T($)T(l/q + $) denotes the beta function, b = [r($)r(l/q)/


03k)UlCI - 2h)l 123 and $q > 2, r] > 0 and $ > 0. The scale factor b makes
Var(E,a, ) = I.
Ch. 49: ARCH Models 3019

One advantage of this specification is that it nests both the Students t and the
GED distributions discussed in Section 2.2 above. In particular, the Students t-
distribution sets YI= 2 and $ = + times the degrees of freedom. The GED is obtained
for Ic/= co. Nelson (1989,199l) fit EGARCH models to U.S. Stock index returns
assuming a GED conditional distribution, and found that there were many more
large standardized residuals z, = s,gZ- l than would be expected if the returns were
actually conditionally GED with the estimated q. The GED has only one shape
parameter q, which is apparently insufficient to fit both the central part and the tails
of the conditional distribution. The generalized t-distribution has two shape param-
eters, and may therefore be more successful in parametrically fitting the conditional
distribution.
The conditional variance function, o:, is parametrized using a variant of the
EGARCH formulation in equation (1.1 I),

ln((g)
t
= w + (l+ cQL
+_+cI,L4)g(z
_1 a )
(9.7)
f (I-~..._pJP) f 1

where the deterministic component is given by

0, = w0 + ln[l + w1 IV, + 02St + o&J. (9.8)

As noted in Section 1.2, trading and non-trading periods contribute differently to


volatility. To also allow for differences between weekend and holiday non-trading
periods W, gives the number of weekend non-trading days between trading days t
and t - 1, while H,denotes the number of holidays. Prior to May 1952, the NYSE
was open for a short trading session on Saturday. Since Saturday may have been a
slow news day and the Saturday trading session was short, we would expect low
average volatility on Saturdays. The S, dummy variable equals one if trading
day t is a Saturday and zero otherwise.
Our specification of the news impact function, g(., .), is a generalization of
EGARCH inspired by the optimal filtering results of Nelson and Foster (1994). In
the EGARCH model in equation (1.11) ln(cr:+ 1) is homoskedastic conditional on
a:, and the partial correlation between z, and ln(af+ i) is constant conditional on
0:. These assumptions may well be too restrictive, and the optimal filtering results
indicate the importance of correctly specifying these moments. Our specification of
g(z,, 0:) therefore allows both moments to vary with the level of 0:.
Several recent papers, including Engle and Ng (1993), have suggested that GARCH,
EGARCH and similar formulations may make 0: or ln(a:) too sensitive to outliers.
The optimal filtering results discussed in Section 4 lead to the same conclusion
when E, is drawn from a conditionally heavy tailed distribution. The final form that
we assume for g(., .) was also motivated by this observation:

g(zt, of) = CT- 2eo-


OlZ,
___ + g,- 2YO 1+!&]. (9.9)
1 + 4lztI [ 1 + Y*lZ,lP
3020 T Bollersleu et al.

The y0 and t?,, parameters allow both the conditional variance of ln(aF+ i) and its
conditional correlation with z, to vary with the level of 0:. If 0, < O,ln(a:+ i) and Z,
are negatively correlated: the leverage effect. The EGARCH model constrains
8, = y0 = 0, so that the conditional correlation is constant, as is the conditional
variance of ln(af). The p, y2, and 8, parameters give the model flexibility in how
much weight to assign to the tail observations. For example, if yZ and e2 are both
positive, the model downweights large 1z, 1s.The second term on the right hand side
of equation (9.9) was motivated by the optimal filtering results in Nelson and Foster
(1994), designed to make the ARCH model serve as a robust filter.
The orders of the ARMA model for ln(a:), p and q, remain to be determined.
Table 2 gives the maximized values of the log likelihoods from (2.7), (2.8) and (9.6)
for ARMA models of order up to ARMA(3,5). For three of the four data sets, the
information criterion of Schwartz (1978) selects an ARMA(2,l) model, the exception
being the Dow data for 1914-1928, for which an AR(l) is selected. For linear time
series models, the Schwartz criterion has been shown to consistently estimate the
order of an ARMA model. As noted in Section 7, it is not known whether this result
carries over to the ARCH class of models. However, guided by the results in Table 2,

Table 2
Log likelihood values for fitted models.

Dow Dow Standard 90 S&P 500


Fitted model 188551914 1914-1928 1928-1952 1953-1990

White Noise - 10036.188 -4397.693 -11110.120 - 10717.199


MA(l) -9926.781 -4272.639 - 10973.417 - 10658.775
MA(2) -9848.319 -4241.686 - 10834.937 - 10596.849
MA(3) -9779.491 -4233.371 - 10765.259 - 10529.688
MA(4) -9750.417 -4214.821 - 10740.999 - 10463.534
MA(5) -9718.642 -4198.672 - 10634.429 - 10433.631
AR(l) -9554.352 -4164.093sc - 10275.294 - 10091.450
ARMA(l, 1) -9553.891 -4164.081 - 10269.771 - 10076.775
ARMA(l,2) -9553.590 -4160.671 - 10265.464 - 10071.040
ARMA(l,3) -9552.148 -4159.413 - 10253.027 - 10070.587
ARMA(l,4) -9543.855 -4158.836 - 10250.446 - 10064.695
ARMA(l,S) - 9540.485 -4158.179 - 10242.833 - 10060.336
AR(2) -9553.939 -4164.086 - 10271.732 - 10083.442
ARMA(2,l) - 9529.904sc -4159.OllAC - 10237.527sc - 10052.322sc
ARMA(2,2) - 9529.642 -4158.428 - 10235.724 - 10049.237
ARMA(2,3) -9526.865 -4157.731 - 10234.556 - 10049.129
ARMA(2,4) -9525.683 -4157.569 - 10234.429 - 10047.962
ARMA(2,5) -9525.560 -4155.071 - 10230.418 - 10046.343
AR(3) -9553.787 -4159.227 - 10270.685 - 10075.441
ARtiA(3,l) -9529.410 -4158.608 - 10237.462 - 10049.833
ARMA(3.2) - 9526.089 -4158.230 -10228.701A= - 10049.044
ARMA(3; 3) - 9524.644Ac -4157.730 - 10228.263 - 10042.710
ARMA(3,4) - 9524.497 -4156.823 - 10227.982 - 10042.284
ARMA(3,5) -9523.375 -4154.906 - 10227.958 - 10040.547AC

The AIC and SC indicators denote the models selected by the information criteria of Akaike (1973)
and Schwartz (1978), respectively.
Table 3
Maximum likelihood estimates?

Dow Dow Standard 90 S&P 500


1885-1914 1914-1928 1928-1952 1953-1990
Coefficient ARMA(2,l) AR(l) ARMA(2,l) ARMA(2,l)

- 0.6682 - 0.6228 - 1.2704 - 0.7899


(0.1251) (0.0703) (2.5894) (0.2628)
0.2013 0.3059 0.1011 0.1286
(0.0520) (0.0904) (0.0518) (0.0295)
-0.4416 -0.5557 -0.6534 *
(0.0270) (0.0328) (0.0211)
0.5099 0.3106 0.6609 0.1988
(0.1554) (0.1776) (0.1702) (0.1160)
3.6032 2.5316 4.0436 3.5437
(0.8019) (0.5840) (0.9362) (0.7557)
2.2198 2.4314 1.7809 2.1844
(0.1338) (0.2041) (0.1143) (0.1215)
0.0280 0.0642 0.0725 0.0259
(0.0112) (0.0222) (0.1139) (0.0113)
- 0.0885 - 0.0920 -0.0914 0.0717
(0.0270) (0.0418) (0.0243) (0.0260)
0.2206 0.3710 0.2990 0.2163
(0.0571) (0.0828) (0.0387) (0.0532)
0.0006 0.0316 0.0285. 0.0050
(0.0209) (0.0442) (0.0102) (0.0213)
-0.1058 0.0232 - 0.0508 0.1117
(0.0905) (0.1824) (0.0687) (0.0908)
0.1122 0.0448 0.1356 0.0658
(0.0256) (0.0478) (0.0327) (0.0157)
0.0245 0.0356 0.0168 0.0312
(0.0178) (0.0316) (0.0236) (0.0080)
2.1663 3.2408 1.6881 2.2477
(0.3 119) (1.5642) (0.3755) (0.3312)
- 0.6097 -0.5675 -0.1959 -0.1970
(0.0758) (0.1232) (0.0948) (0.1820)
-0.1509 -0.3925 -0.1177 -0.1857
(0.0258) (0.1403) (0.027 1) (0.0287)
0.0361 0.3735 -0.0055 0.2286
(0.0828) (0.3787) (0.0844) (0.1241)
0.9942 0.9093 0.9994 0.9979
(0.0033) (0.0172) (0.0009) (0.0011)
0.8759 0.8303 0.8945
(0.0225) (0.0282) (0.0258)
-0.9658 -0.9511 - 0.9695
(0.0148) (0.0124) (0.0010)

Standard errors are reported in parentheses. The parameters indicated by a I were not estimated. The
ARcoefficientsaredecomposedas(1-A,L)(1-A,L)~(1-~,L-~,L2),where~A,~>,(A,~.
3022 T. Bollerslev et al.

Table 4
Wald hypothesis tests.

Dow Dow Standard 90 S&P 500


1885-1914 1914-1928 1928-1952 1953-1990
Test ARMA(2,l) AR(l) ARMA(2,l) ARMAIZ, 1)

y* = f& = y0 = e0 = p - 1 = 0: x; 97.3825 63.4545 10.1816 51.8152


(O.cQOO) (O.O@-w (0.0703) (O.OOw
WI = wg: xi 3.3867 0.0006 9.8593 0.3235
(0.0657) (0.9812) (0.0017) (0.5695)
0, = y0 = 0: x; 67.4221 21.3146 4.4853 2.2024
(O.OOOQ (0.0000) (0.1062) (0.3325)
0, = Yo:x: 17.2288 7.4328 1.7718 1.7844
(O.oooo) (0.0064) (0.1832) (0.1816)
rl = P: x: 0.0247 0.2684 0.0554 0.0312
(0.8751) (0.6044) (0.8139) (0.8598)
yz = b-3+-: x; 14.0804 10.0329 14.1293 14.6436
(0.0002) (0.0015) (0.0002) (0.0001)
q=p,yz=b-$-:x; 18.4200 10.4813 22.5829 16.9047
(0.0001) (0.0053) (0.0000) (0.0002)

Table 5
Conditional moment specification tests.

Dow Dow Standard 90 S&P 500


Orthogonality 1885-1914 1914-1928 1928-1952 1953-1990
Condition ARMA(2,l) AR(l) ARMA(2,l) ARMA(2,l)

(1) &Cz,l= 0 -0.0147 - 0.0243 -0.0275 -0.0110


(0.0208) (0.03 19) (0.0223) (0.0202)

(2) mz:1 = 1 0.0007 o.wO7 0.0083 0.0183


(0.0382) (0.0613) (0.0503) (0.0469)

(3) J%CZr~I~Al
=0 -0.0823 -0.1122 -0.1072 -0.0658
(0.0365) (0.0564) (0.0414) (0.0410)

(4) &CS(Z,, ~,)I = 0 0.0007 0.0013 0.0036 0.0003


(0.0046) (0.0080) (0.0051) (0.0035)
(5) E,C(zf - l)(zf_, - 1)l = 0 - 0.0050 -0.0507 -0.0105 0.1152
(0.0714) (0.0695) (0.0698) (0.0930)
(6) &C(z: - )(z:_~ - 1)1= 0 -0.0047 0.0399 -0.0358 - 0.0627
(0.0471) (0.0606) (0.0815) (0.0458)
(7) Mz: - )(2:_3 - 111= 0 0.0037 -0.0365 0.0373 -0.0171
(0.0385) (0.0521) (0.0583) (0.0611)
(8) E,C(z: - )(z:_~ - III= 0 0.0950 - 0.0658 -0.0018 -0.0312
(0.0562) (0.0403) (0.0543) (0.0426)
(9) mz: - m:_ 5- 111= 0 0.0165 0.0195 0.0710 0.0261
(0.0548) (0.0486) (0.0565) (0.073 1)
(10) wz: - )(z:-6 - 111= 0 - 0.0039 0.0343 0.0046 -0.0557
(0.0309) (0.0602) (0.0439) (0.0392)
Ch. 49: ARCH Models 3023

Table 5 (continued)

Dow Dow Standard 90 S&P 500


Orthogonality 1885-1914 1914-1928 1928-1952 1953-1990
Condition ARMA(2,l) AR(l) ARMA(2.1) ARMA(2,l)

(11) E,[(zf-l)z,_,l=O -0.0338 -0.0364 -0.0253 - 0.0203


(0.0290) (0.0414) (0.0367) (0.0413)
(12) E,[(zf - l)z,_J = 0 0.0069 -0.0275 - 0.0434 -0.0378
(0.025 1) (0.0395) (0.03 15) (0.0278)
(13) E,[(zf - l)z,_J = 0 0.0110 0.0290 0.0075 0.0292
(0.0262) (0.0352) (0.0306) (0.0357)
(14) E,[(zf - l)z,_,] =o - 0.0296 0.0530 -0.0103 -0.0137
(0.0275) (0.0340) (0.0292) (0.0238)
(15) E,[(z; - 1)z,_51 = 0 - 0.0094 0.0567 0.0153 0.0064
(0.0240) (0.0342) (0.0287) (0.0238)
(16) E,[(zf - l)z,_,] =0 0.028 1 0.0038 -0.0170 0.0417
(0.0216) (0.0350) (0.0253) (0.0326)
(17) E,[z;z,_ ,] = 0 0.0265 0.0127 0.0383 0.0188
(0.0236) (0.0346) (0.0243) (0.0226)
(18) E,[z;z,_J = 0 0.0133 -0.0176 - 0.0445 - 0.0434
(0.0157) (0.0283) (0.0174) (0.0158)
(19) E,[Z,.Z,_J =o 0.0406 0.0012 0.0019 0.0140
(0.0158) (0.0262) (0.0175) (0.0152)
(20) E,[z;zr-J = 0 0.0580 0.0056 0.02 11 0.0169
(0.0161) (0.0253) (0.0172) (0.0153)
(21) E,[z;z,_J = 0 0.0516 0.0164 0.0250 0.0121
(0.0163) (0.025 1) (0.0174) (0.0158)
(22) E,[z,.z,_6] = 0 - 0.0027 0.008 1 - 0.0040 -0.0211
(0.0158) (0.0261) (0.0172) (0.0150)
(1)-(16):x:, 39.1111 45.1608 31.7033 25.1116
(0.0010) (O.Ocm) (0.011) (0.0679)
(1)-(22): x:, 94.0156 52.1272 67.1231 63.6383
(O.OOw (0.0003) (0.0000) (O.OOw
-

Table 3 reports the maximum likelihood estimates (MLE) for the models selected by
the Schwartz criterion. Various Wald and conditional moment specification tests
are given in Tables 4 and 5.

9.2.2. Persistence of shocks to tiolatility

As in Nelson (1989,1991), the ARMA(2,l) models selected for three of the four data
sets can be decomposed into the product of two AR(l) components, one of which
has very long-lived shocks, with an AR root very close to one, the other of which
exhibits short-lived shocks, with an AR root very far from one; i.e. (1 - /I1 L - /3J*) =
3024

-4
solid -3
nonparametric -2 -1 0 1 2 3 4

dashed, parametric Z
short dashes. standard normal

E,,titnnted Densities Standard 90, 1928~~95

0
-4
solid, -3
nonparametric -2 -1 0 1 2 3 4

dashed: parametric Z
short dashes: standard normal
Ch. 49: ARCH Models 3025

Estimated Densities Dow 1914~ 1928

solid. nonparametric
dashed. parametric z
short dashes. standard normal

Estimated Densltles S&P 500. 1953- 1990

4 -3 -2 -1 0 1 2 3 4
solid. nonparametric
dashed: parometrlc z
short dashes: standard normal

Figure 3. Conditional Distribution of Returns


3026 T. Bollersfeu et al.

(1 - A,,!,)(1 -A,,!,), where IAll 2 lA,j. When the estimated AR roots are real, a
useful gauge of the persistence of shocks in an AR(l) model is the estimated half
life; that is the value of n for which A = +. For the Dow 1885-1914, the Standard
90 and the S&P 500 the estimated half lives of the long-lived components are about
119 days, 4; years and 329 days respectively. The corresponding estimated half lives
of the short-lived components are only 5.2, 3.7 and 6.2 days, respectively.28*29

9.2.3. Conditional mean of returns

The estimated pi terms strongly support the results of LeBaron (1992) of a negative
relationship between the conditional variance and the conditional serial correlation
in returns. In particular, p2 is significantly positive in each data set, both statistically
and economically. For example, for the Standard 90 data, the fitted conditional first
order correlation in returns is 0.17 when 0; is at the 10th percentile of its fitted
sample values, but equals - 0.07 when CJ: is at the 90th percentile. The implied
variation in returns serial correlation is similar in the other data sets. The relatively
simple specification of ~(r, _ 1, CT:)remains inadequate, however, as can be seen from
the conditional moment tests reported in Table 5. The 17th through 22nd conditions
test for serial correlation in the fitted z,s at lags one through six. In each data set,
significant serial correlation is found at the higher lags.

9.2.4. Conditional distribution of returns

Figure 3 plots the fitted generalized t density of the z,s against both a standard
normal and a nonparametric density estimate constructed from the fitted z,s using
a Gaussian kernel with the bandwidth selection method of Silverman (1986, pp. 45-
48). The parametric and nonparametric densities appear quite close, with the
exception of the Dow 1914-1928 data, which exhibits strong negative skewness in
2,. Further aspects of the fitted conditional distribution are checked in the first three
conditional moment specification tests reported in Table 5. These three orthogonality
conditions test that the standardized residuals 1, E E*,8,-l have mean zero, unit
variance, and no skewness.30 In the first three data sets the 1, series exhibit statistically
significant, though not overwhelmingly so, negative skewness.

This is consistent with recent work by Ding et al. (1993), in which the empirical autocorrelations
of absolute returns from several financial data sets are found to exhibit rapid decay at short lags but
much slower decay at longer lags. This is also the motivation behind the permanent/transitory com-
ponents ARCH model introduced by Engle and Lee (1992,1993), and the fractionally integrated ARCH
models recently proposed by Baillie et al. (1993).
29Volatility in the Dow 1914-1928 data shows much less persistence. The half life asociated with
the AR(l) model selected by the Schwartz (1978) criterion is only about 7.3 days. For the ARMA(2,l)
model selected by the AIC for this data set, the half lives associated with the two AR roots are only
24 and 3.3 days, respectively.
30More precisely, the third orthogonality condition tests that E,[z;lz,l] = 0 rather than E,[z:] = 0.
We use this test because it requires only the existence of a fourth conditional moment for Z, rather
than a sixth conditional moment.
Ch. 49: ARCH Models 3027

Table 6
Frequency of tail eventsa

Dow Dow Standard 90 S&P 500


1885-1914 1914-1928 1928-1952 1953-1990
ARMA(2,l) AR(l) ARMA(2,l) ARMA(2,l)
N Expected Actual Expected Actual Expected Actual Expected Actual

2 421.16 405 180.92 177 369.89 363 458.85 432


3 63.71 74 31.11 33 76.51 81 72.60 57
4 11.54 12 6.99 10 18.76 23 13.83 14
5 2.61 4 2.01 3 5.47 4 3.27 6
6 0.72 2 0.70 1 1.86 1 0.94 5
7 0.23 1 0.28 0 0.71 1 0.31 3
8 9.56 x 10m6 0 0.13 0 0.30 1 0.12 2
9 3.89 x lo- 0 0.06 0 0.14 0 0.05 2
10 1.73 x lo- 0 0.03 0 0.07 0 0.02 2
11 8.25 x lo-* 0 0.01 0 0.04 0 0.01 1

The table reports the expected and the actual number of observations exceeding N conditional
standard deviations.

The original motivation for adopting the generalized t-distribution was that the
two shape parameters q and $ would allow the model to fit both the tails and the
central part of the conditional distribution. Table 6 gives the expected and the actual
number of z,s in each data set exceeding N standard deviations. In the S&P 500
data, the number of outliers is still too large. In the other data sets, the tail fit seems
adequate.
As noted above, the generalized t-distribution nests both the Students t (r] = 2)
and the GED ($ = co). Interestingly, in only two of the data sets does a t-test for the
null hypothesis that r~= 2 reject at standard levels, and then only marginally. Thus,
the improved fit appears to come from the t component rather than the GED
component of the generalized t-distribution. In total, the generalized r-distribution
is a marked improvement over the GED, though perhaps not over the usual
Students r-distribution. Nevertheless, the generalized t is not entirely adequate,
as it does not account for the fairly small skewness in the fitted z,s, and also appears
to have insufficiently thick tails for the S&P 500 data.

9.2.5. News impact function

In line with the results for the EGARCH model reported in Nelson (1989,1991), the
leverage effect term 8, in the g(., .) function is significantly negative in each of the
data sets, while the magnitude effect term y1 is always positive, and significantly so
except in the Dow 1914-1928 data. There are important differences, however. The
EGARCH parameter restrictions that p = 1, y0 = y2 = 8, = 8, = 0 are decisively
rejected in three of the four data sets. The estimated g(zt, C-J:)functions are plotted
in Figure 4, from which the differences with the piecewise linear EGARCH g(z,)
formulation are apparent.
3028

g(z). Dow 1885-1914

\ 1
\
\ \
\
\\
\
\\
\

IL
-Y -/ -5 -3 -1 0 1 2 3 4 5 6 7 8 9 10
solId median o
dashed low CT Z
short dashes high u

g(r): Standard 90. 1928-1952

-3 -1 0 1 2 3 4 5 6 7 8 9 10

short dashes. high CT L


3030 T. l3oller.&?v et al.

To better understand why the standard EGARCH model is rejected, consider


more closely the differences between the specification of the g(z,,a:) function in
equation (9.9) and the EGARCH formulation in equation (1.11). Firstly, the param-
eters y0 and 8, allow the conditional variance of ln(cf) and the conditional correlation
between ln(a:) and I, to change as functions of CJ:. Secondly, the parameters p, y2,
and 0, give the model an added flexibility in how much weight to assign to large
versus small values of z,.
As reported in Table 4, the EGARCH assumption that y0 = 8, = 0 is decisively
rejected in the Dow 1885-1914 and 1914-1928 data sets, but not for either the
Standard 90 or the S&P 500 data sets. For none of the four data sets is the estimated
value of y0 significantly different from 0 at conventional levels. The estimated value
of 8, is always negative, however, and very significantly so in the first two data sets,
indicating that the leverage effect is more important in periods of high volatility
than in periods of low volatility.
The intuition that the influence of large outliers should be limited by setting 0, > 0
and y2 > 0 receives mixed support from the data. The estimated values of y2 and
three of the estimated 0,s are positive, but only the estimate of yZ for the S&P 500
data is significantly positive at standard levels. We also note that if the data is
generated by a stochastic volatility model, as opposed to an ARCH model, with
conditionally generalized t-distributed errors, the asymptotically optimal ARCH
filter would set q = p and y2 = I,- lb-. The results in Table 4 indicate that the q = p
restriction is not rejected, but that y2 = II/- b- is not supported by the data. The
estimated values of y2 are too low relative to the asymptotically optimal filter for
the stochastic volatility model.

10. Conclusion

This chapter has focused on a wide range of theoretical properties of ARCH models.
It has also presented some new important empirical results, but has not attempted
to survey the literature on applications, a recent survey of which can be found in
Bollerslev et al. (1992).3 Three of the most active lines of inquiry are prominently
surveyed here, however. The first concerns the general parametrizations of univariate
discrete time models of time-varying heteroskedasticity. From the original ARCH
model, the literature has focused upon GARCH, EGARCH, IGARCH, ARCH-M,
AGARCH, NGARCH, QARCH, QTARCH, STARCH, SWARCH and many other
formulations with particular distinctive properties. Not only has this literature been
surveyed here, but it has been expanded by the analysis of variations in the EGARCH
model. Second, we have explored the relations between the discrete time models
and the very popular continuous time diffusion processes that are widely used in

a Other recent surveys of the ARCH methodology are given in Bera and Higgins (1993) and Nijman
and Palm (1993).
Ch. 49: ARCH Models 3031

finance. Very useful approximation theorems have been developed, which hold with
increasing accuracy when the length of the sampling interval diminishes. The third
area of important investigation concerns the analysis of multivariate ARCH processes.
This problem is more complex than the specification of univariate models because
of the interest in simultaneously modeling a large number of variables, or assets,
without having to estimate an intractable large number of parameters. Several
multivariate formulations have been proposed, but no clear winners have yet
emerged, either from a theoretical or an empirical point of view.

References

Akaike, H. (1973) Information Theory and an Extension of the Maximum Likelihood Principle, in:
B.N. Petrov and F. Csriki, eds., Second International Symposium on Information Theory. Akadtmiai
Kiad6: Budapest.
Amemiya, T. (1985) Advanced Econometrics. Harvard University Press: Cambridge, MA.
Amin, K.I. and V.K. Ng (1993) Equilibrium Option Valuation with Systematic Stochastic Volatility,
Journal ofFinance, 48, 881C910.
Andersen, T.G. (1992a) Volatility, unpublished manuscript, J.L. Kellogg Graduate School of Man-
agement, Northwestern University.
Andersen, T.G. (1992b) Return Volatility and Trading Volume in Financial Markets: An Information
Flow Interpretation of Stochastic Volatility, unpublished manuscript, J.L. Kellogg Graduate School
of Management, Northwestern University.
Anderson, T.W. (1971) The Statistical Analysis of Time Series. John Wiley and Sons: New York, NY.
Andrews, D.W.K. and W. Ploberger (1992) Optimal Tests when a Nuisance Parameter Is Present only
under the Alternative, unpublished manuscript, Department of Economics, Yale University.
Andrews, D.W.K. and W. Ploberger (1993) Admissibility of the Likelihood Ratio Test when a Nuisance
Parameter Is Present only under the Alternative, unpublished manuscript, Department of Economics,
Yale University.
Attanasio, 0. (1991) Risk, Time-Varying Second Moments and Market Efficiency, Review ofEconomic
Studies, 58,479-494.
Baek, E.G. and W.A. Brock (1992) A Nonparametric Test for Independence of a Multivariate Time
Series, Statistica Sinica, 2, 137-156.
Baillie, R.T. and T. Bollerslev (1989) The Message in Daily Exchange Rates: A Conditional Variance
Tale, Journal of Business and Economic Statistics, 7, 297-305.
Baillie, R.T. and T. Bollerslev (1990) A Multivariate Generalized ARCH Approach to Modeling Risk
Premia in Forward Foreign Exchange Rate Markets, Journal of International Money and Finance,
9,309%324.
Baillie, R.T. and T. Bollerslev (1991) Intra Day and Inter Day Volatility in Foreign Exchange Rates,
Review of Economic Studies, 58, 565-585.
Baillie, R.T. and T. Bollerslev (1992) Prediction in Dynamic Models with Time Demndent Conditional
Variances, Journal ofEconometrics, 52, 91-113. _
Baillie, R.T., T. Bollerslev and H.O. Mikkelsen (1993), Fractionally Integrated Autoregressive Condi-
tional Heteroskedasticity, unpublished manuscript, J.L. Kellogg Graduate School of Management,
Northwestern University.
Bekaert, G. and R.J. Hodrick (1993) On Biases in the Measurement of Foreign Exchange Risk
Premiums, Journal oflnternational Money and Finance. 12. 115-138.
Bera, A.K. and M.L. Higgins (1993) ARCH Models: Properties, Estimation and Testing, Journal of
Economic Surveys, 7,305-366.
Bera, A.K. and S. Lee (1992) Information Matrix Test, Parameter Heterogeneity and ARCH: A
Synthesis, Review of Economic Studies, 60, 229-240.
Bera, A.K. and J-S. Roh (1991) A Moment Test of the Constancy of the Correlation Coefficient in the
Bivariate GARCH Model, unpublished manuscript, Department of Economics, University of Illinois,
Urbana-Champaign.
3032 T. Bollerslev et al.

Bera, A.K., M.L. Higgins and S. Lee (1993) Interaction Between Autocorrelation and Conditional
Heteroskedaslicity: A Random Coefficients Approach, Journal of Businessand Economic Statistics,
10, 133-142.
Berndt E.R., B.H. Hall, R.E. Hall, and J.A. Haussman (1974) Estimation and Inference in Nonlinear
Structural Models, Annals ofEconomic and Social Measurement, 4, 653-665.
Black. F. (1976) Studies of Stock Price Volatility Changes, Proceedingsfrom the American Statistical
Associaiion, business and Economic Statistics Section, 177-181.
Black. F. and M. Scholes (1973) The Pricing of Ontions and Corporate
I .
Liabilities, Journal ofpolitical
Ecdnomy, 81,637-659.
Blattberg, R.C. and N.J. Gonedes (1974) A Comparison of the Stable and Student Distribution of
Statistical Models for Stock Prices, Journal ofBusiness, 47,24+280.
Bollerslev, T. (1986) Generalized Autoregressive Conditional Heteroskedasticity, Journal of Eco-
nometrics, 31, 307-327.
Bollerslev, T. (1987) A Conditional Heteroskedastic Time Series Model for Speculative Prices and
*Rates of Return, Review of Economics and Statistics, 69, 542-547.
Bollerslev, T. (1988) On the Correlation Structure for the Generalized Autoregressive Conditional
Heteroskedastic Process, Journal of Time Series Analysis, 9, 121-131.
Bollerslev, T. (1990) Modelling the Coherence in Short-Run Nominal Exchange Rates: A Multivariate
Generalized ARCH Approach, Review of Economics and Statistics, 72,498-505.
Bollerslev, T. and I. Domowitz (1993) Trading Patterns and the Behavior of Prices in the Interbank
Foreign Exchange Market, Journal of Finance, 48, 1421-1443.
Bollerslev, T. and R.F. Engle (1993) Common Persistence in Conditional Variances, Econometrica,
61,166-187.
Bollerslev, T. and M. Melvin (1994) Bid-Ask Spreads in the Foreign Exchange Market: An Empirical
Analysis, Journal of International Economics, forthcoming.
Bollerslev, T. and J.M. Wooldridge (1992) Quasi Maximum Likelihood Estimation and Inference in
Dynamic Models with Time Varying Covariances, Econometric Reviews, 11, 143-172.
Bollerslev, T., R.F. Engle and J.M. Wooldridge (1988) A Capital Asset Pricing Mode1 with Time Varying
Covariances, Journal of Political Economy, 96, 116-131.
Bollerslev, T., R.Y. Chou and K.F. Kroner (1992) ARCH Modeling in Finance: A Review of the Theory
and Empirical Evidence, Journal of Econometrics, 52, S-59.
Bougerol, P. and N. Picard (1992) Stationarity of GARCH Processes and of Some Non-Negative
Time Series, Journal of Econometrics, 52, 115-128.
Box, G.E.P., and G.M. Jenkins (1976) Time Series Analysis: Forecasting and Control. Holden Day: San
Francisco, CA. Second Edition.
Braun, P.A., D.B. Nelson and A.M. Sunier (1992) Good News, Bad News, Volatility, and Betas,
unpublished manuscript, Graduate School of Business, University of Chicago.
Breusch, T. and A.R. Pagan (1979) A Simple Test for Heteroskedasticity and Random Coefficient
Variation, Econometrica, 47, 1287-1294.
Brock, W.A. and A. Kleidon (1992) Periodic Market Closure and Trading Volume: A Model of Intra
Day Bids and Asks, Journal of Economic Dynamics and Control, 16, 451-489.
Brock, W.A., and S.M. Potter (1992), Nonlinear Time Series and Macroeconometrics, unpublished
manuscript, Department of Economics, University of Wisconsin, Madison.
Brock, W.A., W.D. Dechert and J.A. Scheinkman (1987) A Test for independence Based on the
Correlation Dimension, unpublished manuscript, Department of Economics, University of Wisconsin,
Madison.
Brock, W.A., D.A. Hsieh and B. LeBaron (1991). Nonlinear Dynamics, Chaos and Instability: Statistical
Theory and Economic Eoidence. MIT Press: Cambridge, MA.
Cai, J. (1994) A Markov Model of Unconditional Variance in ARCH, Journal of Business and
Economic Statistics, forthcoming.
Campbell, J.Y. and L. Hentschel(l992) No News is Good News: An Asymmetric Model of Changing
Volatility in Stock Returns, Journal of Financial Economics, 3 1, 28 l-3 18.
Chou, R.Y. (1988) Volatility Persistence and Stock Valuations: Some Empirical Evidence Using
GARCH, Journal of Applied Econometrics, 3, 279-294.
Christie, A.A. (1982) Thd-Stochastic Behavior of Common Stock Variances: Value, Leverage and
Interest Rate Effects, Journaf of Financial Economics, 10,407-432.
Clark, P.K. (1973) A Subordinated Stochastic Process Model with Finite Variance for Speculative
Prices, Econometrica, 41, 135-l 56.
Ch. 49: ARCH Models 3033

Cornell, B. (1978) Using the Options Pricing Model to Measure the Uncertainty Producing Effect of
Major Announcements, Financial Management, 7, 54-59.
Crowder, M.J. (1976) Maximum Likelihood Estimation with Dependent Observations, Journal of
the Royal Statistical Society, 38, 45-53.
Danielson, J. and J.-F. Richard (1993) Accelerated Gaussian Importance Sampler with Application to
Dynamic Latent Variable Models, Journal ofAppliedEconometrics, 8, S153-S173.
Davidian, M. and R.J. Carroll (1987) Variance Function Estimation, Journal ofthe American Statistical
Association, 82, 1079-1091.
Davies. R.B. (1977) Hypothesis Testing when a Nuisance Parameter is Present only under the Null
Hypothesis, Biometkz, 64, 247-254:
Dav. T.E. and CM. Lewis (1992) Stock Market Volatility and the Information Content of Stock Index
Options, Journal of Ecdnomkrics, 52, 267-288. .
Demos, A. and E. Sentana (1991) Testing for GARCH Effects: A One-Sided Approach, unpublished
manuscript, London School of Economics.
Diebold, F.X. (1987) Testing for Serial Correlation in the Presence of ARCH, Proceedings from the
American Statistical Association, Business and Economic Statistics Section, 323-328.
Diebold, F.X. (1988) Empirical Modeling ofExchange Rate Dynamics. Springer Verlag: New York, NY.
Diebold, F.X. and M. Nerlove (1989) The Dynamics of Exchange Rate Volatility: A Multivariate
Latent Factor ARCH Model, Journal of Applied Econometrics, 4, l-21.
Ding, Z., R.F. Engle, and C.W.J. Granger (1993) Long Memory Properties of Stock Market Returns
and a New Model, Journal of Empirical Finance, 1, 83-106.
Drost, F.C. and T.E. Nijman (1993) Temporal Aggregation of GARCH Processes, Econometrica, 61,
909-927.
Engle, R.F. (1982) Autoregressive Conditional Heteroskedasticity with Estimates of the Variance of
U.K. Inflation, Econometrica, 50,987TlOO8.
Engle, R.F. (1984) Wald, Likelihood Ratio, and Lagrange Multiplier Tests in Econometrics, in:
Z. Griliches and M.D. Intriligator, eds., Handbook of Econometrics, Vol. II. North-Holland:
Amsterdam.
Engle, R.F. (1987) Multivariate GARCH with Factor Structures - Cointegration in Variance,
unpublished manuscript, Department of Economics, UCSD.
Engle, R.F. (1990) Discussion: Stock Market Volatility and the Crash of 87, Review of Financial
Studies, 3, 103-106.
Engle, R.F. and T. Bollerslev (1986) Modelling the Persistence of Conditional Variances, Econometric
Reviews, 5, l-50, 81-87.
Engle, R.F. and G. Gonzalez-Rivera (1991) Semiparametric ARCH Models, Journal of Business and
Economic Statistics, 9, 345-359.
Engle, R.F. and C.W.J. Granger (1987) Cointegration and Error Correction: Representation, Estimation
and Testing, Econometrica, 55, 251-276.
Engle, R.F. and S. Kozicki (1993) Testing for Common Features, Journal of Business and Economic
Statistics, 11, 369-379.
Engle, R.F. and K.F. Kroner (1993) Multivariate Simultaneous Generalized ARCH, unpublished
manuscript, Department of Economics, UCSD.
Engle, R.F. and G.G.J. Lee (1992) A Permanent and Transitory Component Model of Stock Return
VolBtility, unpublished manuscript, Department of Economics, UCSD.
Engle, R.F. and G.G.J. Lee (1993) Long Run Volatility Forecasting for Individual Stocks in a One
Factor Model, unpublished manuscript, Department of Economics, UCSD.
Engle, R.F. and C. Mustafa (1992) Implied ARCH Models from Options Prices, Journal of
Econometrics, 52, 289-311.
Engle, R.F. and V.K. Ng (1993) Measuring and Testing the Impact of News on Volatility, Journal
of Finance, 48, 1749-1778.
Engle R.F. and R. Susmel (1993) Common Volatility in International Equity Markets, Journal of
Business and Economic Statistics, 11, 167-176.
Engle, R.F., D.F. Hendry and D. Trumble (1985) Small Sample Properties of ARCH Estimators and
Tests, Canadian Journal of Economics, 18, 66-93.
Engle, R.F., D.M. Lilien and R.P. Robins (1987) Estimating Time Varying Risk Premia in the Term
Structure: The ARCH-M Model, Econometrica, 55, 391-407.
Engle, RF., T. Ito and W-L. Lin (1990a) Meteor Showers or Heat Waves? Heteroskedastic Intra
Daily Volatility in the Foreign Exchange Market, Econometrica, 58, 525-542.
3034 T. Bollersleu et al.

Engle, R.F., V. Ng and M. Rothschild (1990b) Asset Pricing with a Factor ARCH Covariance Structure:
Empirical Estimates for Treasury Bills, Journal of Econometrics, 45, 213-238.
En&, R.F., C-H. Hong, A. Kane, and J. Noh (1993) Arbitrage Valuation of Variance Forecasts with
Simulated Options, in: D.M. Chance and R.R. Trippi, eds., Advances in Futures and Options Research.
JAI Press: Greenwich, Connecticut.
Ethier, S.N. and T.G. Kurtz (1986) Markov processes: Characterization and Convergence. John Wiley:
New York, NY.
Fama, E.F. (1963) Mandelbrot and the Stable Paretian Distribution, Journal ofBusiness, 36,420&429.
Fama. E.F. (1965) The Behavior of Stock Market Prices, Journal of Business,38, 34b105.
Foste;, D.P.and D.B. Nelson (1992) Rolling Regressions, unpublished manuscript, Graduate School
of Business, University of Chicago.
French, K.R. and R. Roll (1986) Stock Return Variances: The Arrival of Information and the Reaction
of Traders, Journal of Financial Economics, 17, 5-26.
French, K.R., G.W. Schwert and R.F. Stambaugh (1987) Expected Stock Returns and Volatility,
Journal of Financial Economics, 19, 3-30.
Gallant, A.R. and G. Tauchen (1989) Semi Non-Parametric Estimation of Conditionally Constrained
Heterogeneous Processes: Asset Pricing Applications, Econometrica, 57, 1091L1120.
Gallant, A.R., D.A. Hsieh and G. Tauchen (1991) On Fitting a Recalcitrant Series: The Pound/Dollar
Exchange Rate 1974&83, in: W.A. Barnett, J. Powell and G. Tauchen, eds., Nonparametric and
Semiparametric Methods in Econometrics and Statistics. Cambridge University Press: Cambridge.
Gallant, A.R., P.E. Rossi and G. Tauchen (1992) Stock Prices and Volume, Review of Financial
Studies, 5, 199-242.
Gallant, A.R., P.E. Rossi and G. Tauchen (1993) Nonlinear Dynamic Structures, Econometrica, 61,
871-907.
Gennotte, G. and T.A. Marsh (1991) Variations in Economic Uncertainty and Risk Premiums on
Capital Assets, unpublished manuscript, Department of Finance, University ofcalifornia, Berkeley.
Gerity, M.S. and J.H. Mulherin (1992) Trading Halts and Market Activity: An Analysis of Volume
at the Open and the Close, Journal of Finance, 47, 1765-1784.
Geweke, J. (1989a) Exact Predictive Densities in Linear Models with ARCH Disturbances, Journal
of Econometrics, 44, 307-325.
Geweke, J. (1989b) Bayesian Inference in Econometric Models Using Monte Carlo Integration,
Econometrica, 57, 1317-1339.
Glosten, L.R., R. Jagannathan and D. Runkle (1993) On the Relation between the Expected Value and
the Volatility of the Nominal Excess Return on Stocks, Journal of Finance, 48, 1779-1801.
Gourieroux, C. and A. Monfort (1992) Qualitative Threshold ARCH Models, Journal of Econometrics,
52,159-199.
Gourieroux, C., A. Holly and A. Monfort (1982) Likelihood Ratio Test, Wald Test and Kuhn-Tucker
Test in Linear Models with Inequality Constraints on Regression Parameters, Econometrica, 50,
63-80.
Granger, C.W.J., R.F. Engle, and R.P. Robins (1986) Wholesale and Retail Prices: Bivariate Time-Series
Modelling with Forecastable Error Variances, in: D. Belsley and E. Kuh, eds., Model Reliability.
MIT Press: Massachusetts. pp 1-17.
Hamao, Y., R.W. Masulis and V.K. Ng (1990) Correlations in Price Changes and Volatility Across
International Stock Markets, Review of Financial Studies, 3, 281-307.
Hamilton, J.D. and R. Susmel (1992), Autoregressive Conditional Heteroskedasticity and Changes in
Regime, unpublished manuscript, Department of Economics, UCSD.
Harris, L. (1986) A Transaction Data Study of Weekly and Intradaily Patterns in Stock Returns,
Journal of Financial Economics, 16, 99-117.
Harvey, CR. and R.D. Huang (1991) Volatility in the Foreign Currency Futures Market, Review of
Financial Studies, 4, 543-569.
Harvey, CR. and R.D. Huang (1992) Information Trading and Fixed Income Volatility, unpublished
manuscript, Department of Finance, Duke University.
Harvey, A.C., E. Ruiz and E. Sentana (1992) Unobserved Component Time Series Models with
ARCH Disturbances, Journal of Econometrics, 52, 129-158.
Harvey, A.C., E. Ruiz and N. Shephard (1994) Multivariate Stochastic Volatility Models, Review of
Economic Studies, forthcoming.
Heston, S.L. (1991) A Closed Form Solution for Options with Stochastic Volatility, unpublished
manuscript, Department of Finance, Yale University.
Ch. 49: ARCH Models 3035

Higgins, M.L. and A.K. Bera (1992) A Class of Nonlinear ARCH Models, International Economic
Review, 33, 137-158.
Hong, P.Y. (1991) The Autocorrelation Structure for the GARCH-M Process, Economics Letters,
37, 129-132.
Hsieh, D.A. (1991) Chaos and Nonlinear Dynamics: Applications to Financial Markets, Journal of
Finance, 46, 1839-1878.
Huber, P.J. (1977). Robust Statistical Procedures. SIAM: Bristol, United Kingdom.
Hull, J. and A. White (1987) The Pricing of Options on Assets with Stochastic Volatilities. Journal
qf Finance, 42,281-300.
Jacquier, E., N.G. Polson and P.E. Rossi (1994) Bayesian Analysis of Stochastic Volatility Models,
Journal of Business and Economic Statistics, forthcoming.
Karatzas, I. and S.E. Shreve (1988) Brownian Motion and Stochastic Calculus. Springer-Verlag: New
York, NY.
Karpoff, J.M. (1987) The Relation Between Price Changes and Trading Volume: A Survey, Journal
of Financial and Quantitatioe Analysis, 22, 109-126.
Kim, C.M. (1989) Nonlinear Dependence of Exchange Rate Changes, unpublished Ph.D. dissertation,
Graduate School of Business, University of Chicago.
Kine. M.. E. Sentana and S. Wadhwani (1994) Volatility and Links Between National Stock Markets,
E&oketrica, forthcoming.
Kitagawa, G. (1987) Non-Gaussian State Space Modelling of Nonstationary Time Series, Journal
of the American Statistical Association, 82, 1032-1063.
Kodde, D.A. and F.C. Palm (1986) Wald Criterion for Jointly Testing Equality and Inequality Restric-
tions, Econometrica, 54, 1243-1248.
Kraft, D.F. and R.F. Engle (1982) Autoregressive Conditional Heteroskedasticity in Multiple Time
Series, unpublished manuscript, Department of Economics, UCSD.
Krengel, U.11985). Ergodic Theme&Walter de Gruyter: Berlin, Germany.
Kroner. K.F. and S. Claessens (1991) ODtimal Dvnamic Hedging Portfolios and the Currency Com-
position of External Debt, Jburnal ofjnternatidnal Money &d-Finance, 10, 131-148. _
Kroner, K.F. and J. Sultan (1991) Exchange Rate Volatility and Time Varying Hedge Ratios, in:
S.G. Rhee and R.P. Chang, eds., Pacific-Basin Capital Markets Research, Vol. II. North-Holland:
Amsterdam.
Lamoureux, C.G. and W.D. Lastrapes (1990) Heteroskedasticity in Stock Return Data: Volume versus
GARCH Effects, Journal ofFinance, 45,221-229.
Lamoureux, C.G. and W.D. Lastrapes (1994) Endogenous Trading Volume and Momentum in Stock
Return Volatility, Journal ofBusiness and Economic Statistics, forthcoming.
LeBaron, B. (1992) Some Relations Between Volatility and Serial Correlation in Stock Market Returns,
Journal of Business, 65, 199-220.
Lee, J.H.H. and M.L. King (1993) A Locally Most Mean Powerful Based Score Test for ARCH and
GARCH Regression Disturbances, Journal of Business and Economic Statistics, 7, 259-279.
Lee, S.W. and B.E. Hansen (1993) Asymptotic Theory for the GARCH(l, 1) Quasi-Maximum Likelihood
Estimator, unpublished manuscript, Department of Economics, University of Rochester.
Lin, W.L. (1992) Alternative Estimators for Factor GARCH Models ~ A Monte Carlo Comparison,
Journal if Applied Econometrics, 7,259-279.
Lin. W.L.. R.F. Enele and T. Ito (1994) Do Bulls and Bears Move Across Borders? International
Transmission of ?&ock Returns &d Volatility as the World Turns, Review of Financial Studies,
forthcoming.
Ljung, G.M. and G.E.P. Box (1978) On a Measure of Lag of Fit in Time Series Models, Biometrika,
61,297-303.
Lumsdaine, R.L. (1992a) Asymptotic Properties of the Quasi:Maximum Likelihood Estimator in
GARCH(1, 1) and IGARCH(l.1) Models, unpublished manuscript, Department of Economics,
Princeton University.
Lumsdaine, R.L. (1992b) Finite Sample Properties of the Maximum Likelihood Estimator in
GARCH(1, 1) and IGARCH(l, 1) Models: A Monte Carlo Investigation, unpublished manuscript,
Department of Economics, Princeton University.
MacKinnon, J.G. and H. White (1985) Some Heteroskedasticity Consistent Covariance Matrix
Estimators with Improved Finite Sample Properties, Journal of Econometrics, 29, 305-325.
Mandelbrot, B. (1963) The Variation of Certain Speculative Prices, Journal of Business, 36,
394-419.
3036 T. Bollerslev et al.

Marcus, M. and H. Mint (1964) A Survey ofMatrix Theory and Matrix Inequalities. Prindle, Weber
and Schmidt: Boston, MA.
McCurdy, T.H. and T. Stengos (1992) A Comparison of Risk Premium Forecasts Implied by
Parametric and Nonparametric Conditional Mean Estimators, Journal of Econometrics, 52,
225-244.
McDonald, J.B. and W.K. Newey (1988) Partially Adaptive Estimation of Regression Models via the
Generalized I Distribution, Econometric Theory, 4, 4288457.
M&no, A. and S. Turnbull (1990) Pricing Foreign Currency Options with Stochastic Volatility,
Journal ofEconometrics, 45, 2399266.
Merton, R.C. (1973) An Intertemporal Capital Asset Pricing Model, Econometrica, 42, 8677887.
Merton, R.C. (1980) On Estimating the Expected Return on the Market, Journal of Financial
Economics, 41, 867-887.
Milhoj, A. (1985) The Moment Structure of ARCH Processes, Scandinavian Journal of Statistics,
12,281-292.
Murphy, K. and R. Topel(1985) Estimation and Inference in Two-Step Econometric Models, Journal
of Business and Economic Statistics, 3, 370-379.
Nelson, D.B. (1989) Modeling Stock Market Volatility Changes, Proceedings from the American
Statistical Association, Business and Economic Statistics Section, 93398.
Nelson, D.B. (1990a) ARCH Models as Diffusion Approximations, Journal of Econometrics, 45,7-38.
Nelson, D.B. (1990b) Stationarity and Persistence in the GARCH(1, 1) Model, Econometric Theory,
6, 3188334.
Nelson, D.B. (1991) Conditional Heteroskedasticity in Asset Returns: A New Approach, Econometrica,
59, 3477370.
Nelson, D.B. (1992) Filtering and Forecasting with Misspecified ARCH Models I: Getting the Right
Variance with the Wrong Model, Journal of Econometrics, 52, 61-90.
Nelson, D.B. and C.Q. Cao (1992) Inequality Constraints in the Univariate GARCH Model, Journal
of Business and Economic Statistics, 10, 2299235.
Nelson, D.B. and D.P. Foster (1991) Filtering and Forecasting with Misspecified ARCH Models II:
Making the Right Forecast with the Wrong Model, unpublished manuscript, Graduate School of
Business, University of Chicago.
Nelson, D.B. and D.P. Foster (1994) Asymptotic Filtering Theory for Univariate ARCH Models,
Econometrica, 62, 1-41.
Newey, W.K. (1985) Maximum Likelihood Specification Testing and Conditional Moment Tests,
Econometrica, 53, 104771070.
Ng, V., R.F. Engle and M. Rothschild (1992) A Multi-Dynamic Factor Model for Stock Returns,
Journal of Econometrics, 52,245-265.
Nijman, T.E. and F.C. Palm (1993) GARCH Modelling of Volatility: An Introduction to Theory and
Applications, in: A.J. de Zeeuw, ed., Advanced Lectures in Quantitative Economics. Academic Press:
London.
Nijman, T.E. and E. Sentana (1993) Marginalization and Contemporaneous Aggregation in Multivariate
GARCH Processes, unpublished manuscript, Center for Economic Research, Tilburg University.
Nummelin, E. and P. Tuominen (1982) Geometric Ergodicity of Harris Recurrent Markov Chains
with Applications to Renewal Theory, Stochastic Processes and Their Applications, 12, 187-202.
Pagan, A.R. (1984) Econometric Issues in the Analysis of Regressions with Generated Regressors,
international Economic Review, 25, 221-247.
Pagan, A.R. (1986) Two Stage and Related Estimators and their Applications, Review of Economic
Studies, 53, 517-538.
Pagan, A.R. and Y.S. Hong (1991) Nonparametric Estimation and the Risk Premium, in: W.A.
Barnett, J. Powell and G. Tauchen, eds., Nonparametric and Semiparametric Methods in Econometrics
and Statistics. Cambridge University Press: Cambridge.
Pagan, A.R. and H.C.L. Sabau (1987a), On the Inconsistency of the MLE in Certain Heteroskedastic
Regression Models, unpublished manuscript, University of Rochester.
Pagan, A.R. and H.C.L. Sabau (1987b) Consistency Tests for Heteroskedasticity and Risk Models,
unpublished manuscript, Department of Economics, University of Rochester.
Pagan, A.R. and G.W. Schwert (1990) Alternative Models for Conditional Stock Volatility, Journal
of Econometrics, 45, 2677290.
Pagan, AR. and A. Ullah (1988) The Econometric Analysis of Models with Risk Terms, Journal of
Applied Econometrics, 3, 877 105.
Ch. 49: ARCH Models 3037

Pagan, A.R., A.D. Hall and P.K. Trivedi (1983) Assessing the Variability of Inflation, Review (d
Economic Studies, 50, 585-596.
Pardoux, E. and D. Talay (1985) Discretization and Simulation of Stochastic Differential Equations,
Acta Applicandae Mathematics, 3, 23-47.
Parkinson, M. (1980) The Extreme Value Method for Estimating the Variance of the Rate of Return,
Journal of Business, 53, 61-65.
Patell, J.M. and M.A. Wolfson (1979) Anticipated Information Releases Retlected in Call Option
Prices, Journal of Accounting and Economics, 1, 117-140.
Patell, J.M. and M.A. Wolfson (1981) The Ex-Ante and Ex-Post Price Effects of Quarterly Earnings
Announcement Reflected in Option and Stock Price, Journal ofAccounting Research, 19, 434-458.
Poterba, J. and L. Summers (1986) The Persistence of Volatility and Stock Market Fluctuations,
American Economic Review, 76,1142Z 115 1.
Rich, R.W., J.E. Raymond, and J.S. Butler (1992) The Relationship between Forecast Dispersion
and Forecast Uncertainty: Evidence from a Survey Data-ARCH Model, Journal of Applied
Econometrics, 7, 131-148.
Royden, H.L. (1968) Real Analysis. Macmillan Publishing Co.: New York, NY.
Scheinkman, J., and B. LeBaron (1989) Nonlinear Dynamics and Stock Returns, Journal of Business,
62, 31 I-337.
Schwarz, G. (1978) Estimating the Dimension of a Model, Annals of Statistics, 6, 461-464.
Schwert, G.W. (1989a) Why Does Stock Market Volatility Change Over Time, Journal of Finance,
44,1115-1153.
Schwert, G.W. (1989b) Business Cycles, Financial Crises, and Stock Volatility, Carnegie-Rochester
Conference Series on Public Policy, 39, 83-126.
Schwert, G.W. (1990) Indexes of U.S. Stock Prices from 1802 to 1987, Journal ofBusiness, 63,399-426.
Schwert, G.W. and P.J. Seguin (1990) Heteroskedasticity in Stock Returns, Journal of Finance, 45,
1129-l 155.
Scott, L.O. (1987) Option Pricing when the Variance Changes Randomly: Theory, Estimation and
an Application, Journal of Financial and Quantitatice Analysis, 22, 419-438.
Sentana, E. (1991) Quadratic ARCH Models: A Potential Re-Interpretation of ARCH Models,
unpublished manuscript, London School of Economics.
Shephard, N. (1993) Fitting Nonlinear Time Series Models with Applications to Stochastic Variance
Models, Journal of AppliedEconomics, 8, S135-S152.
Silverman, B.W. (1986) Density Estimationfor Statistics and Data Analysis. Chapman and Hall: London,
United Kingdom.
Stambaugh, R.F. (1993) Estimating Conditional Expectations When Volatility Fluctuates, unpublished
manuscript, The Wharton School, University of Pennsylvania.
Stroock, D.W. and S.R.S. Varadhan (1979) Multidimensional D$jksion Processes. Springer-Verlag:
Berlin, Germany.
Tauchen, G. (1985) Diagnostic Testing and Evaluation of Maximum Likelihood Models, Journal of
Econometrics, 30, 415-443.
Tauchen, G. and M. Pitts (1983) The Price VariabilityyVolume Relationship on Speculative Markets,
Econometrica, 51, 485-505.
Taylor, S. (1986) Modeling Financial Time Series. Wiley and Sons: New York, NY.
Tsay, R.S. (1987) Conditional Heteroskedastic Time Series Models, Journal of the American Statistical
Association, 82, 590-604.
Tweedie, R.L. (1983a) Criteria for Rates of Convergence of Markov Chains, with Application to
Queuing and Storage Theory, in: J.F.C. Kingman and G.E.H. Reuter, eds., Probability, Statistics, and
Analysis, London Mathematical Society Lecture Note Series No. 79. Cambridge University Press:
Cambridge.
Tweedie, R.L. (1983b) The Existence of Moments for Stationary Markov Chains, Journal of
Applied Probability, 20, 191-196.
Watson, M.W. and R.F. Englc (1985) Testing for Regression Coeflicient Stability with a Stationary
AR(l) Alternative, Review of Economics and Statistics, 67, 341-346.
Weiss, A.A. (1984) AR MA Models with ARCH Errors, Journal of Time SeriesJinalysis, 5, 129- 143.
Weiss, A.A. (1986) Asymptotic Theory for ARCH Models: Estimation and Testing, Econometric
Theory, 2, 107-131.
West, K.D., H.J. Edison and D. Cho (1993) A Utility Based Comparison of Some Models for Exchange
Rate Volatility, Journal of International Economics, 35, 23-45.
3038 T. Bollersleu et al.

White, H. (1980) A Heteroskedastic-Consistent Covariance Matrix and a Direct Test for Hetero-
skedasticity, Econometrica, 48,421~448.
White, H, (1987) Specification Testing in Dynamic Models, in: T.F. Bewley, ed., Advances in hono-
metrics: Fifth World Congress, Vol. 1. Cambridge University Press: Cambridge.
White, H. (1994) Estimation. inference and Specification Analysis, forthcoming.
Wiggins, J.B. (1987) Option Values under Stochastic Volatility: Theory and Empirical Estimates,
Journal of Financial Economics, 19, 351-372.
Wiggins, J.B. (1991) Empirical Tests of the Bias and Efficiency of the Extreme-Value Variance Estimator
for Common Stocks, Journal of Business, 64, 417-432.
Wolak, F.A. (1991) The Local Nature of Hypothesis Tests Involving Inequality Constraints in
Nonlinear Models, Econometrica, 59, 981-995.
Wooldridge, J.M. (1990) A Unified Approach to Robust Regression Based Specification Tests,
Econometric Theory, 6, 17743.
Wooldridge, J.M. (1994) Estimation and Inference for Dependent Processes, in: R.F. Engle and D.
McFadden, eds., Handbook ofEconometrics, Vol. IV. North-Holland: Amsterdam, Chapter 45.
Zakoian, J.-M. (1990) Threshold Heteroskedastic Models, unpublished manuscript, CREST, INSEE.
Zarnowitz, V. and L.A. Lambros (1987) Consensus and Uncertainty in Economic Prediction, Journal
of Political Economy, 95, 591-621.
Chapter 50

STATE-SPACE MODELS*

JAMES D. HAMILTON

University of California, San Diego

Contents

Abstract 3041
1. The state-space representation of a linear dynamic system 3041
2. The Kalman filter 3046
2.1. Overview of the Kalman filter 3047
2.2. Derivation of the Kalman filter 3048
2.3. Forecasting with the Kalman filter 3051
2.4. Smoothed inference 3051
2.5. Interpretation of the Kalman filter with non-normal disturbances 3052
2.6. Time-varying coefficient models 3053
2.7. Other extensions 3054
3. Statistical inference about unknown parameters using the
Kalman filter 3055
3.1. Maximum likelihood estimation 3055
3.2. Identification 3057
3.3. Asymptotic properties of maximum likelihood estimates 3058
3.4. Confidence intervals for smoothed estimates and forecasts 3060
3.5. Empirical application - an analysis of the real interest rate 3060
4. Discrete-valued state variables 3062
4.1. Linear state-space representation of the Markov-switching model 3063
4.2. Optimal filter when the state variable follows a Markov chain 3064
4.3. Extensions 3067
4.4. Forecasting 3068

*I am grateful to Gongpil Choi, Robert Engle and an anonymous referee for helpful comments, and
to the NSF for support under grant SES8920752. Data and software used in this chapter can be
obtained at no charge by writing James D. Hamilton, Department of Economics 0508, UCSD, La
Jolla, CA 92093-0508, USA. Alternatively, data and software can be obtained by writing ICPSR,
Institute for Social Research, P.O. Box 1248, Ann Arbor, MI 48106, USA.

Handbook of Econometrics, Volume 1 V, Edited by R.F. Engle and D.L. McFadden


0 1994 Elsevier Science B.V. All rights reserved
3040 J.D. Hamilton

4.5. Smoothed probabilities 3069


4.6. Maximum likelihood estimation 3070
4.7. Asymptotic properties of maximum likelihood estimates 3071
4.8. Empirical application - another look at the real interest rate 3071
5. Non-normal and nonlinear state-space models 3073
5.1. Kitagawas grid approximation for nonlinear, non-normal state-space models 3073
5.2. Extended Kalman filter 3076
5.3. Other approaches to nonlinear state-space models 3077
References 3077
Ch. 50: State-Space Models 3041

Abstract

This chapter reviews the usefulness of the Kalman filter for parameter estimation
and inference about unobserved variables in linear dynamic systems. Applications
include exact maximum likelihood estimation of regressions with ARMA distur-
bances, time-varying parameters, missing observations, forming an inference about
the publics expectations about inflation, and specification of business cycle
dynamics. The chapter also reviews models of changes in regime and develops the
parallel between such models and linear state-space models. The chapter concludes
with a brief discussion of alternative approaches to nonlinear filtering.

1. The state-space representation of a linear dynamic system

Many dynamic models can usefully be written in what is known as a state-space


form. The value of writing a model in this form can be appreciated by considering
a first-order autoregression

Y,+1 =$Yr+st+r, (1.1)

with E, N i.i.d. N(0, a). Future values of y for this process depend on (Y,, y,_ 1,. . . )
only through the current value y,. This makes it extremely simple to analyze the
dynamics of the process, make forecasts or evaluate the likelihood function. For
example, equation (1.1) is easy to solve by recursive substitution,

Yf+m = 4y, + 4m-1Ey+l + 4m-2Et+2 + ...


+q5l~~+~-~ +E~+~ for m= 1,2,..., (1.2)
from which the optimal m-period-ahead forecast is seen to be

E(Y,+,lY,,Y,-,,...)=~mY,. (1.3)

The process is stable if 14 1< 1.


The idea behind a state-space representation of a more complicated linear system
is to capture the dynamics of an observed (n x 1) vector Y, in terms of a possibly
unobserved (I x 1) vector 4, known as the state vector for the system. The dynamics
of the state vector are taken to be a vector generalization of (1.1):

5,+r =F&+n,+,. (1.4)


3042 J.D. Hamilton

Here F denotes an (r x I) matrix and the (r x 1) vector II, is taken to be i.i.d. N(0, Q).
Result (1.2) generalizes to

5t+m = F& + F-~I,+~ + Fm-2~t+2 + ...

+ Fr~~+~-r +v,+, for m= 1,2,..., (1.5)

where F denotes the matrix F multiplied by itself m times. Hence

Future values of the state vector depend on ({,, 4, _ 1,. . .) only through the current
value 5,. The system is stable provided that the eigenvalues of F all lie inside the
unit circle.
The observed variables are presumed to be related to the state vector through
the observation equation of the system,

y, = A.q + H{, + w,. (1.6)


Here yt is an (n x 1) vector of variables that are observed at date t, H is an (n x r)
matrix of coefficients, and W, is an (n x 1) vector that could be described as
measurement error; W, is assumed to be i.i.d. N(O,R) and independent of g1 and
v, for t= 1,2,... . Equation (1.6) also includes x,, a (k x 1) vector of observed
variables that are exogenous or predetermined and which enter (1.6) through the
(n x k) matrix of coefficients A. There is a choice as to whether a variable is defined
to be in the state vector 5, or in the exogenous vector xt, and there are advantages
if all dynamic variables are included in the state vector so that x, is deterministic.
However, many of the results below are also valid for nondeterministic x,, as long
as n, contains no information about &+, or w,+, for m = 0, 1,2,. . . beyond that
containediny,_,,y,_,,..., yr. For example, X, could include lagged values of y or
variables that are independent of 4, and W, for all T.
The state equation (1.4) and observation equation (1.6) constitute a linear
state-space repesentation for the dynamic behavior of y. The framework can be
further generalized to allow for time-varying coefficient matrices, non-normal
disturbances and nonlinear dynamics, as will be discussed later in this chapter.
For now, however, we just focus on a system characterized by (1.4) and (1.6).
Note that when x, is deterministic, the state vector 4, summarizes everything in
the past that is relevant for determining future values of y,

E(Yt+ml51,5r-l,...,Yt,Y1-1,...)
=EC(Ax,+,+H5,+,+w,+,)l5,,5,-,,...,y,,y,-l,...I
=Axy+,+HE(5,+,151,&-1,...,~t,~*-l,...)
= A'x~+~ + HlF"'&. (1.7)
Ch. 50: State-Space Models 3043

As a simple example of a system that can be written in state-space form, consider


a pth-order autoregression

(Y,+1 - 11)= 4l(Y, - 4 + #dYte 1 - PL)+ ... + 4p(Yt-p+l - 11)+ Et+12 (1.8)

E, - i.i.d. N(0, a2).

Note that (1.8) can equivalently be written as

41 42 ... 4p-1 4p
Yt+1 -P 1 0 ... 0 0 Yt - P Et+1

1i !
J&p+2
Yz - P

-P
= Ol...OO

.o
..

.
..
0
.
...
...
.
1
:
011
Yc-1 -P

J&p+1

The first row of (1.9) simply reproduces (1.8) and other rows assert the identity
.

:
-P
!I:1
+
0
.

0
. (1.9)

Y,_j-p=Yy,_j-p forj=O,l,..., p - 2. Equation (1.9) is of the form of (1.4) with


r=p and

&=(yt-PL,Yt-1 -P~...#-p+l-P)I~ (1.10)

v 2+1- -(%+1,0,...,O)I, (1.11)

F= 0 1 ... (1.12)
.. .. . ..
. .
-0 0 ..

The observation equation is

Yt = P + Ht,, (1.13)

where H is the first row of the (p x ,p) identity matrix. The eigenvalues of F can
be shown to satisfy

(1.14)

thus stability of a pth-order autoregression requires that any value 1 satisfying


(1.14) lies inside the unit circle.
Let us now ask what kind of dynamic system would be described if H in (1.13)
3044 J.D. Hamilton

is replaced with a general (1 x p) vector,

y,=/J+Cl 81 0, . ~,-IK (1.15)

where the 8s represent arbitrary coefficients. Suppose that 4, continues to evolve


in the manner specified for the state vector of an AR(p) process. Letting tjt denote
the jth element of &, this would mean

51,t+1

I! 1
r 2,t+

r PJ+
1

1
=
41

1
0
..

0
.
42

0
1
..

0
.
...

...
...
...
...
4,-l

0
0

1
4%

0
0

0
+
1
E t+1

0
.

0:I
. (1.16)

The jth row of this system for j = 2,3,. . . , p states that <j,l+ 1 = (j- l,f, implying

5jt=Lj51,t+l for j=1,2 ,..., p, (1.17)

for L the lag operator. The first row of (1.16) thus implies that the first element
of 4, can be viewed as an AR(p) process driven by the innovations sequence {E,}:

(1-~l~-~2~2-~~~-~p~P)S1.1+l=~,+l. (1.18)

Equations (1.15) and (1.17) then imply

y,=p+(l +B,L+e2L2+...+ep-1LP-1)51r. (1.19)

If we subtract p from both sides of (1.19) and operate on both sides with
(1 - c$,L- q5,L2 - ... - 4,Lp), the result is

(1 -&L-4,L2- ... -~pLP)(yt-p)=(l+~1L1+~2L2+~~+ep-1LP-1)

x (1 - $,L- f#),L2 - ... - f#),LP)&,

=(1+8,L+82L2+~~~+8p_,LP-1)E,

(1.20)

by virtue of(1.18). Thusequations(l.15)and (1.16)constitute a state-space represen-


tation for an ARMA(p,p - 1) process.
The state-space framework can also be used in its own right as a parsimonious
time-series description of an observed vector of variables. The usefulness of forecasts
emerging from this approach has been demonstrated by Harvey and Todd (1983),
Aoki (1987), and Harvey (1989).
Ch. 50: State-Space Models 3045

The state-space form is particularly convenient for thinking about sums of


stochastic processes or the consequences of measurement error. For example,
suppose we postulate the existence of an underlying true variable, &, that follows
an AR(l) process

(1.21)

with u, white noise. Suppose that 4, is not observed directly. Instead, the econometri-
cian has available data y, that differ from 5, by measurement error w,:

Y, = 5, + wt. (1.22)

If the measurement error is white noise that is uncorrelated with t+, then (1.21)
and (1.22) can immediately be viewed as the state equation and observation
equation of a state-space system, with I = n = 1. Fama and Gibbons (1982) used
just such a model to describe the ex ante real interest rate (the nominal interest
rate i, minus the expected inflation rate 7~;). The ex ante real rate is presumed to
follow an AR( 1) process, but is unobserved by the econometrician because peoples
expectation 7~: is unobserved. The state vector for this application is then
<, = i, - rcr - /J where p is the average ex ante real interest rate. The observed ex
post real rate (y, = i, - n,) differs from the ex ante real rate by the error people
make in forecasting inflation,

i, - x, = p + (i, - 7rcp
- p) + (7~;- 71J,

which is an observation equation of the form of (1.6) with R = 1 and w, = (~1 - 7~~).
If people do not make systematic errors in forecasting inflation, then w, might
reasonably be assumed to be white noise.
In many.economic models, the publics expectations of the future have important
consequences. These expectations are not observed directly, but if they are formed
rationally there are certain implications for the time-series behavior of observed
series. Thus the rational-expectations hypothesis lends itself quite naturally to a
state-space representation; sample applications include Wall (1980), Burmeister
and Wall (1982), Watson (1989), and Imrohoroglu (1993).
In another interesting econometric application of a state-space representation,
Stock and Watson (1991) postulated that the common dynamic behavior of an
(n x 1) vector of macroeconomic variables yt could be explained in terms of an
unobserved scalar ct, which is viewed as the state of the business cycle. In addition,
each series y, is presumed to have an idiosyncratic component (denoted a,J that
is unrelated to movements in yjt for i #j. If each of the component processes could
be described by an AR(l) process, then the [(n + 1) x l] state vector would be

4=L a 113 Qtv . . . 2 a,,) (1.23)


3046 J.D. Hamilton

with state equation

C, UC,,+1

[$j=[j{ZGJ
al,
azr + 1: v1,t+ 1
v2,t+l (1.24)

and observation equation

=
Yl,

1:
IY2t

Y t
P2
.

:I

P1 1% 00
+

I
yz.
.
.
O.
.
.
l.
.
.
...
...

... 1
0
.
c,
al*

a,,

a:I
m
1 (1.25)

Thus yi is a parameter measuring the sensitivity of the ith series to the business
cycle. To allow for @h-order dynamics, Stock and Watson replaced c, and ai, in
(1.23) with the (1 xp) vectors (c,,c,_t ,..., c,-,,+r) and (~,,,a~,~_, ,..., ~_~+r) so
that 4, is an [(n + 1)~ x l] vector. The scalars C#I~ in (1.24) are then replaced by
(p x p) matrices Fi with the structure of (1.12), and blocks of zeros are added in
between the columns of H in the observation equation (1.25). A related theoretical
model was explored by Sargent (1989).
State-space models have seen many other applications in economics. For partial
surveys see Engle and Watson (1987), Harvey (1987), and Aoki (1987).

2. The Kalman filter

For convenience, the general form of a constant-parameter linear state-space model


is reproduced here as equations (2.1) and (2.2).

State equation

51+1 = J-5, + v,+1 (2.1)


(r x 1) (r x r)(r x 1) (r x 1)

E(v,+~$+J= Q
(r x 4
Observation equation

Yt = .4x, + WC& + w, (2.2)


(n x 1) (n x k)(k x 1) (n x r)(r x 1) (n x 1)

E(w,w;) = R .
(n x n)
Ch. 50: State-Space Models 3047

Writing a model in state-space form means imposing certain values (such as


zero or one) on some of the elements of F, Q,A,H and R, and interpreting the
other elements as particular parameters of interest. Typically we will not know
the values of these other elements, but need to estimate them on the basis of
observation of {y,, y,, . . . , yT} and {x1,x2,. . . ,x,}.

2.1. Overview of the Kalman Jilter

Before discussing estimation of parameters, it will be helpful first to assume that


the values of all of the elements of F, Q, A, H and R are known with certainty; the
question of estimation is postponed until Section 3. The filter named for the
contributions of Kalman (1960, 1963) can be described as an algorithm for
calculating an optimal forecast of the value of 4, on the basis of information
observed through date t - 1, assuming that the values of F, Q, A, H and R are all
known.
This optimal forecast is derived from a well-known result for normal variables;
[see, for example, DeGroot (1970, p. 55)]. Let z1 and zZ denote (n, x 1) and (n2 x 1)
vectors respectively that have a joint normal distribution:

Then the distribution of zZ conditional on z1 is N(m,Z) where

m=k +%~;,(z, -II,), (2.3)

E= .n,, - f2&2;;.(2,*. (2.4)

Thus the optimal forecast of z2 conditional on having observed z1 is given by

J%,Iz,)=lr, +&l.n;,(z, -P,), (2.5)

with Z characterizing the mean squared error of this forecast:

EC& - Mz, - m)lz,l = f12, - f2,,R ;:f&. (2.6)


To apply this result, suppose that the initial value of the state vector (et) of a
state-space model is drawn from a normal distribution and that the disturbances
a, and w, are normal. Let the observed data obtained through date t - 1 be
summarized by the vector
3048 J.D. Hamilton

Then the distribution of 4, conditional on &_r turns out to be normal for


t = 2,3,,. . , T. The mean of this conditional distribution is represented by the (r x 1)
vector {,,,_ 1 and the variance of this conditional distribution is represented by the
(I x r) matrix PrIt_ r. The Kalman filter is simply the result of applying (2.5) and
(2.6) to each observation in the sample in succession. The input for step t of the
iteration is the mean &-, and variance P,,t_I that characterize the distribution
of 4, conditional on &- 1. The output for step t is the mean c$+,,, and variance
P t+ I,f of I&+1 conditional on 6,. Thus the output for step t is used as the input for
step t + 1.

2.2. Derivation of the Kalman filter

The iteration is started by assuming that the initial value of the state vector gI is
drawn from a normal distribution with mean denoted G$,,,and variance denoted
PI,,. If the eigenvalues of F are all inside the unit circle, then the vector process
defined by (2.1) is stationary, and ~,,, would be the unconditional mean of this
process,

&o= 09 (2.7)

while P,,, would be the unconditional variance

PII, = E(44:).

This unconditional variance can be calculated from

vec(P,,,) = [I+ - (F@F)]-.vec(Q). (2.f3)

Here Z,2 is the (r2 x r2) identity matrix, 0 denotes the Kronecker product and

1The unconditional variance of 4 can be found by postmultiplying (2.1) by its transpose and taking
expectations:

E(5,+,r;+,)=E(~5,+ol+,)(5:F+~:+,)
= F.E(S,S;)~ + w,+ ,u;+ J.

If 4, is stationary, then E({,+ ,S;+,) = E(t,J;) = P,,,, and the above equation becomes

P,,, = FP,,,F + Q.

Applying the vet operator to this equation and recalling [e.g. Magnus and Neudecker (1988, p. 30)]
that vec(ABCJ = (C&I A).vec(B) produces

WP,,,) = (F@WWP,I,) + ve4Q).


Ch. 50: State-Space Models 3049

vec(P,,,) is the (r2 x 1) vector formed by stacking the columns of Pi,,, one on top
of the other, ordered from left to right.
For time-variant or nonstationary systems, s,,, could represent a guess as to
the value of {i based on prior information, while Pi,, measures the uncertainty
associated with this guess ~ the greater our prior uncertainty, the larger the
diagonal elements of Pi,,. This prior cannot be based on the data, since it is
assumed in the derivations to follow that ut+i and W, are independent of <i for
t= 1,2,..., T. The algorithm described below can also be adapted for the case of
a completely diffuse prior (the limiting case when Pi,,, becomes infinite); as described
by Ansley and Kohn (1985), Kohn and Ansley (1986) and De Jong (1988, 1989,
1991).
At this point we have described the values of & 1 and P+ 1 that characterize
the distribution of 4, conditional on a_ 1 for t = 1. Since a similar set of calculations
will be used for each date t in the sample, it is helpful to describe the next step
using notation appropriate for an arbitrary date t. Thus let us assume that the
values of Et,,- l and P,,,_ 1 have been calculated for some t, and undertake the task
of using these to evaluate &+ ilt and P,+ Ilf. If the distribution of 4, conditional on
&_i is N(&,,_r, P,I,_l), then under the assumptions about n,, this is the same as
the distribution of 4, conditional on &- l and x,. Since W, is independent of X, and
&_ 1, the forecast of yt conditional on I& 1 and X, can be inferred immediately
from (2.2):

E(y,Ix,,r,-,)=Ax,+H%,,-,. (2.9)

From (2.2) and (2.9) the forecast error can be written

Yt - &,I x,9 r,- 1)= (A& + H5, + 4 - (Ant + Ha,,


- 1)
= H(4 - Et,,-
1)+ wt. (2.10)

Since & _ 1 is a function of & _ 1, the term W, is independent of both 4, and $,,_ r.
Thus the conditional variance of (2.10) is

E(Cy,-E(y,lx,,r,-,)lCy,-E(y,Ix,,r,-l)llx,,r,-1}
=H.E{Cr,-~,,,-,1C51-~,1-lllrl-1}H+E(w,w:)
= HPt,,_lH+ R.
Similarly, the conditional covariance between (2.10) and the error in forecasting

*Meinhold and Singpurwalla (1983) gave a nice description of the Kalman filter from a Bayesian
perspective.
3050 J.D. Hamilton

the state vector is

Thus the distribution of the vector (y;, 4:) conditional on X, and C,_ 1 is

It then follows from (2.3) and (2.4) that &I & = ~,Ix,,Y,, <,- 1 is distributed W&, Ptit)
where

&t = B,,- 1 + p,,,- l~(HP,,,- ,H+ WYY, - A? - et,,-l), (2.12)

P,(,=P,,,_l-P,(,_lH(HP,,,_lH+R)-HP,,,_l. (2.13)

The final step is to calculate a forecast of &+ 1 conditional on 5,. It is not hard
to see from (2.1) that 5, + 1I G - N& + l,f, P,+ I,t) where

E+ 111
= %> (2.14)

P t+ l,t = FP,,,F + Q. (2.15)

Substituting (2.12) into (2.14) and (2.13) into (2.15), we have

= F~,,,-l+FP,,,-,H(HP,,,-,~+R)-(y,-~x,-H~~,,-,),
%+1/t (2.16)

P t+l,,= FPt,,_lF- FPt,,_lH(HPt,,_lH+ R)-HPt,,_lF+Q. (2.17)

To summarize, the Kalman filter is an algorithm for calculating the sequence


{&+ &= 1 and P-T+ &= 1y where & + 1 ,f denotes the optimal forecast of 4, + 1 based
on observation of (yt,yt_ i,.. .,yl,n,,x,_, ,..., x1) and Pt+l,t denotes the mean
squared error of this forecast. The filter is implemented by iterating on (2.16) and
(2.17) for t = 1,2,. . . ,T. If the eigenvalues of F are all inside the unit circle and
there is no prior information about the initial value of the state vector, this iteration
is started using equations (2.7) and (2.8).
Note that the sequence {Pt+I,t}:=l is not a function of the data and can be
evaluated without calculating the forecasts {tt+ llt}T= 1. Because P,+ l,t is not a
function of the data, the conditional expectation of the squared forecast error is
Ch. 50: State-Space Models 3051

the same as its unconditional expectation,

This equivalence is a consequence of having assumed normal distributions with


constant variances for II, and w,.

2.3. Forecasting with the Kalman filter

An m-period-ahead forecast of the state vector can be calculated from (1.5):

$+,,t = E(5,+,lu,,u,- l,...,Y1,Xt,X,-l,...,X1)=FrnSt,r. (2.18)

The error of.this forecast can be found by subtracting (2.18) from (1.5),

4t+m - $+l4,= Fm(5A,,,,+Fm-1r+l


+Fm-2vt+2+... +Fu,+,_l +Ut+m,

from which it follows that the mean squared error of the forecast (2.18) is

Pt+m,t = EC(&+,- %+,,,,G+,


- tt+d'l
=FmPC,,(Fm)+Fm-Q(Fm-1)+Fm-2Q(Fm-2)+~~~+FQF+Q. (2.19)

These results can also be used to describe m-period-ahead forecasts of the


observed vector y! + ,,,, provided that {x,} is deterministic. Applying the law of
iterated expectations to (1.7) results in

9 t+mlr=E(yr+mIYt,Yt-1,...,Y1)=A~,+,+HF$I,. (2.20)

The error of this forecast is

Y t+m -A+,I, = Vxt+m+ Hb+, + w,+,) - Vxt+m + HF&t)


= H(5,+*- E+m,,)
+ Wt+m
with mean squared error

EC(y,+m -9t+mlr)(Yt+m -9,+,1,)'1= HPt.,,,H+ R. (2.21)

2.4. Smoothed inference

Up to this point we have been concerned with a forecast of the value of the state
vector at date t based on information available at date t - 1, denoted &,- 1, or
3052 J.D. Hamilton

with an inference about the value of the state vector at date t based on currently
available information, denoted &. In some applications the value of the state vector
is of interest in its own right. In the example of Fama and Gibbons, the state
vector tells us about the publics expectations of inflation, while in the example of
Stock and Watson, it tells us about the overall condition of the economy. In such
cases it is desirable to use information through the end of the sample (date T) to
help improve the inference about the historical value that the state vector took
on at any particular date t in the middle of the sample. Such an inference is known
as a smoothed estimate, denoted e,,= E({,j c&). The mean squared error of this
estimate is denoted PtJT = E(g, - &T)(g, - &-)I.
The smoothed estimates can be calculated as follows. First we run the data
through the Kalman filter, storing the sequences {Pt,,}T=i and {P+ ,}T=, as
calculated from (2.13) and (2.15) and storing the sequences ($,,}T= 1 and {$t,l_ ,>,=1
as calculated from (2.12) and (2.14). The terminal value for {&t}Z i then gives the
smoothed estimate for the last date in the sample, I$=,~,and P,,, is its mean squared
error.
The sequence of smoothed estimates {&T)TE1 is then calculated in reverse order
by iterating on

for t = T- 1, T- 2,. . . , 1, where J, = P,,,FP;,,,,. The corresponding mean squared


errors are similarly found by iterating on

(2.23)

inreverseorderfort=T-l,T-2,..., 1; see for example Hamilton (1994, Section


13.6).

2.5. Interpretation of the Kalman jilter with non-normal disturbances

In motivating the Kalman filter, the assumption was made that u, and w, were
normal. Under this assumption, &,_ 1 is the function of <,- 1 that minimizes

a-(4, - Et,,-
1NC- %,,-l)l> (2.24)

in the sense that any other forecast has a mean squared error matrix that differs
from that of &,_ 1 by a positive semidefinite matrix. This optimal forecast turned
out to be a constant plus a linear function of L-;. The minimum value achieved
for (2.24) was denoted PIi,_ 1.
If D, and w, are not normal, one can pose a related problem of choosing &, _ 1
to be a constant plus a linear function of &- i that minimizes (2.24). The solution
Ch. 50: State-Space Models 3053

to this problem turns out to be given by the Kalman filter iteration (2.16) and its
unconditional mean squared error is still given by (2.17). Similarly, when the
disturbances are not normal, expression (2.20) can be interpreted as the linear
projection of yt +m on 5, and a constant, with (2.21) its unconditional mean squared
error. Thus, while the Kalman filter forecasts need no longer be optimal for systems
that are not normal, no other forecast based on a linear function of & will have
a smaller mean squared error [see Anderson and Moore (1979, pp. 92298) or
Hamilton (1994, Section 13.2)]. These results parallel the Gauss-Markov theorem
for ordinary least squares regression.

2.6. Time-varying coefficient models

The analysis above treated the coefficients of the matrices F, Q, A, H and R as


known constants. An interesting generalization obtains if these are known functions
of n,:

yt = a(~,)+ CHbJ15,+ w,, (2.26)


E(w,w:ln,, r,- 1) = W,).

Here F(.), Q(.), H(.) and R( .) denote matrix-valued functions of x, and a(.) is an
(n x 1) vector-valued function of x,. As before, we assume that, apart from the
possible conditional heteroskedasticity allowed in (2.26), x, provides no information
about 4, or w, for any t beyond that contained in c,_ r.
Even if u, and w, are normal, with x, stochastic the unconditional distributions
of 4, and yt are no longer normal. However, the system is conditionally normal
in the following sense.3 Suppose that the distribution of 4, conditional on &_ 1 is
taken to be N(&I,P,,t_,). Then 4, conditional on x, and &t has the same
distribution. Moreover, conditional on x,, all of the matrices can be treated as
deterministic. Hence the derivation of the Kalman filter goes through essentially
as before, with the recursions (2.16) and (2.17) replaced with

s t+Iit= W& - I + FW,I, - 1H(xt)1CHWI~,,,- ,H(x,)+ R(4) -


x {u, - 44 - CWx,)ltt,,-J, (2.27)

P t+ I,( = F(x,)Pt,,- 1F(x,)l- (F(-q)Pt,,- ~H(n,)CCWxt)lft~t-


,4x,) + R(xt)l-i
x CfWJlf+ - 1CF(41) + QW (2.28)

3See Theorem 6.1 in Tjestheim (1986) for further discussion.


3054 J.D.Hamilton

It is worth noting three elements of the earlier discussion that change with
time-varying parameter matrices. First, the distribution calculated for the initial
state in (2.7) and (2.8) is only valid if F and Q are fixed matrices. Second,
m-period-ahead forecasts of y,,, or &,., for m > 1 are no longer simple to
calculate when F, H or A vary stochastically; Doan et al. (1984) suggested approxi-
mating W, + 2 Iyl,yt - l,...,~l)with E(y,,21~,+t,~,,...,~l) evaluated al yrir =
E(Y,+,~Y,,Y,-I,...,yl).Finally, if u, and W, are not normal, then the one-period-
ahead forecasts Et+Ilf and 9,+ Ilt no longer have the interpretation as linear
projections, since (2.27) is nonlinear in x,.
An important application of a state-space representation with data-dependent
parameter matrices is the time-varying coefficient regression model

Y, = xi@,
+w f
(2.29)

Here & is a vector of regression coefficients that is assumed to evolve over time
according to

&+I-@=I;tSI-h+vt+v (2.30)

Assuming the eigenvalues of F are all inside the unit circle, fi has the interpretation
as the average or steady-state coefficient vector. Equation (2.30) will be recognized
as a state equation of the form of (2.1) with 4, = Vpt- $). Equation (2.29) can then
be written as

Yt = 4 -I- x:5*+ w,, (2.31)

which is in the form of the observation equation (2.26) with a(%,)= X$ and
[L&J] = xi. Higher-order dynamics for /It are easily incorporated by, instead,
defining 4: = [(B - @,, (B,_ 1 - @),. . . , c/pt_ p+ 1 - j?,] as in Nicholls and Pagan
(1985, p. 437).
Excellent surveys of time-varying parameter regressions include Raj and Ullah
(1981), Chow (1984) and Nicholls and Pagan (1985). Applications to vector auto-
regressions have been explored by Sims (1982) and Doan et al. (1984).

2.7. Other extensions

The derivations above assumed no correlation between II, and VU,,though this is
straightforward to generalize; see, for example, Anderson and Moore (1979, p. 108).
Predetermined or exogenous variables can also be added to the state equation with
few adjustments.
The Kalman filter is a very convenient algorithm for handling missing
observations. If y, is unobserved for some date t, one can simply skip the updating
Ch. 50: State-Space Models 3055

equations (2.12) and (2.13) for that date and replace them with Et,*= &,_ 1 and
P,,, = P,,r_ r; see Jones (1980), Harvey and Pierse (1984) and Kohn and Ansley
(1986) for further discussion. Modifications of the Kalman filtering and smoothing
algorithms to allow for singular or infinite P,,, are described in De Jong (1989, 1991).

3. Statistical inference about unknown parameters using the Kalman filter

3.1. Maximum likelihood estimation

The calculations described in Section 2 are implemented by computer, using the


known numerical values for the coefficients in the matrices F, Q, A, H and R.
When the values of the matrices are unknown we can proceed as follows. Collect
the unknown elements of these matrices in a vector 8. For example, to estimate
theARMA(p,p-l)process(1.15)-(1.16),8=($,,4, ,..., 4p,01,02 ,..., Bp_rr~,~).
Make an arbitrary initial guess as to the value of t9, denoted 0(O), and calculate
the sequences {&- r(@())}T=1 and {Pt,t_,(B(o))}tE1 that result from this value in
(2.16) and (2.17). Recall from (2.11) that if the data were really generated from the
model (2.1)-(2.2) with this value of 0, then

_hbt,r14e(0) - w4oW, 409(o))), (3.1)

where

p,(e(O))= p(e(o))]k, + [H(e(O))]&_ 1(8(O)), (3.2)

qe(O)) = pz(e(o))]~p,,,_ ,(e(O))] [qe(O))] + R(e(O)). (3.3)

The value of the log likelihood is then

f logf(y,lx,,r,_,;e(O))= -$i09(27+:$ iOglzt(e(0))l


t=1

- f,flh - ~vWi~~w(o9i - lb, -)u,iw I. (3.4)

which reflects how likely it would have been to have observed the data if 0(O)were
the true value for 8. We then make an alternative guess 0(l) so as to try to achieve
a bigger value of (3.4), and proceed to maximize (3.4) with respect to 8 by numerical
methods such as those described in Quandt (1983), Nash and Walker-Smith (1987)
or Hamilton (1994, Section 5.7).
Many numerical optimization techniques require the gradient vector, or the
derivative of (3.4) with respect to 0. The derivative with respect to the ith element
of 8 could be calculated numerically by making a small change in the ith element
Ch. 50: State-Space Models 3057

The vector of parameters to be estimated is 8 = (B, 19r,8,, a). By making an


arbitrary guess4 at the value of 8, we can calculate the sequences {&_,(B)}T=,
and (ptlt- ,(W>T= r in (2.16) and (2.17). The starting value for (2.16) is the
unconditional mean of er,

0
s^,,,
=E EtY1
; Et-2

while (2.17) is started


=
II-1 0
0
)

with the unconditional variance,

P,,,=E ;
Et
&,-I

5 - 2
1
[E, Et_1 Et_21 = : 0
a*

0
0
a2 0
0
0

i72
1
From these sequences, u(e) and E,(0) can be calculated in (3.2) and (3.3), and
(3.4) then provides

log f(YT,YT-l,...,YlIXT,XT-1,...,Xl;~). (3.7)

Note that this calculation gives the exact log likelihood, not an approximation,
and is valid regardless of whether 8, and e2 are associated with an invertible
MA(2) representation. The parameter estimates j, 8r, 8, and b are the values that
make (3.7) as large as possible.

3.2. Identijication

The maximum likelihood estimation procedure, just described, presupposes that


the model is identified, that is, it assumes that a change in any of the parameters
would imply a different probability distribution for {y,},, 1.
One approach to checking for identification is to rewrite the state-space model
in an alternative form that is better known to econometricians. For example, since
the state-spacemode1(1.15)-(1.16)isjust another way ofwritingan ARMA(p,p - 1)
process, the unknown parameters (4r,. . . ,4p, O1,. . . ,8,_ 1, p, a) can be consistently
estimated provided that the roots of (1 + t9,z + 0,z2 + ... + 8,_ r.zp-r) = 0 are
normalized to lie on or outside the unit circle, and are distinct from the roots of
(1 -C#J1z-CJS2z2- ... - 4,~) = 0 (assuming these to lie outside the unit circle as
well). An illustration of this general idea is provided in Hamilton (1985). As another

4Numerical algorithms are usually much better behaved if an intelligent initial guess for 0 is used.
A good way to proceed in this instance is to use OLS estimates of (3.5) to calculate an initial guess
for /?, and use the estimated variance sz and autocorrelations PI and p2 of the OLS residuals to
construct initial guesses for O,, O2 and c using the results in Box and Jenkins (1976, pp. 187 and 519).
3058 J.D. Hamilton

example, the time-varying coefficient regression model (2.31) can be written

Yt = x:s+4, (3.8)

where

If X, is deterministic, equation (3.8) describes a generalized least squares regression


model in which the varianceecovariance matrix of the residuals can be inferred
from the state equation describing 4,. Thus, assuming that eigenvalues of F are all
inside the unit circle, p can be estimated consistently as long as (l/T)CT= ix&
converges to a nonsingular matrix; other parameters can be consistently estimated
if higher moments of x, satisfy certain conditions [see Nicholls and Pagan (1985,
p. 431)].
The question of identification has also been extensively investigated in the
literature on linear systems; see Gevers and Wertz (1984) and Wall (1987) for a
survey of some of the approaches, and Burmeister et al. (1986) for an illustration
of how these results can be applied.

3.3. Asymptotic properties of maximum likelihood estimates

Under suitable conditions, the estimate 8 that maximizes (3.4) is consistent and
asymptotically normal. Typical conditions require 8 to be identified, eigenvalues
of F to be inside the unit circle, the exogenous variable x, to behave asymptotically
like a full rank linearly nondeterministic covariance-stationary process, and the
true value of 8 to not fall on the boudary of the allowable parameter space; see
Caines (1988, Chapter 7) for a thorough discussion. Pagan (1980, Theorem 4) and
Ghosh (1989) demonstrated that for particular examples of state-space models

(3.9)

where $ZDST is the information matrix for a sample of size T as calculated from
second derivatives of the log likelihood function:

x 2D.T = (3.10)

Engle and Watson (1981) showed that the row i, column j element of $2D.T is
Ch. 50: State-Space Models 3059

given by

(3.11)

One option is to estimate (3.10) by (3.11) with the expectation operator dropped
from (3.11). Another common practice is to assume that the limit of $ZD,T as
T+ co is the same as the plim of

1 T a2w-(~,Ir,-,~-0)
3-g (3.12)
aeaef
3

f 1 e=ti

which can be calculated analytically or numerically by differentiating (3.4). Reported


standard errors for 8 are then square roots of diagonal elements of (l/T)(3)-.
It was noted above that the Kalman filter can be motivated by linear projection
arguments even without normal distributions. It is thus of interest to consider as
in White (1982) what happens if we use as an estimate of 8 the value that maximizes
(3.4), even though the true distribution is not normal. Under certain conditions
such quasi-maximum likelihood estimates give consistent and asymptotically
normal estimates of the true value of 0, with

Jm- 4) L NO, C92d (p2,] - ), (3.13)

where $2D is the plim of (3.12) when evaluated at the true value 0, and .YoP is
the limit of (l/T)CT, 1[s,(&,)] [s,(B,)] where

mu =
[
ai0gf(.hIr,-,~-0)
ae I 1.
e=eo

An important hypothesis test for which (3.9) clearly is not valid is testing the
constancy of regression coefficients [see Tanaka (1983) and Watson and Engle
(1985)]. One can think of the constant-coefficient model as being embedded as a
special case of (2.30) and (2.31) in which E(u,+ lu:+ J = 0 and /I1 = $. However,
such a specification violates two of the conditions for asymptotic normality
mentioned above. First, under the null hypothesis Q falls on the boundary of the
allowable parameter space. Second, the parameters of Fare unidentified under the
null. Watson and Engle (1985) proposed an appropriate test based on the general
procedure of Davies (1977). The results in Davies have recently been extended by
Hansen (1993). Given the computational demands of these tests, Nicholls and
3060 J.D. Hamilton

Pagan (1985, p. 429) recommended Lagrange multiplier tests for heteroskedasticity


based on OLS estimation of the constant-parameter model as a useful practical
approach. Other approaches are described in Nabeya and Tanaka (1988) and
Leybourne and McCabe (1989).

3.4. Confidence intervals for smoothed estimates and forecasts

Let &,(0,) denote the optimal inference about 4, conditional on obervation of


all data through date T assuming that 0, is known. Thus, for t d T, {,IT(OO) is the
smoothed inference (2.22) while for r > T, &,(O,,) is the forecast (2.18). If 0, were
known with certainty, the mean squared error of this inference, denoted PZIT(&,),
would be given by (2.23) for r d T and (2.19) for t > T.
In the case where the true value of 0 is unknown, this optimal inference is
approximated by &,(o) for e^ the maximum likelihood estimate. To describe the
consequences of this, it is convenient to adopt the Bayesian perspective that 8
is a random variable. Conditional on having observed all the data &-, the posterior
distribution might be approximated by

elc,- N(@(lIT)@-). (3.14)

where Etilr,() denotes the expectation of (.) with respect to the distribution in
(3.14). Thus the mean squared error of an inference based on estimated parameters
is the sum of two terms. The first term can be written as E,,c,{P,,T(0)}, and might
be described as filter uncertainty. A convenient way to calculate this would
be to generate, say, 10,000 Monte Carlo draws of 8 from the distribution (3.14),
run through the Kalman filter iterations implied by each draw, and calculate
the average value of PrIT(0) across draws. The second term, which might be
described as parameter uncertainty, could be estimated from the outer product of
[&.(ei) - &,(i!j)] with itself for the ith Monte Carlo draw, and again averaging
across Monte Carlo realizations.
Similar corrections to (2.21) can be used to generate a mean squared error for
the forecast of y,,, in (2.20).

3.5. Empirical application - an analysis of the real interest rate

As an illustration of these methods, consider Fama and Gibbonss (1982) real


interest rate example discussed in equations (1.21) and (1.22). Let y, = i, - 7c, denote
Ch. 50: State-Space Models 3061

IO.0

75

50

2.5
w
0.0

-2.5
I
60 63 66 69 72 75 78 81 84 87 90

2.00
1.75 -
I.50 -
1.25 -
1.00 -
a
075 '!____________________------__-_______________________~'
050 -
025
0.00
-
I
60 63 66 69 72 75 78 Eli 84 87 90

9 5.0

z
8
is
-2.5 -

-5.0
60 63 66 69 72 75 78 81 84 87 90

Figure 1. Top panel. Ex post real interest rate for the United States, qua;terly from 1960:1 to_ 1992:III,
quoted at an annual rate. Middle panel. F$er_uncertainty. Solid line: P,,,(e). Dashed line: PflT(B). Bottom
panel. Smoothed inferences t,,(0) along with 95 percent confidence intervals.

the observed ex post real interest rate, where i, is the nominal interest rate on
3-month U.S. Treasury bills for the third month of quarter t (expressed at an
annual rate) and rr, is the inflation rate between the third month of quarter t and
the third month oft + 1, measured as 400 times the change in the natural logarithm
of the consumer price index. Quarterly data for y, are plotted in the top panel of
Figure 1 for t= 1960:1 to 1992:III.
The maximum likelihood estimates for the parameters of this model are as
follows, with standard errors in parentheses,

& = 0.9145,_ 1 + u, 8 = 0.977


(0.041) (0.177)
y,=1.43+r,+w, 6, = 1.34 .
(0.93) (0.14)
3062 J.D. Hamilton

Here the state variable 5, = i, - 71:-p has the interpretation as the deviation of
the unobserved ex ante real interest rate from its population mean p.
Even if the population parameter vector 8= (4, o,,~, g,J were known with
certainty, the econometrician still would not know the value of the ex ante real
interest rate, since the markets expected inflation 7~: is unobserved. However, the
econometrician can make an educated guess as to the value of 5, based on
observations of the ex post real rate through date t, treating the maximum
likelihood estimate aas if known with certainty. This guess is the magnitude &,(a),
and its mean squared error P,,,(a) is plotted as the solid line in the middle panel
of Figure 1. The mean squared error quickly asymptotes to

which is a fixed constant owing to the stationarity of the process.


The middle panel of Figure. 1 also plots the mean squared error for the smoothed
inference, PrIT(a). For observations in the middle of the sample this is essentially
the mean squared error (MSE) of the doubly-infinite projection

The mean squared error for the smoothed inference is slightly higher for
observations near the beginning of the sample (for which the smoothed inference
is unable to exploit relevant data on y,,y_ I,. . .) and near the end of the sample
(for which knowledge of YT+ r, Y~+~, . . . would be useful).
The bottom panel of Figure 1 plots the econometricians best guess as to the
value of the ex ante real interest rate based on all of the data observed:

Ninety-five percent confidence intervals for this inference that take account of both
the filter uncertainty P1,r(a) and parameter uncertainty due to the random nature
of 8 are also plotted. Negative ex ante real interest rates during the 1970s and
very high ex ante real interest rates during the early 1980s both appear to be
statistically significant. Hamilton (1985) obtained similar results from a more
complicated representation for the ex ante real interest rate.

4. Discrete-valued state variables

The time-varying coefficients model was advocated by Sims (1982) as a useful way
of dealing with changes occurring all the time in government policy and economic
institutions. Often, however, these changes take the form of dramatic, discrete
events, such as major wars, financial panics or significant changes in the policy
Ch. 50: State-Space Models 3063

objectives of the central bank or taxing authority. It is thus of interest to consider


time-series models in which the coefficients change only occasionally as a result
of such changes in regime.
Consider an unobserved scalar s, that can take on integer values 1,2,. . . , N
corresponding to N different possible regimes. We can then think of a time-varying
coefficient regression model of the form of (2.29),

Yt = x:Bs,+ wt (4.1)

for x, a (k x 1) vector of predetermined or exogenous variables and w, - i.i.d.


N(0, a). Thus in the regime represented by s, = 1, the regression coefficients are
given by /?r, when s, = 2, the coefficients are f&, and so on. The variable s,
summarizes the state of the system. The discrete analog to (2.1), the state
transition equation for a continuous-valued state variable, is a Markov chain in
which the probability distribution of s, + 1 depends on past events only through
the value of s,. If, as before, observations through date t are summarized by the
vector

the assumption is that

Prob(s,+,= jlst=i,st_l=il,st_Z=i2,...,&)=Prob(st+i=jls,=i)

- pij. (4.2)

When this probability does not depend on the previous state (pij = pu for all i, j,
and I), the system (4.1)-(4.2) is the switching regression model of Quandt (1958);
with general transition probabilities it is the Markov-switching regression model
developed by Goldfeld and Quandt (1973) and Cosslett and Lee (1985). When x,
includes lagged values of y, (4.1)-(4.2) describes the Markov-switching time-series
model of Hamilton (1989).

4.1. Linear state-space representation of the Markov-switching model

The parallel between (4.2))(4.1) and (2.1))(2.2) is instructive. Let F denote an


(N x N) matrix whose row i, column j element is given by pji:

r hl P21 ... PNll

F= P12 P22 PN2


. .
. ... .

1 PIN P2N . PNN]


3064 J.D. Hamilton

Let e, denote the ith column of the (N x N) identity matrix and construct an
(N x 1) vector 4, that is equal to e, when s, = i. Then the expectation of &+ 1 is an
(N x 1) vector whose ith element is the probability that s,, 1 = i. In particular, the
expectation of &+r conditional on knowing that s, = 1 is the first column of F.
More generally,

E(5*+,l51,5r-l,...,51~a)=F5~. (4.4)

The Markov chain (4.2) thus implies the linear state equation

(4.5)

where u,, 1 is uncorrelated with past values of 4,y or x.


The probability that s,, 2 = j given s, = i can be calculated from

Probh.2 =jlSt = i) = PilPlj + Pi2P2j + + PiNPNj

= Pljpil + PZjPi2 + + PNjPiNr

which will be recognized as the row j, column i element of F2. In general, the
probability that st+,,, = j given s, = i is given by the row j, column i element of F,
and

Moreover, the regression equation (4.1) can be written

y, = x:q + w,, (4.7)

where B is a (k x N) matrix whose ith column is given by pi. Equation (4.7) will
be recognized as an observation equation of the form of (2.26) with [H(x,)] = x:B.
Thus the model (4.1)-(4.2) can be represented by the linear state-space model
(2.1) and (2.26). However, the disturbance in the state equation u,, 1 can only take
on a set of N2 possible discrete values, and is thus no longer normal, so that the
Kalman filter applied to this system does not generate optimal forecasts or
evaluation of the likelihood function.

4.2. OptimalJilter when the state variable follows a Markov chain

The Kalman filter was described above as an iterative algorithm for calculating
the distribution of the state vector 4, conditional on 5,-i. When 4, is a continuous
normal variable, this distribution is summarized by its mean and variance. When
the state variable is the discrete scalar s,, its conditional distribution is, instead,
Ch. 50: State-Space Models 3065

summarized by

Prob(s,=ilG_,) for i=1,2 ,..., N. (4.8)

Expression (4.8) describes a set of N numbers which sum to unity. Hamilton (1989)
presented an algorithm for calculating these numbers, which might be viewed as
a discrete version of the Kalman filter. This is an iterative algorithm whose input
at step t is the set of N numbers {Prob(s, = iIT,_ ,)}y=, and whose output is
{Prob(s,+, = i 1&)}r= 1. In motivating the Kalman filter, we initially assumed that
the values of F, Q, A, Hand R were known with certainty, but then showed how
the filter could be used to evaluate the likelihood function and estimate these
parameters. Similarly, in describing the discrete analog, we will initially assume
that the values of j?l,j?Z,. . . ,&, CJ,and (pij}Tj= 1 are known with certainty, but will
then see how the filter facilitates maximum likelihood estimation of these parameters.
A key difference is that, whereas the Kalman filter produces forecasts that are
linear in the data, the discrete-state algorithm, described below, is nonlinear.
If the Markov chain is stationary and ergodic, the iteration to evaluate (4.8)
can be started at date t = 1 with the unconditional probabilities. Let ni denote the
unconditional probability that s, = i and collect these in an (N x 1) vector a-
(XI,?,..., 7~~)).Noticing that scan be interpreted as E(&), this vector can be found
by taking expectations of (4.5):

n= Fx. (4.9)

Although this represents a system of N equations in N unknowns, it cannot be


solved for n; the matrix (IN - F) is singular, since each of its columns sums to zero.
However, if the chain is stationary and ergodic, the system of (N + 1) equations
represented by (4.9) along with the equation

la= 1 (4.10)

can be solved uniquely for the ergodic probabilities (here 1 denotes an (N x 1)


vector, all of whose elements are unity). For N = 2, the solution is

711 = (1- P22Ml- Pll + 1 - PA (4.11)

x2 = (1- Pl Ml - Pll + 1- P22). (4.12)

A general solution for scan be calculated from the (N + 1)th column of the matrix
(AA)- A where

1
Z, - F
A =
,N+*,xN [ 1 .
3066 J.D. Hamilton

The input for step t of the algorithm is (Prob(s, = i/T,_ r)}y=,, whose ith entry
under the assumption of predetermined or exogenous x, is the same as

Prob(s, = i/x,, G-r). (4.13)

The assumption in (4.1) was that

1
1 - (Y, - $Bi)
f(~,ls, = Otr G-i) = (2Zca2)1,2 exp 2oZ (4.14)
[

For given i, xf,yt,/(, and C, the right-hand side of (4.14) is a number that can be
calculated. This number can be multiplied by (4.13) to produce the likelihood of
jointly observing s, = i and y,:

Expression (4.15) describes a set of N numbers (for i = 1,2,. . . , N) whose sum is


the density of y, conditional on x, and &_ 1:

f(~,lx,~LJ= f f(Yt~st=ilxt~Ld (4.16)


i=l

If each of the N numbers in (4.15) is divided by the magnitude in (4.16), the result
is the optimal inference about s, based on observation of IJ,= {yt, xr, J, _ i}:

f(~,,s,= il~t~~-l)
Prob(s, = i) &) = - (4.17)
f(YtIx,~L,) .

The output for thejth iteration can then be calculated from

Prob(s,+, = jl&)= 2 Prob(s,+, =j,s,=il&)


i=l

= i$l Prob(s,+, = j/s, = i, &),Prob(s, = iI&)

= itI Pij. Prob(s,= iI&). (4.18)

To summarize, let &- 1 denote an (N x 1) vector whose ith element represents


Prob(s, = iIT,- J and let fit denote an (N x 1) vector whose ith element is given by
(4.14). Then the sequence {&,_ , },=1 can be found by iterating on
Ch. 50: State-Space Models 3067

(4.19)

where 0 denotes element-by-element mu_ltiplication and 1 represents an (N x 1)


vector of ones. The iteration is started with Ljl,O= ICwhere 1cis given by the solution
to (4.9) and (4.10). The contemporaneous inference G$ is given by (&, 0 ?,I*)/
V(&l,- 10 %)I.

4.3. Extensions

The assumption that y, depends only on the current value s, of a first-order Markov
chain is not really restrictive. For example, the model estimated in Hamilton (1989)
was

Y,-Ps;=4(Yr-l-P*
St- 1
)+~*(Y,-z-~,:_l)+...+~p(Yr_p-~,*
f-P
)++
(4.20)

where SF can take on the values 1 or 0, and follows a Markov chain with
Prob(s:+ 1 = jlsf = i) = p$. This can be written in the form of (4.1)-(4.2) by letting
N = 2p+ and defining

s, = 1 if (ST = l,s:_, = 1,. . . , and ST_, = l),


s, = 2 if (ST = O,s:_, = 1,. . ., and s:_~ = l),
(4.21)
s,=N- 1 if (SF= l,s:_, =O,..., and s*t_-p=O)
s, = N if (s:=O,s,*_, =0 ,..., and s:_~=O).

For illustration, the matrix of transition probabilities when p = 2 is

-p:1 0 0 0 pT1 0 0 0
P:O 0 0 0 p& 0 0 0

0 p& 0 0 0 p& 0 0

0 PO*0 0 0 0 p& 0 0
F= (4.22)
(8 x 8) 0 0 P:l 0 0 0 P;l 0
0 0 p:o 0 0 0 Pro 0
0 0 0 p& 0 0 0 p&

0 0 0 p& 0 0 0 p&
3068 J.D. Hamilton

There is also no difficulty in generalizing the above method to (n x 1) vector


processesyt with changing coefficients or variances. Suppose that when the process
is in state s,,

Yt ISt,1, - wQf 52,) (4.23)

where n;, for example, is an (n x k) matrix of regression coefficients appropriate


when s, = 1. Then we simply replace (4.14) with

1)
1
.f(~,ls,=i,x~,i-~)=(~~),~,~~,~,~exp - $, - qx,),n ,: I(& - n;x,,
[
(4.24)

with other details of the recursion identical.


It is more difficult to incorporate changes in regime in a moving average process
such as y, = E, + Os.s,_ i. For such a process the distribution of y, depends on the
completehistory(i,_,,y,_, ,..., y,,s:,s:_, ,... , ST),and N, in a representation such
as (4.21), grows with the sample size T. Lam (1990) successfully estimated a related
model by truncating the calculations for negligible probabilities. Approximations
to the optimal filter for a linear state-space model with changing coefficient matrices
have been proposed by Gordon and Smith (1990), Shumway and Stoffer (1991)
and Kim (1994).

4.4. Forecasting

Applying the law of iterated expectations to (4.6), the optimal forecast of &+,
based on data observed through date t is

~(5t+,lr,) = FrnE,r, (4.25)

where I$, is the optimal inference calculated by the filter.


As an example of using (4.25) to forecast yt, consider again the example in (4.20).
This can be written as

Y, = Ps; + z,, (4.26)

where z, = 4iz,-i + &z,_~ + ... + 4pzt-p + E,. If {SF} were observed, an m-period-
ahead .forecast of the first term in (4.26) turns out to be

E(~~~+_ls:)=~,+{~l+~(s:-n,)}(~,-~,), (4.27)
Ch. 50: State-Space Models 3069

where ;1= (- 1 + PT1 + p&J and rrI = (1 - p&,)/(1 - ~7~ + 1 - P&). If ~7~ and P&
are both greater than i, then 0 < ,I < 1 and there is a smooth decay toward the
steady-state probabilities. Similarly, the optimal forecast of z,+, based on its own
lagged values can be deduced from (1.9):

(4.28)

where e; denotes the first row of the (p x p) identity matrix and @ denotes the
(p x p) matrix on the right-hand side of (1.12). Recalling that z, = y, - psz is known
if y, and ST are known, we can substitute (4.27) and (4.28) into (4.26) to conclude

E(Yt+,l% r,) = PO + 1%+ JrnK- 1Wl -PO)


+ e;@Y(Y, -P,:) b-1 -P,:_l) ... &J+1 - Ps;_p+lu.
(4.29)

Since (4.29) is linear in (ST}, the forecast based solely on the observed variables
& can be found by applying the law of iterated expectations to (4.29):

E(y,+,IG) = cl0 + {x1 + AmCProb(s: = 1 I&) - rrJ}(~, - pO) + e;@_F(, (4.30)

where the ith element of the (p x 1) vector j, is given by

.Fit=Yt-i+l - p. Prob(s,*_i+ r =Ol&)-pr Prob($-i+r= 114).

The ease of forecasting makes this class of models very convenient for rational-
expectations analysis; for applications see Hamilton (1988), Cecchetti et al. (1990)
and Engel and Hamilton (1990).

4.5. Smoothed probabilities

We have assumed that the current value of s, contains all the information in the
history of states through date t that is needed to describe the probability laws for
y and s:

Prob(s,+, =jls,=i)=Prob(s,+, =j(sf=i,sf_-l =it_l,...,sl =il).

Under these assumptions we have, as in Kitagawa (1987, p, 1033) and Kim (1994),
J.D. Hamilton
3070

that
Prob(s,=j,s,+,=ilr,)=Prob(s,+,=ilrr)Prob(s,=jls,+,=i,r,)
=Prob(s,+,=il&)Prob(s,=jls,+r=i,&)
Prob(s, = j, s,, 1 = iI 4,)
= Prob(s,+, = iI&)
Prob(s,+r = iI&)
Prob(s,= jl<,)Prob(s,+ 1=ils, = j)
= Prob(s,+, = il&)
Prob(s,+, = iI&)
(4.31)

Sum (4.31) over i= l,..., N and collect the resulting equations for j = 1,. . . , N in
a vector EtIT,whose jth element is Prob(s, = jlcr):

(4.32)

where ( + ),, denotes element-by-element division. The smoothed probabilities are


thus found by iterating on (4.32) backwards for t = T- 1, T- 2,. . . , 1.

4.6. Maximum likelihood estimation

For given numerical values of the transition probabilities in F and the regression
parameters such as (ZI,, . . . , I&, 52,, . , . , L2,) in (4.24), the value of the log likelihood
function of the observed data is CT 1log f(y,Ix,, &_ r) for f(y,Ix,, &_ J given by
(4.16). This can be maximized numerically. Again, the EM elgorithm is often an
efficient approach [see Baum et al. (1970), Kiefer (1980) and Hamilton (1990)]. For
the model given in (4.24), the EM algorithm is implemented by making an arbitrary
initial guess at the parameters and calculating the smoothed probabilities. OLS
regression of y,,/Prob(s, = 1 I CT) on I,,,/ Prob(s, = 1 I&.) gives a new estimate of
I7r and a new estimate of J2, is provided by the sample variance matrix of these
OLS residuals. Smoothed probabilities for state 2 are used to estimate ZI, and
SL,, and so on. New estimates for pij are inferred from

5 Prob(s,=j,s,_, =il&)
1
Pi=
[f:
t=2

Prob(s,_I =il&)]
t=2

with the probability of the initial state calculated from pi = Prob(s, = iI &.) rather
than (4.9)-(4.10). These new parameter values are then used to recalculate the
smoothed probabilities, and the procedure continues until convergence.
Ch. 50: State-Space Models 3071

When the variance depends on the state as in (4.24), there is an essential


singularity in the likelihood function at 0, = 0. This can be safely ignored without
consequences; for further discussion, see Hamilton (199 1).

4.7. Asymptotic properties of maximum likelihood estimates

It is typically assumed that the usual asymptotic distribution theory motivating


(3.9) holds for this class of models, though we are aware of no formal demonstration
of this apart from Kiefers (1978) analysis of i.i.d. switching regressions. Hamilton
(1993) examined specification tests derived under the assumption that (3.9) holds.
Two cases in which (3.9) is clearly invalid should be mentioned. First, the
maximum likelihood estimate flij may well be at a boundary of the allowable
parameter space (zero or one), in which case the information matrix in (3.12) need
not even be positive definite. One approach in this case is to regard the value of
Pij as fixed at zero or one and calculate the information matrix with respect to other
parameters.
Another case in which standard asymptotic distribution theory cannot be invoked
is to test for the number of states. The parameter plZ is unidentified under the
null hypothesis that the distribution under state one is the same as under state
two. A solution to this problem was provided by Hansen (1992). Testing the specifi-
cation with fewer states for evidence of omitted heteroskedasticity affords a simple
alternative.

4.8. Empirical application ~ another look at the real interest rate

We illustrate these methods with a simplified version of Garcia and Perrons (1993)
analysis of the real interest rate. Let y, denote the ex post real interest rate data
described in Section 3.5. Garcia and Perron concluded that a similar data set was
well described by N = 3 different states. Maximum likelihood estimates for our
data are as follows, with standard errors in parentheses:5

Y,~s, = 1 - N( 5.69 , 3.72 ),


(0.41) (1.11)

y,ls,=2-N( 1.58, 1.93),


(0.16) (0.32)

y,ls, = 3 - N( - 1.58 , 2.83 ),


(0.30) (0.72)

Garcia and Perron also included p = 2 autoregressive terms as in (4.20), which were omitted from
the analysis described here.
3072 J.D. Hamilton

- 0.950 0 0.036
(0.044) (0.030)

0.050 0.990 0
F= (0.044) (0.010)

0 0.010 0.964
(0.010) (0.030).

The unrestricted maximum likelihood estimates for the transition probabilities


occur at the boundaries with fir3 = Fiji = flS2 = 0. These values were then imposed
a priori and derivatives were taken with respect to the remaining free parameters
8= (~~,P~,~~,u:,cJ$, ~~,p~~,p~~,p~J to calculate standard errors.

IO.0
75 .-
_ _ -
5.0 -
25- -- _- . _-
0.0 -
_ _ ___
-2.5 - VtiA
-5.0
60 63 66 69 72 7s 78 BI 84 87 90

60 63 66 69 72 75 78 81 El4 87 90

60 63 66 69 72 75 78 81 84 87 90

60 63 66 69 72 75 78 81 84 87 90

Figure 2. Top panel. Solid line: ex post real interest rate. Dashed line: pi6^i,,, where Ji,, = I if
Prob(s, = i(&; 8) > 0.5 and c?~,,= 0 otherwise. Second panel. Prob(_, = I 1I&; 6). Third panel.
Prob(s, = 21 CT; a). Fourth panel. Prob(s, = 3 1CT;0).
Ch. 50: State-Space Models 3073

Regime 1 is characterized by average real interest rates in excess of 5 percent,


while regime 3 is characterized by negative real interest rates. Regime 2 represents
the more typical experience of an average real interest rate of 1.58 percent.
The bottom three panels of Figure 2 plot the smoothed probabilities Prob(s, =
i( CT; 8) for i = 1,2 and 3, respectively. The high interest rate regime lasted from
1980:IV to 1986:11, while the negative real interest rate regime occurred during
1972:,3 to 1980:III.
Regime 1 only occurred once during the sample, and yet the asymptotic standard
errors reported above suggest that the transition probability @ii has a standard
error of only 0.044. This is because there is in fact not just one observation useful
for estimating pi 1, but, rather, 23 observations. It is exceedingly unlikely that one
could have flipped a fair coin once each quarter from 1980:IV through 1986:11
and have it come up heads each time; thus the possibility that pii might be as
low as 0.5 can easily be dismissed.
The means fli, & and f13 corresponding to the imputed regime for each date
are plotted along with the actual data for y, in the top panel of Figure 2. Garcia
and Perron noted that the timing of the high real interest rate episode suggests
that fiscal policy may have been more important than monetary policy in producing
this unusual episode.

5. Non-normal and nonlinear state-space models

A variety of approximating techniques have been suggested for the case when the
disturbances I+ and W, come from a general non-normal distribution or when the
state or observation equations are nonlinear. This section reviews two approaches.
The first approximates the optimal filter using a finite grid and the second is known
as the extended Kalman filter.

5.1. Kitagawas grid approximation for nonlinear, non-normal


state-space models

Kitagawa (1987) suggested the following general approach for nonlinear or


non-normal filtering. Although the approach in principle can be applied to vector
systems, the notation and computations are simplest when the observed variable
(y,) and the state variable (r,) are both scalars. Thus consider

t If1 =dJ(5,)+~,+1~ (5.1)

Yt = 45,)+ wt. (5.2)

The disturbances v, and w, are each i.i.d. and mutually independent and have
3074 J.D. Hamilton

densities denoted q(u,) and r(wJ, respectively. These densities need not be normal,
but they are assumed to be of a known form; for example, we may postulate that
u, has a t distribution with v degrees of freedom:

q(ut) = c(1 + (u:/v))-(V+l)*,

where c is a normalizing constant. Similarly c$(.) and h(.) represent parametric


functions of some known form; for example, 4(.) might be the logistic function,
in which case (5.1) would be

1
5r+l=l+aexp(-_&)+u+l (5.3)

Step t of the Kalman filter accepted as input the distribution of 5, conditional


on Li =(Y~-~,Y~-~,..., yl) and produced as output the distribution of &+1
conditional on 6,. Under the normality assumption the input distribution was
completely summarized by the mean &_ 1 and variance I,,,_ 1. More generally,
we can imagine a recursion whose input is the density f(<, I&- 1) and whose output
is f(&+ 116,). These, in general, would be continuous functions, though they can
be summarized by their values at a finite grid of points, denoted t(O), t(l), . . . , ttN).
Thus the input for Kitagawas filter is the set of (N + 1) numbers

KIr,-,)I,,=,(~) i=O,l,...,N (5.4)

and the output is (5.4) with t replaced by t + 1.


To derive the filter, first notice that under the assumed structure, 5, summarizes
everything about the past that matters for y,:

f(YtI5J =f(Y,I5,~Ld

Thus

f(Y*,5,lr,-l)=f(Y,I~,)f(5,Ir,-l)

= d-Y, - ~(5,)l_f-(5,1
L 1) (5.5)

and

f(Ytlr,-1) = m f(Y,, 5,lL AdL (5.6)


s -Kl

Given the observed y, and the known form for I(.) and II(.), the joint density (5.5)
can be calculated for each 5, = t(), i = 0, 1,. . . , N, and these values can then be
Ch. 50: State-Space Models 3075

used to approximate (5.6) by

The updated density for 5, is obtained by dividing each of the N + 1 numbers in


(5.5) by the constant (5.7):

fKlr,)=f(5tlYt~Ll)

JYdtlL1)
(5.8)
f(Y,lL 1) .
The joint conditional density of 5,+ 1 and 5, is then

f(rt+l,rtIrt)=f(5r+lI5t)f(51lT1)
= d5,+ 1 - 4(&)l.m, I0 (5.9)

For any pair of values t() and 5 equation (5.9) can be evaluated at 5, = 5 and
5, + 1 = 5 from (5.8) and the form df q( .) and 4( .). The recursion is completed by:

f(5t+11Tl)l~t+,=p)= mf(5,+1,5,Ir,)I,,+,=,,j,d5,
s -02

+f(t 1+lr51151)ls,+,=6(,,,6,=6ci~1,}3{5(i)-5(i-1)}.
(5.10)

An approximation to the log likelihood can be calculated from (5.6):

logf(Yr~Yr-l~..*~ Yl) = i h2f(Y,ILJ (5.11)


1=1

The maximum likelihood estimates of parameters such as a, b and v are then the
values for which (5.11) is greatest.
Feyzioglu and Hassett (1991) provided an economic application of Kitagawas
approach to a nonlinear, non-normal state-space model.
3076 J.D. Hamilton

5.2. Extended Kalman jilter

Consider next a multidimensional normal state-space model

5*+1= 9(5,)+4+1, (5.12)

Yr = 44 + 45,) + WI, (5.13)

where I$: IR-+lR, a: Rk-+IR and h: IR+fR, u,-i.i.d. N(O,Q) and IV,-i.i.d.
N(0, R). Suppose 4 (.) in (5.12) is replaced with a first-order Taylors approximation
around 4, = &,

5,+1=~,++,(5,-%,t)+u,+1, (5.14)

where

4 = d&t) @t-$I (5.15)


0.x 1) (r x r) f &=F,,

For example, suppose r = 1 and 4(.) is the logistic function as in (5.3). Then (5.14)
would be given by

1
5 abexp(-b5,1J (&-5;,1)+ur+I. (5.16)
f+1=1+aexp(-~~~,,)+[1+aexp(-~~~I,)]2

If the form of 4 and any parameters it depends on [such as a and b in (5.3)] are
known, then the inference &, can be constructed as a function of variables observed
at date t (&) through a recursion to be described in a moment. Thus & and 4$ in
(5.14) are directly observed at date t.
Similarly the function h(.) in (5.13) can be linearized around I$- 1 to produce

Yt = 44 + ht+ fq5t - Et,,-


1)+ wt, (5.17)

where

4 f h(&,,-1) Hi =Wi) (5.18)


(nx 1) (nxr) a<:st=it,,-t
Again h, and H, are observed at date t - 1. The function a(.) in (5.13) need not be
liearized since X, is observed directly.
The idea behind the extended Kalman filter is to treat (5.14) and (5.17) as if
they were the true model. These will be recognized as time-varying coefficient
versions of a linear state-space model, in which the observed predetermined variable
Ch. 50: State-Space Models 3011

4, - @&, has been added to the state equation. Retracing the logic behind the
Kalman filter for this application, the input for step t of the iteration is again the
forecast Et,,_ 1 and mean squared error I+ 1. Given these, the forecast of _v~is
found from (5.17):

E(y,Ix,,r,-l)=a(x,)+h,
= a@,) + h(&- 1). (5.19)

The joint distribution of 4, and y, conditional on X, and 4, _ 1 continues to be given


by (2.11), with (5.19) replacing the mean of yt and II, replacing H. The contem-
poraneous inference (2.12) goes through with the same minor modification:

&, = &,,- 1+ J,,,-,H,W:f,,,- IH, + W- br - 4x,) - 4$,,- Jl. (5.20)

If (5.14) described the true model, then the optimal forecast of &+1 on the basis
of 6, would be

To summarize, step t of the extended Kalman filter uses &,, _ 1 and I,,,_ 1 to
calculate H, from (5.18) and &, from (5.20). From these we can evaluate @t in
(5.15). The output for step t is then

$+ 111= &,t), (5.21)

P~+II,=~~P,I,-,~:-(~~P,,~-,H,(H:P,,,-~H,+R)-'H:P,,~-,~:}+Q.
(5.22)

The recursion is started with &,, and P,,, representing the analysts prior
information about the initial state.

5.3. Other approaches to nonlinear state-space models

A number of other approaches to nonlinear state-space models have been explored


in the literature. See Anderson and Moore (1979, Chapter 8) and Priestly (1980,
1988) for partial surveys.

References

Anderson, B.D.O. and J.B. Moore (1979) Optimal Filtering. Englewood Cliffs, New Jersey: Prentice-Hall,
Inc.
Ansley, C.F. and R. Kohn (1985) Estimation, Filtering, and Smoothing in State Space Models with
Incompletely Specified Initial Conditions, Annah of Statistics, 13, 1286-13 16.
3078 J.D. Hamilton

Aoki, M. (1987) State Space Modeling of Time Series. New York: Springer Verlag.
Baum, L.E., T. Petrie, G. Soules and N. Weiss (1970) A Maximization Technique Occurring in the
Statistical Analysis of Probabilistic Functions of Markov Chains, Annals of Mathematical Statistics,
41, 164-171.
Box, G.E.P. and G.M. Jenkins (1976) Time Series Analysis: Forecasting and Control, Second edition.
San Francisco: Holden-Day.
Burmeister, E. and K.D. Wall (1982) Kalman Filtering Estimation of Unobserved Rational
Expectations with an Application to the German Hyperinflation, Journal of Econometrics, 20,
255-284.
Burmeister, E., K.D. Wall and J.D. Hamilton (1986) Estimation of Unobserved Expected Monthly
Inflation Using Kalman Filtering, Journal of Business and Economic Statistics, 4, 147-160.
Caines, P.E. (1988) Linear Stochastic Systems. New York: John Wiley and Sons, Inc.
Cecchetti, S.G., P.-S. Lam and N. Mark (1990) Mean Reversion in Equilibrium Asset Prices, American
Economic Review, 80, 398-418.
Chow, G.C. (1984) Random and Changing Coefficient Models, in: Z. Griliches and M.D. Intriligator,
eds., Handbook of Econometrics. Vol. 2. Amsterdam: North-Holland.
Cosslett, S.R. and L.-F. Lee (1985) Serial Correlation in Discrete Variable Models, Journal of
Econometrics, 27, 79-97.
Davies, R.B. (1977) Hypothesis Testing When a Nuisance Parameter is Present Only Under the
Alternative, Biometrika, 64, 247-254.
DeGroot, M.H. (1970) Optima2 Statistical Decisions. New York: McGraw-Hill.
De Jong, P. (1988) The Likelihood for a State Space Model, Biometrika, 75, 165-169.
De Jong, P. (1989) Smoothing and Interpolation with the State-Space Model, Journal of the American
Statistical Association, 84, 1085-1088.
De Jong, P. (1991) The Diffuse Kalman Filter, Annals of Statistics, 19, 1073-1083.
Dempster, A.P., N.M. Laird and D.B. Rubin (1977) Maximum Likelihood from Incomplete Data via
the EM Algorithm, Journal of the Royal Statistical Society, Series B, 39, l-38.
Doan, T., R.B. Litterman and C.A. Sims (1984) Forecasting and Conditional Projection Using Realistic
Prior Distributions, Econometric Reoiews, 3, l-100.
Engel, C. and J.D. Hamilton (1990) Long Swings in the Dollar: Are They in the Data and Do Markets
Know It?, American Economic Reoiew, 80, 689-713.
Engle, R.F. and M.W. Watson (1981) A One-Factor Multivariate Time Series Model of Metropolitan
Wage Rates, Journal of the American Statistical Association, 76, 774-781.
Engle, R.F. and M.W. Watson (1987) The Kalman Filter: Applications to Forecasting and Rational-
Expectations Models, in: T.F. Bewley, ed., Advances in Econometrics. Fifth World Congress,
Volume I. Cambridge, England: Cambridge University Press.
Fama. E.F. and M.R. Gibbons (1982) Inflation. Real Returns. and Caoital Investment. Journal of
Monetary Economics, 9, 297-i23.
Feyzioglu, T. and K. Hassett (1991), A Nonlinear Filtering Technique for Estimating the Timing and
Importance of Liquidity Constraints, Mimeographed, Georgetown University.
Garcia, R. and P. Perron (1993), An Analysis of the Real Interest Rate Under Regime Shifts,
Mimeographed, University of Montreal.
Gevers, M. and V. Wertz (1984) Uniquely Identifiable State-Space and ARMA Parameterizations for
Multivariable Linear Systems, Automatica, 20, 333-347.
Ghosh, D. (1989) Maximum Likelihood Estimation of the Dynamic Shock-Error Model, Journal of
Econometrics, 41, 121-143.
Goldfeld, S.M. and R.M. Quandt (1973) A Markov Model for Switching Regressions, Journal of
Econometrics, 1, 3-16.
Gordon, K. and A.F.M. Smith (1990) Modeling and Monitoring Biomedical Time Series, Journal of
the American Statistical Association, 85, 328-337.
Hamilton, J.D. (1985) Uncovering Financial Market Expectations of Inflation, Journal of Political
Economy, 93, 1224-1241.
Hamilton, J.D. (1986) A Standard Error for the Estimated State Vector of a State-Space Model,
Journal of Econometrics, 33, 387-397.
Hamilton, J.D. (1988) Rational-Expectations Econometric Analysis of Changes in Regime: An
Investigation of the Term Structure of Interest Rates, Journal of Economic Dynamics and Control,
12, 385-423.
Ch. 50: State-Space Models 3079

Hamilton, J.D. (1989) A New Approach to the Economic Analysis of Nonstationary Time Series and
the Business Cycle, Econometrica, 57, 3577384.
Hamilton, J.D. (1990) Analysis ofTime Series Subject to Changes in Regime, Journal afEconometrics,
45, 39-70.
Hamilton, J.D. (1991) A Quasi-Bayesian Approach to Estimating Parameters for Mixtures of Normal
Distributions, Journal of Business and Economic Statistics, 9, 27-39.
Hamilton, J.D. (1993) Specification Testing in Markov-Switching Time Series Models, Mimeographed,
University of California, San Diego.
Hamilton, J.D. (1994) Time Series Analysis. Princeton, N.J.: Princeton University Press.
Hannan, E.J. (1971) The Identification Problem for Multiple Equation Systems with Moving Average
Errors, Econometrica, 39, 751-765.
Hansen, B.E. (1992) The Likelihood Ratio Test Under Non-Standard Conditions: Testing the Markov
Trend Model of GNP, Journal of Applied Econometrics, 7, S61-S82.
Hansen, B.E. (1993) Inference When a Nuisance Parameter is Not Identified Under the Null Hypothesis,
Mimeographed, University of Rochester.
Harvey, A.C. (1987) Applications of the Kalman Filter in Econometrics, in: T.F. Bewley, ed.,
Advances in Econometrics, Fifth World Congress, Volume I. Cambridge, England: Cambridge
University Press.
Harvey, A.C. (1989) Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge,
England: Cambridge University Press.
Harvey, A.C. and G.D.A. Phillips (1979) The Maximum Likelihood Estimation of Regression Models
with Autoregressive-Moving Average Disturbances, Biometrika, 66, 49-58.
Harvey, A.C. and R.G. Pierse (1984) Estimating Missing Observations in Economic Time Series,
Journal of the American Statistical Association, 79, 125-131.
Harvey, A.C. and P.H.J. Todd (1983) Forecasting Economic Time Series with Structural and Box-
Jenkins Models: A Case Study, Journal of Business and Economic Statistics, 1, 2999307.
Imrohoroglu, S. (1993) Testing for Sunspots in the German Hyperinflation, Journal of Economic
Dynamics and Control, 17, 289-317.
Jones, R.H. (1980) Maximum Likelihood Fitting of ARMA Models to Time Series with Missing
Observations, Technometrics, 22, 3899395.
Kalman, R.E. (1960) A New Approach to Linear Filtering and Prediction Problems, Journal of Basic
Engineering, Transactions of the ASME, Series D, 82, 35545.
Kalman, R.E. (1963) New Methods in Wiener Filtering Theory, in: J.L. Bogdanoff and F. Kozin,
eds., Proceedings of the First Symposium of Engineering Applications of Random Function Theory and
Probability, pp. 270-388. New York: John Wiley & Sons, Inc.
Kiefer, N.M. (1978) Discrete Parameter Variation: Efficient Estimation of a Switching Regression
Model, Econometrica, 46, 4277434.
Kiefer, N.M. (1980) A Note on Switching Regressions and Logistic Discrimination, Econometrica,
48, 1065-1069.
Kim, C.-J. (1994) Dynamic Linear Models with Markov-Switching, Journal ofEconometrics, 60, l-22.
Kitagawa, G. (1987) Non-Gaussian State-Space Modeling of Nonstationary Time Series, Journal of
the American Statistical Association, 82, 1032-1041.
Kahn, R. and C.F. Ansley (1986) Estimation, Prediction, and Interpolation for ARIMA Models with
Missing Data, Journal of the American Statistical Association, 81, 751-761.
Lam, P.-S. (1990) The Hamilton Model with a General Autoregressive Component: Estimation and
Comparison with Other Models of Economic Time Series, Journal of Monetary Economics, 26,
409-432.
Leybourne, S.J. and B.P.M. McCabe (1989) On the Distribution of Some Test Statistics for Coefficient
Constancy, Biometrika, 76, 1699177.
Magnus, J.R. and H. Neudecker (1988) Matrix Differential Calculus with Applications in Statistics and
Econometrics. New York: John Wiley & Sons, Inc.
Meinhold, R.J. and N.D. Singpurwalla (1983) Understanding the Kalman Filter, American Statistician,
37, 123-127.
Naba, S. and K. Tanaka (1988) Asymptotic Theory for the Constance of Regression Coefficients
Against the Random Walk Alternative, Annals of Statistics, 16, 2188235.
Nash, J.C. and M. Walker-Smith (1987) Nonlinear Parameter Estimation: An Integrated System in
Basic. New York: Marcel Dekker.
3080 J.D. Hamilton

Nicholls, D.F. and A.R. Pagan (1985) Varying Coefficient Regression, in: E.J. Hannan, P.R. Krishnaiah
and M.M. Rao, eds., Handbook of Statistics.Vol. 5. Amsterdam: North-Holland.
Pagan, A. (1980) Some Identification and Estimation Results for Regression Models with Stochastically
Varying Coefficients, Journal of Econometrics, 13, 341-363.
Priestly, M.B. (1980) State-Dependent Models: A General Approach to Non-Linear Time Series
Analysis, Journal of Time Series Analysis, 1, 47-71.
Priestly, M.B. (1988) Current Developments in Time-Series Modelling, Journal of Econometrics, 37,
67-86.
Quandt, R.E. (1958) The Estimation of Parameters of Linear Regression System Obeying Two Separate
Regimes, /ournal of the American Statistical Association, 55, 873-880.
Quandt, R.E. (1983) Computational Problems and Methods, in: Z. Griliches and M.D. Intriligator,
eds., Handbook of Econometrics. Volume 1. Amsterdam: North-Holland.
Raj, B. and A. Ullah (1981) Econometrics: A Varying Coeficients Approach. London: Groom-Helm.
Sargent, T.J. (1989) Two Models of Measurements and the Investment Accelerator, Journal of Political
Economy, 97,251-287.
Shumway, R.H. and D.S. Stolfer (1982) An Approach to Time Series Smoothing and Forecasting
Using the EM Algorithm, Journal of Time Series Analysis, 3, 253-263.
Shumway, R.H. and D.S. Staffer (1991) Dynamic Linear Models with Switching, Journal of the
American Statistical Association, 86, 7633769.
Sims, C.A. (1982) Policy Analysis with Econometric Models, Brookings Papers on Economic Activity,
1, 1077152.
Stock, J.H. and M.W. Watson (1991) A Probability Model of the Coincident Economic Indicators,
in: K. Lahiri and G.H. Moore, eds., Leading Economic Indicators: New Approaches and Forecasting
Records, Cambridge, England: Cambridge University Press.
Tanaka, K. (1983) Non-Normality of the Lagrange Multiplier Statistic for Testing the Constancy of
Regression Coefficients, Econometrica, 51, 1577-1582.
Tjostheim, D. (1986) Some Doubly St&hastic Time Series Models, Journal of Time Series Analysis,
7, 51-72.
Wall, K.D. (1980) Generalized Expectations Modeling in Macroeconometrics, Journal of Economic
Dynamics and Control, 2, 161-184.
Wall, K.D. (1987) Identification Theory for Varying Coefficient Regression Models, Journal of Time
Series Analysis, 8, 3599371.
Watson, M.W. (1989) Recursive Solution Methods for Dynamic Linear Rational Expectations Models,
Journal of Econometrics, 41, 65-89.
Watson, M.W. and R.F. Engle (1983) Alternative Algorithms for the Estimation of Dynamic Factor,
MIMIC, and Varying Coefficient Regression Models, Journal of Econometrics, 23, 385-400.
Watson, M.W. and R.F. Engle (1985) Testing for Regression Coefficient Stability with a Stationary
AR(l) Alternative, Review of Economics and Statistics, 67, 341-346.
White, H. (1982) Maximum Likelihood Estimation of Misspecified Models, Econometrica, 50, l-25,
Chapter 51

STRUCTURAL ESTIMATION OF MARKOV


DECISION PROCESSES*

JOHN RUST

University of Wisconsin

Contents

1. Introduction 3082
2. Solving MDPs via dynamic programming: A brief review 3088
2.1. Finite-horizon dynamic programming and the optimality of Markovian
decision rules 3089
2.2. Infinite-horizon dynamic programming and Bellmans equation 3091
2.3. Bellmans equation, contraction mappings and optimality 3091
2.4. A geometric series representation for MDPs 3094
2.5. Overview of solution methods 3095
3. Econometric methods for discrete decision processes 3099
3.1. Alternative models of the error term 3100
3.2. Maximum likelihood estimation of DDPs 3101
3.3. Alternative estimation methods: Finite-horizon DDP problems 3118
3.4. Alternative estimation methods: Infinite-horizon DDPs 3123
3.5. The identification problem 3125
4. Empirical applications 3130
4.1. Optimal replacement of bus engines 3130
4.2. Optimal retirement from a firm 3134
References 3139

*This is an abridged version of a monograph, Stochastic Decision Processes: Theory, Computation,


and Estimation written for the Leif Johansen lectures at the University of Oslo in the fall of 1991. I am
grateful for generous financial support from the Central Bank of Norway and the University of Oslo
and comments from John Dagsvik, Peter Frenger and Steinar Stram.

Handbook of Econometrics. Volume IV, Edited by R.F. Engle and D.L. McFadden
0 1994 Elsevier Science B. V. All rights reserved
3082 John Rust

1. Introduction

Markov decision processes (MDP) provide a broad framework for modelling


sequential decision making under uncertainty. MDPs have two sorts of variables:
state variables s, and control variables d,, both of which are indexed by time
t=0,1,2,3 ,..., T, where the horizon T may be infinity. A decision-maker or agent
can be represented by a set of primitives (u, p, b) where u(s,, d,) is a utility function
representing the agents preferences at time t, p(s,+ 1(s,, d,) is a Markov transition
probability representing the agents subjective beliefs about uncertain future states,
and /3~(0,1) is the rate at which the agent discounts utility in future periods. Agents
are assumed to be rational: they behave according to an optimal decision rule
d, = 6(s,) that solves V,(s) = maxd E, { CT= o Fu(s,, d,)) s,, = s} where Ed denotes expec-
tation with respect to the controlled stochastic process {st,d,) induced by the
decision rule 6. The method of dynamic programming provides a constructive proce-
dure for computing 6 using the valuefunction V, as a shadow price to decentralize
a complicated stochastic/multiperiod optimization problem into a sequence of
simpler deterministic/static optimization problems.
MDPs have been extensively used in theoretical studies because the framework
is rich enough to model most economic problems involving choices made over time
and under uncertainty. A pplications include the pioneering work on optimal
inventory policy by Arrow et al. (1951), investment under uncertainty [Lucas and
Prescott (1971)] optimal intertemporal consumption/savings and portfolio selec-
tion under uncertainty [Phelps (1962), Hakansson (1970), Levhari and Srinivasan
(1969), Merton (1969) and Samuelson (1969)], optimal growth under uncertainty
[Brock and Mirman (1972), Leland (1974)], models of asset pricing [Lucas (1978),
Brock (1982)], and models of equilibrium business cycles [Kydland and Prescott
(1982), Long and Plosser (1983)]. By the early 1980s the use of MDPs had become
widespread in both micro- and macroeconomic theory as well as in finance and
operations research.
In addition to providing a normative theory of how rational agents should
behave, econometricians soon realized that MDPs might provide good empirical
models of how real-world decision-makers actually behave. Most data sets take the
form {dT,sy) where d; is the decision and s; is the state of an agent a at time t.2
Reduced-form estimation methods can be viewed as uncovering agents decision

Stochastic control theory can also be used to model learning behavior in which agents update
beliefs about unobserved stae variables and unknown parameters of the transition probabilities
according to the Bayes rule.
In time-series data, a is fixed at 1 arid t ranges over 1,. , T. In cross-sectional data sets, T is fixed
at 1 and a ranges over 1,. ., A. In panel data sets, t ranges over 1,. ., T.. where T, is the number of
periods agent a is observed (possibly different for each agent) and a ranges over 1,. , A where A is
the total number of agents in the sample.
Ch. 51: Structural Estimation ofMarkov Decision Processes 3083

rules or, more generally, the stochastic process from which the realizations {df, $}
were drawn, but are generally independent of any particular behavioral theory.3
This chapter focuses on structural estimation of MDPs under the maintained
hypothesis that {d;, ST}is a realization of a controlled stochastic process. In addition
to uncovering the form of this stochastic process (and the associated decision rule
6), structural methods attempt to uncover (estimate) the primitives (u,p,B) that
generated it.
Before considering whether it is technically possible to estimate agents preferences
and beliefs, we need to consider whether this is even logically possible, i.e. whether
(u,p,fi) is identified. I discuss the identification problem in Section 3.5, and show
that the question of identification depends on what type of data we have access to
(i.e. experimental vs. no&experimental), and what kinds of a priori restrictions we
are willing to impose on (u, p, p). If we only have access to non-experimental data
(i.e. uncontrolled observations of agents in the wild), and if we are unwilling to
impose any prior restrictions on (u, p, j?) beyond basic measurability and regularity
conditions on u and p, then it is impossible to consistently estimate (u, p, b), i.e. the
class of all MDPs is non-parametrically unidentified. On the other hand, if we are
willing to restrict u and p to a finite-dimensional parametric family, say {U= u,, p =
psi 13~0 c RK}, then the primitives (u,p, /?) are identified (generically). If we are
willing to impose an even stronger prior restriction, stationarity and rational expec-
tations (RE), then we only need parametric restrictions on u in order to identify
(u,p,fl) since stationarity and the RE hypothesis allow us to use non-parametric
methods to consistently estimate agents subjective beliefs from observations of their
past states and decisions. Given that we are already imposing strong prior assump-
tions by modelling agents behavior as an optimal decision rule to an MDP, it would
be somewhat schizophrenic to be unwilling to impose any additional prior restric-
tions on (u, p, /3). In the sequel, I assume that the econometrician is willing to bring
to bear prior knowledge in the form of a parametric representation for (u, p, /?).This
reduces the problem of structural estimation to the technical issue of estimating a
parameter vector BE0 where 0 is a compact subset of RK.
The appropriate econometric method for estimating 8 depends critically on
whether the control variable d, is continuous or discrete. If d, can take on a
continuum of possible values we say that the MDP is a continuous decision process
(CDP), and if d, can take on a finite or countable number of values then the MDP
is a discrete decision process (DDP). The predominant estimation method for CDPs
is generalized method of moments (GMM) using the first order conditions from the
MDP problem (stochastic Euler equations) as orthogonality conditions [Hansen
(1982), Hansen and Singleton (1982)]. Hansens chapter (this volume) and Pakess
(1994) survey provide excellent introductions to the literature on structural estima-
tion methods for CDPs.

For an overview of this literature,see Billingsley(1961), Chamberlain (1984), Heckman (1981a),


Lancaster (1990) and Basawa and Prakasa Rao (1980).
3084 John Rust

Thus chapter focuses on structural estimation of DDPs. DDPs are appropriate


for decision problems such as whether not to quit a job [Gotz and McCall (1984)],
search for a new job [Miller (1984)], have a child [Wolpin (1984)-J, renew a patent
[Pakes (1986)], replace a bus or airplane engine [Rust (1987), Kennet (1994)] or
retire a cement kiln [Das (1992)]. Although most of the early empirical applications
of DDPs have been for binary decision problems, this chapter shows that most of
the estimation methods extend naturally to DDPs with any finite number of
possible decisions. Examples of multiple choice DDPs include Rusts (1989, 1993)
model of retirement behavior where workers decide each period whether to work
full-time, work part-time, or quit, and whether or not to apply for Social Security,
and Millers (1984) multi-armed-bandit model of occupation choice.
Since the control variable in a DDP model assumes at most a finite number of
possible values, the optimal decision rule is determined by the solution to a system
of inequalities rather than as a zero to a first order condition. As a result there is no
analog of stochastic Euler equations to serve as orthogonality conditions for GMM
estimation of 8 as in the case of CDPs. Instead, most structural estimation methods
for DDPs require explicit calculation of the optimal decision rule 6, typically via
numerical methods since analytic solutions for 6 are quite rare. Although we also
discuss simulation estimators that rely on Monte Carlo simulations of the controlled
stochastic process {st, d,} rather than on explicit numerical calculation of 6, all of
these methods can be conceptualized as forms of nonlinear regression that search
for an estimate 8 whose implied decision rule d, = 6(s,, 6) best fits the data {d:, s;}
according to some metric. Unfortunately straightforward application of nonlinear
regression methods is not possible due to three complications: (1) the dependent
variable d, is discrete rather than continuous; (2) the functional form of 6 is
generally not known a priori but rather must be derived from the solution to the
stochastic control problem; (3) the error term E, in the regression function 6 is
typically multi-dimensional and enters in a non-additive, non-separable fashion:
d, = 6(x,, s,, 0).
The basic motivation for including an error term in the DDP model is to obtain
a statistically non-degenerate econometric model. The degeneracy of DDP models
without error terms is due to a basic result of MDP theory reviewed in Section 2:
the optimal decision rule 6 is a deterministic function of the state s,. Section 3.1
offers several possible interpretations for the error terms in a DDP model, but
argues that the most natural and internally consistent interpretation is that E, is an
unobserved state variable. Under this interpretation, we partition the full state
variable s, = (x,, E,) into a subvector x, that is observed by the econometrician, and
a subvector E, that is observed only by the agent. If we are willing to impose two
additional restrictions on u and p, namely, that E, enters u in an additive separable
(AS) fashion and that p satisfies a conditional independence (CI) condition, we can
apply a number of powerful results from the literature on estimation of static
discrete choice models [McFadden (1981, 1984)] to yield estimators of 0 with
desirable asymptotic properties. In particular, the ASCI assumption allows us to
Ch. 51: Structural Estimation of Markou Decision Processes 3085

integrate out E, from the decision rule 6, yielding a non-degenerate system of


conditional choice probabilities P(d,l x,, 0) for estimating 8 by the method of maxi-
mum likelihood. Under the further restriction that {E,} is an IID extreme value
process we obtain a dynamic generalization of the well-known multinomial logit
model,

(1.1)

As far as estimation is concerned, the main difference between the static and
dynamic logit models is the interpretation of the uOfunction: in the static logit model
it is a one period utility function that is typically specified as a linear-in-parameters
function of 13,whereas in the dynamic logit model it is the sum of a one period utility
function plus the expected discounted utility in all future periods. Since the func-
tional form of ugin DDP is generally not known a priori, its values must be computed
numerically for any particular value of 8. As a result, maximum likelihood estima-
tion of DDP models requires a nested numerical solution algorithm consisting of
an outer optimization algorithm that searches over the parameter space 0 to
maximize the likelihood function and an inner dynamic programming algorithm
that solves (or approximately solves) the stochastic control problem and computes
the choice probabilities P(dl x, 6) and derivatives aP(dI x, @/a0 for each trial value
of 8. There are a number of fast algorithms for solving finite- and infinite-horizon
stochastic control problems, but space constraints prevent more than a cursory
discussion of the main methods in this chapter.
Section 3.3 presents other econometric specifications for the error term that allow
E,to enter u in a nonlinear, non-additive fashion, and, also, specifications with more
complicated patterns of serial dependence in {E,}than is allowed by the CI assump-
tion. Section 3.4 discusses the simulation estimator proposed by Hotz et al. (1993)
that avoids the computational burden of the nested numerical solution methods,
and the associated curse of dimensionality, i.e. the exponential rise in the amount
of computer time/space required to solve a DDP problem as its size (measured
in terms of number of possible values the state and control variables can assume)
increases. However, the curse of dimensionality also has implications for the data
and estimation complexity of a DDP model: as the size (i.e. the level of realism
or detail) of a DDP model increases, the amount of data needed to estimate the
model with an acceptable degree of precision increases more than proportionately.
The problems are most severe for estimating beliefs, p. Subjective beliefs can be very
slippery, high-dimensional objects to estimate. Since the optimal decision rule 6 is
generally quite sensitive to the specification of p, an innaccurate or inconsistent
estimate of p will contaminate the estimates of u and /I. Even under the assumption
of rational expectations (which allows us to estimate p non-parametrically), the
number of observations required to calculate estimates of p of specified accuracy
increases exponentially with the number of state and control variables included in
the model. The simulation estimator is particularly data-dependent in that it requires
3086 John Rust

accurate non-parametric estimates of agents conditional choice probabilities P as


well as their beliefs p.
Given all the difficulties involved in structural estimation, the reader might
wonder why not simply estimate agents conditional choice probabilities P using
simpler flexible parametric and non-parametric estimation methods. Of course,
reduced-form methods can be used, and are quite useful for initial exploratory data
analysis and judging whether more tightly parameterized structural models are
misspecified. Nevertheless there is considerable interest in structural estimation
methods for both intellectual and practical reasons. The intellectual reason is that
structural estimation is the most direct way to assess the empirical validity of a
specific MDP model: in the process of solving, estimating, and testing a particular
MDP model we learn not only about the data, but the detailed implications of the
theory. The practical motivation is that structural models can generate more
accurate predictions of the impacts of policy changes than reduced-form models.
As Lucas (1976) noted, reduced-form econometric techniques can be thought of as
uncovering the form of an agents historical decision rule. The resulting estimate 8
can then be used to predict the agents behavior in the future, provided that the
environment is stationary. Lucas showed that reduced-form estimates can produce
very misleading forecasts of the effects of policy changes that alter the stochastic
environment that agents will face in the future. 4 The reason is that a policy CI(such
as government rules for payment of Social Security or welfare benefits) can affect
an agents preferences, beliefs and discount factor. If we denote the dependence of
primitives on policy as (u,,~~,fiJ, then under a new decision rule ~1the agents
behavior will be given by a new decision rule 6(u,., pa,, /?,,) rather than the historical
decision rule 6(u,,pa, fi,). Unless there has been a lot of historical variation in
policies a, reduced-form models wont be able to estimate the independent effect of
CYon 6, and, therefore, we wont be able to predict how agents will react to a
hypothetical policy Co. However if we are able to parameterize the way in which
policy affects the primitives, (ub, pb, fi,), then it is a typically straightforward exercise
to compute the new decision rule a(~,,, p,,, b,,) for a hypothetical policy a.
One can push this line of argument only so far, since its validity depends on. the
assumption that agents really are rational expected-utility maximizers and the
structural model is correctly specified. If we admit that a tightly parameterized
structural model is at best an abstract and approximate representation of reality,
there is no reason why a structural model necessarily yields more accurate forecasts
than reduced-form models. Furthermore, because of the identification problem it is
possible that we could have a situation where two distinct sets of primitives fit an
historical data set equally well, but yield very different predictions about the impact
of a hypothetical policy. Under such circumstances there is no objective basis for
choosing one prediction over another, and we may have to go to the expense of

The limitations of reduced-form models have also been pointed out in an earlier paper by Marschak
(1953), although his exposition pertained more to the static econometric model> of that period. These
general ideas can be traced back even further to the work of Haavelmli (1944) and others at the Cowles
Commission.
Ch. 51: Structural Estimation of Markov Decision Processes 3087

conducting a controlled experiment to help identify the primitives and predict the
impact of a new policy u.~ In spite of these problems, the final section of this chapter
provides some empirical applications that demonstrate the ability of simple struc-
tural models to make much more accurate predictions of the effects of various policy
changes than reduced-form models.
Readers who are familiar with the theory of stochastic control are free to skip the
brief review of theory and solution methods in Section 2 and move directly to the
econometric implementation of the theory in Section 3. A general observation about
the current state of the art in this literature is that, while it is easy to formulate very
general and detailed MDPs, Bellmans curse of dimensionality implies that our
ability to actually solve and estimate these problems is much more limited.6 How-
ever, recent research [Rust (1995b)] shows that use of random Monte Carlo
integration methods does succeed in breaking the curse of dimensionality for the
subclass of DDPs. This result offers the promise that fairly realistic and detailed
DDP models will be estimable in the near future. The approach of this chapter is
to start with a presentation of the general theory of MDPs and then show how
various restrictions on the general theory lead to subclasses of econometric models
that are feasible to estimate.
The first general restriction is to exclude MDPs formulated in continuous time.
Although many of the results described in Section 3 can be generalized to continuous-
time semi-Markov processes [Ahn (1993b)], there has been little progress on
extending the theory to cover other types of continuous-time objects such as
controlled diffusion processes. The rationale for using discrete-time models is that
solutions to continuous-time problems can be arbitrarily closely approximated by
solutions to corresponding discrete-time versions of the problem [cf. Gihman and
Skorohod (1979, Chapter 2.3) van Dijk (1984)]. Indeed the standard approach to
solving continuous-time stochastic control problems involves solving an approximate
version of the problem in discrete time [Kushner (1990)].
The second restriction is implicit in the theory of stochastic control, namely the
assumption that agents conform to the von NeumannMorgenstern axioms for
choice under uncertainty so that their preferences can be represented by the expected
value of a cardinal utility function. A number of experiments have indicated that
human decision-making under uncertainty may not always be consistent with the
von Neumann-Morgenstern axioms. In addition, expected-utility models imply
that agents are indifferent about the timing of the resolution of uncertain events,
whereas human decision-makers seem to have definite preferences over the time at
which uncertainty is resolved [Kreps and Porteus (1978), Chew and Epstein (1989)].
The justification for focusing on expected utility is that it remains the most tractable

5Experimental data are subject to their own problems, and it would be a mistake to think of
controlled experiments as the only reliable way to predict the response to a new policy. See Heckmail
(1991, 1994) for an enlightening discussion of some of these limitations.
6See Rust (1994, Section 2) for a more detailed discussion of some of the problems faced in estimating
MDPs.
Machina (1982) identifies the independence axiom as the source of many of the discrepancies.
John Rust
3088

framework for modelling choice under uncertainty.8 Furthermore, Section 3.5


shows that, from an econometric standpoint, the expected-utility framework is
sufficiently rich to model virtually any type of observed behavior. Our ability to
discriminate between expected utility and the more subtle non-expected-utility
theories of choice under uncertainty may require quasi-econometric methods such
as controlled experiments.

2. Solving MDPs via dynamic programming: A brief review

This section reviews the main results on dynamic programming in finite-horizon


problems, and the functional equations that must be solved in infinite-horizon
problems. Due to space constraints I only give a cursory outline of the main
numerical methods for solving these functional equations, referring the reader to
Puterman (1990) or Rust (1995a, 1996) for more in-depth surveys.

Dejnition 2.1

A (discrete-time) Markovian decision process consists of the following objects:

l A time index te{O, 1,2 ,..., T}, T < 00;


l A state space S;
l A decision space D;
l A family of constraint sets (Dt(st) G D};
l A family of transition probabilities {pt+ l(.Is,, ~&):54?(S)=-[0, 11);
l A family of discount functions {bt(s,, d,) > 0} and single period utility functions
{u,(s,, d,)} such that the utility functional U has the additively separable decom-
position

(2.1)

Recent work by Epstein and Zin (1989) and Hansen and Sargent (1992) on models with
non-separable, non-expected-utility functions shows that certain specifications are computationally
and analytically tractable. Epstein and Zin have already used their specification of preferences in an
empirical investigation of asset pricing. Despite these promising beginnings, the theory and
computational methods for these more general problems are in their infancy, and due to space
constraints, we are unable to cover these methods in this survey.
9An example of the ability of laboratory experiments to uncover discrepancies between human
behavior and the predictions of expected-utility theory is the Allias paradox described in Machina
(1982, 1987).
3(S) is the Bore1 a-algebra of measurable subsets of s. For simplicity, the rest of this chapter avoids
measure-theoretic details since they are superfluous in the most commonly encountered case where
both the state and control variables are discrete. See Rust (1996) for a statement of the required
regularity conditions for problems with continuous state and control variables.
t The boldface notation denotes sequences: s = (sc,. , sT).Also, define fI,~~lO~j(sj, dj) = 1 in formula
(2.1).
Ch. 51: Structural Estimation ofMar-kmDecision Processes 3089

The agents optimization problem is to choose an optimal decision rule 6* =


(6,, ,6,) to solve the following problem:

max Ed{ U(s, d)}. (2.2)


d=(du,....SZ-)

2.1. Finite-horizon dynamic programming and the optimality of


Markovian decision rules

In finite-horizon problems (T < co), the optimal decision rule S* = (S,, . . , iif) can
be computed by backward induction starting at the terminal period, T. In principle,
the optimal decision at each time t can depend not only on the current state s,, but
on the entire previous history of the process, d, = sT(.s,, H,_ 1) where H, = (so, d,, . . . ,
s,_ 1,A,_ 1). However in carrying out the process of backward induction it is easy to
see that the Markovian structure of p and the additive separability of U imply that
it is unnecessary to keep track of the entire previous history: the optimal decision
rule depends only on the current time t and the current state s,: d, = ST(s,). For
example, starting in period T we have

S,(H,- 1, 4 = argmaxWf,- 1, Q-, 4-h (2.3)

where U can be rewritten as

From (2.4) it is clear that previous history H,_ 1 does not affect the optimal decision
of d, in (2.3) since d, appears only in the final term ur(sr, dT) on the right hand side
of (2.4). Since the final term is affected by H,_ 1 only by the multiplicative discount
factor nT:,i Bj(sj, dj), its clear that 6, depends only on sr. Working backwards
recursively, it is straightforward to verify that at each time t the optimal decision
rule 6, depends only on s,. A decision rule that depends on the past history of the
process only via the current state s, is called Markovian. Notice also that the optimal
decision rule will generally be a deterministic function of s, because randomization
can only reduce expected utility if the optimal value of d, in (2.3) is unique. This is
a generic property, since if there are two distinct values of dED,(S,) that attain the
maximum in (2.3) by a slight perturbation of U, we obtain a similar model where
the maximizing value is unique.
The valuefunction is the expected discounted value of utility over the remaining
3090 John Rust

horizon assuming an optimal policy is followed in the future. The method of


dynamic programming calculates the value function and the optimal policy recur-
sively as follows. In the terminal period VF and SF are defined by

6$) = axmax u&-, 44, (2.5)


dTdk(sT)

VF(s,) = max t+(s,, dT). (2.6)


dTEDT(sT)

In periods t = 0,. . . , T - 1, VT and ST are recursively defined by

Its straightforward to verify that at time t = 0 the value function V:(Q) represents
the conditional expectation of utility over all future periods. Since dynamic pro-
gramming has recursively generated the optimal decision rule 6* = (S,, . . . , SF), it
follows that

Vi(s) = maxE,{ U(&J)/s, = s}. (2.9)


s

These results can be formalized as follows.

Theorem 2.1

Given an MDP that satisfies certain weak regularity conditions [see Gihman and
Skorohod (1979)],
1. An optimal, non-randomized decision rule 6* exists,
2. An optimal decision rule can be found within the subclass of non-randomized
Markovian strategies,
3. In the finite-horizon case (T < co) an optimal decision rule 6* can be computed
by backward induction according to the recursions (2.5), . . . , (2.8),
4. In the infinite-horizon case (T = co) an optimal decision rule 6* can be approxi-
mated arbitrarily closely by the optimal decision rule Sg to an N-period
problem in the sense that

lim Ed, { U,(S, a)} = 1im sup Ed{ U,(S, a)} = sup Ed{ U(S, a)}. (2.10)
N-m N N+m 6 6
Ch. 51: Structural Estimation ofMarkov Decision Processes 3091

2.2. Infinite-horizon dynamic programming and Bellmans equation

Further simplifications are possible in the case of stationary MDPs In this case
the transition probabilities and utility functions are the same for all t, and the
discount functions &(s,, d,) are set equal to some constant /?E[O, 1). In the finite-
horizon case the time homogeneity of u and p does not lead to any significant
simplifications since there still is a fundamental non-stationarity induced by the fact
that remaining utility C~_/Iju(sj, dj) depends on t. However in the infinite-horizon
case, the stationary Markovian structure of the problem implies that the future
looks the same whether the agent is in state s, at time t or in state s,+~ at time t + k
provided that s, = s, + k. In other words, the only variable which affects the agents
view about the future is the value of his current state s. This suggests that the optimal
decision rule and corresponding value function should be time invariant, i.e. for all
t 3 0 and all ES, 6?(s) = 6(s) and V:(s) = V(s). Analogous to equation (2.7), 6
satisfies

6(s) = argmax u(s, d) + p V(s)p(dsl s, d) (2.11)


dsDW [ .f 1

where V is defined recursively as the solution to Bellmans equation,

V(s) = max U(S,d) + p V(s)p(ds 1s, d) (2.12)


d..,G[ s 1

It is easy to see that if a solution to Bellmans equation exists, then it must be unique.
Suppose that W(s) is another solution to (2.12). Then we have

I V(s)- W(s)1< B max I V(s) - W(s)Ip(dsI s, d)


s dsD(s)

<j?supIV(s)- W(s)/. (2.13)


SE.7

Since 0 < fi < 1, the only solution to (2.13) is SUP,,~ I V(s) - W(s)/ = 0.

2.3. Bellmans equation, contraction mappings and optimality

To establish the existence of a solution to Bellmans equation, assume for the


moment the following regularity conditions: (1) u&d) is jointly continuous and
bounded in (s, d), (2) D(s) is a continuous correspondence. Let C(S) denote the vector
space of all continuous, bounded functions f: S -+ R under the supremum norm,
Ilf 1)= sup,,s If(s)/. Then C(S) is a Banach space, i.e. a complete normed linear
3092 John Rust

space. l2 Define an operator r: C(S) -+ C(S) by

W(s)p(ds Is,
d) (2.14)

Bellmans equation can then be rewritten in operator notation as

v = l-(V), (2.15)

i.e. V is a fixed point of the mapping r. Using an argument similar to (2.13) it is


easy to show that given any I, WEC(S) we have

IIWY-mw GBIIV-w. (2.16)

An operator that satisfies inequality (2.16) for some /?E(O, 1) is said to be a contraction
mapping.

Theorem 2.2. (Contraction mapping theorem)

If I- is a contraction mapping on a Banach space B, then I- has a unique fixed point


V.

The uniqueness of the fixed point can be established by an argument similar to


(2.13). The existence of a solution is a result of the completeness of the Banach space
B. Starting from any initial element of B (such as 0), the contraction property (2.16)
implies that the following sequence of successive approximations forms a Cauchy
sequence in B:

(0, r(o), r2(o), P(o), . . . , r(o), . . .}. (2.17)

Since the Banach space B is complete, the Cauchy sequence converges to a point
VEB, so existence follows by showing that V is a fixed point of r. To see this, note
that a contraction r is (uniformly) continuous, so

V = lim r"(o)
= lim r[rn-l(0)]
= r(v), (2.18)
+LZ ao

i.e. V is indeed the required fixed point of r.


We now show that given the single period decision rule 6 defined in (2.11) the
stationary, infinite-horizon policy 6* = (6,6,. . .) does in fact constitute an optimal
decision rule for the infinite-horizon MDP. This result follows by showing that
the unique solution V(s) to Bellmans equation coincides with the optimal value

A space B is said to be complete if every Cauchy seuence in B converges to a point in B.


Ch. 51: Structural Estimation of Markov Decision Processes 3093

function Vz defined by

V:(s) = my E, f ~u(s~, dt)lso = s . (2.19)


t=o

Consider approximating the infinite-horizon problem by the solution to a finite-


horizon problem with value function

V,(s)= maxE,
d
i
f
t=o
~u(s,, d,) 1so = s
I
.

Since u is bounded and continuous, CT= o /Y&, d,) converges to C,=, pu(s,, d,) for
any sequences s = (so, sr, . .) and d = (do, d,, . . .). Theorem 2.1(4) implies that for each
SE& V:(s) converges to the infinite-horizon value function V;(s):

lim V,(s) = V;(s) vses. (2.21)


T-02

But the contraction mapping theorem also implies that this same sequence converges
to V (since V,= rT(0)), so V = I/F. Since V is the expected present discounted
value of utility under the policy 6* (a result we demonstrate in Section 2.4), the fact
that V = Vz immediately implies the optimality of 6*.
A similar result can be proved under weaker conditions that allo)v u(s,d) to be
an unbounded function of the state variable. As we will see in Section 3, unbounded
utility functions arise in DDP problems as a consequence of assumptions about the
distribution of unobserved state variables. Although the contraction mapping theo-
rem is no longer directly applicable, one can prove the following result, a generaliza-
tion of Blackwells theorem, under a weaker set of regularity conditions that allows
for unbounded utility.

Theorem 2.3 (Blackwells theorem)

Given an infinite-horizon, time homogeneous MDP that satisfies certain regularity


conditions [see Bhattacharya and Majumdar (1989)];
1. A unique solution V to Bellmans equation (2.12) exists, and it coincides with
the optimal value function defined in (2.19),
2. There exists a stationary, non-randomized, Markovian optimal control 6*
given by the solution to (2.1 l),
3. There is an optimal non-randomized, Markovian decision rule 6* which can
be approximated by the solution 6 z to an N-period problem with utility
function U,(s, d) = C;_ o BU(%do:

lim Eg* { U,(F, (7)) = 1rm sup E6{ U&,2)) = sup E,(U(Z, 2)) = Ed*{ U(s, ai>.
N-rm N N-m 6 6

(2.22)
3094 John Rust

2.4.A geometric series representation for MDP's

Presently, the most commonly used solution procedure for MDP problems involves
discretizing continuous state and control variables into a finite number of possible
values. This resulting class of finite state DDP problems has a simple and beautiful
algebraic structure that we now review. r3 Without loss of generality we can identify
the state and decision spaces as finite sets of integers { 1,. . . , S} and { 1,. . . , D},and
the constraint set as { 1,. . . , D(s)}
where for notational simplicity we now let S, D
and D(s) denote positive integers rather than sets. It follows that a feasible stationary
decision rule 6 is an S-dimensional vector satisfying ME{ 1,. . . , D(s)},s= 1,. . . , S,
and the value function V is an S-dimensional vector in the Euclidean space RS.
Given 6 we can define a vector u,ER~whose ith component is u[i, d(i)], and an S x S
transition probability matrix E, whose (i, j) element is p[ilj, S(j)] = Pr { st+ 1 = iIs, =
j, d, = S(j)}. Bellmans equation for a DDP reduces to

T(V)(s) = max
l<d<D(s) [
a(s, d) + P s:
s= 1
W)P(Sl s, 4
1. (2.23)

Given a stationary, Markovian decision rule 6, we define I/,eRS as the vector of


expected discounted utilities under policy 6. It is straightforward to show that V, is
the solution to a system of linear equations,

which can be solved by matrix inversion:

= [Z - BEa] - lU6
= U6 + BE@, + p2zgu, + p3E,3U,+ . . . (2.25)

The last equation in (2.25) is simply a geometric series expansion for V, in powers
of /3 and E,. As is well known, Ey = (EJNis simply the N-stage transition probability
matrix, whose (i,j) element equals Pr { s, +N = i (s, = j, S}, where the presence of 6 as
a conditioning argument denotes the fact that all intervening decisions satisfy
dt+j=6(st,.j), j=O ,..., N. Since j?NEy~, is the expected discounted utility received
in period N under policy 6, formula (2.25) can be thought of as a vector generaliza-
tion of a geometric series, showing explicitly how V, equals the sum of expected
discounted utilities under 6 in all future periods.14 Since Ed"is a transition pro-
bability matrix (i.e. all elements are between 0 and 1, and its rows sum to unity), it

13The geometric representtion also holds for continous state MDPs, but in infinite-dimensional
space instead of R.
As Lucas (1978) notes, a little knowledge of geometric series goes a long way.
Ch. 51: Structural Estimation of Markov Decision Processes 3095

follows that lim,,, BNEr = 0, guaranteeing the invertibility of [I- /?EJ for any
Markovian decision rule 6 and all BE[O, 1).i5

2.5. Overview of solution methods

This section provides a brief review of solution methods for MDPs. For a more
extensive review we refer the reader to Rust (1995a).
The main solution method for finite-horizon MDPs is backward recursion,
which has already been described in Section 2.1. The amount of computer time/
space required to solve a problem increases linearly in the length of the horizon T
and quadratically in the number of possible state variables S, the latter result being
due to the fact that the main work involved in dynamic programming is calculating
the conditional expectation of future utility, which requires multiplying an S x S
transition matrix by the S x 1 value function.
In the infinite-horizon case there are a variety of solution methods, most of which
can be viewed as different strategies for solving the Bellman functional equation.
The method of successive approximations which we described in Section 2.2 is
probably the most well-known solution method for infinite-horizon problems: it
essentially amounts to using the solution to a finite-horizon problem with a large
horizon T to approximate the solution to the infinite-horizon problem. In certain
cases we can significantly accelerate successive approximations by employing the
McQueen-Porteus error bounds,

rk( v) + _bke< I/* < rk( v) + 6,e, (2.26)

where V* is the fixed point to T,e denotes an S x 1 vector of ls, and

bk=~/(l-~)min[rk(V)-rk-l(V)],
5,= p/(1 - p) max[r (v) - r-(v)]. (2.27)

The contraction property guarantees that bk and Ek approach each other geo-
metrically at rate /?. The fact that the fixed point V* is bracketed within these
bounds suggests that we can obtain an improved estimate of V* by terminating the
contraction iterations when 16, - _bkl< E and setting the final estimate of V* to be
the median bracketed value

e. (2.28)

151f there are continous state variables, the MDP problem still has the same representation as in
(2.23, except that E, is a Markov operator (a bounded, positive linear operator with norm equal to
1) instead of an S x S transition probability matrix.
3096 John Rust

Bertsekas (1987, p. 195) shows that the rate ofconvergence of { pk;I>to V* is geometric
at rate p 1A, 1,where 1, is the subdominant eigenvalue of E,.. In cases where II, I < 1,
the use of the error bounds can lead to significant speed-ups in the convergence of
successive approximations at essentially no extra computational cost. However in
problems where Ed* has multiple ergodic sets, I& I= 1, and the error bounds will
not lead to an appreciable speed improvement as illustrated in computational
results in Table 5.2 of Bertsekas (1987).
In relatively small scale problems (S < 10000) the method of policy iteration is
generally the fastest method for computing V* and the associated optimal decision
rule S*, provided the discount factor is sufficiently large (fi > 0.95). The method
starts by choosing an arbitrary initial policy, &,.16 Next a policy valuation step is
carried out to compute the value function V,, implied by the stationary decision
rule 6,. This requires solving the linear system (2.25). Once the solution I,, is
obtained, a policy improvement step is used to generate an updated policy 6,,

6,(s) = argmax [u(s, d) + p i VdO(s)p(sIs, d)]. (2.29)


l<d<D(s) s= 1

Given 6,, one continues the cycle of policy valuation and policy improvement steps
until the first iteration k such that 6, = 6, _ 1 (or alternatively V,, = I,,- ,). It is easy
to see from (2.25) and (2.29) that such a V,, satisfies Bellmans equation (2.23), so
that by Theorem 2.3 the stationary Markovian decision rule 6* = 6, is optimal.
One can show that policy iteration always generates an improved policy:

v,,3 v,,_ I
(2.30)

Since there are only a finite number D(1) x ... x D(S) of feasible stationary Markov
policies, it follows that policy iteration always converges to the optimal decision
rule 6* in a finite number of iterations.
Policy iteration is able to find the optimal decision rule after testing an amazingly
small number of trial policies 6,. However the amount of work per iteration is larger
than for successive approximations. Since the number of algebraic operations
needed to solve the linear system (2.25) for I,, is of order S3, the standard policy
iteration algorithm becomes impractical for S much larger than 10000. To solve
very large scale MDP problems, it seems that the best strategy is to use policy
iteration, but to only attempt to approximately solve for V, in each policy evaluation
step (2.25). There are a number of variants of policy iteration that avoid direct
numerical solution of the linear system in (2.25) including modified policy iteration

160ne obvious choice is 6,(s) = argmax, < do .,,,[u(s, d)].


Supercomputers using combinations of vector processing and multitasking can now routinely
solve dense linear systems exceeding 1000 equations and unknowns in under 1 CPU second. See, for
example, Dongara and Hewitt (1986).
Ch. 51: Structural Estimation of Markov Decision Processes 3097

[Puterman and Shin (1978)], and adaptive state aggregation algorithms [Bertsekas
and Castafion (1989)].
Puterman and Brumelle (1978, 1979) have shown that policy iteration is identical
to Newtons method for computing a zero to a nonlinear function. This insight turns
out to be useful for computing fixed points to contraction mappings Y that are
closely related to, but distinct from, the contraction mapping I- defined by Bellmans
equation (2.11). An example of such a mapping is Y: B + J3 defined by

Y(u)(s, d) = u(s,d) + p 1
deD(s) 1
exp {v(s, d)} p(ds( s, d). (2.31)

In Section 3 we show that the fixed point to this mapping is identical to the value
function ug entering the dynamic logit model (1.1). Rewriting the fixed point condi-
tion as 0 = u - Y(U), we can apply Newtons method, generating iterations of the
form

a,+ 1 = uk - [I - ul(O,)l - (I - Y)(s), (2.32)

where I denotes the identity matrix and Y(u) is the gradient of Y evaluated at the
point UEB. An argument exactly analogous to the series expansion argument used
to proved the existence of [Z - /3Ed] - can be used to establish that the matrix
[Z - Y(u)] - 1 is invertible, so the Newton iterations are always well-defined. Given
a starting point u0 in a domain of attraction sufficiently close to the fixed point u*
of Y, the Newton iterations will converge to u* at a quadratic rate:

IV,,, - v*I<Klvk- 1/*i2 (2.33)

for a positive constant K.


Although Newton iterations yield rapid quadratic rates of convergence, it is only
guaranteed to converge for initial estimates u,, in a domain of attraction of u*
whereas the method of successive approximations yields much slower linear rates
of convergence but are always guaranteed to converge to u* starting from any initial
point uO.l8 This suggests the following hybrid method or polyalgorithm: start with
successive approximations, and when the McQueen-Porteus error bounds indicate
that one is sufficiently close to u*, switch to Newton iterations to rapidly converge
to the solution.
There is another class of methods, which Judd (1994) has termed minimum
weighted residual (MWR) methods, that can be applied to solve general operator
equations of the form

@(u*)= 0, (2.34)

Newtons method does exhibit global convergence in finite state DDP problems due to the fadt
that Newtons method and policy iteration are identical in this case, and policy iteration converges
from any starting point. Thus the domain of attraction in this case is all of R.
3098 John Rust

where @: B + B is a nonlinear operator on a potentially infinite-dimensional Banach


space B. For example, Bellmans equation is a special case of (2.34) for @(V) =
[I- r](V). Similar to policy iteration, Newtons method becomes computationally
burdensome in high-dimensional problems. To avoid this, MWR methods attempt
to approximate the solution to (2.34) by restricting the search to a smaller-dimensional
subspace B, spanned by the basis elements {xi, x2,. . . , xN). It follows that we can
index any approximate solution UEB, by a vector c = (c,, . . . , c,)ER~:

u, = ClXl + ... + CNXN. (2.35)

Unless the true solution u* is an element of B,, @(u,) will generally be non-zero for
all vectors CERN. The MWR method computes an estimate uI of u* using a value of
e that solves

e = argmin 1@(u,)\. (2.36)


CERN

Variants of MWR methods can be obtained by using different subspaces B, (e.g.,


Legendre or Chebyshev polynomials, etc.) and different norms on Y(u,) (e.g., least
squares or sup norm, etc.). In cases where B is an infinite-dimensional space (which
occurs when the DDP problem contains continuous state variables), one must also
choose a finite grid of points over which the norm in (2.36) is evaluated.
Although I have described MWR as parameterizing the value function in terms
of a small number of unknown coefficients c, there are variants of this approach
that are based on parameterizations of other features of the stochastic control
problem such as the decision rule 6 [Smith (1991)], or the conditional expectation
operator E, [Marcet (1994)]. For simplicity, I refer to all these methods as MWR
even though there are important differences in their computational implementation.
The advantage of the MWR approach is that it converts the problem of finding
a zero of a high-dimensional operator equation (2.34) into the problem of finding
a zero to a smaller-dimensional minimization problem (2.36). MWR methods may
be particularly effective for solving DDP problems with several continuous state
variables, since straightforward discretization methods quickly run into the curse
of dimensionality. However a disadvantage of the procedure is the computational
burden of solving (2.36) given that I @(u,)I must be evaluated for each trial value of
c. Typically, one uses approximate methods to evaluate 1@(u,)I, such as Gaussian
quadrature or Monte Carlo integration. Another disadvantage is that MWR methods
are non-iterative, i.e. previous approximations ui, ul,. . . , uN- 1 are not used to deter-
mine the next estimate uN. In practice, one must make do with a single approximation
UN,however there is no analog of the McQueen-Porteus error bounds to tell us
how far uN is from the true solution. Indeed, there are no general theorems proving
the convergence of MWR methods as the dimension N of the subspace increases.
There are also problems to be faced in cases where @ has multiple solutions V*, and
when the minimization problem (2.36) has multiple local minima. Despite these
unresolved problems, versions of the MWR method have proved to be effective in a
Ch. 51: Structural Estimation of Markoo Decision Processes 3099

variety of applied problems. See, for example, Kortum (1993) (who has nested the
MWR solution of (2.35) in an estimation routine), and Bansal et al. (1993) who have
used Marcets method of parameterized expectations to generate stochastic simula-
tions of dynamic, stochastic models for use by their non-parameteric simulation
estimator.
A final class of methods uses Monte Carlo integration to avoid the computational
burden of multivariate numerical integration that is the dominating factor that
limits our ability to solve DDP problems. Keane and Wolpin (1994) developed a
method that combines Monte Carlo integration and interpolation to dramatically
reduce the solution time for large scale DDP problems with continuous multi-
dimensional state variables. As we will see below, incorporation of unobservable
state variables Eimplies that DDP problems will always have these multidimensional
continuous state variables. Recently, Rust (1995b) has introduced a random multi-
grid algorithm using a random Bellman operator F that avoids the need for inter-
polation and repeated Monte Carlo simulations that is an inherent limiting future
of Keane and Wolpins method. Rust showed that his algorithm succeeds in breaking
the curse of dimensionality of solving the DDP problem -i.e. the amount of
computer time required to solve the DDP problem increases only polynomially
rather than exponentially with the dimension d of the state variables using Rusts
algorithms. These new methods offer the promise that substantially more realistic
DDP models will be estimable in the near future.

3. Econometric methods for discrete decision processes

As we discussed in Section 1, structural estimation methods for DDPs are funda-


mentally different from the Euler equation methods used to estimate CDPs. Since
the control variable is discrete, we cannot differentiate to derive first order necessary
conditions characterizing the optimal decision rule 6* = (6,6,. . .). Instead each
component function 6(s) is defined by a finite number of inequality conditions: l9

d&(s)- Id ED(S) u(s, d) + /Ii V*(s)p(ds 1s, 6) 3 U(S,d) + fi I*(s)p(dsIs, d) .


iTI s s I
(3.1)
Econometric methods for DDPs borrow heavily from methods developed in the
literature on estimation of static discrete choice models.20 The primary difference
between estimation of static versus dynamic models of discrete choice is that agents
choices are governed by the relative magnitude of the value function V rather than
the single period utility function u. Even if the functional form of the latter is

For notational simplicity, this section focuses on stationary infinite-horizon DDP problems and
ignores the distinction between the optimal policy 6* and its components 6* = (6,6,. .).
See McFadden (1981, 1984) for excellent surveys of the huge literature on estimation of static
discrete choice models.
3100 John Rust

specified a priori, the value function is generally unknown, although it can be


computed for any value of 8. To date, most empirical applications have used nested
numerical solution algorithms that compute the best fitting estimate I!?by repeated-
ly solving the dynamic programming problem for each trial value of 9.

3.1. Alternative models of the error term

In addition to the numerical problems involved in computing the value function and
optimal decision rule, we face the problem of how to incorporate error terms into
the structural model. Error terms are necessary in light of Blackwells theorem
(Theorem 2.3) that the optimal decision rule d = 6(s) is a deterministic function of
the agents state s. Blackwells theorem implies that if we were able to observe all
components of s, then a correctly specified DDP model would be able to perfectly
predict agents behavior. Since no theory is realistically capable of perfectly predict-
ing the behavior of human decision-makers, there are basically four ways to recon-
cile discrepancies between the predictions of the DP model and observed behavior:
(1) optimization errors, (2) measurement errors, (3) approximation errors, and (4)
unobserved state variables.l
An optimization error causes an agent who intends to behave according to the
optimal decision rule 6 to take an actual decision d given by

d = 6(s) + r/, (3.2)

where 9 is interpreted as an error that prevents the agent from correctly calculating
or implementing the optimal action 6(s). This interpretation of discrepancies be-
tween d and 6(s) seems logically inconsistent: if the agent knew that there were
random factors that lead to ex post discrepancies between intended and realized
decisions, he would re-optimize taking these uncertainties into account. The resulting
decision rule will generally be different from the optimal decision rule 6 when
intended and realized decisions coincide. On the other hand, if q is simply a way
of accounting for irrational or non-maximizing behavior, it is not clear why this
behavior should take the peculiar form of random deviations from a rational
decision rule 6. Given these logical difficulties, we ignore optimization errors as a
way of explaining discrepancies between d and 6(s).
Measurement errors, due to response or coding errors, must surely be acknowl-
edged in most empirical studies. Measurement errors are usually much more likely
to occur in continuous components of s than in the discrete values of d, although
significant errors can occur in the latter as a result of classification error (e.g.
defining workers as choosing to work full-time vs. part-time based on noisy measu-
rements of total hours of work). From an econometric standpoint, measurement

l Another method, unobserved heterogeneity, can be regarded as a special case of unobserved state
variables in which certain components of the state vector vary over individuals but not over time.
Ch. 51: Structural Estimation of Markoo Decision Processes 3101

errors in s create more serious difficulties since 6 is typically a nonlinear function


of s. Unfortunately, the problem of nonlinear errors-in-variables has not yet been
satisfactorily resolved in the econometrics literature. In certain cases [Eckstein and
Wolpin (1989b) and Christensen and Kiefer (199 1b)], one can account for measure-
ment error in a statistically and computationally tractable manner, although at the
present time this approach seems to be highly problem-specific.
An approximation error is .defined as the difference between the actual and
predicted decision, E = d - 6(s). This approach amounts to an up-front admission
that the DDP model is misspecified, and does not attempt to impose auxiliary
statistical assumptions about the distribution of E. The existence of such errors is
hard to deny since by their very nature DDP models are simplified, abstract
representations of human behavior and we would never expect their predictions
to be 100% correct. Under this interpretation the econometric problem is to find
a specification (u, p, /I) that minimizes some metric of the approximation error such
as mean squared prediction error. While this approach seems quite natural, it leads
to a degenerate econometric model and estimators with poor asymptotic proper-
ties. The approximation error approach also suffers from ambiguity about the
appropriate metric for determining whether a given model does or does not provide
a good approximation to observed behavior.
The final approach, unobserved state variables, is the subject of Section 3.2.

3.2. Maximum likelihood estimation of DDPs

The remainder of this chapter focuses on structural estimation of DDPs with


unobserved state variables. In these models the state variable s is partitioned into
two components s = (x, E) where x is a state variable observed by both agent
and econometrician and E is observed only by the agent. The existence of unobserved
state variables is quite plausible: it is unlikely that any survey could completely
record all information that is relevant to the agents decision-making process. It also
provides a natural way to rationalize discrepancies between observed behavior
and the predictions of the DDP model: even though the optimal decision rule
d = 6(x, E)is a deterministic function, if the specification of unobservables is sufficiently
rich any observed (x, d) combination can be explained as the result of an optimal
decision by an agent for an appropriate value of E. Since E enters the decision rule
6 in a non-additive fashion, it is infeasible to estimate 0 by nonlinear least squares.
The preferred method for estimating 0 is maximum likelihood using the conditional
choice probability,

P(dlx) = Z{d = b(x,.s)}q(dEjx), (3.3)


s

where q(delx) is the conditional distribution of E given x (to be defined). Even


though 6 is a step function, integration over E in (3.3) leads to a conditional choice
3102 John Rust

probabilty that is a smooth function of 8 provided that the primitives (u, p, p) are
smooth functions of 0 and the DDP problem satisfies certain general properties
given in assumptions AS and CI below. These assumptions guarantee that the
conditional choice probability has full support:

d@(x)-=P(d(x) > 0, (3.4)

which is equivalent to saying that the set (&Id = 6(x, E)} has positive probability
under q(de) x). We say that a specification for unobservables is saturated if (3.4) holds
for all possible values of 8. The problem with an unsaturated specification is the
possibility that the DDP model may be contradicted in a sufficiently large data
set: i.e. one may encounter observations (x;,d;) which cannot be rationalized by
any value of E or 8, i.e. P(d;)xf, 0) = 0 for all 0. This leads to practical difficulties
in maximum likelihood estimation, causing the log-likelihood function to blow
up when it encounters a zero probability observation. Although one might
eliminate such observations to achieve convergence, the impact on the asymptotic
properties of the estimator is unclear. In addition, an unsaturated specification may
yield a likelihood function whose support depends on 0 or which may be a non-
smooth function of 8. Little is known about the general asymptotic properties of
these non-standard maximum likelihood estimators.22
Borrowing from the literature on static discrete choice models [McFadden (1981)]
we introduce two assumptions that are sufficient to generate a saturated specifica-
tion for unobservables in a DDP model.

Assumption AS

The choice sets depend only on the observed state variable x: D(s) = D(x). The
unobserved state variable E is a vector with at least as many components as the
number of elements in D(x).~~ The utility function has the additively separable
decomposition

u(s, d) = u(x, d) + c(d), (3.5)

where c(d) is the dth component of the vector E.

22Result~ are available for certain special cases, such as Flinn and Heckmans (1982) and Christensen
and Kiefers (1991) analysis of the job search model. If wages are measured without error, this model
generates the restriction that any accepted wage offer must be greater than the reservation wage (which
is an implicit function of 0). This imphes that the support of the likelihood function depends on 0,
resulting in a non-normal limiting distribution with certain parameters converging faster than the fi
rate that is typical of standard maximum likelihood estimators. The basic result is analogous to
estimating the upper bound 0 of a uniform distribution U[O, 01. The support of this distribution clearly
depends f? and, as well known (Cox and Hinckley, 1974) the maximum likelihood estimator is
B = max {x1,. , xAj, which converges at rate A to an exponential limiting distribution.
23For technical reasons E may have a number of superfluous components so that we may formally
embed the E state vectors in a common state space e. For details, see Definition 3.1.
Ch. 51: Structural Estimation of Markov Decision Processes 3103

Figure 1. Pattern of dependence in controlled stochastic process implied by the CI assumption

Assumption CZ

The transition density for the controlled Markov process {xt, Ed] factors as

p(dx, + 1, ds, + 1 Ix,, E,,4 = G&t + 1Ix, + , Wx, + 1Ix,, 4, (3.6)

where the marginal density of q(ds Ix) of the first 1D(x) ( components of E has support
equal to R tD(x)iand finite absolute first moments.
CI is a conditional independence assumption which limits the pattern of depen-
dence in the {x,,E,} process in two ways. First, x,+ 1 is a sufficient statistic for E,+ 1
implying that any serial dependence between E, and E,+ 1 is transmitted entirely
through the observed state xt+ 1.24 Second, the probability density for x, + 1 depends
only on x, and not on E,. Intuitively, CI implies that the (Ed} process is essentially
a noise process superimposed on the main dynamics which are embodied by the
transition probability n(dx lx, d).
Under assumptions AS and CI Bellmans equation has the form

V(x, E) = max [0(x, d) + c(d)], (3.7)


deD(x)

where

s
$x,4 = 4x, 4 + B VY, &(d~l y)Wy Ix, 4.

Equation (3.8) is the key to subsequent results. It shows that the DDP problem has
(3.8)

the same basic structure as a static discrete choice problem except that the value
function u replaces the single period utility function u as an argument of the
conditional choice probability. In particular, AS-C1 yields a saturated specification
for unobservables: (3.8) implies that the set {EJd = 6(x, E)} is a non-empty intersection
of half-spaces in RID@), and since E is continuously distributed with unbounded
support, it follows that regardless of the values of (o(x,d)} the choice probability
P(dlx) is positive for each dED(x).
In order to formally define the class of DDPs, we need to embed the unobserved
state variables E in a common space E. Without loss of generality, we can identify
each choice set D(x) as the set of integers D(x) = { 1,. . . , ID(x)j}, and let the decision
space D be the set D = { 1,. . , supXGx (D(x)/}. Then we define E = RID, and whenever

241f q(dc)x) is dependent of x then {E,} is an IID process which is independent of {xr}.
3104 John Rust

1D(x)1 < 1D 1then q(ds 1x) assigns the remaining ID I- 1D(x) I irrelevant components
of E equal to some arbitrary value, say 0, with probability 1.

Dejinition 3.1

A discrete decision process (DDP) is an MDP satisfying the following restrictions:


l The decision space D = { 1,. . . , su~,,~ ID(s) I ), where su~,,~ ID(s)I < 00.
l The state space S is the product space S = X x E, where X is a Bore1 subset of
RJ and E = RID.
l For each SES and XEX we have D(s) = D(x) c D.

l The utility function u(s, d) satisfies assumption AS.


l The transition probability p(ds,+ 11s,, d,) satisfies assumption CI.
l The component q(dsl x) of the transition probability p(ds1 s, d) is itself a product
meaSure on R~-(x))x RlDl-IDfX)l whose first component has support RIDCX)I and
whose second component is a unit mass on a vector of Os of length 1D I - 1D(x) (.

The conditional choice probability P(dl x) can be defined in terms of a function


McFadden (198 1) has called the social surplus,

GC{4x,d),dWx)}Ixl= max [u(x, d) + .z(d)]q(de Ix). (3.9)


s RIDI
dsD(x)

If we think of a population of consumers indexed by E, then G[ {u(x, d), deD(x)} Ix]


is simply the expected indirect utility of choosing alternatives dED(x). G has an
important property, apparently first noted by Williams (1977) and Daly and Zachary
(1979), that can be thought of as a discrete analog of Roys identity.

Theorem 3.1

If q(dslx) has finite first moments, then the social surplus function (3.9) exists, and
has the following properties.
1. G is a convex function of {u(x, d), deD(x)}.
2. G satisfies the additivity property

G[{u(x,d) + a,dsD(x)}Ix] = c( + G[{u(x,d),d~D(x)}Ix]. (3.10)

3. The partial derivative of G with respect to u(x, d) equals the conditional choice
probability:

aGC{u(x,d),d~D(x))Ixl =Ptd,xj.
(3.11)
a 4x, 4
From the definition of G in (3.9), it is evident that the proof of Theorem 3.1(3) is
simply an exercise in interchanging integration and differentiation. Taking the
Ch. 51: Structural Estimation ofMarkov Decision Processes 3105

partial derivative operator inside the integral sign we obtain25

aGC{u(x,d),d~~(X)}IXl a{maxdeD(y)
Cub,4 + 441> q(dE, x)

autx,d) = a4x, 4

= Z{d = argmax [u(x,d) + ~(dl)]}q(d~Ix)


s dsD(x)

= P(dIx). (3.12)

Note that the additivity property (3.10) implies that the conditional choice probabi-
lities sum to 1, so P(. (x) is a well-defined probability distribution over D(x).
The fact that the unobserved state variables E have unbounded support implies
that the objective function in the DDP problem is unbounded. We need to introduce
three extra assumptions to guarantee the existence of a solution since the general
results for MDPs given in Theorems 2.1 and 2.2 assumed that u is bounded above.

Assumption BU

For each dED(x), u(x,d) is an upper semicontinuous function of x with bounded


expectation:

R(x) 3 F PR,(x) < co,


t=1

4 +,(4 = max R,W(dy Ix, 4


dsD(x) s

R,(x) = max max I U(Y, 4 + 44 I NE IMdy Ix, 4. (3.13)


d&(x) deD(y)

Assumption WC

rr(dy 1x, d) is a weakly continuous function of (x, d): for each bounded continuous
function h: X + R, s h(y)rc(dy)x, d) IS
. a continuous function of x for each dED(x).

Assumption BE

Let B be the Banach space of bounded, Bore1 measurable functions h: X x D --) R


under the essential supremum norm. Then UEB and for each DEB, E~EB, where Eh
is defined by

Wx,d)= GC(h(y,d),d~~(y))lyl~(dyIx,d). (3.14)


s
Z5The interchange is justified by the Lebesgue dominated convergence theorem, since the derivative
ofmax ,,,,,[u(x, d) + e(d)] with respect to u(x, d) is bounded (it equals either 0 or 1) for almost all E.
3106 John Rust

Theorem 3.2

If {s,, d,} is a DDP satisfying AS, CI and regularity conditions BU, WC and BE, then
the optimal decision rule 6 is given by

6(x, E) = argmax [u(x, d) + c(d)], (3.15)


dfD(x)

where u is the unique fixed point to the contraction mapping Y: B + B defined by

s
u(u)(x,4 = 4x, 4 + P GC{u(y,d'), dWy)) lyl4dylx, 4. (3.16)

Theorem 3.3

If {st, d,} is a DDP satisfying AS, CI and regularity conditions BU, WC and BE, then
the controlled process {x,, st} is Markovian with transition probability

Pr{dx r+i,d,+ilx,,dJ =P(dt+llxt+l)n(dxt+llx,,d,), (3.17)

where the conditional choice probability P(dJx) is given by

P(Qx) = aGC{u(x,d),d~D(x)}Ixl (3.18)


ao(x,d)

where G is the social surplus function defined in (3.9), and u is the unique fixed point
to the contraction mapping Y defined in (3.16).

The proofs of Theorems 3.2 and 3.3 are straightforward: under assumption
ASCI the value function is the unique solution to Bellmans equation given in (3.7)
and (3.8). Substituting the formula for I/ given in (3.7) into the formula for u given
in (3.8) we obtain

u(x, d) = u(x, d) + /I max CU(Y,d) + M)lq(d~Iyb(dy Ix, 4


dED(y)

=Ukd)+B
s GC(u(y,d),d~~(y)}Iyl~(dyIx,d).

The latter formula is the fixed point condition (3.16). Its a simple exercise to verify
(3.19)

that Y is a contraction mapping, guaranteeing the existence and uniqueness of the


function u. The fact that the observed components {xf, d,} of the controlled process
{xt,et,d,} is Markovian is a direct result of the CI assumption: the observed state
x,+ 1 is a sufficient statistic for the agents choice d,, r. Without the CI assumption,
lagged state and control variables would be useful for predicting the agents choice
Ch. 51: Structural Estimation of Markoc Decision Processes 3107

at time t + 1 and {xl, d,} will no longer be Markovian. As we will see, this observation
provides the basis for a specification test of CL
For specific functional forms for q we obtain concrete formulas for the conditional
choice probability P(dIx), the social surplus function G and the contraction map-
ping Y. For example if q(ds(x) is a multivariate extreme-value distribution we
havez6

q(dsIx)= fl expC-&(d)+Y}expC-exp{-&(d)+y}l y G 0.577. (3.20)


deD(x)

Then P(d(x) is given by the well-known multinomial logic formula

P(d(x) = __~~
expf& 41 (3.21)
C exp(+,d)J
dED(x)

where u is the fixed point to the contraction mapping Y?

Y(u)(x, d) = U(X, d) + B c
d4Ky)
exp {u(Y, 4 }
1
4dyl x, 4 (3.22)

The extreme-value specification is especially attractive for empirical applications


since the closed-form expressions for P and G avoid the need for multi-dimensional
numerical integrations required for other distributions.27 A simple consequence of
the extreme value specification is that the log-odds ratio for two alternatives equals
the utility differential:

log =u(x, d) - u(x, 1). (3.23)

Suppose that the utility function depends only on the attributes of the chosen
alternative: u(x, d) = u(x,), where x = (x,, . . . , xD) is a vector of attributes of all the
alternatives and xd is the attribute of the dth alternative. In this case the log-odds
ratio implies a property known as independence from irrelevant alternatives (HA):
the odds of choosing alternative d over alternative 1 depends only on the attributes
of those two alternatives. The IIA property has a number of undesirable implications
such as the red bus/blue bus problem noted by Debreu (1960). Note, however,
that in the dynamic logit model the IIA property does not hold: the log-odds of

26The constant y in (3.18) is Eulers constant, which shifts the extreme value distribution so it has
unconditional mean zero.
Closed-form solutions for the conditional choice probability are available for the larger family of
multivariate extreme-value distributions [McFadden (1977)J This family is characterized by the
property that it is mu-stable, i.e. it is closed under the operation of maximization. Dagsvik (1991)
showed that this class is dense in the space of all distributions for E in the sense that the conditinal
choice probabilities for an arbitrary density q can be approximated arbitrarily closely by the choice
probability for some multivariate extreme-value distribution.
3108 John Rust

choosing d over 1 equals the difference in the value functions u(x, d) - u(x, l), but
from the definition of u(x,d) in (3.22) we see that it generally depends on the
attributes of all of the other alternatives even when the single period utility function
depends only on the attributes of the chosen alternative, u(x, d) = u(xJ. Thus, the
dynamic logit model benefits from the computational simplifications of the extreme-
value specification but avoids the IIA problem of static logit models.
Although Theorems 3.2 and 3.3 appear to apply only to infinite-horizon stationary
DDP problems, they actually include finite-horizon, non-stationary DDP problems
as a special case. To see this, let the time index t be an additional component of x,,
and assume that the process enters an absorbing state with uJx,, d,) = u(x,, t, d,) = 0
for t > T. Then Theorems 3.2 and 3.3 continue to hold, with the exception that
6, P, G, 7t and u all depend on t. The value functions o,, t = 1,. . . , T are given by the
same backward recursion formulas as in the finite-horizon MDP models described
in Section 2:

4x, 4 = &, 4,

u,(x,d)=~,(x,d)+~

Substituting
s t+
G,C{u l(y, 4, d@(y))I~l~t(d~lx, 4.

these value functions into (3.18), we obtain choice probabilities P, that


(3.24)

depend on time. It is easy to see that the process {xt, d,} is still Markovian, but with
non-stationary transition probabilities.
Given panel data {xf,dp} on observed states and decisions of a collection of
individuals, the full information maximum likelihood estimator I!? is defined by

(3.25)

Maximum likelihood estimation is complicated by the fact that even in cases where
the conditional choice probability has a closed-form solution in terms of the value
functions uO, the latter function does not have an a priori known functional form
and is only implicitly defined by the fixed point condition (3.16). Rust (1987, 1988b)
developed a nested fixed point algorithm for estimating 6: an inner contraction
fixed point algorithm computes uO for each trial value of 0, and an outer hill-
climbing algorithm searches for the value of 0 that maximizes J!!.
In practice, 0 can be estimated by a simpler 2-stage procedure that yields consistent,
asymptotically normal but inefficient estimates of 8*, and a 3-stage procedure which
is asymptotically equivalent to full information maximum likelihood. Suppose we
partition 8 into two components (O,, O,), where 8, is a subvector of parameters that
appear only in n and O2 is a subvector of parameters that appear only in (u, L_J, ,Q
In the first stage we estimate 0, using the partial likelihood estimator &;,2s

28Cox (1975) has shown that under standard regularity conditions, the partial likelihood estimator
will be consistent and asymptotically normally distributed.
Ch. 51: Structural Estimation ofMarkou Decision Processes 3109

6: = argmax LT(8,) 3 fi fi rc(dxplxf_ ,,dp_ I, 0,). (3.26)


B,ERNl a=1 t=1

Note that the first stage does not require a nested fixed point algorithm to solve the
DDP problem. In the second stage we estimate the remaining parameters using the
partial likelihood estimator 85 defined by

(3.27)

The second stage treats the consistent first stage estimates of n(dx, + 1Ix,, d,, 6;) as
the truth, reducing the problem to estimating the remaining parameters 8, of
(u, q, /?).It is well known that for any optimization method the number of likelihood
function evaluations needed to find a maximum increases rapidly with the number
of parameters being estimated. Since the second stage estimation requires a nested
fixed point algorithm to solve the DDP problem at each likelihood function evalua-
tion, any reduction in the number of parameters being estimated can lead to
substantial computational savings.
Note that, due to the presence of estimation error in the first stage estimate of
@, the covariance matrix formed by inverting the information matrix for the partial
likelihood function (3.27) will be inconsistent. Although there is a standard correc-
tion formula that yields a consistent estimate of the covariance matrix [Amemiya
(1976)], in practice it is just as simple to use the consistent estimates t?p = (&, I@
from stages 1 and 2 as starting values for one or more Newton steps on the full
likelihood function (3.25):

& = @ - y$&), (3.28)

where the search direction $gp) is given by

$@) = - [a2 i0gLf(@yaeatq - [a log _C@yae] (3.29)

and y > 0 is a step-size parameter. Ordinarily the step size y is set equal to 1, but one
can also choose y to maximize Ls without changing the asymptotic properties of @.
Using the well-known information equality we can obtain an alternative asymp-
totically equivalent version of (3.29) by replacing the Hessian matrix with the
negative of the information matrix I^((ep)defined by

f@) = 5 5a i0g P(~;I~;, @~c(x;_ IX;_1 1, dp_ 1, Pyae


a=1 [ *=,1

. (3.30)
x 9 aiog~(d~~xp,BP)n(xg_l~x~_l,dp_l,BP)~ae~
t=1 1

We call @ the Newton-step estimator. It is straightforward to show that this


3110 John Rust

procedure results in parameter estimates that are asymptotically equivalent to full


information maximum likelihood and, as a by-product, consistent estimates of the
asymptotic covariance matrix r^- 1(Qs).29
The feasibility of the nested fixed point maximum likelihood procedure depends
on our ability to rapidly compute the fixed point o,, for any given value of 6, and
to find the maximum of Lf or Lp in as few likelihood function evaluations as possible.
At a minimum, the likelihood function should be a smooth function of 0 so that
more efficient gradient optimization algorithms can be employed. The smoothness
of the likelihood is also crucial for establishing the large sample properties of the
maximum likelihood estimator. Since the primitives (u, p, b) are specified a priori,
they can be chosen to be smooth functions of 0. The convexity of the social surplus
function implies that the conditional choice probabilities are smooth functions of
up Therefore the question of smoothness further reduces to finding sufficient condi-
tions under which tI-+u, is a smooth mapping from RN into B. This follows from
the implicit function theorem since the pair (0, ue) is a zero of the nonlinear operator
F: RN x B -+ B defined by

0 = F(u, 0) = (I - PO)(u). (3.31)

Theorem 3.4

Under regularity conditions (A 1) to (A 13) given in Rust (1988b), a v/36 exists and is
a continuous function of 9 given by

c!$+ y&)]-l !s$! . (3.32)


[ II =

The successive approximation/Newton iteration polyalgorithm described in Sec-


tion 2.5 can be used to compute uO.Since Newtons algorithm involves inverting the
operator [I - Y$(u)], it follows that one can use it to compute au,,630 using formula
(3.32) at negligible marginal cost.
Once we have the derivatives av,/ae, it is a straightforward exercise to compute
the derivatives of the likelihood, aLf/ad. This allows us to employ more efficient
quasi-Newton gradient optimization algorithms to search for the maximum likeli-
hood estimate, i?. Some of these methods require second derivatives of the likeli-
hood function, which are significantly harder to compute. However, the information
equality implies that the information matrix (3.30) is a good approximation to the
negative of the Hessian of Lf in large samples. This idea forms the basis of the BHHH
optimization algorithm [Berndt, Hall, Hall and Hausman (1974)] which only

In practice, the two-stage estimator & may be sufficiently far away from the maximum likelihood
estimates that several Newton steps (3.28) are necessary. In this case, the Newton-step estimator is
simply a way generating values for computing the full information maximum likelihood estimates in
(3.25). Also, we havent attempted to correct the estimated standard errors for possible misspecification
as in White (1982) due to the fact that such corrections require second derivatives of
the likelihood function which are difficult to compute in DDP models.
Ch. 51: Structural Estimation of Markov Decision Processes 3111

requires first derivatives of the likelihood function. 3o The nested fixed point algorithm
combines the successive approximation/Newton iteration polyalgorithm and the
BHHH/Broyden gradient optimization algorithm in order to obtain an efficient and
numerically stable method for computing the maximum likelihood estimate 6.
In order to derive the asymptotic properties of the maximum likelihood estima-
tors e, i = f, p, n, we need to make some additional assumption about the sampling
process. First, we assume that the periods at which agents are observed coincide
with the periods at which they make their decisions. In practice agents do not make
decisions at exogenously spaced intervals of time, so it is unlikely that the particular
points in time at which agents are interviewed coincide with the times they make
their decisions. One way to deal with the problem is to use retrospective data on
decisions made between survey dates. In order to minimize problems of time
aggregation one should in principle formulate a sufficiently fine-grained model with
null actions that allow one to model decision-making processes with randomly
varying times between decisions. However if the DDP model has a significantly
shorter decision interval than the observation interval, we may face the problem
that the data set may not contain observations on the agents intervening states and
decisions. In principle, this problem can be solved by using a partial likelihood
function that omits the intervening periods, or a full likelihood function that
integrates out the unobserved states and decisions in the intervening periods. The
practical limitation of this approach is the curse of dimensionality of solving very
fine-grained DP models.
Next, we need to make some assumptions about the dependence between the
realizations {xp, da} and {xp, dp} for agents a # b. The standard assumption is that
these realizations are independent, but this may not be plausible in models where
agents are affected by macroeconomic shocks (examples of such shocks include
price, unemployment rates and news announcements).We assume that the observed
state variable can be partitioned into two components, x, = (m,,z,), where m, rep-
resents a macroeconomic shock that is common to all agents and z, represents an
idiosyncratic component that is independently distributed across agents conditional
on the realization of {m,}. Sufficient conditions for such independence are given in
the three assumptions below.

Assumption CZ-X
The transition probability for the observed state variable x, = (m,, zr) is given by

4dxt +1 Ix,, 4 = ~l(dzt+1 Iz,,m,,4h(dm, +1 14. (3.33)

30Convergence of the BHHH method in small samples can be accelerated by Broyden and Davidon
Fletcher Powell updating procedures that adaptively improve the accuracy of the information matrix
approximation to the Hessian of Lr. The method also applies to maximization of the partial likelihood
function Lp.
A documented example of the algorithm written in the Gauss programming language is available
from the author upon request.
3112 John Rust

Assumption SI-E
For each t 3 0 the distributions of the unobserved state variable E: are conditionally
independent:

Pr(dc: ,..., dcflx: ,..., xf)= ti q(dcpIxp). (3.34)

Assumption SI-Z

For each t > 1 the transition probabilities for the idiosyncratic components zr of
the observed state variable xy are conditionally independent:

Pr(dz :+i,. .,dz;lx:,. . ., xf,d:,...,df)= fi ~l(dz~+lIz~,m,,~~) (3.35)


a=1

and, when t = 0, the initial distributions of zt are independent, conditional on m,:

Pr(dz&. . , dztlm,) = fi rc,(dz,Im,). (3.36)


0=1

Assumption CIIX is an additional conditional independence assumption impos-


ed when the observed state variable x, includes macroeconomic shocks. It corre-
sponds to an asymmetry that seems reasonable when individual agents are small
relative to the economy: macroeconomic shocks can affect the evolution of agents
idiosyncratic states {z;,E;}, but an individuals decision df has no effect on the
evolution of the {m,} process, modelled as an exogenous Markov process with
transition probability rc2.32 SI-E and SI-Z require that any correlation between
agents idiosyncratic states {zf, E;} is a result of the common macroeconomic shocks.
Together with assumption CI-X, these conditions imply that realizations {zp, df}
and {zp, d,b} are independent conditional on {m,}.
A final assumption is needed about relative sizes of time-series and cross-sectional
dimensions of the data set. There are three cases to consider: (1) the number of
time-series observations for each agent is fixed and A+ co, (2) the number of
cross-section observations is fixed and T, + GO,or (3) both A and T, tend to infinity.
In most panel data sets the cross-sectional dimension is much larger relative to
the time-series dimension, so we will focus on case 1. If we further assume that the
observation period T, is fixed at a common value T for all agents a, then it is
straightforward to show that, conditional on (m,, . . . , m,), the realizations of {x;, d;}
are IID. This allows us to use the simpler IID strong law of large numbers and
LindeberggLevy central limit theorem to establish the consistency and asymptotic
normality of 8, i = p, f, n, requiring only continuous second derivatives of the
likelihood.33

32Note that while {m,} is independent of the decisions of individual agents, it will generally not be
independent of their collective behavior. Thus, CI-X is not inconsistent with dynamic general
equilibrium in large economies.
331n cases where the observations are INID or weakly dependent, somewhat stronger smoothness
and boundedness conditions are required to establish the asymptotic properties of the MLE.
3113
Ch. 51: Structural Estimation ofMarkov Decision Processes

Theorem 3.5
Under assumptions AS, CI, BU, WC, BE, CI-X, SI-E and ST-Z and regularity
conditions (AlO) to (A35) of Rust (1988b), the full information and 3-stage maximum
likelihood estimators e, i = f, n satisfy
1. @ is a well-defined random variable,
2. t? converges to t3* with probability 1 as A + co,
3. the distribution of fi(e - e*) converges weakly to N[O, 1(8*)-i] where the
information matrix Z(e*) is given by

Ice*)=-E 5 a2iogP@j
, z,,m,,e*)/aea~l(m,,...,m=)}
i t=1

-E ~~a210g~l(i,~i,-l,m,-l,rZ,_l,e*~~aea~1(mo7. ..,md}
i t=1

- 5 a iog7c,(m,(m,_,,e*)/aeae
i t=1 I

=E i aiogq&Iz,,m,, e*yae f: a log&it I-z,,m,.RyaBl~m,,....mdj


i t=1 t=1

+ E i: a 10g711(ZtI~,_1,m,_l,e*)/ae
i f=l

x f ai0g711(2tp_1,m,-,,e*)/aefj~mo,....m,))
t=1

,f a 10gP(i7tj5t,m,,e*)/ae
t=1
T
x c alogn,(~,t~~-,,~~,-,,
1=1
e*)/a@~(mO~~~~~mT~}

+2E f alogP(~~I~~,m,,B*)/ae i a logn,(m,(m,_,,e*)/a8( 1 mOy ..jmT))


t=1 f=l

x i a logiQt,Im,_,,0*)/asl /(m 0, ..2mT)}


t=1

a 10g~,(m,~m,_,,e*~,aea8~. (3.37)
3114 John Rust

Theorem 3.6

Under assumptions AS, CI, BU, WC, BE, CI-X, SI-E and SI-Z and regularity
conditions (AlO) to (A35) of Rust (1988b), the 2-stage partial likelihood estimator
ijp = (&, e^;) satisfies
1. BP is a well-defined random variable,
2. I!?~converges to 8* with probability 1 as A + co,
3. the distribution of fi(@ - (3*) converges weakly to N(0, z) where 2 is given
by

z=A-flA-, (3.38)

where A and R are given by

(3.39)

where

A,, = E f a210gn1(ZtlZt_l,mt_1 ,~~-1,eT)n2(m,Im,-l,8T)/


i t=1

A,, = E i a2p(~~I~~,e:,e~)/ae2ad~
2i(-o,--~-T~}~
i 1=1

nll = E f: a 10g7r1(Z,IZ,_i,m,_, -
,d,-,,8:)~2(m,Im,-,,eT)/ae,
i f=l

x i alOgz 1(-I--
zt Z, ,,m,,,ri,,,n:)n,lm,,m~-~,~:)/au;l(m,,...,m,)},
2=1

q2=E f: alog711(~tI~,-1,m,-,,d,-,,~T)~2(m,lm,~,,~T)/ae,
i 1=1

x i az@~z,,q,o;)/a0~ 21(m,,...,md},
t=1

i aP@Iz-,,q,8;)/a8, f:
1=1 t=1
(3.40)
Ch. 51: Structural Estimation of Markoa Decision Processes 3115

In addition to the distribution of the parameter vector &, we are often interested in
the distribution of utility and value functions ug and u0 treated as random elements
of B. This is useful, for example, in computing uniform confidence bands for these
functions. The following result is essentially a Banach space version of the delta
theorem.

Theorem 3.7

If &converges with probability 1 to 6* and &[i? - Q*] converges weakly to N(0, Z),
and v, is a smooth function of 8, then
1. 06 is a B-valued random element,
2. vi converges with probability 1 to u,,,,
3. the distribution of ,,&[ui - v,.] converges weakly to a Gaussian random
element of B with mean 0 and covariance operator [au,,/a0]Z[au,,/aC3].

We conclude this section with some comments on specification and hypothesis


testing. Since structural models are often highly simplified and tightly parameterized,
one has an obligation to go beyond simply reporting parameter estimates and their
standard errors. If the structural model is to have any credibility, it needs to be
subjected to a battery of in-sample specification tests, and if possible, out-of-sample
predictive tests (a good example of the latter type of test is presented in Section 4.2).
The maximum likelihood framework allows formulation of standard holy trinity
test statistics, the Wald, Likelihood Ratio, and Lagrange Multiplier (LM) tests [see
Engle (1984)]. Examples 1 and 2 below show how these statistics can be used to test
the validity of various functional-form restrictions on the DDP model. Example 3
discusses the chi-square goodness-of-fit statistic, which is perhaps the most useful
omnibus test of the correctness of the DDP specification.

Example 1

The holy trinity can be used to conduct a heterogeneity test of the null hypothesis
that the parameter vector 8* is the same for two or more subgroups of agents. If
there are K subgroups, we can formulate the null hypothesis as K - 1 linear restric-
tions on a KN-dimensional full parameter vector (d,, . . . , 0,) where 0, is the N-
dimensional parameter vector for subgroup k. The likelihood ratio test involves
computing - 2 times the difference in the restricted and unrestricted log-likelihood
functions, where we compute the restricted log-likelihood by pooling all K sub-
groups and estimating a single N-dimensional parameter vector 0. The Wald test
statistic is a quadratic form in the K - 1 differences in the group-specific coefficient
estimates, OK+ 1 - 8,, k = 1,. . . , K - 1. In this case the LM statistic is the easiest to
compute since it only requires computation of a single N-dimensional parameter
estimate 8for the pooled sample under the null hypothesis of no heterogeneity. The
LM statistic tests whether the score of the likelihood function is approximately zero
for all K subgroups. All three test statistics have an asymptotic chi-square distribu-
3116 John Rust

tion under the null, with degrees of freedom equal to the number of restrictions
being tested. In the example, there are (K - l)N degrees of freedom. Computation
of the Wald and LM statistics requires an estimate of the information matrix, f(&)
for each of the subgroups k = 1,. . , K.

Example 2

The holy trinity can be used to test the validity of the conditional independence
assumption CI. Recall that CI implies that the unobserved state variable E, is
independent of a,_ 1 and is conditional on the value x, of the observed state variable.
This is a strong restriction although, as we will see in Section 4.6, it seems to be
necessary to obtain a computationally tractable estimation algorithm. A natural
way to test CI is to add some function f of the previous period control variables to
the current period value function: Q(x,, II,) + af(dt_ J. Under the null hypothesis
that CI is valid, the decision taken in period t - 1 will have no effect on the decision
made in period t once we condition on x, since {u&x,, d), dud} constitutes a set
of sufficient statistics for the agents decision in period t. Thus, a = 0 under the
null hypothesis that CI holds. However under the alternative hypothesis that Cl
doesnt hold, E, and E,_ 1 will be serially correlated, even conditional, on x,, so that
d,_ 1 will generally be useful for predicting the agents choice d,. Thus, c1# 0 under
the alternative hypothesis. The Wald, Likelihood Ratio or LM statistics can be used
to test the hypothesis that a = 0. For example, the Wald statistic is simply A&/8(8),
where d2(6i) is the asymptotic variance of oi.

Example 3

The chi-square goodness-of-fit statistic provides an overall test of the null hypothesis
that an econometric model is correctly specified (i.e. that the parametric model
coincides with the true data generating process). In the case of a DDP model, this
amounts to a test of the joint hypotheses that (1) agents are rational, i.e. they act
as if their behavior is governed by an optimal decision rule from some DDP
model, and (2) the particular parametric specification of this model is correct.
However, the analysis of the identification problem in Section 3.5 reveals that the
hypothesis of rationality per se imposes essentially no testable restrictions: the
empirical content of the theory arises from additional restrictions on the primitives
(u, p, j?) of the DDP model. In this sense, testing the theory is tantamount to testing
the econometricians assumptions about the parametric functional form of (u, p, fi).
Although there are other omnibus specification tests [such as Whites (1982) infor-
mation matrix test] the chi-square goodness-of-fit test is far more useful in diag-
nosing the source of specification errors. There are two versions of this statistic, one
for models without covariates [Pollard (1979)], and one for models with covariates
[Andrews (1988,1989)]. 34 The former is useful for testing complete realizations of

?Strictly speaking, Pollards results are only applicable if the full likelihood includes the probability
density of the initial state x,,. Otherwise the full likelihood (3.25) can be analyzed as a conditional
likelihood using Andrew& analysis of chi-square tests of parametric models with covariates.
Ch. 51: Structural Estimation ofMarkou Decision Processes 3117

the controlled process {xt, d,} using the full Markov transition probability P(d,+ l I
X t+l,MdX,+l 1xt, d,, 0) derived in Theorem 3.3, whereas the version with covariates
is useful for testing the conditional choice probability P(d, Ixt, 0) and the transition
probability rc(dx,+ 1 Ix,, d,, f3). Both formulations are based on a partition of the
relevant space of realizations of x, and d,. In the case of the full likelihood function,
the relevant space is XT x DT, and in the case of P or n it is X x D or X x X x D,
respectively. We partition this space into a fixed number M of mutually exclusive
cells. The cells can be randomly chosen, or chosen based on some data-dependent
procedure provided (1) the total number of cells M is fixed for all sample sizes, (2)
the elements of the partition are members of a Vapnikkcervonenkis (VC) class, and
(3) the partition converges in probability to a fixed, non-stochastic partition whose
elements are also members of a VC class. 35 If we let 0 denote the M x 1 vector of
elements of this partition R = (a,, . . . , a,), we can define a vector n(L2,0) of differ-
ences between sample frequencies of fl and the predicted probabilities of the DDP
model with parameter 8. In a chi-square test of the specification of the conditional
choice probability P, the ith element of A(f2, f3)is given by

(3.41)

The first term in (3.41) is the sample proportion of (x,, d,) pairs falling into partition
element ~2~whereas the second term is the DDP models prediction of the probabil-
ity that (x,, ~,)ER~ Note that since the law of motion for x, is not specified, we simply
average over all sample points xp. An analogous formula holds for the case of
chi-square tests of the full likelihood function: in that case ~(Ri, 0) is the difference
between the sample fraction of paths {xp, d;} falling in partition element LJi less the
probability that these realizations fall in Q,, computed by integrating with respect
to the probability measure on XT x DT generated by the controlled transition
probabilityP(d,+ 1I~,+~,&r(dx,+, I x,, d,, t9).The chi-square test statistic is given by

x2(0, e^,= A@, @J? + n(n, 6), (3.42)

where 2 is a generalized inverse of the asymptotic covariance matrix 2 (which


is generally singular). Andrews (1989) showed that under the null hypothesis of
correct specification, I1(R,8) converges in distribution to N(O,Z), which implies
that x(f2,6) converges in distribution to a chi-square random variable whose
degrees of freedom equal the rank of Z. Andrews provides formulas for ,!? that
take relatively simple forms when 6 is an asymptotically efficient estimator such as
in the case of the full information maximum likelihood estimator ef. A natural
strategy is to start with a chi-square test of the specification of the full likelihood
(3.25). If this is rejected, one can then do a chi-square test of rc to see if the rejection
is due to a misspecification of agents beliefs. If this is not rejected, then we have an

See Pollard (1984) for a definition of a VC class.


3118 John Rust

indication that the rejection of the model is due to a misspecification of preferences


(p, u} or that the CI assumption is not valid. A further test of the CI assumption
using one of the holy trinity specification tests described in Example 2 can be used
to determine if CI is the source of the problem. If not, an investigation of the cells
which have the largest prediction errors (3.41) may provide valuable insights on
the source of specification errors in u or /?.

3.3. Alternative estimation methods: Finite-horizon DDP problems

Although the previous section showed that the AS-C1 assumption leads to a simple
and general estimation theory for DDPs, it would be desirable to develop estima-
tion methods that can relax these assumptions. This section considers estimation
methods for DDP models with unobservables that are serially correlated and enter
the utility function in a non-separable fashion. Unfortunately there is no general
estimation theory for this class of models at present. Instead there are a variety of
problem-specific specifications and estimation methods, most of which are designed
for finite-horizon binary choice problems. As we will see, there are substantial
theoretical and computational obstacles to developing an estimation theory for a
more general class of DDP models that relaxes ASCI. This section presents two
examples that illustrate the successful approaches.36

Example 1

Wolpin (1984) pioneered the nested numerical solution method for a class of binary
choice models where unobservables enter the utility function u(x,~,d) in a non-
additive fashion. In a binary choice model, one does not need a 2-dimensional vector
of unobservables to yield a saturated choice model: in Wolpins specification E is
uni-dimensional and interacts with the observed state variable x,, yet the choice
probability satisfies P&l Ix) > 0 for all x and is continuously differentiable in 8.
Wolpins application concerned a Malaysian familys choice of whether or not to
conceive a child in period t: d, = 1 if a child is born in year t, d, = 0, otherwise.37
Wolpin used the following quadratic family utility function,

u&x,, E,,d,) = (0, + e,)n, - 8,n: + e3c, - &:, (3.43)

where x, = (n,, c,) and n, is the number of children in the family and c, is total family
consumption (treated as an exogenous state variable rather than as a control
variable). Assuming that {st} is an IID Gaussian process with mean 0 and variance
cr2 and that {x,} is a Markov process independent of {et}, the familys optimal

36For additional examples, see the survey by Eckstein and Wolpin (1989a).
Wolpins model thus assumed that there is no uncertainty about fertility and that contraception
is 100% effective.
Ch. 51: Structural Estimation of Markov Decision Processes 3119

decision rule d, = dt(xr, E,) can be computed by backward induction starting in the
final period T at which the family is fertile. It is not difficult to see that (3.43) implies
6, is given by the threshold rule

1 if 8, > v,(x,,0) (3.44)


4(x,, 8,)=
i 0 if E, d yl,(x,, 0).

The cutoffs (Q(x, 0)) define the value of gr such that the family is indifferent as to
whether or not to have another child: for any given value of 8 and for each possible
t and x, the cutoffs can be computed by backward induction. The assumption that
{E,} is IID N(0, a*) t h en implies that the conditional choice probability is given by

1 - @[Q(x,, 0)/a] if d= 1
P,(d,Ix,9 4 = (3.45)
@C?,(X, @/aI if d = 0,

where @ is the standard normal CDF. One can estimate 8 maximum likelihood
using the partial, full or Newton-step maximum likelihood estimators given in
equations (3.25) to (3.27) of Section 3.2. Using the implicit function theorem one
can show that the cutoffs are smooth functions of 8. From (3.45) it is clear that this
implies that the likelihood function is a smooth function of 8, allowing Wolpin to
establish that his maximum likelihood estimator has standard asymptotic properties.

Example 2

Although Example 1 shows how one can incorporate measurement errors and
unobservables that enter the utility function non-separably, it relies on the indepen-
dence of the {E,} process, a strong form of CI. Pakes (1986) developed a method for
estimating binary DDPs with serially correlated observables that enter u additively.
Pakes developed an optimal stopping model of whether or not to renew a patent.
He used Monte Carlo simulations to integrate out the serially correlated {Ed}
process in order to evaluate the likelihood function, avoiding a potentially intract-
able numerical integration problem. Let t denote the number of years a patent has
been in force, and let E, denote the cash flow accruing to the patent in year t. In
Europe, patent holders must pay an annual renewal fee c,, an increasing function
oft. The patent holder must decide whether or not to pay the cost c, and renew the
patent, or let it lapse in which case the patent is permanently cancelled and the idea
is assumed to have 0 net value thereafter. In Pakess data set one cannot observe
patent holders earnings E,, so the only observable state variable is x, = t, the number
of years the patent is kept in force. Assuming that {et} is a first order Markov process
with transition probability qt(dE,IE,_ 1), Bellmans equation is given by

VT(&)= max (0, E - c,},

(3.46)
3120 John Rust

where T is the statutory limit on patent life. Under fairly general conditions on the
transition probabilities {qt} (analogous to the weak continuity conditions of Sec-
tion 2) one can establish that the optimal decision rule, d = S,(E), is given by a
threshold rule of the form (3.44), but in this case the cutoffs q,(0) only depend on t
and the parameter 0. Pakes used a particular family {ql} to simplify the recursive
calculation of the cutoffs:

if s,+i = 0
if OGs,+i ddZs,
(3.47)
if E,+1>6 E
2 2

The interpretation of (3.47) is that with probability exp{ -8is1} the patent is dis-
covered to be worthless (in which case E~+~= 0 for ail k 2 l), or alternatively, the
patent is believed to have value next period given by E,, 1 = max{e2c,,z}, where z
is an IID exponential random variable whose density is given by the third term in
(3.47). The likelihood function for this problem requires computation of the pro-
bability A,(Q) that the patent lapses in year t:

A,(t)= Pr{6,(e,)=0,6,_,(a,_,)= 1,...,6r(ai)= 1)

where q. is the initial distribution of returns at the time the patent was applied for,
assumed to be log-normal with parameters (0,,0,). To establish the asymptotic pro-
perties of the maximum likelihood estimator 8 = (8,, . . . , &), we need to show that
&(t) is a smooth function of 8. It is straightforward to show that the cutoffs q,(0)
are smooth functions of 8. Pakes showed that this fact, together with the smoothing
that occurs when integrating over the indicator functions in (3.48), yields a likeli-
hood function which is continuously differentiable in tJ and has standard asymptotic
properties. However, in practice the numerical integrations required in (3.48) become
intractable for t larger than 3 or 4, whereas many patents are held for 20 years or
more. To overcome the problem Pakes used Monte Carlo integration, calculating
a consistent estimate It(e) by simulating a large number of realizations of the process
{Et} and tabulating the fraction of realizations that lead to drop-outs at year t, i.e.
$ < q,(d), Es 2 q,(d), s = 1,. . , t - 1. This requires a nested simulation algorithm
consisting of an outer optimization algorithm that searches for a value of 8 to
maximize the likelihood function, and an inner simulation/DP algorithm that
1. solves the DDP problem for a given value of 8, using backward induction to
calculate the cutoffs {s,(e)},
Ch. 51: Structural Estimation of Markov Decision Processes 3121

2. simulates NSIM draws of the process (EJ to calculate a consistent estimate of


&(O) and the full likelihood function J?(O)

Note that the realizations {EJ change each time we update 8. The main burden of
Pakess, simulation estimator arose from having to repeatedly re-simulate NSIM
realizations of (8,): the recursive calculation of the cutoffs {r,(O)} took little time in
comparison. 38 An additional burden is imposed by the need to calculate numerical
derivatives of the likelihood function: each iteration requires 8 separate solutions
and simulations of the DDP problem, once to compute if(e) and 7 additional times
to compute its partial derivatives with respect to (O,, . . . ,O,).
McFadden (1989) and Pakes and Pollard (1989) subsequently developed a new
class of simulation estimators that dramatically lowers the value of NSIM needed
to obtain consistent and asymptotically normal parameter estimates. In fact, consis-
tent, asymptotically normal estimates of I3 can be obtained with NSIM as small as
1, whereas consistency of Pakess original simulation estimators required NSIM to
tend to infinity with sample size. The new approach uses minimum distance rather
than maximum likelihood to estimate 8. In the case of the patent renewal problem,
the estimator is defined by

ij = argmin HA(ey w,H,(e),


HA(e) = [fl,/A - l,(e), . . , +/A - &(e)-y,

A=f:n,

It(e) = (3.49)

where n, is the number of patent holders who dropped out (failed to renew) after t
years, A is the total number of agents (patent holders) in the sample, W,., is a T x T
positive definite weighting matrix, and {Ejl} denotes the jth simulation of the
Markov process when the transition probabilities qt in (3.47) are evaluated at 8. In
order to satisfy the uniformity conditions necessary to prove consistency and
asymptotic normality of the estimator, it is crucial to note that one does not draw
independent realizations {Etj} for distinct values of 8. Instead, at the beginning of
estimation we simulate NSIM random vectors (f,, . . . , rNs,,) which are held fixed
throughout the entire estimation process, and thejth draw_tj is used to construct {Ejf}
for each trial value of 8. In the patent renewal problem tj is given by

Tj=(f9Gl,. . . 9 ~T~~lr~~~~7T)~ (3.50)

38Pakes varied NSIM depending on how close the outer algorithm was to converging, setting NSIM
equal to a few thousand in the initial stages of the optimization when one is far from convergence and
gradually increasing NSIM to 20000 in the final stages of estimation in order to get more precise
estimates of the likelihood function and its derivatives.
3122 John Rust

where zis standard normal, and the 6, and 7, are IID draws from an exponential
distribution with unit mean. Using tj we can construct a realization of {Ejl} for any
value of 8 using the recursion

Fjo= exp(0, + ~9,zS},

ej,=r{Q,Ej,-, >,~1}max[8,~j,_,,8,B,y,-8,]. (3.51)

This simulation process insures that no extraneous simulation error is introduced


in the process of minimizing over 0 in (3.49), insuring that the uniformity conditions
given in Theorems 3.1 and 3.2 of Pakes and Pollard hold. These conditions imply
the following asymptotic distribution for 0:

JA [S - e*] + N(0, Q),

f2 = (1 + NSIM- )(AWA)-AWZ-WA(AWA)-,

A = aa.(o*)/ae,
r = diag[YB*)] - n(e*)n(e*),,

A(Q)= C&(@, ...,&(@I. (3.52)

By Aitkens theorem [Theil (1971)], the most efficient choice for the weighting
matrix is W = diag[YB*)]. Notice that when one uses the optimal weighting matrix,
W,, the relative efficiency of the simulation estimator increases rapidly to the
efficiency of maximum likelihood [which requires exact numerical integration to
compute A(Q)]. The standard errors of the simulation estimator are only twice as
large as maximum likelihood when NSIM equals one, and are only 10% greater
when NSIM equals 10.
While the frequency simulation estimator (3.49) substantially reduces the value
of NSIM required to estimate 8, it has the drawback that the objective function is
a step function in 0, so a non-differentiable search algorithm must be used which
typically requires many more function evaluations to find 8 than gradient optimiza-
tion methods. Thus, the frequency simulator will be feasible for problems where we
can rapidly solve the DDP problem for the cutoffs {q,(0)}. An important question
is whether simulation methods can be used to avoid solving the DDP problem itself.
Note that the main burden of DDP estimation methods involves calculation of the
value function V,(s), which is itself a conditional expectation with respect to the
controlled stochastic process {s,}. Can we also use Monte Carlo integration to
compute an approximate value function i&s) rather than computing exactly by
backward induction? The next section discusses how this might be done in the
context of infinite-horizon DDP models.

39Recent progress in developing smooth simulation estimators [Stern (1989). McFadden (1990)]
may help overcome this problem.
Ch. 51: Structural Estimation of Markov Decision Processes 3123

3.4. Alternative estimation methods: Injinite-horizon DDPs

Hotz et al. (1993) introduced a simulation estimator for DDP problems that avoids
the computational burden of nested numerical solution of the Bellman equation.
Unlike the simulation estimators described in Section 3.3, the Hotz et al. estimator
is a smooth function of 0. The simulation estimator is based on the following result
of Hotz and Miller (1993).

Lemma 3.1

Suppose that q(delx) has a density with finite first moments and support equal to
RIDCX)I.
Then for each XGX, the choice probabilities given in equation (3.18) define a
1: 1 mapping between the space of normalized value functions {UER~(~)Iu( 1) = 0}
and the 1D(x)(-dimensional simplex dlDCx)l.

In the special case where q has an extreme-value distribution, the inverse mapping
has a closed-form solution: it is simply the log-odds transformation. The idea behind
the simulation estimator is to use non-parametric estimation to obtain consistent
estimates of the conditional choice probabilities p(dlx) and then invert these esti-
mates to obtain consistent estimates of the normalized value function, B&d) -
6(,x, 1). If we also have estimates of agents beliefs fi we can simulate (a consistent
estimate of) the controlled stochastic process {x,, E,,d,). For any given values of x
and 0 we can use the simulated realizations of {x,, E,,d,} to construct a simulation
estimate of the normalized value function, i&(x, d) - f&(x, 1). At the true parameter
9*, t&(x, d) - &(x, 1) is an (asymptotically) unbiased estimate of u&x, d) - u&x, 1).
Since the latter quantity can be consistently estimated by inverting non-parametric
estimates of the choice probability, the simulation estimator can be roughly described
as the value of 0 that minimizes the distance between the simulated value function
CO(x,d) - 5(x, 1) and 0(x, d) - 0(x, l), averaged over the various (x, d) pairs in the
sample.
More precisely, the simulated value (SV) function estimator consists of 5 steps.
For simplicity I assume that the observed state variable x has finite support and
that there are no unknown parameters in the transition probability n, and that q is
an IID extreme-value distribution.

Step I. Invert non-parametric estimates of P(d (x) to compute consistent estimates


of the normalized value functions do(x, d) = u(x, d) - u(x, 1):

(3.53)

Step 2. Use the consistent estimate of dv* to uncover a consistent estimate of the
decision rule b:

&x, E) = argmax [d 0(x, d) + c(d)]. (3.54)


deD(r)
3124 John Rust

Step 3. Using & q and 7~simulate realizations of the controlled stochastic process
{xt, et}. Given an initial condition (x,, d,), this consists of the following steps:

1. Given (x,_ r, d,_ 1) draw x, from the transition probability


~(dxtlxt~r>dt-I),
2. Given x, draw E, from the density q(EtIxt),
3. Given (xc, EJ compute d, = 8(x,, E,),
4. If t > S(A), stop, otherwise set t = t + 1 and return to step 1.40

Repeat step 2 using each of the sample points (x;, d;), a = 1,. . . , A, t = 1,. , T,
as initial conditions.

Step 4. Using the simulations {T,Li,} from step 2, compute the simulated value
function I&(X,,, d,):

S(A)
(3.55)

where (x,, d,) is the initial condition from step 2. Repeat step 3 using each
of the sample points (x:, dp) as initial conditions.

Step 5. Using the normalized simulated value function d &,(xp, d:) and correspond-
ing non-parametric estimates d 0(x;, df) as data, compute the parameter
estimate 8 as the solution to

8A = argmin HA(0)WAHA(@,

(3.56)

where WA is a K x K positive definite weighting matrix and K is the


dimension of the instrument vector, Z;.

The SV estimator has several attractive features: (1) it avoids the need for repeated
calculation of the fixed point (3.22), (2) the simulation error in d f&(xp,df) averages
out over sample observations, so that one can estimate the underlying parameters
using as few as one simulated path of (_i$,Et} per (xr, d;) observation, (3) since each
term di&(x,d) is a smooth function of 8, the objective function (3.56) is a smooth
function of 8 allowing use of gradient optimization algorithms to estimate e^,. Note
that the simulation in step 2 needs to be done only once at the start of the estimation,
so the main computational burden is the repeated evaluation of (3.55) each time 13
is updated.
Although the SV estimator is consistent, its main drawback is its sensitivity to

@-Here S(A) denotes any stopping rule (possibly stochastic) with the property that with probability
1, S(A)-+m as A+co.
Ch. 51: Structural Estimation of Markov Decision Processes 3125

the non-parametric estimates of P(d(x) in finite samples. One can see from the
log-odds transformation (3.53) that if one of the estimated choice probabilities
entering the odds ratio is 0, the normalized value function will equal plus or minus
co. For states x with few observations, the natural non-parametric estimator for
P(dlx) ~ sample frequencies - can frequently be 0 if the true value of P(dlx) is
non-zero but small. In general, the inversion process can amplify estimation errors
in the non-parametric estimates of P(dIx), and these errors can result in biased
estimates of 0. A related problem is that the SV estimator requires estimates of the
normalized value function d u(x, d) for all possible (x, d) pairs, but many data sets
will frequently have data concentrated only in a small subset of (x,d) pairs. Such
cases may require parametric specifications of P(dJx) to be able to extrapolate
estimates of P(dl x) to the set of (x, d) pairs where there are no observations. In their
Monte Carlo study of he SV estimator, Hotz et al. found that smoothed estimates
of P(d Ix) (such as produced by kernel estimators, for example) can reduce bias, but
in general the SV estimator depends on having lots of data to obtain accurate
non-parametric estimates of P.41

3.5. The identi$cation problem

I conclude with a brief discussion of the identification problem for MDPs and
DDPs. Without strong restrictions on the primitives (u,p, /?) the structural model
is non-parametrically unidentified: there are infinitely many distinct primitives
that are consistent with any given decision rule 6. Indeed, we show that the
hypothesis that agents behave according to Bellmans principle of optimality im-
poses no testable restrictions in the sense that we can always find primitives {/I, u, p}
that rationalize an arbitrary decision rule 6. This result stands in contrast to the
case of static choice models where we know that the hypothesis of optimization per
se does imply testable restrictions.42 The absence of restrictions in the dynamic case
may seem surprising given that the structure of the MDP problem already imposes
a number of strong restrictions such as time additive preferences and constant
intertemporal discount factors, as well as the expected utility hypothesis itself. While
Bellmans principle does not place restrictions on historical choice behavior, it can
yield restrictions in choice experiments where we have at least partial control over
agents preferences or beliefs. 43 For example, by varying agents beliefs from p to p,
an experiment implies a new optimal decision rule S@*, u*, p), where /I* and u* are
the agents true discount factor and utility function. This experiment imposes the

41See also Bansal et al. (1993) for another interesting non-parametric simulation estimator that might
also be effective for estimating large scale DDP problems.
42These restrictions include the symmetry and negative-semidefiniteness of the Slutsky matrix
[Hurwicz and Uzawa (1971)], the generalized axiom of revealed preference [Varian (1984)], and in
the case ofdiscrete choice, restrictions on conditional choice probability [Block and Marschak (1960)].
43An exampleis the experiment which revealed the Allais paradox mentioned in Footnote 9.
3126 John Rust

following restriction on candidate (p, u) combinations: they must lie in the set R
defined by

R = {CD,u)IW, u,P)= W*, u*,p,>


f-7 ((P,4lW, U,P)= w*, u*,P)). (3.57)

Although experimental methods offer a valuable source of additional testable


restrictions that can help narrow down the equivalence class of competing struc-
tural explanations of observed behavior, it should be clear that even extensive
experimentation will be unable to uniquely identify the true structure of agents
decision-making processes.
As is standard in analyses of the identification problem, we will assume that the
econometrician has access to an unlimited number of observations. This is justified
since our results in this section are primarily negative: if the primitives {/I, u,p}
cannot be identified with infinite amounts of data, then they certainly cant be
identified in finite samples. We consider general MDPs without unobservables,
since the existence of unobservables can only make the identification problem
worse. In order to formulate the identification problem a la Cowles Commission,
we need to translate the concepts of reduced-form and structure to the context of
a nonlinear MDP model.

Definition 3.2

The reduced-form of an MDP is the agents optimal decision rule, 6.

Dejinition 3.3

The structure of an MDP is the mapping: A: (/I, u, p} + 6 defined by

6(s) = argmax [u(s, d)], (3.58)


deD(s)

where v is the unique fixed point to

u(s, d) = u(s, d) + j? max [u(s, d)]p(dsl s, d). (3.59)


s dsD(s)

The rationale for identifying 6 as the reduced-form of the MDP is that it embodies
all observable implications of the theory and can be consistently estimated by
non-parametric regression given sufficient number of observations {s,, d,}.44 We can

%ince the econometrician fully observes all components of (s,d), the non-parametric regression is
degenerate in the sense that the model d = 6(s) has an error term that is identically 0. Nevertheless,
a variety of non-parametric regression methods will be able to consistently estimate 6 under weak
regularity conditions.
Ch. 51: Structural Estimation of Markov Decision Processes 3127

use the reduced-form estimate of 6 to define an equivalence relation over the space
of primitives.

Dejnition 3.4
_ - -
Primitives (u, p, p) and (u, p, /?) are observationally equivalent if

(3.60)

Thus A (6) is the equivalence class of primitives consistent with decision rule 6.
Expected-utility theory implies that A(/?, U, p) = A(fl, au + b, p) for any constants a
and b satisfying a > 0, so at best we will only be able to identify an agents preferences
u modulo cardinal equivalence, i.e. up to a positive linear transformation of u.

Definition 3.5

The stationary MDP problem (3.58) and (3.59) is non-parametrically ident$ed if


given any reduced-form 6 in the range of A, and any primitives (u, p, /I) and (U, p, p)
in A - (6) we have

B=ix
p=P,
u = aii + b, (3.61)

for some constant a and b satisfying a > 0.

Lemma 3.2

The MDP problem (3.58) and (3.59) is non-parametrically unidentified.

The proof of this result is quite simple. Given any 6 in the range of A, let (p, u, p)~
A - (6). Define a new set of primitives (U,p, /3) by

dds Is,4 = Ads1s, 4,

$s, 4 = u(s,4 + f(s) - P f(s')p(ds' Is,d), (3.62)


s

where f is an arbitrary measurable function of s. Then U is clearly not cardinally


equivalent to u unless f is a constant. To see that both (u,p,/?) and (U,p,p) are
observationally equivalent, note that if t&d) is the value function corresponding
to primitives (u,p, fl) then we conjecture that U(s,d) = u(s, d) + f(s) is the value
3128 John Rust

function corresponding to (ti, p, p):

U(S,d) = ii@, d) + p max [z?(s,d)]p(dsI s, d)


s deD(s)

= u(s, d) + f(s) - j3 f(s)p(ds( s, d) + /I max [I@, d) + f(s)]p(ds Is, d)


s s dED(s)

= u(s, d) + f(s) + /i max [u(s, d)]p(dsI s, d)


s deD(s)

= u(s, d) + f(s). (3.63)

Since v is the unique fixed point to (3.59) it follows that v + f is the unique fixed
point to (3.63), so our conjecture V = v + f is indeed the unique fixed point to
Bellmans equation with primitives (ii, p, fl). Clearly {U(S,d), dud} and {I+, d) +
f(s),d~D(s)} yield the same decision rule 6. It follows that (fi,u + f - BEf,p)is
observationally equivalent to (/I, U, P), but u + f - /lEfis not cardinally equivalent
to ll.
We now ask whether Bellmans principle places any restrictions on the decision
rule 6. In the case of MDPs Blackwells theorem (Theorem 2.3) does provide two
restrictions: (1) 6 is Markovian, (2) 6 is deterministic. In practice, it is extremely
difficult to test these restrictions empirically. Presumably we could test the first
restriction by seeing whether agents decisions depend on lagged states s,_~ for
k= 1,2,.... However given that we have not placed any a priori bounds on the
dimensionality of S, the well-known trick of expanding the state space [Bertsekas
(1987, Chapter 4)] can be used to transform an Nth order Markov process into a
1st order Markov process. The second restriction might be tested by looking for
agents who make different choices in the same state: 6(s,) # 6(s,+,), for some state
s, = St+k = s. However, this behavior can be rationalized by a model where the agent
is indifferent between several alternatives available in state s and simply chooses
one at random. The following lemma shows that Bellmans principle implies no
other restrictions beyond the two essentially untestable restrictions of Blackwells
theorem:

Lemma 3.3

Given an arbitrary measurable mapping 6: S + D there exist primitives (u, p, 8) such


that 6 = A@, u,p).

The proof of this result is straightforward. Given an arbitrary discount factor


/?E(O, 1) and transition probability p, define u by

U(S,d)=I(d=6(s)}-p. (3.64)
Ch. 51: Structural Estimation ofMarkov Decision Processes 3129

Then it is easy to see that u(,s,d) = I{d = 6(s)} is the unique solution to Bellmans
equation (3.59), so that 6 is the optimal decision rule implied by (u, P, P).
If we are unwilling to place any restrictions on (u, p, P), then Lemma 3.2 shows
that the resulting MDP model is non-parametrically unidentified and Lemma 3.3
shows that it has no testable implications, in the sense that we can rationalize any
decision rule 6. However, it is clear that from an economic standpoint, many of the
utility functions u + f - b_Ef will generally be implausible, as is the utility function
U(S,d) = I{d = 6(s)}. These results provide a simple illustration of the need for
auxiliary identifying restrictions to eliminate primitives that we deem unreasonable
while at the same time pointing out the futility of direct non-parametric estimation
of (u, p, /I). The proofs of the lemmas also indicate that in order to obtain testable
implications, we will need to impose very strong identifying restrictions on u. To
see this, suppose that the agents discount factor /I is known a priori, and suppose
that we invoke the hypothesis of rational expectations: i.e. agents subjective beliefs
about the evolution of the state variables coincide with the objectively measurable
population probability measure. This implies that we can use observed realizations
of the controlled process {.s,,d,} to consistently estimate the p(ds,+ 1Isf,dt), which
means that we can treat both /I and p as known a priori. Note, however, that
the proofs of Lemmas 3.2 and 3.3 are unchanged whether or not we know /3 and p,
so it is clear that we must impose identifying restrictions directly on u itself. The
usual way to impose restrictions is to assume that u is a smooth function of a vector
of unknown parameters 8. However, in the case where S and D are finite sets, this
is insufficient to uniquely identify the model: there is an equivalence class of 8 with
non-empty interior consistent with any decision rule 6,.
This latter identification problem is an artifact of the degeneracy of MDP models
without unobservables. Rust (1996) provides a parallel identification analysis for
the subclass of DDP models satisfying the AS-C1 assumptions presented in Section
4.3, and shows that while AS-C1 does impose testable restrictions, the primitives
(u, q, rc,/?) are still non-parametrically unidentified in the absence of further identifying
restrictions. However, when identifying restrictions take the form of smooth para-
metric specifications (u,, qe, 7tg,Be), the presence of unobservables succeeds in smooth-
ing out the problem, and the results of Section 4.3 imply that the likelihood is a
smooth function of 0. Results from differentiable topology can then be used to show
that the resulting parametric model is generically identified.
While the results in this section suggest that, in principle, one can always rig a
DDP model to rationalize any given set of observations, we should emphasize that
there is no guarantee that we can do it in a way that is theoretically plausible. Given
the increasing size and complexity of current data sets, it can be a considerable
challenge to find a plausible DDP model that is consistent with the available data,
let alone one that is capable of making accurate out-of-sample forecasts of policy
changes. However, in the cases where we do succeed in specifying a plausible
structural model that fits the data well, we need to exercise a certain amount of
caution in using the model for policy forecasting and welfare analyses, etc. Keep in
3130 John Rust

mind that a model will be credible only until such time as it is rejected by new,
out-of-sample observations, or faces competition from alternative models that are
also consistent with available data and are equally plausible. The identification
analysis suggests that we cant rule out the existence of alternative plausible, obser-
vationally equivalent models that generate completely different policy forecasts or
welfare impacts. Put simply, since structural models can be falsified but never
proven to be true, their predictions should always be treated as tentative and subject
to continual verification.

4. Empirical applications

I conclude this chapter with brief reviews of two empirical applications of DDP
models: Rusts (1987) model of optimal replacement of bus engines and the Lums-
daine et al. (1992) model of optimal retirement from a firm.

4.1. Optimal replacement of bus engines

One of the simplest applications of the specific class of DDP models given in
Definition 3.1 is Rusts (1987) application to bus engine replacement. In contrast to
the macroeconomic studies that investigate aggregate replacement investment,
this model goes to the other extreme and examines the replacement investment
decision at the level of an individual agent. In this case the agent, Harold Zurcher,
is the maintenance manager of the Madison Metropolitan Bus Company. One of
the problems he faces is to decide how long to operate a bus before replacing its
engine with a new or completely overhauled bus engine. We can represent Zurchers
problem as a DDP with state variable x, equal to the cumulative mileage on a bus
since last engine replacement, and control variable d, which equals 1 if Zurcher
decides to replace the bus engine, and 0 otherwise. Rust assumed that Zurcher
behaves as a cost minimizer, so his utility function is given by

--or -c(O,Q,)-s,(l) if d,= 1


u(x,, d,, d,,&) = (4.1)
- c(x0 0,) - c,(O) if d, = 0,

where 0, represents the labor and parts cost of installing a replacement bus engine
and c(x, 0,) represents the expected monthly operating and maintenance costs of a
bus with x accumulated miles since last replacement. Implicit in the specification
(4.1) is the assumption that when a bus engine is replaced, it is as good as new,
so the state of the system regenerates to x, = 0 when d, = 1. This regeneration
property is also captured in the transition probability for x,:

g(x,+,-0) if d,= 1
4x, +1 Ix,,4) =
g(x,+r-xx,) if d,=O,
Ch. 51: Structural Estimation of Markov Decision Processes 3131

where g is a probability density function. The renewal property given in equations


(4.1) and (4.2) defines a regenerative optima/ stopping problem, and under an optimal
decision rule d, = 6*(x,, E,, 6) the mileage process (x,} is a regenerative random walk.
Using data on 8156 bus/month observations over the period 1975 to 1986, Rust
estimated 8 and g using the maximum likelihood estimator (3.25). Figures 2 and 3
present the estimated value and replacement hazard functions assuming a linear
cost function, c(x, 0,) = 0,x and two different values of /?. Comparing the estimated
hazard function P(l Ix, 0) to the non-parametrically estimated hazard, both the
dynamic (fi = 0.9999) and myopic (B = 0) models appear to fit the data equally well.
In particular, both models predict that the probability of a replacement is essentially
0 for x less than 100 000 miles. However likelihood ratio and chi-square tests both
strongly reject the myopic model in favor of the dynamic model: the data imply that
the concave value function for p = 0.9999 fits the data better than the linear value
function 8,x when B = 0. The precise value of B could not be identified: the likeli-
hood was virtually flat for B 2 0.98, although with a very slight upward slope as
/?- 1. The latter finding, together with theJinal oalue theorem [Bhattacharya and
Majumdar (1989)] may indicate that Zurcher is minimizing long-run average costs
rather than discounted costs:

(1 - /I) f /??J(x,, d,) = lim max L E a{ zI u(xA)}. (4.3)


t=o T+m d T

100 200 300 400


(Thousands)
Mileage since last replacement

Figure 2. Estimated value functions


3132 John Rust

100 200 300 400


(Thousands)
Mileage since lart replacement

Figure 3. Estimated hazard functions

The estimated model implies that expected monthly maintenance costs increase by
$1.87 for every additional 5000 miles. Thus, a bus with 300000 miles costs an
average of $112 per month more to maintain than a bus with a newly replaced
engine. Rust found evidence of heterogeneity across bus types, since the estimated
value of O2 for the newer 1979 model GMC buses is nearly twice as high as for the
1974 models. This finding resolves a puzzling aspect of the raw data: engine replace-
ments for the 1979 model buses occur on average 57 000 earlier than the 1974 model
buses despite the fact that the engine replacement cost for the 1979 models is 25%
higher. One of the nice features of estimating the preferences at the level of a single
individual is that one can evaluate the accuracy of the structural model by simply
asking the individual whether the estimated utility function is reasonable.45 In this
case conversations with Zurcher revealed that implied cost estimates of the structural
model corresponded closely with his perceptions of operating costs, including the
finding that monthly maintenance expenditures for the 1979 model GMC buses
were nearly twice as high as for the 1974 models.

45Another nice feature is that we completely avoid the problem of unobserved heterogeneity that can
confound attempts to estimate dynamic models using panel data. Heckman (1981a. b) provides a good
discussion of this problem in the context of models for discrete panel data.
Ch. 51: Structural Estimation of Markoo Decision Processes 3133

Figure 4 illustrates the value of structural models for use in policy forecasting. In
this case, we are interested in forecasting Zurchers demand for replacement bus
engines as their price varies. Figure 4 contains two demand curves computed for
models estimated with j? = 0.9999 and /I = 0, respectively. The maximum likelihood
procedure insures that both models generate the same predictions at the actual
replacement price of $4343. A reduced-form approach to forecasting bus engine
demand would bypass the difficulties of structural estimation and simply regress
the number of engine replacements in a given period as a function of replacement
costs during the period. However, since the cost of replacement bus engines has not
varied much in the past, the reduced-form approach will be incapable of generating
reliable predictions of replacement demand. In terms of Figure 4, all the data would
be clustered in a small ball about the intersection of the two curves: obviously many
different demand functions will be able to fit the data equally well. By para-
meterizing our prior knowledge about the replacement problem Zurcher is facing,
and by efficiently using the additional information contained in the {xp, dp} sequences,
the structural model is able to generate very precise estimates of the replacement
demand function.

2 4 6
(Thouaanda) 8 10 12
Parta coat of rttplacemont bum engine

Figure 4. Expected replacement demand function


3134 John Rust

4.2. Optimal retirement from a firm

Lumsdaine, Stock and Wise (1992) (LSW) used a data set that provides a unique
natural experiment that allows them to compare the policy forecasts of four
competing structural and reduced-form models. The data consist of observations
of departure rates of older office employees at a Fortune 500 company. These
workers were covered by a defined benefit pension plan that provided substantial
incentives to remain with the firm until age 55 and then substantial incentives to
leave the firm before age 65. In 1982 non-managerial employees who were over 55
and vested in the pension plan were offered a temporary 1 year window plan.
Under the window plan employees who retired in 1982 were offered a bonus
equivalent to 3 to 12 months salary, depending on the years of service with the firm.
Needless to say, the firm experienced a substantial increase in departure rates in
1982.
Using data prior to 1982, LSW fit four alternative econometric models and used
the fitted models to make out-of-sample forecasts of departure rates under the 1982
window plan. One of the models was a reduced-form probit model with various
explanatory variables and the remaining three were different types of dynamic
structural models.46 Two of the structural models were finite-horizon DDP models
with a binary decision variable: continue working (d = 1) or quit (d = 0). Since
quitting is treated as an absorbing state and workers were subject to mandatory
retirement at age 70, the DDP model reduces to a simple finite-horizon optimal
stopping problem. The observed state variable x, is the benefit (wage or pension)
obtained in year t and the unobserved state variable E, is assumed to be an IID
extreme-value process in the first specification and an IID Gaussian process in the
other specification. LSW used the following specification for workers utility func-
tions:

x: + Pl + Et(l) if d, = 1
4% &r,44 =
i (~ZQ2x,)8 + ~~(0) if d, = 0.

The two components of p = (pl, pL2)represent time-invariant worker-specific hetero-


geneity. In the extreme-value specification for {E,},p1 is assumed to be identically 0
and pL2is assumed to have a log-normal population distribution with mean 1 and
scale parameter 8,. In the Gaussian specification for (E,}, p2 is identically 1 and pL1
is assumed to have Gaussian population distribution with mean 0 and standard
deviation 8,. Although the model does not directly include leisure in the utility

46LSW used different sets of explanatory variables in the reduced-form model including calculated
option values of continued work (the expected present value of benefits earned by retiring at the
optimal age, i.e. the age at which the total present value of benefits, wages plus pensions, is maximized).
Other specifications used the levels and present values of Social Security and pension benefits as well
as changes in the present value of these benefits (pension accruals), predicted earnings in the next
year of employment, and age.
Ch. 51: Structural Estimation of Markou Decision Processrs 3135

function, it is implicitly included via the parameter 0,. Thus, we expect that 8, > 1
since the additional leisure time should imply that a dollar of pension income is
worth more than a dollar of wage income.
The third structural model is the option value model developed in previous
work by Stock and Wise (1990). The option value model predicts that the worker
will leave the firm at the first year t in which the expected presented discounted value
of benefits from departing at t exceeds the maximum of the expected values of
departing at any future date. This rule differs from an optimal stopping rule
generated from the solution to a DP problem by interchanging the expectation
and maximization operators. This results in a temporally inconsistent decision
rule in which the worker ignores the fact that as new information arrives he will be
continually revising his estimate of the optimal departure date t*.47
The parameter estimates for the three structural models are presented in Table 1.
There are significant differences in the parameter estimates in the three models. The

Table 1
Parameter estimates for the option value and the dynamic programing models.

Parameter Option Value Models Dynamic Programming Models

(1) (2) Extreme Value Normal

(1) (2) (3) (4) (5) (6)

0, 1.00* 0.612 1.OQ* 1.018 1.187 1.oO* 1.187 1.109


(0.072) (0.045) (0.215) (0.110) (0.275)
0, 1.902 1.477 1.864 1.881 1.411 2.592 2.975 2.974
(0.192) (0.445) (0.144) (0.185) (0.307) (0.100) (0.039) (0.374)
B 0.855 0.895 0.618 0.620 0.583 0.899 0.916 0.920
(0.046) (0.083) (0.048) (0.063) (0.105) (0.017) (0.013) (0.023)
0, 0.168 0.109 0.306 0.302 0.392 0.224 0.202 0.168
(0.016) (0.046) (0.037) (0.036) (0.090) (0.021) (0.022) (0.023)
0, 0.00* 0.00* 0.407 0.OQ* 0.00* 0.183
(0.138) (0.243)

Summary Statistics
-1n Y 294.59 280.32 279.60 279.57 277.25 277.24 276.49 276.17
x2 sample 36.5 53.5 38.9 38.2 36.2 45.0 40.7 41.5
x2 window 43.9 37.5 32.4 33.5 33.4 29.0 25.0 24.3

Notes: Estimation is by maximum likelihood. All monetary values are in %lOO,OOO(1980 dollars).
*Parameter value imposed.

Thus, it is somewhat of a misnomer to call it an option value model since it ignores the option
value of new information that is explicitly accounted for in a dynamic programming formulation. Stern
(1995) shows that in many problems the computationally simpler procedure of interchanging the
expectation and maximization operators yields a very poor approximation to the optimal decision rule
computed by DP methods.
3136 John Rust

Gaussian DP specification predicts a much higher implicit value for leisure (0,) than
the other two models, and the extreme-value specification yields a much lower
estimate of the discount factor /I. The estimated standard deviation CJof the E,Sis
also much higher in the extreme-value specification than the Gaussian. Allowing
for unobserved heterogeneity did not significantly improve the fit of the Gaussian
DP model, although it did have a marginal impact on the extreme-value mode1.48
Figures 5 to 7 summarize the ability of the four models to fit the historical data.

1 ,
I
-Actual --- PredIcted

0.8 - .__~ -___--.

0.6 -~

0.4 -~

50 52 5L 56 58 60 62 64 66
Age

Figure 5. Predicted versus actual 1980 departure rates and implicit cumulative departures, option value
model

- Actual
--- Extreme value
0.6 A
- Normal

0.6 ___

04

m,?
I

0.2

Figure 6. Predicted
0c
W
_^
versus actual
52 54 56

1980 departure
58
Age
60

rates and implicit


62

cumulative
64 66

departures, dynamic
programming model

*The log-likelihoods in the Gaussian and extreme-value specifications improved to - 276.17 and
- 277.25, respectively.
Ch. 51: Structural Estimation ofMarkot> Decision Processes 3137

-Actual ---Predicted

A
0.8

0.6

0.L

0.2

n
-50 52 5L 56 56 60 62 6L 66

Age

Figure 7. Predicted versus actual departure rates and implicit cumulative departures probit model

Figure 5 compares the actual departure rates (solid line) to the departure rates
predicted by the option value model (dashed line). Figure 6 presents a similar
comparison for the two DDP models, and Figure 7 presents the results for the
best-fitting probit model. As one can see from the graphs all four models provide a
relatively good fit of actual departure rates except that they all miss the pronounced
peak in departure rates at age 65. Table 1 presents the x2 goodness-of-fit statistics
for each of the models. The four models are generally comparable, although the
option model fits slightly worse and the probit model fits slightly better than the
two DDP models. The superior fit of the probit model is probably due to the
inclusion of age trends that are excluded from the other models.
Figures 8 to 10 summarize the ability of the models to track the shift in departure
rates induced by the 1982 window plan. All forecasts were based on the estimated
utility function parameters using data prior to 1982. Using these parameters,
predictions were generated from all four models after incorporating the extra bonus
provisions of the window plan. As is evident from Figures 8-10, the structural
models were generally able to accurately predict the large increase in departure rates
induced by the window plan, although once again none of the models was able to
capture the peak in departure rates at age 65. On the other hand, the reduced-form
probit model predicted that the window plan had essentially no effect on departure
rates. Other reduced-form specifications greatly overpredicted departure rates under
the window plan. The x2 goodness-of-fit statistics presented in Table 1 show that
all of the structural models do a significantly better job of predicting the impact of
the window plan than any of the reduced-form models.49 LSW concluded that:

@The smallest x2 value for any of the reduced-form models under the window plan was 57.3, the
largest was 623.3.
3138 John Rust

- Actual 1981
---Actual 19.32 IA\
0.8 - -Predicted1982 I' \
\
//. \
/
0.6.

50 52 5L 56 58 60 62 6L 66

Age

Figure 8. Predicted versus actual departure rates and implicit cumulative departures under the 1982
window plan, based on 1980 parameter estimates, and 1981 actual rates: option value model

- Actual1981

---Actual1982
0.8
-Pred.Ext Val.1982
-Pred Normal1982

0.6

50 52 5L 56 58 60 62 6L 66

Age

Figure 9. Predicted versus actual departure rates and implicit cumulative departures under the 1982
window plan, based on 1980 parameter estimates, and 1981 actual rates: dynamic programming models

The option value and the dynamic programming models fit the data equally well,
with a slight advantage to the normal dynamic programming model. Both models
correctly predicted the very large increase in retirement under the window plan,
with some advantage in the fit to the option value model. In short, this evidence
suggests that the option value and dynamic programming models are considerably
more successful than the less complex probit model in approximating the rules
Ch. 51: Structural Estimation of Markov Decision Processes 3139

0.8

06

0.L /
/
1
/
0.2 - I / /
/ /
0 A-.-
50 52
d..
5L 56 58 60 62 6L 66

Figure 10. Predicted versus actual departures rates and implicit cumulative departures under the 1982
window plan, based on 1980 parameter estimates, and 1981 actual rates: probit model

individuals use to make retirement decisions, but that the more complex dynamic
programming rule approximates behavior no better than the simpler option value
rule. More definitive conclusions will have to await accumulated evidence based
on additional comparisons using different data sets and with respect to different
pension plan provisions. (p. 31).

References

Ahn, M.Y. (1993a) Duration Dependence, Search Effort, and Labor Market Outcomes: A Structural
Model of Labor Market History, manuscript, Duke University.
Ahn, M.Y. (1993b) Econometric Analysis of Sequential Discrete Choice Behavior with Duration Depen-
dence, manuscript, Duke University.
Ahn, H. and Manski, C. (1993) Distribution Theory for the Analysis of Binary Choice Under Uncertain-
ty with Nonparametric Estimation of Expectations, Journal ofEconometrics, 56(3), 270-291.
Amemiya, T. (1976) On A Two-Step Estimation of a Multivariate Logit Model, Journal of Econo-
metrics, 8, 13-21.
Andrews, D.W.K. (1988a) Chi-Square Diagnostic Tests for Econometric Models, Journal of Econo-
metrics, 37, 135-l 56.
Andrews, D.W.K. (1988b) Chi-Square Tests for Econometric Models: Theory, Econometrica, 56,
1414-1453.
Arrow, K.J., T. Harris and J. Marschak (1951) Optimal Inventory Policy, Econometrica, 19(3), 250-272.
Bansal, R., A.R. Gallant, R. Hussey and G. Tauchen (1993) Nonparametric Estimation of Structural
Models for High Frequency Currency Market Data, Journal of Econometrics, forthcoming.
Basawa, I.V. and B.L.S. Prakasa Rao (1980) Statisrical Inference for Stochastic Processes. New York:
Academic Press.
Bellman, R. (1957) Dynamic Programming. Princeton University Press: Princeton.
Berkovec,J. and S. Stern (1991) Job Exit Behavior of Older Men, Econometrica, 59(l), 189-210.
3140 John Rust

Berndt, E., B. Hall, R. Hall and J. Hausman (1974) Estimation and Inference in Nonlinear Structural
Models, Annals ofEconomic and Social Measurement, 3, 6533665.
Bertsekas, D. (1987) Dynamic Programming Deterministic and Stochastic Models, Prentice Hall: New
York.
Bertsekas, D. and D. Castaiion (1989) Adaptive Aggregation Methods for Infinite Horizon Dynamic
Programming, IEEE Transactions on Automatic Control, 34(6), 5899598.
Bhattacharya, R.N. and M. Majumdar (1989) Controlled Semi-Markov Models ~ The Discounted
Case, Journal ofStatistical Planning and Inference, 21, 3655381.
Billingsley, P. (1961) Statistical Inferencefor Markou Processes. University of Chicago Press: Chicago.
Blackwell, D. (1962) Discrete Dynamic Programming, Annals ofMathematical Statistics, 33, 719~726.
Blackwell, D. (1965) Discounted Dynamic Programming, Annals of Mathematical Statistics, 36, 2266
235.
Blackwell, D. (1967) Positive Dynamic Programming, Proceedings of the 5th BerkeCey Symposium
on Mathematical Statistics and Probability, 1, 415-418.
Block, H. and J. Marschak (1960) Random Orderings and Stochastic Theories of Response, in: I.
Olkin, ed., Contributions to Probability and Statistics, Stanford University Press: Stanford.
Boldrin, M. and L. Montrucchio (1986) On the Indeterminacy of Capital Accumulation Paths,
Journal of Economic Theory, 40(l), 26-39.
Brock, W.A. (1982) Asset Prices in a Production Economy, in: J.J. McCall, ed., The Economics of
Information and Uncertainty, Chicago: University of Chicago Press.
Brock, W.A. and L.J. Mirman (1972) Optimal Economic Growth Under Uncertainty: The Discounted
Case, Journal of Economic Theory, 4,4799513.
Chamberlain, G. (1984) Panel Data, in: Z. Griliches and M.D. Intrilligator, eds., Handbook of Econo-
metrics Volume 2. North-Holland: Amsterdam. 1247-1318.
Chew, S.H. and L.G. Epstein (1989) The Structure of Preferences and Attitudes Towards the Timing
and Resolution of Uncertainty, Znternational Economic Review, 30(l), 103-l 18.
Christensen, B.J. and N.M. Kiefer (1991a) The Exact Likelihood Function for an Empirical Job
Search Model, Econometric Theory.
Christensen, B.J. and N.M. Kiefer (1991b) Measurement Error in the Prototypical Job Search Model,
manuscript, Cornell University.
Cox, D.R. (1975) Partial Likelihood, Biometrika, 62(2), 2699276.
Cox, D.R. and D.V. Hinkley (1974) Theoretical Statistics. Chapman and Hall: London.
Dagsvik, J. (1983) Discrete Dynamic Choice: An Extension of the Choice Models of Lute and Thurs-
tone, Journal of Mathematical Psychology, 27, l-43.
Dagsvik, J. (1991) A Note on the Theoretical Content of the GEU Model for Discrete Choice, manuscript,
Central Bureau of Statistics, Oslo.
Daly, A.J. and S. Zachary, (1978) Improved Multiple Choice Models, in: D.A. Hensher and Q, Dalvi,
eds., Determinants of Travel Choice, 3355357, Teakfield, Hampshire.
Das, M. (1992) A Micro Econometric Model of Capital Utilization and Retirement: The Case of the
Cement Industry, Review of Economic Studies, 59(2), 287-298.
Daula, T. and R. Moffitt (1991) Estimating a Dynamic Programming Model of Army Reenlistment
Behavior, Military Compensation and Personnel Retention.
Debreu, G. (1960) Review of R.D. Lute Individual Choice Behavior, American Economic Review, 50,
1866188.
Denardo, E.V. (1967) Contraction Mappings Underlying the Theory of Dynamic Programming, SIAM
Review, 9, 1655177.
Dongarra, J.J. and T. Hewitt (1986) Implementing Dense Linear Algebra Algorithms Using Multi-
tasking on the Cray X-MP-4, SIAM Journal on Scientific and Statistical Computing, 7(l), 3477350.
Eckstein, Z. and K. Wolpin (1989a) The Specification and Estimation of Dynamic Stochastic Discrete
Choice Models, Journal of Human Resources, 24(4),562-598.
Eckstein, Z. and K. Wolpin (1989b) Dynamic Labour Force Participation of Married Women and
Endogenous Work Experience, Review of Economic Studies, 56, 375-390.
Eckstein, Z. and K. Wolpin (1990) Estimating a Market Equilibrium Search Model from Panel Data
on Individuals, Econometrica, 58(4), 783-808.
Engle, R. (1984) Wald, Likelihood Ratio and Lagrange Multiplier Tests in Econometrics, in:
Z. Griliches and M.D. Intrilligator, eds., Handbook of Econometrics Vol 2. North-Holland:
Amsterdam.
Ch. 51: Structural Estimation of Markov Decision Processes 3141

Epstein, L.C. and SE. Zin (1989) Substitution, Risk Aversion, and the Temporal Behavior of Consump-
tion and Asset Returns: A Theoretical Framework, Econometrica, 57(4), 937-970.
Flinn, C. and J.J. Heckman (1982) New Methods for Analyzing Laborforce Dynamics, Journal qf
Econometrics, 18, 115-168.
Gihman, 1.1. and A.V. Skorohod (1979) Controlled Stochastic Processes, Springer-Verlag New York.
Gotz. G.A. and J.J. McCall (1980) Estimation in Sequential Decisionmaking Models: A Methodological
Note, Economics Letters,6, 131-136.
Gotr, G.A. and J.J. McCall (1984) A Dynamic Retention Model for Air Force Officers, Report R-3028-
AF, The RAND Corporation, Santa Monica, California.
Haavelmii, T. (1944) The Probability Approach in Econometrics, Econometrica Supplement, 12, 1-l 15.
Hakansson. N. (1970) Optimal Investment and Consumption Strategies Under Risk for a Class of
Utility Functibns,Econometrica, 38, 587-607.
Hansen, L.P. (1982) Large Sample Properties of Method of Moments Estimators, Econometrica, 50,
1029-1054.
Hansen, L.P. (1994) in: R. Engle and D. McFadden, eds., Handbook of Econometrics Vol. 4. North-
Holland: Amsterdam.
Hansen, L.P. and T.J. Sargent (1980a) Formulating and Estimating Dynamic Linear Rational Expec-
tations Models, Journal qf Economic Dynamics and Control, 2(l), 7-46.
Hansen, L.P. and T.J. Sargent (1980b) Linear Rational Expectations Models for Dynamically Inter-
related Variables, in: R.E. Lucas, Jr. and T.J. Sargent, eds., Rational Expectations and Econometric
Practice, Minneapolis: University of Minnesota Press.
Hansen, L.P. and T.J. Sargent (1993) Recursive Models of Dynamic Economies, manuscript, Hoover
Institution.
Hansen, L.P. and K. Singleton (1982) Generalized Instrumental Variables Estimation of Nonlinear
Rational Expectations Models, Econometrica, 50, 1269~1281.
Hansen, L.P. and K. Singleton (1983) Stochastic Consumption, Risk Aversion, and the Temporal
Behavior of Asset Returns, Journal of Political Economy, 91(2), 249-265.
Heckman, J.J. (1981a) Statistical Models of Discrete Panel Data, in: CF. Manski and D. McFadden,
eds., Srructural Analysis ofDiscrete Data, MIT Press: Cambridge, Massachusetts.
Heckman, J.J. (1981 b) The Incidental Parameters Problem and the Problem of Initial Conditions in
Estimating Discrete-Time, Discrete-Data Stochastic Processes, in: C.F. Manski and D. McFadden,
eds., Structural Analysis ofDiscrete Data, MIT Press: Cambridge, Massachusetts.
Heckman, J.J. (1991) Randomization and Social Policy Evaluation, NBER Technical Working paper
107.
Heckman, J.J. (1994) Alternative Approaches to the Evaluation of Social Programs: Econometric and
Experimental Methods, in: J. Laffont and C. Sims, eds., Advances in Econometrics: Sixth World
Congress, Econometric Society Monographs, Cambridge University Press.
Hotz, V.J. and R.A. Miller (1993) Conditional Choice Probabilities and the Estimation of Dynamic
Models, Review of Economic Studies, forthcoming.
Hotz, V.J., R.A. Miller, S. Sanders and J. Smith (1993) A Simulation Estimator for Dynamic Models of
Discrete Choice, Review ofEconomic Studies, 60, 397-429.
Howard, R. (1960) Dynamic Programming and Markou Processes. J. Wiley: New York.
Howard, R. (1971) Dynamic Probabilistic Systems: Volume 2 ~ Semi-Markov and Decision Processes. J.
Wiley: New York.
Hurwicz, L. and H. Uzawa (1971) On the Integrability of Demand Functions, in: J. Chipman et al.,
eds., Preferences, Utility, and Demand. New York: Harcourt, Brace and Jovanovich.
Judd, K. (1994) Numerical Methods in Economics, manuscript, Hoover Institution.
Keane, M. and K. Wolpin (1994) The Solution and Estimation of Discrete Choice Dynamic Programming
Models by Simulation: Monte Carlo Evidence, manuscript, University of Minnesota.
Kennet, M. (1994) A Structural Model of Aircraft Engine Maintenance, Journal of Applied Econometrics,
forthcoming.
Kreps, D. and E. Porteus (1978) Temporal Resolution of Uncertainty and Dynamic Choice Theory,
Econometrica, 46, 185-200.
Kushner, H.J. (1990) Numerical Methods for Stochastic Control Problems in Continuous Time, SIAM
Journal on Control and Optimization, 28(5), 999-1048.
Kydland, F. and E.C. Prescott (1982) Time to Build and Aggregate Fluctuations, Econometrica, 50,
1345-1370.
3142 John Rust

Lancaster, T. (1990) The Econometric Anulysis of Transition Data Cambridge University Press.
Leland, H. (1974) Optimal Growth in a Stochastic Environment, Review ofEconomic Studies, 41,75-86.
Levhari, D. and T. Srinivasan (1969) Optimal Savings Under Uncertainty, Review ofEconomic Studies,
36, 153-163.
Long, J.B. and C. Plosser (1983) Real Business Cycles, Journal of Political Economy, 91(I), 39-69.
Lucas, R.E. Jr. (1976) Econometric Policy Evaluation: A Critique, in: K. Brunner and A.K. Meltzer,
eds., The Phillips Curve and Lahour Markets. Carnegie-Rochester Conference on Public Policy,
North-Holland: Amsterdam.
Lucas, R.E. Jr. (1978) Asset Prices in an Exchange Economy, Econometrica, 46, 1426-1446.
Lucas, R.E. Jr. and C.E. Prescott (1971)Investment Under Uncertainty, Econometrica, 39(5), 659-681.
Lumsdaine, R., J. Stock and D. Wise (1992) Three Models of Retirement: Computational Complexity
vs. Predictive Validity, in: D. Wise, ed., Topics in the Economics of Aging. Chicago: University of
Chicago Press.
Machina, M.J. (1982) Expected Utility without the Independence Axiom, Econometricu, 50-2,277-324.
Machina, M.J. (1987) Choice Under Uncertainty: Problems Solved and Unsolved, Journul c>fEconomic
Perspectives, l(l), 121-154.
Mantel, R. (1974) On the Characterization of Excess Demand, Journal qfEconomic Theory, 7,348-353.
Marcet, A. (1994) Simulation Analysis of Stochastic Dynamic Models: Applications to Theory and
Estimation, in: C. Sims and J. Laffont, eds., Aduances in Econometrics: Proceedings of the 1990
Meetings ofthe Econometric Society.
Marschak, T. (1953) Economic Measurements for Policy and Prediction, in: W.C. Hood and T.J.
Koopmans, eds., Studies in Econometric Method. Wiley: New York.
McFadden, D. (1973) Conditional Logit Analysis of Qualitative Choice Behavior, in: P. Zarembka, ed.,
Frontiers of Econometrics. Academic Press: New York.
McFadden, D. (1981) Econometric Models of Probabilistic Choice, in: CF. Manski and D. McFadden,
eds. Structural Analysis of Discrete Data. MIT Press: Cambridge, Massachusetts.
McFadden, D. (1984) Econometric Analysis of Qualitative Response Models, in: 2. Griliches and
M.D. Intriligator, eds., Handbook of Econometrics Vol. 2. North-Holland: Amsterdam. 1395-1457.
McFadden, D. (1989) A Method of Simulated Moments for Estimation of Discrete Response Models
without Numerical Integration, Econometrica, 57(5), 995-1026.
Merton, R.C. (1969) Lifetime Portfolio Selection Under Uncertaintv: The Continuous-time Case.
Review of E&on& and Statistics, 51, 247-257.
Miller, R. (1984) Job Matching and Occupational Choice, Journal of Political Economy, 92(6), 1086-
1120.
Pakes, A. (1986) Patents as Options: Some Estimates of the Value of Holding European Patent Stocks,
Econometrica, 54, 755-785.
Pakes, A. (1994) Estimation of Dynamic Structural Models: Problems and Prospects Part II: Mixed
Continuous+Discrete Models and Market Interactions, in: C. Sims and J.J. Laffont, eds., Proceedings
of the 6th World Congress of the Econometric Society, Barcelona, Spain. Cambridge University Press.
Pakes, A. and D. Pollard (1989) Simulation and the Asvmptotics of Optimization Estimators. Econo-
metrica, 57(5), 1027-1057.
Phelps, E. (1962) Accumulation of Risky Capital, Econometrica, 30, 729-743.
Pollard, D. (1979) General Chi-Square Goodness of Fit Tests with Data-Dependent Cells, 2. Wahr-
scheinlichkeitstheorie verw. Gebeite, 50, 317-331.
Polland, D. (1984) Convergence of Stochastic Processes. Springer Verlag.
Puterman, M.L. (1990) Markov Decision Processes, in: D.P. Heyman and M.J. Sobel, eds., Handbooks
in Operations Research and Management Science Volume 2. North-HoIIand/Elsevier: Amsterdam.
Rust, J. (1987) Optimal Replacement of GMC Bus Engines: An Empirical Model of Harold Zurcher,
Econometrica, 55(S), 999-1033.
Rust, J. (1988a) Statistical Models of Discrete Choice Processes, Transportation Research, 22B(2),
125-158.
Rust, J. (1988b) Maximum Likelihood Estimation of Discrete Control Processes, SIAM Journal on
Control and Optimization, 26(5), 1006-1023.
Rust, J. (1989) A Dynamic Programming Model of Retirement Behavior, in: D. Wise, ed., The
Economics of Aging. University of Chicago Press: Chicago. 359-398.
Rust, J. (1992) Do People Behave According to Bellmans Principal of Optimality?, Hoover Institution
Working Paper, E-92-10.
Ch. 51: Structural Estimation of Markov Decision Processes 3143

Rust, J. (1993) How Social Security and Medicare Affect Retirement Behavior in a World of Incomplete
Markets, manuscript, University of Wisconsin.
Rust, J. (1994) Estimation of Dynamic Structural Models: Problems and Prospects Part I: Discrete
Decision Processes, in: C. Sims and J.J. Laflont, eds., Proceedings of the 6th World Congress of the
Econometric Society, Barcelona, Spain. Cambridge University Press.
Rust, J. (1995a) Numerical Dynamic Programming in Economics, in H. Amman, D. Kendrick and J.
Rust, eds., Handbook of Computational Economics, North-Holland, forthcoming.
Rust, J. (1995b) Using Randomization to Break the Curse of Dimensionality, manuscript, University of
Wisconsin.
Rust, J. (1996) Stochastic Decision Processes: Theory, Computation, and Estimation, manuscript,
University of Wisconsin.
Samuelson, P.A. (1969) Lifetime Portfolio Selection by Dynamic Stochastic Programming, Review of
Economics and Statistics, 51, 239-246.
Sargent, T.J. (1978) Estimation of Dynamic Labor Demand Schedules Under Rational Expectations,
Journal of Political Economy, 86(6), 1009-1044.
Sargent, T.J. (1981) Interpreting Economic Time Series, Journal of Political Economy, 89(2), 213-248.
Smith, A.A. Jr. (1991) Solving Stochastic Dynamic Programming Problems Using Rules of Thumb,
Discussion Paper 818, Institute for Economic Research, Queens University, Ontario, Canada.
Sonnenschein, H. (1973) Do Walras Law and Continuity Characterize the Class of Community Excess
Demand Functions?, Journal of Economic Theory, 6,345-354.
Stern, S. (1992) A Method for Smoothing Simulated Moments of Discrete Probabilities in Multinomial
Probit Models, Econometrica, 60(4), 943-952.
Stern, S. (1995) Approximate Solutions to Stochastic Dynamic Programming Problems, Econometric
Theory, forthcoming.
Stock, J. and D. Wise (1990) Pensions, The Option Value of Work and Retirement, Econometrica,
58(5), 1151-1180.
Stokey, N.L., R.E. Lucas, Jr. and E.C. Prescott (1989) Recursioe Methods in Economic Dynamics. Harvard
University Press: Cambridge, Massachusetts.
Theil, H. (1971) Principles of Econometrics. Wiley: New York.
van Diik. N.M. (1984) Controlled Markov Processes: Time Discretization. CWI Tract 11. Center for
Mathematics and domputer Science, Amsterdam.
Verim, H. (1982) The Nonparametric Approach to Demand Analysis, Econometrica, 52(3), 945-972.
White, H. (1982) Maximum Likelihood Estimation of Misspecified Models, Econometrica, 50, l-26.
Williams, H.C. (1977) On the Formation of Travel Demand Models and Economic Evaluation of User
Benefit, Environment and Planning, A-9,285-344.
Wolpin, K. (1984) An Estimable Dynamic Stochastic Model of Fertility and Child Mortality, Journal
of Political Economy, 92(5), 852-874.
Wolpin, K. (1987) Estimating a Structural Search Model: The Transition from Schooling to Work,
Econometrica, 55, 801-818.

You might also like