You are on page 1of 51

Lecture Notes 20152016

Advanced Econometrics I
Chapter 4

Francisco Blasques

These lecture notes contain the material covered in the master course
Advanced Econometrics I. Further study material can be found in the
lecture slides and the many references cited throughout the text.

Contents
4 Nonlinear dynamic probability models: definitions and properties
4.1 Models and DGPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Probability spaces and random variables . . . . . . . . . . . .
4.1.2 What is a probability model? . . . . . . . . . . . . . . . . . .
4.1.3 What is a DGP? . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.4 Correct Specification . . . . . . . . . . . . . . . . . . . . . . .
4.1.5 Generality of linear dynamic models (Wolds representation) .
4.1.6 Nonlinear dynamic models . . . . . . . . . . . . . . . . . . . .
4.2 Examples of nonlinear dynamic models . . . . . . . . . . . . . . . . .
4.2.1 Nonlinear autoregressions . . . . . . . . . . . . . . . . . . . .
4.2.2 Random coefficient autoregressions . . . . . . . . . . . . . . .
4.2.3 Time-varying parameter models: parameter-driven . . . . . .
4.2.4 Time-varying parameter models: observation-driven . . . . . .
4.2.5 Nonlinear dynamic models with exogenous variables . . . . . .
4.3 Stationarity, dependence, ergodicity and moments . . . . . . . . . . .
4.3.1 Strict stationarity and m-dependence . . . . . . . . . . . . . .
4.3.2 Strict stationarity and ergodicity (SE) . . . . . . . . . . . . .
4.3.3 Sufficient conditions for SE . . . . . . . . . . . . . . . . . . . .
4.3.4 Examples and Counter Examples . . . . . . . . . . . . . . . .
4.3.5 Bounded unconditional moments . . . . . . . . . . . . . . . .
4.3.6 Examples and Counter Examples . . . . . . . . . . . . . . . .
4.3.7 Notes for Time-varying Parameter Models . . . . . . . . . . .
4.3.8 Notes for Models with Exogenous Variables . . . . . . . . . .
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
3
3
8
11
11
12
15
16
16
18
19
20
23
24
25
27
28
31
36
37
39
44
46

4
4.1

Nonlinear dynamic probability models: definitions and properties


Models and DGPs
Some more reading material:
1. Davidson (1994), Stochastic Limit Theory
Chapter 1.6, 2.3, 3.1 and 7.1
2. Billingsley (1995), Probability and Measure
Chapter 2 and 5
3. White (1996), Estimation Inference and Specification Analysis
Chapter 2.1, 2.2 and 20
4. Fan and Yao (2005), Nonlinear Time-Series
Chapter 1.3

4.1.1

Probability spaces and random variables

In this section we venture into the world of set theory and measure theory. We will
not dig too deep into this world. Actually, we will barely scratch its surface. Our
objective is just to understand certain basic concepts.
As you may know, it was only in the late 19th and 20th centuries that mathematicians attempted to give proper foundations to mathematics. The famous work of
Gottlob Frege attempted to give proper definitions of numbers, functions and variables.
What is the number 2 after all? We know how to use the number 2. But what is it?
Frege set himself to answer these questions. Unfortunately, just a few days before publishing his monumental work, written over more than 10 years, Frege received a letter
from the great mathematician and philosopher Bertrand Russell pointing out that he
had found a small inconsistency in Freges argument. This small problem turned
out to completely destroy Freges work. As a result, Alfred Whitehead and Bertrand
Russell tried themselves to give foundations to mathematics and published their outstanding work of art Principia Mathematica in 3 volumes. In Volume 1 of Principia
Mathematica the authors take 379 pages just to arrive at the proof that 1 + 1 = 2. The
authors planned a fourth volume of Principia Mathematica but could not go further
on the account of intellectual exhaustion. Finally, in 1931, Kurt G
odel s historical
Incompleteness Theorem showed once and for all that any such attempts to give proper
foundations to mathematics would always fail! In effect, he showed that any logical
axiomatic system (like mathematics) will always be incomplete (i.e. contain statements
that cannot be proved) or contradictory (i.e. contain statements that are both true and
false).

The world of set theory and measure theory is a fascinating one indeed! We will
now make use of set theory and measure theory to arrive at a precise definition of
probability model. First we will have to define what a probability space is. Next we
define the concept of random variable. Finally, we turn to probability model.
3

Definition 1 (Probability space) A probability space is a triplet (E, F, P ) where E


is the event space, F is a -field defined on the event space E and P is a probability
measure defined on the -field F.
Please do not be confused by the strange words! The definition of probability space
above is simpler than it looks! It just states that a probability space is composed of
three things: an event space E, a -field F and a probability measure P .
The event space E (also known as sample space) is the collection of all possible
outcomes for the random variable. For example, in the case of coin tosses, the event
space is the set E = {heads, tails}, or alternatively E = {0, 1}. In the case of tosses
of a dice, the event space is E = {1, 2, 3, 4, 5, 6}. In the case of a continuous random
variable like the Gaussian innovations t N (0, 2 ), the event space is the entire real
line E = R.
The probability measure P defines essentially the probability associated to each
event as well as collections of events of E. All the relevant collections of events of E are
precisely contained in the -field F. Hence, the probability measure P : F [0, 1]
maps elements of F to the interval [0, 1]; i.e. it gives us the probability of each element
of F.
For example, a -field (also known as -algebra) F of the event space E =
{heads, tails} may be given by
n
o
F := {} , {heads} , {tails} , {heads, tails} .
Note that the -algebra F contains the empty set , the full set E = {heads, tails},
as well as each individual element of E. Hence, the probability measure P must
define the probability of nothing happening P (), the probability of drawing heads
P (heads), the probability of drawing tails P (tails), and the probability drawing either
heads or tails. Since we are talking about a coin toss, the most natural probabilities
would be the following:
P () = 0 (something always happens!);
P (heads) =

(for some [0, 1]);

P (tails) = 1

(for some [0, 1]);

P ({heads, tails}) = 1 (either heads or tails must be drawn).


In general, there are certain rules that must be followed for constructing a algebra F. We will not analyze such rules in detail here. The most important thing
to keep in mind is that the -algebra is needed to make sure that we do not run into
problems. The most famous problem is known as the Banach-Tarsky paradox!

Banach and Tarski proved a theorem in 1924 showing that you can take a ball, cut it
into pieces and then re-arrange those pieces in such a manner as to obtain two balls
of the exact same size, that have no parts missing! This unbelievable result warns us
about the mathematical problems that may occur if we are not careful enough when
measuring things! The use of the -algebra solves this problem.

Definition 2 (-field) A -field F of a set E is a collection of subsets of E satisfying:


(i) E F.
(ii) If F F, then F c F.1
(iii) If {Fn }nN is a sequence of sets in F then

n=1

Fn F.

Since the -algebra is important to avoid problems we will often work with measurable spaces.
Definition 3 (Measurable space) A measurable space is just a pair (E, F) composed
of an event space E and respective -algebra F.
Definition 4 (Probability measure) A probability measure P defined on a measurable
space (E, F) is a function P : F [0, 1] satisfying:
(i) P (F ) 0 F F.
(ii) P (E) = 1.
(iii) If {Fn }nN is a collection of sets in F, then P

n=1

 P
Fn =
n=1 P (Fn ).

Now that we know what a probability space is, we can finally define the concept
of random variable! As we shall see a random variable is essentially a measurable
function (or measurable map).
Definition 5 (Measurable function) Given two measurable spaces (A, FA ) and (B, FB ),
a function f : A B is said to be measurable if f is such that every element b FB
satisfies f 1 (b) FA ; i.e. the inverse image of each element of FB is in FA .
Note that the inverse map f 1 in the definition above exists always! You may recall
from your introductory mathematics courses that some functions are not invertible.
However, that only means that the inverse map f 1 is not a function. The inverse
map exists always, it may just not be a function. Do you remember the properties of
a function?
Definition 6 (Function) Given two sets A and B, and a map f : A B, we say that
f is a function if and only if each element a of A is associated to a unique element b
of B.
1

F c denotes the complement of F in F.

We are finally ready to give a proper definition of random variable. In this definition,
the random variable emerges as a real valued measurable function that maps each
event in E to a number in R. Depending on the event that occurs we obtain a
different number in R for our random variable!
Definition 7 (Random variable) Given a probability space (E, F, P ) and the measurable space (R, FR ), a random-variable xt is a measurable map xt : E R that maps
elements of E to the real numbers R; i.e. the inverse map x1
: FR F satisfies
t
2
x1
(r)

F
for
every
element
r

F
.
R
t
This definition of random variable is actually quite intuitive! The requirement that
the function be measurable just means that we can assign probabilities for each value
of the random variable xt . Indeed, we can define a probability measure PR on FR by
assigning for each element r in FR the probability PR (r) = P (x1
t (r)). In essence,
the random variable induces a probability measure PR on FR . We thus obtain a
probability space (R, FR , PR ).
From the probability measure PR we can now define the cumulative distribution
function F that you know so well as
F (a) = PR (x a) a R.
This notion of random variable generalizes easily to random vectors and other
random elements such as random functions.
Definition 8 (Random vector) Given a probability space (E, F, P ) and a measurable
space (Rn , FRn ) with n N, an n-variate random-vector xt is a measurable map
xt : E Rn that maps elements of E to Rn .
Now that we defined a random variable xt as a function from the event space E to
the reals R, we can easily distinguish the random variable from its realizations. In
particular, note that while xt is a random variable, xt (e) R is the actual realization
of the random variable produced by event e E. A different event e0 E may produce
a different realization xt (e0 ) 6= xt (e).
The definition of random variable is also interesting for another reason: whether
something is a random-variable or not, depends on the -algebra that one is using!
As an example, consider the case where xt is a normal random variable xt N (0, 2 ).
Is x2t also a random variable? How about exp(xt )? or log(xt )?
Below we define the Borel -algebra introduced by one of the pioneers of measure
theory and probability theory, the French mathematician Emile Borel.
Please do not be fooled by the notation x1
into thinking that x1
maps from R to E, i.e. that
t
t
: R E. It does not! First, since the inverse f 1 is not necessarily a function, it will
naturally map elements of R to subsets of E like those in F. Second, note that, given any function
g : A B, we can always define a mapping g from subsets A of A to subsets of B satisfying
g (A) = {g(a), a A A}.
2

x1
t

Definition 9 (Borel -algebra) Given set A the Borel -algebra BA is the smallest
-algebra containing all open sets of A.3
Luckily, you do not have to feel intimidated by the definition above. In fact, you
dont even have to understand it! The only thing you have to know is its practical
consequence: all continuous functions are measurable under the Borel -algebra. Indeed, the Borel -algebra is famous because it ensures that any continuous function
f is measurable. As such, any continuous transformation f (xt ) of a random variable
xt is also a random variable.
The fact that continuous functions are measurable under the Borel -algebra is
trivial once we give an appropriate definition of continuous function! This definition
uses the notion of topological space. Much like a measurable space, a topological space
is just a pair (A, TA ) where A is a set and TA is a collection of subsets of A. This
collection, when constructed in a certain way is called a topology.4 Most importantly,
the elements of TA are precisely what we call the open sets.
Definition 10 (Continuous function) Let (A, TA ) and (B, TB ) be topological spaces.
A function f : A B is said to be continuous if its inverse f 1 maps open sets to
open sets; i.e. if for every b TB we have f 1 (b) TA .
The definition above clarifies everything! If the Borel -algebra is made of open
sets, then every continuous function is measurable because f 1 maps elements of one
Borel -algebra to another! In other words, given two measurable spaces (A, BA )
and (B, BB ) a continuous function f : A B will be measurable because for every
b BB we will have f 1 (b) BA .
Definition 11 (Random element) Given a probability space (E, F, P ) and the measurable space (A, FA ), a random-element at taking values in A is a measurable map
at : E A that maps elements of E to A.
The following is an example of random element that we will often use. In this example
C(R) denotes the set of all continuous functions defined on R. Each element of C(R)
is thus a continuous function.
Example 1 (Random continuous function) Given a probability space (E, F, P ) and
the measurable space (C(R), FC(R) ), a random continuous function f taking values in
C(R) is a measurable map f : E C(R) that maps elements of E to C(R).
Note that while f is a random element taking values in C(R), the realization f (e),
e E, is an actual continuous function f (e) C(R).
3

The collection of all open sets of a set A is known as the topology of A. Hence, you might come
across some references that define the Borel -algebra BA as the -algebra generated by the topology
of A.
4
A collection TA of subsets of A is called a topology on A when both the empty set and A are
elements of TA , any union of elements of TA is an element of TA and any intersection of infinitely
many elements of TA is an element of TA . Do not memorize this! This is not important!

4.1.2

What is a probability model?

The notion of probability model is often an illusive one. What exactly is a model? To
answer this question, consider first the very simple Bernoulli model for coin tosses.
In particular, suppose that we want to model T tosses of a coin that may, or may not,
be fair. Then, it is reasonable to suppose that the T observed outcomes x1 , ..., xT are
realizations of T Bernoulli random variables, xt Bern() with unknown probability
parameter [0, 1]. Note that for each we have defined a certain probability
distribution
for the random vector (x1 , ..., xT ) taking values in RT (where RT =
QT
t=1 R denotes the Cartesian product of T copies of R). Indeed, our model consists
of a collection of probability distributions on RT . For each you have a different
distribution! Since, each distribution in this collection of distributions is identified by
the parameter [0, 1], we call this model a parametric model.
If you think carefully about it, this definition of model is the one you
have been using all along. Even if you did not realize it!
For example, in the Gaussian linear AR(1) model,
xt = + xt1 + t ,

t N (0, 2 ) ,

tZ

each value of the parameter vector (, , 2 ) defines a unique distribution for the
time-series {xt }tZ . Hence, this AR(1) model is also a collection of distributions.
Furthermore, it is a parametric model because each distribution is indexed by the
parameter vector (, , 2 ). In the case of a time-series model, since we have an
infinite sequence of random variables, {xt }tZ the model is actually a collection of
probability measures on the -algebra of R (the Cartesian product of infinitely
many copies of R).
Definition 12 (Probability model) Given a measurable space (E, F) and a parameter
space , a probability model is a collection P := {P , } of probability measures
defined on F.
Definition 13 (Time-series probability model) Given the measurable space (R , FR )
and a parameter space , a probability model is a collection P := {P , } of
probability measures defined on FR .
Since a time-series is an infinite random sequence {xt }tZ , a realization of a timeseries is an infinite sequence of points {xt (e)}tZ . Alternative, you can see the timeseries as a random element taking values in R , and a realization of the time-series as
a point in R . The term stochastic process is usually reserved for the continuous-time
series that are obtained by connecting the dots of the realized time-series (see Figure 1). A stochastic process X is typically seen as a random element x : E C(R)
taking values in the space of continuous functions C(R), and every realization of
8

the stochastic process is then a specific continuous function, i.e. x(e) C(R) and
x(e, t) R (e, t) E R. In this course we will only focus on random sequences!
Even when we refer to a stochastic process, we will have in mind a random sequence.
Realization of random sequence

Realization of stochastic process

2
5

10

15

20

25

30

35

40

45

50

time

10

15

20

25

30

35

40

45

50

time

Figure 1: Comparing the realization of a random sequence with a random process.


In this course we will only deal with parametric models. Other model classes like
Non-parametric, semi-parametric and semi-nonparametric will not be discussed. In
order to help clarify much of the confusion that one observes on a daily basis in
the minds of so many graduate students, let us use our recent knowledge to find an
appropriate definition of parametric model.
Recall from you introductory mathematics courses that the dimension of a space
is the number of basis vectors that span the space. For example, R2 is a twodimensional space because it is spanned by the vectors (0, 1) and (1, 0). In other
words, every element of R2 can be obtained as a linear combination of these two
vectors. Similarly, Rn is n-dimensional because n vectors are required to span Rn .
The space R is infinite dimensional because it is spanned by infinitely many basis
vectors. The space of all linear functions defined on R, which we denote L(R), is just
two dimensional, because the space is spanned by the vectors 1 and x. Indeed, as we
know well, each linear function f L(R) can be written as f (x) = a+bx which is just
a linear combination of 1 and x. The space of quadratic functions f (x) = a + bx + cx2
is three-dimensional as it is spanned by the vectors {1, x, x2 }. Similarly, the space of
polynomials of order n is n + 1-dimensional. The space of continuous functions C(R)
is infinitely dimensional.
Definition 14 (Parametric and nonparametric models) A probability model P :=
{P , } is said to be parametric if the parameter space is finite dimensional,
and it is said to be nonparametric if the parameter space is infinite dimensional.
Since we defined a probability model as a collection of probability measures, it is
important to note that certain models may contain others as a special case. In
econometrics we usually talk about models that nest others.
Definition 15 (Nested model) Given a measurable space (E, F) and two parametric
models P := {P , } and P := {P , }, we say that model P nests
model P if and only if P P .
9

At this point it may be useful to clarify the following. The Gaussian linear AR(1)
model above is a probability model because the innovations are random variables.
If the innovations are not recognized as being random variables, then it is not a
probability model! Of course, the distribution of the innovations could belong to
some family other than the normal. For example, we could have specified that the
innovations belong to the Students t family where each distribution () is indexed by
the degrees of freedom parameter . The innovations could be uniformly distributed
t u(a, b) with each distribution u(a, b) parameterized by the end points of the
distribution a and b. More generally, we may not even specify a parametric family
of distributions! For example, if we just state that the innovations {t }tZ are white
noise (i.e. have mean zero, uncorrelated and with constant variance) as in Section
2.2,
xt = xt1 + t , {t }tZ WN(0, 2 ) , t Z.
Now that we know what a probability model is, it is important that we ask ourselves the question: why? why should we work with probability models? A complete
answer to this question would take us to the very foundations of econometrics itself! Below, you can find a very brief review of the history behind the probability
approach in econometrics. An excellent reference is Haavelmos 1944 paper entitled
The Probability Approach in Econometrics. Haavelmos paper is the longest paper
ever published in Econometrica and the only to occupy an entire issue of the journal!
In 1936, the Dutch econometrician Jan Tinbergen, published a book describing the
relations between economic variables using hundreds of dynamic regressions.5 In this
way, The Netherlands became the first country for which a macroeconometric model
was estimated. Before Tinbergens model, econometrics was essentially descriptive. In
the words of Robert Solow, the work of Jan Tinbergen was a major force in the transformation of economics from a discursive discipline into a model-building discipline.
In 1939, the world famous economist John Maynard Keynes published however an
extremely insightful review in which he praised Jan Tinbergen personally, but criticized
his work.6 Keynes was essentially concerned with the fact that the model could not be
proved wrong by the data. Consider a regression of consumption Ct on output Yt given
by Ct = + Yt + t . You may notice that any given observation (ct , yt ) will always
fit this regression because the error term can account for anything! In other words,
if anyone complains that this model is not accurate because the observation (ct , yt ) is
such that ct 6= + yt , we can simply reply Well, I told you there would be an error!.
The fact that the model can never be rejected by the data is a real problem:
the model becomes totally useless! The solution to this problem was finally stated in
5

Jan Tinbergen was awarded the Nobel Prize for this outstanding contribution to economics.
John Maynard Keynes, the guardian of Isaac Newtons papers, was not only a brilliant economist,
he was a master in various other fields such as logic and mathematics. Bertrand Russel himself (the
famous mathematician and author of Principia Mathematica) wrote the following about Keynes:
Keyness intellect was the sharpest and clearest that I have ever known. When I argued with him,
I felt that I took my life in my hands, and I seldom emerged without feeling something of a fool. In
fact, Keynes intellect conquered even his life-long opponents like Friedrich Hayek which described
Keynes as follows: He was the one really great man I ever knew, and for whom I had unbounded
admiration. The world will be a very much poorer place without him.
6

10

Haavelmos 1944 masterpiece The Probability Approach in Econometrics. Haavelmo


argued that regression models should be imbedded in a probabilistic structure that
described the distribution of the data. For example, if a probability distribution is
assigned to the error term in the regression Ct = + Yt + t , then we can reject the
model on the basis that the observations are too unlikely! In particular, we can discuss
how likely is the error associated with each observation (ct , yt ). If the observation
(ct , yt ) fits well with the probability model, then it constitutes evidence in favor of the
model. Otherwise it constitutes evidence against the model.

4.1.3

What is a DGP?

In econometrics we refer to the data generating process (DGP) as the place from
where the data comes ! In other words, the DGP is the unknown mechanism that
generates the data. From an economic perspective, this mechanism is most likely
a very complex one, involving thousands (if not infinitely many) agents, factors,
variables, decisions, etc. Luckily, whatever that mechanism might be in economic
terms, in mathematical terms at least, it is a quite simple animal: the DGP is a given
probability measure. This probability measure is the end result of the immensely
complex workings of the economy.
Definition 16 (Data generating process) Given a measurable space (E, F), a data
generating process is a probability measure P0 defined on F.
In a time-series perspective, the DGP is a probability measure P0 that defines all the
interconnections and dependencies between the infinitely many random variables in
the random sequence {xt }tZ .
Definition 17 (Time-series data generating process) Given the measurable space
(R , FR ) a data generating process is a probability measure P0 defined on FR .
4.1.4

Correct Specification

We say that a model is correctly specified when the DGP is an element of the model.
Otherwise the model is said to be misspecified.
Definition 18 (Correctly specified model) A model P := {P , } is said to
be correctly specified if the data generating process P0 is an element of the model P ;
i.e. if there exists 0 such that P0 = P0 . When this parameter 0 exists, then it
is called the true parameter.
In shorter notation, we will often say that the model is correctly specified if
0 : P 0 = P0 ,
which means that P0 P . We often refer to a correctly specified model as a well
specified model.
11

Definition 19 (Mis-specified model) A model P := {P , } is said to be misspecified (or incorrectly specified) if the data generating process P0 is not an element
of the model P ; i.e. if P 6= P0 .
Consider again the linear Gaussian AR(1) model
xt = + xt1 + t ,

t N (0, 2 ) ,

t Z.

(1)

This is a model because the parameter vector := (, , 2 ) is allowed to take


several values in the parameter space R3 , and hence (1) specifies a collection of
probability measures on -algebra of R . Given a time series {xt }tZ with probability
measure P0 we say that the model implicitly defined by (1) is well specified if 0
such that (1) generates a random sequence with measure P0 = P0 .
4.1.5

Generality of linear dynamic models (Wolds representation)

Now that we know what a correctly specified model is, we can ask ourselves: how
general are the linear dynamic models? Is there any chance that these models are
correctly specified? A first answer comes in the form of the much celebrated Wolds
decomposition theorem stated by Wold in 1938. We review Wolds theorem in this
section and discuss its strengths and limitations.
We recall first the definitions of weakly stationary process, white noise, linear
stochastic process and linear model.
Definition 20 (Weak/Covariance Stationarity) A time series {xt } is said to be weakly
stationary or covariance stationary, if the mean t = E(xt ) and autocovariance function t (h) = Cov(xt , xth ) are constant in time t = and t (h) = (h) (t, h).
Weak stationarity implies that the mean, variance and autocovariances do not
change over time. Unlike strict stationarity (see Definition 3 in Section 2.2), the
notion of weak stationarity requires the existence of moments of second order (mean,
variance and autocovariances) but allows all higher-order moments (e.g. skewness and
kurtosis) to be time-varying.
A simple example of a weakly stationary sequence is the so-called white noise.
Definition 21 (White Noise) A random sequence {xt }tZ is said to be a white noise
process, denoted {xt }tZ WN(0, 2 ), if it is a sequence of uncorrelated random variables Cov(xt , xth ) = 0 t and every h 6= 0, with zero mean E(xt ) = 0 t and
constant variance Var(xt ) = 2 t.
Note that the definition of white noise is silent about higher-order moments. Just
as in the definition of weak stationarity, the higher-order moments are allowed to vary
in time.
Finally, we turn to the definition of linear process, linear model, and Wolds
theorem.
12

Definition 22 (Linear process) A time-series {xt } is said to be a linear stochastic


process if it can be represented as,
xt =

j ztj

t Z,

j=

where {zt } WN(0, 2 ) and {j }


is a sequence of constants with

j=

|j | < .

Definition 23 (Linear time-series model) A time-series model P := {P , }


is said to be linear if every measure P defines a stochastic process that is linear.
Theorem 1 (Wolds representation theorem) Let {xt } be a weakly stationary process.
Then it admits the following representation,
xt =

j ztj + vt ,

j=0

where,
a. 0 = 1 and

j=0

j2 < ,

b. {zt }tZ WN(0, 2 )


c. {vt }tZ is deterministic (non-random).
Wolds representation theorem is an important landmark in time-series analysis.
It states essentially that any weakly stationary process can be represented as something similar to a linear process. As a result, Wolds theorem has been used as
an important justification for the adoption of linear dynamic processes for modeling
weakly stationary sequences. A good example is the Gaussian autoregressive movingaverage (ARMA) model that you surely studied in your introductory econometrics
courses.
Definition 24 (ARMA model) A time-series {xt } is said to be generated by an autoregressive moving-average (ARMA) model of order (p,q), denoted ARMA(p,q), if
and only if,
xt = 1 xt1 + ... + p xtp + t + 1 t1 + . . . + q tq

tZ

where {t }tZ NID(0, 2 ) and 1 , ..., p , 1 , ..., q are parameters.


ARMA models are very important models in econometrics. They are quite flexible
and constitute an important tool in the arsenal of any econometrician! As you may
recall, the ARMA model admits a linear processes representation if the autoregressive

13

polynomial (L) = 1 1 L 2 L2 . . . p Lp is invertible, because then it can be


re-written in the infinite MA representation
xt =

j tj

with 0 = 1 and

j=0

|j | < .

j=0

For example, the AR(1) model admits an MA() representation when || < 1
xt = xt1 + t

xt =

j tj

with j = j j.

j=0

At this point, it may seem that ARMA models can describe any weakly stationary
process. Indeed, Wolds representation theorem, together with the MA() representation of ARMA processes, seems to imply that any weakly stationary process can
be written in ARMA form. This is not true! In particular, it is crucial to note the
following practical limitations of Wolds representation theorem and ARMA models:
P
1. In Wolds theorem, the stochastic component
j=0 j ztj is not necessarily a
linear process. Hence, more general models may have to be considered:
As you can P
see in Definition 22, linear processes have coefficients {j } that are absolutely
summable
|j | < whereas
representation theorem features coefficients
PWolds
{j } that are square summable
j2 < . Since absolute summability implies square
summability it follows that linear models are too restrictive for Wolds representation
theorem.7

2. Even if Wolds theorem featured a linear stochastic component {zt }, the representation would still involve infinitely many parameters {j } that cannot be
estimated from a finite sample of data:
ARMA models solve this problem by reducing the infinite MA coefficients {j } to just a
few AR and MA parameters. However, as a result, they can only define linear processes
with a restrictive sequence of parameters {j }. For example, we saw above that the
linear AR(1) can only define a sequence {j } with a decay of the type j = j j.

3. Even if the stochastic component of Wolds theorem were linear and well approximated by an ARMA model, the representation would still contain a deterministic term {vt }tZ which is unknown and potentially very complex:
7
It is easy to show that absolute summability implies square summability by noting that: if {j }
is absolutely summable, then n N such that |j | < 1 j n, and hence, j2 |j | j n. As
a result,

n1

X
X
X
j2 =
j2 +
|j | < .
j=0

j=0

j=n

14

We know how to estimate simple deterministic components such as polynomial trends


(e.g. linear trends t), seasonal waves (e.g. sin(t)), break dummies (e.g. 1t17 ), etc.
However, {vt } might be extremely complex and have no structure.

4. Finally, even if we could somehow ignore points 1, 2 and 3 above (i.e. if Wolds
stochastic component was linear, well approximated by an ARMA and the deterministic component was simple and easy to estimate), we would still be faced
with the problem that the distribution of the white noise sequence is unknown
and possibly too complex to be tractable:
Wolds representation theory leaves the joint distribution of the innovations unspecified.
It just tells us it belongs to a class of distributions with zero mean, fixed variance and
zero covariances (white noise). The distribution of the error term may be extremely
complex since dependence of higher-order moments is allowed.

The above points might seem to cast an overly negative picture on the use of
linear models. However, this should not be interpreted in that way! The objective
was simply to enumerate carefully the practical limitations of working with ARMA
models and the extent to which it is reasonable to appeal to Wolds representation
theorem as a guide for linear models.
In the end of the day, the question of which model to use should always be left to
the data to answer. As we shall see, in many cases, ARMA models actually do well in
describing economic data! Model specification tests may show that the ARMA model
provides a good description of the data, and it may happen that nonlinear models
do not provide better results. In other cases, nonlinear models are clearly needed
(e.g. for modeling time-varying volatility in financial applications). In a considerable
number of cases there is at least room for improvement by adopting nonlinear models.
4.1.6

Nonlinear dynamic models

Nonlinear dynamic models belong to a much more general class of models than linear
ones and can generally nest linear models like the ARMA. As the name itself suggests,
a model is non-linear if its is not a linear model. One should be careful however not
to infer from this statement that a nonlinear model cannot contain linear models as
a special case.
Definition 25 (Nonlinear time-series model) A time-series model P := {P ,
} is said to be nonlinear if at least some measure P P defines a stochastic
process that is not linear.8
8

Most commonly, a model is said to be nonlinear if it seems that for some , the probability
measure P defines a process that is not linear. The emphasis on the word seems is due to the fact
that we may just lack a proof of linearity; i.e. maybe no one was yet able to prove that the process
defined by P can actually be represented as a weighted infinite sum of a white noise sequence with
absolutely summable coefficients.

15

Consider again the SESTAR model described in Section 3.1.2


xt =

xt1 + t
1 + exp( + xt1 )

tZ

where , and are unknown parameters and {t }tZ is Gaussian iid sequence with
t N (0, 2 ) for every t. This model is a nonlinear model as long as the parameter
space allows for values (, ) 6= (0, 0). However, this nonlinear model nests the linear
model that is obtained by setting (, ) = (0, 0) and || < 2.9

4.2

Examples of nonlinear dynamic models


Some more reading material:
1. Fan and Yao (2005), Nonlinear Time-Series
Chapter 1.4, 1.5 and 4
2. Granger and Terasvirta (1993), Modeling Nonlinear Economic Relationships
Chapter 2.1, 3.1 and 7.3

4.2.1

Nonlinear autoregressions

The linear AR(1) model can be easily extended to a nonlinear setting by considering
the general nonlinear autoregressive (GNLAR) model,
xt = f (xt1 , t ; )

tZ

(2)

where f is some nonlinear parametric function parameterized by the vector ,


and {t }tZ are innovations with some assigned distribution.
Just like the AR model, the GNLAR can be specified with several lags of xt . The
GNLAR(p) model takes the form
xt = f (xt1 , ..., xtp , t ; )

t Z.

More generally a general nonlinear autoregressive moving-average GNLARMA(p,q)


model takes the form
xt = f (xt1 , ..., xtp , t , ..., tq ; )

t Z.

A popular special case of the GNLAR(1) model is the nonlinear autoregressive


(NLAR) model with additive innovations
xt = f (xt1 ; ) + t
9

t Z.

(3)

Strictly speaking, even the famous Gaussian random walk discussed in Section 3.2P
is not a linear
process. Indeed, since the random walk takes the form xt = xt1 + t xt =
j tj with
j = 1 j, the MA coefficients {j } are not absolutely summable.

16

NLAR(1) models are often written in the equivalent form


xt = g(xt1 ; )xt1 + t

tZ

by defining g(xt1 ; ) := f (xt1 ; )/xt1 .


Example: (STAR model) A famous NLAR(1) model is the smooth transition
autoregressive (STAR) model,
xt = g(zt1 ; )xt1 + t

tZ

where {t } are innovations with some specified distribution and


g(zt1 ; ) := +

1 + exp( + zt1 )

t Z.

The STAR(1) model has many variants depending on the nature of the dependence
driver zt1 . When the driver {zt } is an exogenous process that is independent of
{xt }, then we call this model the exogenous STAR model. When the driver {zt } is
the lagged dependent variable {xt1 }, then we call this model the self-excited STAR
(SESTAR) model. Two important SESTAR examples are: (i) the logistic SESTAR
model, and the exponential SESTAR.
Example: (Logistic SESTAR model) The logistic SESTAR presented in Section
3.1.2 is obtained when zt1 = xt1
g(zt1 ; ) := +

1 + exp( + xt1 )

t Z.

The logistic SESTAR model allows us to model changes in the dependence of the
time-series {xt1 }tZ that are related to past realizations of the time-series itself.
For example, many macroeconomic variables feature higher temporal dependence in
recessions (i.e. when xt1 < 0) than expansions (i.e. when xt1 > 0).

g(x t1, )

0.8

0.6

0.4
0.5

x t1

0.5

Figure 2: Plot of g(xt1 ; ) for (, , , ) = (0.5, 0.4, 0, 15). Temporal dependence is low when
xt1 is positive.

17

Example: (Exponential SESTAR model) The exponential SESTAR obtained


when zt1 = (xt1 )2
g(zt1 ; ) := +

1 + exp( + (xt1 )2 )

t Z.

The exponential SESTAR model is useful, for example, in modeling the temporal
dependence of real exchange rates. Indeed, real exchange rates tend to have high
dependence (no mean reversion) when rates are close to 1 and lower values of dependence (mean reverting behavior) when rates are far away from 1.
110

110

105

105

100

100

95

95

90
1995

2000

2005

90

2010

50

100

150

200

250

Figure 3: Real exchange rate of EU 15 vs Danish Kroner (left) and data simulated
from an exponential SESTAR model (right).

g(x t1, )

0.8

0.6

0.4
0.5

1.5

x t1

Figure 4: Plot of g(xt1 ; ) for (, , , , ) = (0.1, 0.8, 0, 50, 1). Temporal dependence is high
when xt1 is close to 1.

4.2.2

Random coefficient autoregressions

Another popular type of nonlinear dynamic model is the random coefficient autoregressive (RCAR) model
xt = t1 xt1 + t
18

tZ

where both {t }tZ and {t }tZ are exogenous iid sequences with a certain distribution. This simple model first proposed by Quinn (1980) has several important
applications in finance and biology as it allows for a time-varying conditional mean
and variance. Suppose, for example, that
t1 N (, 2 )

t N (0, 2 )

and

t Z.

Then the distribution of xt conditional on xt1 is given by




xt |xt1 N xt1 , 2 x2t1 + 2 .
This model is able to explain temporal dynamics in both the conditional mean and
variance of {xt }tZ .
As we have seen above, the NLAR(1) model
xt = f (xt1 ; ) + t
can be re-written as a time-varying parameter model
xt = t1 xt1 + t
simply by defining t1 as t1 := f (xt1 ; )/xt1 . Of course, the crucial difference
between the NLAR(1) and the RCAR(1) models is that the time-varying parameter
t is a function of xt1 in the NLAR(1), whereas in the RCAR(1) it is an exogenous
iid random variable (independent of all {xt } and {t }). This distinction motivates us
to follow Cox (1981) in separating time-varying parameter models into those that are
parameter-driven and those that are observation-driven. We review these models in
the next two sections.
4.2.3

Time-varying parameter models: parameter-driven

A time-varying parameter model is said to be parameter-driven when the time-varying


parameters evolve independently of the time-series itself (i.e. when the time-varying
parameter is exogenous). Parameter-driven models are typically estimated numerically using filtering techniques like the Kalman filter that allows us to evaluate the
likelihood function for any given parameter value .
For example, the Gaussian local-level model with time-varying mean t takes the
form,
xt = t + t , {t } NID(0, 2 ) ,
t = + t1 + vt ,

{vt } NID(0, v2 ).

The crucial feature here is the fact that the time-varying parameter t evolves in time
independently of {xt }tZ . This is what makes the model a parameter-driven model.
The local-level model has important applications in many areas of science, including
e.g. in macroeconometrics.
19

Another famous parameter-driven model is the Gaussian stochastic volatility model


used in empirical finance
xt = t t ,

{t } NID(0, 2 ) ,

2
t2 = + t1
+ vt ,

{vt } NID(0, v2 ).

This model defines a time-series {xt }tZ with time-varying volatility. Applications
in finance are typically related to modeling returns.
4.2.4

Time-varying parameter models: observation-driven

A time-varying parameter model is said to be observation-driven when the stochastic


behavior of the time-varying parameters is determined by lagged observations of the
time-series itself. Observation-driven models are typically much easier to estimate
than parameter-driven models because the likelihood function is available in closedform.
Example: (Local-level model) The Gaussian observation-driven local-level model
takes the form,
xt = t + t , {t } NID(0, 2 ) ,
t = + (xt1 t1 ) + t1 .
The crucial feature here is the fact that the update of the time-varying parameter t
is determined by the lagged observation xt1 plus an autoregressive term. This makes
the model an observation-driven model. When the signal-to-noise ratio is small, it
may be difficult to extract the path of the time-varying mean. In particular, in the
presence of outliers, the filter of t may fluctuate too much. Robust filters are thus
important.
Example: (Robust local-level model) A robust observation-driven local-level
model allows for fat-tailed innovations and bounds the update of the time-varying
mean,
xt = t + t , {t } iid () ,
t = + tanh(xt1 t1 ) + t1 ,
Figure 5 compares the performance of the Gaussian local-level model with that of
the robust local-level model. The robust local-level model reacts considerably less to
outliers than the Gaussian local level model.
Example: (GARCH model) Another famous observation-driven model is the generalized autoregressive heteroeskedasticity (GARCH) model of Engle (1982) and Bollerslev (1986). The GARCH is typically used to model the time-varying conditional
20

6
4
2
0
2

data
true mean
LL filter
Robust LL filter

4
6

20

40

60

80

100

120

140

Figure 5: Filtered time-varying means using local level and robust local level models.

volatility in a sequence of returns {xt } (i.e. the percent changes in the prices of
stocks). Mean zero returns are modeled as,
xt = t t ,

{t }tZ NID(0, 1) ,

2
t2 = + x2t1 + t1
.

This model defines a time-series {xt }tZ with time-varying conditional volatility that
has many important applications in finance. Figure 6 below plots a path of S&P500
returns and a path simulated from a GARCH model.
0.15

0.15

0.1

0.1

0.05

0.05

0.05

0.05

0.1

0.1

2005

2006

2007

2008

2009

2010

2011

2012

2013

500

1000

1500

2000

2500

3000

3500

Figure 6: Time-series of S&P500 returns (left), and data simulated from a GARCH
model (right).
The GARCH model uses the lagged x2t1 as a driver to update the volatility
parameter. While this choice of driver for the volatility is quite intuitive, it can
often be improved upon. In particular, in the presence of outliers produced by fattailed innovations, the GARCH filter tends to produce exaggerated updates of the
time-varying parameter t . Robust alternatives are thus important.
21

Example: (Robust GARCH model) A robust GARCH model can be obtained by


recognizing that the innovations t may be fat tailed, and defining an update that is
uniformly bounded. For example, we can have
xt = t t ,
t2

{t } iid () ,

x2t1
2
=+
.
+ t1
2
1 + xt1

The boundedness of the update is crucial for avoiding large explosions in the filtered volatility as a result of outliers in xt1 . As we shall see in Chapter 5, recognizing
that the innovations are fat tailed may be important for robustness since the ML estimator converges to a pseudo-true parameter that renders the volatility filter more
robust. Figures 7 and 8 compare the driver (also called the news impact curve) and
the filtered volatilities of the GARCH and robust GARCH models.
1.5
GARCH
Robust GARCH
1

0.5

0.5

0.5

Figure 7: News impact curve of GARCH and Robust GARCH models.


Clearly, by having a bounded news impact curve, the robust GARCH becomes less
sensitive to outliers than the GARCH model. This is well seen by the fact that the
filtered volatility of the GARCH model for S&P500 returns often explodes, whereas
the Robust GARCH remains rather stable.
Example: (NGARCH model) Another important feature of financial return series
is the so-called leverage effect: negative returns seem to have a larger impact on
volatility than positive returns. In other words, the so-called news impact curve is
not symmetric. For example, in the nonlinear GARCH (NGARCH) proposed by
Engle and Ng (1993), the volatility updating equation takes the form
2
t2 = + (xt1 t1 )2 + t1
.

Notice how the leverage effect is obtained through the cross-term xt1 t1 in the
volatility driver
2
(xt1 t1 )2 = x2t1 + 2 t1
2xt1 t1 .

22

Figure 8: Filtered volatilities from S&P500 returns using the GARCH and Robust
GARCH models.
In particular, if both > 0 and > 0, then large values of xt produce less volatility
because 2xt1 t1 < 0.
Example: (Q-GARCH model) Similarly, the quadratic GARCH (QGARCH) introduced in Sentana (1995) proposes the following volatility update with a leverage
effect,
2
t2 = + x2t1 + xt1 + t1
.
The leverage effect exists when < 0 since negative values of xt1 will generate more
volatility than positive values of xt1 .
4.2.5

Nonlinear dynamic models with exogenous variables

The nonlinear autoregressive model with exogenous variables (NLARX) is typically


defined as follows
xt = f (xt1 , zt , t ; ) t Z
(4)
where f is some nonlinear parametric function parameterized by the vector ,
{zt }tZ is some exogenous explanatory variable, and {t }tZ are innovations with some
assigned distribution.
Just like the GNLAR, the NLARX can be specified with several lags of xt and zt
giving rise to a nonlinear autoregressive distributed lag (NADL(p, q)) model
xt = f (xt1 , ..., xtp , zt , zt1 , ..., ztq , t ; )
23

t Z.

A popular special case is the NADL(1,0) features additive innovations


xt = f (xt1 ; zt , ) + t

t Z.

(5)

This model is useful to describe time-series {xt } that exhibit nonlinear autoregressive
dynamics, or when zt affects nonlinearly xt .
Example: (NADL(1,0) with logistic contemporaneous link) One possible
formulation of the logistic NADL(1,0) model that features a nonlinear contemporaneous relation between xt and zt is given by
xt = xt1 + g(xt1 ; )zt + t
with g(xt1 ; ) = +

t Z.

1 + exp( + xt1 )

t Z.

In economics, this NADL(1,0) model is useful to describe how fiscal policy affects real GDP nonlinearly over the business cycle. In particular, changes in net
government expenditure zt have larger impact on GDP xt during periods of economic
recession, when GDP xt1 is low and far from potential, compared to periods of
economic expansion, when GDP xt1 is high and close to its full employment potential. Indeed, if the economy is already operating at full capacity with labor and
capital fully employed in production, then the rise in aggregate demand that results
from increased Government net expenditure zt results in increased prices (inflation),
rather than increased output supply. In contrast, if the economy is operating far from
potential, then an increase in aggregate demand fostered by a rise in government expenditure zt leads to a rise in labor and capital employment, increased output, and
hence a growth in real GDP xt .

4.3

Stationarity, dependence, ergodicity and moments


Some more reading material:
1. Davidson (1994), Stochastic Limit Theory
Chapter 13.1-13.4, 15.1, 18.6 and 24.3
2. Billingsley (1995), Probability and Measure
Chapter 24 and 35
3. Fan and Yao (2005), Nonlinear Time-Series
Chapter 2.1
4. Granger and Terasvirta (1993), Modeling Nonlinear Economic Relationships
Chapter 1.5 and 1.6

24

As you may recall from Definition 3 of Section 2.2, a time-series is strictly stationary if its finite-dimensional distribution (fidis) is invariant in time.
Definition 26 (Strict Stationarity) A time-series {xt }tZ is said to be strictly stationary if the distribution of every finite sub-vector is invariant in time
d

(x1 , ..., xh ) = (xt+1 , ..., xt+h ), (t, h) Z N.


The stationarity property is, by itself, not enough to yield laws of large numbers
(LLN) or central limit theorems (CLT). In general, some form of limited dependence
or fading memory is also needed. Notions of fading memory come in various form
and shapes. From weak and strong mixing to near epoch dependence, there are a
multitude of ways in which we can convey the idea that events that occur far apart
in time, should be (almost) independent. Below we introduce two concepts of limited
dependence (m-dependence and ergodicity) that together with strict stationarity yield
LLNs and CLTs.
4.3.1

Strict stationarity and m-dependence

The simplest concept of limited dependence is probably that of m-dependence. As we


shall see, this concept is simple, but unfortunately too limited to be useful in most
cases.
Definition 27 (m-dependence) A random sequence {xt }tZ is said to be m-dependent
if there exists an m > 0 such that xt is independent of xtmj for every (t, j) Z N.
An MA(q) random sequence is a simple example of an m-dependent time-series.
Proposition 1 Let {xt }tZ be an MA(q) random sequence generated according to
xt = t + 1 t1 + 2 t2 + ... + q tq ,

{t }tZ NID(0, 2 ).

Then, {xt }tZ is an m-dependent sequence for any m > q.


Proof: For m > q, the vector (t , ..., tq ) is independent of the vector (tm , ..., tqm ),
and hence a function of (t , ..., tq ) is also independent of a function of (tm , ..., tqm ).
The desired result follows by noting that xt = f (t , ..., tq ) and xtm = f (t , ..., tq ) where
f happens to be a linear function.10 

The notion of m-dependence is useful because it allows us to apply the following law of large numbers (LLN) and central limit theorem (CLT) for stationary mdependent sequences.
10

The statement that functions of independent random variables are themselves independent is
rather trivial once you appropriately define the notion of independent random variables in terms
of independence of sub -algebras;see Theorem 1 in Chapter 4 of David Polards (2002) A Users
Guide to Measure Theoretic Probability for further details. Here we will just take this for granted.

25

Theorem 2 (LLN for stationary m-dependent sequences) Let {xt }tZ be a strictly
stationary m-dependent sequence with E|xt | < . Then
T
1X
p
xt E(xt )
T t=1

as T .

Theorem 3 (CLT for stationary m-dependent sequences) Let {xt }tZ be a strictly
stationary m-dependent sequence with E(x1 ) = and Var(x1 ) = x2 < . Then
T

1 X
d
T
xt N (0, w2 )
T t=1

where

w =

x2

+2

m1
X

as T

Cov(x1 , x1+k ).

k=1

Unfortunately, the notion of m-dependence is too limited to hold even for the
simple AR(1) process with (, 2 ) 6= (0, 0).
Proposition 2 Let {xt }tZ be generated by an AR(1) process with (, 2 ) 6= (0, 0)
xt = xt1 + t ,

{et }tZ NID(0, 2 ).

Then, {xt }tZ is not m-dependent for any m N.


Proof: Let (h) denote the h-lag covariance of the time-series; i.e. (h) = Cov(xt , xth ).
Then,
2
(0) = Var(xt ) = 2 (0) + 2 (0) =
6= 0.
1 2
The desired result thus follows immediately by noting that
(m) = Cov(xt , xtm ) = Cov(xt1 + t , xtm )
= Cov(xt1 , xtm ) + Cov(t , xtm )
= (m 1) + 0
because Cov(t , xtm ) = 0 m 0. As a result, by unfolding the recursion
(m) = (m 1)
= ((m 2)) = 2 (m 2)
= ...
= m (0) 6= 0
because m 6= 0 and (0) 6= 0.

26

mN

4.3.2

Strict stationarity and ergodicity (SE)

The property of a stationary sequence which ensures that the ensemble average and
the time average converge to the same point is known as ergodicity. The ensemble
average is the sample average obtained from N realizations xt (e1 ), xt (e2 ), ..., xt (eN )
of the same random variable xt
N
1 X
xt (ei ) for some given t Z where each ei is an event ei E.
N i=1

The time average is the average obtained from a single realization {xt (e)}tZ of the
time-series {xt }tZ
T
1X
xt (e) for some e E.
T t=1
The formal measure-theoretic definition of ergodicity is a tricky one. Here we
provide an approximate definition.
Definition 28 (Ergodic sequence) A random sequence {zt }tZ is said to be ergodic
if and only if, over an infinite amount of time, every event occurs with probability 0
or 1.
The definition above is essentially stating that every event that can happen will
eventually happen. In a loose sense, it is not possible that no element of the sequence
{zt }tZ ever takes a certain value, if that value has a positive probability of occurring.
In essence, the sequence will eventually visit all corners of the space, i.e. it will take
any possible value that has positive probability of occurring.
Well fortunately, you need not worry too much about the complications surrounding the concept of ergodicity, because here we will only be concerned with the practical
result of ergodicity: the LLNs and CLTs that it produces! In particular, you may
recall the following theorems from Sections 2.2.2 and and 2.2.3.
Theorem 4 (Ergodic theorem: LLN for SE data) Let {zt }tZ be a strictly stationary
and ergodic random sequence with finite first moment E|z1 | < . Then we have
T
1X
p
zt E(zt ) as T .
T t=2

Definition 29 (Martingale difference sequence) A sequence {zt }tZ is said to be a


martingale difference sequence if and only if E(zt |zt1 , zt2 , ...) = 0 t Z.
Theorem 5 (Billingsleys Central Limit Theorem: CLT for SE data) Let {zt }tZ be
a strictly stationary and ergodic martingale difference sequence of random variables
with E(z1 ) = and Var(z1 ) = 2 < . Then
T

1 X
d
T
zt N (0, 2 ) as T .
T t=2
27

4.3.3

Sufficient conditions for SE

At this point it should be obvious that we need tools for establishing the strict stationarity and ergodicity of random sequences {xt }tZ since this will open the door to
obtaining LLNs and CLTs.
In general we will be concerned with conditions for stationarity and ergodicity
that apply to time-series generated by stochastic Markov dynamical systems.
Definition 30 (Markov dynamical system) An nx -variate time-series {xt }tZ is said
to be generated by a random Markov dynamical system if and only if every xt satisfies
xt+1 = (xt , t , )
for some parametric function : Rnx Rn Rnx and some n -variate random
innovation sequence {t }tZ .
In the definition above we say that the system is dynamic because every xt depends
on its past values. Naturally, the system is stochastic because the innovations {t }tZ
make the generated time-series a random process. Finally, the dynamical system is
said to be a Markov dynamical system because xt is a function only of xt1 and not
of further lags (e.g. xt2 ).
Andrey Markov was one of the most famous Russian mathematicians of all times. He
was a doctoral student of the great Pafnuty Chebyshev.11 Entire fields of probability
and statistics like (e.g. the field of Markov chains) are today named after Markov.

Example: (Linear and nonlinear AR(1)) The linear AR(1) is an obvious example
of a
xt = xt1 + t
Markov dynamical system where
(xt1 , t , ) = xt1 + t .
Clearly, any nonlinear AR(1) is also a Markov system as it fits Definition 30. Furthermore, it is important to note that dynamic systems with an arbitrary number
of lags can also be re-written as a vector Markov dynamical system by adopting a
companion form.
Example: (Linear and nonlinear AR(p)) The linear AR(p) model
xt = 1 xt1 + ... + p xtp + t
11

Andrey Markovs brother, Vladimir Markov, was himself a great mathematician, also a doctoral student of Pafnuty Chebyshev. Together they wrote the famous Markov brothers inequality.
Unfortunately, Vladimir Markov died very early at the age of 25.

28

can be written as a multivariate Markov


xt
1 2 . . .
xt1 1 0 . . .

.. . .
..

= .
.
.


xtp+2 0 . . . 1
xtp+1
0 ... 0

dynamical system xt = Bxt1 + t ,

. . . p
xt1
t

... 0
xt2 0
..
..
..

+ . .
.
.

0 0 xtp+1 0
1 0
xtp
0

The problem of finding conditions to ensure that a dynamical system is stable or


stationary is a problem that dates back to the work of the famous mathematician
Aleksandr Lyapunov, another doctoral student of the great Pafnuty Chebyshev and
colleague of Andrey Markov! It all began with Lyapunovs doctoral thesis entitled The
General Problem of Stability of Motion published in 1892.

The following theorem is the result of decades of research by many mathematicians


since Lyapunov. This theorem, proven by Bougerol in 1992, gives conditions for the
asymptotic strict stationarity of a random sequence generated by a Markov dynamical
system. In particular, it gives conditions for a sequence {xt (, x1 )}tN , generated
under the parameter and initialized at a point x1 X , to converge to a strictly
stationary and ergodic sequence {xt ()}tZ initialized in the infinite past. Please note
that the sequence initialized at t = 1 has t N whereas the limit sequence has t Z.
In the theorem below we let log+ denote the truncated logarithm function

0
for x (0, 1)
+
log (x) =
.
log(x) for x 1
e.a.s.

Furthermore, we let the symbol denote exponentially fast almost sure convergence.
e.a.s.
a.s.
In particular, we write zt 0 if and only if there exists > 1 such that t |zt | 0.
Theorem 6 (Bougerols Theorem) For some , let {xt (, x1 )}tN be a random
sequence initialized at t = 1 with value x1 X R and generated by the Markov
dynamic system
xt+1 = (xt , t , ) t N
with differentiable function : X Rn X and elements xt (, x1 ) taking values
in X R for every t N. Suppose further that the following conditions hold:
A1. {t }tZ is an exogenous n -variate SE sequence,
A2. there exists x1 X such that E log+ |(x1 , t )| < ,


A3. the dynamical system is contracting on average E log supxX (x, t )/x < 0.
Then {xt (, x1 )}tN converges e.a.s. to a unique strictly stationary and ergodic (SE)
e.a.s.
sequence {xt ()}tZ . In other words, |xt (, x1 ) xt ()| 0.
29

Bougerols theorem tells us essentially that a smooth stochastic Markov dynamic


system with exogenous SE innovations (condition A1) will generate a time-series that
is asymptotically SE as long as, for given , the first iteration has a log moment
(condition A2) and the system is contracting on average (condition A3).
Condition A1 is simple. The SE nature of the innovations is trivially satisfied, for
example, when {t }tZ is an iid sequence. The property of exogeneity refers essentially
to the fact that {t }tZ is generated outside the system. Here we shall not dwell
into the important differences between notions of exogeneity, such as weak, strong or
super exogeneity. Instead, we shall adopt the more practical approach of imposing a
strict definition of exogeneity which is easy to work with, while leaving for later the
work of analyzing the restrictive nature and potential pitfalls of our assumption. In
particular, we define the exogeneity of {t }tZ as follows.
Definition 31 (Exogenous Innovations) In the context of the dynamical system of
Theorem 6, the innovation process {t }tZ is said to be exogenous if and only if each
element it of the vector t = (1t , ..., nt ) is independent of all other elements jt+h for
every (t, h) and any pair i 6= j, as well as present and past values of xth for every t
and h 0;
The exogeneity of the innovation process may not always stated. The following remark
highlights however that we shall always work under that assumption.
Remark 1 We shall always assume that the vector process {t }tZ is exogenous, even
if this property is not explicitly stated!
Conditions A2 and A3 are written in a weak form that is not always transparent. The
following two remarks highlight important features concerning conditions A2 and A3.
Remark 2 For any random variable zt , the log moment condition E log+ |zt | <
is implied by E|zt |n < for any n > 0. As a result, condition A2 in Theorem 6 is
implied by having
E|(x1 , t )|n < for any n > 0.
Remark 3 For any random variable zt , Jensens inequality tells us that E log |zt | < 0
is implied by E|zt | < 1. As a result, condition A3 in Theorem 6 is implied by having
(x, )

t
E sup
< 1.
x
xX
Remark 4 The conditions stated in Theorem 6 are sufficient conditions for a random
sequence to converge to an SE limit. Hence, if they fail to hold we cannot conclude
that the random sequence is non-stationary. We simply cannot conclude anything!

30

4.3.4

Examples and Counter Examples

Example: (AR(1) model) Consider a time-series {xt (, x1 )}tN generated by a Gaussian AR(1) initialized at some value x1 R,
xt+1 = xt + t

tN,

{t }tZ NID(0, 2 ).

(6)

It is easy to show that the Gaussian AR(1) with || < 1 generates asymptotically SE
time-series by applying Bougerols theorem.12
Proposition 3 (SE Gaussian AR) Let {xt (, x1 )}tN be generated by the Gaussian
AR(1) in (6) with = (, 2 ) satisfying || < 1 and 0 < 2 < . Then {xt (, x1 )}tN
converges e.a.s. to a limit SE sequence {xt ()}tZ for any initialization x1 R.
Proof. Condition A1 holds trivially since the innovations {t }tZ are iid (and hence also
SE).
Condition A2 holds because, for any given value x1 R, we have
E log+ |(x1 , t )| = E log+ |x1 + t | E|x1 + t | || |x1 | + E|t | < .
|{z} |{z} | {z }
< <

<

The first equality E log+ |(x1 , t )| = E log+ |x1 + t | holds by definition. The first inequality holds because log+ (x) x for every x. The second inequality holds by the sub-additivity
of the absolute value |a + b| |a| + |b| for every a and b, and the absolute homogeneity of the
absolute value |ab| = |a||b| for every a and b. The terms || and |x1 | are bounded constants
in R and E|t | < because t is normally distributed with bounded variance 2 < , and
hence, has bounded moments of any order; i.e. E|t |n < n 0.
Condition A3 holds since (x, t )/x = , and hence,
(x, )

t
E log sup
<0
x
xX

E log sup || < 0

E log || < 0

xX

where the second equivalence holds because the derivative does not depend on x and hence
we can drop the supremum. Finally, we obtain the desired results by noting that we can
drop the expectation E log || < 0 log || < 0 because log || is just a constant, and
that log || < 0 || < 1. 

Two important tools we used in the proof of Proposition 3 were:


1. Sub-additivity of absolute value: |a + b| |a| + |b| (a, b).
2. Bounded moments of Gaussian random variables: E|t |n < n.
12

Note that {xt }tN initialized at some point x1 R is always non-stationary! Suppose that
x1 = 5. Well, then the mean of the process changes from E(x1 ) = x1 = 5 to E(x2 ) = x1 = 5.
Hence the process is non-stationary! In fact, even if x1 = 0, we still have a changing variance from
Var(x1 ) = 0 to Var(x2 ) = 2 . Hence, the process is still non-stationary. There is really no x1 R
that would make the model stationary.

31

Surely, you already knew the stationarity condition (|| < 1) for the AR(1) model
from your introductory econometrics courses. Bougerols theorem can however be
applied to many other models!
Example: (random-coefficient AR(1) model) Consider for example the randomcoefficient Gaussian AR(1) initialized at some value x1 R,
xt+1 = t xt + t

tN,

{t }tN NID(b, 2 ) ,

{t }tN NID(0, 2 ).

(7)

Note that we can apply Bougerols Theorem, by defining the n -variate vector t in
Theorem 6 as t = (t , t ). As such, it is easy to show that this random-coefficient
model generates asymptotically SE time-series if t is smaller than 1 on average; i.e. if
E|t | < 1. Note that the condition E|t | < 1 imposes implicitly restrictions on the
parameters b and 2 that define the distribution of t .
Proposition 4 (SE Gaussian RCAR) Let {xt (, x1 )}tN be generated by the Gaussian random-coefficient AR(1) in (7) with = (b, 2 , 2 ) satisfying E|t | < 1, and
0 < 2 < . Then {xt (, x1 )}tN converges e.a.s. to a limit SE sequence {xt ()}tZ
for any initialization x1 R.
Proof.
Condition A1 holds trivially since the innovations {}tN and {t }tN are iid.
Condition A2 holds because for any given value x1 R we have
E log+ |(x1 , t )| = E log+ |t x1 + t | E|t x1 + t | |x1 | E |t | + E|t | < .
|{z} |{z} | {z }
<

<

<

The second inequality holds by the sub-additivity of the absolute value. The term |x1 | is a
bounded constant in R and E|t | < 1 < and E|t | < because {t } is NID with finite
variance and hence has bounded moments of any order.
Condition A3 holds since (x, t )/x = t , and hence,
(x, )

t
E log sup
<0
x
xX

E log sup |t | < 0

E log |t | < 0

xX

and E log |t | < 0 is implied by E|t | < 1 by Jensens inequality.

It is important to note that the condition E|t | < 1 used in Proposition 4 is unnecessarily restrictive. We could have assumed E log |t | < 0 which is weaker. In the two
examples above, the contraction condition was rather trivial because the derivative
(x, t )/x did not depend on x. We turn now to an example where (x, t )/x
is a function of x. The logistic AR(1) model ensures that every xt takes values in the
interval [0, 1] and is useful to model time-series that refer to probabilities, intensities,
percentages, etc.

32

Example: (logistic AR(1) model) Consider a time-series {xt (, x1 )}tN generated


by the logistic AR(1) with initialization x1 (0, 1),
xt+1 =

1
1 + exp( xt + t )

tN,

{t }tN NID(0, 2 ).

(8)

It is easy to show that this nonlinear autoregressive model generates asymptotically


SE time-series if || < 4.
Proposition 5 (SE Logistic AR) Let {xt (, x1 )}tN be generated by the logistic AR(1)
in (8) with = (, , 2 ) satisfying || < , || < 4 and 0 < 2 < . Then
{xt (, x1 )}tN converges e.a.s. to a limit SE sequence {xt ()}tZ for any initialization
x1 (0, 1).
Proof.
Condition A1 holds because the innovations {t }tN are iid.
Condition A2 holds because for any given value x1 R we have


1


E log+ |(x1 , t )| = E log+
< 0 < .
1 + exp( x1 + t )
The first inequality holds because the logistic function is uniformly bounded by 1.
Condition A3 holds because
(x, t )
exp( x + t )
= 
2 .
x
1 + exp( x + t )
It is easy to see that this derivative attains its maximum value at /4 by setting, for any
given t , the value of x as x = ( + t )/.13 As a result, we have


(x, )
exp( x + )



t
t
E log sup
< 0 E log sup
2 < 0
x
xX
xX 1 + exp( x + t )
and since the supremum over X R is dominated by the supremum over R,




exp( x + )
exp( x + )




t
t
E log sup
2 < 0 E log sup
2 < 0
xX 1 + exp( x + t )
xR 1 + exp( x + t )
we conclude that


exp( x + )


t
E log sup
2 < 0
xR 1 + exp( x + t )

E log |/4| < 0

|| < 4.

13
This follows from the fact that w/(1 + w)2 is maximized at w = 1, and hence, the function
exp(z)/(1 + exp(z))2 is maximized at z = 0.

33

It is important to note that the condition || < 4 is simple, but unnecessarily


restrictive. Indeed, by enlarging the supremum set from X to R we have made the
condition possibly more restrictive. We could have instead stated in Proposition 5
that the parameter had to satisfy the condition


exp( x + )


t
E log sup
2 < 0

xX
1 + exp( x + t )
as this is the actual contraction condition. In any case, it is crucial to keep in mind
that even this more general contraction condition is only sufficient for obtaining strict
stationarity and ergodicity! For those parameters that do not satisfy this condition,
we simply do not know if {xt (, x1 )}tN converges to a limit SE sequence or not. This
is in stark contrast with the linear Gaussian AR(1) where we know that the condition
|| < 1 is both necessary and sufficient for obtaining limit SE sequences. Curiously,
the contraction can also be shown to be necessary and sufficient for SE in the random
coefficient AR(1) model reviewed above and the GARCH model that we review next.
These constitute however special cases. In general, we must always keep in mind that
the contraction condition is only a sufficient condition!
Example: (GARCH model) Consider a time-series {xt (, 12 )}tN generated by the
famous GARCH model introduced already in Section 4.2.4, with initialization of the
filter at t = 1 given by some value 12 R+ ,
xt = t t ,

{t }tN NID(0, 1) ,

2
t+1
= + x2t + t2

t N.

It is easy to show that {xt (, 12 )}tN converges e.a.s. to a limit SE sequence as long
as E log |2t + | < 0. A simpler sufficient condition is thus that || < 1 ||.
Proposition 6 (SE GARCH) Let {xt (, 12 )}tN be generated by the GARCH(1,1)
in (4.3.4) under some = (, , ) with E log |2t + | < 0. Then {xt (, 12 )}tN
converges e.a.s. to a limit SE sequence {xt ()}tZ .
Proof. We substitute first xt in the recursion equation for t2 using the fact that xt = t t
and obtain
2
t+1
= + t2 2t + t2 t N.
Now we can analyze this recursion as a stochastic Markov dynamical system and show that
{t2 (, 12 )}tN converges to a limit SE sequence.
Condition A1 holds trivially because the innovations {2t }tN are iid.
Condition A2 holds since for any given value 12 R we have
E log+ |(12 , t )| = E log+ | + 12 2t + 12 | || + |||12 |E|2t | + |||12 | <
because every real valued parameter , and , and the real valued initialization 12 are
naturally finite, and E|t |2 < because t is standard Gaussian.

34

Condition A3 holds because ( 2 , t )/ 2 = 2t + , and hence,


( 2 , )

t
E log sup
< 0 E log sup |2t + | < 0
2

2 R+
2 R+

E log |2t + | < 0.

We thus conclude that {t2 (, 12 )}tN converges e.a.s. to a limit SE process {t2 ()}tZ .
Finally, {xt (, 12 )}tN also converges to a limit SE process {xt ()}tZ since, by the product
e.a.s. convergence theorem (see below), we have
a.s.

t |xt (, 12 ) xt ()| = t |t (, 12 )t t ()t | |t | t |t (, 12 ) t ()| 0


and because the limit product sequence {(t2 () t )}tZ is SE under the Borel -algebra,
by Krengels theorem (see below). 

The proof of Proposition 6 above revealed an important fact: in observation-driven


models, the proof of SE is taken in two steps! In the first step, we showed that the autoregressive sequence {t2 (, 12 )}tN converged e.a.s. to a limit SE process {t2 ()}tZ .
In the second step, we used the product e.a.s. convergence theorem and Krengels theorem to conclude that {xt (, 12 )}tN also converges to a limit SE process {xt ()}tZ .
Krengels theorem is stated below.
Theorem 7 (Product e.a.s. convergence) Let {t }tZ be an iid sequence satisfying
e.a.s.
E log+ |t | < and {zt }tN be a sequence satisfying zt 0 as t . Then,
e.a.s.

t zt 0 as

t .

Theorem 8 (Krengels Theorem) Let {zt }tZ be an n-variate SE random sequence.


If g is a measurable function, then the sequence {wt }tZ with elements given by wt =
g(zt , ..., ztq ) t Z for some q N is also an SE sequence.14
We finish this section with a counter example. This is a model that can only satisfy Bougerols contraction condition on a degenerate parameter space that eliminates
all the dynamics.
Example: (Quadratic AR(1) model) Consider the quadratic AR(1) model,
xt+1 = + x2t + t ,

{t }tN NID(0, 2 ).

It is easy to see that this stochastic Markov dynamical system can only satisfy the
contraction condition on a degenerate parameter space with = 0.
In particular, note that (x, t )/x = 2x, and hence,
(x, )

t
E log sup
= E log sup |2x|
x
xX
xX
is unbounded because the derivative can be made arbitrarily large by taking x to
infinity. Strictly speaking, the contraction condition is also not satisfied for = 0
since log(0) is not defined. In any case, we know that for = 0, the process {xt } =
{t } is a simple iid sequence.
14

An n-variate sequence is said to be SE if its elements are SE.

35

4.3.5

Bounded unconditional moments

Every law of large numbers imposes the condition that the first unconditional moment exists E|xt | < . Every central limit theorem imposes the condition that the
second unconditional moment exists E|xt |2 < . In the previous section we have
learned how to verify that a stochastic dynamical system generates time-series that
are asymptotically SE. We now turn to the last element that we need to obtain laws
of large numbers and central limit theorems: ensuring that the unconditional moments are bounded. As in the previous section, we focus on time-series generated by
stochastic Markov dynamical systems.
Theorem 9 (Power-n contraction) For some , let {xt (, x1 )}tN be a random
sequence initialized at t = 1 with value x1 X R and generated by the Markov
dynamic system
xt+1 = (xt , t , ) t N
with differentiable function : X Rn X and elements xt (, x1 ) taking values
in X R for every t N. Suppose further that the following conditions hold for
some n > 0:
A1. {t }tZ is an exogenous n -variate SE sequence,
A2. there exists x1 X such that E|(x1 , t )|n < ,

n
A3. the system is power-n contracting E supxX (x, t )/x < 1.
Then {xt (, x1 )}tN satisfies supt E|xt (, x1 )|n < and converges e.a.s. to a unique
strictly stationary and ergodic (SE) sequence {xt ()}tZ satisfying E|xt ()|n < .
Note that, besides the bounded moments, Theorem 9 also obtains the e.a.s. convergence to an SE limit. This is simply due to the fact that conditions A1, A2 and A3
of Theorem 9 imply conditions A1, A2 and A3 of Theorem 6. Indeed, the conditions
of both theorems are very similar. In fact, condition A1 is the same, and conditions
A2 and A3 just impose n moments rather than the log moment.
The power-n contraction theorem ensures that a Markov dynamical system generates time-series with bounded unconditional moments. If conditions A1, A2 and A3
hold for n = 2, then the time-series has bounded unconditional variance. This means
that the time-series is weakly stationary!
Two important inequalities will be useful for verifying that the conditions of the
power-n contraction theorem hold. These are known as the cn -inequality and the
generalized Holder inequality.
Theorem 10 (cn -inequality) For every n > 0, there exists a constant c > 0 such that
E|zt + wt |n cE|zt |n + cE|wt |n .
36

Theorem 11 (Generalized Holder inequality) For every p > 0 and q > 0, it holds
true that
E|zt wt |n E|zt |p E|wt |q
with n = p q/(p + q).
Three important implications of the generalized Holder inequality are:
1. Let E|zt |p < and E|wt |q < , then E|zt wt |n < with n =

pq
.
p+q

2. Let E|zt |p < and |wt | < a.s., then E|zt wt |n < with n = p.
3. Let E|zt |p+ < for some > 0 and E|wt |q < q > 0, then E|zt wt |n <
with n = p.
4.3.6

Examples and Counter Examples

Example: (AR(1) model) Consider a time-series {xt (, x1 )}tN generated by a Gaussian AR(1) initialized at some value x1 R,
xt+1 = xt + t

tN,

{t }tZ NID(0, 2 ).

(9)

It is easy to show that the time-series {xt (, x1 )}tN has bounded moments of any
order when || < 1 and 2 is finite.
Proposition 7 Let {xt (, x1 )}tN be generated the Gaussian AR(1) in (9) with =
(, 2 ) satisfying || < 1 and 0 < 2 < . Then {xt (, x1 )}tN and its SE limit
{xt ()}tN satisfy supt E|xt (, x1 )|n < and E|xt ()|n < for any n > 0 and any
x1 X .
Proof. Condition A1 holds trivially for every n > 0. Condition A2 holds n > 0 and

any x1 R, we have
E|(x1 , t )|n = E|xt + t |n c ||n |x1 |n +c E|t |n < .
| {z }
|{z} |{z}
<

<

<

The inequality above holds by the cn -inequality (see below).


Condition A3 holds since (x, t )/x = , and hence,
(x, ) n

t
E sup
<1
x
xX

E||n < 1

Finally, we obtain E||n < 1 ||n < 1 || < 1.

It is important to note that the fat-tailed AR(1) model with iid student-t (TID)
innovations
xt+1 = xt + t t N , {t }tZ TID()
37

generates time-series with n-moments only for > n.


Example: (GARCH model) Consider a time-series {xt (, 12 )}tN generated by
xt = t t ,

{t }tN NID(0, 1) ,

with volatility initialization at 12 R+


2
= + x2t + t2
t+1

t N.

Proposition 8 Let {xt (, 12 )}tN be generated by a GARCH model under some =


(, , ) with E|2t + |n+ < 1 for some > 0. Then {xt (, 12 )}tN and its SE limit
{xt ()}tZ satisfy supt E|xt (, 12 )|n < and E|xt ()|n < for any 12 R+ .
Proof.
We first show that the volatility sequence {t2 (, 12 )}tN initialized at 12 and the limit SE
process {t2 ()}tZ both have n + bounded moments. In order to do this, we substitute
xt by t t in volatility equation
2
t+1
= + t2 2t + t2

t N.

and now we check that the conditions of the power-n contraction theorem hold for this
Markov system.
Condition A1 holds trivially because the innovations {2t }tN are iid.
Condition A2 holds for any m = n + and 12 R+ since we have
E|(12 , t )|m = E| + 12 2t + 12 |m
c||m + c||m |12 |m E|2t |m + c||m |12 |m <
Condition A3 holds because ( 2 , t )/ 2 = 2t + , and hence,
( 2 , ) n+

t
E sup
<1

2

2
R+

E sup |2t + |n+ < 1


2 R+

E|2t + |n+ < 1.

As a result, we conclude that supt E|t2 (, 12 )|n+ < and E|t2 ()|n+ < .
Finally, we use the n + bounded moments of the volatility sequences {t2 (, 12 )}tN
and {t2 ()}tZ to conclude that the data sequences {xt (, 12 )}tN and {xt ()}tZ have n
bounded moments.
In particular, we show that supt E|xt (, 12 )|n < and E|xt ()|n < holds for all
12 R+ because E|t |q < q > 0 and hence, by the generalized holder inequality, we
have
sup E|xt (, 12 )|n = sup E|t (, 12 )t |n <
t

and

E|xt ()|n = E|t ()t |n < .

38

We end this section by comparing the SE contraction condition of Bougerols


Theorem with the power-n contraction condition used to obtain bounded moments.
In particular, since the power-n contraction condition in the GARCH
E|2t + |n < 1
is more restrictive than the SE contraction condition
E log |2t + | < 1
there are naturally fewer values of and that satisfy the power-n contraction
compared to the SE contraction. Figure 9 shows the pairs (, ) satisfying each
condition. Note that > 0 and > 0 is required for ensuring the positivity of the
volatility.
1
SE frontier
1st moment frontier
2nd moment frontier
4th moment frontier

0.9

0.8
0.7
0.6
0.5
0

0.2

0.4

0.6

0.8

Figure 9: Pairs (, ) below each frontier, are pairs (, ) that ensure SE or bounded
moments.

4.3.7

Notes for Time-varying Parameter Models

Filters and DGPs


In observation-driven time-varying parameter models, it is important to note that
the properties of the filtered time-varying parameter, can be very different from the
properties of the true time-varying parameter. The filtered parameter takes the data
{xt } as given, whereas the true parameter determines the behavior of the data {xt }.
39

Take, for example, the observation-driven model for the time-varying mean
xt = t + t , t N (0, 2 ).
The filtered time-varying mean {t (, 1 )} initialized at some value 1 R takes the
data {xt } as given. The filtered parameter {t (, 1 )} is updated according to the
updating equation
t+1 = + (xt t ) + t .
If the model is correctly specified, then the true time-varying mean {t ( 0 )} is supposed to be generated by the model under some 0 . It is important to note that the
true parameter t ( 0 ) influences the data xt because the data is generated according
to xt = t ( 0 ) + t . As a result, we cannot simply analyze the properties of {t ( 0 )}
as if the data {xt } were given. Instead, we must recognize that xt = t + t and hence
substitute xt in the updating equation and obtain
t+1 = + t + t .
When estimating time-varying parameter models, we always need to analyze the
properties of the filtered parameter {t (, 1 )}. After all, it is the filtered parameter,
not the true parameter, that enters the criterion function for estimation of parameters.
The true parameter is unobserved! For example, it is the filtered parameter that enters
the log likelihood function
T
xt t (, 1 )
1X 1
log 22
L(xT , ) =
T t=2 2
22

2

that we use to obtain maximum likelihood estimates of the parameters of the model!
On the contrary, we only need to analyze the properties of the true sequence {t ( 0 )}
if we want to establish properties for the data {xt }.
Consider now the GARCH model for time-varying volatility
xt = t t , t N (0, 1).
The filtered time-varying volatility {t2 (, 12 )} takes the data {xt } as given. The
filtered volatility {t2 (, 12 )} is then updated according to the updating equation
2
t+1
= + x2t + t2 .

If the model is correctly specified, then the true time-varying volatility {t2 ( 0 )} is
supposed to be generated by the model under some 0 . Again, it is important to
note that t2 ( 0 ) influences the data xt because the data is generated according to
xt = t2 ( 0 )t . As a result, we cannot simply analyze the properties of {t2 ( 0 )} as
40

if the data {xt } were given. Instead, we must recognize that xt = t2 t and hence
substitute xt in the updating equation and obtain
2
= + t2 2t + t2 .
t+1

Again, please keep in mind that when estimating the parameters of the GARCH
model, we always need to analyze the properties of the filtered volatility {t2 (, 12 )},
since the filtered volatility enters the log likelihood function
LT (xT , ) =

T
X
t=2

1
1
1
x2t
log 2 log t2 (, 12 )
.
2
2
2 t2 (, 12 )

On the contrary, we only need to analyze the properties of the true volatility {t2 ( 0 )}
if we want to establish the properties the data {xt }.
Stationarity and Moments
We have noted above that in time-varying parameter models, the filtered parameter
and the true parameter can have very different behavior. Above, we have learned
how to show that the true time-varying parameter is strictly stationary and ergodic
(SE) and has n bounded moments, and we used it to show that {xt }tZ is also SE
and has bounded moments. In particular, we applied Bougerols Theorem and the
Power-n Theorem to the true time-varying parameter.
As we shall now see, we can also use Bougerols Theorem and the Power-n Theorem
to establish the stochastic properties of the filtered time-varying parameter. In this
case however, we take the properties of the data {xt } as given, and look at the data
{xt } as innovations.
Example: (Local-level model) In the case of the local-level model, we can show
that the true sequence {t ( 0 )}tZ is SE and has n bounded moments by applying
Bougerols Theorem and the Power-n Theorem to the updating equation
t+1 = + t + t .
The conditions of Bougerols Theorem which ensure that the true parameter {t ( 0 )}tZ
is SE are:
A1: {t } is iid
A2: E log+ | + t + 1 | <
A3: E log sup || < 0

|| < 1.

The conditions of the Power-n Theorem which ensure that the true parameter {t ( 0 )}tZ
is SE and has n bounded moments E|t ( 0 )|n < are:
41

A1: {t } is iid
A2: E| + t + 1 |n <
A3: E sup ||n < 1

|| < 1.

On the other hand, if we instead want to ensure that the filtered parameter {t (, 1 )}tN
initialized at some 1 R converges to a limit SE process with n bounded moments
we should apply Bougerols Theorem and the Power-n Theorem to the updating
equation
t+1 = + (xt t ) + t .
Bougerols conditions for a limit SE filtered parameter {t (, 1 )}tN are then:
A1: {xt } is SE
A2: E log+ | + (xt 1 ) + 1 | <
A3: E log sup | | < 0

| | < 1.

The Power-n conditions for the filtered parameter {t (, 1 )}tN to converge to a


limit SE process with n bounded moments are then given by:
A1: {xt } is SE
A2: E| + (xt 1 ) + 1 |n <
A3: E sup | |n < 1

| | < 1.

Note that the conditions used in the filtered parameter are very different from those
used in the true parameter case. For example, condition A1 for the filtered parameter
requires that we already know certain properties of the data (i.e. the SE nature of
{xt }). The contraction conditions are also very different. In particular, || < 1 is
sufficient to show that the true parameter is SE and has bounded moments, whereas
| | < 1 is required to ensure that the filtered parameter is SE and has bounded
moments.
Example: (GARCH model) In the case of the GARCH model, we can show that the
true volatility sequence {t2 ( 0 )}tZ is SE and has n bounded moments by applying
Bougerols Theorem and the Power-n Theorem to the updating equation
2
t+1
= + t2 2t + t2 .

Bougerols conditions for the true volatility {t2 ( 0 )}tZ to be SE are then given by:
A1: {t } is iid
A2: E log+ | + 12 2t + 12 | <
42

A3: E log sup2 |2t + | < 0

E log |2t + | < 0.

The conditions of the Power-n Theorem for the true volatility {t2 ( 0 )}tZ to be SE
and have n bounded moments E|t2 ( 0 )| < are:
A1: {t } is iid
A2: E| + 12 2t + 12 |n <
A3: E sup2 |2t + |n < 0

E|2t + |n < 0.

On the other hand, if we instead want to ensure that the filtered parameter {t2 (, 12 )}tN
initialized at some 12 R converges to a limit SE process with n bounded moments
we should apply Bougerols Theorem and the Power-n Theorem to the updating
equation
2
t+1
= + x2t + t2 .
The conditions of Bourgerols Theorem which ensure that the filtered volatility parameter {t2 (, 12 )}tN converges to an SE limit are:
A1: {xt } is SE
A2: E log+ | + x2t + 12 | <
A3: E log sup2 || < 0

|| < 1.

The conditions of the Power-n Theorem which ensure that the filtered parameter
{t2 (, 12 )}tN converges to a limit SE process with n bounded moments are:
A1: {xt } is SE
A2: E| + x2t + 12 |n <
A3: E sup2 ||n < 0

|| < 1.

Note that the conditions used in the filtered parameter are different from those used
in true parameter. For example, condition A1 for the filtered parameter requires that
we already know certain properties of the data (i.e. the SE nature of {xt }). The
contraction conditions are also very different. In particular, E log |2t + | < 0 is
sufficient to show that the true parameter is SE and has bounded moments, whereas
|| < 1 is required to ensure that the filtered parameter is SE and has bounded
moments.

43

4.3.8

Notes for Models with Exogenous Variables

Before we end this chapter, it is convenient to note that the theory covered in the
previous sections applies also to models with exogenous variables. Indeed, since both
Bougerols Theorem and the Power-n Theorem were not specific about the nature of
the innovation vector {t }tZ , our freedom here is considerable. Typically, the label
innovations applies only to sequences that exhibit weak or no temporal dependence,
such as white noise processes. Those two theorems however, allowed the vector process
{t }tZ to contain dependence, as long as it was SE and exogenous in nature. As such,
there is a host of variables that can potentially fit that definition.
As an example, consider the following nonlinear autoregressive distributed lag
(NADL) model
xt+1 = f (xt , zt , t ; ) t N
where {t }tZ is a sequence of iid innovations, and {zt }tZ denotes some exogenous
SE time-series. Clearly, the theorems above can be directly applied by defining the
vector t as t = (t , zt ) t. It is really that simple!
Below we treat explicitly the example of a random coefficient ADL (RC-ADL)
model with autoregressive dynamics in the time-varying coefficients. From here the
reader should immediately guess how to proceed for other models.
Example: (Gaussian RC-ADL model) Consider the Gaussian random-coefficient ADL
model for {xt }tN initialized at some x1 R,
xt+1 = t xt + t zt + t

tN,

{t }tZ NID(0, 2 ) ,

where {t }tZ NID(b, 2 ), {t }tZ NID(a, 2 ), and {zt }tZ is some exogenous
sequence.
In the model above, the vector {t }tZ that appears in both Bougerols Theorem
and the Power-n Theorem, takes effectively the form t = (t , t , zt , t ). We recall
that, in this context, the exogeneity of the vector process {t }tZ (which we always
assume implicitly!) ensures that t , t , zt and t are independent of each other at all
leads and lags, as well as, independent of present and future values of xt . As we shall
now see, this random-coefficient ADL model generates asymptotically SE time-series
if t is smaller than 1 on average.
Proposition 9 (SE Gaussian RC-ADL) Let {zt }tZ be an exogenous sequence satisfying E|zt |1+ < for some > 0, and {xt (, x1 )}tN be generated by the Gaussian
RC-ADL model above, and parameterized by a vector = (b, a, 2 , 2 , 2 ) which ensures that E|t | < 1, 0 < 2 < and 0 < 2 < . Then {xt (, x1 )}tN converges
e.a.s. to a limit SE sequence {xt ()}tZ for any initialization x1 R.
Proof.
Condition A1 of Bougerols Theorem holds trivially since the innovations {}tN , {}tN
and{t }tN are iid, and {zt }tN is SE.

44

Condition A2 of Bougerols Theorem holds because for any given value x1 R we have
E log+ |(x1 , zt , t )| = E log+ |t x1 + t zt + t |
E|t x1 + t zt + t |
|x1 | E |t | + E|t zt | + E|t | < .
|{z} |{z} | {z } | {z }
<

<

<

<

The second inequality holds by the sub-additivity of the absolute value. The term |x1 | is a
bounded constant in R. E|t | < 1 < holds by assumption. E|t zt |2 < holds by the
generalized Holders inequality because {t } is NID with bounded variance (and hence has
bounded moments of any order, i.e. E|t |q < q > 0) and E|zt |1+ < by assumption.
Finally, E|t | < because {t } is NID with finite variance and hence has bounded moments
of any order.
Condition A3 of Bougerols theorem holds since (x, t )/x = t , and hence,
(x, )

t
E log sup
<0
x
xX

E log sup |t | < 0

E log |t | < 0

xX

and E log |t | < 0 is implied by E|t | < 1 by Jensens inequality.

Notice that the Power-n Theorem could be applied to show that {xt }tN has
bounded moments of a certain order. This is left for the reader to explore as an
exercise in the end of this chapter. Note also that we could have explicitly stated a
DGP for {zt }tZ . For example, {zt }tZ could be generated exogenously by a logistic
SESTAR model. Bougerols Theorem and the Power-n Theorem could then be applied
to show that {zt }tZ is SE and that it has the required bounded moments. The last
exercise of this chapter explores this possibility!

45

4.4

Exercises

1. Which of the following models are nested?


(a) Gaussian AR(1) and RCAR(1)
(b) Gaussian AR(1) and Gaussian SESTAR
(c) Gaussian AR(1) and logistic AR(1)
(d) Gaussian AR(1) and LL
(e) GARCH and QGARCH
2. Show that any model nests itself.
3. Show that if model A nests model B and model B nests model A, then they
are the same model; i.e. A = B.
4. Let {xt }tZ be a sequence of iid standard normal random variables; i.e. xt
N (0, 1) t Z. Which of the following models is well specified?
(a) {xt }tZ UID(a, b), with (a, b) R2 where UID stands for uniformly
independently distributed ; i.e. each xt follows a uniform U(a, b) distribution.
(b) {xt }tZ NID(, 2 ), with > 0 and 2 > 0.
(c) {xt }tZ NID(0, 2 ), with 2 > 0.
(d) {xt }tZ SNID(, 2 , ), with (, 2 , ) R3 where SNID stands for
skewed normal independently distributed ; i.e. each xt follows a skew normal
distribution SN(, 2 , ) where is the skewness parameter.
(e) Unrestricted Gaussian AR(1) model: {xt }tZ is generated by
xt = + xt1 + t ,
where {t }tZ NID(0, 2 ) with (, , 2 ) R3 .
(f) Restricted Gaussian MA(2) model: {xt }tZ is generated by
xt = 0 t + 1 t1 + 2 t2 ,
where {t }tZ NID(0, 9) with (0 , 1 , 2 ) R3 .
(g) GARCH model: {xt }tZ is generated by
xt = t t ,

2
t+1
= + x2t + t2 ,

where {t }tZ NID(0, 1) with 0, 0 and 0.


5. Comment the following popular statement: All models are mis-specified!
46

6. Given two models, A and B, designed to describe a given time-series. Comment


the following statement: Only one model can be correctly specified! In other
words, either model A is correctly specified or model B is correctly specified.
7. Suppose that model A nests models B and C. Suppose further that the model
D := B C nests model A. Comment on the following statements:
(a) if model A is well specified, then model B is also well specified
(b) if model C is well specified, then model A is also well specified
(c) if model A is well specified, then model D is also well specified
(d) if model D is well specified and model B is mis-specified. Then model C
is well specified.
8. Comment on the following statements:
(a) if a time-series {xt }tZ is weakly stationary then it admits an ARMA(p, q)
representation
(b) if a time-series is weakly stationary then it is also strictly stationary
(c) if a time-series is strictly stationary then it is also weakly stationary
(d) if a time-series is m-dependent then it is strictly stationary
(e) if a time-series is m-dependent then it is weakly stationary
(f) if a time-series is iid, then it is m-dependent
(g) if a time-series is white noise, then it is m-dependent
(h) if a time-series is white noise, then it is strictly stationary
9. Which of the following DGPs generate m-dependent time-series? How about
strictly stationary and ergodic time-series?
(a) Gaussian MA(2):
{t }tZ N ID(0, 1).

xt = t + 0.2t1 + 3t2 ,
(b) Fat-tailed MA(2):15
xt = t + 1 t1 + 2 t2 ,

{t }tZ T ID()

for some (1 , 2 ) R2 and 2 < < .


(c) Fat-tailed AR(1):
xt = 0.7xt1 + t ,
15

{t }tZ T ID().

T ID() stands for student-t independently distributed ; i.e. elements of the sequence are independently identically distributed with students-t distribution with degrees of freedom.

47

(d) ARMA(1,1):16
{t }tZ W N (0, 2).

xt = 0.3xt1 + t + 2t1 ,
(e) Random-coefficient Gaussian MA(1):

{t }tZ N ID(0, 2 ).

t N (, 2 ) ,

xt = t + t t1 ,

for some strictly positive , 2 and 2 .


(f) Random coefficient AR(1):17
t U (0 , 1.5) ,

xt = t xt1 + t ,

{t }tZ N ID(0, 1).

(g) Nonlinear MA(2):


xt = 2t + sin(2 + 3t1 t2 ) ,

{t }tZ N ID(0, 1).

(h) Fat-tailed sigmoid AR(1):


xt = + cos(xt1 ) + t ,

{t }tZ T ID(3) ,

for some R and || < 1.


(i) STAR with exogenous AR(1) driver:
xt = g(zt1 ; )xt1 + t ,
g(zt1 ; ) := 2 +

{t }tZ N ID(0, 2 ) ,

1 + exp(0.13 + 24zt1 )

zt = 0.85zt1 + vt ,

tZ

{vt }tZ N ID(0, v2 ).

(j) Logistic SESTAR:


xt = g(zt1 ; )xt1 + t ,
g(zt1 ; ) := 0.2 +

{t }tZ N ID(0, 2 ) ,

1 + exp(2 + 1.4xt1 )

t Z.

(k) Exponential SESTAR:


xt = g(zt1 ; )xt1 + t ,
g(zt1 ; ) := 0.1 +

{t }tZ N ID(0, 2 ) ,


1 + exp 0.5 + 7.2(xt1 )2

16

t Z.

WN(0, 2 ) stands for white noise with mean zero and variance 2 .
UID(0, 1.5) stands for uniformly independently distributed; i.e. elements of the sequence are
independently identically distributed with uniform distribution between 0 and 1.5.
17

48

(l) Gaussian local-level model:


xt = t + t ,

{t } NID(0, 3) ,

t = 5 + 0.9t1 + vt ,

{vt } NID(0, 0.6).

(m) Fat tailed stochastic volatility model:


{t } NID(0, 1) ,

xt = t t ,

2
t2 = + t1
+ vt ,

{vt } TID().

for some R, some || < 1 and > 5.


(n) Gaussian observation-driven local-level model :
xt = t + t ,

{t } NID(0, 1) ,

t = 0.5 + 0.4(xt1 t1 ) + 0.7t1 .


(o) GARCH:
{t }tZ NID(0, 1) ,

xt = t t ,

2
t2 = + x2t1 + t1
.

for some (, ) R2 and || < 1.


(p) Robust GARCH:
xt = t t ,

{t } TID(7) ,

2
t2 = 5 + 0.2 tanh(x2t ) + 0.4t1
.

(q) NGARCH:
xt = t t ,

{t } NID(0, 1) ,

2
t2 = 0.5 + 0.1(xt1 t1 )2 + 0.9t1
.

(r) QGARCH:
xt = t t ,

{t } NID(0, 1) ,

2
.
t2 = 0.5 + 0.1x2t1 + 0.2xt1 + 0.3t1

10. Which of the DGPs in the previous question generate time-series {xt } that
satisfy a law of large numbers? Which DGPs generate time-series {xt } that
satisfy a central limit theorem? Which DGPs generate a time-series {xt } with
bounded fourth moment?

49

11. Which of the DGPs above can be written as


xt =

j ztj + vt ,

j=0

where,
a. 0 = 1 and

j=0

j2 < ,

b. {zt }tZ WN(0, 2 )


c. Cov(zs , vt ) = 0 (s, t) Z Z
d. {vt }tZ is deterministic (non-random).
12. State sufficient conditions for the strict stationarity and ergodicity of all time
series generated by the following models:
(a) Nonlinear MA(q):
xt = h(t , t1 , ..., tq ; ) ,

t p () ,

(b) NLAR(1)
xt = (xt1 , t , ) ,

t p () ,

(c) NLARMA(p,q)
xt = (xt1 ..., xtp , t , ..., tp , ) ,

t p () ,

(d) Nonlinear parameter-driven model:


xt = g(t , t ) ,

t p () ,

t = (t1 , vt , ) ,

vt pv () ,

(e) Nonlinear observation-driven model:


xt = g(t , t ) ,

t p () ,

t = (t1 , xt1 , ).
13. Find a model that satisfies Bougerols contraction (condition A3) of Theorem
6 only on a degenerate set.
14. Find a model that never satisfies Bougerols contraction (condition A3) of Theorem 6.
15. Find a model that does not satisfy the moment bound (condition A2) of Theorem 6.
50

16. Consider the RC-ADL model analyzed in Proposition 9. Give sufficient conditions for {xt ()}tZ to have two bounded moments.
17. Consider the following system
xt+1 = t xt + t zt + t
where
t = 0.97t + vt

tN,

{t }tZ NID(0, 0.45) ,

{t+1 }tZ N ID(0.8, 0.07) ,


tZ,

zt = 0.9zt1 + wt + 2.3wt1

{vt }tZ NID(0, 0.03) ,

tZ,

{wt }tZ NID(0, 1.1) ,

Can you show that {xt ()}tZ is SE and has two bounded moments?

51

You might also like