Contents
1 Introduction
1.1
R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
13
1.2.1
13
1.2.2
14
1.2.3
16
1.3
Some formalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
1.4
24
1.5
Stationary processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
1.5.1
28
1.5.2
33
34
1.2
1.6
38
2.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
2.2
40
2.2.1
40
43
2.3.1
43
2.3.2
43
2.3.3
46
2.3.4
47
2.3
2.3.5
48
2.3.6
51
2.3.7
53
57
2.4.1
60
2.5
62
2.6
66
2.4
3.2
3.3
70
3.1.1
71
3.1.2
72
3.1.3
79
3.1.4
80
81
3.2.1
82
3.2.2
85
3.2.3
The variance/covariance matrix and precision matrix of an autoregressive and moving average process . . . . . . . . . . . . . . . . . . . . .
89
91
3.3.1
94
3.3.2
94
4.2
70
97
Data Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
99
4.1.1
99
4.1.2
4.2.2
4.3
4.4
4.5
4.3.2
4.3.3
R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.4.2
4.4.3
R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5 Prediction
117
5.1
5.2
5.3
5.3.2
5.3.3
5.4
5.5
5.5.2
5.5.3
5.6
5.7
6.2
151
6.2.1
6.2.2
6.3
6.4
7 Parameter estimation
7.1
7.2
7.3
171
7.1.2
7.2.2
8 Spectral Representations
187
8.1
8.2
8.2.2
8.3
8.2.3
8.2.4
Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
. . . . . . . . . . . . 203
8.3.2
8.4
8.5
8.6
8.5.1
8.5.2
8.5.3
8.7
Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
8.7.1
9 Spectral Analysis
227
9.1
9.2
9.3
9.4
9.5
9.6
262
300
316
B.1 Obtaining almost sure rates of convergence for some sums . . . . . . . . . . 317
B.2 Proof of Theorem 10.7.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
Preface
The material for these notes come from several different places, in particular:
Brockwell and Davis (1998)
Shumway and Stoffer (2006) (a shortened version is Shumway and Stoffer EZ).
Fuller (1995)
Pourahmadi (2001)
Priestley (1983)
Box and Jenkins (1970)
A whole bunch of articles.
Tata Subba Rao and Piotr Fryzlewicz were very generous in giving advice and sharing
homework problems.
When doing the homework, you are encouraged to use all materials available, including
Wikipedia, Mathematica/Maple (software which allows you to easily derive analytic
expressions, a webbased version which is not sensitive to syntax is Wolframalpha).
You are encouraged to use R (see David Stoffers tutorial). I have tried to include
Rcode in the notes so that you can replicate some of the results.
Exercise questions will be in the notes and will be set at regular intervals.
You will be given some projects are the start of semester which you should select and
then present in November.
Chapter 1
Introduction
A time series is a series of observations xt , observed over a period of time. Typically the
observations can be over an entire interval, randomly sampled on an interval or at fixed time
points. Different types of time sampling require different approaches to the data analysis.
In this course we will focus on the case that observations are observed at fixed equidistant
time points, hence we will suppose we observe {xt : t Z} (Z = {. . . , 0, 1, 2 . . .}).
Let us start with a simple example, independent, uncorrelated random variables (the
simplest example of a time series). A plot is given in Figure 1.1. We observe that there
arent any clear patterns in the data. Our best forecast (predictor) of the next observation
is zero (which appears to be the mean). The feature that distinguishes a time series from
classical statistics is that there is dependence in the observations. This allows us to obtain
better forecasts of future observations. Keep Figure 1.1 in mind, and compare this to the
following real examples of time series (observe in all these examples you see patterns).
1.1
2
1
0
2
whitenoise
50
100
150
200
Time
40
20
soi1
20
1880
1900
1920
1940
1960
1980
2000
2020
Time
2000
1000
nasdaq1
3000
4000
the goal is to make money! Therefore the main object is to forecast (predict future volatility).
1985
1990
1995
2000
2005
2010
2015
Time
150
100
0
50
sunspot1
1700
1750
1800
1850
1900
1950
2000
Time
1.1.1
R code
A large number of the methods and concepts will be illustrated in R. If you are not familar
with this language please learn the very basics.
Here we give the R code for making the plots above.
# assuming the data is stored in your main directory we scan the data into R
soi <soi1 <
scan("~/soi.txt")
ts(monthlytemp,start=c(1876,1),frequency=12)
0.5
0.5
0.0
temp
1880
1900
1920
1940
1960
1980
2000
Time
0.6
0.2
0.4
monthlytemp1
0.8
Figure 1.5: Plot of global, yearly average, temperature anomalies, 1880  2013
2000
2005
2010
2015
Time
Figure 1.6: Plot of global, monthly average, temperatures January, 1996  July, 2014.
# unit of time (year). As the data is monthly it is 12.
plot.ts(soi1)
12
1.2
In time series, the main focus is on modelling the relationship between observations. Time
series analysis is usually performed after the data has been detrended. In other words, if
Yt = t + t , where {t } is zero mean time series, we first estimate t and then conduct the
time series analysis on the residuals. Once the analysis has been performed, we return to
the trend estimators and use the results from the time series analysis to construct confidence
intervals etc. In this course the main focus will be on the data after detrending. However,
we start by reviewing some well known detrending methods. A very good primer is given in
Shumway and Stoffer, Chapter 2, and you are strongly encouraged to read it.
1.2.1
(1.1)
Yt = 0 + 1 t + 2 t2 + t .
(1.2)
For example we may fit such models to the yearly average temperature data. Alternatively
we may want to include seasonal terms
Yt = 0 + 1 sin
2t
12
+ 3 cos
2t
12
+ t .
For example, we may believe that the Southern Oscillation Index has a period 12 (since
the observations are taken monthly) and we use sine and cosine functions to model the
seasonality. For these type of models, least squares can be used to estimate the parameters.
Remark 1.2.1 (Taking differences to avoid fitting linear and higher order trends)
A commonly used method to avoid fitting linear trend to a model is to take first differences.
13
(i) Import the yearly temperature data (file global mean temp.txt) into R
and fit the linear model in (1.1) to the data (use the R command lsfit).
(ii) Suppose the errors in (1.1) are correlated. Under the correlated assumption, explain
why the standard errors reported in the R output are unreliable.
(iii) Make a plot of the residuals after fitting the linear model in (i). Make a plot of the
first differences. What do you notice about the two plots, similar?
(What I found was quite strange)
The AIC (Akaike Information Criterion) is usually used to select the parameters in the model
(see wiki). You should have studied the AIC/AICc/BIC in several of the prerequists you
have taken. In this course it will be assumed that you are familiar with it.
1.2.2
In Section 1.2.1 we assumed that the mean had a certain known parametric form. This may
not always be the case. If we have no apriori idea of what features may be in the mean, we
can estimate the mean trend using a nonparametric approach. If we do not have any apriori
knowledge of the mean function we cannot estimate it without placing some assumptions on
its structure. The most common is to assume that the mean t is a sample from a smooth
function, ie. t = ( nt ). Under this assumption the following approaches are valid.
Possibly one of the most simplest methods is to use a rolling window. There are several
windows that one can use. We describe, below, the exponential window, since it can be
14
t = (1 )
t1 + Yt ,
where 0 < < 1. The choice of depends on how much weight one wants to give the present
observation. It is straightforward to show that
t =
t1
X
(1 )
tj
Yj =
t
X
j=1
j=1
t = (1 e )
 {z }
b1
n
X
K
j=1
tj
b
Yj ,
This we observe that the exponential rolling window estimator is very close to a nonparametric kernel estimator of the mean, which has the form
t =
n
X
1
j=1
K
tj
b
Yj .
it is likely you came across such estimators in your nonparametric classes. The main difference between the rolling window estimator and the nonparametric kernel estimator is that
the kernel/window for the rolling window is not symmetric. This is because we are trying
to estimate the mean at time t, given only the observations up to time t. Whereas for nonparametric kernel estimators we can be observations on both sides of the neighbourhood of
t.
Other type of estimators include sieveestimators. This is where we expand (u) in terms
of an orthogonal basis {k (u); k Z}
(u) =
ak k (u).
k=1
15
Examples of basis functions are the Fourier k (u) = exp(iku), Haar/other wavelet functions
etc. We observe that the unknown coefficients ak are a linear in the regressors k . Thus
we can use least squares to estimate the coefficients, {ak }. To estimate these coefficients, we
truncate the above expansion to order M , and use least squares to estimate the coefficients
n
X
t=1
"
#2
t
Yt
ak k ( ) .
n
k=1
M
X
(1.3)
The orthogonality of the basis means that the least squares estimator a
k is
n
1X
Yt k
a
k
n t=1
t
.
n
It is worth pointing out that regardless of the method used, correlations in the errors
{t } will play an role in quality of the estimator and even on the choice of bandwidth, b, or
equivalently the number of basis functions, M (see Hart (1991)). To understand why, suppose
t
the mean function is t = ( 200
) (the sample size n = 200), where (u) = 5(2u2.5u2 )+20.
We corrupt this quadratic function with both iid and dependent noise (the dependent noise is
the AR(2) process defined in equation (1.6)). The plots are given in Figure 1.7. We observe
that the dependent noise looks smooth (dependent can induce smoothness in a realisation).
This means that in the case that the mean has been corrupted by dependent noise it difficult
to see that the underlying trend is a simple quadratic function.
1.2.3
Suppose that the observations {Yt ; t = 1, . . . , n} satisfy the following regression model
Yt = A cos(t) + B sin(t) + t
where {t } are iid standard normal random variables and 0 < < . The parameters A, B,
and are real and unknown. Unlike the regression models given in (1.2.1) the model here
is nonlinear, since the unknown parameter, , is inside a trignometric function. Standard
least squares methods cannot be used to estimate the parameters. Assuming Gaussianity of
16
3
2
3 2 1
ar2
1
0
2 1
iid
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.6
0.8
1.0
0.8
1.0
temp
22
quadraticar2
18
20
24
22
20
18
16
16
quadraticiid
0.4
24
temp
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
temp
0.4
0.6
temp
Figure 1.7: Top: realisations from iid random noise and dependent noise (left = iid and right
= dependent). Bottom: Quadratic trend plus corresponding noise.
{t }, the maximum likelihood corresponding to the model is
n
1X
Ln (A, B, ) =
(Yt A cos(t) B sin(t))2 .
2 t=1
Nonlinear least squares method (which would require the use of a numerical maximisation
scheme) can be employed to estimate the parameters. However, using some algebraic manipulations, explicit expressions for the estimators can be obtained (see Walker (1971) and
Exercise 1.3). These are
n = arg max In ()
where
2
n
1 X
In () =
Yt exp(it)
n t=1
17
(1.4)
2k
n
for 1 k n),
X
2X
n = 2
Yt cos(
n t) and B
Yt sin(
n t).
An =
n t=1
n t=1
The rather remarkable aspect of this result is that the rate of convergence of 
n  =
O(n3/2 ), which is faster than the standard O(n1/2 ) that we usually encounter (we will see
this in Example 1.2.1).
In () is usually called the periodogram. Searching for peaks in the periodogram is a long
established method for detecting periodicities. If we believe that there were two or more
periods in the time series, we can generalize the method to searching for the largest and
second largest peak etc. We consider an example below.
Example 1.2.1 Consider the following model
Yt = 2 sin
2t
8
+ t
t = 1, . . . , n.
(1.5)
where t are iid standard normal random variables. It is clear that {Yt } is made up of
a periodic signal with period eight. We make a plot of one realisation (using sample size
n = 128) together with the periodogram I() (defined in (1.4)). In Figure 1.8 we give a plot of
one realisation together with a plot of the periodogram. We observe that there is a symmetry,
this is because of the ei in the definition of I() we can show that I() = I(2 ). Notice
there is a clear peak at frequency 2/8 0.78 (where we recall that 8 is the period).
This method works extremely well if the error process {t } is uncorrelated. However, problems arise when the errors are correlated. To illustrate this issue, consider again model (1.5)
but this time let us suppose the errors are correlated. More precisely, they satisfy the AR(2)
model,
t = 1.5t1 0.75t2 + t ,
(1.6)
where {t } are iid random variables (do not worry if this does not make sense to you we
define this class of models precisely in Chapter 2). As in the iid case we use a sample size
18
1 2 3
1
3
signal
20
40
60
80
100
120
Time
0.2
0.0
0.4
frequency
Figure 1.8: Left: Realisation of (1.5) with iid noise, Right: Periodogram
n = 128. In Figure 1.9 we give a plot of one realisation and the corresponding periodogram.
We observe that the peak at 2/8 is not the highest. The correlated errors (often called
0
5
signal2
coloured noise) is masking the peak by introducing new peaks. To see what happens for
20
40
60
80
100
120
Time
0.4
0.0
P2
0.8
frequency
Figure 1.9: Left: Realisation of (1.5) with correlated noise and n = 128, Right: Periodogram
larger sample sizes, we consider exactly the same model (1.5) with the noise generated as
19
in (1.6). But this time we use n = 1024 (8 time the previous sample size). A plot of one
realisation, together with the periodogram is given in Figure 1.10. In contrast to the smaller
0
5
signal2
sample size, a large peak is seen at 2/8. These examples illustrates two important points:
200
400
600
800
1000
Time
P2
frequency
Figure 1.10: Left: Realisation of (1.5) with correlated noise and n = 1024, Right: Periodogram
(i) When the noise is correlated and the sample size is relatively small it is difficult to
disentangle the deterministic period from the noise. Indeed we will show in Chapters
2 and 3 that linear time series can exhibit similar types of behaviour to a periodic
deterministic signal. This is a subject of on going research that dates back at least 60
years (see Quinn and Hannan (2001)).
However, the similarity is only to a point. Given a large enough sample size (which
may in practice not be realistic), the deterministic frequency dominates again.
(ii) The periodogram holds important properties about the correlations in the noise (observe the periodogram in both Figures 1.9 and 1.10), there is some interesting activity
in the lower frequencies, that appear to be due to noise.
This is called spectral analysis and is explored in Chapters 8 and 9. Indeed a lot of
time series analysis can be done within the so called frequency or time domain.
20
1, . . . , 128}.
(iv) Let Yt = 2 sin( 2t
). Plot the Periodogram of {Yt ; t = 1, . . . , 128}.
8
(v) Let Yt = 2 sin( 2t
) + 4 cos( 2t
). Plot the Periodogram of {Yt ; t = 1, . . . , 128}.
8
12
Exercise 1.3
(i) Let
Sn (A, B, ) =
X
n
Yt2
n
X
t=1
t=1
1
2
2
Yt A cos(t) + B sin(t) + n(A + B ) .
2
Show that
n
X
(A2 B 2 ) X
sin(2t).
cos(2t) + AB
2
t=1
t=1
and thus Ln (A, B, ) + 12 Sn (A, B, ) = O(1) (ie. the difference does not grow with
n).
Since Ln (A, B, ) and 12 Sn (A, B, ) are asymptotically equivalent (i) shows that we
can maximise
1
S (A, B, )
2 n
(ii) By profiling out the parameters A and B, use the the profile likelihood to show that
P
exp(it) =
0 < < 2
= 0 or 2.
21
(1.7)
t cos(t) = O(n)
t=1
n
X
t2 cos(t) = O(n2 )
n
X
t sin(t) = O(n)
t=1
n
X
t=1
t2 sin(t) = O(n2 ).
t=1
Using the above identities, show that the Fisher Information of Ln (A, B, ) (denoted
as I(A, B, )) is asymptotically equivalent to
2I(A, B, ) = E
2 Sn
=
2
n2
2
n
2
n2
B
2
n
2
2
n2 A
n2
B + O(n) 2 A + O(n)
n3
3
+ O(n)
+ O(n)
(A2 + B 2 ) + O(n2 )
(i) Simulate three hundred times from model (1.5) using n = 128. Esti
mate , A and B for each simulation and obtain the empirical mean squared error
P300
1
2
22
par(mfrow=c(2,1))
plot.ts(signal)
plot(frequency, P,type="o")
Simulation and periodogram for model (1.5) with correlated errors:
set.seed(10)
ar2 < arima.sim(list(order=c(2,0,0), ar = c(1.5, 0.75)), n=128)
signal2 < 1.5*sin(2*pi*c(1:128)/8) + ar2
P2 < abs(fft(signal2)/128)**2
frequency < 2*pi*c(0:127)/128
par(mfrow=c(2,1))
plot.ts(signal2)
plot(frequency, P2,type="o")
1.3
Some formalism
When we observe the time series {xt }, usually we assume that {xt } is a realisation from a random
process {Xt }. We formalise this notion below. The random process {Xt ; t Z} (where Z denotes
the integers) is defined on the probability space {, F, P }. We explain what these mean below:
(i) is the set of all possible outcomes. Suppose that , then {Xt ()} is one realisation
from the random process. For any given , {Xt ()} is not random. In time series we will
usually assume that what we observe xt = Xt () (for some ) is a typical realisation. That
is, for any other , Xt ( ) will be different, but its general or overall characteristics
will be similar.
(ii) F is known as a sigma algebra. It is a set of subsets of (though not necessarily the set of
all subsets, as this can be too large). But it consists of all sets for which a probability can
be assigned. That is if A F, then a probability is assigned to the set A.
(iii) P is the probability.
Different types of convergence we will be using in class:
23
a.s.
(i) Almost sure convergence: Xn a as n (in this course a will always be a constant).
This means for every Xn a, where P () = 1 (this is classical limit of a sequence,
see Wiki for a definition).
P
(ii) Convergence in probability: Xn a. This means that for every > 0, P (Xn a > ) 0
as n (see Wiki)
2
(iii) Convergence in mean square Xn a. This means EXn a2 0 as n (see Wiki).
(iv) Convergence in distribution. This means the distribution of Xn converges to the distribution
of X, ie. for all x where FX is continuous, we have Fn (x) FX (x) as n (where Fn and
FX are the distribution functions of Xn and X respectively). This is the simplest definition
(see Wiki).
Which implies which?
(i), (ii) and (iii) imply (iv).
(i) implies (ii).
(iii) implies (ii).
Central limit theorems require (iv). It is often easiest to show (iii) (since this only requires
mean and variance calculations).
1.4
Based on one realisation of a time series we want to make inference about parameters associated
with the process {Xt }, such as the mean etc. Let us consider the simplest case, estimating the
mean. We recall that in classical statistics we usually assume we observe several independent
realisations, {Xt } from a random variable X, and use the multiple realisations to make inference
= 1 Pn Xk . Roughly speaking, by using several independent realisations we
about the mean: X
k=1
n
are sampling over the entire probability space and obtaining a good estimate of the mean. On the
other hand if the samples were highly dependent, then it is likely that {Xt } would be concentrated
over small parts of the probability space. In this case, the variance of the sample mean would not
converge to zero as the sample size grows.
24
A typical time series is a half way house between totally dependent data and independent data.
Unlike, classical statistics, in time series, parameter estimation is based on only one realisation
xt = Xt () (not multiple, independent, replications). Therefore, it would appear impossible to
obtain a good estimator of the mean. However good estimates, of the mean, can be made, based
on just one realisation so long as certain assumptions are satisfied (i) the process has a constant
mean (a type of stationarity) and (ii) despite the fact that each time series is generated from one
realisation there is short memory in the observations. That is, what is observed today, xt has
little influence on observations in the future, xt+k (when k is relatively large). Hence, even though
we observe one tragectory, that trajectory traverses much of the probability space. The amount
of dependency in the time series determines the quality of the estimator. There are several ways
to measure the dependency. We know that the most common measure of linear dependency is the
covariance. The covariance in the stochastic process {Xt } is defined as
cov(Xt , Xt+k ) = E(Xt Xt+k ) E(Xt )E(Xt+k ).
Noting that if {Xt } has zero mean, then the above reduces to cov(Xt , Xt+k ) = E(Xt Xt+k ).
Remark 1.4.1 It is worth bearing in mind that the covariance only measures linear dependence.
For some statistical analysis, such as deriving an expression for the variance of an estimator, the
covariance is often sufficient as a measure. However, given cov(Xt , Xt+k ) we cannot say anything
about cov(g(Xt ), g(Xt+k )), where g is a nonlinear function. There are occassions where we require
a more general measure of dependence (for example, to show asymptotic normality). Examples of
more general measures include mixing (and other related notions, such as Mixingales, NearEpoch
dependence, approximate mdependence, physical dependence, weak dependence), first introduced by
Rosenblatt in the 50s (M. and Grenander (1997)). In this course we will not cover mixing.
Returning to the sample mean example suppose that {Xt } is a time series. In order to estimate the
mean we need to be sure that the mean is constant over time (else the estimator will be meaningless).
Therefore we will assume that {Xt } is a time series with constant mean . We observe {Xt }nt=1
= 1 Pn Xt . It is clear that this is an unbiased
and estimate the mean with the sample mean X
t=1
n
= (it is unbiased). Thus to see whether it converges in mean square
estimator of , since E(X)
25
n
n
n
1 X
2 X X
var(X
)
+
cov(Xt , X ).
t
n2
n2
t=1
(1.8)
t=1 =t+1
If the covariance structure decays at such a rate that the sum of all lags is finite (supt
= cov(Xt , X )
, often called short memory), then the variance is O( n1 ), just as in the iid case. However, even
in order to test/construct CI for .
with this assumption we need to be able to estimate var(X)
Usually this requires the stronger assumption of stationarity, which we define in Section 1.5
Example 1.4.1 (The variance of a regression model with correlated errors) Let us return
to the parametric models discussed in Section 1.2.1. The general model is
Yt = 0 +
p
X
j ut,j + t = 0 ut + t ,
j=1
where E[t ] = 0 and we will assume that {ut,j } are nonrandom regressors. Note this includes the
parametric trend models discussed in Section 1.2.1. We use least squares to estimate
Ln () =
n
X
(Yt 0 ut )2 ,
t=1
with
n = arg min Ln () = (
n
X
t=1
Thus
)
Ln (
n
ut u0t )1
n
X
Yt ut .
t=1
expression also applies to many nonlinear estimators too). We note that by using straightforward
algebra we can show that
n
i0 X
) Ln () h
Ln (
n
ut u0t .
=
n
t=1
26
(1.9)
<
Moreoover because
)
Ln (
n
= 0 we have
) Ln ()
Ln (
n
Ln ()
n
n
X
X
u t t .
=
[Yt 0 ut ] ut =
 {z }
=
t=1
(1.10)
t=1
n
i0 X
ut u0t =
t=1
n
X
u0t t .
t=1
and
h
n
X
!1
ut u0t
t=1
n
X
u t t .
t=1
=
var
n
1X
ut u0t
n
!1
var
t=1
1
n
var
1X
u t t
n
!
=
t=1
Pn
t=1 ut t
1X
u t t
n
t=1
1X
ut u0t
n
!1
.
t=1
which is
n
1 X
cov[t , ]ut u0
n2
t, =1
n
X
1
n2

t=1
{z
expression if independent
n
n
2 X X
cov[t , ]ut u0
n2
t=1 =t+1

{z
}
P
P
Under the assumption that n1 nt=1 ut u0t is nonsingular, supt kut k1 < and supt
= cov(t , ) <
h
i
= O(n1 ), but just as in the case of the sample mean we need to
, we can see that var
n
impose some additional conditions on {t } we want to construct confidence intervals/test for .
1.5
Stationary processes
We have established that one of the main features that distinguish time series analysis from classical
methods is that observations taken over time (a time series) can be dependent and this dependency
27
tends to decline the further apart in time these two observations. However, to do any sort of analysis
of this time series we have to assume some sort of invariance in the time series, for example the mean
or variance of the time series does not change over time. If the marginal distributions of the time
series were totally different no sort of inference would be possible (suppose in classical statistics you
were given independent random variables all with different distributions, what parameter would
you be estimating, it is not possible to estimate anything!).
The typical assumption that is made is that a time series is stationary. Stationarity is a rather
intuitive concept, it is an invariant property which means that statistical characteristics of the time
series do not change over time. For example, the yearly rainfall may vary year by year, but the
average rainfall in two equal length time intervals will be roughly the same as would the number of
times the rainfall exceeds a certain threshold. Of course, over long periods of time this assumption
may not be so plausible. For example, the climate change that we are currently experiencing is
causing changes in the overall weather patterns (we will consider nonstationary time series towards
the end of this course). However in many situations, including short time intervals, the assumption
of stationarity is quite a plausible. Indeed often the statistical analysis of a time series is done
under the assumption that a time series is stationary.
1.5.1
There are two definitions of stationarity, weak stationarity which only concerns the covariance of a
process and strict stationarity which is a much stronger condition and supposes the distributions
are invariant over time.
Definition 1.5.1 (Strict stationarity) The time series {Xt } is said to be strictly stationary
if for any finite sequence of integers t1 , . . . , tk and shift h the distribution of (Xt1 , . . . , Xtk ) and
(Xt1 +h , . . . , Xtk +h ) are the same.
The above assumption is often considered to be rather strong (and given a data it is very
hard to check). Often it is possible to work under a weaker assumption called weak/second order
stationarity.
Definition 1.5.2 (Second order stationarity/weak stationarity) The time series {Xt } is said
to be second order stationary if the mean is constant for all t and if for any t and k the covariance
between Xt and Xt+k only depends on the lag difference k. In other words there exists a function
28
ity and EXt2  < , then it is also second order stationary. But the converse is not necessarily
true. To show that strict stationarity (with EXt2  < ) implies second order stationarity,
suppose that {Xt } is a strictly stationary process, then
cov(Xt , Xt+k ) = E(Xt Xt+k ) E(Xt )E(Xt+k )
Z
=
xy PXt ,Xt+k (dx, dy) PXt (dx)PXt+k (dy)
Z
=
xy [PX0 ,Xk (dx, dy) PX0 (dx)PXk (dy)] = cov(X0 , Xk ),
where PXt ,Xt+k and PXt is the joint distribution and marginal distribution of Xt , Xt+k respectively. The above shows that cov(Xt , Xt+k ) does not depend on t and {Xt } is second order
stationary.
(ii) If a process is strictly stationary but the second moment is not finite, then it is not second
order stationary.
(iii) It should be noted that a weakly stationary Gaussian time series is also strictly stationary too
(this is the only case where weakly stationary implies strictly stationary).
Example 1.5.1 (The sample mean and its variance under stationarity) Returning the variance of the sample mean discussed (1.8), if a time series is second order stationary, then the sample
is estimating the mean and the variance of X
is
mean X
=
var(X)
n
n
1
2 X X
var(X0 ) + 2
cov(Xt , X )
n
T
t=1 =t+1
n
1
2 X n r
var(X0 ) +
cov(X0 , Xr ) .

{z
}
n  {z } n
n
r=1
c(0)
c(r)
P
We approximate the above, by using that the covariances
r c(r) < . Therefore for all r,
Pn
P
(1 r/n)c(r) c(r) and  r=1 (1 r/n)c(r) r c(r), thus by dominated convergence (see
29
Chapter A)
Pn
r=1 (1
r/n)c(r)
r=1 c(r).
var(X)
1
2X
1
c(0) +
c(r) = O( ).
n
n
n
r=1
The above is often called the long term variance. The above implies that
)2 = var(X)
0,
E(X
P
The example above illustrates how second order stationarity gave a rather elegant expression
for the variance. We now motivate the concept of ergodicity.
Sometimes, it is difficult to evaluate the mean and variance of an estimator, but often we
may only require almost sure or convergence in probability. Therefore, we may want to find an
alternative method to evaluating the mean squared error. To see whether this is possible we recall
that for iid random variables we have the very useful law of large numbers
n
1X
a.s.
Xt
n
t=1
and in general
1
n
Pn
t=1 g(Xt )
a.s.
series? It does, but we require the slightly stronger condition that a time series is ergodic (which
is a slightly stronger condition than the strictly stationary).
Definition 1.5.3 (Ergodicity  very rough) Let (, F, P ) be a probability space. A transformation T : is said to be measure preserving if for every set A F, P (T 1 A) = P (A).
Moreover, it is said to be an ergodic transformation if T 1 A = A implies that P (A) = 0 or 1.
It is not obvious what this has to do with stochastic processes, but we attempt to make a link. Let
us suppose that X = {Xt } is a strictly stationary process defined on the probability space (, F, P ).
By strict stationarity the transformation (shifting a sequence by one)
T (x1 , x2 , . . .) = (x2 , x3 , . . .),
is a measure preserving transformation. Thus a process which is stationarity is measure preserving.
30
1X
a.s.
g(Xt ) E[g(X0 )]
n
t=1
for any function g(). And in general for any shift 1 , . . . , k and function g : Rk+1 R we have
n
1X
a.s.
g(Xt , Xt+1 , . . . , Xt+k ) E[g(X0 , . . . , Xt+k )]
n
(1.11)
t=1
(often (1.11) is used as the definition of ergodicity, as it is an iff with the ergodic definition). Later
you will see how useful this.
(1.11) gives us an idea of what constitutes an ergodic process. Suppose that {t } is an ergodic
process (a classical example are iid random variables) then any reasonable (meaning measurable)
function of Xt is also ergodic. More precisely, if Xt is defined as
Xt = h(. . . , t , t1 , . . .),
(1.12)
where {t } are iid random variables and h() is a measureable function, then {Xt } is an Ergodic
process. For full details see Stout (1974), Theorem 3.4.5.
Remark 1.5.2 As mentioned above all Ergodic processes are stationary, but a stationary process
is not necessarily ergodic. Here is one simple example. Suppose that {t } are iid random variables
and Z is a Bernoulli random variable with outcomes {1, 2} (where the chance of either outcome is
31
1 + t Z = 1
Xt =
+ Z = 2.
2
t
It is clear that E(Xt Z = i) = i and E(Xt ) = 12 (1 + 2 ). This sequence is stationary. However,
P
we observe that T1 Tt=1 Xt will only converge to one of the means, hence we do not have almost
sure convergence (or convergence in probability) to 12 (1 + 2 ).
Exercise 1.5 State, with explanation, which of the following time series is second order stationary,
which are strictly stationary and which are both.
(i) {t } are iid random variables with mean zero and variance one.
(ii) {t } are iid random variables from a Cauchy distributon.
(iii) Xt+1 = Xt + t , where {t } are iid random variables with mean zero and variance one.
(iv) Xt = Y where Y is a random variable with mean zero and variance one.
(iv) Xt = Ut +Ut1 +Vt , where {(Ut , Vt )} is a strictly stationary vector time series with E[Ut2 ] <
and E[Vt2 ] < .
Example 1.5.2 In Chapter 6 we consider estimation of the autocovariance function. However for
now use R command acf. In Figure 1.11 we give the sample acf plots of the Southern Oscillation
Index and the Sunspot data. We observe that are very different. The acf of the SOI decays rapidly,
but there does appear to be some sort of pattern in the correlations. On the other hand, there is
more persistence in the acf of the Sunspot data. The correlations of the acf data appear to decay
but over a longer period of time and there is a clear periodicity in the correlation.
Exercise 1.6
(i) Make an ACF plot of the monthly temperature data from 19962014.
(ii) Make and ACF plot of the yearly temperature data from 18802013.
(iii) Make and ACF plot of the residuals (after fitting a line through the data (using the command
lsfit(..)$res)) of the yearly temperature data from 18802013.
Briefly describe what you see.
32
0.4
0.0
ACF
0.8
Series soi
50
100
150
200
250
300
40
50
60
Lag
ACF
Series sunspot
10
20
30
Lag
Figure 1.11: Top: ACF of Southern Oscillation data. Bottom ACF plot of Sunspot data.
R code
To make the above plots we use the commands
par(mfrow=c(2,1))
acf(soi,lag.max=300)
acf(sunspot,lag.max=60)
1.5.2
Returning to the sample mean Example 1.5.1. Suppose we want to construct CIs or apply statistical
tests on the mean. This requires us to estimate the long run variance (assuming stationarity)
var(X)
2X
1
c(0) +
c(r).
n
n
r=1
There are several ways this can be done, either by fitting a model to the data and from the model
estimate the covariance or doing it nonparametrically. This example motivates the contents of the
course:
(i) Modelling, finding suitable time series models to fit to the data.
(ii) Forecasting, this is essentially predicting the future given current and past observations.
33
1.6
The covariance of a stationary process has several very interesting properties. The most important
is that it is positive semidefinite, which we define below.
Definition 1.6.1 (Positive semidefinite sequence)
set of all integers) is said to be positive semidefinite if for any n Z and sequence x =
(x1 , . . . , xn ) Rn the following is satisfied
n
X
c(i j)xi xj 0.
i,j=1
(ii) A function is said to be an even positive semidefinite sequence if (i) is satisfied and c(k) =
c(k) for all k Z.
An extension of this notion is the positive semidefinite function.
Definition 1.6.2 (Positive semidefinite function)
c(ui uj )xi xj 0.
i,j=1
(ii) A function is said to be an even positive semidefinite function if (i) is satisfied and c(u) =
c(u) for all u R.
34
Remark 1.6.1 You have probably encountered this positive definite notion before, when dealing
with positive definite matrices. Recall the n n matrix n is positive semidefinite if for all x Rn
x0 n x 0. To see how this is related to positive semidefinite matrices, suppose that the matrix n
P
has a special form, that is the elements of n are (n )i,j = c(ij). Then x0 n x = ni,j c(ij)xi xj .
We observe that in the case that {Xt } is a stationary process with covariance c(k), the variance
covariance matrix of X n = (X1 , . . . , Xn ) is n , where (n )i,j = c(i j).
We now take the above remark further and show that the covariance of a stationary process is
positive semidefinite.
Theorem 1.6.1 Suppose that {Xt } is a discrete time/continuous stationary time series with covariance function {c(k)}, then {c(k)} is a positive semidefinite sequence/function. Conversely for
any even positive semidefinite sequence/function there exists a stationary time series with this
positive semidefinite sequence/function as its covariance function.
PROOF. We prove the result in the case that {Xt } is a discrete time time series, ie. {Xt ; t Z}.
We first show that {c(k)} is a positive semidefinite sequence. Consider any sequence x =
Pn
(x1 , . . . , xn ) Rn , and the double sum
i,j xi c(i j)xj . Define the random variable Y =
Pn
Pn
0
i=1 xi Xi . It is straightforward to see that var(Y ) = x var(X n )x =
i,j=1 c(ij)xi xj where X n =
P
(X1 , . . . , Xn ). Since for any random variable Y , var(Y ) 0, this means that ni,j=1 xi c(ij)xj 0,
hence {c(k)} is a positive definite sequence.
To show the converse, that is for any positive semidefinite sequence {c(k)} we can find a
corresponding stationary time series with the covariance {c(k)} is relatively straightfoward, but
depends on defining the characteristic function of a process and using Komologorovs extension
theorem. We omit the details but refer an interested reader to Brockwell and Davis (1998), Section
1.5.
In time series analysis usually the data is analysed by fitting a model to the data. The model
(so long as it is correctly specified, we will see what this means in later chapters) guarantees the
covariance function corresponding to the model (again we cover this in later chapters) is positive
definite. This means, in general we do not have to worry about positive definiteness of the covariance
function, as it is implicitly implied.
On the other hand, in spatial statistics, often the object of interest is the covariance function
and specific classes of covariance functions are fitted to the data. In which case it is necessary to
35
ensure that the covariance function is semipositive definite (noting that once a covariance function
has been found by Theorem 1.6.1 there must exist a spatial process which has this covariance
function). It is impossible to check for positive definiteness using Definitions 1.6.1 or 1.6.1. Instead
an alternative but equivalent criterion is used. The general result, which does not impose any
conditions on {c(k)} is stated in terms of positive measures (this result is often called Bochners
theorem). Instead, we place some conditions on {c(k)}, and state a simpler version of the theorem.
Theorem 1.6.2 Suppose the coefficients {c(k); k Z} are absolutely summable (that is
c(k) <
). Then the sequence {c(k)} is positive semidefinite if an only if the function f (), where
f () =
1 X
c(k) exp(ik),
2
k=
c(u) exp(iu)du,
Example 1.6.1 We will show that sequence c(0) = 1, c(1) = 0.5, c(1) = 0.5 and c(k) = 0 for
k > 1 a positive definite sequence.
From the definition of spectral density given above we see that the spectral density corresponding
to the above sequence is
f () = 1 + 2 0.5 cos().
Since  cos() 1, f () 0, thus the sequence is positive definite. An alternative method is to
find a model which has this as the covariance structure. Let Xt = t + t1 , where t are iid random
variables with E[t ] = 0 and var(t ) = 0.5. This model has this covariance structure.
36
We note that Theorem 1.6.2 can easily be generalized to higher dimensions, d, by taking Fourier
transforms over Zd or Rd .
Exercise 1.7 Which of these sequences can used as the autocovariance function of a second order
stationary time series?
(i) c(1) = 1/2, c(0) = 1, c(1) = 1/2 and for all k > 1, c(k) = 0.
(ii) c(1) = 1/2, c(0) = 1, c(1) = 1/2 and for all k > 1, c(k) = 0.
(iii) c(2) = 0.8, c(1) = 0.5, c(0) = 1, c(1) = 0.5 and c(2) = 0.8 and for all k > 2,
c(k) = 0.
Exercise 1.8
(i) Show that the function c(u) = exp(au) where a > 0 is a positive semi
definite function.
(ii) Show that the commonly used exponential spatial covariance defined on R2 , c(u1 , u2 ) =
p
exp(a u21 + u22 ), where a > 0, is a positive semidefinite function.
37
Chapter 2
Linear time series
Prerequisites
Familarity with linear models.
Solve polynomial equations.
Be familiar with complex numbers.
Understand under what conditions the partial sum Sn =
P
P
(ie. if
j=1 aj .
j=1 aj  < , then Sn S, where S =
Pn
j=1 aj
Objectives
Understand what causal and invertible is.
Know what an AR, MA and ARMA time series model is.
Know how to find a solution of an ARMA time series, and understand why this is important (how the roots determine causality and why this is important to know  in terms of
characteristics in the process and also simulations).
Understand how the roots of the AR can determine features in the time series and covariance
structure (such as pseudo periodicities).
38
2.1
Motivation
The objective of this chapter is to introduce the linear time series model. Linear time series models
are designed to model the covariance structure in the time series. There are two popular subgroups of linear time models (a) the autoregressive and (a) the moving average models, which can
be combined to make the autoregressive moving average models.
We motivate the autoregressive from the perspective of classical linear regression. We recall one
objective in linear regression is to predict the response variable given variables that are observed.
To do this, typically linear dependence between response and variable is assumed and we model Yi
as
Yi =
p
X
aj Xij + i ,
j=1
where i is such that E[i Xij ] = 0 and more commonly i and Xij are independent. In linear
regression once the model has been defined, we can immediately find estimators of the parameters,
do model selection etc.
Returning to time series, one major objective is to predict/forecast the future given current and
past observations (just as in linear regression our aim is to predict the response given the observed
variables). At least formally, it seems reasonable to represent this as
Xt =
p
X
aj Xtj + t ,
(2.1)
j=1
where we assume that {t } are independent, identically distributed, zero mean random variables.
Model (2.1) is called an autoregressive model of order p (AR(p) for short). It is easy to see that
p
X
aj Xtj
j=1
(this is the exected value of Xt given that Xt1 , . . . , Xtp have already been observed), thus the
past values of Xt have a linear influence on the conditional mean of Xt (compare with the linear
P
P
model Yt = pj=1 aj Xt,j + t , then E(Yt Xt,j ) = pj=1 aj Xt,j ). Conceptionally, the autoregressive
model appears to be a straightforward extension of the linear regression model. Dont be fooled
by this, it is a more complex object. Unlike the linear regression model, (2.1) is an infinite set of
linear difference equations. This means, for this systems of equations to be well defined, it needs
39
to have a solution which is meaningful. To understand why, recall that equation is defined for all
t Z, so let us start the equation at the beginning of time (t = ) and run it on. Without any
constraint on the parameters {aj }, there is no reason to believe the solution is finite (contrast this
with linear regression where these issues are not relevant). Therefore, the first thing to understand
is under what conditions will the AR model (2.1) have a well defined stationary solution and what
features in a time series is the solution able to capture.
Of course, one could ask why go through to the effort. One could simply use least squares to
estimate the parameters. This is possible, but without a proper analysis it is not clear whether
model has a meaningful solution (for example in Section 3.3 we show that the least squares estimator can lead to misspecified models), its not even possible to make simulations of the process.
Therefore, there is a practical motivation behind our theoretical treatment.
In this chapter we will be deriving conditions for a strictly stationary solution of (2.1). We
will place moment conditions on the innovations {t }, these conditions will be sufficient but not
necessary conditions Under these conditions we obtain a strictly stationary solution but not a
second order stationary process. In Chapter 3 we obtain conditions for (2.1) to have be both a
strictly and second order stationary solution. It is possible to obtain strictly stationary solution
under far weaker conditions (see Theorem 4.0.1), but these wont be considered here.
Example 2.1.1 How would you simulate from the model
Xt = 1 Xt1 + 2 Xt1 + t .
2.2
2.2.1
Before defining a linear time series, we define the MA(q) model which is a subclass of linear time
series. Let us supppose that {t } are iid random variables with mean zero and finite variance. The
time series {Xt } is said to have a MA(q) representation if it satisfies
Xt =
q
X
j tj ,
j=0
40
where E(t ) = 0 and var(t ) = 1. It is clear that Xt is a rolling finite weighted sum of {t },
therefore {Xt } must be well defined (which for finite sums means it is almost surely finite, this
you can see because it has a finite variance). We extend this notion and consider infinite sums of
random variables. Now, things become more complicated, since care must be always be taken with
anything involving infinite sums. More precisely, for the sum
j tj ,
j=
Pn
j=n j tj
should be (almost
surely) finite and the sequence Sn should converge (ie. Sn1 Sn2  0 as n1 , n2 ). Below, we
give conditions under which this is true.
Lemma 2.2.1 Suppose {Xt } is a strictly stationary time series with EXt  < , then {Yt } defined
by
Yt =
j Xtj ,
j=
< , is a strictly stationary time series. Furthermore, the partial sum converges
P
almost surely, Yn,t = nj=0 j Xtj Yt . If var(Xt ) < , then {Yt } is second order stationary
where
j=0 j 
Example 2.2.1 Suppose {Xt } is a strictly stationary time series with var(Xt ) < . Define {Yt }
as the following infinite sum
Yt =
j k j Xtj 
j=0
where  < 1. Then {Yt } is also a strictly stationary time series with a finite variance.
We will use this example later in the course.
Having derived conditions under which infinite sums are well defined (good), we can now define
the general class of linear and MA() processes.
41
Definition 2.2.1 (The linear process and moving average (MA)()) Suppose that {t }
P
are iid random variables,
j=0 j  < and E(t ) < .
A time series is said to be a linear time series if it can be represented as
Xt =
j tj ,
j=
Xt =
j tj .
j=0
Note that since that as these sums are well defined by equation (1.12) {Xt } is a strictly stationary
(ergodic) time series.
The difference between an MA() process and a linear process is quite subtle. A linear process
involves both past, present and future innovations {t }, whereas the MA() uses only past and
present innovations.
Definition 2.2.2 (Causal and invertible)
(i) A process is said to be causal if it has the representation
Xt =
aj tj ,
j=0
Xt =
bj Xtj + t ,
j=1
(so far we have yet to give conditions under which the above has a well defined solution).
Causal and invertible solutions are useful in both estimation and forecasting (predicting the future
based on the current and past).
A very interesting class of models which have MA() representations are autoregressive and
autoregressive moving average models. In the following sections we prove this.
42
2.3
In this section we will examine under what conditions the AR process has a stationary solution.
2.3.1
t Z,
where {t } are zero mean, finite variance random variables. As we mentioned previously, the autoregressive model is a difference equation (which can be treated as a infinite number of simultaneous
equations). Therefore for it to make any sense it must have a solution. To obtain a general solution
we write the autoregressive model in terms of backshift operators:
Xt 1 BXt . . . p B p Xt = t ,
where (B) = 1
Pp
j=1 j B
j,
(B)Xt = t
Simply rearranging (B)Xt = t , gives the solution of the autoregressive difference equation to
be Xt = (B)1 t , however this is just an algebraic manipulation, below we investigate whether it
really has any meaning. To do this, we start with an example.
2.3.2
Below we consider two different AR(1) models and obtain their solutions.
(i) Consider the AR(1) process
Xt = 0.5Xt1 + t ,
t Z.
(2.2)
43
X
X
X
Xt = (1 0.5B)1 t = ( (0.5B)j )t = ( (0.5j B j ))t =
(0.5)j tj ,
j=0
j=0
j=0
which corresponds to the solution above. Hence the backshift operator in this example helps
us to obtain a solution. Moreover, because the solution can be written in terms of past values
of t , it is causal.
(ii) Let us consider the AR model, which we will see has a very different solution:
Xt = 2Xt1 + t .
Doing what we did in (i) we find that after the kth back iteration we have Xt =
(2.3)
Pk
j
j=0 2 tj +
2k+1 Xtk . However, unlike example (i) 2k does not converge as k . This suggest that if
P
j
we continue the iteration Xt =
j=0 2 tj is not a quantity that is finite (when t are iid).
P
j
Therefore Xt =
j=0 2 tj cannot be considered as a solution of (2.3). We need to write
(2.3) in a slightly different way in order to obtain a meaningful solution.
Rewriting (2.3) we have Xt1 = 0.5Xt + 0.5t . Forward iterating this we get Xt1 =
P
(0.5) kj=0 (0.5)j t+j (0.5)t+k+1 Xt+k . Since (0.5)t+k+1 0 we have
Xt1 = (0.5)
X
j=0
as a solution of (2.3).
44
(0.5)j t+j
Let us see whether the difference equation can also offer a solution. Since (1 2B)Xt = t ,
using the crude manipulation we have Xt = (1 2B)1 t . Now we see that (1 2B)1 =
P
P j j
j
j=0 (2B) for B < 1/2. Using this expansion gives Xt =
j=0 2 B Xt , but as pointed out
above this sum is not well defined. What we find is that (B)1 t only makes sense (is well
defined) if the series expansion of (B)1 converges in a region that includes the unit circle
B = 1.
What we need is another series expansion of (1 2B)1 which converges in a region which
includes the unit circle B = 1 (as an aside, we note that a function does not necessarily
have a unique series expansion, it can have difference series expansions which may converge
in different regions). We now show that a convergent series expansion needs to be defined in
terms of negative powers of B not positive powers. Writing (1 2B) = (2B)(1 (2B)1 ),
therefore
(1 2B)1 = (2B)1
(2B)j ,
j=0
which converges for B > 1/2. Using this expansion we have
Xt =
X
X
(0.5)j+1 B j1 t =
(0.5)j+1 t+j+1 ,
j=0
j=0
X
1
=
(2B)j
(1 2B)
j=0
X
1
= (2B)1
(2B)j ,
(1 2B)
j=0
which converges for B > 1/2. The one that is useful for us is the series which converges
when B = 1.
It is clear from the above examples how to obtain the solution of a general AR(1). We now
show that this solution is the unique stationary solution.
45
2.3.3
Consider the AR(1) process Xt = Xt1 + t , where  < 1. Using the method outlined in (i), it is
P
j
straightforward to show that Xt =
j=0 tj is its stationary solution, we now show that this
solution is unique.
P
j
We first show that Xt =
j=0 tj is well defined (that it is almost surely finite). We note
P
P
j
j
that Xt 
j=0   tj . Thus we will show that
j=0   tj  is almost surely finite,
which will imply that Xt is almost surely finite. By montone convergence we can exchange sum
P
P
and expectation and we have E(Xt ) E(limn nj=0 j tj ) = limn nj=0 j Etj ) =
P j
P
j
E(0 )
j=0 tj is a well defined solution of
j=0   < . Therefore since EXt  < ,
Xt = Xt1 + t .
To show that it is the unique stationary causal solution, let us suppose there is another (causal)
solution, call it Yt (note that this part of the proof is useful to know as such methods are often used
when obtaining solutions of time series models). Clearly, by recursively applying the difference
equation to Yt , for every s we have
Yt =
s
X
j tj + s Yts1 .
j=0
Evaluating the difference between the two solutions gives Yt Xt = As Bs where As = s Yts1
P
j
and Bs =
j=s+1 tj for all s. To show that Yt and Xt coincide almost surely we will show that
P
for every > 0,
s=1 P (As Bs  > ) < (and then apply the BorelCantelli lemma). We note
if As Bs  > ), then either As  > /2 or Bs  > /2. Therefore P (As Bs  > ) P (As  >
/2)+P (Bs  > /2). To bound these two terms we use Markovs inequality. It is straightforward to
show that P (Bs  > /2) Cs /. To bound EAs , we note that Ys   Ys1  + s , since {Yt }
is a stationary solution then EYs (1 ) Es , thus EYt  Et /(1 ) < . Altogether
P
this gives P (As Bs  > ) Cs / (for some finite constant C). Hence
s=1 P (As Bs  > ) <
P
s
s=1 C / < . Thus by the BorelCantelli lemma, this implies that the event {As Bs  > }
happens only finitely often (almost surely). Since for every , {As Bs  > } occurs (almost surely)
P
j
only finite often for all , then Yt = Xt almost surely. Hence Xt =
j=0 tj is (almost surely)
the unique causal solution.
46
2.3.4
Let us now summarise our observation for the general AR(1) process Xt = Xt1 + t . If  < 1,
then the solution is in terms of past values of {t }, if on the other hand  > 1 the solution is in
terms of future values of {t }.
Now we try to understand this in terms of the expansions of the characteristic polynomial
(B) = 1 B (using the AR(1) as a starting point). From what we learnt in the previous
section, we require the characteristic polynomial of the AR process to have a convergent power
series expansion in the region including the ring B = 1. In terms of the AR(1) process, if the root
of (B) is greater than one, then the power series of (B)1 is in terms of positive powers, if it is
less than one, then (B)1 is in terms of negative powers.
Generalising this argument to a general polynomial, if the roots of (B) are greater than one,
then the power series of (B)1 (which converges for B = 1) is in terms of positive powers (hence
the solution (B)1 t will be in past terms of {t }). On the other hand, if the roots are both less
than and greater than one (but do not lie on the unit circle), then the power series of (B)1 will
be in both negative and positive powers. Thus the solution Xt = (B)1 t will be in terms of both
past and future values of {t }. We summarize this result in a lemma below.
Lemma 2.3.1 Suppose that the AR(p) process satisfies the representation (B)Xt = t , where
none of the roots of the characteristic polynomial lie on the unit circle and Et  < . Then {Xt }
has a stationary, almost surely unique, solution.
We see that where the roots of the characteristic polynomial (B) lie defines the solution of
the AR process. We will show in Sections 2.3.6 and 3.1.2 that it not only defines the solution but
determines some of the characteristics of the time series.
Exercise 2.1 Suppose {Xt } satisfies the AR(p) representation
Xt =
p
X
j Xtj + t ,
j=1
where
Pp
j=1 j 
< 1 and Et  < . Show that {Xt } will always have a causal stationary solution.
47
2.3.5
Specific example
Suppose {Xt } satisfies
Xt = 0.75Xt1 0.125Xt2 + t ,
where {t } are iid random variables. We want to obtain a solution for the above equations.
It is not easy to use the backward (or forward) iterating techique for AR processes beyond
order one. This is where using the backshift operator becomes useful. We start by writing Xt =
0.75Xt1 0.125Xt2 + t as (B)Xt = , where (B) = 1 0.75B + 0.125B 2 , which leads to what
is commonly known as the characteristic polynomial (z) = 1 0.75z + 0.125z 2 . If we can find a
power series expansion of (B)1 , which is valid for B = 1, then the solution is Xt = (B)1 t .
We first observe that (z) = 1 0.75z + 0.125z 2 = (1 0.5z)(1 0.25z). Therefore by using
partial fractions we have
1
1
2
1
=
=
+
.
(z)
(1 0.5z)(1 0.25z)
(1 0.5z) (1 0.25z)
We recall from geometric expansions that
X
1
=
(0.5)j z j
(1 0.5z)
z 2,
j=0
X
2
=2
(0.25)j z j
(1 0.25z)
z 4.
j=0
X
1
=
{(0.5)j + 2(0.25)j }z j
(1 0.5z)(1 0.25z)
z < 2.
j=0
j=0 
2.3.2). Hence
Xt = {(1 0.5B)(1 0.25B)}1 t =
X
{(0.5)j + 2(0.25)j }B j t =
{(0.5)j + 2(0.25)j }tj ,
j=0
j=0
which gives a stationary solution to the AR(2) process (see Lemma 2.2.1).
The discussion above shows how the backshift operator can be applied and how it can be used
48
(1 za)(1 zb)
b a 1 bz 1 az
Cases:
(i) a < 1 and b < 1, this means the roots lie outside the unit circle. Thus the expansion is
j=0
j=0
X
X
1
1
=
b
bj z j a
aj z j ,
(1 za)(1 zb)
(b a)
which leads to the causal solution
1
Xt =
ba
X
j+1
j+1
)tj .
(2.4)
j=0
(ii) Case that a > 1 and b < 1, this means the roots lie inside and outside the unit circle and
we have the expansion
1
(1 za)(1 zb)
1
ba
1
(b a)
49
b
a
1 bz (az)((az)1 1)
X
X
bj z j + z 1
aj z j ,
b
j=0
j=0
(2.5)
X
X
1
bj+1 tj +
aj t+1+j .
ba
j=0
(2.6)
j=0
Later we show that the noncausal solution has the same correlation structure as the causal
solution when a = a1 .
This solution throws up additional interesting results. Let us return to the expansion in (2.5)
and apply it to Xt
Xt =
1
b
1
1
t =
t +
t
1
1
(1 Ba)(1 Bb)
b a 1 {z
bB }
B(1 a B )
{z
}

causal AR(1)
noncausal AR(1)
1
(Yt + Zt+1 )
ba
where Yt = bYt1 + t and Zt+1 = a1 Zt+2 + t+1 . In other words, the noncausal AR(2)
process is the sum of a causal and afuture AR(1) process. This is true for all noncausal
time series (except when there is multiplicity in the roots) and is discussed further in Section
2.6.
Several authors including Richard Davis, Jay Breidt and Beth Andrews argue that noncausal
time series can model features in data which causal time series cannot.
(iii) a = b > 1. The characteristic polynomial is (1 az)2 . To obtain the convergent expansion
when z = 1 we note that (1 az)2 = (1) d(1az)
d(az)
. Thus
X
(1)
=
(1)
j(az)j1 .
(1 az)2
j=0
jaj1 tj .
j=1
Exercise 2.2 Show for the AR(2) model Xt = 1 Xt1 + 2 Xt2 + t to have a causal stationary
50
2 1 < 1
2  < 1.
where {t } are iid random variables with mean zero and variance one. Suppose the roots of
the characteristic polynomial 1 1 z 2 z 2 are greater than one. Show that 1  + 2  < 4.
(b) Now consider a generalisation of this result. Consider the AR(p) process
Xt = 1 Xt1 + 2 Xt2 + . . . p Xtp + t .
Suppose the roots of the characteristic polynomial 1 1 z . . . p z p are greater than one.
Show that 1  + . . . + p  2p .
2.3.6
We now explain why the AR(2) (and higher orders) can characterise some very interesting behaviour
(over the rather dull AR(1)). For now we assume that Xt is a causal time series which satisfies the
AR(2) representation
Xt = 1 Xt1 + 2 Xt2 + t
where {t } are iid with mean zero and finite variance. The characteristic polynomial is (B) =
1 1 B 2 B 2 . Let us assume the roots of (B) are complex, since 1 and 2 are real, the roots
are complex conjugates. Thus by using case (i) above we have
1
1
=
1 1 B 2 B 2
where 1 and
1 B 1 B
tj
j=0
X
j=0
51
j tj ,
(2.7)
where C = [ ]1 . Since and C are complex we use the representation = r exp(i) and
C = exp(i) (noting that r < 1), and substitute these expressions for and C into (2.7) to give
Xt =
rj cos(j + )tj .
j=0
We can see that Xt is effectively the sum of cosines with frequency that have been modulated by
the iid errors and exponentially damped. This is why for realisations of autoregressive processes
you will often see periodicities (depending on the roots of the characteristic). These arguments can
be generalised to higher orders p.
Exercise 2.4
where {t } are iid random variables with mean zero and variance 2 .
Does the solution have an MA() representation?
(b) Obtain the stationary solution of the AR(2) process
4 3
42
Xt =
Xt1 2 Xt2 + t ,
5
5
where {t } are iid random variables with mean zero and variance 2 .
Does the solution have an MA() representation?
(c) Obtain the stationary solution of the AR(2) process
Xt = Xt1 4Xt2 + t ,
where {t } are iid random variables with mean zero and variance 2 .
Does the solution have an MA() representation?
Exercise 2.5 Construct a causal stationary AR(2) process with pseudoperiod 17. Using the R
function arima.sim simulate a realisation from this process (of length 200) and make a plot of the
periodogram. What do you observe about the peak in this plot?
Below we now consider solutions to general AR() processes.
52
2.3.7
AR() models are more general than the AR(p) model and are able to model more complex
behaviour, such as slower decay of the covariance structure. It is arguable how useful these models
are in modelling data, however recently it has become quite popular in time series bootstrap
methods.
In order to obtain the stationary solution of an AR(), we need to define an analytic function
and its inverse.
Definition 2.3.1 (Analytic functions in the region ) Suppose that z C. (z) is an analytic complex function in the region , if it has a power series expansion which converges in , that
P
j
is (z) =
j= j z .
P
j
Pp
j=0 j z
j
j=0 (0.5z)
for = C.
We observe that for AR processes we can represent the equation as (B)Xt = t , which formally
gives the solution Xt = (B)1 t . This raises the question, under what conditions on (B)1 is
(B)1 t a valid solution. For (B)1 t to make sense (B)1 should be represented as a power
series expansion. Below, we give conditions on the power series expansion which give a stationary
solution. It is worth noting this is closely related to Lemma 2.2.1.
P
j= j z
P
j
PROOF. It can be shown that if supz=1 (z) < , in other words on the unit circle
j= j z <
P
, then
j= j  < . Since the coefficients are absolutely summable, then by Lemma 2.2.1
P
we have that Yt = (B)Xtj =
j= j Xtj is almost surely finite and strictly stationary.
Using the above we can obtain the solution of an AR() (which includes an AR(p) as a special
case).
53
j Xtj + t
j=1
j= j z
the solution of Xt is
Xt =
j tj .
j=
Xt =
p
X
j Xtj + t .
j=1
Pp
j=1 j B
1
a(z)
1
a(B) Zt .
z < 2.
54
(ii) Clearly a(z) = 1 2z is analytic for all z C, and has no zeros for z > 1/2. The inverse is
P
1
1
1
j
a(z) = (2z) (1 (1/2z)) = (2z) ( j=0 (1/(2z)) ) well defined in the region z > 1/2.
(iii) The function a(z) =
1
(10.5z)(12z)
(iv) a(z) = 1 z, is analytic for all z C, but is zero for z = 1. Hence its inverse is not well
defined for regions which involve z = 1 (see Example 2.3.2).
Example 2.3.2 (Unit root/integrated processes and noninvertible processes)
(i) If the difference equation has root one, then an (almost sure) stationary solution of the AR model
do not exist. The simplest example is the random walk Xt = Xt1 + t ((z) = (1 z)). This is
an example of an Autoregressive Integrated Moving Average ARIMA(0, 1, 0) model (1 B)Xt = t .
To see that it does not have a stationary solution, we iterate the equation n steps backwards
P
P
and we see that Xt = nj=0 tj + Xtn . St,n = nj=0 tj is the partial sum, but it is clear that
the partial sum St,n does not have a limit, since it is not a Cauchy sequence, ie. St,n St,m  does
not have a limit. However, given some initial value X0 , for t > 0 we can define the unit process
P
Xt = Xt1 +. Notice that the nonstationary solution of this sequence is Xt = X0 + tj=1 tj which
has variance var(Xt ) = var(X0 ) + t (assuming that {t } are iid random variables with variance one
and independent of X0 ).
We observe that we can stationarize the process by taking first differences, ie. defining Yt =
Xt Xt1 = t .
(ii) The unit process described above can be generalised ARIMA(0, d, 0), where (1 B)d Xt = t .
In this case to stationarize the sequence we take d differences, ie. let Yt,0 = Xt and for 1 i d
define the iteration
Yt,i = Yt,i1 Yt1,i1
and Yt = Yt,d will be a stationary sequence. Or define
Yt =
d
X
j=0
d!
(1)j Xtj ,
j!(d j)!
55
(iii) The general ARIMA(p, d, q) is defined as (1 B)d (B)Xt = (B)t , where (B) and (B) are
p and q order polynomials respectively and the roots of (B) lie outside the unit circle.
Another way of describing the above model is that after taking d differences (as detailed in (ii))
the resulting process is an ARMA(p, q) process (see Section 2.5 for the definition of an ARMA
model).
To illustrate the difference between stationary ARMA and ARIMA processes, in Figure 2.1
(iv) In examples (i) and (ii) a stationary solution does not exist. We now consider an example
where the process is stationary but an autoregressive representation does not exist.
Consider the MA(1) model Xt = t t1 . We recall that this can be written as Xt = (B)t
where (B) = 1 B. From Example 2.3.1(iv) we know that (z)1 does not exist, therefore it does
20
60
40
20
ar2
ar2I
40
60
80
100
200
300
400
Time
100
200
300
400
Time
Figure 2.1: Realisations from an AR process and its corresponding integrated process, using
N (0, 1) innovations (generated using the same seed).
56
2.4
To understand why the magic backshift operator works, we use matrix notation to rewrite the
AR(p) model as an infinite set of difference equations
..
.
Xt
... 0
1 1 . . . p . . .
t1
... 0
0
1
1 . . . p
X
t2
.
... ... ... ...
...
...
...
..
..
.
t
=
t1
t2
.
..
The above is an infinite dimensional equation (and the matrix is an infinite upper triangular matrix).
Formally to obtain a simulation we invert the matrix to get a solution of Xt in terms of t . Of course
in reality it is not straightfoward to define this inverse. Instead let us consider a finite (truncated)
version of the above matrix equation. Except for the edge effects this is a circulant matrix (where
the rows are repeated, but each time shifted by one, see wiki for a description). Truncating the
matrix to have dimension n, we approximate the above by the finite set of nequations
...
0
1
1
...
...
...
1 2 . . .
...
...
...
...
...
Xn
Xn1
..
.
X0
= n1
..
C n X n n .
The approximation of the AR(p) equation only arises in the first pequations, where
X0
p
X
j Xnj
= 0
j Xn+1j
= 1
j=1
X1 1 X0
p
X
j=2
..
.
Xp
p1
X
..
.
j Xpj p Xn = p .
j=1
57
Un =
0 1 0 0 ... 0
0 0 1 0 ... 0
.. .. .. .. .. ..
. . . . . .
1 0 0 0 ... 0
We observe that Un is a deformed diagonal matrix where all the ones along the diagonal have
been shifted once to the right, and the left over one is placed in the bottom left hand corner. Un
is another example of a circulant matrix, moreover Un2 shifts once again all the ones to the right
2
Un =
0 1 0 0 ... 0
0 0 1 0 ... 0
.. .. .. .. .. ..
. . . . . .
1 0 0 0 ... 0
0 1 0 0 ... 0
Un3 shifts the ones to the third offdiagonal and so forth until Unn = I. Thus all circulant matrices
can be written in terms of powers of Un (the matrix Un can be considered as the building blocks
of circulant matrices). In particular
Cn = In
p
X
j Unj ,
j=1
[In
Pp
j
j=1 j Un ]X n
X n = (In
p
X
j Unj )1 n .
j=1
Pp
j 1
j=1 j Un )
backshift operator.
P
To do this we recall the similarity between the matrix In pj=1 j Unj and the characteristic
P
equation (B) = 1 pj=1 j z j . In particular since we can factorize the characteristic equation as
Q
P
Q
(B) = pj=1 [1 j B], we can factorize the matrix In pj=1 j Unj = pj=1 [In j Un ]. To obtain
the inverse, for simplicity, we assume that the roots of the characteristic function are greater than
58
one (ie. j  < 1, which we recall corresponds to a causal solution) and are all different. Then there
exists constants cj where
[In
p
X
j Unj ]1
j=1
p
X
cj (In j Un )1
j=1
P
(just as in partial fractions)  to see why multiply the above by [In pj=1 j Unj ]. Finally, we
P
j
recall that if the eigenvalues of A are less than one, then (1 A)1 =
j=0 A . The eigenvalues
of Un are {exp( 2ij
n ); j = 1, . . . , n}, thus the eigenvalues of j Un are less than one. This gives
P
k k
(In j Un )1 =
k=0 j Un and
[In
p
X
j Unj ]1 =
j=1
p
X
cj
j=1
kj Unk .
(2.8)
k=0
X n = Cn1 n =
p
X
cj
j=1
kj Unk n .
k=0
Let us focus on the first element of the vector X n , which is Xn . Since Unk n shifts the elements of
n up by k (note that this shift is with wrapping of the vector) we have
Xn =
p
X
j=1
cj
n
X
kj nk
k=0
p
X
cj
j=1
kj nk
{z
mod (n) .
(2.9)
k=n+1
0
Note that the second term decays geometrically fast to zero. Thus giving the stationary solution
P
P
k
Xn = pj=1 cj
k=0 j nk .
P
To recollect, we have shown that [In pj=1 j Unj ]1 admits the solution in (2.8) (which is
the same as the solution of the inverse of (B)1 ) and that Unj n plays the role of the backshift
operator. Therefore, we can use the backshift operator in obtaining a solution of an AR process
because it plays the role of the matrix Un .
59
...
1
1
...
1 . . .
...
...
0 0
Xn
Xn1
..
.
X0
... ...
0
= n1
..
C n X n = n .
The approximation of the AR(1) is only for the first equation, where X0 1 Xn = 0 . Using the
matrix Un , the above equation can be written as (In 1 Un )X n = n , which gives the solution
X n = (In 1 Un )1 n .
Let us suppose that 1  > 1 (ie, the root lies inside the unit circle and the solution is noncausal),
1
then to get a convergent expansion of (1n 1 Un )1 we rewrite (In 1 Un ) = 1 Un (In 1
1 Un ).
Thus we have
"
(In 1 Un )
#
k
k
1 Un
(1 Un )1 .
k=0
Xn =
!
Unk+1
k+1
1
n ,
k=0
2.4.1
Xt =
p
X
j Xtj + t .
j=1
For the rest of this section we will assume that the roots of the characteristic function, (z), lie
outside the unit circle, thus the solution causal. We can rewrite the above as a Vector Autoregressive
60
(VAR(1)) process
X t = AX t1 + t
(2.10)
where
1 2 . . . p1 p
...
...
...
0
,
(2.11)
p
X
i z pi = z p (1
i=1
p
X
i z i )),
i=1
{z
=z p (z 1 )
thus the eigenvalues of A lie inside the unit circle. It can be shown that for any max (A) < < 1,
there exists a constant C such that kAj kspec C j (see Appendix A). Note that result is
extremely obvious if the eigenvalues are distinct (in which case the spectral decomposition can be
used), in which case kAj kspec C max (A)j (note that kAkspec is the spectral norm of A, which
is the largest eigenvalue of the symmetric matrix AA0 ).
We can apply the same back iterating that we did for the AR(1) to the vector AR(1). Iterating
(2.10) backwards k times gives
Xt =
k1
X
Aj tj + Ak X tk .
j=0
P
Xt =
Aj tj .
j=0
61
2.5
Up to now, we have defined the moving average and the autoregressive model. The MA(q) average
has the feature that after q lags there isnt any correlation between two random variables. On
the other hand, there are correlations at all lags for an AR(p) model. In addition as we shall see
later on, it is much easier to estimate the parameters of an AR model than an MA. Therefore,
there are several advantages in fitting an AR model to the data (note that when the roots are
of the characteristic polynomial lie inside the unit circle, then the AR can also be written as an
MA(), since it is causal). However, if we do fit an AR model to the data, what order of model
should we use? Usually one uses the AIC (BIC or similar criterion) to determine the order. But
for many data sets, the selected order tends to be relative large, for example order 14. The large
order is usually chosen when correlations tend to decay slowly and/or the autcorrelations structure
is quite complex (not just monotonically decaying). However, a model involving 1015 unknown
parameters is not particularly parsimonious and more parsimonious models which can model the
same behaviour would be useful. A very useful generalisation which can be more flexible (and
parsimonious) is the ARMA(p, q) model, in this case Xt satisfies
Xt
p
X
i Xti = t +
i=1
q
X
j tj .
j=1
Xt =
p
X
i Xti + t .
(2.12)
i=1
Xt = t +
q
X
j=1
62
j tj .
(2.13)
(iii) The autoregressive moving average ARM A(p, q) model: {Xt } satisfies
Xt
p
X
q
X
i Xti = t +
i=1
j tj .
(2.14)
j=1
Xt =
aj tj ,
(2.15)
j=0
Pq
i=1 i [A
A=
1
1
..
.
0
where
2
0
..
.
ji ]
1,1 ,
with
. . . p1 p
...
..
.
... ...
...
..
.
0
..
.
aj  < (we note that really aj = aj (, ) since its a function of {i } and {i }).
(2.16)
Xt =
X
j=
63
aj tj ,
(2.17)
(2.18)
Pq
j=1 j z
can be written as
Xt =
bj Xtj + t .
(2.19)
j=1
where
bj  Kj
(2.20)
Pq
j=1 j tj , 0, . . . , 0).
Xt =
(2.21)
Now iterating (2.21), we have
Aj tj ,
(2.22)
j=0
i=0
q
X
j tij ).
j=1
Comparing (2.15) with the above it is clear that for j > q, aj = [Aj ]1,1 +
Pq
i=1 i [A
ji ] .
1,1
Observe
that the above representation is very similar to the AR(1). Indeed as we will show below the Aj
behaves in much the same way as the j in AR(1) example. As with j , we will show that Aj
converges to zero as j (because the eigenvalues of A are less than one). We now show that
64
Xt  K
j
j=1 tj 
for some 0 < < 1, this will mean that aj  Kj . To bound Xt  we use
(2.22)
Xt  kX t k2
j=0
Hence, by using Section 2.4.1 we have kAj kspec C j (for any max (A) < < 1), which gives
the corresponding bound for aj .
To prove (ii) we consider a power series expansion of
(z)
(z) .
then it is straightforward to write (z)1 in terms of partial fractions and a convergent power
series for z = 1. This expansion immediately gives the the linear coefficients aj and show that
aj  C(1 + )j for some finite constant C. On the other hand, if there are multiple roots, say
P
the roots of (z) are 1 , . . . , s with multiplicity m1 , . . . , ms (where sj=1 ms = p) then we need to
adjust the partial fraction expansion. It can be shown that aj  Cjmaxs ms  (1 + )j . We note
that for every (1 + )1 < < 1, there exists a constant such that jmaxs ms  (1 + )j Cj ,
thus we obtain the desired result.
To show (iii) we use a similar proof to (i), and omit the details.
Corollary 2.5.1 An ARMA process is invertible if the roots of (B) (the MA coefficients) lie
outside the unit circle and causal if the roots of (B) (the AR coefficients) lie outside the unit
circle.
The representation of an ARMA process is unique upto AR and MA polynomials (B) and
(B) having common roots. A simplest example is Xt = t , this also satisfies the representation
Xt Xt1 = t t1 etc. Therefore it is not possible to identify common factors in the
polynomials.
One of the main advantages of the invertibility property is in prediction and estimation. We will
consider this in detail below. It is worth noting that even if an ARMA process is not invertible, one
can generate a time series which has identical correlation structure but is invertible (see Section
3.3).
65
2.6
2j
1 =
j=0
r1
.
1 21
j1 tj .
j=0
66
t
X
j1 tj .
j=0
Xt =
p
X
j Xtj + t ,
j=1
whose characteristic function have roots both inside and outside the unit circle. Thus, the stationary
solution of this equation is not causal. It is not possible to simulate from this equation. To see why,
consider directly simulating from Xt = 2Xt1 + t without rearranging it as Xt1 = 12 Xt 12 t , the
solution would explode. Now if the roots are both inside and outside the unit circle, there would
not be a way to rearrange the equation to iterate a stationary solution. There are two methods to
remedy this problem:
67
Xt =
aj tj ,
(2.23)
j=
where the coefficients aj are determined from the characteristic equation. Thus to simulate
the process we use the above representation, though we do need to truncate the number of
terms in (2.23) and use
t =
X
M
X
aj tj .
j=M
(ii) The above is a brute force method is an approximation which is also difficult to evaluate.
There is a simpler method, if one studies the roots of the characteristic equation.
Let us suppose that {j1 ; j1 = 1, . . . , p1 } are the roots of (z) which lie outside the unit circle
and {j2 ; j2 = 1, . . . , p2 } are the roots which lie inside the unit circle. For ease of calculation
we will assume the roots are distinct. We can rewrite (z)1 as
(z)1 =
hQ
p1
j1 =1 (1
p1
X
j1 =1
p1
X
j1 =1
1
i hQ
i
p2
j1 z)
(1
z)
j
2
j2 =1
p2
X
djd
cj1
+
(1 j1 z)
(1 jd z)
cj1
(1 j1 z)
j2 =1
p2
X
j2
djd
1
z(1 1
jd z )
=1 jd
p2
X
cj1
djd
t
1 1 t
(1 j1 B)
B(1
j
d
jd B )
j =1
=1
p1
X
j1
Let Yj1 ,t = j1 Yj1 ,t1 + t and Zj2 ,t = j2 Zj2 ,t1 + t (thus the stationary solution is gener1
ated with Zj2 ,t1 = 1
j2 Zj2 ,t j2 t ). Generate the time series {Yj1 ,t ; j1 = 1, . . . , p1 } and
{Yj1 ,t ; j1 = 1, . . . , p1 } using the method described above. Then the noncausal time series
can be generated by using
Xt =
p1
X
cj1 Yj1 ,t
j1 =1
p2
X
j2 =1
68
dj2 Zj2 ,t .
Comments:
Remember Yj,t is generated using the past t and Zj,t is generated using future innovations. Therefore to ensure that the generated {Yj,t } and {Zj,t } are close to the
stationary we need to ensure that the initial value of Yj,t is far in the past and the
initial value for Zj,t is far in the future.
If the roots are complex conjugates, then the corresponding {Yj,t } or {Zj,t } should be
written as AR(2) models (to avoid complex processes).
R functions
Shumway and Stoffer (2006) and David Stoffers website gives a comprehensive introduction to time
series Rfunctions.
The function arima.sim simulates from a Gaussian ARIMA process. For example,
arima.sim(list(order=c(2,0,0), ar = c(1.5, 0.75)), n=150) simulates from the AR(2) model
Xt = 1.5Xt1 0.75Xt2 + t , where the innovations are Gaussian.
Exercise 2.6 In the following simulations, use nonGaussian innovations.
(i) Simulate an AR(4) process with characteristic function
2
(z) = 1 0.8 exp(i )z
13
2
2
2
1 0.8 exp(i )z 1 1.5 exp(i )z 1 1.5 exp(i )z .
13
5
5
2
(z) = 1 0.8 exp(i )z
13
2
2
2
2
2
1 0.8 exp(i )z 1 exp(i )z 1 exp(i )z .
13
3
5
3
5
69
Chapter 3
The autocovariance function of a
linear time series
Objectives
Be able to determine the rate of decay of an ARMA time series.
Be able solve the autocovariance structure of an AR process.
Understand what partial correlation is and how this may be useful in determining the order
of an AR model.
Understand why autocovariance is blind to processes which are noncausal. But the higher
order cumulants are not blind to causality.
3.1
The autocovariance function (ACF) is defined as the sequence of covariances of a stationary process.
That is suppose that {Xt } is a stationary process with mean zero, then {c(k) : k Z} is the ACF
of {Xt } where c(k) = E(X0 Xk ). Clearly different time series give rise to different features in the
ACF. We will explore some of these features below.
Before investigating the structure of ARMA processes we state a general result connecting linear
time series and the summability of the autocovariance function.
70
Lemma 3.1.1 Suppose the stationary time series Xt satisfies the linear representation
P
The covariance is c(r) =
j= j j+r .
(i) If
(ii) If
(iii) If
j= j 
< , then
j= jj 
j= j 
j= j tj .
c(k) < .
< , then
k c(k) < .
j jk .
c(k)
P P
k
j  jk , thus
proves (i).
P
The proof of (ii) is similar. To prove (iii), we observe that j j 2 < is a weaker condition
P
then j j  < (for example the sequence j = j1 satisfies the former condition but not the
latter). Thus based on the condition we cannot say anything about summability of the covariances.
First we consider a general result on the covariance of a causal ARMA process (always to obtain
the covariance we use the MA() expansion  you will see why below).
3.1.1
We evaluate the covariance of an ARMA process using its MA() representation. Let us suppose
that {Xt } is a causal ARMA process, then it has the representation in (2.17) (where the roots of
(z) have absolute value greater than 1 + ). Using (2.17) and the independence of {t } we have
cov(Xt , X ) = cov(
aj1 tj1 ,
j1 =0
aj2 j2 )
j2 =0
j=0
aj aj+t  var(t )
(3.1)
j=0
(here we see the beauty of the MA() expansion). Using (2.18) we have
cov(Xt , X )
var(t )C2
j j+t 
j=0
C2 t 
X
j=0
71
2j =
t 
,
1 2
(3.2)
3.1.2
Xt =
p
X
j Xtj + t .
(3.3)
j=1
From now onwards we will assume that {Xt } is causal (the roots of (z) lie outside the unit circle).
Given that {Xt } is causal we can derive a recursion for the covariances. It can be shown that
multipying both sides of the above equation by Xtk (k 0) and taking expectations, gives the
equation
E(Xt Xtk ) =
p
X
j=1
p
X
j E(Xtj Xtk ).
(3.4)
j=1
It is worth mentioning that if the process were not causal this equation would not hold, since t
and Xtk are not necessarily independent. These are the YuleWalker equations, we will discuss
them in detail when we consider estimation. For now letting c(k) = E(X0 Xk ) and using the above
we see that the autocovariance satisfies the homogenuous difference equation
c(k)
p
X
j c(k j) = 0,
(3.5)
j=1
for k 0. In other words, the autocovariance function of {Xt } is the solution of this difference
equation. The study of difference equations is a entire field of research, however we will now scratch
the surface to obtain a solution for (3.5). Solving (3.5) is very similar to solving homogenuous
differential equations, which some of you may be familar with (do not worry if you are not).
72
Pp
j=1 j z
= 0, which has
the roots 1 , . . . , p . In Section 2.3.4 we used the roots of the characteristic equation to find the
stationary solution of the AR process. In this section we use the roots characteristic to obtain the
solution (3.5). It can be shown if the roots are distinct (the roots are all different) the solution of
(3.5) is
c(k) =
p
X
Cj k
j ,
(3.6)
j=1
where the constants {Cj } are chosen depending on the initial values {c(k) : 1 k p} and are
such that they ensure that c(k) is real (recalling that j ) can be complex.
The simplest way to prove (3.6) is to use a plugin method. Plugging c(k) =
Pp
k
j=1 Cj j
into
(3.5) gives
c(k)
p
X
j c(k j) =
j=1
p
X
p
X
(ki)
k
Cj j
i j
j=1
p
X
i=1
p
X
k
i
Cj j
1
i j = 0.
j=1
i=1
{z
(i )
In the case that the roots of (z) are not distinct, let the roots be 1 , . . . , s with multiplicity
P
m1 , . . . , ms ( sk=1 mk = p). In this case the solution is
c(k) =
s
X
k
j Pmj (k),
(3.7)
j=1
where Pmj (k) is mj th order polynomial and the coefficients {Cj } are now hidden in Pmj (k). We
now study the covariance in greater details and see what it tells us about a realisation. As a
motivation consider the following example.
Example 3.1.1 Consider the AR(2) process
Xt = 1.5Xt1 0.75Xt2 + t ,
(3.8)
where {t } are iid random variables with mean zero and variance one. The corresponding characp
teristic polynomial is 1 1.5z + 0, 75z 2 , which has roots 1 i31/2 = 4/3 exp(i/6). Using the
73
and the complex roots are {j , j }j=r+1 . The covariance in (3.6) can be written as
c(k) =
r
X
j=1
(p2)/2
Cj k
j
aj j k cos(kj + bj )
(3.9)
j=r+1
where for j > r we write j = j  exp(ij ) and aj and bj are real constants. Notice that as the
example above the covariance decays exponentially with lag, but there is undulation. A typical
realisation from such a process will be quasiperiodic with periods at r+1 , . . . , (pr)/2 , though the
74
1.0
0.8
0.6
0.4
0.4
0.2
0.0
0.2
acf
10
20
30
40
50
lag
0
2
4
ar2
24
48
72
96
120
144
Time
75
0.5
0.3
0.2
Periodogram
0.4
0.1
0.0
frequency
30
0
10
20
spectrum
40
50
60
Autoregressive
0.0
0.1
0.2
0.3
0.4
0.5
frequency
Figure 3.3: Top: Periodogram of Xt = 1.5Xt1 0.75Xt2 + t for sample size n = 144.
Lower: The corresponding spectral density function (note that 0.5 of the xaxis on spectral
density corresponds to on the xaxis of the periodogram).
76
Xt =
X
1
bj+1 aj+1 )tj .
ba
j=0
X
(bj+1 aj+1 )(bj+1+r aj+1+r ).
j=0
Example 3.1.3 The autocorrelation of a causal and noncausal time series Let us consider the two
AR(1) processes considered in Section 2.3.2. We recall that the model
Xt = 0.5Xt1 + t
has the stationary causal solution
Xt =
0.5j tj .
j=0
1
1 0.52
cX (k) =
77
0.5k
1 0.52
X
(0.5)j+1 t+j+1 .
j=0
0.52
1 0.52
cX (k) =
0.52+k
.
1 0.52
Thus we observe that except for a factor (0.5)2 both models has an identical autocovariance function.
Indeed their autocorrelation function would be same. Furthermore, by letting the innovation of Xt
have standard deviation 0.5, both time series would have the same autocovariance function.
Therefore, we observe an interesting feature, that the noncausal time series has the same
correlation structure of a causal time series. In Section 3.3 that for every noncausal time series
there exists a causal time series with the same autocovariance function. Therefore autocorrelation
is blind to noncausality.
Exercise 3.1 Recall the AR(2) models considered in Exercise 2.4. Now we want to derive their
ACF functions.
(i)
4 3
42
Xt =
Xt1 2 Xt2 + t ,
5
5
where {t } are iid random variables with mean zero and variance 2 .
(c) Obtain the ACF corresponding to
Xt = Xt1 4Xt2 + t ,
where {t } are iid random variables with mean zero and variance 2 .
78
(ii) For all these models plot the true ACF in R. You will need to use the function ARMAacf.
BEWARE of the ACF it gives for noncausal solutions. Find a method of plotting a causal
solution in the noncausal case.
Exercise 3.2 In Exercise 2.5 you constructed a causal AR(2) process with period 17.
Load Shumway and Stoffers package asta into R (use the command install.packages("astsa")
and then library("astsa").
Use the command arma.spec to make a plot of the corresponding spectral density function. How
does your periodogram compare with the true spectral density function?
R code
We use the code given in Shumway and Stoffer (2006), page 101 to make Figures 3.1 and 3.2.
To make Figure 3.1:
acf = ARMAacf(ar=c(1.5,0.75),ma=0,50)
plot(acf,type="h",xlab="lag")
abline(h=0)
To make Figures 3.2 and 6.1:
set.seed(5)
ar2 < arima.sim(list(order=c(2,0,0), ar = c(1.5, 0.75)), n=144)
plot.ts(ar2, axes=F); box(); axis(2)
axis(1,seq(0,144,24))
abline(v=seq(0,144,12),lty="dotted")
plot(frequency, Periodogram,type="o")
library("astsa")
arma.spec( ar = c(1.5, 0.75), log = "no", main = "Autoregressive")
3.1.3
Xt = t +
q
X
j=1
79
j tj .
The covariance is
cov(Xt , Xtk ) =
P
p
i=0 i ik
k = q, . . . , q
otherwise
where 0 = 1 and i = 0 for i < 0 and i q. Therefore we see that there is no correlation when
the lag between Xt and Xtk is greater than q.
3.1.4
We see from the above that an MA(q) model is only really suitable when we believe that there
is no correlaton between two random variables separated by more than a certain distance. Often
autoregressive models are fitted. However in several applications we find that autoregressive models
of a very high order are needed to fit the data. If a very long autoregressive model is required
a more suitable model may be the autoregressive moving average process. It has several of the
properties of an autoregressive process, but can be more parsimonuous than a long autoregressive
process. In this section we consider the ACF of an ARMA process.
Let us suppose that the causal time series {Xt } satisfies the equations
Xt
p
X
i Xti = t +
i=1
q
X
j tj .
j=1
We now define a recursion for ACF, which is similar to the ACF recursion for AR processes. Let
us suppose that the lag k is such that k > q, then it can be shown that the autocovariance function
of the ARMA process satisfies
E(Xt Xtk )
p
X
i E(Xti Xtk ) = 0
i=1
E(Xt Xtk )
p
X
i=1
i E(Xti Xtk ) =
q
X
j E(tj Xtk ) =
j=1
80
q
X
j E(tj Xtk ).
j=k
j=0 aj tj
k j q we have E(tj Xtk ) = ajk var(t ) (where a(z) = (z)(z)1 ). Altogether the above
gives the difference equations
c(k)
c(k)
p
X
i=1
p
X
i c(k i) = var(t )
q
X
j ajk
for 1 k q
(3.10)
j=k
i=1
where c(k) = E(X0 Xk ). (3.10) is homogenuous difference equation, then it can be shown that the
solution is
c(k) =
s
X
k
j Pmj (k),
j=1
P
where 1 , . . . , s with multiplicity m1 , . . . , ms ( k ms = p) are the roots of the characteristic
P
polynomial 1 pj=1 j z j . Observe the similarity to the autocovariance function of the AR process
(see (3.7)). The coefficients in the polynomials Pmj are determined by the initial condition given
in (3.10).
You can also look at Brockwell and Davis (1998), Chapter 3.3 and Shumway and Stoffer (2006),
Chapter 3.4.
3.2
We see that by using the autocovariance function we are able to identify the order of an MA(q)
process: when the covariance lag is greater than q the covariance is zero. However the same is
not true for AR(p) processes. The autocovariances do not enlighten us on the order p. However
a variant of the autocovariance, called the partial autocovariance is quite informative about order
of AR(p). We start by reviewing the partial autocovariance, and its relationship to the inverse
variance/covariance matrix (often called the precision matrix).
81
3.2.1
Partial correlation
Suppose X = (X1 , . . . , Xd ) is a zero mean random vector (we impose the zero mean condition to
simplify notation and its not necessary). The partial correlation is the covariance between Xi and
Xj , conditioned on the other elements in the vector. In other words, the covariance between the
residuals of Xi conditioned on X (ij) (the vector not containing Xi and Xj ) and the residual of Xj
conditioned on X (ij) . That is the partial covariance between Xi and Xj given X (ij) is defined
as
cov Xi var[X (ij) ]1 E[X (ij) Xi ]X (ij) , Xj var[X (ij) ]1 E[X (ij) Xj ]X (ij)
(3.11)
where ij = var(X ij ), cij = E(X ij X (ij) ) (=cov(X ij , X (ij) )) and (ij) = var(X (ij) )
( denotes the tensor product). Let sij denote the (i, j)th element of the (2 2) matrix ij
c0ij 1
(ij) cij . The partial correlation between Xi and Xj given X (ij) is
ij =
s12
,
s11 s22
observing that
(i) s12 is the partial covariance between Xi and Xj .
(ii) s11 = E(Xi
k6=i,j
i,k Xk )2 (where i,k are the coefficients of the best linear predictor of
k6=i,j
j,k Xk )2 (where j,k are the coefficients of the best linear predictor of
82
ij
ii jj
= ij .
(3.12)
The proof uses the inverse of block matrices. To simplify the notation, we will focus on the (1, 2)th
element of and 1 (which concerns the correlation between X1 and X2 ). Let X 1,2 = (X1 , X2 )0 ,
X (1,2) = (X3 , . . . , Xd )0 , (1,2) = var(X (1,2) ), c1,2 = cov(X (1,2) , X (1,2) ) and 1,2 = var(X 1,2 ).
Using this notation it is clear that
var(X) = =
1,2
c1,2
c01,2
(1,2)
(3.13)
1 =
P 1 c01,2 1
(1,2)
P 1
1 P 1 + 1
1 c0 1
1
1,2 (1,2)
(1,2) c1,2 P
(1,2) c1,2 P
(3.14)
(3.15)
where Pij denotes the elements of the matrix P . Inverting P (since it is a two by two matrix), we
see that
P 1 =
P2,2
2
P1,1 P2,2 P1,2
P1,2
83
P1,2
P11
(3.16)
Thus, by comparing (3.14) and (3.16) and by the definition of partial correlation given in (3.15) we
have
1
P1,2
= 1,2 .
Let ij denote the (i, j)th element of 1 . Thus we have shown (3.12):
ij =
ij
ii jj
In other words, the (i, j)th element of 1 divided by the square root of its diagonal gives negative
partial correlation. Therefore, if the partial correlation between Xi and Xj given Xij is zero, then
i,j = 0.
The precision matrix, 1 , contains many other hidden treasures. For example, the coefficients
of 1 convey information about the best linear predictor Xi given X i = (X1 , . . . , Xi1 , Xi+1 , . . . , Xd )
(all elements of X except Xi ). Let
Xi =
i,j Xj + i ,
j6=i
where {i,j } are the coefficients of the best linear predictor. Then it can be shown that
i,j =
ij
ii
and
ii =
E[Xi
1
P
j6=i i,j Xj ]
(3.17)
var(X) = =
c1
c01
(1)
(3.18)
84
Xt =
t1
X
t,j Xj + t ,
t = 2, . . . , k,
(3.19)
j=1
where {t,j ; 1 j t 1} are the coefficeints of the best linear predictor of Xt given X1 , . . . , Xt1 .
P
2
2
Let t2 = var[t ] = E[Xt t1
j=1 t,j Xj ] and 1 = var[X1 ]. We standardize (3.19) and define
t
X
t,j Xj =
j=1
1
Xt
t
t1
X
t,j Xj ,
(3.20)
j=1
where we note that t,t = t1 and for 1 j < t 1, t,j = t,j /i . By construction it is clear
that var(LX) = Ik , where
1,1
...
0 ...
0
0
2,1 2,2
..
..
..
..
..
..
.
.
.
.
.
.
(3.21)
and LL0 = 1 (see ?, equation (18)), where = var(X k ). Let t = var[X t ], then
ij
t
t
X
ik jk .
k=1
We use apply these results to the analysis of the partial correlations of autoregressive processes
and the inverse of its variance/covariance matrix.
3.2.2
85
We now obtain an expression for the partial correlation between Xt and Xt+k+1 in terms of their
autocovariance function (for the final result see equation (3.22)). As the underlying assumption
is that the time series is stationary it is the same as the partial covariance/correlation Xk+1 and
X0 . In Chapter 5 we will introduce the idea of linear predictor of a future time point given the
present and the past (usually called forecasting) this can be neatly described using the idea of
projections onto subspaces. This notation is quite succinct, therefore we derive an expression for
the partial correlation using projection notation. The projection of Xk+1 onto the space spanned
by X k = (X1 , X2 , . . . , Xk ), is the best linear predictor of Xk+1 given X k . We will denote the
projection of Xk onto the space spanned by X1 , X2 , . . . , Xk as PX k (Xk+1 ) (note that this is the
same as the best linear predictor). Thus
PX k (Xk+1 ) =
X 0k 1
k ck
:=
k
X
k,j Xj ,
j=1
where k = var(X k ) and ck = E(Xk+1 X k ). To derive a similar expression for PX k (X0 ) we use the
stationarity property
PX k (Xk+1 ) = X 0k (var[X k ]1 E[X0 X k ])
= X 0k (var[X k ]1 Ek E[Xk+1 X k ])
1
0
= X 0k 1
k Ek ck = X k Ek k ck :=
k
X
k,k+1j Xj ,
j=1
Ek =
0 0 0 ... 0 1
0 0 0 ... 1 0
.. .. .. .. ..
. . . . .
.
1 0 .. 0 0 0
Thus the partial correlation between Xt and Xt+k (where k > 0) is the correlation X0 PX k (X0 )
and Xk+1 PX k (Xk+1 ), which is
cov(Xk+1 PX k (Xk+1 ), X0 PX k (X0 ))
= cov(Xk+1 X0 ) c0k 1
k Ek ck .
86
(3.22)
We consider an example.
Example 3.2.1 (The PACF of an AR(1) process) Consider the causal AR(1) process Xt =
0.5Xt1 + t where E(t ) = 0 and var(t ) = 1. Using (3.1) it can be shown that cov(Xt , Xt2 ) =
20.52 (compare with the MA(1) process Xt = t +0.5t1 , where the covariance cov(Xt , Xt2 ) = 0).
We evaluate the partial covariance between Xt and Xt2 . Remember we have to condition out the
random variables inbetween, which in this case is Xt1 . It is clear that the projection of Xt onto
Xt1 is 0.5Xt1 (since Xt = 0.5Xt1 + t ). Therefore Xt Psp(X
Xt = Xt 0.5Xt1 = t . The
t1 )
projection of Xt2 onto Xt1 is a little more complicated, it is Psp(X
Xt2 =
t1 )
E(Xt1 Xt2 )
Xt1 .
2 )
E(Xt1
E(Xt1 Xt2 )
Xt1
= cov t , Xt2
2 )
E(Xt1
= 0.
In fact the above is true for the partial covariance between Xt and Xtk , for all k 2. Hence we
see that despite the covariance not being zero for the autocovariance of an AR process greater than
order two, the partial covariance is zero for all lags greater than or equal to two.
Using the same argument as above, it is easy to show that partial covariance of an AR(p) for
lags greater than p is zero. Hence in may respects the partial covariance can be considered as an
analogue of the autocovariance. It should be noted that though the covariance of MA(q) is zero
for lag greater than q, the same is not true for the parial covariance. Whereas partial covariances
removes correlation for autoregressive processes it seems to add correlation for moving average
processes!
Model identification:
If the autocovariances after a certain lag are zero q, it may be appropriate to fit an MA(q)
model to the time series.
On the other hand, the autocovariances of any AR(p) process will only decay to zero as the
lag increases.
If the partial autocovariances after a certain lag are zero p, it may be appropriate to fit an
AR(p) model to the time series.
On the other hand, the partial covariances of any MA(p) process will only decay to zero as
the lag increases.
87
Exercise 3.4 (The partial correlation of an invertible MA(1)) Let t,t denote the partial correlation between Xt+1 and X1 . It is well known (this is the LevinsonDurbin algorithm, which we
cover in Chapter 5) that t,t can be deduced recursively from the autocovariance funciton using the
algorithm:
Step 1 1,1 = c(1)/c(0) and r(2) = E[X2 X21 ]2 = E[X2 1,1 X1 ]2 = c(0) 1,1 c(1).
Step 2 For j = t
t,t =
t,j
c(t)
= t1,j
Pt1
r(t)
t,t t1,tj
j)
1 j t 1,
(i) Using this algorithm show that the PACF of the MA(1) process Xt = t + t1 , where  < 1
(so it is invertible) is
t,t =
(1)t+1 ()t (1 2 )
.
1 2(t+1)
(ii) Explain how this partial correlation is similar to the ACF of the AR(1) model Xt = Xt1 +
t .
Exercise 3.5 (Comparing the ACF and PACF of an AR process) Compare the below plots:
(i) Compare the ACF and PACF of the AR(2) model Xt = 1.5Xt1 0.75Xt2 + t using
ARIMAacf(ar=c(1.5,0.75),ma=0,30) and ARIMAacf(ar=c(1.5,0.75),ma=0,pacf=T,30).
(ii) Compare the ACF and PACF of the MA(1) model Xt = t 0.5t using ARIMAacf(ar=0,ma=c(1.5),30)
and ARIMAacf(ar=0,ma=c(1.5),pacf=T,30).
(ii) Compare the ACF and PACF of the ARMA(2, 1) model Xt 1.5Xt1 + 0.75Xt2 = t 0.5t
using ARIMAacf(ar=c(1.5,0.75),ma=c(1.5),30) and
ARIMAacf(ar=c(1.5,0.75),ma=c(1.5),pacf=T,30).
Exercise 3.6 Compare the ACF and PACF plots of the monthly temperature data from 19962014.
Would you fit an AR, MA or ARMA model to this data?
88
Rcode
The sample partial autocorrelation of a time series can be obtained using the command pacf.
However, remember just because the sample PACF is not zero, does not mean the true PACF is
nonzero. This is why we require error bars!
3.2.3
Let us suppose that {Xt } is a stationary time series. In this section we consider the variance/covariance matrix var(X k ) = k , where X k = (X1 , . . . , Xk )0 . We will consider two cases (i) when
Xt follows an MA(p) models and (ii) when Xt follows an AR(p) model. The variance and inverse
of the variance matrices for both cases yield quite interesting results. We will use classical results
from multivariate analysis, stated in Section 3.2.1.
We recall that the variance/covariance matrix of a stationary time series has a (symmetric)
Toeplitz structure (see wiki for a definition). Let X k = (X1 , . . . , Xk )0 , then
k = var(X k ) =
c(0)
c(1)
. . . c(k 2) c(k 1)
c(0)
..
.
1
k for an AR(p) model
We now consider the inverse of k . Warning: note that the inverse of a Toeplitz is not necessarily
Toeplitz (unlike the circulant which is). We use the results in Section 3.2.1. Suppose that we have
an AR(p) process and we consider the precision matrix of X k = (X1 , . . . , Xk ), where k > p.
89
negative partial correlation between Xi and Xj given X(ij) not just the elements between Xi and
Xj ). To show this we use the Cholesky decomposition given in (3.19). Since Xt is an autoregressive
process of order p and plugging this information into (3.19), for t > p we have
Xt =
t1
X
t,j Xj + t =
j=1
p
X
j Xtj + t ,
j=1
thus t,tj = j for 1 j p otherwise t,tj = 0. Moreover, for t > p we have t2 = var(t ) = 1.
For t p we use the same notation as that used in (3.19). This gives the lower triangular pbandlimited matrix
1,1
2,1
2,2
..
..
.
.
p p1
.
..
.
Lk =
.
.
0
0
0
0
.
..
.
.
.
0
0
...
...
...
..
.
0
..
.
0
..
.
...
..
.
0
..
.
0
..
.
1
..
.
...
..
.
0
..
.
0
..
.
. . . 1
..
..
.
.
. . . p p1 . . . 1
...
..
.
0
..
.
p
..
.
...
. . . 2 1
..
..
..
.
.
.
...
0 ... 0
0 ... 0
.. .. ..
. . .
0 ... 0
.. .. ..
. . .
0 ... 0
1 ... 0
.. .. ..
. . .
0 ... 1
(3.23)
(the above matrix has not been formated well, but after the first p 1 rows, there are ones along
the diagonal and the p lower offdiagonals are nonzero).
0
We recall that 1
k = Lk Lk , thus we observe that since Lk is a lower triangular bandlimited
0
matrix, 1
k = Lk Lk is a bandlimited matrix with the p offdiagonals either side of the diagonal
(i,j) = 0 if i j > p.
nonzero. Let ij denote the (i, j)th element of 1
k . Then we observe that
P
Moreover, if 0 < i j p and either i or j is greater than p, then ij = 2 pk=ij k kij+1
ij .
90
The coefficients (i,j) gives us a fascinating insight into the prediction of Xt given the past
and future observations. We recall from equation (3.17) that ij /ii are the coffficients of the
best linear predictor of Xi given X i . This result tells if the observations came from a stationary
AR(p) process, then the best linear predictor of Xi given Xi1 , . . . , Xia and Xi+1 , . . . , Xi+b (where
a and b > p) is the same as the best linear predictor of Xi given Xi1 , . . . , Xip and Xi+1 , . . . , Xi+p
(knowledge of other values will not improve the prediction).
There is an interesting duality between the AR and MA model which we will explore further
in the course.
3.3
c(k) exp(ik).
The covariances can be obtained from the spectral density by using the inverse fourier transform
c(k) =
1
2
f () exp(ik).
0
Xt =
p
X
i Xti + t
i=1
Pp
j=1 j z
but not on the unit circle (thus it has a stationary solution). We will show in Chapter 8 that the
91
1
1
.
2
j=1 j exp(ij)
(3.24)
Pp
Factorizing f ().
P
Let us supose the roots of the characteristic polynomial (z) = 1 + qj=1 j z j are {j }pj=1 ,
P
Q
thus we can factorize (x) 1 + pj=1 j z j = pj=1 (1 j z). Using this factorization we have
(3.24) can be written as
f () =
1
.
2
j=1 1 j exp(i)
(3.25)
Qp
As we have not assumed {Xt } is causal, the roots of (z) can lie both inside and outside the
unit circle. We separate the roots, into those outside the unit circle {O,j1 ; j1 = 1, . . . , p1 }
and inside the unit circle {I,j2 ; j2 = 1, . . . , p2 } (p1 + p2 = p). Thus
(z) = [
p1
Y
(1 O,j1 z)][
j1 =1
p2
Y
(1 I,j2 z)]
j2 =1
p
1
Y
p2
Y
j1 =1
j2 =1
= (1)p2 I,j2 z p2 [
(1 O,j1 z)][
(1 1
I,j2 z)].
(3.26)
1
1
.
Qp 1
Qp2
1
2
2
2
j2 =1 I,j2 
j1 =1 1 O,j exp(i)
j2 =1 1 I,j2 exp(i)
Qp 2
(3.27)
Let
fO () =
Then f () =
Qp 1
j1 =1 1
O,j
1
Qp 2
exp(i)2
j2 =1 1
2
1
I,j2 exp(i)
Qp 2
2
j2 =1 I,j2  fO ().
A parallel causal AR(p) process with the same covariance structure always exists.
We now define a process which has the same autocovariance function as {Xt } but is causal.
92
e
(z)
=[
p1
Y
(1 O,j1 z)][
j1 =1
p2
Y
(1 1
I,j2 z)].
(3.28)
j2 =1
By construction, the roots of this polynomial lie outside the unit circle. We then define the
AR(p) process
e
e t = t ,
(B)
X
(3.29)
et } has a stationary, almost sure unique solution. Morefrom Lemma 2.3.1 we know that {X
over, because the roots lie outside the unit circle the solution is causal.
et } is fe(). We know that the spectral density
By using (3.24) the spectral density of {X
et }
function uniquely gives the autocovariance function. Comparing the spectral density of {X
with the spectral density of {Xt } we see that they both are the same up to a multiplicative
constant. Thus they both have the same autocovariance structure up to a multiplicative
constant (which can be made the same, if in the definition (3.29) the innovation process has
Q
variance pj22=1 I,j2 2 ).
Therefore, for every noncausal process, there exists a causal process with the same autocovariance function.
By using the same arguments above, we can generalize to result to ARMA processes.
Definition 3.3.2 An ARMA process is said to have minimum phase when the roots of (z) and
(z) both lie outside of the unit circle.
Remark 3.3.1 For Gaussian random processes it is impossible to discriminate between a causal
and noncausal time series, this is because the mean and autocovariance function uniquely identify
the process.
However, if the innovations are nonGaussian, even though the autocovariance function is blind
to noncausal processes, by looking for other features in the time series we are able to discriminate
between a causal and noncausal process.
93
3.3.1
Xt =
p
X
j Xtj + t ,
j=1
and var(t ) < . Suppose the roots of the corresponding characteristic polynomial lie outside the
unit circle, then {Xt } is strictly stationary where the solution of Xt is only in terms of past and
present values of {t }. Moreover, it is second order stationary with covariance {c(k)}. We recall
from Section 3.1.2, equation (3.4) that we derived the YuleWalker equations for causal AR(p)
processes, where
E(Xt Xtk ) =
p
X
j=1
p
X
j c(k j) = 0.
(3.30)
j=1
Let us now consider the case that the roots of the characteristic polynomial lie both outside
and inside the unit circle, thus Xt does not have a causal solution but it is still strictly and second
order stationary (with autocovariance, say {c(k)}). In the previous section we showed that there
P
e
e
et = t (where (B) and (B)
exists a causal AR(p) (B)
X
= 1 pj=1 j z j are the characteristic
polynomials defined in (3.26) and (3.28)). We showed that both have the same autocovariance
structure. Therefore,
c(k)
p
X
j c(k j) = 0
j=1
t }.
This means the YuleWalker equations for {Xt } would actually give the AR(p) coefficients of {X
Thus if the YuleWalker equations were used to estimate the AR coefficients of {Xt }, in reality we
t }.
would be estimating the AR coefficients of the corresponding causal {X
3.3.2
Here we discuss the surprising result that filtering a noncausal time series with the corresponding
causal AR parameters leaves a sequence which is uncorrelated but not independent. Let us suppose
94
that
Xt =
p
X
j Xtj + t ,
j=1
where t are iid, E(t ) = 0 and var(t ) < . It is clear that given the input Xt , if we apply the
P
filter Xt pj=1 j Xtj we obtain an iid sequence (which is {t }).
P
Suppose that we filter {Xt } with the causal coefficients {ej }, the output et = Xt pj=1 ej Xtj
is not an independent sequence. However, it is an uncorrelated sequence. We illustrate this with an
example.
Example 3.3.1 Let us return to the AR(1) example, where Xt = Xt1 + t . Let us suppose that
> 1, which corresponds to a noncausal time series, then Xt has the solution
Xt =
X
1
t+j+1 .
j
j=1
et =
The causal time series with the same covariance structure as Xt is X
1 e
Xt1
+ (which has
et = (1
(1 1 B)
1
1
B)Xt = Xt Xt1 =
1 t
B(1 B
)
1
1 X 1
= t + (1 2 )
t+j .
j1
j=1
1
1 1
1 X 1
cov(e
t , et+r ) = (1 2 ) r + (1 2 )2
= 0.
2j
j=0
95
Exercise 3.7
Derive a parallel process with the same autocovariance structure but that is noncausal (it
should be real).
(ii) Simulate both from the causal process above and the corresponding noncausal process with
nonGaussian innovations (see Section 2.6). Show that they have the same ACF function.
(iii) Find features which allow you to discriminate between the causal and noncausal process.
96
Chapter 4
Nonlinear Time Series Models
Prerequisites
A basic understanding of expectations, conditional expectations and how one can use conditioning to obtain an expectation.
Objectives:
Use relevant results to show that a model has a stationary, solution.
Derive moments of these processes.
Understand the differences between linear and nonlinear time series.
So far we have focused on linear time series, that is time series which have the representation
Xt =
j tj ,
j=
where {t } are iid random variables. Such models are extremely useful, because they are designed
to model the autocovariance structure and are straightforward to use for forecasting. These are
some of the reasons that they are used widely in several applications.
A typical realisation from a linear time series, will be quite regular with no suddent bursts
or jumps. This is due to the linearity of the system. However, if one looks at financial data, for
example, there are sudden bursts in volatility (variation) and extreme values, which calm down
after a while. It is not possible to model such behaviour well with a linear time series. In order to
capture nonlinear behaviour several nonlinear models have been proposed. The models typically
97
consists of products of random variables which make possible the sudden irratic bursts seen in
the data. Over the past 30 years there has been a lot research into nonlinear time series models.
Probably one of the first nonlinear models proposed for time series analysis is the bilinear model,
this model is used extensively in signal processing and engineering. A popular model for modelling
financial data are (G)ARCHfamily of models. Other popular models are random autoregressive
coefficient models and threshold models, to name but a few (see, for example, Subba Rao (1977),
Granger and Andersen (1978), Nicholls and Quinn (1982), Engle (1982), Subba Rao and Gabr
(1984), Bollerslev (1986), Terdik (1999), Fan and Yao (2003), Straumann (2005) and Douc et al.
(2014)).
Once a model has been defined, the first difficult task is to show that it actually has a solution
which is almost surely finite (recall these models have dynamics which start at the , so if they
are not well defined they could be infinite), with a stationary solution. Typically, in the nonlinear
world, we look for causal solutions. I suspect this is because the mathematics behind existence of
noncausal solution makes the problem even more complex.
We state a result that gives sufficient conditions for a stationary, causal solution of a certain
class of models. These models include ARCH/GARCH and Bilinear models. We note that the
theorem guarantees a solution, but does not give conditions for its moments. The result is based
on Brandt (1986), but under stronger conditions.
Theorem 4.0.1 (Brandt (1986)) Let us suppose that {X t } is a ddimensional time series defined by the stochastic recurrence relation
X t = At X t1 + B t ,
(4.1)
where {At } and {Bt } are iid random matrices and vectors respectively. If E log kAt k < 0 and
E log kB t k < (where k k denotes the spectral norm of a matrix), then
X t = Bt +
k1
Y
k=1
i=0
!
Ati
B tk
(4.2)
converges almost surely and is the unique strictly stationary causal solution.
Note: The conditions given above are very strong and Brandt (1986) states the result under
which weaker conditions, we outline the differences here. Firstly, the assumption {At , Bt } are iid
can be relaxed to their being Ergodic sequences. Secondly, the assumption E log kAt k < 0 can be
98
relaxed to E log kAt k < and that {At } has a negative Lyapunov exponent, where the Lyapunov
Q
exponent is defined as limn n1 k nj=1 Aj k = , with < 0 (see Brandt (1986)).
The conditions given in the above theorem may appear a little cryptic. However, the condition
E log At  < 0 (in the unvariate case) becomes quite clear if you compare the SRE model with
the AR(1) model Xt = Xt1 + t , where  < 1 (which is the special case of the SRE, where
P
j
the coefficients is deterministic). We recall that the solution of the AR(1) is Xt =
k=1 tj .
The important part in this decomposition is that j  decays geometrically fast to zero. Now let
Q
us compare this to (4.2), we see that j plays a similar role to k1
i=0 Ati . Given that there are
Q
similarities between the AR(1) and SRE, it seems reasonable that for (4.2) to converge, k1
i=0 Ati
should converge geometrically too (at least almost surely). However analysis of a product is not
straight forward, therefore we take logarithms to turn it into a sum
k1
k1
Y
1
1X
a.s.
log
Ati =
log Ati E[log At ] := ,
k
k
i=0
i=0
Ati exp[k],
i=0
which only converges to zero if < 0, in other words E[log At ] < 0. Thus we see that the condition
E log At  < 0 is quite a logical conditional afterall.
4.1
4.1.1
Data Motivation
Yahoo data from 19962014
We consider here the closing share price of the Yahoo daily data downloaded from https://uk.
finance.yahoo.com/q/hp?s=YHOO. The data starts from from 10th April 1996 to 8th August 2014
(over 4000 observations). A plot is given in Figure 4.1. Typically the logarithm of such data taken,
and in order to remove linear and/or stochastic trend the first difference of the logarithm is taken
(ie. Xt = log St log St1 ). The hope is that after taking differences the data has been stationarized
(see Example 2.3.2). However, the data set spans almost 20 years and this assumption is rather
precarious and will be investigated later. A plot of the data after taking first differences together
99
100
200
yahoo
300
400
1000
2000
3000
4000
Time
0.4
0.2
0.4
0.6
0.2
Sample Quantiles
0.0
0.4
0.2
0.4
0.6
yahoo.log.diff
0.0
0.2
Normal QQ Plot
0.8
0.8
1000
3000
Time
Theoretical Quantiles
Figure 4.2: Plot of log differences of daily Yahoo share price 19962014 and the corresponding
QQplot
differences {Xt } appears to have very thick tails, which may mean that higher order moments of
the log returns do not exist (not finite).
In Figure 4.3 we give the autocorrelation (ACF) plots of the log differences, absolute log differences and squares of the log differences. Note that the sample autocorrelation is defined as
b(k) =
b
c(k)
,
b
c(0)
where
b
c(k) =
T k
1 X
(Xt X)(X
t+k X).
T
(4.3)
t=1
The dotted lines are the errors bars (the 95% confidence of the sample correlations constructed
under the assumption the observations are independent, see Section 6.2.1). From Figure 4.3a
we see that there appears to be no correlation in the data. More precisely, most of the sample
100
correlations are within the errors bars, the few that are outside it could be by chance, as the error
bars are constructed pointwise. However, Figure 4.3b the ACF plot of the absolutes gives significant
large correlations. In contrast, in Figure 4.3c we give the ACF plot of the squares, where there
does not appear to be any significant correlations.
10
15
20
25
30
35
Lag
0.8
0.6
ACF
0.0
0.2
0.4
ACF
0.0
0.0
0.2
0.2
0.4
0.4
ACF
0.6
0.6
0.8
0.8
1.0
Series (yahoo.log.diff)^2
1.0
Series abs(yahoo.log.diff)
1.0
Series yahoo.log.diff
10
15
20
25
30
35
10
Lag
15
20
25
30
35
Lag
(a) ACF plot of the log differ (b) ACF plot of the absolute (c) ACF plot of the square of
ences
of the log differences
the log differences
101
plot.ts(yahoo.log.diff)
qqnorm(yahoo.log.diff)
qqline(yahoo.log.diff)
par(mfrow=c(1,3))
acf(yahoo.log.diff) # ACF plot of log differences
acf(abs(yahoo.log.diff)) # ACF plot of absolute log differences
acf((yahoo.log.diff)**2) # ACF plot of square of log differences
4.1.2
For completeness we discuss a much shorter data set, the daily closing price of the FTSE 100
from 20th January  8th August, 2014 (141 observations). This data was downloaded from http:
//markets.ft.com/research//Tearsheets/PriceHistoryPopup?symbol=FTSE:FSI.
Exactly the same analysis that was applied to the Yahoo data is applied to the FTSE data and
6500
6600
ftse
6700
6800
20
40
60
80
100
120
140
Time
102
Normal QQ Plot
20
60
100
140
Sample Quantiles
ftse.log.diff
Time
Theoretical Quantiles
Figure 4.5: Plot of log differences of daily FTSE price JanAugust, 2014 and the corresponding QQplot
10
15
1.0
0.8
0.2
0.0
0.2
0.4
ACF
0.6
0.8
0.2
0.0
0.2
0.4
ACF
0.6
0.8
0.6
0.4
0.2
0.0
0.2
ACF
Series (ftse.log.diff)^2
1.0
Series abs(ftse.log.diff)
1.0
Series ftse.log.diff
Lag
10
15
Lag
10
15
Lag
(a) ACF plot of the log differ (b) ACF plot of the absolute (c) ACF plot of the square of
ences
of the log differences
the log differences
4.2
During the early 80s Econometricians were trying to find a suitable model for forecasting stock
prices. They were faced with data similar to the log differences of the Yahoo data in Figure 4.2. As
Figure 4.3a demonstrates, there does not appear to be any linear dependence in the data, which
makes the best linear predictor quite useless for forecasting. Instead, they tried to predict the
variance of future prices given the past, var[Xt+1 Xt , Xt1 , . . .]. This called for a model that has a
zero autocorrelation function, but models the conditional variance.
To address this need, Engle (1982) proposed the autoregressive conditionally heteroskadastic
(ARCH) model (note that Rob Engle, together with Clive Granger, in 2004, received the Noble prize
103
for Economics for Cointegration). He proposed the ARCH(p) which satisfies the representation
t2
Xt = t Zt
= a0 +
p
X
2
aj Xtj
,
j=1
where Zt are iid random variables where E(Zt ) = 0 and var(Zt ) = 1, a0 > 0 and for 1 j p
aj 0.
Before, worrying about whether a solution of such a model exists, let us consider the reasons
behind why this model was first proposed.
4.2.1
Features of an ARCH
Let us suppose that a causal, stationary solution of the ARCH model exists (Xt is a function of
Zt , Zt1 , Zt1 , . . .) and all the necessary moments exist. Then we obtain the following.
(i) The first moment:
E[Xt ] = E[Zt t ] = E[E(Zt t Xt1 , Xt2 , . . .)] = E[t E(Zt Xt1 , Xt2 , . . .)]

{z
}
t function of Xt1 ,...,Xtp
= E[t
E(Zt ) ] = E[0 t ] = 0.
 {z }
by causality
Pp
2
j=1 aj Xtj
past).
(iii) The autocovariance function:
Without loss of generality assume k > 0
cov[Xt , Xt+k ] = E[Xt Xt+k ] = E[Xt E(Xt+k Xt+k1 , . . . , Xt )]
= E[Xt t+k E(Zt+k Xt+k1 , . . . , Xt )] = E[Xt t+k E(Zt+k )] = E[Xt t+k 0] = 0.
104
Zt2
= a0 +
p
X
2
aj Xtj
+ (Zt2 1)t2 ,
(4.4)
j=1
Pp
j=1 aj z
j.
Pp
j=1 aj
roots of the characteristic polynomial (z) lie outside the unit circle (see Exercise 2.1). Moreover,
the innovations t = (Zt2 1)t2 are martingale differences (see wiki). This can be shown by noting
that
E[(Zt2 1)t2 Xt1 , Xt2 , . . .] = t2 E(Zt2 1Xt1 , Xt2 , . . .) = t2 E(Zt2 1) = 0.
 {z }
=0
Thus cov(t , s ) = 0 for s 6= t. Martingales are a useful asymptotic tool in time series, we demonstrate how they can be used in Chapter 10.
To summarise, in many respects the ARCH(p) model resembles the AR(p) except that the
innovations {t } are martingale differences and not iid random variables. This means that despite
the resemblence, it is not a linear time series.
We show that a unique, stationary causal solution of the ARCH model exists and derive conditions under which the moments exist.
105
4.2.2
2
t2 = a0 + a1 Xt1
.
(4.5)
It is difficult to directly obtain a solution of Xt , instead we obtain a solution for t2 (since Xt can
2
2 Z2
immediately be obtained from this). Using that Xt1
= t1
t1 and substituting this into (4.5)
we obtain
2
2
2
t2 = a0 + a1 Xt1
= (a1 Zt1
)t1
+ a0 .
(4.6)
We observe that (4.6) can be written in the stochastic recurrence relation form given in (4.1)
2
2 ] = log a +
with At = a1 Zt1
and Bt = a0 . Therefore, by using Theorem 4.0.1, if E[log a1 Zt1
1
2 ] < 0, then 2 has the strictly stationary causal solution
E[log Zt1
t
t2 = a0 + a0
ak1
k
Y
2
Ztj
.
j=1
k=1
(4.7)
which is immediately implied if a1 < 1 (since E[log Zt2 ] log E[Zt2 ] = 0), but it is also satisfied
under weaker conditions on a1 .
To obtain the moments of Xt2 we use that it has the solution is
Xt2 = Zt2 a0 + a0
ak1
k=1
k
Y
2
Ztj
,
j=1
E[Xt2 ] = E[Zt2 ]E a0 + a0
ak1
k=1
106
k
Y
j=1
2
Ztj
= a0 + a0
X
k=1
ak1 .
Thus E[Xt2 ] < if and only if a1 < 1 (heuristically we can see this from E[Xt2 ] = E[Z22 ](a0 +
2 ])).
a1 E[Xt1
By placing stricter conditions on a1 , namely a1 E(Zt2d )1/d < 1, we can show that E[Xt2d ] < .
mt (k)
(4.8)
k1
where mt (k) =
k
Y
a0
j1 ,...,jk 1
r=1
ajr
k
Y
2 P
r
Zt
s=0 js
(j0 = 0).
r=0
The above solution belongs to a general class of functions called a Volterra expansion. We note
P
that E[Xt2 ] < iff pj=1 aj < 1.
4.3
A possible drawback of the ARCH(p) model is that the conditional variance only depends on finite
number of the past squared observations/log returns (in finance, the share price is often called
the return). However, when fitting the model to the data, analogous to order selection of an
autoregressive model (using, say, the AIC), often a large order p is selected. This suggests that
the conditional variance should involve a large (infinite?) number of past terms. This observation
motivated the GARCH model (first proposed in Bollerslev (1986) and Taylor (1986)), which in
many respects is analogous to the ARMA. The conditional variance of the GARCH model is a
weighted average of the squared returns, the weights decline with the lag, but never go completely
to zero. The GARCH class of models is a rather parsimonous class of models and is extremely
popular in finance. The GARCH(p, q) model is defined as
Xt = t Zt
t2
= a0 +
p
X
j=1
107
2
aj Xtj
q
X
i=1
2
bi ti
(4.9)
where Zt are iid random variables where E(Zt ) = 0 and var(Zt ) = 1, a0 > 0 and for 1 j p
aj 0 and 1 i q bi 0.
Under the assumption that a causal solution with sufficient moments exist, the same properties
defined for the ARCH(p) in Section 4.2.1 also apply to the GARCH(p, q) model.
It can be shown that under suitable conditions on {bj } that Xt satisfies an ARCH() representation. Formally, we can write the conditional variance t2 (assuming that a stationarity solution
exists) as
(1
q
X
bi B i )t2 = (a0 +
i=1
p
X
2
aj Xtj
),
j=1
where B denotes the backshift notation defined in Chapter 2. Therefore if the roots of b(z) =
P
P
(1 qi=1 bi z i ) lie outside the unit circle (which is satisfied if i bi < 1) then
t2 =
(1
1
Pq
j
j=1 bj B )
(a0 +
p
X
2
aj Xtj
) = 0 +
j=1
2
j Xtj
,
(4.10)
j=1
where a recursive equation for the derivation of j can be found in Berkes et al. (2003). In other
words the GARCH(p, q) process can be written as a ARCH() process. This is analogous to the
invertibility representation given in Definition 2.2.2. This representation is useful when estimating
the parameters of a GARCH process (see Berkes et al. (2003)) and also prediction. The expansion
in (4.10) helps explain why the GARCH(p, q) process is so popular. As we stated at the start of this
section, the conditional variance of the GARCH is a weighted average of the squared returns, the
weights decline with the lag, but never go completely to zero, a property that is highly desirable.
Example 4.3.1 (Inverting the GARCH(1, 1)) If b1 < 1, then we can write t2 as
t2
X
X
a0
2
2
+ a1
bj Xt1j
.
=
bj B j a0 + a1 Xt1
=
1b
j=0
j=0
108
4.3.1
We will focus on the GARCH(1, 1) model as this substantially simplifies the conditions. We recall
the conditional variance of the GARCH(1, 1) can be written as
2
2
2
2
t2 = a0 + a1 Xt1
+ b1 t1
= a1 Zt1
+ b1 t1
+ a0 .
(4.11)
We observe that (4.11) can be written in the stochastic recurrence relation form given in (4.1) with
2
2
At = (a1 Zt1
+ b1 ) and Bt = a0 . Therefore, by using Theorem 4.0.1, if E[log(a1 Zt1
+ b1 )] < 0,
= a0 + a0
Y
k
X
2
(a1 Ztj
+ b1 ).
(4.12)
k=1 j=1
These conditions are relatively weak and depend on the distribution of Zt . They are definitely
2
2
satisfied if a1 + b1 < 1, since E[log(a1 Zt1
+ b1 )] log E[a1 Zt1
+ b1 ] = log(a1 + b1 ). However
existence of a stationary solution does not require such a strong condition on the coefficients (and
there can still exist a stationary solution if a1 + b1 > 1, so long as the distribution of Zt2 is such
that E[log(a1 Zt2 + b1 )] < 0).
By taking expectations of (4.12) we can see that
E[Xt2 ] = E[t2 ] = a0 + a0
Y
k
X
(a1 + b1 ) = a0 + a0
k=1 j=1
(a1 + b1 )k .
k=1
Thus E[Xt2 ] < iff a1 + b1 < 1 (noting that a1 and b1 are both positive). Expanding on this
argument, if d > 1 we can use Minkowski inequality to show
(E[t2d ])1/d
a0 +
k
Y
2
a0
(E[ (a1 Ztj
j=1
k=1
d 1/d
+ b1 )] )
Y
k
X
2
a0 + a0
(
E[(a1 Ztj
+ b1 )d ])1/d .
k=1 j=1
2
Therefore, if E[(a1 Ztj
+ b1 )d ] < 1, then E[Xt2d ] < . This is an iff condition, since from the
2
2 . We observe that by stationarity and if
since t1
has a causal solution, it is independent of Zt1
109
Indeed in order for E[Xt2d ] < a huge constraint needs to be placed on the parameter space
of a1 and b1 .
Exercise 4.1 Suppose {Zt } are standard normal random variables. Find conditions on a1 and b1
such that E[Xt4 ] < .
The above results can be generalised to GARCH(p, q) model. Conditions for existence of a
stationary solution hinge on the random matrix corresponding to the SRE representation of the
GARCH model (see Bougerol and Picard (1992a) and Bougerol and Picard (1992b)), which are
nearly impossible to verify. Sufficient and necessary conditions for both a stationary (causal)
P
P
solution and second order stationarity (E[Xt2 ] < ) is pj=1 aj + qi=1 bi < 1. However, many
econometricians believe this condition places an unreasonable constraint on the parameter space of
{aj } and {bj }. A large amount of research has been done on finding consistent parameter estimators
under weaker conditions. Indeed, in the very interesting paper by Berkes et al. (2003) (see also
Straumann (2005)) they derive consistent estimates of GARCH parameters on far milder set of
conditions on {aj } and {bi } (which dont require E(Xt2 ) < ).
Definition 4.3.1 The IGARCH model is a GARCH model where
Xt = t Zt
t2 = a0 +
p
X
2
aj Xtj
+
j=1
Pp
j=1 aj
Pq
i=1 bi
q
X
2
bi ti
(4.13)
i=1
model which has a strictly stationary solution but it is not second order stationary.
Exercise 4.2 Simulate realisations of ARCH(1) and GARCH(1, 1) models. Simulate with both iid
Gaussian and tdistribution errors ({Zt } where E[Zt2 ] = 1). Remember to burnin each realisation.
In all cases fix a0 > 0. Then
(i) Simulate an ARCH(1) with a1 = 0.3 and a1 = 0.9.
(ii) Simulate a GARCH(1, 1) with a1 = 0.1 and b1 = 0.85, and a GARCH(1, 1) with a1 = 0.85
and b1 = 0.1. Compare the two behaviours.
110
4.3.2
One criticism of the GARCH model is that it is blind to negative the sign of the return Xt . In
other words, the conditional variance of Xt only takes into account the magnitude of Xt and does
not depend on increases or a decreases in St (which corresponds to Xt being positive or negative).
In contrast it is largely believed that the financial markets react differently to negative or positive
Xt . The general view is that there is greater volatility/uncertainity/variation in the market when
the price goes down.
This observation has motivated extensions to the GARCH, such as the EGARCH which take
into account the sign of Xt . Deriving conditions for such a stationary solution to exist can be
difficult task, and the reader is refered to Straumann (2005) and more the details.
Other extensions to the GARCH include an Autoregressive type model with GARCH innovations.
4.3.3
R code
4.4
Bilinear models
The Bilinear model was first proposed in Subba Rao (1977) and Granger and Andersen (1978) (see
also Subba Rao (1981)). The general Bilinear (BL(p, q, r, s)) model is defined as
Xt
p
X
j=1
j Xtj = t +
q
X
i ti +
r X
s
X
k=1 k0 =1
i=1
where {t } are iid random variables with mean zero and variance 2 .
To motivate the Bilinear model let us consider the simplest version of the model BL(1, 0, 1, 1)
Xt = 1 Xt1 + b1,1 Xt1 t1 + t = [1 + b1,1 t1 ]Xt1 + t .
(4.14)
Comparing (4.16) with the conditional variane of the GARCH(1, 1) in (4.11) we see that they are
very similar, the main differences are that (a) the bilinear model does not constrain the coefficients
to be positive (whereas the conditional variance requires the coefficients to be positive) (b) the
111
2
2
t1 depends on Xt1 , whereas in the GARCH(1, 1) Zt1
and t1
are independent coefficients and
(c) the innovation in the GARCH(1, 1) model is deterministic, whereas in the innovation in the
Bilinear model is random. (b) and (c) makes the analysis of the Bilinear more complicated than
the GARCH model.
4.4.1
In this section we assume a causal, stationary solution of the bilinear model exists, the appropriate
number of moments and that it is invertible in the sense that there exists a function g such that
t = g(Xt1 , Xt2 , . . .).
Under the assumption that the Bilinear process is invertible we can show that
E[Xt Xt1 , Xt2 , . . .] = E[(1 + b1,1 t1 )Xt1 Xt1 , Xt2 , . . .] + E[t Xt1 , Xt2 , . . .]
= (1 + b1,1 t1 )Xt1 ,
(4.15)
thus unlike the autoregressive model the conditional expectation of the Xt given the past is a
nonlinear function of the past. It is this nonlinearity that gives rise to the spontaneous peaks that
we see a typical realisation.
To see how the bilinear model was motivated in Figure 4.7 we give a plot of
Xt = 1 Xt1 + b1,1 Xt1 t1 + t ,
(4.16)
where 1 = 0.5 and b1,1 = 0, 0.35, 0.65 and 0.65. and {t } are iid standard normal random
variables. We observe that Figure 4.7a is a realisation from an AR(1) process and the subsequent
plots are for different values of b1,1 . Figure 4.7a is quite regular, whereas the sudden bursts in
activity become more pronounced as b1,1 grows (see Figures 4.7b and 4.7c). In Figure 4.7d we give
a plot realisation from a model where b1,1 is negative and we see that the fluctation has changed
direction.
Remark 4.4.1 (Markov Bilinear model) Some authors define the BL(1, 0, 1, 1) as
Yt = 1 Yt1 + b1,1 Yt1 t + t = [1 + b11 t ]Yt1 + t .
The fundamental difference between this model and (4.16) is that the multiplicative innovation
112
3
2
1
0
bilinear(400, 0.5, 0)
2
3
0
100
200
300
400
100
Time
200
300
400
Time
5
10
20
15
10
100
200
300
400
Time
100
200
300
400
Time
4.4.2
We observe that (4.16) can be written in the stochastic recurrence relation form given in (4.1) with
At = (1 + b11 t1 ) and Bt = a0 . Therefore, by using Theorem 4.0.1, if E[log(1 + b11 t1 )] < 0
113
and E[t ] < , then Xt has the strictly stationary, causal solution
k1
X
Y
(1 + b1,1 tj ) [(1 + b1,1 tk )tk ] + t .
Xt =
k=1
(4.17)
j=1
To show that it is second order stationary we require that E[Xt2 ] < , which imposes additional
conditions on the parameters. To derive conditions for E[Xt2 ] we use (4.18) and the Minkowski
inequality to give
(E[Xt2 ])1/2
k1
Y
(1 + b1,1 tj )
1/2
E [(1 + b11 tk )tk ]2
j=1
k=1
2 1/2
k1
X
Y
1/2
1/2
E [(1 + b1,1 tj )]2
E [(1 + b1,1 tk )tk ]2
.
(4.18)
k=1 j=1
t =
X
j=0
j1
(b)
j
Y
!
Xt1j
[Xtj Xtj1 ].
i=0
This invertible representation is useful both in forecasting and estimation (see Section 5.5.3).
Exercise 4.3 Simulate the BL(2, 0, 1, 1) model (using the AR(2) parameters 1 = 1.5 and 2 =
0.75). Experiment with different parameters to give different types of behaviour.
Exercise 4.4 The random coefficient AR model is a nonlinear time series proposed by Barry Quinn
(see Nicholls and Quinn (1982) and Aue et al. (2006)). The random coefficient AR(1) model is
114
defined as
Xt = ( + t )Xt1 + t
where {t } and {t } are iid random variables which are independent of each other.
(i) State sufficient conditions which ensure that {Xt } has a strictly stationary solution.
(ii) State conditions which ensure that {Xt } is second order stationary.
(iii) Simulate from this model for different and var[t ].
4.4.3
R code
4.5
Many researchers argue that fitting parametric models can lead to misspecification and argue that
it may be more realistic to fit nonparametric or semiparametric time series models instead. There
exists several nonstationary and semiparametric time series (see Fan and Yao (2003) and Douc
et al. (2014) for a comprehensive summary), we give a few examples below. The most general
115
nonparametric model is
Xt = m(Xt1 , . . . , Xtp , t ),
but this is so general it looses all meaning, especially if the need is to predict. A slight restriction
is make the innovation term additive (see Jones (1978))
Xt = m(Xt1 , . . . , Xtp ) + t ,
it is clear that for this model E[Xt Xt1 , . . . , Xtp ] = m(Xt1 , . . . , Xtp ). However this model has
the distinct disadvantage that without placing any structure on m(), for p > 2 nonparametric
estimators of m() are lousy (as the suffer from the curse of dimensionality).
Thus such a generalisation renders the model useless. Instead semiparametric approaches have
been developed. Examples include the functional AR(p) model defined as
Xt =
p
X
j (Xtp )Xtj + t
j=1
Xt = t Zt
t2 = a0 +
p
X
2
aj (Xtj
).
j=1
In the case of all these models it is not easy to establish conditions in which a stationary solution
exists. More often then not, if conditions are established they are similar in spirit to those that
are used in the parametric setting. For some details on the proof see Vogt (2013) (also here), who
considers nonparametric and nonstationary models (note the nonstationarity he considers is when
the covariance structure changes over time, not the unit root type). For example in the case of the
the semiparametric AR(1) model, a stationary causal solution exists if  + 0 (0) < 1.
116
Chapter 5
Prediction
Prerequisites
The best linear predictor.
Some idea of what a basis of a vector space is.
Objectives
Understand that prediction using a long past can be difficult because a large matrix has to
be inverted, thus alternative, recursive method are often used to avoid direct inversion.
Understand the derivation of the LevinsonDurbin algorithm, and why the coefficient, t,t ,
corresponds to the partial correlation between X1 and Xt+1 .
Understand how these predictive schemes can be used write space of sp(Xt , Xt1 , . . . , X1 ) in
terms of an orthogonal basis sp(Xt PXt1 ,Xt2 ,...,X1 (Xt ), . . . , X1 ).
Understand how the above leads to the Wold decomposition of a second order stationary
time series.
To understand how to approximate the prediction for an ARMA time series into a scheme
which explicitly uses the ARMA structure. And this approximation improves geometrically,
when the past is large.
One motivation behind fitting models to a time series is to forecast future unobserved observations  which would not be possible without a model. In this chapter we consider forecasting, based
on the assumption that the model and/or autocovariance structure is known.
117
5.1
In this section we will assume that the linear time series {Xt } is both causal and invertible, that is
Xt =
aj tj =
j=0
bi Xti + t ,
(5.1)
i=1
where {t } are iid random variables (recall Definition 2.2.2). Both these representations play an
important role in prediction. Furthermore, in order to predict Xt+k given Xt , Xt1 , . . . we will
assume that the infinite past is observed. In later sections we consider the more realistic situation
that only the finite past is observed. We note that since Xt , Xt1 , Xt2 , . . . is observed that we can
obtain (for t) by using the invertibility condition
= X
bi X i .
i=1
Now we consider the prediction of Xt+k given {X ; t}. Using the MA() presentation
(since the time series is causal) of Xt+k we have
Xt+k =
k1
X
aj+k tj
aj t+kj
j=0
j=0
{z
{z
P
Pk1
since E[ k1
j=0 aj t+kj Xt , Xt1 , . . .] = E[ j=0 aj t+kj ] = 0. Therefore, the best linear predictor
of Xt+k given Xt , Xt1 , . . ., which we denote as Xt (k) is
Xt (k) =
aj+k tj =
j=0
X
j=0
aj+k (Xtj
bi Xtij ).
(5.2)
i=1
Xt (k) is called the kstep ahead predictor and it is straightforward to see that its mean squared
error is
2
k1
k
X
X
E [Xt+k Xt (k)]2 = E
aj t+kj = var[t ]
a2j ,
j=0
(5.3)
j=0
where the last line is due to the uncorrelatedness and zero mean of the innovations.
Often we would like to obtain the kstep ahead predictor for k = 1, . . . , n where n is some
118
time in the future. We now explain how Xt (k) can be evaluated recursively using the invertibility
assumption.
Step 1 Use invertibility in (5.1) to give
Xt (1) =
bi Xt+1i ,
i=1
X
i=2
i=2
bi Xt+2i + b1 Xt (1)
i=2
and E [Xt+2 Xt (2)]2 = var[t ] b21 + 1 = var[t ] a22 + a21 .
Step 3 To obtain the 3step ahead predictor we note that
Xt+3 =
=
X
i=3
i=3
Thus
Xt (3) =
i=3
and E [Xt+3 Xt (3)]2 = var[t ] (b2 + b21 )2 + b21 + 1 = var[t ] a23 + a22 + a21 .
119
Xt (k) =
bi Xt+ki +
k1
X
bi Xt (k i).
i=1
i=k
Xt (1) =
p
X
j Xt+1j
j=1
Xt (k) =
Xt (k) =
p
X
j=k
p
X
j Xt+kj +
k1
X
j Xt (k j) for 2 k p
j=1
j Xt (k j) for k > p.
(5.4)
j=1
However, in the general case more sophisticated algorithms are required when only the finite
past is known.
120
ACF
0.2
0.5
0.0
0.2
0.0
0.4
temp
0.6
0.5
0.8
1.0
Series global.mean
1880
1900
1920
1940
1960
1980
2000
10
Time
15
20
25
30
25
30
Lag
1.0
0.5
ACF
0.2
0.0
0.0
0.6
0.5
0.4
0.2
second.differences
0.4
0.6
Series diff2
1880
1900
1920
1940
1960
1980
2000
Time
10
15
20
Lag
Figure 5.2: Second differences of yearly temperature from 18802013 and its ACF.
with var[t ] = 2 = 0.02294. An ACF plot after fitting this model and then estimating the residuals
{t } is given in Figure 5.3. We observe that the ACF of the residuals appears to be uncorrelated,
which suggests that the AR(7) model fitted the data well (there is a formal test for this called the
LjungBox test which we cover later). By using the sequence of equations
121
0.4
0.2
0.0
0.2
ACF
0.6
0.8
1.0
Series residuals
10
15
20
Lag
122
If we believe the residuals are Gaussian we can use the mean squared error to construct confidence
intervals for the predictions.
On the other hand, if the residuals are not Gaussian, we can construct 95% confidence intervals
for the forecast using bootstrap. Specifically, we rewrite the AR(7) process as an MA() process
Xt =
tj .
j ()
j=0
t+kj
j ()
j=k
Xt+k Xt (k) =
k1
X
t+kj .
j ()
j=0
We have the prediction estimates, therefore all we need is to obtain the distribution of
Pk1
j=0
t+kj .
j ()
This can be done by estimating the residuals and then using bootstrap to estimate the distribuP
Pk1
tion of k1
j=0 j ()t+kj , using the empirical distribution of
j=0 j ()t+kj . From this we can
construct the 95% CI for the forecasts.
A small criticism of our approach is that we have fitted a rather large AR(7) model to time
series of length of 127. It may be more appropriate to fit an ARMA model to this time series.
Exercise 5.1 In this exercise we analyze the Sunspot data found on the course website. In the data
analysis below only use the data from 1700  2003 (the remaining data we will use for prediction).
In this section you will need to use the function ar.yw in R.
(i) Fit the following models to the data and study the residuals (using the ACF). Using this
123
0.3
= forecast
0.2
= true value
0.1
0.0
0.1
second difference
0.2
0.3
2000
2002
2004
2006
2008
2010
2012
year
or
AR
Xt = + t
{z}
AR
is more appropriate (take into account the number of parameters estimated overall).
(ii) Use these models to forecast the sunspot numbers from 20042013.
1.1565*res[126]
124
1.0784*res[125]
0.7745*res[124]
0.6132*res[
res[129] = 1.1472*res[128]
1.1565*res[127]
1.0784*res[126]
0.7745*res[125]
0.6132*res[
res[130] = 1.1472*res[129]
1.1565*res[128]
1.0784*res[127]
0.7745*res[126]
0.6132*res[
res[131] = 1.1472*res[130]
1.1565*res[129]
1.0784*res[128]
0.7745*res[127]
0.6132*res[
res[132] = 1.1472*res[131]
1.1565*res[130]
1.0784*res[129]
0.7745*res[128]
0.6132*res[
5.2
In next few sections we will consider prediction/forecasting for stationary time series. In particular
to find the best linear predictor of Xt+1 given the finite past Xt , . . . , X1 . Setting up notation our
aim is to find
Xt+1t = PX1 ,...,Xt (Xt+1 ) = Xt+1t,...,1 =
t
X
t,j Xt+1j ,
j=1
where {t,j } are chosen to minimise the mean squared error min E(Xt+1
t
Pt
2
j=1 t,j Xt+1j ) .
t,1
..
. = 1
t rt ,
t,t
where (t )i,j = E(Xi Xj ) and (rt )i = E(Xti Xt+1 ). Given the covariances this can easily be done.
However, if t is large a brute force method would require O(t3 ) computing operations to calculate
(5.7). Our aim is to exploit stationarity to reduce the number of operations. To do this, we will
briefly discuss the notion of projections on a space, which help in our derivation of computationally
more efficient methods.
Before we continue we first discuss briefly the idea of a a vector space, inner product spaces,
Hilbert spaces, spans and basis. A more complete review is given in Brockwell and Davis (1998),
Chapter 2.
First a brief definition of a vector space. X is called an vector space if for every x, y X and
a, b R (this can be generalised to C), then ax + by X . An inner product space is a vector
space which comes with an inner product, in other words for every element x, y X we can defined
an innerproduct hx, yi, where h, i satisfies all the conditions of an inner product. Thus for every
element x X we can define its norm as kxk = hx, xi. If the inner product space is complete
125
(meaning the limit of every sequence in the space is also in the space) then the innerproduct space
is a Hilbert space (see wiki).
(i) The classical example of a Hilbert space is the Euclidean space Rn where
P
the innerproduct between two elements is simply the scalar product, hx, yi = ni=1 xi yi .
Example 5.2.1
(ii) The subset of the probability space (, F, P ), where all the random variables defined on
R
have a finite second moment, ie. E(X 2 ) = X()2 dP () < . This space is denoted as
L2 (, F, P ). In this case, the inner product is hX, Y i = E(XY ).
(iii) The function space L2 [R, ], where f L2 [R, ] if f is mumeasureable and
Z
f (x)g(x)d(x).
R
In this chapter we will not use this function space, but it will be used in Chapter ?? (when
we prove the Spectral representation theorem).
It is straightforward to generalize the above to complex random variables and functions defined
on C. We simply need to remember to take conjugates when defining the innerproduct, ie. hX, Y i =
R
E(XY ) and hf, gi = C f (z)g(z)d(z).
In this chapter our focus will be on certain spaces of random variables which have a finite variance.
Basis
The random variables {Xt , Xt1 , . . . , X1 } span the space Xt1 (denoted as sp(Xt , Xt1 , . . . , X1 )), if
for every Y Xt1 , there exists coefficients {aj R} such that
Y =
t
X
aj Xt+1j .
(5.5)
j=1
Pt
j=1 aj Xt+1j
Xt1 . We now
define the basis of a vector space, which is closely related to the span. The random variables
{Xt , . . . , X1 } form a basis of the space Xt1 , if for every Y Xt1 we have a representation (5.5) and
126
this representation is unique. More precisely, there does not exist another set of coefficients {bj }
P
such that Y = tj=1 bj Xt+1j . For this reason, one can consider a basis as the minimal span, that
is the smallest set of elements which can span a space.
Definition 5.2.1 (Projections) The projection of the random variable Y onto the space spanned
P
by sp(Xt , Xt1 , . . . , X1 ) (often denoted as PXt ,Xt1 ,...,X1 (Y)) is defined as PXt ,Xt1 ,...,X1 (Y) = tj=1 cj Xt+1j ,
where {cj } is chosen such that the difference Y P( Xt ,Xt1 ,...,X1 ) (Yt ) is uncorrelated (orthogonal/perpendicular) to any element in sp(Xt , Xt1 , . . . , X1 ). In other words, PXt ,Xt1 ,...,X1 (Yt ) is the best
linear predictor of Y given Xt , . . . , X1 .
Orthogonal basis
An orthogonal basis is a basis, where every element in the basis is orthogonal to every other element
in the basis. It is straightforward to orthogonalize any given basis using the method of projections.
To simplify notation let Xtt1 = PXt1 ,...,X1 (Xt ). By definition, Xt Xtt1 is orthogonal to
the space sp(Xt1 , Xt1 , . . . , X1 ). In other words Xt Xtt1 and Xs (1 s t) are orthogonal
(cov(Xs , (Xt Xtt1 )), and by a similar argument Xt Xtt1 and Xs Xss1 are orthogonal.
Thus by using projections we have created an orthogonal basis X1 , (X2 X21 ), . . . , (Xt Xtt1 )
of the space sp(X1 , (X2 X21 ), . . . , (Xt Xtt1 )). By construction it clear that sp(X1 , (X2
X21 ), . . . , (Xt Xtt1 )) is a subspace of sp(Xt , . . . , X1 ). We now show that
sp(X1 , (X2 X21 ), . . . , (Xt Xtt1 )) = sp(Xt , . . . , X1 ).
To do this we define the sum of spaces. If U and V are two orthogonal vector spaces (which
share the same innerproduct), then y U V , if there exists a u U and v V such that
1 .
y = u + v. By the definition of Xt1 , it is clear that (Xt Xtt1 ) Xt1 , but (Xt Xtt1 )
/ Xt1
1 . Continuing this argument we see that X 1 = sp(X
Hence Xt1 = sp(X
t Xtt1 ) Xt1
t Xtt1 )
t
sp(X
t1 Xt1t2 ), . . . , sp(X
1 ). Hence sp(X
t , . . . , X1 ) = sp(X
t Xtt1 , . . . , X2 X21 , X1 ).
Pt
Therefore for every PXt ,...,X1 (Y ) = j=1 aj Xt+1j , there exists coefficients {bj } such that
t
X
PXt+1j Xt+1jtj (Y ) =
j=1
t1
X
bj (Xt+1j Xt+1jtj ) + bt X1 ,
j=1
where bj = E(Y (Xj Xjj1 ))/E(Xj Xjj1 ))2 . A useful application of orthogonal basis is the
ease of obtaining the coefficients bj , which avoids the inversion of a matrix. This is the underlying
idea behind the innovations algorithm proposed in Brockwell and Davis (1998), Chapter 5.
127
5.2.1
The notions above can be generalised to spaces which have an infinite number of elements in their
basis (and are useful to prove Wolds decomposition theorem). Let now construct the space spanned
by infinite number random variables {Xt , Xt1 , . . .}. As with anything that involves we need to
define precisely what we mean by an infinite basis. To do this we construct a sequence of subspaces,
each defined with a finite number of elements in the basis. We increase the number of elements in
the subspace and consider the limit of this space. Let Xtn = sp(Xt , . . . , Xn ), clearly if m > n,
n
exists an n such that Y Xtn . However, we also need to ensure that the limits of all the sequences
lie in this infinite dimensional space, therefore we close the space by defining defining a new space
which includes the old space and also includes all the limits. To make this precise suppose the
sequence of random variables is such that Ys Xts , and E(Ys1 Ys2 )2 0 as s1 , s2 . Since
the sequence {Ys } is a Cauchy sequence there exists a limit. More precisely, there exists a random
n
variable Y , such that E(Ys Y )2 0 as s . Since the closure of the space, X t , contains the
set Xtn and all the limits of the Cauchy sequences in this set, then Y Xt . We let
Xt = sp(Xt , Xt1 , . . .),
(5.6)
clear that (Xt Xt1 (1)) and Xs for s t1 are uncorrelated and Xt = sp(Xt Xt1 (1))Xt1
,
where Xt = sp(Xt , Xt1 , . . .). Thus we can construct the orthogonal basis (Xt Xt1 (1)), (Xt1
Xt2 (1)), . . . and the corresponding space sp((Xt Xt1 (1)), (Xt1 Xt2 (1)), . . .). It is clear that
sp((Xt Xt1 (1)), (Xt1 Xt2 (1)), . . .) sp(Xt , Xt1 , . . .). However, unlike the finite dimensional
case it is not clear that they are equal, roughly speaking this is because sp((Xt Xt1 (1)), (Xt1
Xt2 (1)), . . .) lacks the inital value X . Of course the time in the past is not really a well
128
defined quantity. Instead, the way we overcome this issue is that we define the initial starting
Furthermore, we note that since Xn Xn1 (1) and Xs (for any s n) are orthogonal, then
sp((Xt Xt1 (1)), (Xt1 Xt2 (1)), . . .) and X are orthogonal spaces. Using X , we have
tj=0 sp((Xtj Xtj1 (1)) X = sp(Xt , Xt1 , . . .).
We will use this result when we prove the Wold decomposition theorem (in Section 5.7).
5.3
LevinsonDurbin algorithm
We recall that in prediction the aim is to predict Xt+1 given Xt , Xt1 , . . . , X1 . The best linear
predictor is
t
X
t,j Xt+1j ,
(5.7)
j=1
where {t,j } are chosen to minimise the mean squared error, and are the solution of the equation
t,1
..
. = 1
t rt ,
t,t
(5.8)
where (t )i,j = E(Xi Xj ) and (rt )i = E(Xti Xt+1 ). Using standard methods, such as GaussJordan
elimination, to solve this system of equations requires O(t3 ) operations. However, we recall that
{Xt } is a stationary time series, thus t is a Toeplitz matrix, by using this information in the 1940s
Norman Levinson proposed an algorithm which reduced the number of operations to O(t2 ). In the
1960s, Jim Durbin adapted the algorithm to time series and improved it.
We first outline the algorithm. We recall that the best linear predictor of Xt+1 given Xt , . . . , X1
is
Xt+1t =
t
X
t,j Xt+1j .
(5.9)
j=1
The mean squared error is r(t + 1) = E[Xt+1 Xt+1t ]2 . Given that the second order stationary
covariance structure, the idea of the LevinsonDurbin algorithm is to recursively estimate {t,j ; j =
1, . . . , t} given {t1,j ; j = 1, . . . , t 1} (which are the coefficients of the best linear predictor of Xt
129
given Xt1 , . . . , X1 ). Let us suppose that the autocovariance function c(k) = cov[X0 , Xk ] is known.
The LevinsonDurbin algorithm is calculated using the following recursion.
Step 1 1,1 = c(1)/c(0) and r(2) = E[X2 X21 ]2 = E[X2 1,1 X1 ]2 = 2c(0) 21,1 c(1).
Step 2 For j = t
t,t =
t,j
c(t)
= t1,j
Pt1
r(t)
t,t t1,tj
j)
1 j t 1,
(i) Suppose Xt = Xt1 +t (where  < 1). Use the LevinsonDurbin algorithm,
5.3.1
Let us suppose {Xt } is a zero mean stationary time series and c(k) = E(Xk X0 ). Let PXt ,...,X2 (X1 )
denote the best linear predictor of X1 given Xt , . . . , X2 and PXt ,...,X2 (Xt+1 ) denote the best linear
predictor of Xt+1 given Xt , . . . , X2 . Stationarity means that the following predictors share the same
coefficients
Xtt1 =
t1
X
t1,j Xtj
j=1
t1
X
t1
X
t1,j Xt+1j
(5.10)
j=1
t1,j Xj+1 .
j=1
The last line is because stationarity means that flipping a time series round has the same correlation
structure. These three relations are an important component of the proof.
Recall our objective is to derive the coefficients of the best linear predictor of PXt ,...,X1 (Xt+1 )
based on the coefficients of the best linear predictor PXt1 ,...,X1 (Xt ). To do this we partition the
130
t1
X
j=1
t1
X
t1,j Xt+1j
j=1
t1
+ t,t X1
t1,j Xj+1 .
j=1
{z
}

(5.11)
by (5.10)
We start by evaluating an expression for t,t (which in turn will give the expression for the other
coefficients). It is straightforward to see that
t,t =
=
=
(5.12)
Therefore we see that the numerator of t,t is the partial covariance between Xt+1 and X1 (see
Section 3.2.2), furthermore the denominator of t,t is the mean squared prediction error, since by
stationarity
E(X1 PXt ,...,X2 (X1 ))2 = E(Xt PXt1 ,...,X1 (Xt ))2 = r(t)
(5.13)
Returning to (5.12), expanding out the expectation in the numerator and using (5.13) we have
t,t =
c(0)
c(0) E[Xt+1 PXt ,...,X2 (X1 ))]
E(Xt+1 (X1 PXt ,...,X2 (X1 )))
=
=
r(t)
r(t)
Pt1
j)
r(t)
(5.14)
which immediately gives us the first equation in Step 2 of the LevinsonDurbin algorithm. To
131
Xt+1t =
t
X
t,j Xt+1j
j=1
t1
X
j=1
t1
X
t1,j Xj+1 .
j=1
1 j t 1.
This gives the middle equation in Step 2. To obtain the recursion for the mean squared prediction
error we note that by orthogonality of {Xt , . . . , X2 } and X1 PXt ,...,X2 (X1 ) we use (5.11) to give
r(t + 1) = E(Xt+1 Xt+1t )2 = E[Xt+1 PXt ,...,X2 (Xt+1 ) t,t (X1 PXt ,...,X2 (X1 )]2
= E[Xt+1 PX2 ,...,Xt (Xt+1 )]2 + 2t,t E[X1 PXt ,...,X2 (X1 )]2
2t,t E[(Xt+1 PXt ,...,X2 (Xt+1 ))(X1 PXt ,...,X2 (X1 ))]
= r(t) + 2t,t r(t) 2t,t E[Xt+1 (X1 PXt ,...,X2 (X1 ))]
{z
}

=r(t)t,t by (5.14)
= r(t)[1
2tt ].
This gives the final part of the equation in Step 2 of the LevinsonDurbin algorithm.
Further references: Brockwell and Davis (1998), Chapter 5 and Fuller (1995), pages 82.
5.3.2
We now give an alternative proof which is based on properties of the (symmetric) Toeplitz matrix.
We use (5.8), which is a matrix equation where
t,1
.
t .. = rt ,
t,t
132
(5.15)
with
t =
c(2) . . . c(t 1)
c(1) . . . c(t 2)
..
..
..
.
.
.
..
..
c(t 1) c(t 2)
.
.
c(0)
c(0)
c(1)
..
.
c(1)
c(0)
..
.
c(1)
c(2)
rt =
.. .
.
c(t)
and
The proof is based on embedding rt1 and t1 into t1 and using that t1 t1 = rt1 .
To do this, we define the (t 1) (t 1) matrix Et1 which basically swops round all the
elements in a vector
Et1
0 0 0 ... 0 1
0 0 0 ... 1 0
,
.. .. .. .. ..
. . . . .
..
1 0 . 0 0 0
(recall we came across this swopping matrix in Section 3.2.2). Using the above notation, we have
the interesting block matrix structure
t =
and rt =
t1
Et1 rt1
r0t1 Et1
(r0t1 , c(t))0 .
c(0)
Returning to the matrix equations in (5.15) and substituting the above into (5.15) we have
t t = r t ,
t1
Et1 rt1
r0t1 Et1
c(0)
t1,t
t,t
rt1
c(t)
(5.16)
(5.17)
We first show that equation (5.16) corresponds to the second equation in the LevinsonDurbin
133
=Et1 t1
Thus we have
t1,t = t1 t,t Et1 t1 .
(5.18)
(5.19)
t,t =
(5.20)
Noting that r(t) = c(0) c0t1 0t1 . (5.20) is the first equation of Step 2 in the LevinsonDurbin
equation.
Note from this proof it does not appear that we need that the (symmetric) Toeplitz matrix is
positive semidefinite.
5.3.3
Using the DurbinLevinson to obtain the Cholesky decomposition of the precision matrix
We recall from Section 3.2.1 that by sequentially projecting the elements of random vector on the
past elements in the vector gives rise to Cholesky decomposition of the inverse of the variance/covariance (precision) matrix. This is exactly what was done in when we make the DurbinLevinson
134
var
X1
r(1)
X1 1,1 X2
r(2)
= In
..
.
P
Xn n1
j=1 n1,j Xnj
r(n)
0
Therefore, if n = var[X n ], where X n = (X1 , . . . , Xn ), then 1
n = Ln Dn Ln , where
Ln =
...
1,1
2,1
..
.
2,2
..
.
1
..
... ... 0
... ... 0
0 ... 0
. . . . ..
. .
.
(5.21)
5.4
Given the autocovariance of any stationary process the LevinsonDurbin algorithm allows us to
systematically obtain onestep predictors of second order stationary time series without directly
inverting a matrix.
In this section we consider forecasting for a special case of stationary processes, the ARMA
process. We will assume throughout this section that the parameters of the model are known.
We showed in Section 5.1 that if {Xt } has an AR(p) representation and t > p, then the best
linear predictor can easily be obtained using (5.4). Therefore, when t > p, there is no real gain
in using the LevinsonDurbin for prediction of AR(p) processes. However, we do show in Chapter
?? we can apply the LevinsonDurbin algorithm for obtaining estimators of the autoregressive
parameters.
Similarly if {Xt } satisfies an ARMA(p, q) representation, then the prediction scheme can be
greatly simplified. Unlike the AR(p) process, which is pMarkovian, PXt ,Xt1 ,...,X1 (Xt+1 ) does
involve all regressors Xt , . . . , X1 . However, some simplifications can be made in the scheme. To
135
p
X
i Xtj = t +
j=1
q
X
i ti ,
i=1
where {t } are iid zero mean random variables and the roots of (z) and (z) lie outside the
unit circle. For the analysis below, let Wt = Xt for 1 t p and for t > max(p, q) let Wt =
P
P
t + qi=1 i ti (which is the MA(q) part of the process). Since Xp+1 = pj=1 j Xt+1j + Wp+1
and so forth it is clear that sp(X1 , . . . , Xt ) = sp(W1 , . . . , Wt ) (ie, they are linear combinations of
each other). We will show for t > max(p, q) that
p
X
j Xt+1j +
j=1
q
X
(5.22)
i=1
for some t,i which can be evaluated from the autocovariance structure. To prove the result we use
the following steps:
p
X
j=1
p
X
j Xt+1j +
j=1
i=1
p
X
q
X
j Xt+1j +
j=1
i=1
p
X
q
X
j Xt+1j +
j=1
q
X
p
X
q
X
i=1
i=1
i1
X
q
X
j=1
p
X
q
X
j=1
j Xt+1j +
=Xt+1i Xt+1iti
j Xt+1j +
(5.23)
i=1
this gives the desired result. Thus given the parameters {t,i } is straightforward to construct the
predictor Xt+1t . It can be shown that t,i i as t (see Brockwell and Davis (1998)),
Chapter 5.
Remark 5.4.1 In terms of notation we can understand the above result for the MA(q) case. In
136
bt+1t =
X
q
X
bt+1iti .
t,i Xt+1i X
i=1
(Since for symmetric matrices the spectral norm and the largest eigenvalue are the same, then
1
k1
t kspec ).
Remark 5.4.2 Suppose {Xt } is an ARMA process, where the roots (z) and and (z) have absolute value greater than 1 + 1 and less than 2 , then the spectral density f () is bounded by
(1 1 )2p
var(t ) (1(
2
1
)2p
1+1
f () var(t )
1
)2p
(1( 1+
1
(1 1 )2p
and max (1
t ) is bounded uniformly over t.
The prediction can be simplified if we make a simple approximation (which works well if t is
bt+1t = Xt and for t > max(p, q) we define the
relatively large). For 1 t max(p, q), set X
recursion
bt+1t =
X
p
X
j=1
j Xt+1j +
q
X
bt+1iti ).
i (Xt+1i X
(5.24)
i=1
This approximation seems reasonable, since in the exact predictor (5.23), t,i i .
In the following proposition we show that the best linear predictor of Xt+1 given X1 , . . . , Xt ,
bt+1t and the best linear predictor given the infinite past,
Xt+1t , the approximating predictor X
137
bt+1t
Xt (1) are asymptotically equivalent. To do this we obtain expressions for Xt (1) and X
Xt (1) =
j=1
bj Xt+1j + t+1 ).
j=1
bt+1t =
X
max(p,q)
bj Xt+1j +
j=1
gammaj Xj
(5.25)
j=1
where j Ct , with 1/(1 + ) < < 1 and the roots of (z) are outside (1 + ). We give a proof in
the remark below.
Remark 5.4.3 We prove (5.25) for the ARMA(1, 2). We first recall that the AR(1) part in the
ARMA(1, 1) model does not play any role since sp(X1 , Xt , . . . , Xt ) = sp(W1 , W2 , . . . , Wt ), where
W1 = X1 and for t 2 we define the corresponding MA(2) process Wt = 1 t1 + 2 t2 + t . The
c21 = W1 , W
c32 = W2 and for t > 3
corresponding approximating predictor is defined as W
ctt1 = 1 [Wt1 W
ct1t2 ] + 2 [Wt2 W
ct2t3 ].
W
Using this and rearranging (5.24) gives
bt+1t 1 Xt = 1 [Xt X
bt1t2 ],
btt1 ] +2 [Xt1 X
X


{z
}

{z
}
{z
}
ctt1 )
=(Wt W
ct+1t
W
ct1t2 )
=(Wt1 W
(5.26)
ct+1t
Wt+1 W
ctt1
Wt W
{z
=b
t+1
=
}
1 0

{z
=Q
ctt1
Wt W
+
ct1t2
Wt1 W
}
{z
} 
=b
t
Wt+1
0
{z
W t+1
138
t+1
t
{z
1 0

{z
=t+1
t1
}  {z }
t
Wt+1
0
{z
W t+1
(5.27)
X
X
bj Wt+1j ,
(1)j [Qj ](1,1) Wt+1j =
j=0
j=0
where bj = (1)j [Qj ](1,1) (noting that b0 = 1) denotes the (1, 1)th element of the matrix Qj (note
we did something similar in Section 2.4.1). Furthermore the same iteration shows that
t+1 =
t3
X
j=0
t3
X
(5.28)
j=0
t+1
t3
X
bj Wt+1j .
j=t2
j=0
We now return to the approximation prediction in (5.26). Comparing (5.27) and (5.27) we see
that they are almost the same difference equations. The only difference is the point at which the
algorithm starts. t goes all the way back to the start of time. Whereas we have set initial values
c21 = W1 , W
c32 = W2 , thus b
for W
03 = (W3 W2 , W2 W1 ).Therefore, by iterating both (5.27) and
(5.27) backwards, focusing on the first element of the vector and using (5.28) we have
3 ]1
t+1 bt+1 = (1)t2 [Qt2 3 ]1 +(1)t2 [Qt2b

{z
}
P
=
j=t2 bj Wt+1j
P
ct+1t . Substituting this into
bt+1 = Wt+1 W
j=1 bj Wt+1j and that
139
bj Wt+1j =
j=1
j=t2
Replacing Wt with Xt 1 Xt1 gives (5.25), where the bj can be easily deduced from bj and 1 .
Proposition 5.4.1 Suppose {Xt } is an ARMA process where the roots of (z) and (z) have roots
bt+1t and Xt (1) be defined as in (5.23),
which are greater in absolute value than 1 + . Let Xt+1t , X
(5.24) and (5.2) respectively. Then
for any
1
1+
bt+1t ]2 Kt ,
E[Xt+1t X
(5.29)
bt+1t Xt (1)]2 Kt
E[X
(5.30)
E[Xt+1 Xt+1t ]2 2 Kt
(5.31)
PROOF. The proof of (5.29) becomes clear when we use the expansion Xt+1 =
j=1 bj Xt+1j
X
j=1
tmax(p,q)
bj Xt+1j +
j=1
j=tmax(p,q)
{z
bt+1t Pmax(p,q) j Xj
X
j=1
bt+1t is
Therefore by using (5.25) we see that the difference between the best linear predictor and X
bt+1t =
Xt+1t X
max(p,q)
X
j=1
j= max(p,q)
140
j Xj = I + II.
Pmax(p,q)
j=1
j Xtj ]2
Ct , therefore what remains is to show that E[II 2 ] attains a similar bound. Heuristically, this seems
reasonable, since bt+j Kt+j , the main obstacle is to show that E[PXt ,...,X1 (Xj+1 )2 ] and does
not grow with t. To obtain a bound, we first obtain a bound for E[PXt ,...,X1 (Xj+1 )]2 . Basic results
in linear regression shows that
PXt ,...,X1 (Xj+1 ) = 0j,t X t ,
(5.32)
0
0
0
where j,t = 1
t r t,j , with j,t = (1,j,t , . . . , t,j,t ), X t = (X1 , . . . , Xt ), t = E(X t X t ) and r t,j =
j= max(p,q)
bt+j 0j,t X t =
j= max(p,q)
bj r 0t,j 1
t X t . (5.33)
j=tmax(p,q)
E[I 2 ] =
bt+j r 0t,j 1
t
j= max(p,q)
bt+j r t,j .
j= max(p,q)
To bound the above we use the Cauchy schwarz inequality (kaBbk1 kak2 kBbk2 ), the specP
tral norm inequality (kak2 kBbk2 kak2 kBkspec kbk2 ) and Minkowiskis inequality (k nj=1 aj k2
Pn
j=1 kaj k2 ) we have
X
2
X
2
2
2
E I2
bt+j r 0t,j
2 k1
k
bt+j  kr t,j k2 k1
spec
t
t kspec .
j=1
(5.34)
j=1
We now bound each of the terms above. We note that for all t, using Remark 5.4.2 that k1
t kspec
K (for some constant K). We now consider r 0t,j = (E(X1 Xj ), . . . , E(Xt Xj )) = (c(1j), . . . , c(t
j)). By using (3.2) we have c(k) Ck , therefore
kr t,j k2 K(
t
X
2(j+r) )1/2 K
r=1
j
.
(1 2 )2
Substituting these bounds into (5.34) gives E I 2 Kt . Altogether the bounds for I and II give
bt+1t )2 K
E(Xt+1t X
141
j
.
(1 2 )2
X
bt+1t ]2 = E
E[Xt (1) X
bt+j Xj +
j=0
t
X
2
bj Ytj .
j=tmax(p,q)
Using the above and that bt+j Kt+j , it is straightforward to prove the result.
Finally to prove (5.31), we note that by Minkowskis inequality we have
h
2 i1/2
E Xt+1 Xt+1t
h
i1/2
2 1/2
2 1/2
2
b
b
+ E Xt (1) Xt+1t
+ E Xt+1t Xt+1t
.
E (Xt Xt (1))
{z
} 

{z
} 
{z
}
=
Kt/2 by (5.30)
Kt/2 by (5.29)
5.5
In this section we consider forecasting for nonlinear models. The forecasts we construct, may not
necessarily/formally be the best linear predictor, because the best linear predictor is based on
minimising the mean squared error, which we recall from Chapter 4 requires the existence of the
higher order moments. Instead our forecast will be the conditional expection of Xt+1 given the past
(note that we can think of it as the best linear predictor). Furthermore, with the exception of the
ARCH model we will derive approximation of the conditional expectation/best linear predictor,
bt+1t (given in (5.24)).
analogous to the forecasting approximation for the ARMA model, X
5.5.1
Xt = t Zt
t2 = a0 +
p
X
j=1
142
2
aj Xtj
.
In other words, past values of Xt have no influence on the expected value of Xt+1 . On the other
hand, in Section 4.2.1 we showed that
2
2
2
2
2
2
E(Xt+1
Xt , Xt1 , . . . , Xtp+1 ) = E(Zt+1
t+1
Xt , Xt2 , . . . , Xtp+1 ) = t+1
E[Zt+1
] = t+1
=
p
X
2
aj Xt+1j
,
j=1
thus Xt has an influence on the conditional mean squared/variance. Therefore, if we let Xt+kt
denote the conditional variance of Xt+k given Xt , . . . , Xtp+1 , it can be derived using the following
recursion
2
Xt+1t
=
p
X
2
aj Xt+1j
j=1
2
Xt+kt
=
2
Xt+kt
=
p
X
j=k
p
X
2
+
aj Xt+kj
k1
X
2
aj Xt+kjk
2kp
j=1
2
aj Xt+kjt
k > p.
j=1
5.5.2
143
then
2
2
2
2
E[Xt+1
Xt , Xt1 , . . .] = t+1
= a0 + a1 Xt1
+ b1 t1
=
X
a0
2
+ a1
bj Xtj
.
1b
(5.35)
j=0
Of course, in reality we only observe the finite past Xt , Xt1 , . . . , X1 . We can approximate
2 X , X
2 = 0, then for t 1 let
E[Xt+1
b10
t
t1 , . . . , X1 ] using the following recursion, set
2
2
= a0 + a1 Xt2 + b1
btt1
bt+1t
(noting that this is similar in spirit to the recursive approximate onestep ahead predictor defined
in (5.25)). It is straightforward to show that
t1
bt+1t
=
X
a0 (1 bt+1 )
2
+ a1
bj Xtj
,
1b
j=0
Exercise 5.3 To answer this question you need R install.package("tseries") then remember
library("garch").
(i) You will find the Nasdaq data from 4th January 2010  15th October 2014 on my website.
(ii) By taking log differences fit a GARCH(1,1) model to the daily closing data (ignore the adjusted
closing value) from 4th January 2010  30th September 2014 (use the function garch(x,
order = c(1, 1)) fit the GARCH(1, 1) model).
(iii) Using the fitted GARCH(1, 1) model, forecast the volatility t2 from October 1st15th (not2 . Evaluate
ing that no trading is done during the weekends). Denote these forecasts as t0
P11 2
t=1 t0
P11
2
t=1 Xt
144
5.5.3
t =
X
j=0
(b)
j1
Y
!
Xt1j
[Xtj Xtj1 ],
i=0
(5.36)
We will compare the above recursion to the recursion based on t+1 . Rearranging the bilinear
equation gives
t+1 = bt Xt + (Xt+1 1 Xt ) .

{z
}
=bt Xt +t+1
145
(5.37)
We observe that (5.36) and (5.37) are almost the same difference equation, the only difference is
b10 . This gives the difference between the two equations as
that an initial value is set for X
bt+1t ] = (1)t bt X1
t+1 [Xt+1 X
t
Y
b10 ]
j + (1)t bt [X1 X
j=1
t
Y
j .
j=1
P
bt+1t
0 as t , then X
Xt (1) as t . We now show that if
Q
Q
a.s.
E[log t  < log b, then bt tj=1 j 0. Since bt tj=1 j is a product, it seems appropriate to
Thus if bt
Qt
j=1 j
a.s.
take logarithms to transform it into a sum. To ensure that it is positive, we take absolutes and
troots
t
Y
log bt
1X
log j 
t
j=1
{z
}

j=1
log b
t
Y
j=1
1/t
j 
1X
P
= log b +
log j  log b + E log 0  = .
t
1/t a.s.
j=1 j 
Qt
j=1
Qt
j=1 j 
5.6
Nonparametric prediction
In this section we briefly consider how prediction can be achieved in the nonparametric world. Let
us assume that {Xt } is a stationary time series. Our objective is to predict Xt+1 given the past.
However, we dont want to make any assumptions about the nature of {Xt }. Instead we want to
obtain a predictor of Xt+1 given Xt which minimises the means squared error, E[Xt+1 g(Xt )]2 . It
is well known that this is conditional expectation E[Xt+1 Xt ]. (since E[Xt+1 g(Xt )]2 = E[Xt+1
E(Xt+1 Xt )]2 + E[g(Xt ) E(Xt+1 Xt )]2 ). Therefore, one can estimate
E[Xt+1 Xt = x] = m(x)
146
xXt
t=1 Xt+1 K( b )
,
Pn1
xXt
t=1 K( b )
m
b n (x) =
where K : R R is a kernel function (see Fan and Yao (2003), Chapter 5 and 6). Under some
regularity conditions it can be shown that m
b n (x) is a consistent estimator of m(x) and converges
to m(x) in mean square (with the typical mean squared rate O(b4 + (bn)1 )). The advantage of
going the nonparametric route is that we have not imposed any form of structure on the process
(such as linear/(G)ARCH/Bilinear). Therefore, we do not run the risk of misspecifying the model
A disadvantage is that nonparametric estimators tend to be a lot worse than parametric estimators
(in Chapter ?? we show that parametric estimators have O(n1/2 ) convergence which is faster than
the nonparametric rate O(b2 + (bn)1/2 )). Another possible disavantage is that if we wanted to
include more past values in the predictor, ie. m(x1 , . . . , xd ) = E[Xt+1 Xt = x1 , . . . , Xtp = xd ] then
the estimator will have an extremely poor rate of convergence (due to the curse of dimensionality).
A possible solution to the problem is to assume some structure on the nonparametric model,
and define a semiparametric time series model. We state some examples below:
(i) An additive structure of the type
Xt =
p
X
gj (Xtj ) + t
j=1
Xt =
p
X
gj (Xtd )Xtj + t .
j=1
2
t2 = bt1
+ m(Xt1 ).
However, once a structure has been imposed, conditions need to be derived in order that the model
has a stationary solution (just as we did with the fullyparametric models).
See ?, ?, ?, ?, ? etc.
147
5.7
Section 5.2.1 nicely leads to the Wold decomposition, which we now state and prove. The Wold
decomposition theorem, states that any stationary process, has something that appears close to
an MA() representation (though it is not). We state the theorem below and use some of the
notation introduced in Section 5.2.1.
Theorem 5.7.1 Suppose that {Xt } is a second order stationary time series with a finite variance
(we shall assume that it has mean zero, though this is not necessary). Then Xt can be uniquely
expressed as
Xt =
j Ztj + Vt ,
(5.38)
j=0
where {Zt } are uncorrelated random variables, with var(Zt ) = E(Xt Xt1 (1))2 (noting that Xt1 (1)
, where X
is the best linear predictor of Xt given Xt1 , Xt2 , . . .) and Vt X =
n
n= Xn
is defined in (5.6).
PROOF. First let is consider the onestep ahead prediction of Xt given the infinite past, denoted
P
Xt1 (1). Since {Xt } is a second order stationary process it is clear that Xt1 (1) =
j=1 bj Xtj ,
where the coefficients {bj } do not vary with t. For this reason {Xt1 (1)} and {Xt Xt1 (1)} are
second order stationary random variables. Furthermore, since {Xt Xt1 (1)} is uncorrelated with
Xs for any s t, then {Xs Xs1 (1); s R} are uncorrelated random variables. Define Zs = Xs
Xs1 (1), and observe that Zs is the onestep ahead prediction error. We recall from Section 5.2.1
that Xt sp((Xt Xt1 (1)), (Xt1 Xt2 (1)), . . .) sp(X
) =
). Since
j=0 sp(Ztj ) sp(X
the spaces
j=0 sp(Ztj ) and sp(X ) are orthogonal, we shall first project Xt onto j=0 sp(Ztj ),
due to orthogonality the difference between Xt and its projection will be in sp(X ). This will
lead to the Wold decomposition.
First we consider the projection of Xt onto the space
j=0 sp(Ztj ), which is
j Ztj ,
j=0
where due to orthogonality j = cov(Xt , (Xtj Xtj1 (1)))/var(Xtj Xtj1 (1)). Since Xt
148
sp(X
). Hence we have
Xt =
j Ztj + Vt ,
j=0
where Vt = Xt
j=0 j Ztj
that the representation is unique we note that Zt , Zt1 , . . . are an orthogonal basis of sp(Zt , Zt1 , . . .),
which pretty much leads to uniqueness.
Exercise 5.4 Consider the process Xt = A cos(Bt + U ) where A, B and U are random variables
such that A, B and U are independent and U is uniformly distributed on (0, 2).
(i) Show that Xt is second order stationary (actually its stationary) and obtain its means and
covariance function.
(ii) Show that the distribution of A and B can be chosen is such a way that {Xt } has the same
covariance function as the MA(1) process Yt = t + t (where  < 1) (quite amazing).
(iii) Suppose A and B have the same distribution found in (ii).
(a) What is the best predictor of Xt+1 given Xt , Xt1 , . . .?
(b) What is the best linear predictor of Xt+1 given Xt , Xt1 , . . .?
It is worth noting that variants on the proof can be found in Brockwell and Davis (1998),
Section 5.7 and Fuller (1995), page 94.
Remark 5.7.1 Notice that the representation in (5.38) looks like an MA() process. There is,
however, a significant difference. The random variables {Zt } of an MA() process are iid random
variables and not just uncorrelated.
We recall that we have already come across the Wold decomposition of some time series. In
Section 3.3 we showed that a noncausal linear time series could be represented as a causal linear
time series with uncorrelated but dependent innovations. Another example is in Chapter 4, where
we explored ARCH/GARCH process which have an AR and ARMA type representation. Using this
representation we can represent ARCH and GARCH processes as the weighted sum of {(Zt2 1)t2 }
which are uncorrelated random variables.
149
Remark 5.7.2 (Variation on the Wold decomposition) In many technical proofs involving
time series, we often use a results related to the Wold decomposition. More precisely, we often
decompose the time series in terms of an infinite sum of martingale differences. In particular,
we define the sigmaalgebra Ft = (Xt , Xt1 , . . .), and suppose that E(Xt F ) = . Then by
telescoping we can formally write Xt as
Xt =
Zt,j
j=0
where Zt,j = E(Xt Ftj ) E(Xt Ftj1 ). It is straightforward to see that Zt,j are martingale
differences, and under certain conditions (mixing, physical dependence, your favourite dependence
P
flavour etc) it can be shown that
j=0 kZt,j kp < (where k kp is the pth moment). This means
the above representation holds almost surely. Thus in several proofs we can replace Xt by
P
j=0 Zt,j . This decomposition allows us to use martingale theorems to prove results.
150
Chapter 6
Estimation of the mean and
covariance
Prerequisites
Some idea of what a cumulant is.
Objectives
To derive the sample autocovariance of a time series, and show that this is a positive definite
sequence.
To show that the variance of the sample covariance involves fourth order cumulants, which
can be unwielding to estimate in practice. But under linearity the expression for the variance
greatly simplifies.
To show that under linearity the correlation does not involve the fourth order cumulant. This
is the Bartlett formula.
To use the above results to construct a test for uncorrelatedness of a time series (the Portmanteau test). And understand how this test may be useful for testing for independence in
various different setting. Also understand situations where the test may fail.
151
6.1
6.1.1
We recall from Example 1.5.1 that we obtained an expression for the sample mean. We showed
that
n
var(Yn ) =
2 X n k
1
var(X0 ) +
c(k).
n
n
n
k=1
Furthermore, if
var(Yn ) =
1
2X
1
var(X0 ) +
c(k) + o( ).
n
n
n
k=1
Thus if the time series has sufficient decay in its correlation structure a mean squared consistent
estimator of the sample mean can be achieved. However, one drawback is that the dependency
means that one observation will influence the next, and if the influence is positive (seen by a positive
covariance), the resulting estimator may have a (much) larger variance than the iid case.
The above result does not require any more conditions on the process, besides second order
stationarity and summability of its covariance. However, to obtain confidence intervals we require
a stronger result, namely a central limit theorem for the sample mean. The above conditions are
P
not enough to give a central limit theorem. To obtain a CLT for sums of the form nt=1 Xt we
need the following main ingredients:
(i) The variance needs to be finite.
(ii) The dependence between Xt decreases the further apart in time the observations. However,
this is more than just the correlation, it really means the dependence.
152
The above conditions are satisfied by linear time series, if the cofficients j decay sufficient fast.
However, these conditions can also be verified for nonlinear time series (for example the (G)ARCH
and Bilinear model described in Chapter 4).
We now state the asymptotic normality result for linear models.
P
Theorem 6.1.1 Suppose that Xt is a linear time series, of the form Xt =
j= j tj , where t
P
P
are iid random variables with mean zero and variance one, j= j  < and
j= j 6= 0.
Let Yt = + Xt , then we have
where 2 = var(X0 ) + 2
n Yn = N (0, 2 )
k=1 c(k).
PROOF. Later in this course we will give precise details on how to prove asymptotic normality of
several different type of estimators in time series. However, we give a small flavour here by showing
asymptotic normality of Yn in the special case that {Xt }nt=1 satisfy an MA(q) model, then explain
how it can be extended to MA() processes.
The main idea of the proof is to transform/approximate the average into a quantity that we
know is asymptotic normal. We know if {t }nt=1 are iid random variables with mean and variance
one then
n(
n ) N (0, 1).
(6.1)
We aim to use this result to prove the theorem. Returning to Yn by a change of variables (s = t j)
we can show that
n
1X
Yt
n
t=1
q
n
n
1X
1 XX
Xt = +
j tj
n
n
t=1
t=1 j=0
nq
q
q
0
n
ns
X
X
X
X
X
X
1
+
s
j +
s
j +
s
j
n
s=1
s=q+1
s=nq+1
j=0
j=qs
j=0
q
nq
q
0
n
ns
X
X
X
X
X
X
1
1
1
nq
j
s +
s
j +
s
j
+
n
nq
n
n
j=0
:= +
s=1
s=q+1
(n q)
nq + E1 + E2 ,
n
j=q+s
s=nq+1
j=0
(6.2)
153
Pq
where =
j=0 j .
0
X
s=q+1
q
X
j +
j=qs
1
n
n
X
s=nq+1
ns
X
j .
j=0
This is a degenerate case, since E1 and E2 only consist of a finite number of terms and thus if t are
nonGaussian these terms will never be asymptotically normal. Therefore, in this case we simply
P
have that n1 nt=1 Yt = + O( n1 ) (this is why in the assumptions it was stated that 6= 0).
On the other hand, if 6= 0, then the dominating term in Yn is nq . From (6.1) it is
p
P
P
clear that n q nq N (0, 1) as n . However, for finite q, (n q)/n 1, therefore
P
n
nq N (0, 1). Altogether, substituting EE1  Cn1 and EE2  Cn1 into (6.2) gives
1 P
n Yn = n
nq + Op ( ) N 0, 2 .
n
n Yn
1 XX
j tj
n t=1
n
1 X X
j
t + Rn
n
t=1
j=0
j=0
where
Rn
nj
n
X
1 X X
j
s
s
n
s=1
j=0
s=1j
nj
n
0
n
n
X
X
X
X
X
X
1
1
j
s
s +
j
s
s
n
n
s=1
j=0
s=1j
s=nj
154
j=n+1
s=1j
2
E[Rn,1
] =
0
0
n
X
X
1 X
s
s
j1 j2 cov
n
s=1j1
j1 ,j2 =0
1
n
n
X
s=1j2
j1 j2 min[j1 1, j2 1]
j1 ,j2 =0
jX
n
n
1 1
1X 2
2 X
j2 min[j2 1]
j (j 1) +
j1 ,
n
n
1
n
j=0
n
X
j2 (j 1) +
j=0
j1 =0
n
X
2
n
j2 =0
j1 j1 .
j1 =0
Pn
P
2
< and, thus,
j=0 [1j/n]j
j=0 j  < , then by dominated convegence
P
P
Pn
n
2
2
j=0 (j/n)j 0 and
j=0 j as n . This implies that
j=0 [1 j/n]j
j=0 j and
Pn
2
2
j=0 (j/n)j 0. Substituting this into the above bounds for E[Rn,1 ] we immediately obtain
Since
P
j=0 j 
2 ] = o(1). Using the same argument we obtain the same bound for R
E[Rn,1
n,2 , Rn,3 and Rn,4 . Thus
n
1 X
n Yn =
t + op (1)
n
j=1
Estimation of the so called long run variance (given in Theorem 6.1.1) can be difficult. There
are various methods that can be used, such as estimating the spectral density function (which we
define in Chapter 8) at zero. An interesting approach advocated by Xiaofeng Shao is to use the
method of so called selfnormalization which circumvents the need to estimate the long run mean,
see Shao (2010).
6.2
Suppose we observe {Yt }nt=1 , to estimate the covariance we can estimate the covariance c(k) =
cov(Y0 , Yk ) from the the observations. A plausible estimator is
nk
1 X
cn (k) =
(Yt Yn )(Yt+k Yn ),
n
t=1
155
(6.3)
since E[(Yt Yn )(Yt+k Yn )] c(k). Of course if the mean of Yt is known to be zero (Yt = Xt ),
then the covariance estimator is
nk
1 X
Xt Xt+k .
cn (k) =
n
(6.4)
t=1
1 Pnk
The eagleeyed amongst you may wonder why we dont use nk
n (k) is a
t=1 Xt Xt+k , when c
P
nk
1
biased estimator, whereas nk
n (k) has some very nice properties
t=1 Xt Xt+k is not. However c
cn (k) =
1
n
Pnk
t=1
Xt Xt+k k n 1
otherwise
then {
cn (k)} is a positive definite sequence. Therefore, using Lemma 1.6.1 there exists a stationary
time series {Zt } which has the covariance cn (k).
PROOF. There are various ways to show that {
cn (k)} is a positive definite sequence. One method
uses that the spectral density corresponding to this sequence is nonnegative, we give this proof in
Section 8.3.1.
Here we give an alternative proof. We recall a sequence is positive definite if for any vector
a = (a1 , . . . , ar )0 we have
r
X
k1 ,k2 =1
n
X
b na 0
ak1 ak2 cn (k1 k2 ) = a0
k1 ,k2 =1
where
b
n =
1
n
cn (0)
cn (1)
cn (2) . . . cn (n 1)
cn (1)
cn (0)
cn (1) . . . cn (n 2)
,
..
..
..
..
..
.
.
.
.
.
..
..
cn (n 1) cn (n 2)
.
.
cn (0)
Pnk
t=1
1
n
Pnk
t=1
b n = Xn X0 , where Xn is a
construction, it can be shown that the above convariance matrix is
n
156
n 2n matrix with
Xn =
...
X1
X2
0
..
.
0
..
.
...
..
.
X1
..
.
X2
..
.
. . . Xn1
..
..
.
.
X1 X2 . . . Xn1 Xn
...
...
Xn1 Xn
Xn
..
.
0
..
.
...
6.2.1
The main reason we construct an estimator is either for testing or constructing a confidence interval
for the parameter of interest. To do this we need the variance and distribution of the estimator. It
is impossible to derive the finite sample distribution, thus we look at their asymptotic distribution.
Besides showing asymptotic normality, it is important to derive an expression for the variance.
In an ideal world the variance will be simple and will not involve unknown parameters. Usually
in time series this will not be the case, and the variance will involve several (often an infinite)
number of parameters which are not straightforward to estimate. Later in this section we show
that the variance of the sample covariance can be extremely complicated. However, a substantial
simplification can arise if we consider only the sample correlation (not variance) and assume linearity
of the time series. This result is known as Bartletts formula (you may have come across Maurice
Bartlett before, besides his fundamental contributions in time series he is well known for proposing
the famous Bartlett correction). This example demonstrates, how the assumption of linearity can
really simplify problems in time series analysis and also how we can circumvent certain problems
in which arise by making slight modifications of the estimator (such as going from covariance to
correlation).
The following theorem gives the asymptotic sampling properties of the covariance estimator
(6.3). One proof of the result can be found in Brockwell and Davis (1998), Chapter 8, Fuller
157
(1995), but it goes back to Bartlett (indeed its called Bartletts formula). We prove the result in
Section 6.2.2.
Theorem 6.2.1 Suppose {Xt } is a linear stationary time series where
Xt = +
j tj ,
j=
where
j  < , {t } are iid random variables with E(4t ) < . Suppose we observe {Xt :
t = 1, . . . , n} and use (6.3) as an estimator of the covariance c(k) = cov(X0 , Xk ). Define n (r) =
cn (r)/
cn (0) as the sample correlation. Then for each h {1, . . . , n}
n(
n (h) (h)) N (0, Wh )
(6.5)
where n (h) = (
n (1), . . . , n (h)), (h) = ((1), . . . , (h)) and
(Wh )ij =
(6.6)
k=
6.2.2
158
var[
cn (r)] =
nr
1 X
cov(Xt Xt+r , X X +r ).
n2
t, =1
One approach for the analysis of cov(Xt Xt+r , X X +r ) is to expand it in terms of expectations
cov(Xt Xt+r , X X +r ) = E(Xt Xt+r , X X +r )E(Xt Xt+r )E(X X +r ), however it not clear how this
will give var[Xt Xt+r ] = O(n1 ). Instead we observe that cov(Xt Xt+r , X X +r ) is the covariance
of the product of random variables. This belong to the general class of cumulants of products of
random variables. We now use standard results on cumulants, which show that cov[XY, U V ] =
cov[X, U ]cov[Y, V ] + cov[X, V ]cov[Y, U ] + cum(X, Y, U, V ) (note this result can be generalized to
higher order cumulants, see ?). Using this result we have
var[
cn (r)]
=
nr
1 X
n2
t, =1
cov(Xt , X )

{z
}
cov(Xt+r , X +r ) + cov(Xt , X +r )cov(Xt+r , X ) + cum(Xt , Xt+r , X , X +r )
=c(t ) by stationarity
nr
1 X
c(t )2 + c(t r)c(t + r ) + k4 (r, t, + r t)
2
n
t, =1
:= I + II + III,
where the above is due to strict stationarity of the time series. We analyse the above term by
term. Either (i) by changing variables and letting k = t and thus changing the limits of the
Pnr
summand in an appropriate way or (ii) observing that t, =1 c(t )2 is the sum of the elements
in the Toeplitz matrix
c(0)2
c(1)2
c(1)2
c(0)2
2)2
..
.
..
.
. . . c(n r
..
..
.
.
c(0)2
I =
nr
1 X
1
c(t )2 = 2
n2
n
t, =1
(nr1)
nrk
2
c(k)
X
t=1
k=(nr1)
159
1
1=
n
nr
X
k=(nr)
n r k
n
c(k)2 .
Pnr
P
For all k, (1k/n)c(k)2 c(k)2 and  k=(nr) (1k/n)c(k)2  k c(k)2 , thus by dominated
P
P
2
convergence (see Chapter A) nk=(nr) (1 k/n)c(k)2
k= c(k) . This gives
I=
1 X
1
c(k)2 + o( ).
n
n
k=
1 X
1
II =
c(k + r)c(k r) + o( ).
n
n
k=
nr
X
k=(nr)
n r k
n
k4 (r, k, k + r).
Pnr
To bound we note that for all k, (1 k/n)k4 (r, k, k + r) k4 (r, k, k + r) and  k=(nr) (1
P
P
k/n)k4 (r, k, k + r) k k4 (r, k, k + r), thus by dominated convergence we have nk=(nr) (1
P
k/n)k4 (r, k, k + r)
k= k4 (r, k, k + r). This gives
III =
1X
1
4 (r, k, k + r) + o( ).
n
n
k
nvar[
cn (r)] =
c(k) +
k=
c(k + r)c(k r) +
k=
4 (r, k, k + r) + o(1).
k=
X
k=
c(k)c(k + r1 r2 ) +
X
k=
c(k r1 )c(k + r2 ) +
4 (r1 , k, k + r2 ) + o(1).
k=
We observe that the covariance of the covariance estimator contains both covariance and cumulants
terms. Thus if we need to estimate them, for example to construct confidence intervals, this can be
extremely difficult. However, we show below that under linearity the above fourth order cumulant
term has a simpler form.
160
c(k + r1 r2 )c(k) +
k=
c(k r1 )c(k + r2 ) +
k=
k=
We now show that under linearity, T3 (the fourth order cumulant) has a much simpler form. Let
us suppose that the time series is linear
Xt =
j tj
j=
where
T3 =
cum
j1 j1 ,
j1 =
k=
j2 r1 j2 ,
j2 =
j3 kj3 ,
j3 =
j4 k+r2 j1
j4 =
k= j1 ,...,j4 =
Standard results in cumulants (which can be proved using the characteristic function), show that
cum[Y1 , Y2 , . . . , Yn ] = 0, if any of these variables is independent of all the others. Applying this
result to cum (j1 , r1 j2 , kj3 , k+r2 j1 ) reduces T3 to
T3 = 4
j jr1 jk jr2 k .
k= j=
j1 =
j jr1
j2 j2 r2 = 4 c(r1 )c(r2 ),
j2 =
X
k=
c(k)c(k + r1 r2 ) +
k=
Thus in the case of linearity our expression for the variance is simpler, and the only difficult
161
cn (r)
,
cn (0)
(j) < .
k=
=
=
cn (r) c(r)
cn (0) c(0)
[
cn (r) c(r)]
[
cn (0) c(0)]
c(0)
cn (r)
c(0)2
 {z }
cn (r)
+ [
cn (0) c(0)]2
cn (0)3
{z
}

=O(n1 )
c(r)
cn (r)
[
cn (r) c(r)]
[
cn (r) c(r)]
[
cn (0) c(0)]
+ 2 [
cn (0) c(0)]2
[
cn (0) c(0)]
c(0)
c(0)2
cn (0)3
c(0)2

{z
}
O(n1 )
1
:= An + Op ( ),
n
where cn (0) lies between cn (0) and c(0). We observe that the last two terms of the above are
of order O(n1 ) (by (6.7) and that c(0) is bounded away from zero) and the dominating term is
An which is of order O(n1/2 ) (again by (6.7)). Thus the limiting distribution of n (r) (r) is
determined by An and the variance of the limiting distribution is also determined by An . It is
straightforward to show that
nvar[An ] = n
var[
cn (r)]
c(r)2
c(r)2
2ncov[
c
(r),
c
(0)]
+
nvar[
c
(0)]
.
n
n
n
c(0)2
c(0)3
c(0)4
162
(6.8)
nvar
cn (r)
cn (0)
P
P
P
2
2
2 k= c(k)c(k r) + 4 c(r)c(0)
k= c(k)c(k r) + 4 c(r)
k= c(k) +
=
P
P
P
2
2
2 k= c(k)c(k r) + 4 c(r)c(0)
k= c(k)c(k r) + 4 c(0)
k= c(k) +
+o(1).
Substituting the above into (6.8) gives us
nvar[An ] =
c(k) +
k=
2 2
k=
!
2
c(k)c(k r) + 4 c(r)
k=
!
c(k)c(k r) + 4 c(r)c(0)
2
c(k) +
k=
c(r)2
+
c(0)3
!
c(k)c(k r) + 4 c(0)2
k=
c(0)2
c(r)2
+ o(1).
c(0)4
Focusing on the fourth order cumulant terms, we see that these cancel, which gives the result.
To prove Theorem 6.2.1, we simply use the Lemma 6.2.2 to obtain an asymptotic expression
for the variance, then we use An to show asymptotic normality of cn (r) (under linearity).
Exercise 6.1 Under the assumption that {Xt } are iid random variables show that cn (1) is asymptotically normal.
Hint: Let m = n/(b + 1) and partition the sum
n1
X
Xt Xt+1 =
t=1
b
X
Pn1
k=1
Xt Xt+1 as follows
(b+1)+b
t=1
t=(b+1)+1
(m1)(b+1)+b+1
+X(m1)(b+1) X(m1)(b+1)+1 +
Xt Xt+1
t=(m1)(b+1)+1
m1
X
j=0
where Ub,j =
Pj(b+1)+b
t=j(b+1)
Ub,j +
m1
X
Xj(b+1)+b X(j(b+1)+1
j=0
Xt Xt+1 . Show that the second term in the above summand is asymptotically
negligible and show that the classical CLT for iid random variables can be applied to the first term.
163
Exercise 6.2 Under the assumption that {Xt } is a MA(1) process, show that cn (1) is asymptotically normal.
Exercise 6.3 The block bootstrap scheme is a commonly used method for estimating the finite
sample distribution of a statistic (which includes its variance). The aim in this exercise is to see
how well the bootstrap variance approximates the finite sample variance of a statistic.
(i) In R write a function to calculate the autocovariance b
cn (1) =
1
n
Pn1
t=1
Xt Xt+1 .
164
Pn
k=1 (k )In (k )
,
1 Pn
k=1 In (k )
n
where In () is the periodogram, and {Xt } is a linear time series, then we will show later that the
asymptotic distribution of the above has a variance which is only in terms of the covariances not
higher order cumulants. We prove this result in Section 9.5.
6.3
Bartletts formula if commonly used to check by eye; whether a time series is uncorrelated (there
are more sensitive tests, but this one is often used to construct CI in for the sample autocovariances
in several statistical packages). This is an important problem, for many reasons:
Given a data set, we need to check whether there is dependence, if there is we need to analyse
it in a different way.
Suppose we fit a linear regression to time series data. We may to check whether the residuals
are actually uncorrelated, else the standard errors based on the assumption of uncorrelatedness would be unreliable.
We need to check whether a time series model is the appropriate model. To do this we fit
the model and estimate the residuals. If the residuals appear to be uncorrelated it would
seem likely that the model is correct. If they are correlated, then the model is inappropriate.
For example, we may fit an AR(1) to the data, estimate the residuals t , if there is still
t1 is
correlation in the residuals, then the AR(1) was not the correct model, since Xt X
still correlated (which it would not be, if it were the correct model).
165
ACF
0.0
0.2
0.4
0.6
0.8
1.0
Series iid
10
15
20
Lag
Figure 6.1: The sample ACF of an iid sample with error bars (sample size n = 200).
We now apply Theorem 6.2.1 to the case that the time series are iid random variables. Suppose {Xt }
are iid random variables, then it is clear that it is trivial example of a (not necessarily Gaussian)
linear process. We use (6.3) as an estimator of the autocovariances.
To derive the asymptotic variance of {
cn (r)}, we recall that if {Xt } are iid then (k) = 0 for
k 6= 0. Substituting this into (6.6) we see that
n(
n (h) (h)) N (0, Wh ),
where
1 i=j
(Wh )ij =
0 i=
6 j
In other words,
n(
n (h) (h)) N (0, Ih ). Hence the sample autocovariances at different lags
are asymptotically uncorrelated and have variance one. This allows us to easily construct error
bars for the sample autocovariances under the assumption of independence. If the vast majority of
the sample autocovariance lie inside the error bars there is not enough evidence to suggest that the
data is a realisation of a iid random variables (often called a white noise process). An example of
the empirical ACF and error bars is given in Figure 6.1. We see that the empirical autocorrelations
of the realisation from iid random variables all lie within the error bars.
In contrast in Figure
6.2 we give a plot of the sample ACF of an AR(2). We observe that a large number of the sample
autocorrelations lie outside the error bars.
166
ACF
Series ar2
10
15
20
acf
Lag
10
15
20
lag
Figure 6.2: Top: The sample ACF of the AR(2) process Xt = 1.5Xt1 + 0.75Xt2 + t with
error bars n = 200. Bottom: The true ACF.
Of course, simply checking by eye means that we risk misconstruing a sample coefficient that
lies outside the error bars as meaning that the time series is correlated, whereas this could simply
be a false positive (due to multiple testing). To counter this problem, we construct a test statistic
D
n (h) (h)) N (0, I), one method of
for testing uncorrelatedness. Since under the null n(
testing is to use the square correlations
Sh = n
h
X

n (r)2 ,
(6.9)
r=1
under the null it will asymptotically have a 2 distribution with h degrees of freedom, under the
alternative it will be a noncentral (generalised) chisquared. The noncentrality is what makes us
reject the null if the alternative of correlatedness is true. This is known as the BoxPierce test (a
test which gives better finite sample results is the LjungBox test). Of course, a big question is
how to select h. In general, we do not have to use large h since most correlations will arise when
r is small, However the choice of h will have an influence on power. If h is too large the test will
loose power (since the mean of the chisquared grows as h ), on the other hand choosing h too
small may mean that certain correlations at higher lags are missed. How to selection h is discussed
in several papers, see for example Escanciano and Lobato (2009).
167
6.4
A process is said to have long range dependence if the autocovariances are not absolutely summable,
P
ie.
k c(k) = .
From a practical point of view data is said to exhibit long range dependence if the autocovariances do not decay very fast to zero as the lag increases. Returning to the Yahoo data considered
in Section 4.1.1 we recall that the ACF plot of the absolute log differences, given again in Figure
6.3 appears to exhibit this type of behaviour. However, it has been argued by several authors that
0.0
0.2
0.4
ACF
0.6
0.8
1.0
Series abs(yahoo.log.diff)
10
15
20
25
30
35
Lag
the appearance of long memory is really because of a timedependent mean has not been corrected
for. Could this be the reason we see the memory in the log differences?
We now demonstrate that one must be careful when diagnosing long range dependence, because
a slow/none decay of the autocovariance could also imply a timedependent mean that has not been
corrected for. This was shown in Bhattacharya et al. (1983), and applied to econometric data in
Mikosch and St
aric
a (2000) and Mikosch and Starica (2003). A test for distinguishing between long
range dependence and change points is proposed in Berkes et al. (2006).
Suppose that Yt satisfies
Yt = t + t ,
where {t } are iid random variables and the mean t depends on t. We observe {Yt } but do not
know the mean is changing. We want to evaluate the autocovariance function, hence estimate the
168
cn (k) =
nk
1 X
(Yt Yn )(Yt+k Yn ).
n
t=1
Observe that Yn is not really estimating the mean but the average mean! If we plotted the empirical
ACF {
cn (k)} we would see that the covariances do not decay with time. However the true ACF
would be zero and at all lags but zero. The reason the empirical ACF does not decay to zero is
because we have not corrected for the time dependent mean. Indeed it can be shown that
cn (k) =
nk
1 X
(Yt t + t Yn )(Yt+k t+k + t+k Yn )
n
t=1
nk
nk
1 X
1 X
(Yt t )(Yt+k t+k ) +
(t Yn )(t+k Yn )
n
n
t=1
t=1
nk
c(k)
{z}
true autocovariance=0
1 X
(t Yn )(t+k Yn )
n
 t=1
{z
}
Expanding the second term and assuming that k << n and t (t/n) (and is thus smooth) we
have
nk
1 X
(t Yn )(t+k Yn )
n
t=1
!2
n
n
1X 2
1X
t
t + op (1)
n
n
t=1
t=1
!2
n
n
n
1 XX 2
1X
t
t + op (1)
n2
n
s=1 t=1
t=1
n
n
n
n
n
n
1 XX
1 XX
1 XX
2
t (t s ) = 2
(t s ) + 2
s (t s )
n2
n
n
s=1 t=1
s=1 t=1
s=1 t=1

{z
}
=
1
n2
Pn
s=1
Pn
t=1
t (t s )
Therefore
nk
n
n
1 X
1 XX
(t Yn )(t+k Yn ) 2
(t s )2 .
n
2n
t=1
s=1 t=1
Thus we observe that the sample covariances are positive and dont tend to zero for large lags.
169
170
Chapter 7
Parameter estimation
Prerequisites
The Gaussian likelihood.
Objectives
To be able to derive the YuleWalker and least squares estimator of the AR parameters.
To understand what the quasiGaussian likelihood for the estimation of ARMA models is,
and how the DurbinLevinson algorithm is useful in obtaining this likelihood (in practice).
Also how we can approximate it by using approximations of the predictions.
Understand that there exists alternative methods for estimating the ARMA parameters,
which exploit the fact that the ARMA can be written as an AR().
We will consider various methods for estimating the parameters in a stationary time series.
We first consider estimation parameters of an AR and ARMA process. It is worth noting that we
will look at maximum likelihood estimators for the AR and ARMA parameters. The maximum
likelihood will be constructed as if the observations were Gaussian. However, these estimators
work both when the process is Gaussian is also nonGaussian. In the nonGaussian case, the
likelihood simply acts as a contrast function (and is commonly called the quasilikelihood). In time
series, often the distribution of the random variables is unknown and the notion of likelihood has
little meaning. Instead we seek methods that give good estimators of the parameters, meaning that
they are consistent and as close to efficiency as possible without placing too many assumption on
171
the distribution. We need to free ourselves from the notion of likelihood acting as a likelihood
(and attaining the Cr
amerRao lower bound).
7.1
Let us suppose that {Xt } is a zero mean stationary time series which satisfies the AR(p) representation
Xt =
p
X
j Xtj + t ,
j=1
where E(t ) = 0 and var(t ) = 2 and the roots of the characteristic polynomial 1
Pp
j=1 j z
lie
outside the unit circle. We will assume that the AR(p) is causal (the techniques discussed here
will not consistently estimate the parameters in the case that the process is noncausal, they will
only consistently estimate the corresponding causal model). Our aim in this section is to construct
estimator of the AR parameters {j }. We will show that in the case that {Xt } has an AR(p)
representation the estimation is relatively straightforward, and the estimation methods all have
properties which are asymptotically equivalent to the Gaussian maximum estimator.
The YuleWalker estimator is based on the YuleWalker equations derived in (3.4) (Section
3.1.4).
7.1.1
We recall that the YuleWalker equation state that if an AR process is causal, then for i > 0 we
have
E(Xt Xti ) =
p
X
j=1
p
X
j c(i j).
(7.1)
j=1
(7.2)
where (p )i,j = c(i j), (rp )i = c(i) and 0p = (1 , . . . , p ). Thus the autoregressive parameters
solve these equations.
172
The YuleWalker equations inspire the method of moments estimator called the YuleWalker
p are estimators of
estimator. We use (7.2) as the basis of the estimator. It is clear that rp and
p )i,j = cn (i j) and (rp )i = cn (i). Therefore we can use
rp and p where (
=
1
rp ,
p
p
(7.3)
cn (t)
= t1,j
Pt1
n (t j)
j=1 t1,j c
rn (t)
t,t t1,tj
1 j t 1,
173
0.0
0.5
Partial ACF
0.5
Series ar2
10
15
20
Lag
Figure 7.1: Top: The sample partial autocorrelation plot of the AR(2) process Xt =
1.5Xt1 + 0.75Xt2 + t with error bars n = 200.
cn (ij), using this and the following result it follows that {j ; j = 1, . . . , p} corresponds to a causal
AR process.
Lemma 7.1.1 Let us suppose Z p+1 = (Z1 , . . . , Zp+1 ) is a random vector, where var[Z]p+1 =
(p+1 )i,j = cn (i j) (which is Toeplitz). Let Zp+1p be the best linear predictor of Zp+1 given
Zp , . . . , Z1 , where p = (1 , . . . , p ) = 1
p r p are the coefficients corresponding to the best linear
P
predictor. Then the roots of the corresponding characteristic polynomial (z) = 1 pj=1 j z j lie
outside the unit circle.
PROOF. We first note that by definition of the best linear predictor, for any coefficients {aj } we
have the inequality
E Zp+1
p
X
j=1
p
X
2
aj Zp+1j = E (a(B)Zp+1 )2 .(7.4)
j=1
We use the above inequality to prove the result by contradiction. Let us suppose that there exists
at least one root of (z), which lies inside the unit circle. We denote this root as 1 ( > 1) and
factorize (z) as (z) = (1 z)R(z), where R(z) contains the other remaining roots and can be
either inside or outside the unit circle.
Define the two new random variables, Yp+1 = R(B)Zp+1 and Yp = R(B)Zp (where B acts
as the backshift operator), which we note is a linear combination of Zp+1 , . . . , Z2 and Zp , . . . , Z1
P
Pp1
respectively ie. Yp+1 = p1
0=1 Ri Zp+1i and Yp =
i=0 Ri Zpi . The most important observation
174
in this construction is that the matrix p+1 is Toeplitz (ie. {Zt } is a stationary vector), therefore
Yp+1 and Yp have the same covariance structure, in particular they have the same variance. Let
=
cov[Yp+1 , Yp ]
cov[Yp+1 , Yp ]
= cor(Yp+1 , Yp ).
=q
var[Yp ]
var[Yp ]var[Yp+1 ]
{z
}

by stationarity
The above result can immediately be used to show that the YuleWalker estimators of the AR(p)
coefficients yield a causal solution. Since the autocovariance estimators {
cn (r)} form a positive
p+1 with (
p+1 ) = cn (i j),
semidefinite sequence, there exists a vector Y p where var[Y p+1 ] =
1
thus by the above lemma we have that
p r p are the coefficients of a Causal AR process. We note
that below we define the intuitively obvious least squares estimator, which does not necessarily
have this property.
The least squares estimator is based can either be defined in its own right or be considered as
the conditional Gaussian likelihood. We start by defining the Gaussian likelihood.
7.1.2
Our object here is to obtain the maximum likelihood estimator of the AR(p) parameters. We recall
that the maximum likelihood estimator is the parameter which maximises the joint density of the
observations. Since the loglikelihood often has a simpler form, we will focus on the loglikelihood.
We note that the Gaussian MLE is constructed as if the observations {Xt } were Gaussian, though it
175
is not necessary that {Xt } is Gaussian when doing the estimation. In the case that the innovations
are not Gaussian, estimator will be less efficient (will not obtain the CramerRao lower bound)
then the likelihood constructed as if the distribution were known.
Suppose we observe {Xt ; t = 1, . . . , n} where Xt are observations from an AR(p) process. Let
us suppose for the moment that the innovations of the AR process are Gaussian, this implies that
X n = (X1 , . . . , Xn ) is a ndimension Gaussian random vector, with the corresponding loglikelihood
Ln (a) = log n (a) X0n n (a)1 Xn ,
(7.5)
where t (a) the variance covariance matrix of Xn constructed as if Xn came from an AR process
with parameters a. Of course, in practice in the likelihood in the form given above is impossible to
maximise. Therefore we need to rewrite the likelihood in a more tractable form.
We now derive a tractable form of the likelihood under the assumption that the innovations
come from an arbitrary distribution. To construct the likelihood, we use the method of conditioning,
to write the likelihood as the product of conditional likelihoods. In order to do this, we derive the
conditional distribution of Xt+1 given Xt1 , . . . , X1 . We first note that the AR(p) process is pMarkovian, therefore if t p all the information about Xt+1 is contained in the past p observations,
therefore
P(Xt+1 xXt , Xt1 , . . . , X1 ) = P(Xt+1 xXt , Xt1 , . . . , Xtp+1 ).
(7.6)
Since the Markov property applies to the distribution function it also applied to the density
f (Xt+1 Xt , . . . , X1 ) = f (Xt+1 Xt , . . . , Xtp+1 ).
By using the (7.6) we have
p
X
aj Xt+1j ),
(7.7)
j=1
where P denotes the distribution of the innovation. Differentiating P with respect to Xt+1 gives
P ( Xt+1
Pp
j=1 aj Xt+1j )
Xt+1
176
= f Xt+1
p
X
j=1
aj Xt+1j . (7.8)
Example 7.1.1 (AR(1)) To understand why (7.6) is true consider the simple case that p = 1
(AR(1)). Studying the conditional probability gives
P(Xt+1 xt+1 Xt = xt , . . . , X1 = x1 ) = P(
aX + t xt+1
}
 t {z
Xt = xt , . . . , X1 = x1 )
f(X1 , X2 , . . . , Xn ) = f (X1 , . . . , Xp )
n1
Y
f(Xt+1 Xt , . . . , X1 )
t=p
= f(X1 , . . . , Xp )
n1
Y
t=p
= f(X1 , . . . , Xp )
n1
Y
f (Xt+1
t=p
p
X
aj Xt+1j )
(by (7.8)).
j=1
In the case that the sample sizes are large n >> p, the contribution of initial observations
log f(X1 , . . . , Xp ) is minimal and the conditional loglikelihood and full loglikelihood are asymptotically equivalent.
So far we have not specified the distribution of . From now on we shall assume that it is
Gaussian. In the case that is Gaussian, log f(X1 , . . . , Xp ) is multivariate normal with mean zero
(since we are assuming, for convenience, that the time series has zero mean) and variance p . We
recall that p (a) is a Toeplitz matrix whose covariance is determined by the AR parameters a, see
(3.7). As can be seen from (3.7), the coefficients are buried within the covariance (which is in
terms of the roots of the characteristic), this makes it quite an unpleasant part of the likelihood to
177
maximise. On the other hand the conditional loglikelihood has a far simpler form
Ln (a; X) = (n p) log 2
1
2
n1
X
Xt+1
t=p
p
X
2
aj Xt+1j .
j=1
p
p
n
a
(7.9)
By constraining the parameter space, we an ensure the estimator correspond to a causal AR process.
However, it is clear that despite having the advantage that it attains the CramerRao lower bound
in the case that the innovations are Gaussian, it not simple to evaluate. A far simpler estimator can
be obtained, by simply focusing on the conditiona log likelihood Ln (a; X). An explicit expression
for its maximum can easily be obtained (as long as we do not constrain the parameter space). It
= arg max Ln (a; X) and that
is simply the least squares estimator, in other words,
p
=
1 rp ,
p
p
p )i,j =
where (
1
np
Pn
and (rn )i =
1
np
Pn
t=p+1 Xt Xti .
Remark 7.1.1 (A comparison of the YuleWalker and least squares estimators) Comparing
=
=
1 rp with the YuleWalker estimator
1 rp we see that
the least squares estimator
p
p
p
p
p and
p (and the corresponding rp and rp ). We see
they are very similar. The difference lies in
p is a Toeplitz matrix, defined entirely by the positive definite sequence cn (r). On the other
that
p is not a Toeplitz matrix, the estimator of c(r) changes subtly at each row. This means
hand,
that the proof given in Lemma 7.1.1 cannot be applied to the least squares estimator as it relies
on the matrix p+1 (which is a combination of p and rp ) being Toeplitz (thus stationary). Thus
the characteristic polynomial corresponding to the least squares estimator will not necessarily have
roots which lie outside the unit circle.
Example 7.1.2 (Toy Example) To illustrate the difference between the YuleWalker and least
squares estimator (at least for example samples) consider the rather artifical example that the time
series consists of two observations X1 and X2 (we will assume the mean is zero). We fit an AR(1)
178
model to the data, the least squares estimator of the AR(1) parameter is
X1 X2
bLS =
X22
whereas the YuleWalker estimator of the AR(1) parameter is
bY W =
X1 X2
.
X12 + X22
It is clear that bLS < 1 only if X2 < X1 . On the other hand bY W < 1. Indeed since (X1 X2 )2 > 0,
we see that bY W 1/2.
Exercise 7.1 In R you can estimate the AR parameters using ordinary least squares (ar.ols),
yulewalker (ar.yw) and (Gaussian) maximum likelihood (ar.mle).
Simulate the causal AR(2) model Xt = 1.5Xt1 0.75Xt2 + t using the routine arima.sim
(which gives Gaussian realizations) and also innovations which from a tdistribution with 4df. Use
the sample sizes n = 100 and n = 500 and compare the three methods through a simulation study.
Exercise 7.2 None of these methods are able to consistently estimator the parameters of a noncausal AR(p) time series. This is because all these methods are estimating the autocovariance
function (regardless of whether the YuleWalker of least squares method is used). It is possible that
other criterions may give a consistent estimator. For example the `1 norm defined as
p
t
X
X
Xt
Ln () =
j Xtj ,
t=p+1
j=1
with n = arg min Ln ().
(i) Simulate a stationary solution of the noncausal AR(1) process Xt = 2Xt1 + t , where the
innovations come from a double exponential and estimate using Ln (). Do this 100 times,
does this estimator appear to consistently estimate 2?
(ii) Simulate a stationary solution of the noncausal AR(1) process Xt = 2Xt1 + t , where the
innovations come from a tdistribution with 4 df and estimate using Ln (). Do this 100
times, does this estimator appear to consistently estimate 2?
You will need to use a Quantile Regression package to minimise the `1 norm. I suggest using the
package quantreg and the function rq where we set = 0.5 (the median).
179
7.2
Xt
p
X
i Xti = t +
i=1
q
X
j tj ,
j=1
7.2.1
We now derive the Gaussian maximum likelihood estimator (GMLE) to estimate the parameters
and . Let X 0n = (X1 , . . . , Xn ). The criterion (the GMLE) is constructed as if {Xt } were Gaussian,
but this need not be the case. The likelihood is similar to the likelihood given in (7.5), but just as
in the autoregressive case it can be not directly maximised, ie.
Ln (, , ) = log n (, , ) X 0n n (, , )1 X n ,
(7.10)
where n (, , ) the variance covariance matrix of X n . However, in Section 5.3.3, equation (5.21)
the Cholesky decomposition of n is given and using this we can show that
X 0n n (, , )1 X n
P
n1
X (Xt+1 tj=1 t,j ()Xt+1j )2
X12
=
+
,
r(1; )
r(t + 1; )
t=1
0
1
2
where = (, , 2 ). Furthermore, since 1
n = Ln Dn Ln , then det(n ) = det(Ln ) det(Dn ) =
Qn
Pn
1
t=1 r(t) , this implies log n (, , ) =
t=1 log r(t; ). Thus the loglikelihood is
Ln (, , ) =
n
X
t=1
P
n1
X (Xt+1 tj=1 t,j ()Xt+1j )2
X12
log r(t; )
.
r(1; )
r(t + 1; )
t=1
180
Pt
X12
+
r(1; )
n1
X
max(p,q)
(Xt+1
Pt
+
r(t + 1; )
P
P
(Xt+1 pj=1 j Xt+1j qi=1 t,i (Xt+1i Xt+1iti ())2
t=1
r(t + 1; )
max(p,q)
r(1; )
r(t + 1; )
t=1
t=1
P
P
n1
X (Xt+1 pj=1 j Xt+1j qi=1 t,i (Xt+1i Xt+1iti ())2
n
X
r(t + 1; )
max(p,q)
b n () =
L
n
X
log 2
t=1
n
X
t=1
log 2
n1
X
t=2
n1
X
t=2
bt+1t ()]2
[Xt+1 X
2
[((B)1 (B))[t] Xt+1 ]2
2
where ((B)1 (B))[t] denotes the approximation of the polynomial in B to the tth order. This
approximate likelihood greatly simplifies the estimation scheme because the derivatives (which is
the main tool used in the maximising it) can be easily obtained. To do this we note that
d (B)
B i (B)
(B)
Xt =
Xt =
Xti
2
di (B)
(B)
(B)2
d (B)
Bj
1
Xt =
Xt =
Xtj
2
dj (B)
(B)
(B)2
181
(7.11)
therefore
d
di
(B)
Xt
(B)
2
= 2
(B)
Xt
(B)
(B)
Xti
(B)2
d
and
dj
(B)
Xt
(B)
2
= 2
(B)
Xt
(B)
1
Xtj .
(B)2
(7.12)
#
"
n
i (B)
2 Xh
= 2
(B)1 (B) [t] Xt
Xti
(B)2 [ti]
t=1
#
"
n
i 1
2 Xh
= 2
Xtj
(B)1 (B) [t] Xt
(B) [tj]
t=1
n
i2
1
1 Xh
1
(B)
(B)
X
.
[t] t
2 n 4
(7.13)
t=1
We then use the NewtonRaphson scheme to solve maximise the approximate likelihood. It can be
shown that the approximate likelihood is close the actual true likelihood and asymptotically both
methods are equivalent.
Theorem 7.2.1 Let us suppose that Xt has a causal and invertible ARMA representation
Xt
p
X
j Xtj = t +
j=1
q
X
i ti
i=1
where {t } are iid random variables with mean zero and var[t ] = 2 . Then the the (quasi)Gaussian
n
n
N (0, 1 ),
with
and Ut = (Ut , . . . , Utp+1 ) and Vt = (Vt , . . . , Vtq+1 ), where {Ut } and {Vt } are autoregressive
processes which satisfy (B)Ut = t and (B)Vt = t .
We do not give the proof in this section, however it is possible to understand where this result
comes from. We recall that that the maximum likelihood and the approximate likelihood are
182
e n () =
L
n
X
log
t=1
n1
X
t=2
n1
t=1
t=2
X
X [(B)1 (B)Xt+1 ]2
[Xt+1 Xt (1; )]2
2
=
log
.
2
2
This likelihood is infeasible in the sense that it cannot be maximised since the finite past X0 , X1 , . . .
is unobserved, however is a very convenient tool for doing the asymptotic analysis. Using Lemma
b n and L
e n are all asymptotically equivalent.
5.4.1 we can show that all three likelihoods Ln , L
b n we can simply consider the
Therefore, to obtain the asymptotic sampling properties of Ln or L
en .
unobserved likelihood L
To show asymptotic normality (we assume here that the estimators are consistent) we need to
e n (since the asymptotic properties are determined by
consider the first and second derivative of L
Taylor expansions). In particular we need to consider the distribution of
and the expectation of
en
2L
2
e
L
i
e
L
j
en
L
n
2 X
(B)
1
(B)
(B)
X
X
ti
t
2
(B)2
t=1
n
1
2 X
(B)1 (B) Xt
Xtj
= 2
(B)
=
(7.14)
t=1
Since we are considering the derivatives at the true parameters we observe that (B)1 (B) Xt =
t ,
(B)
(B) (B)
1
Xti =
ti =
ti = Vti
2
2
(B)
(B) (B)
(B)
and
1 (B)
1
1
Xtj =
tj =
tj = Utj .
(B)
(B) (B)
(B)
Thus (B)Ut = t and (B)Vt = t are autoregressive processes (compare with theorem). This
means that the derivative of the unobserved likelihood can be written as
e
L
i
n
n
e
2 X
L
2 X
= 2
t Uti and
= 2
t Vtj
t=1
(7.15)
t=1
Note that by causality t , Uti and Vtj are independent. Again like many of the other estimators we
183
have encountered this sum is meanlike so can show normality of it by using a central limit theorem
L
; i = 1, . . . , q},
designed for dependent data. Indeed we can show asymptotically normality of {
i
e
L
; j = 1, . . . , p} and their linear combinations using the Martingale central limit theorem, see
{
j
e
Theorem 3.2 (and Corollary 3.1), Hall and Heyde (1980)  note that one can also use mdependence.
L L
Moreover, it is relatively straightforward to show that n1/2 (
,
) has the limit variance matrix
i j
e
2b
. Finally, by taking second derivative of the likelihood we can show that E[n1 L2 ] = . Thus
giving us the desired result.
7.2.2
The methods detailed above require good initial values in order to begin the maximisation (in order
to prevent convergence to a local maximum).
We now describe a simple method first propose in Hannan and Rissanen (1982) and An et al.
(1982). It is worth bearing in mind that currently the large p small n problem is a hot topic.
These are generally regression problems where the sample size n is quite small but the number of
regressors p is quite large (usually model selection is of importance in this context). The methods
proposed by Hannan involves expanding the ARMA process (assuming invertibility) as an AR()
process and estimating the parameters of the AR() process. In some sense this can be considered
as a regression problem with an infinite number of regressors. Hence there are some parallels
between the estimation described below and the large p, small n problem.
As we mentioned in Lemma 2.5.1, if an ARMA process is invertible it is can be represented as
Xt =
bj Xtj + t .
(7.16)
j=1
The idea behind Hannans method is to estimate the parameters {bj }, then estimate the innovations
t , and use the estimated innovations to construct a multiple linear regression estimator of the
ARMA paramters {i } and {j }. Of course in practice we cannot estimate all parameters {bj } as
there are an infinite number of them. So instead we do a type of sieve estimation where we only
estimate a finite number and let the number of parameters to be estimated grow as the sample size
increases. We describe the estimation steps below:
n
(i) Suppose we observe {Xt }nt=1 . Recalling (7.16), will estimate {bj }pj=1
parameters. We will
suppose that pn as n and pn << n (we will state the rate below).
184
n
We use YuleWalker to estimate {bj }pj=1
, where
b =
1 rp ,
pn
pn n
where
nij
nj
1 X
1 X
t=1
n
(ii) Having estimated the first {bj }pj=1
coefficients we estimate the residuals with
t = Xt
pn
X
bj,n Xtj .
j=1
, where
(iii) Now use as estimates of 0 and 0
n n
n
X
, = arg min
n n
(Xt
t=pn +1
p
X
j Xtj
j=1
q
X
i ti )2 .
i=1
n
X
t
Y t Y
t=max(p,q)
1
and sn =
n
n
X
Y t Xt ,
t=max(p,q)
0
Y t = (Xt1 , . . . , Xtp , t1 , . . . , tq ).
7.3
185
We now construct an estimator of the ARCH parameters based on Zt N (0, 1). It is worth
mentioning that despite the criterion being constructed under this condition it is not necessary
that the innovations Zt are normally distributed. In fact in the case that the innovations are not
normally distributed but have a finite fourth moment the estimator is still good. This is why it
is called the quasimaximum likelihood , rather than the maximum likelihood (similar to the how
the GMLE estimates the parameters of an ARMA model regardless of whether the innovations are
Gaussian or not).
q
P
2 , E(X X
Let us suppose that Zt is Gaussian. Since Zt = Xt / a0 + pj=1 aj Xtj
t
t1 , . . . , Xtp ) =
Pp
2 , then the log density of X given X
0 and var(Xt Xt1 , . . . , Xtp ) = a0 + j=1 aj Xtj
t
t1 , . . . , Xtp
is
log(a0 +
p
X
2
aj Xtj
)+
j=1
a0 +
X2
Pp t
2 .
j=1 aj Xtj
log(a0 +
t=p+1
p
X
2
aj Xtj
)
j=1
Xt2
a0 +
Pp
2
j=1 aj Xtj
.
= { = (0 , . . . , p ) :
p
X
j 1, 0 < c1 0 c2 < , c1 j }
j=1
and assume the true parameters lie in its interior a = (a0 , . . . , ap ) Int(). We let
a
n = arg min Ln ().
(7.17)
The method for estimation of GARCH parameters parallels the approximate likelihood ARMA
estimator given in Section 7.2.1.
186
Chapter 8
Spectral Representations
Prerequisites
Knowledge of complex numbers.
Have some idea of what the covariance of a complex random variable (we do define it below).
Some idea of a Fourier transform (a review is given in Section A.3).
Objectives
Know the definition of the spectral density.
The spectral density is always nonnegative and this is a way of checking that a sequence is
actually nonnegative definite (is a autocovariance).
The DFT of a second order stationary time series is almost uncorrelated.
The spectral density of an ARMA time series, and how the roots of the characteristic polynomial of an AR may influence the spectral density function.
There is no need to understand the proofs of either Bochners (generalised) theorem or the
spectral representation theorem, just know what these theorems are. However, you should
P
know the proof of Bochners theorem in the simple case that r rc(r) < .
187
8.1
t = 1, . . . , n.
(8.1)
where t are iid random variables with mean zero and variance 2 and is unknown. We estimated
P
the frequency by taking the Fourier transform Jn () = 1n nt=1 Xt eit and using as an estimator
of , the value which maximised Jn ()2 . As the sample size grows the peak (which corresponds
the frequency estimator) grows in size. Besides the fact that this corresponds to the least squares
estimator of , we note that
1
Jn (k ) =
n
1 X
Xt exp(itk )
2n
t=1
n
1 X t
=
( ) exp(itk ) +
2n
n
t=1

{z
}
=O(1)
where k =
2k
n ,
1 X
t exp(itk )
2n
t=1

{z
}
1
=Op (n1/2 ) compare with n
(8.2)
Pn
t=1 t
k. In the case that the mean is simply the sin function, there is only one frequency which is nonzero. A plot of one realization (n = 128), periodogram of the realization, periodogram of the iid
noise and periodogram of the sin function is given in Figure 8.1. Take careful note of the scale (yaxis), observe that the periodogram of the sin function dominates the the periodogram of the noise
(magnitudes larger). We can understand why from (8.2), where the asymptotic rates are given and
we see that the periodogram of the deterministic signal is estimating nFourier coefficient, whereas
the periodgram of the noise is Op (1). However, this is an asymptotic result, for small samples sizes
you may not see such a big difference between deterministic mean and the noise. Next look at the
periodogram of the noise we see that it is very erratic (we will show later that this is because it is
an inconsistent estimator of the spectral density function), however, despite the erraticness, the
amount of variation overall frequencies seems to be same (there is just one large peak  which could
be explained by the randomness of the periodogram).
Returning again to Section 1.2.3, we now consider the case that the sin function has been
188
PS
0.0
0.2
0.4
2
0
2
signal
0.6
20
40
60
80
100
120
Time
P2
0.02
0.00
P1
0.04
frequency
frequency
frequency
(8.3)
A realisation and the corresponding periodograms are given in Figure 8.2. The results are different
to the iid case. The peak in the periodogram no longer corresponds to the period of the sin function.
From the periodogram of the just the AR(2) process we observe that it erratic, just as in the iid
case, however, there appears to be varying degrees of variation over the frequencies (though this
is not so obvious in this plot). We recall from Chapters 2 and 3, that the AR(2) process has
a pseudoperiod, which means the periodogram of the colored noise will have pronounced peaks
which correspond to the frequencies around the pseudoperiod. It is these pseudoperiods which
are dominating the periodogram, which is giving a peak at frequency that does not correspond to
the sin function. However, asymptotically the rates given in (8.2) still hold in this case too. In
other words, for large enough sample sizes the DFT of the signal should dominate the noise. To see
that this is the case, we increase the sample size to n = 1024, a realisation is given in Figure 8.3.
We see that the period corresponding the sin function dominates the periodogram. Studying the
periodogram of just the AR(2) noise we see that it is still erratic (despite the large sample size),
189
but we also observe that the variability clearly changes over frequency.
0.8
PS
5
0
0.0
5
0
20
40
60
80
100
120
0.4
signal2
1.2
Time
0.0
0.4
P2
0.8
P1
1.2
frequency
frequency
frequency
200
400
600
800
PS
0
10
signal3
10
1000
frequency
P2
P1
Time
frequency
frequency
190
(where the mean is either constant or zero). As we have observed above, the periodogram is the
absolute square of the discrete Fourier Transform (DFT), where
n
1 X
Xt exp(itk ).
2n t=1
Jn (k ) =
(8.4)
This is simply a (linear) transformation of the data, thus it easily reversible by taking the inverse
DFT
n
2 X
Jn (k ) exp(itk ).
Xt =
n t=1
(8.5)
Therefore, just as one often analyzes the log transform of data (which is also an invertible transform), one can analyze a time series through its DFT.
In Figure 8.4 we give plots of the periodogram of an iid sequence and AR(2) process defined
in equation (8.3). We recall from Chapter 3, that the periodogram is an inconsistent estimator
P
of the spectral density function f () = (2)1
r= c(r) exp(ir) and a plot of the spectral
density function corresponding to the iid and AR(2) process defined in (??). We will show later
that by inconsistent estimator we mean that E[Jn (k )2 ] = f (k ) + O(n1 ) but var[Jn (k )2 ] 9 0
as n . this explains why the general shape of Jn (k )2 looks like f (k ) but Jn (k )2 is
0.008
0.25
0.15
P2
0.10
0.05
0.00
0.004
0.002
0.000
P1
0.20
0.006
frequency
frequency
Figure 8.4: Left: Periodogram of iid noise. Right: Periodogram of AR(2) process.
Remark 8.1.1 (Properties of the spectral density function) The spectral density function
was first introduced in in Section 1.6. We recall that given an autoregressive process {c(k)}, the
191
Autoregressive (2)
0.6
10
0.8
20
30
spectrum
1.0
spectrum
40
1.2
50
60
1.4
IID
0.0
0.1
0.2
0.3
0.4
0.5
0.0
frequency
0.1
0.2
0.3
0.4
0.5
frequency
Figure 8.5: Left: Spectral density of iid noise. Right: Spectral density of AR(2), note that
the interval [0, 1] corresponds to [0, 2] in Figure 8.5
spectral density is defined as
f () =
1 X
c(r) exp(2ir).
2 r=
And visa versa, given the spectral density we can recover the autocovariance via the inverse transR 2
form c(r) = 0 f () exp(2ir)d. We recall from Section 1.6 that the spectral density function
can be used to construct a valid autocovariance function since only a sequence whose Fourier transform is real and positive can be positive definite.
In Section 5.4 we used the spectral density function to define conditions under which the variance
covariance matrix of a stationary time series had minimum and maximim eigenvalues. Now from
the discussion above we observe that the variance of the DFT is approximately the spectral density
function (note that for this reason the spectral density is sometimes called the power spectrum).
We now collect some of the above observations, to summarize some of the basic properties of
the DFT:
(i) We note that Jn (k ) = Jn (nk ), therefore, all the information on the time series is contain
in the first n/2 frequencies {Jn (k ); k = 1, . . . , n/2}.
(ii) If the time series E[Xt ] = and k 6= 0 then
n
1 X
E[Jn (k )] =
exp(itk ) = 0.
n t=1
192
In other words, the mean of the DFT is zero regardless of whether the time series has a zero
mean (it just needs to have a constant mean).
(iii) However, unlike the original stationary time series, we observe that the variance of the DFT
depends on frequency (unless it is a white noise process) and that for k 6= 0, var[Jn (k )] =
E[Jn (k )2 ] = f (k ) + O(n1 ).
The focus of this chapter will be on properties of the spectral density function (proving some
of the results we stated previously) and on the so called Cramer representation (or spectral representation) of a second order stationary time series. However, before we go into these results (and
proofs) we give one final reason why the analysis of a time series is frequently done by transforming
to the frequency domain via the DFT. Above we showed that there is a onetoone correspondence
between the DFT and the original time series, below we show that the DFT almost decorrelates
the stationary time series. In other words, one of the main advantages of working within the
frequency domain is that we have transformed a correlated time series into something that it almost uncorrelated (this also happens to be a heuristic reason behind the spectral representation
theorem).
8.2
X n is an uncorrelated
1/2
. However, if Xt is a
second order stationary time series, something curiously, remarkable happens. The DFT, almost
uncorrelates the X n . The implication of this is extremely useful in time series, and we shall be
using this transform in estimation in Chapter 9.
We start by defining the Fourier transform of {Xt }nt=1 as
n
Jn (k ) =
1 X
2t
Xt exp(ik
)
n
2n t=1
where the frequences k = 2k/n are often called the fundamental, Fourier frequencies.
193
Lemma 8.2.1 Suppose {Xt } is a second order stationary time series, where
we have
f ( 2k ) + O( 1 ) k1 = k2
2k1
2k2
n
n
cov(Jn (
), Jn (
)) =
1
n
n
O( n )
k1 =
6 k2
where f () =
1
2
r= c(r) exp(ir).
In the sections below we give two proofs for the same result.
We note that the principle reason behind both proofs is that
0 j 6= nZ
2j
exp it
=
.
n jZ
n
t=1
n
X
8.2.1
(8.6)
We evaluate the DFT using the following piece of code (note that we do not standardize by
2)
Jn (k )
,
f (k )1/2
as the transformed random variables, noting that {Zk } is complex, our aim is to show that the acf
corresponding to {Zk } is close to zero. Of course, in practice we do not know the spectral density
function f , therefore we estimate it using the piece of code (where test is the time series)
k<kernel("daniell",6)
temp2 <spec.pgram(test,k, taper=0, log = "no")$spec
n < length(temp2)
194
1Xb b
1 X Jn (k )Jn (k+r )
q
Cn (r) =
Zk Z k =
n
n
k=1
k=1
fbn (k )fbn (k+r )
to do this we use we exploit the speed of the FFT (Fast Fourier Transform)
temp5 < Mod(dft(temp4))**2
dftcov < fft(temp5, inverse = TRUE)/(length(temp5))
dftcov1 = dftcov[1]
and make an ACF plot of the real and imaginary parts of the Fourier ACF:
n = length(temp5)
par(mfrow=c(2,1))
plot(sqrt(n)*Re(dftcov1[1:30]))
lines(c(1,30),c(1.96,1.96))
lines(c(1,30),c(1.96,1.96))
plot(sqrt(n)*Im(dftcov1[1:30]))
lines(c(1,30),c(1.96,1.96))
lines(c(1,30),c(1.96,1.96))
Note that the 1.96 corresponds to the 2.5% limits, however this bound only holds if the time series
is Gaussian. If it nonGaussian some corrections have to be made (see Dwivedi and Subba Rao
(2011) and Jentsch and Subba Rao (2014)). A plot of the AR(2) model
t = 1.5t1 0.75t2 + t .
195
together with the real and imaginary parts of its DFT autocovariance is given in Figure 8.6. We
test
50
100
150
200
250
Time
Real
Imaginary
sqrt(n) * Im(dftcov1[1:30])
sqrt(n) * Re(dftcov1[1:30])
10
15
20
25
30
Index
10
15
20
25
30
Index
Figure 8.6: Top: Realization. Bottom: Real and Imaginary of Cn (r) plotted against the lag
r.
Exercise 8.1
(a) Simulate an AR(2) process and run the above code using the sample size
196
8.2.2
Let X 0n = (Xn , . . . , X1 ) and Fn be the Fourier transformation matrix (Fn )s,t = n1/2 n
0
n1/2 exp( 2i(s1)(t1)
) (note that n = exp( 2
n
n )). It is clear that Fn X n = (Jn (0 ), . . . , Jn (n1 )) .
n =
c(n 1)
c(1) . . . c(n 2)
..
..
..
.
.
.
..
c(n 1) c(n 2)
.
c(1)
c(0)
c(0)
c(1)
..
.
c(1)
c(2)
...
c(0)
..
.
we observe that it can be written as the sum of two circulant matrices, plus some error, that we
will bound. That is, we define the two circulant matrices
C1n
c(n 1)
...
c(n 2)
c(n 1) c(0) c(1)
=
..
..
.
..
..
..
.
.
.
.
..
c(1)
c(2)
.
c(n 1)
c(0)
c(0)
c(1) c(2)
197
...
and
c(n 1) c(n 2)
C2n
...
c(1)
c(1)
0
c(n 1) . . . c(2)
c(2)
c(1)
0
. . . c(3)
..
..
..
..
..
.
.
.
.
.
..
c(n 1) c(n 2)
.
c(1)
0
We observe that the upper right hand sides of C1n and n match and the lower left and sides of C2n
0
and n match. As the above are circulant their eigenvector matrix is Fn (note that Fn1 = F n ).
Furthermore, the eigenvalues matrix of Cn1 is
diag
n1
X
c(j),
j=0
n1
X
c(j)jn , . . . ,
j=0
n1
X
,
c(j)(t1)j
n
j=0
diag
= diag
n1
X
c(j),
n1
X
j=1
j=1
n1
X
n1
X
j=1
c(j),
c(n j)jn , . . . ,
n1
X
c(n j)(t1)j
n
j=1
c(j)j
n ,...,
j=1
n1
X
c(j)n(t1)j ,
j=1
Pn1
j(k1)
and k2 =
More succinctly, the kth eigenvalues of Cn1 and Cn2 are k1 =
j=0 c(j)n
P
Pn1
2j
j(k1)
. Observe that k1 + k2 = j(n1) c(j)e n f (j ), thus the sum of these
j=1 c(j)n
eigenvalues approximate the spectral density function.
P
We now show that under the condition r rc(r) < we have
0
Fn n F n
Fn Cn1 +
0
Cn2 F n
1
=O
I,
n
(8.7)
where I is a n n matrix of ones. To show the above we consider the differences element by
element. Since the upper right hand sides of Cn1 and n match and the lower left and sides of Cn2
198
Thus we have shown (8.7). Therefore, since Fn is the eigenvector matrix of Cn1 and Cn2 , altogether
we have
Fn Cn1 + Cn2
where fn () =
2
2(n 1)
F n = diag fn (0), fn ( ), . . . , fn (
) ,
n
n
Pn1
fn (0)
var(Fn X n ) = Fn n F n =
0
..
.
...
0
fn ( 2
n ) ...
..
..
. ...
.
0
Finally, we note that since
...
...
0
..
.
fn ( 2(n1)
))
n
1
+ O( )
...
X
r>n
c(r)
1 X
rc(r) = O(n1 ),
n
(8.8)
r>n
8.2.3
2k2
1
A more hands on proof is to just calculate cov(Jn ( 2k
n ), Jn ( n )). The important aspect of this
proof is that if we can isolate the exponentials than we can use (8.6). It is this that gives rise to the
near uncorrelatedness property. Remember also that exp(i 2
n jk) = exp(ijk ) = exp(ikj ), hence
199
1 1 ... 1 1
.. .
.. . . . .
. ... .
.
.
1 ... ... 1 1
rc(r) <
fn () f ()
r=(n1)
n
n1
X
2it(k1 k2 )
2k2 1 X
exp
+Rn ,
=
c(r) exp ir
n
n
n
r=(n1)
 t=1
{z
}
k1 (k2 )
where
1
Rn =
n
Thus Rn 
1
n
n1
X
r=(n1)
rn rc(r)
2k2
c(r) exp ir
n
n
X
exp
t=nr+1
Exercise 8.2 The the above proof (in Section 8.2.3) uses that
P
we obtain if we relax this assumption to r c(r) < ?
8.2.4
2it(k1 k2 )
)
n
Heuristics
In this section we summarize some spectral properties. We do this by considering the DFT of the
data {Jn (k )}nk=1 . It is worth noting that to calculate {Jn (k )}nk=1 is computationally very fast and
requires only O(n log n) computing operations (see Section A.5, where the Fast Fourier Transform
is described).
200
1 X
Xt =
Jn (k ) exp(itk ),
n
(8.9)
k=1
n
X
1
n
exp(it)dZn (),
(8.10)
k=2
where Zn () =
Pb 2
nc
k=1
Jn (k ).
The second order stationary property of Xt means that the DFT Jn (k ) is close to an uncorrelated sequence or equivalently the process Zn () has near orthogonal increments, meaning that for
any two nonintersecting intervals [1 , 2 ] and [3 , 4 ] that Zn (2 ) Zn (1 ) and Zn (4 ) Zn (3 ).
The spectral representation theorem generalizes this result, it states that for any second order
stationary time series {Xt } there exists an a process {Z(); [0, 2]} where for all t Z
Z
Xt =
exp(it)dZ()
(8.11)
and Z() has orthogonal increments, meaning that for any two nonintersecting intervals [1 , 2 ]
and [3 , 4 ] E[Z(2 ) Z(1 )][Z(2 ) Z(1 )] = 0.
We now explore the relationship between the DFT with the orthogonal increment process.
Using (8.11) we see that
Jn (k ) =
=
!
Z 2 X
n
n
1 X
1
Xt exp(itk ) =
exp(it[k ]) dZ()
2n t=1
2n 0
t=1
Z 2
1
where Dn/2 (x) = sin[((n + 1)/2)x]/ sin(x/2) is the Dirichlet kernel (see Priestley (1983), page
419). We recall that the Dirichlet kernel limits to the Diracdelta function, therefore very crudely
speaking we observe that the DFT is an approximation of the orthogonal increment localized about
k (though mathematically this is not strictly correct).
201
Bochners theorem
This is a closely related result that is stated in terms of the so called spectral distribution. First
the heuristics. We see that from Lemma 8.2.1 that the DFT Jn (k ), is close to uncorrelated. Using
this and inverse Fourier transforms we see that for 1 t, n we have
c(t ) = cov(Xt , X ) =
Let Fn () =
1
n
Pb 2
nc
k=1
n
n
1 X X
cov (Jn (k1 ), Jn (k2 )) exp(itk1 + i k2 )
n
1
n
k1 =1 k2 =1
n
X
var(Jn (k )) exp(i(t )k ).
(8.12)
k=1
c(t )
0
c(t ) =
Z
exp(i(t ))f ()d =
j tj ,
j=
j= j
A() exp(ik)dZ(),
(8.13)
202
E(dZ()2 ) = d ie. the variance of increments do not vary over frequency (as this varying has
been absorbed by A(), since F () = A()2 ).
We mention that a more detailed discussion on spectral analysis in time series is give in Priestley
(1983), Chapters 4 and 6, Brockwell and Davis (1998), Chapters 4 and 10, Fuller (1995), Chapter
3, Shumway and Stoffer (2006), Chapter 4. In many of these references they also discuss tests for
periodicity etc (see also Quinn and Hannan (2001) for estimation of frequencies etc.).
8.3
8.3.1
We start by showing that under certain strong conditions the spectral density function is nonnegative. We later weaken these conditions (and this is often called Bochners theorem).
Theorem 8.3.1 (Positiveness of the spectral density) Suppose the coefficients {c(k)} are abP
solutely summable (that is k c(k) < ). Then the sequence {c(k)} is positive semidefinite if an
only if the function f (), where
1 X
f () =
c(k) exp(ik)
2
k=
is nonnegative. Moreover
Z
c(k) =
exp(ik)f ()d.
(8.14)
It is worth noting that f is called the spectral density corresponding to the covariances {c(k)}.
PROOF. We first show that if {c(k)} is a nonnegative definite sequence, then f () is a nonnegative
function. We recall that since {c(k)} is nonnegative then for any sequence x = (x1 , . . . , xN ) (real
P
or complex) we have ns,t=1 xs c(s t)
xs 0 (where x
s is the complex conjugate of xs ). Now we
consider the above for the particular case x = (exp(i), . . . , exp(in)). Define the function
fn () =
n
1 X
exp(is)c(s t) exp(it).
2n
s,t=1
203
fn () =
1
2
Comparing f () =
1
2
(n1)
X
k=(n1)
k= c(k) exp(ik)
f () fn ()
n k
n
c(k) exp(ik).
1 X
1
c(k) exp(ik) +
2
2
kn
(n1)
X
k=(n1)
k
c(k) exp(ik)
n
:= In + IIn .
Since
k= c(k)
as n .
(8.15)
Now it is clear that since for all n, fn () are nonnegative functions, the limit f must be nonnegative
(if we suppose the contrary, then there must exist a sequence of functions {fnk ()} which are not
necessarily nonnegative, which is not true). Therefore we have shown that if {c(k)} is a nonnegative
definite sequence, then f () is a nonnegative function.
1
2
We now show the converse, that is the Fourier coefficients of any nonnegative `2 function f () =
R 2
P
k= c(k) exp(ik), is a positive semidefinite sequence. Writing c(k) = 0 f () exp(ik)d
Z
xs c(s t)
xs =
f ()
0
n
X
xs exp(i(s t))
xs d =
0
s,t=1
2
n
X
xs exp(is) d 0.
f ()
s=1
The above theorem is very useful. It basically gives a simple way to check whether a sequence
{c(k)} is nonnegative definite or not (hence whether it is a covariance function  recall Theorem 1.6.1). See Brockwell and Davis (1998), Corollary 4.3.2 or Fuller (1995), Theorem 3.1.9, for
alternative explanations.
Example 8.3.1 Consider the empirical covariances (here we gives an alternative proof to Remark
204
cn (k) =
1
n
Pnk
t=1
Xt Xt+k k n 1
otherwise
(n1)
exp(ik)
cn (k) =
k=(n1)
exp(ik)
nk
n
1 X
1X
Xt Xt+k =
Xt exp(it) 0.
n
n
t=1
k=(n1)
t=1
and
PROOF. Let e1 be the eigenvector with smallest eigenvalue 1 corresponding to n . Then using
R
c(s t) = f () exp(i(s t))d we have
min (n ) = e01 n e1 =
n
X
Z
es,1 c(s t)et,1 =
f ()
s,t=1
n
X
s,t=1
2
2
Z 2 X
n
n
X
2
e
exp(is)
es,1 exp(is) d inf f ()
f ()
d = inf f (),
s,1
0
0
Z
=
s=1
s=1
since by definition
R Pn
P
 s=1 es,1 exp(is)2 d = ns=1 es,1 2 = 1 (using Parsevals identity). Using
We now state a version of the above result which requires weaker conditions on the autocovariance function (only that they decay to zero).
205
Lemma 8.3.2 Suppose the covariance {c(k)} decays to zero as k , then for all n, n =
var(X n ) is a nonsingular matrix (Note we do not require the stronger condition the covariances
are absolutely summable).
PROOF. See Brockwell and Davis (1998), Proposition 5.1.1.
8.3.2
Theorem 8.3.1 only holds when the sequence {c(k)} is absolutely summable. Of course this may not
always be the case. An extreme example is the time series Xt = Z. Clearly this is a stationary time
series and its covariance is c(k) = var(Z) = 1 for all k. In this case the autocovariances {c(k) = 1},
is not absolutely summable, hence the representation of the covariance in Theorem 8.3.1 does not
apply in this case. The reason is because the Fourier transform of the infinite sequence {c(k) = 1}k
is not well defined (clearly {c(k) = 1}k does not belong to `2 ).
However, we now show that Theorem 8.3.1 can be generalised to include all nonnegative definite
sequences and stationary processes, by considering the spectral distribution rather than the spectral
density.
Theorem 8.3.2 A function {c(k)} is nonnegative definite sequence if and only if
Z
c(k) =
exp(ik)dF (),
(8.16)
206
of this distribution function does not exist. However, this sequence will not belong to `2 (ie. the
correlations function will not decay to zero as the lag grows).
Figure 8.7: Both plots are of nondecreasing functions, hence are valid distribution functions.
The top plot is continuous and smooth, thus its derivative (the spectral density function)
exists. Whereas the bottom plot is not (spectral density does not exist).
PROOF of Theorem 8.3.2. We first show that if {c(k)} is nonnegative definite sequence,
R 2
then we can write c(k) = 0 exp(ik)dF (), where F () is a distribution function.
To prove the result we adapt some of the ideas used to prove Theorem 8.3.1. As in the proof
of Theorem 8.3.1 define the (nonnegative) function
n
1 X
1
exp(is)c(s t) exp(it) =
fn () = var[Jn ()] =
2n
2
s,t=1
(n1)
X
k=(n1)
n k
n
c(k) exp(ik).
If {c(k)} is not absolutely summable, the limit of fn () is no longer be well defined. Instead we consider its integral, which will always be a distribution function (in the sense that it is nondecreasing
and bounded). Let us define the function Fn () whose derivative is fn (), that is
Z
Fn () =
0
n1
2 X
r
sin(r)
fn ()d =
c(0) +
1
c(r)
2
2
n
r
r=1
207
0 2.
fn ()d = c(0).
Fn (2) =
0
Hence Fn satisfies all properties of a distribution and can be treated as a distribution function. This
means that we can use Hellys theorem which states that for any sequence of distributions {Gn }
defined on [0, 2], were Gn (0) = 0 and supn Gn (2) < M < , there exist a subsequence {nm }m
where Gnm (x) G(x) as m for each x [0, 2] at which G is continuous. Furthermore, since
Gnm (x) G(x) (pointwise as m ), this implies that for any bounded sequence h we have that
Z
Z
h(x)dGnm (x)
h(x)dG(x)
as m
Z
h(x)dFnm (x)
h(x)dF (x)
as m .
We focus on the function h(x) = exp(ik). It is clear that for every k and n we have
Z
Z
exp(ik)dFn () =
0
0
k n
(8.17)
k
exp(ik)dFn () = 1
c(k)
n
(8.18)
as n . Thus
Z
dnm ,k =
Z
exp(ikx)dFnm (x)
exp(ikx)dF (x)
208
as m
exp(ikx)dF (x),
where F (x) is a well defined distribution. This gives the first part of the assertion.
To show the converse, that is {c(k)} is a nonnegative definite sequence when c(k) is defined as
R
c(k) = exp(ik)dF (), we use the same method given in the proof of Theorem 8.3.1, that is
n
X
Z
xs c(s t)
xs =
0
s,t=1
n
X
xs exp(i(s t))
xs dF ()
s,t=1
2
n
2 X
x
exp(is)
dF () 0,
s
0
Z
=
s=1
since F () is a distribution.
R 2
0
Finally, if {c(k)} were absolutely summable, then we can use Theorem 8.3.1 to write c(k) =
R
1 P
exp(ik)dF (), where F () = 0 f ()d and f () = 2
k= c(k) exp(ik). By using
Theorem 8.3.1 we know that f () is nonnegative, hence F () is a distribution, and we have the
result.
Example 8.3.2 Using the above we can construct the spectral distribution for the (rather silly)
time series Xt = Z. Let F () = 0 for < 0 and F () = var(Z) for 0 (hence F is the step
function). Then we have
Z
cov(Xt , Xt+k ) = var(Z) =
exp(ik)dF ().
2
[exp(ik) + exp(ik)] .
2
209
Observe that this covariance does not decay with the lag k. Then
Z
exp(ik)dF ().
where
F () =
8.4
<
2 /2 <
2
.
We now state the spectral representation theorem and give a rough outline of the proof.
Theorem 8.4.1 If {Xt } is a second order stationary time series with mean zero, and spectral distribution F (), and the spectral distribution function is F (), then there exists a right continuous,
orthogonal increment process {Z()} (that is E[(Z(1 ) Z(2 )(Z(3 ) Z(4 ))] = 0, when the
intervals [1 , 2 ] and [3 , 4 ] do not overlap) such that
Z
Xt =
exp(it)dZ(),
(8.19)
where for 1 2 , EZ(1 ) Z(2 )2 = F (1 ) F (2 ) (noting that F (0) = 0). (One example
of a right continuous, orthogonal increment process is Brownian motion, though this is just one
example, and usually Z() will be far more general than Brownian motion).
Heuristically we see that (8.19) is the decomposition of Xt in terms of frequencies, whose
amplitudes are orthogonal. In other words Xt is decomposed in terms of frequencies exp(it)
which have the orthogonal amplitudes dZ() (Z( + ) Z()).
Remark 8.4.1 Note that so far we have not defined the integral on the right hand side of (8.19). It
is known as a stochastic integral. Unlike many deterministic functions (functions whose derivative
exists), one cannot really suppose dZ() Z 0 ()d, because usually a typical realisation of Z()
will not be smooth enough to differentiate. For example, it is well known that Brownian is quite
rough, that is a typical realisation of Brownian motion satisfies B(t1 ,
) B(t2 ,
) K(
)t1
tt  , where
is a realisation and 1/2, but in general will not be larger. The integral
210
g()dZ() is well defined if it is defined as the limit (in the mean squared sense) of discrete
Pbn/2c
P
[Z(k ) Z(k1 )], then
sums. More precisely, let Zn () = nk=1 Z(k )Ink 1 ,nk () = k=1
Z
g()dZn () =
n
X
k=1
R
R
R
The limit of g()dZn () as n is g()dZ() (in the mean squared sense, that is E[ g()dZ()
R
g()dZn ()]2 ). Compare this with our heuristics in equation (8.10).
For a more precise explanation, see Parzen (1959), Priestley (1983), Sections 3.6.3 and Section
4.11, page 254, and Brockwell and Davis (1998), Section 4.7. For a very good review of elementary
stochastic calculus see Mikosch (1999).
A very elegant explanation on the different proofs of the spectral representation theorem is given
in Priestley (1983), Section 4.11. We now give a rough outline of the proof using the functional
theory approach.
Rough PROOF of the Spectral Representation Theorem To prove the result we first
define two Hilbert spaces H1 and H2 , where H1 one contains deterministic functions and H2 contains
random variables.
First we define the space
H1 = sp{eit ; t Z}
with innerproduct
Z
hf, gi =
f (x)g(x)dF (x)
(8.20)
R 2
0
is well defined because hf, f i 0 (since F is a measure). It can be shown (see Brockwell and Davis
n R
o
2
(1998), page 144) that H1 = g; 0 g()2 dF () < 1 . We also define the space
H2 = sp{Xt ; t Z}
1
Roughly speaking it is because all continuous functions on [0, 2] are dense in L2 ([0, 2], B, F ) (using
the metric kf gk = hf g, f gi and the limit of Cauchy sequences). Since all continuous function can be
written as linear combinations of the Fourier basis, this gives the result.
211
(8.21)
j=1
for any n (it is necessary to show that this can be extended to infinite n, but we wont do so here).
We will shown that T defines an isomorphism (ie. it is a onetoone linear mapping that preserves
norm). To show that it is a onetoone mapping see Brockwell and Davis (1998), Section 4.7. It is
clear that it is linear, there all that remains is to show that the mapping preserves innerproduct.
P
Suppose f, g H1 , then there exists coefficients {fj } and {gj } such that f (x) = j fj exp(ij)
P
and g(x) = j gj exp(ij). Hence by definition of T in (8.21) we have
hT f, T gi = cov(
X
j
fj Xj ,
gj Xj ) =
(8.22)
j1 ,j2
fj1 gj2 exp(i(j1 j2 )) dF () =
j1 ,j2
(8.23)
Hence < T f, T g >=< f, g >, so the inner product is preserved (hence T is an isometry).
Altogether this means that T defines an isomorphism betwen H1 and H2 . Therefore all functions
which are in H1 have a corresponding random variable in H2 which has similar properties.
For all [0, 2], it is clear that the identity functions I[0,] (x) H1 . Thus we define the
random function {Z(); 0 2}, where T (I[0,] ()) = Z() H2 (since T is an isomorphism).
Since that mapping T is linear we observe that
T (I[1 ,2 ] ) = T (I[0,1 ] I[0,2 ] ) = T (I[0,1 ] ) T (I[0,2 ] ) = Z(1 ) Z(2 ).
Moreover, since T preserves the norm for any nonintersecting intervals [1 , 2 ] and [3 , 4 ] we have
cov ((Z(1 ) Z(2 ), (Z(3 ) Z(4 )) = hT (I[1 ,2 ] ), T (I[3 ,4 ] )i = hI[1 ,2 ] , I[3 ,4 ] i
Z
=
I[1 ,2 ] ()I[3 ,4 ] ()dF () = 0.
212
Having defined the two spaces which are isomorphic and the random function {Z(); 0
2} and function I[0,] (x) which have orthogonal increments, we can now prove the result. Since
dI[0,] (s) = (s)ds, where (s) is the dirac delta function, any function g L2 [0, 2] can be
represented as
2
g(s)dI[,2] (s).
g() =
0
Z
exp(it) =
exp(its)dI[,2] (s).
0
Therefore
2
Z
T (exp(it)) = T
Z
=
Z
exp(its)dI[,2] (s) =
where the mapping goes inside the integral due to the linearity of the isomorphism. Using that
I[,2] (s) = I[0,s] () we have
2
Z
T (exp(it)) =
By definition we have T (I[0,s] ()) = Z(s) which we substitute into the above to give
Z
Xt =
exp(its)dZ(s),
0
213
It is worth taking a step back from the proof and see where the assumption of stationarity crept
in. By Bochners theorem we have that
Z
c(t ) =
where F is a distribution. We use F to define the space H1 , the mapping T (through {exp(ik)}k ),
the innerproduct and thus the isomorphism. However, it was the construction of the orthogonal
random functions {Z()} that was instrumental. The main idea of the proof was that there are
functions {k ()} and a distribution H such that all the covariances of the stochastic process {Xt }
can be written as
2
t () ()dH(),
E(Xt X ) = c(t, ) =
0
where H is a measure. As long as the above representation exists, then we can define two spaces
H1 and H2 where {k } is the basis of the functional space H1 and it contains all functions f such
R
that f ()2 dH() < and H2 is the random space defined by sp(Xt ; t Z). From here we can
P
define an isomorphism T : H1 H2 , where for all functions f () = k fk k () H1
T (f ) =
fk Xk H2 .
An important example is T (k ) = Xk . Now by using the same arguments as those in the proof
above we have
Z
Xt =
t ()dZ()
where {Z()} are orthogonal random functions and EZ()2 = H(). We state this result in the
theorem below (see Priestley (1983), Section 4.11).
Theorem 8.4.2 (General orthogonal expansions) Let {Xt } be a time series (not necessarily
second order stationary) with covariance {E(Xt X ) = c(t, s)}. If there exists a sequence of functions
{k ()} which satisfy for all k
Z
214
t ()s ()dH(),
c(t, s) =
(8.24)
t ()dZ()
(8.25)
where {Z()} are orthogonal random functions and EZ()2 = H(). On the other hand if Xt
has the representation (8.25), then c(s, t) admits the representation (8.24).
Remark 8.4.2 We mention that the above representation applies to both stationary and nonstationary time series. What makes the exponential functions {exp(ik)} special is if a process is
stationary then the representation of c(k) := cov(Xt , Xt+k ) in terms of exponentials is guaranteed:
Z
c(k) =
exp(ik)dF ().
(8.26)
Therefore there always exists an orthogonal random function {Z()} such that
Z
Xt =
exp(it)dZ().
Indeed, whenever the exponential basis is used in the definition of either the covariance or the
process {Xt }, the resulting process will always be second order stationary.
We mention that it is not always guaranteed that for any basis {t } we can represent the
covariance {c(k)} as (8.24). However (8.25) is a very useful starting point for characterising
nonstationary processes.
215
8.5
We obtain the spectral density function for MA() processes. Using this we can easily obtain the
spectral density for ARMA processes. Let us suppose that {Xt } satisfies the representation
Xt =
j tj
(8.27)
j=
where {t } are iid random variables with mean zero and variance 2 and
j= j 
< . We
j j+k .
(8.28)
j=
Since
j= j 
c(k)
X X
j  j+k  < .
k j=
Hence by using Theorem 8.3.1, the spectral density function of {Xt } is well defined.
There
are several ways to derive the spectral density of {Xt }, we can either use (8.28) and f () =
1 P
k c(k) exp(ik) or obtain the spectral representation of {Xt } and derive f () from the spec2
tral representation. We prove the results using the latter method.
8.5.1
Since {t } are iid random variables, using Theorem 8.4.1 there exists an orthogonal random function
{Z()} such that
Z
t =
exp(it)dZ().
0
Since E(t ) = 0 and E(2t ) = 2 multiplying the above by t , taking expectations and noting that
due to the orthogonality of {Z()} we have E(dZ(1 )dZ(2 )) = 0 unless 1 = 2 we have that
E(dZ()2 ) = 2 d, hence f () = (2)1 2 .
216
Using the above we obtain the following spectral representation for {Xt }
2
Z
Xt =
0
j=
j exp(ij) exp(it)dZ().
Hence
Z
A() exp(it)dZ(),
Xt =
(8.29)
where A() =
j= j
Definition 8.5.1 (The Cramer Representation) We mention that the representation in (8.29)
of a stationary process is usually called the Cramer representation of a stationary process, where
Z
Xt =
A() exp(it)dZ(),
0
Cramers representation?
(ii) Suppose that {Xt } has a causal AR(1) representation Xt = Xt1 + t . What is its Cramers
representation?
8.5.2
Due to the orthogonality of {Z()} we have E(dZ(1 )dZ(2 )) = 0 unless 1 = 2 , altogether this
gives
Z
E(Xt Xt+k ) = c(k) =
A()2 exp(ik)E(dZ()2 ) =
where f () =
2
2
2 A() .
f () exp(ik)d,
0
Comparing the above with (8.14) we see that f () is the spectral density
function.
217
The spectral density function corresponding to the linear process defined in (8.27) is
2 X

j exp(ij)2 .
f () =
2
j=
Remark 8.5.1 (An alternative, more hands on proof ) An alternative proof which avoids the
P
Cramer representation is to use that the acf of a linear time series is c(r) = 2 k j j+r (see
Lemma 3.1.1). Thus by definition the spectral density function is
f () =
=
1 X
c(r) exp(ir)
2 r=
2 X X
j j+r exp(ir).
2 r=
j=
2
2 X X
j s exp(i(s j)) =
A()2 .
2 r= s=
2
Example 8.5.1 Let us suppose that {Xt } is a stationary ARM A(p, q) time series (not necessarily
invertible or causal), where
Xt
p
X
j Xtj =
j=1
q
X
j tj ,
j=1
{t } are iid random variables with E(t ) = 0 and E(2t ) = 2 . Then the spectral density of {Xt } is
Pq
2
2 1 + j=1 j exp(ij)
Pq
f () =
2 1 j=1 j exp(ij)2
We note that because the ARMA is the ratio of trignometric polynomials, this is known as a rational
spectral density.
Remark 8.5.2 The roots of the characteristic function of an AR process will have an influence on
the location of peaks in its corresponding spectral density function. To see why consider the AR(2)
model
Xt = 1 Xt1 + 2 Xt2 + t ,
218
where {t } are iid random variables with zero mean and E(2 ) = 2 . Suppose the roots of the
characteristic polynomial (B) = 1 1 B 2 B 2 lie outside the unit circle and are complex
conjugates where 1 = r exp(i) and 2 = r exp(i). Then the spectral density function is
2
f () =
1 r exp(i(
))2 1
r exp(i( )2
2
.
[1 + r2 2r cos( )][1 + r2 2r cos( + )]
If r > 0, the f () is maximum when = , on the other hand if, r < 0 then the above is maximum
when = . Thus the peaks in f () correspond to peaks in the pseudo periodicities of the
time series and covariance structure (which one would expect), see Section 3.1.2. How pronounced
these peaks are depend on how close r is to one. The close r is to one the larger the peak. We
can generalise the above argument to higher order Autoregressive models, in this case there may be
multiple peaks. In fact, this suggests that the larger the number of peaks, the higher the order of the
AR model that should be fitted.
8.5.3
1 X
c(r) exp(ir)
2 r=
can be approximated to any order by the spectral density of an AR(p) or MA(q) process.
We do this by truncating the infinite number of covariances by a finite number, however, this
does not necessarily lead to a positive definite spectral density. This can easily be proven by noting
that
fem () =
m
X
f ()Dm ( )d,
c(r) exp(ir) =
0
r=m
where Dm () = sin[(n + 1/2)]/ sin(/2). Observe that Dm () can be negative, which means that
fem () can be negative despite f being positive.
Example 8.5.2 Consider the AR(1) process Xt = 0.75Xt1 + t where var[t ] = 1. In Lemma
219
3.1.1 we showed that the autcovariance corresponding to this model is c(r) = [1 0.752 ]1 0.75r .
Let us define a process whose autocorrelation is c(0) = [1 0.752 ]1 , c(1) = c(1) = [1
0.752 ]1 0.75 and c(r) = 0 for r > 1. The spectral density of this process is
1
fem () =
1 0.752
3
1 + 2 cos[] .
4
It is clear that this function can be zero for some values of . This means that {
c(r)} is not a well
defined covariance function, hence there does not exist a time series with this covariance structure.
In other words, simply truncating an autocovariance is not enough to guarantee that it positive
definite sequence.
Instead we consider a slight variant on this and define
m
1 X
r
1
c(r) exp(ir)
2 r=m
m
which is positive.
Remark 8.5.3 We note that fm is known as a Ces
aro sum because it can be written as
fm () =
m
m
1 X
r
1 Xe
1
c(r) exp(ir) =
fn (),
2 r=m
m
m
(8.30)
n=0
where fen () =
1
2
Pn
transform fen is not negative, however fn () is definitely positive. There are are a few ways to prove
this:
(i) The first method we came across previously, var[Jn ()] = fn (), it is clear that using this
construction inf fn () 0.
(ii) By using (8.30) we can write fm () as
2
f ()Fm ( )d,
fm () =
0
where Fm () =
1
m
Pm
r=m Dr () =
1
m
sin(n/2)
sin(/2)
2
and Dr () =
Pr
j=r
the Fejer and Dirichlet kernels respectively). Since both f and Fm are positive, then fm has
to be positive.
220
as m .
(8.31)
Thus for a large enough m, fm () will be within of the spectral density f . Using this we can
prove the results below.
Lemma 8.5.1 Suppose that
for every > 0, there exists a m such that f () fm () < and fm () = 2 ()2 , where
Pm
() =
j=0 j exp(ij). Thus we can approximate the spectral density of f with the spectral
density of a MA.
PROOF. We show that there exists an MA(m) which has the spectral density fm (), where fm is
defined in (8.30). Thus by (8.31) we have the result.
Before proving the result we note that if a polynomial is of the form
p(z) = a0 +
m1
X
ar z + am z +
r=1
2m
X
arm z r ,
r=m+1
1
j=1 [1 j z][1 j z],
Qm
proof is clear by factorising the polynomial and using contradiction. Therefore, if a(r) = a(r),
then we have
m
X
r=m
2m
X
a(r m) exp(ir)
r=0
m
Y
= C exp(im)
= C(1)m
m
Y
j=1
h
i
[1 j exp(i)] 1 1
exp(i)
j
j=1
m h
Y
ih
i
1
1 1
exp(i)
1
exp(i)
,
j
j
j=1
for some finite constant C. Using the above, with a(r) = [1 rn1 ]c(r), we can write fm as
fm () = K
m
Y
(1 1
j exp(i))
j=1
m
Y
(1 1
j exp(i))
j=1
2
= A()A() = A() ,
221
where
A(z) =
m
Y
(1 1
j z).
j=1
Since A(z) is an mth order polynomial where all the roots are greater than 1, we can always
construct an MA(m) process which has A(z) as its transfer function. Thus there exists an MA(m)
process which has fm () as its spectral density function.
where inf f () > 0. Then for every > 0, there exists a m such that f () gm () < and
P
gm () = 2 ()1 2 , where () = m
j=0 j exp(ij) and the roots of (z) lie outside the unit
circle. Thus we can approximate the spectral density of f with the spectral density of a causal
autoregressive process.
PROOF. We first note that we can write
f () gm () = f ()gm ()1 f ()1 gm ().
Since f () L2 and is bounded away from zero, then f 1 L2 and we can write f 1 as
f 1 () =
dr exp(ir),
r=
where dr are the Fourier coefficients of f 1 . Since f is positive and symmetric, then f 1 is posP
ir and {d } is a positive definite symmetric
itive and symmetric such that f 1 () =
r
r= dr e
sequence. Thus we can define the positive function gm where
1
gm
()
X
rm
r
1
m
dr exp(ir)
X
f () gm () [
c(r)]2 .
r
1 can be
Now we can apply the same arguments to prove to Lemma 8.5.1 we can show that gm
1 () = C ()2 (where
factorised as gm
m
m is an mth order polynomial whose roots lie outside
222
the unit circle). Thus gm () = Cm ()2 and we obtain the desired result.
8.6
We recall that the covariance is a measure of linear dependence between two random variables.
Higher order cumulants are a measure of higher order dependence. For example, the third order
cumulant for the zero mean random variables X1 , X2 , X3 is
cum(X1 , X2 , X3 ) = E(X1 X2 X3 )
and the fourth order cumulant for the zero mean random variables X1 , X2 , X3 , X4 is
cum(X1 , X2 , X3 , X4 ) = E(X1 X2 X3 X4 ) E(X1 X2 )E(X3 X4 ) E(X1 X3 )E(X2 X4 ) E(X1 X4 )E(X2 X3 ).
From the definition we see that if X1 , X2 , X3 , X4 are independent then cum(X1 , X2 , X3 ) = 0 and
cum(X1 , X2 , X3 , X4 ) = 0.
Moreover, if X1 , X2 , X3 , X4 are Gaussian random variables then cum(X1 , X2 , X3 ) = 0 and
cum(X1 , X2 , X3 , X4 ) = 0. Indeed all cumulants higher than order two is zero. This comes from
the fact that cumulants are the coefficients of the power series expansion of the logarithm of the
characteristic function of {Xt }, which is
1
gX (t) = i 0 t t0 {z}
t.
2
{z}
cumulant
mean
Since the spectral density is the Fourier transform of the covariance it is natural to ask whether
one can define the higher order spectral density as the fourier transform of the higher order cumulants. This turns out to be the case, and the higher order spectra have several interesting properties.
Let us suppose that {Xt } is a stationary time series (notice that we are assuming it is strictly stationary and not second order). Let 3 (t, s) = cum(X0 , Xt , Xs ), 3 (t, s, r) = cum(X0 , Xt , Xs , Xr )
and q (t1 , . . . , tq1 ) = cum(X0 , Xt1 , . . . , Xtq ) (noting that like the covariance the higher order cumulants are invariant to shift). The third, fourth and the general qth order spectras is defined
223
as
f3 (1 , 2 ) =
f4 (1 , 2 , 3 ) =
fq (1 , 2 , . . . , q1 ) =
s= t=
X
X
s= t= r=
t1 ,...,tq1 =
Example 8.6.1 (Third and Fourth order spectral density of a linear process) Let us suppose that {Xt } satisfies
Xt =
j tj
j=
where
j= j 
j= j
exp(ij). Then it is
224
covariance assumption
fq (k2 , . . . , kq )
nq/2
where ki =
8.7
8.7.1
n
X
exp(ij(k1 . . . kq )) + O(
j=1
1
f ( , . . . , kq )
n(q1)/2 q k2
1
O( nq/2
)
1
+ O( nq/2
)
Pq
i=1 ki
1
nq/2
= nZ
otherwise
2ki
n .
Extensions
The spectral density of a time series with randomly missing
observations
Let us suppose that {Xt } is a second order stationary time series. However {Xt } is not observed at
everytime point and there are observations missing, thus we only observe Xt at {k }k . Thus what is
observed is {Xk }. The question is how to deal with this type of data. One method was suggested
in ?. He suggested that the missingness mechanism {k } be modelled stochastically. That is define
the random process {Yt } which only takes the values {0, 1}, where Yt = 1 if Xt is observed, but
Yt = 0 if Xt is not observed. Thus we observe {Xt Yt }t = {Xtk } and also {Yt } (which is the time
points the process is observed). He also suggests modelling {Yt } as a stationary process, which is
independent of {Xt } (thus the missingness mechanism and the time series are independent).
The spectral densities of {Xt Yt }, {Xt } and {Yt } have an interest relationship, which can be
exploited to estimate the spectral density of {Xt } given estimators of the spectral densities of {Xt Yt }
and {Xt } (which we recall are observed). We first note that since {Xt } and {Yt } are stationary,
then {Xt Yt } is stationary, furthermore
cov(Xt Yt , X Y ) = cov(Xt , X )cov(Yt , Y ) + cov(Xt , Y )cov(Yt , X ) + cum(Xt , Yt , X , Y )
= cov(Xt , X )cov(Yt , Y ) = cX (t )cY (t )
225
where the above is due to independence of {Xt } and {Yt }. Thus the spectral density of {Xt Yt } is
fXY () =
1 X
cov(X0 Y0 , Xr Yr ) exp(ir)
2 r=
1 X
=
cX (r)cY (r) exp(ir)
2 r=
Z
=
fX ()fY ( )d,
where fX () =
1
2
r= cX (r) exp(ir)
and fY () =
226
1
2
r= cY (r) exp(ir)
Chapter 9
Spectral Analysis
Prerequisites
The Gaussian likelihood.
The approximation of a Toeplitz by a Circulant (covered in previous chapters).
Objectives
The DFTs are close to uncorrelated but have a frequency dependent variance (under stationarity).
The DFTs are asymptotically Gaussian.
For a linear time series the DFT is almost equal to the transfer function times the DFT of
the innovations.
The periodograms is the square of the DFT, whose expectation is approximately equal to the
spectral density. Smoothing the periodogram leads to an estimator of the spectral density as
does truncating the covariances.
The Whittle likelihood and how it is related to the Gaussian likelihood.
Understand that many estimator can be written in the frequency domain.
Calculating the variance of an estimator.
227
9.1
In the previous section we motivated transforming the stationary time series {Xt } into its discrete
Fourier transform
n
Jn (k ) =
2t
1 X
Xt exp(ik
)
n
2n t=1
n
n
2t
2t
1 X
1 X
Xt cos(k
Xt sin(k
) + i
)
n
n
2n t=1
2n t=1
k = 0, . . . , n/2
(frequency series) as an alternative way of analysing the time series. Since there is a onetoone
mapping between the two, nothing is lost by making this transformation. Our principle reason for
n/2
using this transformation is given in Lemma 8.2.1, where we showed that {Jn (k )}n=1 is an almost
uncorrelated series. However, there is a cost to the uncorrelatedness property, that is unlike the
original stationary time series {Xt }, the variance of the DFT varies over the frequencies, and the
variance is the spectral density at that frequency. We summarise this result below, but first we
recall the definition of the spectral density function
1 X
c(r) exp(ir)
f () =
2 r=
[0, 2].
(9.1)
Jn ()2 =
1
2
n1
X
cn (r) exp(ir),
(9.2)
r=(n1)
as n ,
228
rn
(9.3)
(iii)
f ( 2k ) + o(1) k1 = k2
2k1
2k2
n
cov Jn (
), Jn (
) =
n
n
o(1)
k1 =
6 k2
where f () is the spectral density function defined in (9.1). Under the stronger condition
rc(r) <
1X
b
cn (r) + b
cn (n r) =
Jn (k )2 exp(irk ).
n
(9.4)
k=1
1
n
Pn
t=nr Xt Xt+nr ,
b
cn (r)
1X
Jn (k )2 exp(irk ).
n
(9.5)
k=1
The modulo square of the DFT plays such an important role in time series analysis that it has
its own name, the periodogram, which is defined as
In () = Jn ()2 =
1
2
n1
X
cn (r) exp(ir).
(9.6)
r=(n1)
By using Lemma 9.1.1 or Theorem 8.6.1 we have E(In ()) = f () + O( n1 ). Moreover, (9.4) belongs
to a general class of integrated mean periodogram estimators which have the form
n
A(, In ) =
1X
In (k )(k ).
n
k=1
229
(9.7)
Replacing the sum by an integral and the periodogram by its limit, it is clear that these are
estimators of the integrated spectral density
Z
f ()()d.
A(f, ) =
0
Before we consider these estimators (in Section 9.5). We analyse some of the properties of the
DFT.
9.2
An interesting aspect of the DFT, is that under certain conditions the DFT is asymptotically
normal. We can heuristically justify this by noting that the DFT is a (weighted) sample mean. In
pn
In this section we prove this result,
X).
fact at frequency zero, it is the sample mean (Jn (0) = 2
and a similar result for the periodogram. We do the proof under linearity of the time series, that is
Xt =
j tj ,
j=
however the result also holds for nonlinear time series (but is beyond this course).
1 Pn
itk is a very simple object to deal with it.
The DFT of the innovations J (k ) = 2n
t=1 t e
First the DFT is an orthogonal transformation and the orthogonal transformation of iid random
variables leads to uncorrelated random variables. In other words, {J (k )} is completely uncorrelated as are its real and imaginary parts. Secondly, if {t } are Gaussian, then {J (k )} are
independent and Gaussian. Thus we start by showing the DFT of a linear time series is approximately equal to the DFT of the innovations multiplied by the transfer function. This allows us to
transfer results regarding J (k ) to Jn (k ).
P
We will use the assumption that j j 1/2 j  < , this is a slightly stronger assumption than
P
j j  < (which we worked under in Chapter 2).
Lemma 9.2.1 Let us suppose that {Xt } satisfy Xt =
230
j= j t ,
where
j= j
1/2 
j
< ,
and {t } are iid random variables with mean zero and variance 2 . Let
n
J () =
1 X
t exp(it).
2n t=1
Then we have
X
Jn () =
j exp(ij) J () + Yn (),
(9.8)
j
1
j j exp(ij)Un,j , with Un,j =
2n
P
1
1/2 )2 = O( 1 ).
( n1/2
j= j  min(j, n)
n
where Yn () =
E(Yn ())2
Pnj
t=1j
exp(it)t
Pn
t=1 exp(it)t
and
1 X
Jn () =
j exp(ij)
tj exp(it)
2n t=1
j=
nj
X
1
s exp(is)
=
j exp(ij)
2n s=1j
j=
nj
n
X
X
X
X
1
=
j exp(ij) J () +
j exp(ij)
exp(it)t
exp(it)t .
2n j
t=1
j
t=1j
{z
}

=Yn ()
We will show that Yn () is negligible with respect to the first term. We decompose Yn () into
three terms
Yn () =
n
X
nj
X
n
X
1
j eij
exp(it)t
exp(it)t +
2n j=
t=1
t=1j
{z
}

no terms in common
nj
n
n
X
X
X
1
j eij
exp(it)t
exp(it)t +
2n j=n
t=1
t=1j

{z
}
nj
X
1
j eij
exp(it)t
2n j=n+1
t=1j

{z
n
X
exp(it)t
t=1
no terms in common
= I + II + II.
231
If we took the expectation of the absolute of Yn () we find that we require the condition
jj  <
(and we dont exploit independence of the innovations). However, by evaluating EYn ()2 we
exploit to independence of {t }, ie.
[E(I 2 )]1/2
n
X
nj
X
j  E
exp(it)t
2n j=
t=1j
n
X
2 1/2
exp(it)t
t=1
n
n
X
X
X
1/2
1
1
1
j  2n 2
j 1/2 j 
j 1/2 j 
2n j=
2n j=
2n j=
[E(I 2 )]1/2
n
X
nj
X
1
j  E
exp(it)t
2n j=n
t=1j
n
X
n
X
2 1/2
exp(it)t
t=1
n
X
X
1
1
1
2 1/2
1/2
j  2j
j j 
j 1/2 j .
2n j=n
2n j=
2n j=
X
j
1
j exp(ij) J () + Op ( ).
n
(9.9)
This implies that the distribution of Jn () is determined by the DFT of the innovations J (). We
generalise the above result to the periodogram.
P
Lemma 9.2.2 Let us suppose that {Xt } is a linear time series Xt =
j= j tj , where
P
1/2  < , and { } are iid random variables with mean zero, variance 2 E(4 ) < .
j
t
t
j= j
Then we have
2
X
In () =
j exp(ij) J ()2 + Rn (),
j
(9.10)
232
X
j
j= j tj
we have
1
1
j exp(ij)2 J ()2 + Op ( ) = 2f ()I () + Op ( ),
n
n
1 P
j
2 
(9.11)
of {Xt }.
The asymptotic normality of Jn () follows from asymptotic normality of J (), which we prove
in the following proposition.
Proposition 9.2.1 Suppse {t } are iid random variables with mean zero and variance 2 . We
2
1 Pn
1 Pn
define J () = 2n
t=1 t exp(it) and I () = 2n
t=1 t exp(it) . Then we have
J () =
<J ()
=J ()
2
0, I2 ,
2
(9.12)
(9.13)
I ()/ 2 2 (2) (which is equivalent to the exponential distribution with mean one) and
cov(J (j ) , J (k ) ) =
4
(2)2 n
4
(2)2 n
2 4
(2)2
j 6= k
(9.14)
j=k
hence {t,n } and {t,n } are martingale differences. Therefore, to show asymptotic normality, we
will use the martingale central limit theorem with the CramerWold device to show that (9.15). We
note that since {t,n } and {t,n } are independent random variables we an prove the same result
using a CLT for independent, nonidentically distributed variables. However, for practice we will
use a martingale CLT. To prove the result we need to verify the three conditions of the martingale
233
2
1 X
P
cos(2kt/n)2 2t
2n
2
1
2n
1
2n
1
2n
t=1
n
X
E t,n 2 t1 , t2 , . . . , 1 =
t=1
n
X
E t,n t,n t1 , t2 , . . . , 1 =
t=1
1
2n
t=1
n
X
t=1
n
X
sin(2kt/n)2 2t
2
2
P
cos(2kt/n) sin(2kt/n)2t 0,
t=1
where the above follows from basic calculations using the mean and variance of the above. Finally
1 Pn
we need to verify the Lindeberg condition, we only verify it for 2n
t=1 t,n , the same argument
P
n
1
holds true for 2n
t=1 t,n . We note that for every > 0 we have
n
n
1 X
1 X
2
E t,n  I(t,n  2 n) t1 , t2 , . . . =
E t,n 2 I(t,n  2 n) .
2n
2n
t=1
t=1
1 X
E t,n 2 I(t,n  2 n)]
2n
t=1
n
P
1 X 2
E t  I(t  2 n) = E t 2 I(t  2 n) 0
2n
as
n ,
t=1
the above is true because E(2t ) < . Hence we have verified Lindeberg condition and we obtain
(9.15). The proof of (9.13) is similar, hence we omit the details. Because I () = <(J ())2 +
=(J ())2 , from (9.15) we have I ()/ 2 2 (2).
To prove (9.14) we can either derive it from first principles or by using Proposition 8.6.1. Here
we do it from first principles. We observe
cov(I (j ), I (k )) =
XXXX
1
cov(t1 t1 +k1 , t2 t2 +k2 ).
(2)2 n2
t
t
k1
k2
234
Since {t } are iid random variables, for most t1 , t2 , k1 and k2 the above covariance is zero. The
exceptions are when t1 = t2 and k1 = k2 or t1 = t2 and k1 = k2 = 0 or t1 t2 = k1 = k2 . Counting
all these combinations we have
cov(J (j )2 , J (k )2 ) =
X
1
2 4 X X X
exp(ik(j k )) +
4
2
2
2
2
(2) n
(2)
n
t
t
t
k
P
where 2 = var(t ) and 4 = cum4 () = cum(t , t , t , t ). We note that for j 6= k, t exp(ik(j
P
k )) = 0 and for j = k, t exp(ik(j k )) = n, substutiting this into cov(J (j )2 , J (k )2 )
gives us the desired result.
By using (9.9) the following result follows immediately from Lemma 9.2.1, equation (9.15).
P
Corollary 9.2.1 Let us suppose that {Xt } is a linear time series Xt =
j= j tj , where
P
1/2  < , and { } are iid random variables with mean zero, variance 2 E(4 ) < .
j
t
t
j= j
Then we have
<Jn ()
=Jn ()
N (0, f ()I2 ) ,
(9.15)
Using (9.11) we see that In () f ()J ()2 . This suggest that most of the properties which
apply to J ()2 also apply to In (). Indeed in the following theorem we show that the asympototic
distribution of In () is exponential with asymptotic mean f () and variance f ()2 (unless = 0
in which case it is 2f ()2 ).
By using Lemma 9.2.1 we now generalise Proposition 9.2.1 to linear processes. We show that
just like the DFT the Periodogram is also near uncorrelated at different frequencies. This result
will be useful when motivating and deriving the sampling of the spectral density estimator in
Section 9.3.
Theorem 9.2.1 Suppose {Xt } is a linear time series Xt =
j= j tj ,
where
j= j
1/2 
j
<
with E[t ] = 0, var[t ] = 2 and E[4t ] < . Let In () denote the periodogram associated with
{X1 , . . . , Xn } and f () be the spectral density. Then
(i) If f () > 0 for all [0, 2] and 0 < 1 , . . . , m < , then
In (1 )/f (1 ), . . . , In (m )/f (m )
235
2j
n
and k =
2k
n
we have
2f (k )2 + O(n1/2 ) j = k = 0 or
cov(In (k ), In (j )) =
f (k )2 + O(n1/2 ) 0 < j = k <
O(n1 )
j 6= k
where the bound is uniform in j and k .
).
(ii) It symmetric about zero, In () = In ( + ), like the spectral density function.
(iii) At the fundemental frequencies {In (j )} are asymptotically uncorrelated.
(iv) If 0 < < , In () is asymptotically exponentially distributed with mean f ().
It should be mentioned that Theorem 9.2.1 also holds for several nonlinear time series too.
9.3
There are several explanations as to why the raw periodogram can not be used as an estimator of
the spectral density function, despite its mean being approximately equal to the spectral density.
One explanation is a direct consequence of Theorem 9.2.1, where we showed that the distribution of
the periodogram standardized with the spectral density function is a chisquared with two degrees
of freedom, from here it is clear it will not converge to the mean, however large the sample size.
An alternative, explanation is that the periodogram is the Fourier transform of the autocovariances
estimators at n different lags. Typically the variance for each covariance cn (k) will be about O(n1 ),
thus, roughly speaking, the variance of In () will be the sum of these n O(n1 ) variances which
leads to a variance of O(1), this clearly does not converge to zero.
Both these explanation motivate estimators of the spectral density function, which turn out to
be the same. It is worth noting that Parzen (1957) first proposed a consistent estimator of the
236
spectral density. These results not only lead to a revolution in spectral density estimation but also
the usual density estimation that you may have encountered in nonparametric statistics (one of the
first papers on density estimation is Parzen (1962)).
We recall that Jn (k ) are zero mean uncorrelated random variables whose variance is almost
equal to f (k ). This means that EJn (k )2 = E[In (n )] f (k ).
Remark 9.3.1 (Smoothness of the spectral density) We observe that
f (s) () =
1 X
(ir)s c(r) exp(ir).
(2)
rZ
rs c(r),
in other words how fast the autocovariance function converges to zero. We recall that the acf of
ARMA processes decay exponential fast to zero, thus f is extremely smooth (all derivatives exist).
Assuming that the autocovariance function converges to zero sufficiently fast f will slowly vary over
frequency. Furthermore, using Theorem 9.2.1, we know that {In (k )} are close to uncorrelated and
In (k )/f (k ) is exponentially distributed. Therefore we can write In (k ) as
In (k ) = E(In (k )) + [In (k ) E(In (k ))]
f (k ) + f (k )Uk ,
k = 1, . . . , n,
(9.16)
where {Uk } is sequence of mean zero and constant variance almost uncorrelated random variables.
We recall (9.16) resembles the usual nonparametric equation (function plus noise) often considered in nonparametric statistics.
Remark 9.3.2 (Nonparametric Kernel estimation) Let us suppose that we observe Yi where
i
+ i
Yi = g
n
1 i n,
and {i } are iid random variables and g() is a smooth function. The kernel density estimator of
gbn ( ni )
X
j
1
ji
gbn
=
W
Yi ,
n
bn
bn
i
where W () is a smooth kernel function of your choosing, such as the Gaussian kernel, etc.
237
This suggest that to estimate the spectral density we could use a local weighted average of {In (k )}.
Equation (9.16) motivates the following nonparametric estimator of f ()
X 1
j
k
In (k ),
W
fbn (j ) =
bn
bn
(9.17)
W (x)dx = 1 and
W (x)2 dx < .
Example 9.3.1 (Spectral windows) Here we give examples of spectral windows (see Section
6.2.3, page 437 in Priestley (1983)).
(i) The Daniell spectral Window is the local average
1/2 x 1
W (x) =
0 x > 1
This window leads to the estimator
fbn (j ) =
1
bn
j+bn/2
In (k ).
k=jbn/2
A plot of the periodgram, spectral density and different estimators (using Daniell kernel with
bn = 2 and bn = 10) of the AR(2) process Xt = 1.5Xt1 0.75Xt2 +t is given in Figure 9.1.
We observe that too small b leads to undersmoothing but too large b leads to over smoothing
of features. There are various methods for selecting the bandwidth, one commonly method
based on the KullbachLeibler criterion is proposed in Beltrao and Bloomfield (1987).
(ii) The BartlettPriestley spectral Window
W (x) =
3
4
1 x2
x 1
x > 1
This spectral window was designed to reduce the mean squared error of the spectral density
estimator (under certain smoothness conditions).
The above estimator was constructed within the frequency domain. We now consider a spectral
density estimator constructed within the time domain. We do this by considering the periodogram
238
Autoregressive
0 10 20 30 40 50 60
spectrum
P2[c(1:128)]
0.1
0.2
0.3
0.4
0.5
0.0
0.2
0.3
0.4
frequency
Series: ar2
Smoothed Periodogram
Series: ar2
Smoothed Periodogram
30
40
0.5
10
20
spectrum
60
40
20
spectrum
0.1
frequency[c(1:128)]
80 100
0.0
0.0
0.1
0.2
0.3
0.4
0.5
0.0
frequency
bandwidth = 0.00564
0.1
0.2
0.3
0.4
0.5
frequency
bandwidth = 0.0237
Figure 9.1: Using a realisation of the AR(2): Xt = 1.5Xt1 0.75Xt2 + t where n = 256.
Top left: Periodogram, Top Right: True spectral density function. Bottom left: Spectral
density estimator with bn = 2 and Bottom right: Spectral density estimator with bn = 10.
from an alternative angle. We recall that
1
2
In () =
n1
X
cn (k) exp(ik),
k=(n1)
thus it is the sum of n autocovariance estimators. This is a type of sieve estimator (a nonparametric
function estimator which estimates the coefficients/covariances in a series expansion). But as we
explained above, this estimator is not viable because it uses too many coefficient estimators. Since
the true coefficients/covariances decay to zero for large lags, this suggests that we do not use all the
sample covariances in the estimator, just some of them. Hence a viable estimator of the spectral
density is the truncated autocovariance estimator
fen () =
m
1 X
cn (k) exp(ik),
2
(9.18)
k=m
or a generalised version of this which down weights the sample autocovariances at larger lags
fen () =
1
2
n1
X
k=(n1)
239
k
m
cn (k) exp(ik),
(9.19)
where () is the so called lag window. The estimators (9.17) and (9.19) are very conceptionally simiR 2
lar, this can be understood if we rewrite cn (r) in terms the periodogram cn (r) = 0 In () exp(ir)d,
and transforming (9.19) back into the frequency domain
fen () =
where Wm () =
1
2
1
2
n1
X
Z
In ()
k=(n1)
k
1
( ) exp(ik( ))d =
m
2
Z
In ()Wm ( )d, (9.20)
Pn1
k
k=(n1) ( m ) exp(ik).
Example 9.3.2 (Examples of Lag windows) Here we detail examples of lag windows.
(i) Truncated Periodogram lag Window (u) = I[1,1] (u), where {(k/m)} corresponds to
m
1 sin[(m + 1/2)x]
1 X ik
e
=
,
Wm (x) =
2
2
sin(x/2)
k=m
which is the Fejer kernel. We can immediately see that one advantage of the Bartlett window
is that it corresponds to a spectral density estimator which is positive.
Note that in the case that m = n (the sample size), the truncated periodogram window estimaP
P
ir and the Bartlett window estimator corresponds to
tor corresponds to
rn c(r)e
rn [1
r/n]c(r)eir .
Wm () and 1b W ( b ) (defined in (9.17)) cannot not be the same function, but they share many of the
same characteristics. In particular,
n1
X
Wm () =
k=(n1)
= m
1
m
k
m
m1
X
exp(ik) = m
m1
X
k=(m1)
k=(m1)
240
k
m
k
exp i m
m
where k = k/n. By using (A.2) and (A.3) (in the appendix), we can approximate the sum by the
integral and obtain
Z
Wm () = mW (m) + O(1),
where W () =
(x) exp(i)dx.
Therefore
Z
fen () m
In ()K(m( ))d.
Comparing with fbn and fen () we see that m plays the same role as b1 . Furthermore, we observe
P 1
jk
k bn W ( bn )I(k ) is the sum of about nb I(k ) terms. The equivalent for Wm (), is that it
k
1 Pn1
has the spectral width n/m. In other words since fn () = 2
cn (k) exp(ik) =
k=(n1) ( M )
R
1
mIn ()W (M ( ))d, it is the sum of about n/m terms.
2
We now analyze the sampling properties of the spectral density estimator. It is worth noting
that the analysis is very similar to the analysis of nonparametric kernel regression estimator gbn ( nj ) =
ji
1 P
i
i
i W ( bn )Yi , where Yi = g( n ) + g( n )i and {i } are iid random variables. This is because the
bn
periodogram {In ()}k is near uncorrelated. However, still some care needs to be taken in the
proof to ensure that the errors in the near uncorrelated term does not build up.
Theorem 9.3.1 Suppose {Xt } satisfy Xt =
j= j tj ,
where
j= jj 
1
+b
n
(9.21)
and
var[fn (j )]
1
2
bn f (j )
0 < j <
2
2
bn f (j )
j = 0 or
(9.22)
bn , b 0 as n .
PROOF. The proof of both (9.21) and (9.22) are based on the spectral window W (x/b) becoming narrower as b 0, hence there is increasing localisation as the sample size grows (just like
nonparametric regression).
We first note that by using Lemma 3.1.1(ii) we have
241
rc(r) <
X 1
k
W
E I(jk ) f (j )
bn
bn
k
X 1
X 1
k
W
E I(jk ) f (jk ) +
W k f (j ) f (jk )
bn
bn
bn
bn
=
=
:= I + II.
Using Lemma 9.1.1 we have
X 1 k
W
 E I(jk ) f (jk )
I =
bn
bn
k
!
X
X
X
1
k
1
1
C
W ( )
c(k) +
kc(k) = O( ).
bn
bn
n
n
kn
kn
Altogether this gives I = O(n1 ) and II = O(b) as bn , b 0 and n . The above two
bounds mean give (9.21).
We will use Theorem 9.2.1 to prove (9.22). We first assume that j 6= 0 or n. To prove the result
we use that
cov(Jn (k1 )2 , Jn (k2 )2 ) =
1
1
1
[f (k1 )I(k1 = k2 ) + O( )]2 + [f (k1 )I(k1 = n k2 ) + O( )][f (k1 )I(n k1 = k2 ) + O( )]
n
n
n
1
1
+[ f4 (1 , 1 , 2 ) + O( 2 )].
n
n
242
The above result means that the mean squared error of the estimator
2
E fn (j ) f (j ) 0,
where bn and b 0 as n . Moreover
2
E fn (j ) f (j )
= O
1
+b .
bn
Remark 9.3.3 (The distribution of the spectral density estimator) Using that the periodogram
In ()/f () is asymptotically 2 (2) distributed and uncorrelated at the fundemental frequencies, we
can heuristically deduce the limiting distribution of fn (). Here we consider the distribution with
243
fn (j ) =
1
bn
j+bn/2
I(k ).
k=jbn/2
Pj+bn/2
Since I(k )/f (k ) are approximately 2 (2), then since the sum k=jbn/2 I(k ) is taken over a
Pj+bn/2
local neighbourhood of j , we have that f (j )1 k=jbn/2 I(k ) is approximately 2 (2bn).
We note that when bn is large, then 2 (2bn) is close to normal. Hence
bnfn (j ) N (f (j ), f (j )2 ).
Using this these asymptotic results, we can construct confidence intervals for f (j ).
In general, to prove normality of fn we rewrite it as a quadratic form, from this asymptotic
normality can be derived, where
Z
2
2
W (u) du .
bnfn (j ) N f (j ), f (j )
The variance of the spectral density estimator is simple to derive by using Proposition 8.6.1. The
remarkable aspect is that the variance of the spectral density does not involve (asymptotically) the
fourth order cumulant (as it is off lower order).
9.4
In Chapter 6 we considered various methods for estimating the parameters of an ARMA process.
The most efficient method (in terms of Fisher efficiency), when the errors are Gaussian is the
Gaussian maximum likelihood estimator. This estimator was defined in the time domain, but it is
interesting to note that a very similar estimator which is asymptotically equivalent to the GMLE
estimator can be defined within the frequency domain. We start by using heuristics to define the
Whittle likelihood. We then show how it is related to the Gaussian maximum likelihood.
To motivate the method let us return to the Sunspot data considered in Exercise 5.1. The
244
Periodogram and the spectral density corresponding to the best fitting autoregressive model,
f () = (2) 1 1.1584ei 0.3890ei2 0.1674ei3 0.1385ei4 0.1054ei5 0.0559ei6
2
i7
i8
0.0049e 0.0572e 0.2378e ,
1
is given in Figure 9.2. We see that the spectral density of the best fitting AR process closely
follows the shape of the periodogram (the DFT modulo square). This means that indirectly the
autoregressive estimator (YuleWalker) chose the AR parameters which best fitted the shape of
the periodogram. The Whittle likelihood estimator, that we describe below, does this directly. By
80 120
40
P2[c(2:n3)]
selecting the parametric spectral density function which best fits the periodogram. The Whittle
0.0
0.1
0.2
0.3
0.4
0.5
frequency2[c(2:n3)]
50 100
0
spectrum
Autoregressive
0.0
0.1
0.2
0.3
0.4
0.5
frequency
Figure 9.2: The periodogram of sunspot data (with the mean removed, which is necessary
to prevent a huge peak at zero) and the spectral density of the best fitting AR model.
likelihood measures the distance between In () and the parametric spectral density function using
the KullbachLeibler criterion
Lw
n ()
n
X
In (k )
=
,
log f (k ) +
f (k )
k=1
k =
2k
,
n
and the parametric model which minimises this distance is used as the estimated model. The choice
of this criterion over the other distance criterions may appear to be a little arbitrary, however there
are several reasons why this is considered a good choice. Below we give some justifications as to
245
p
X
j Xtj +
j=1
q
X
j tj + t ,
j=1
and {t } are iid random variables. We will assume that {j } and {j } are such that the roots
of their corresponding characteristic polynomial are greater than 1 + . Let = (, ). As we
P
mentioned in Section 8.2 if r rc(r) < , then
f (k1 ) + O( n1 )
k1 = k2
O( n1 )
k1 6= k2 ,
where
f () =
2 1 +
21 +
Pq
Ppj=1
j exp(ij)2
j=1 j
exp(ij)2
In other words, if the time series satisfies an ARMA presentation the DFT is near uncorrelated,
its mean is zero and its variance has a well specified parametric form. Using this information
we can define a criterion for estimating the parameters. We motivate this criterion through the
likelihood, however there are various other methods for motivating the criterion for example the
KullbachLeibler criterion is an alternative motivation, we comment on this later on.
If the innovations are Gaussian then <Jn () and =Jn () are also Gaussian, thus by using above
we approximately have
<Jn (1 )
=Jn (1 )
..
Jn =
.
<Jn (n/2 )
=Jn (n/2 )
In the case that the innovations are not normal then, by Corollary 9.2.1, the above holds asymptotically for a finite number of frequencies. Here we construct the likelihood under normality of the
innovations, however, this assumption is not required and is only used to motivate the construction.
Since Jn is normally distributed random vector with mean zero and approximate diagonal
246
n
X
k=1
JX (k )2
.
log f (k ) +
f (k )
To estimate the parameter we would choose the which minimises the above criterion, that is
bw = arg min Lw (),
n
n
(9.23)
where consists of all parameters where the roots of the corresponding characteristic polynomial
have absolute value greater than (1 + ) (note that under this assumption all spectral densities
corresponding to these parameters will be bounded away from zero).
Example 9.4.1 Fitting an ARMA(1, 1) model to the data To fit an ARMA model to the data using
the Whittle likelihood we use the criterion
Lw
n ()
n/2
X
k=1
2 1 + eik 2
21 eik 2
log
+
I
(
)
n
k
21 eik 
2 1 + eik 2
.
By differentiating Ln with respect to , 2 and we solve these three equations (usually numerically), this gives us the Whittle likelihood estimators.
Whittle (1962) showed that the above criterion is an approximation of the GMLE. The correct
proof is quite complicated and uses several matrix approximations due to Grenander and Szeg
o
(1958). Instead we give a heuristic proof which is quite enlightening.
Returning the the Gaussian likelihood for the ARMA process, defined in (7.10), we rewrite it
as
Ln () = det Rn () + X0n Rn ()1 Xn = det Rn (f ) + X0n Rn (f )1 Xn ,
where Rn (f )s,t =
(9.24)
Lw
n ().
Lemma 9.4.1 Suppose that {Xt } is a stationary ARMA time series with absolutely summable
247
X0n Rn (f )1 Xn
n
X
k=1
Jn (k )2
log f (k ) +
f (k )
+ O(1),
for large n.
PROOF. There are various ways to precisely prove this result. All of them show that the Toeplitz
matrix can in some sense be approximated by a circulant matrix. This result uses Szegos identity
(Grenander and Szeg
o (1958)). The main difficulty in the proof is showing that Rn (f )1
R
Un (f1 ), where Un (f1 )s,t = f ()1 exp(i(s t))d. An interesting derivation is given in
Brockwell and Davis (1998), Section 10.8. The main ingredients in the proof are:
1. For a sufficiently large m, Rn (f )1 can be approximated by Rn (gm )1 , where gm is the
spectral density of an mth order autoregressive process (this follows from Lemma 8.5.2), and
showing that
X 0n Rn (f )1 X n X 0n Rn (gm )1 X n = X 0n Rn (f )1 Rn (gm )1 X n
= X 0n Rn (gm )1 [Rn (gm ) Rn (f )] Rn (f1 )X n 0.
2. From Section 3.2.3, we recall if gm is the spectral density of an AR(m) process, then for
n >> m, Rn (gm )1 will be bandlimited with most of its rows a shift of the other (thus with
the exception of the first m and last m rows it is close to circulant).
3. We approximate Rn (gm )1 with a circulant matrix, showing that
1
X 0n Rn (gm )1 Cn (gm
) X n 0,
2 ) is the corresponding circulant matrix (where for 0 < ij m and either i or
where Cn (gm
P
j is greater than m, (Cn (g 1 ))ij = 2 m
k=ij m,k m,kij+1 m,ij ) with the eigenvalues
{gm (k )1 }nk=1 .
4. These steps show that
1
) Xn 0
X 0n Rn (f )1 Un (gm
as m as n , which gives the result.
248
Remark 9.4.1 (A heuristic derivation) We give a heuristic proof. Using the results in Section
8.2 we have see that Rn (f ) can be approximately written in terms of the eigenvalue and eigenvectors
of the circulant matrix associated with Rn (f ), that is
Rn (f ) Fn (f )Fn
(n)
Rn (f )1 Fn (f )1 Fn ,
thus
(n)
(n)
P(n1)
(9.25)
f () and
(9.26)
1X
1
Ln ()
n
n
k=1
Jn (k )2
log f (k ) +
f (k )
=
1 w
L ().
n
(9.27)
Hence using the approximation in (9.25) leads to a heuristic equivalence between the Whittle and
Gaussian likelihood.
Lemma 9.4.2 (Consistency) Suppose that {Xt } is a causal ARMA process with parameters
bw be defined as in (9.23)
whose roots lie outside the (1 + )circle (where > 0 is arbitrary). Let
and suppose that E(4t ) < . Then we have
P
bw
1
n Ln .
Let
1
L () =
2
w
Z
0
f0 ()
log f () +
d.
f ()
249
1 w
var
L ()
n n
n
1 X
1
1
cov(Jn (k1 )2 , Jn (k2 )2 ) = O( ).
2
n
f (k1 )f (k2 )
n
=
k1 ,k2 =1
Thus we have
1 w
P
L () Lw ().
n n
To show equicontinuity we apply the mean value theorem to
1 w
n Ln .
parameters (, ) , have characteristic polynomial whose roots are greater than (1 + ) then
f () is bounded away from zero (there exists a > 0 where inf , f () ). Hence it can be
1 w
shown that there exists a random sequence {Kn } such that  n1 Lw
n ( 1 ) n Ln ( 2 )) Kn (k 1 2 k)
1
n Ln
is stochastically
equicontinuous. Since the parameter space is compact, the three standard conditions are satisfied
and we have consistency of the Whittle estimator.
To show asymptotic normality we note that
1 w
L () =
n n
log f (k ) +
0
1
n
1 w
n Ln ()
n1
X
r=(n1)
can be written as a quadratic form
nr
dn (r; )
Xk Xk+r
k=1
where
n
1X
dn (r; ) =
f (k )1 exp(irk ).
n
k=1
Using the above quadratic form and its derivatives wrt one can show normality of the Whittle
likelihood under various dependence conditions on the time series. Using this result, in the following
theorem we show asymptotic normality of the Whittle estimator. Note, this result not only applies
to linear time series, but several types of nonlinear time series too.
Theorem 9.4.1 Let us suppose that {Xt } is a strictly stationary time series with a sufficient
dependence structure (such as linearity, mixing at a certain rate, etc.) with spectral density function
250
n
X
k=1
Jn (k )2
,
log f (k ) +
f (k )
bn = arg min Lw ()
= arg min Lw ()
Then we have
D
bn
n
N (0, 2V 1 + V 1 W V 1 )
where
V
Z 2
1
f ()
f () 0
d
2 0
f ()
f ()
Z 2 Z 2
0
2
f (1 )1 f (2 )1 f4,0 (1 , 1 , 2 )d1 d2 ,
2
(2) 0
0
j=1 j
exp(ij)2
PROOF. Since f (z) is nonzero for z 1, then log f (z) has no poles in {z; z 1}. Thus we have
Z 2
Z 2
X
1
1
2
log f ()d =
log d +
log 1 +
j exp(ij)2 d
2
2
0
0
0
j=1
Z 2
Z
X
1
1
log 2 d +
log 1 +
j z2 dz
2 0
2 z=1
j=1
Z 2
1
log 2 d.
2 0
1
2
=
=
An alternative proof is that since f (z) is analytic and does not have any poles for z 1, then
251
log f (z) is also analytic in the region z 1, thus for z 1 we have the power series expansion
P
P
j
2
log 1 +
j=1 bj z (a Taylor expansion about log 1). Using this we have
j=1 j exp(ij) =
1
2
=
log 1 +
0
X
j=1
1
j exp(ij) d =
2
2
Z
0
2 X
bj exp(ij)d
j=1
Z 2
1 X
exp(ij)d = 0,
bj
2
0
j=1
Pp
j=1 j Xtj
Pq
i=1 i ti +
t , where E[t ] = 0, var[t ] = 2 and E[4t ] < . Let = ({j , j }), then we have W = 0 and
w
D
b
n
N (0, 2V 1 ).
n
PROOF. The result follows from Theorem 9.4.1, however we need to show that in the case of
linearity that W = 0.
We use Example 8.6.1 for linear processes to give f4, (1 , 1 , 2 ) = 4 A(1 )2 A(2 )2 =
4
f (1 )f (2 ).
4
1
2
4
4
=
=
0
1
1
f (1 )
f (2 )
f4,0 (1 , 1 , 2 )d1 d2
0
2
2
Z 2
2
f ()
4 1
f ()
f ()d = 4
d
f ()2
2 0
f ()
0
Z 2
2
4 1
log f ()d
4 2 0
2
2
Z 2
4
2
4 1
2
0
1
2
R 2
0
2 we obtain the above. Hence for linear processes the higher order cumulant does not play an
asymptotic role in the variance thus giving the result.
On first appearances there does not seem to be a connection between the Whittle likelihood
and the sample autocorrelation estimator defined in Section 6.2.1. However, we observe that the
variance of both estimators, under linearity, do not contain the fourth order cumulant (even for
nonGaussian linear time series). In Section 9.5 we explain there is a connection between the two,
252
and it is this connection that explains away this fourth order cumulant term.
Remark 9.4.2 Under linearity, the GMLE and the Whittle likelihood are asymptotically equivalent, therefore they have the same asymptotic distributions. The GMLE has the asymptotic distri
D
bution n(
, n ) N (0, 1 ), where
n
and {Ut } and {Vt } are autoregressive processes which satisfy (B)Ut = t and (B)Vt = t .
By using the similar derivatives to those given in (7.11) we can show that
= 1
2
E(Ut Vt0 ) E(Vt Vt0 )
9.5
f ()
f ()
f ()
f ()
0
d.
We recall from (9.4) that the covariance can be written as a general periodogram mean which has
the form
n
1X
In (k )(k ).
A(, In ) =
n
(9.28)
k=1
n
1 X
(k1 )(k1 )cov(Jn (k1 )2 , Jn (k2 )2 )
n2
k1 ,k2 =1
n
1 X
(k1 )(k1 ) cov(Jn (k1 ), Jn (k2 ))cov(Jn (k1 ), Jn (k2 ))
n2
k1 ,k2 =1
253
(9.29)
k=1
n
X
1
1
(k1 )(k2 )f4 (k1 , k1 , k2 ) + O( 2 )
3
n
n
k1 ,k2 =1
Z 2
Z 2
1
1
()2 f ()2 d +
()(2 )f ()2 d
=
n 0
n 0
Z
Z
1 2 2
1
+
(1 )(2 )f4 (1 , 1 , 2 )d1 d2 + O( 2 ),
n 0
n
0
+
(9.31)
where f4 is the fourth order cumulant of {Xt }. From above we see that unless satisfies some
special conditions, var(A(, In )) contains the fourth order spectrum, which can be difficult to
estimate. There are bootstrap methods which can be used to estimate the variance or finite sample
distribution, but simple bootstrap methods, such as the frequency domain bootstrap, cannot be
applied to A(, In ), since it is unable to capture the fourth order cumulant structure. However, in
special cases the fourth order structure is disappears, we consider this case below and then discuss
how this case can be generalised.
Lemma 9.5.1 Suppose {Xt } is a linear time series, with spectral density f (). Let A(, In ) be
defined as in (9.28) and suppose the condition
Z
A(, f ) =
()f ()d = 0
(9.32)
holds, then
1
var(A(, In )) =
n
Z
0
1
() f () d +
n
2
254
Z
0
()(2 )f ()2 d.
4
f (1 )f (2 ),
4
var(A(, In ))
Z
Z
1 2
1 2
2
2
() f () d +
()(2 )f ()2 d
=
n 0
n 0
Z
Z
4 1 2 2
1
(1 )(2 )f (1 )f (2 )d1 d2 + O
2
4 n 0
n
0
Z 2
Z 2
1
1
=
()2 f ()2 d +
()(2 )f ()2 d
n 0
n 0
Z
2
4 1 2
1
+ 4
()f ()d + O
.
n 0
n2
Since
Example 9.5.1 (The Whittle likelihood) Let us return to the Whittle likelihood in the case of
linearity. In Lemma 9.4.4 we showed that the fourth order cumulant term does not play a role in
the variance of the ARMA estimator. We now show that condition (9.32) holds.
Consider the partial derivative of the Whittle likelihood
Lw
n ()
n
X
f (k )
k=1
f (k )
In (k )
f (k ) .
f (k )2
To show normality we consider the above at the true parameter , this gives
Lw
n ()
n
X
f (k )
k=1
f (k )
In (k )
f (k ) .
f (k )2
Only the second term of the above is random, therefore it is only this term that yields the variance.
Let
n
A(f2 f , In ) =
1 X In (k )
f (k ).
n
f (k )2
k=1
255
To see whether this term satisfies the conditions of Lemma 9.5.1 we evaluate
A(f2 f , f )
f ()
f ()
f ()2
=
0
log f ()
=
0
Z
=
0
1
log f () =
2
log f ()d = 0,
0
by using Lemma 9.4.3. Thus we see that the derivative of the Whittle likelhood satisfies the condition
(9.32). Therefore the zero cumulant term is really due to this property.
The Whittle likelihood is a rather special example. However we now show that any statistic of
the form A(, In ) can be transformed such that the resulting transformed statistic satisfies condition
(9.32). To find the suitable transformation we recall from Section 6.2.1 that the variance of b
cn (r)
involves the fourth order cumulant, but under linearity the sample correlation bn (r) = b
cn (r)/b
cn (0)
does given not. Returning to the frequency representation of the autocovariance given in (9.5) we
observe that
bn (r) =
n/2
k=1
k=1
1 1X
1 1X
In (k ) exp(irk ),
In (k ) exp(irk )
b
cn (0) n
b
cn (0) n
(it does not matter whether we sum over n or n/2 for the remainder of this section we choose the
case of summing over n). Motivated by this example we define the so called ratio statistic
n
k=1
k=1
X In (k )(k )
1 X In (k )(k )
e In ) = 1
A(,
=
,
n
b
cn (0)
n
Fn (2)
where Fn (2) =
1
n
Pn
k=1 In (k )
1
n
Pn
2
t=1 Xt
(9.33)
= b
cn (0). We show in the following lemma that
X f (k )(k )
e f) = 1
A(,
,
n
Fn (2)
k=1
256
where Fn (2) =
1
n
Pn
In ) as
Then we can represent A(,
j=1 f (k ).
e In ) A(,
e f) =
A(,
1
1X
n (k )In (k ),
F (2)Fn (2) n
k=1
where
n
n (k ) = (k )Fn (2)
1X
(j )f (j )
n
1X
(k )f (k ) = 0.
n
and
j=1
(9.34)
k=1
In ) A(,
f) =
A(,
1X
n
k=1
=
=
(k )In (k ) (k )f (k )
Fn (2)
Fn (2)
n
1 X (k )Fn (2)In (k ) (k )Fn (2)f (k )
n
Fn (2)Fn (2)
k=1
n
n
1X
In (k )
1X
(k )Fn (2)
(k )f (k )
n
n
Fn (2)Fn (2)
k=1
k=1
n
1 X (k )In (k )
,
n
Fn (2)Fn (2)
k=1
where Fn (2) and are defined as above. To show (9.34), again we use basic algebra to give
n
1X
(k )f (k ) =
n
k=1
n
n
1X
1X
()Fn (2)
(j )f (j ) f (k )
n
n
j=1
k=1
n
n
n
1X
1X
1X
(k )f (k )Fn (2)
(k )f (k )
f (j ) = 0.
n
n
n
k=1
k=1
j=1
In ) A(,
f ) almost seems to satisfy the conditions in
From the lemma above we see that A(,
Lemma 9.5.1, the only difference is the random term b
cn (0) = Fbn (2) in the denominator. We now
show that that we can replace Fbn (2) with its limit and that error is asymptotically negligible.
Let
n
In ) A(,
f) =
A(,
1
1X
e
n (k )In (k ) := B(,
In )
Fn (2)Fn (2) n k=1
257
and
n
B(n , In ) =
1X
1
(k )In (k ).
2
Fn (2) n
k=1
e n , In ) about
By using the mean value theorem (basically the Delta method) and expanding B(
B(n , In ) (noting that B(n , f ) = 0) gives
e
B(,
In ) B(, In )
n
1X
1
1
= Fn (2) Fn (2)
n (k )In (k ) = Op ( ),
3 n
n
{z
} Fn (2)

}
 k=1 {z
Op (n1/2 )
Op (n1/2 )
where F n (2) lies between Fn (2) and Fbn (2). Therefore the limiting distribution variance of
In ) A(,
f ) is determined by
A(,
e In ) A(,
e f ) = B(n , In ) + Op (n1/2 ).
A(,
B(n , In ) does satisfy the conditions in (9.32) and the lemma below immediately follows.
Lemma 9.5.3 Suppose that {Xt } is a linear time series, then
var(B(n , In )) =
1
n
1
n
()2 f ()2 d +
()(2 )f ()2 d + O(
1
),
n2
where
1
() = ()F (2)
2
()f ()d.
0
e In ) is
Therefore, the limiting variance of A(,
1
n
Z
0
1
() f () d +
n
2
()(2 )f ()2 d + O(
1
).
n2
This is a more elegant explanation as to why under linearity the limiting variance of the correlation
estimator does not contain the fourth order cumulant term. It also allows for a general class of
statistics.
258
Remark 9.5.1 (Applications) As we remarked above, many statistics can be written as a ratio statistic. The advantage of this is that the variance of the limiting distribution is only in
terms of the spectral densities, and not any other higher order terms (which are difficult to estimate). Another perk is that simple schemes such as the frequency domain bootstrap can be used
to estimate the finite sample distributions of statistics which satisfy the assumptions in Lemma
9.5.1 or is a ratio statistic (so long as the underlying process is linear), see Dahlhaus and Janas
(1996) for the details. The frequency domain bootstrap works by constructing the DFT from the
data {Jn ()} and dividing by the square root of either the nonparametric estimator of f or a
q
parametric estimator, ie. {Jn ()/ fn ()}, these are close to constant variance random variables.
q
q
j tj ,
j=
with E(t ) = 0, var(t ) = 2 and cum4 (t ) = 4 . Then we can use the spectral density estimator to
estimate 4 without any additional assumptions on {Xt } (besides linearity). Let f () denote the
259
spectral density of {Xt } and g2 () the spectral density of {Xt2 }, then it can be shown that
R 2
2g2 (0) 4 0 f ()2 d
.
4 =
2
R 2
0 f ()d
Thus by estimating f and g2 we can estimate 4 .
Alternatively, we can use the fact that for linear time series, the fourth order spectral density
f4 (1 , 2 , 3 ) = 4 A(1 )A(2 )A(3 )A(1 2 3 ). Thus we have
4 =
4 f4 (1 , 1 , 2 )
.
f (1 )f (2 )
9.6
As with many other areas in statistics, we often want to test the appropriateness of a model. In this
section we briefly consider methods for validating whether, say an ARMA(p, q), is the appropriate
model to fit to a time series. One method is to fit the model to the data and the estimate the
residuals and conduct a Portmanteau test (see Section 3, equation (6.9)) on the estimated residuals.
It can be shown that if model fitted to the data is the correct one, the estimated residuals behave
almost like the true residuals in the model and the Portmanteau test statistic
Sh = n
h
X

n (r)2 ,
r=1
260
Therefore, if we fit the correct model to the data we would expect that
IX ()
= J ()2 + op (1).
f ()
are the model parameter estimators. Now J ()2 has the special property that not only is
where
it almost uncorrelated at various frequencies, but it is constant over all the frequencies. Therefore,
we would expect that
n/2
D
1 X IX ()
2 N (0, 1)
f ()
2 n
k=1
Thus, as an alternative to the goodness fit test based on the portmanteau test statistic we can use
the above as a test statistic, noting that under the alternative the mean would be different.
261
Chapter 10
Consistency and and asymptotic
normality of estimators
In the previous chapter we considered estimators of several different parameters. The hope is that
as the sample size increases the estimator should get closer to the parameter of interest. When we
say closer we mean to converge. In the classical sense the sequence {xk } converges to x (xk x),
if xk x 0 as k (or for every > 0, there exists an n where for all k > n, xk x < ).
Of course the estimators we have considered are random, that is for every (set of all out
comes) we have an different estimate. The natural question to ask is what does convergence mean
for random sequences.
10.1
Modes of convergence
{Xt } converges almost sure to , if there exists a set M , such that P(M ) = 1 and for
every N we have
Xt () .
262
(10.1)
for all t > N (). Note that the above definition is very close to classical convergence. We
a.s.
P (;
m=1 t=m {Xt ()  > }) = 0.
{Xt ()  > } =
6 , then there exists an
m=1 t=m {Xt ()  > } such that for
some infinite sequence {kj }, we have Xkj ( )  > , this means Xt ( ) does not converge
to . Now let
m=1 t=m {Xt ()  > } = A, if P (A) = 0, then for most the sequence
t .
t ,
then it implies convergence in probability (to see this, use Markovs inequality).
Rates of convergence:
(i) Suppose at 0 as t . We say the stochastic process {Xt } is Xt  = Op (at ),
263
is almost surely bounded, then for a positive sequence {ek }, such that ek as k
(typically ek = 2k is used), we have
P (; {
k=1 {X() ek }}) = 1.
Usually to prove the above we consider the complement
c
P ((; {
k=1 {X ek }}) ) = 0.
Since (
k=1 {X ek }) = k=1 {X > ek } k=1 m=k {X > ek }, to show the above we
show
P ( : {
k=1 m=k {X() > ek }}) = 0.
(10.2)
We note that if ( : {
k=1 m=k {X() > ek }}) 6= , then there exists a and an
infinite subsequence kj , where X( ) > ekj , hence X( ) is not bounded (since ek ). To
P
prove (10.2) we usually use the Borel Cantelli Lemma. This states that if
k=1 P (Ak ) < ,
the events {Ak } occur only finitely often with probability one. Applying this to our case,
P
if we can show that
m=1 P ( : {X() > em }) < , then {X() > em } happens
P
only finitely often with probability one. Hence if
m=1 P ( : {X() > em }) < , then
P ( : {
k=1 m=k {X() > ek }}) = 0 and X is a bounded random variable.
P
It is worth noting that often we choose the sequence ek = 2k , in this case
m=1 P ( :
P
{X() > em }) = m=1 P ( : {log X() > log 2k }) CE(log X). Hence if we can
show that E(log X) < , then X is bounded almost surely.
b
(ii) Sequences which are bounded in probability A sequence is bounded in probability,
264
written Xt = Op (1), if for every > 0, there exists a () < such that P (Xt  ()) < .
Roughly speaking this means that the sequence is only extremely large with a very small
probability. And as the largeness grows the probability declines.
10.2
Sampling properties
Often we will estimate the parameters by maximising (or minimising) a criterion. Suppose we have
the criterion Ln (a) (eg. likelihood, quasilikelihood, KullbackLeibler etc) we use as an estimator
of a0 , a
n where
a
n = arg max Ln (a)
a
and is the parameter space we do the maximisation (minimisation) over. Typically the true
parameter a should maximise (minimise) the limiting criterion L.
If this is to be a good estimator, as the sample size grows the estimator should converge (in
some sense) to the parameter we are interesting in estimating. As we discussed above, there are
various modes in which we can measure this convergence (i) almost surely (ii) in probability and
(iii) in mean squared error. Usually we show either (i) or (ii) (noting that (i) implies (ii)), in time
series its usually quite difficult to show (iii).
Definition 10.2.1
(i) An estimator a
n is said to be almost surely consistent estimator of a0 , if
(ii) An estimator a
n is said to converge in probability to a0 , if for every > 0
P (
an a > ) 0
T .
To prove either (i) or (ii) usually involves verifying two main things, pointwise convergence and
equicontinuity.
265
10.3
We now consider the general case where Ln (a) is a criterion which we maximise. Let us suppose
we can write Ln as
n
Ln (a) =
1X
`t (a),
n
(10.3)
t=1
(10.4)
we assume that L(a) is continuous and has a unique maximum in . We define the estimator
n
where
n = arg mina Ln (a).
Definition 10.3.1 (Uniform convergence) Ln (a) is said to almost surely converge uniformly
to L(a), if
a.s.
is the unique maximum. If supa Ln (a) L(a) 0 as n and L(a) has a unique maximum.
a.s.
Then Then a
n a0 as n .
PROOF. We note that by definition we have Ln (a0 ) Ln (
an ) and L(
an ) L(a0 ). Using this
inequality we have
Ln (a0 ) L(a0 ) Ln (
an ) L(a0 ) Ln (
an ) L(
an ).
Therefore from the above we have
Ln (
aT ) L(a0 ) max {Ln (a0 ) L(a0 ), Ln (
aT ) L(
an )} sup Ln (a) L(a).
a
266
a.s.
a.s.
We note that directly establishing uniform convergence is not easy. Usually it is done by assuming the parameter space is compact and showing point wise convergence and stochastic equicontinuity, these three facts imply uniform convergence. Below we define stochastic equicontinuity and
show consistency under these conditions.
Definition 10.3.2 The sequence of stochastic functions {fn (a)}n is said to be stochastically equicontinuous if there exists a set M where P (M ) = 1 and for every M and and > 0, there
exists a and such that for every M
sup
fn (, a1 ) fn (, a2 ) ,
a1 a2 
n ). To show that this implies equicontinuity we note that Kn K0 means that for every
M (P (M ) = 1) and > 0, we have Kn () K0  < for all n > N (). Therefore if we
choose = /(K0 + ) we have
sup
fn (, a1 ) fn (, a2 ) < ,
a1 a2 /(K0 +)
for every a we have Ln (a) L(a) and Ln (a) is stochastic equicontinuous. Then supa Ln (a)
a.s.
L(a) 0 as n .
267
(i) We have point wise convergence, that is for every a we have Ln (a) L(a).
(ii) The parameter space is compact.
(iii) Ln (a) is stochastic equicontinuous.
a.s.
then a
n a0 as n .
PROOF. By using Theorem 10.3.2 three assumptions imply that  sup Ln () L() 0, thus
by using Theorem 10.3.1 we obtain the result.
We prove Theorem 10.3.2 in the section below, but it can be omitted on first reading.
10.3.1
We now show that stochastic equicontinuity and almost pointwise convergence imply uniform convergence. We note that on its own, pointwise convergence is a much weaker condition than uniform
convergence, since for pointwise convergence the rate of convergence can be different for each parameter.
Before we continue a few technical points. We recall that we are assuming almost pointwise
convergence. This means for each parameter a there exists a set Na (with P (Na ) = 1)
such that for all Na Ln (, a) L(a). In the following lemma we unify this set. That is show
(using stochastic equicontinuity) that there exists a set N (with P (N ) = 1) such that for all
N Ln (, a) L(a).
Lemma 10.3.1 Suppose the sequence {Ln (a)}n is stochastically equicontinuous and also pointwise
convergent (that is Ln (a) converges almost surely to L(a)), then there exists a set M where
) = 1 and for every M
and a we have
P (M
Ln (, a) L(a) 0.
PROOF. Enumerate all the rationals in the set and call this sequence {ai }i . Since we have almost
sure convergence, this implies for every ai there exists a set Mai where P (Mai ) = 1 and for every
268
Mai we have LT (, ai ) L(ai ) 0. Define M = Mai , since the number of sets is countable
P (M ) = 1 and for every M and ai we have Ln (, ai ) L(ai ).
where P (M
) = 1 and for every
Since we have stochastic equicontinuity, there exists a set M
, {Ln (, )} is equicontinuous. Let M
=M
{Ma }, we will show that for all a and
M
i
we have Ln (, a) L(a). By stochastic equicontinuity for every M
and /3 > 0, there
M
exists a > 0 such that
Ln (, b1 ) Ln (, b2 ) /3,
sup
(10.5)
b1 b2 
(10.6)
where n > N 0 (). Now for any given a , there exists a rational ai such that ka aj k . Using
this, (10.5) and (10.6) we have
Ln (, a) L(a) Ln (, a) Ln (, ai ) + Ln (, ai ) L(ai ) + L(a) L(ai ) ,
and a , we have Ln (, a) L(a)
for n > max(N (), N 0 ()). To summarise for every M
.
0. Hence we have pointwise covergence for every realisation in M
LT (, a1 ) LT (, a2 ) /3,
(10.7)
a1 a2 
for all n > n(). Since is compact it can be divided into a finite number of open sets. Construct
the sets {Oi }pi=1 , such that pi=1 Oi and supx,y,i kx yk . Let {ai }pi=1 be such that ai Oi .
we have Ln (, ai ) L(ai ), hence for every /3, there exists an
We note that for every M
ni () such that for all n > ni () we have LT (, ai ) L(ai ) /3. Therefore, since p is finite (due
269
1ip
10.4
In Chapter ?? we will consider the sampling properties of many of the estimators defined in Chapter
6. However to illustrate the consistency result above we apply it to the least squares estimator of
the autoregressive parameters.
To simply notation we only consider estimator for AR(1) models. Suppose that Xt satisfies
Xt = Xt1 + t (where  < 1). To estimate we use the least squares estimator defined below.
Let
n
Ln (a) =
1 X
(Xt aXt1 )2 ,
n1
(10.8)
t=2
270
(10.9)
1 Pn
t=2 Xt Xt1
n1
.
P
T 1
1
2
t=1 Xt
n1
a.s.
By just applying the ergodic theorem to the numerator and denominator we get n .
1 Pnt=2 Xt Xt1
< 1 is not necesIt is worth noting, that unlike the YuleWalker estimator n11 P
n1
X2
n1
t=1
sarily true.
Here we will tackle the problem in a rather artifical way and assume that it does not have an
explicit form and instead assume that n is obtained by minimising Ln (a) using a numerical
routine.
In order to derive the sampling properties of n we need to directly study the least squares
criterion Ln (a). We will do this now in the least squares case.
We will first show almost sure convergence, which will involve repeated use of the ergodic
theorem. We will then demonstrate how to show convergence in probability. We look at almost
sure convergence as its easier to follow. Note that almost sure convergence implies convergence in
probability (but the converse is not necessarily true).
The first thing to do it let
`t (a) = (Xt aXt1 )2 .
Since {Xt } is an ergodic process (recall Example ??(ii)) by using Theorem ?? we have for a, that
{`t (a)}t is an ergodic process. Therefore by using the ergodic theorem we have
n
1 X
a.s.
Ln (a) =
`t (a) E(`0 (a)).
n1
t=2
a.s.
In other words for every a [1, 1] we have that Ln (a) E(`0 (a)) (almost sure pointwise convergence).
Since the parameter space = [1, 1] is compact and a is the unique minimum of `() in the
271
parameter space, all that remains is to show show stochastic equicontinuity. From this we deduce
almost sure uniform convergence.
To show stochastic equicontinuity we expand LT (a) and use the mean value theorem to obtain
Ln (a1 ) Ln (a2 ) = LT (
a)(a1 a2 ),
(10.10)
where a
[min[a1 , a2 ], max[a1 , a2 ]] and
n
Ln (
a) =
2 X
Xt1 (Xt a
Xt1 ).
n1
t=2
Because a
[1, 1] we have
n
2 X
2
Ln (
a) Dn , where Dn =
(Xt1 Xt  + Xt1
).
n1
t=2
2
Dn 2E(Xt1 Xt  + Xt1
).
Ln (, a1 ) Ln (, a2 ) ,
for all n N (). Since this is true for all M we see that {Ln (a)} is stochastically equicontinuous.
272
a.s.
Theorem 10.4.1 Let n be defined as in (10.9). Then we have n .
PROOF. Since {Ln (a)} is almost sure equicontinuous, the parameter space [1, 1] is compact and
a.s.
a.s.
we have pointwise convergence of Ln (a) L(a), by using Theorem 10.3.1 we have that n a,
2
E(Xt Xt1 ) = E(Xt1
) + E(t Xt1 ) .
 {z }
=0
a.s.
Therefore = E(X0 X1 )/E(X02 ), hence n .
We note that by using a very similar methods we can show strong consistency of the least
squares estimator of the parameters in an AR(p) model.
10.5
where a
T a0 (convergence in probability), is shown. This requires a weaker set of conditions,
which we now describe:
(i) The parameter space should be compact.
P
sup
a1 a2 
273
< .
(10.11)
Verifying conditions (ii) and (iii) may look a little daunting but by using Chebyshevs (or
Markovs) inequality it can be quite straightforward. For example if we can show that for every
a
E(Ln (a) L(a))2 0
T .
T .
Now if we can show that supn E supa ka Ln (a)k2 < (in other words it is uniformly bounded in
probability over n) then we have the result. To see this observe that
!
sup
P sup ka Ln (a)k2 a1 a2  >
a
a1 a2 
Therefore by a careful choice of > 0 we see that (10.11) is satisfied (and we have equicontinuity
in probability).
10.6
Once consistency of an estimator has been shown this paves the way to showing normality. To
make the derivations simple we will assume that is univariate (this allows to easily use Taylor
expansion) . We will assume that that the third derivative of the contrast function, Ln (), exists,
its expectation is bounded and its variance converges to zero as n . If this is the case we have
have the following result
274
Lemma 10.6.1 Suppose that the third derivative of the contrast function Ln () exists, for k =
k
Ln ()
0, 1, 2 E( theta
k ) =
k L
k
Ln ()
and var( theta
k ) 0 as n and
3 Ln ()
theta3
is bounded by a random
variable Zn which is independent of n where E(Zn ) < and var(Zn ) 0. Then we have
(n 0 ) = V ()1
where V (0 ) =
Ln ()
Ln ()
+ op (1)
,
=0
=0
2 L()
.
2 0
2
Ln ()
2 Ln ()
n 0 ) Ln ()
=
(
(10.12)
(n 0 )
=n
2 =n
2 =n
2 Ln ()
.
2 =n
have
2
2 Ln ()
2 Ln ()
n 0 ) Ln ()
=
+
(
2 =n
2 0
2 =n
2 Ln ()
2 0
2 L()
2 0
we have
2
2L
n 0  Ln ()
n 0 Wn .
V
(
)
0
2
2
=n
=n
Therefore, by consistency of the estimator it is clear that
P
2L
2 =n
(10.12) we have
L
=0
= (n 0 )(V (0 ) + o(1)),
2
since V (0 ) is bounded away from zero we have [ L2 = ]1 = V (0 )1 + op (1) and we obtain the
n
desired result.
275
L
=0 .
L
=0 .
In the
10.6.1
The first central limit theorm goes back to the asymptotic distribution of sums of binary random
variables (these have a binomial distribution and Bernoulli showed that they could be approximated
to a normal distribution). This result was later generalised to sums of iid random variables. However
from mid 20th century to late 20th century several advances have been made for generalisating the
results to dependent random variables. These include generalisations to random variables which
have ndependence, mixing properties, cumulant properties, nearepoch dependence etc (see, for
example, Billingsley (1995) and Davidson (1994)). In this section we will concentrate on a central
limit theore for martingales. Our reason for choosing this flavour of CLT is that it can be applied
in various estimation settings  as it can often be shown that the derivative of a criterion at the
true parameter is a martingale.
Let us suppose that
n
1X
Sn =
Zt ,
n
t=1
n, is
a.s.
that (Sn E(Sn )) 0 as n , hence in terms of distributions it converges towards the point
mass at zero. Therefore we need to increase the magnitude of the difference. It it can show that
ST =
T
X
Zt
k=1
276
ST =
T
X
Zt ,
(10.13)
t=1
where Ft = (Zt , Zt1 , . . .), E(Zt Ft1 ) = 0 and E(Zt2 ) < . In the following theorem adapted
from Hall and Heyde (1980), Theorem 3.2 and Corollary 3.1, we show that ST is asymptotically
normal.
Theorem 10.6.1 Let {ST }T be defined as in (10.62). Further suppose
T
1X 2 P 2
Zt ,
T
(10.14)
t=1
1X
P
E(Zt2 I(Zt  > T )Ft1 ) 0,
T
(10.15)
t=1
(10.16)
t=1
Then we have
D
T 1/2 ST N (0, 2 ).
277
(10.17)
10.6.2
In this section we show asymptotic normality of the least squares estimator of the AR(1) (Xt =
Xt1 + t , with var(t ) = 2 ) defined in (10.8).
We call that the least squares estimator is n = arg maxa[1,1] Ln (a). Recalling the criterion
n
1 X
Ln (a) =
(Xt aXt1 )2 ,
n1
t=2
t=2
t=2
2 X
2 X
Xt1 (Xt aXt1 ) =
Xt1 t
n1
n1
2
n1
n
X
2
Xt1
.
t=2
1
Ln ().
(10.18)
a.s.
Since {Xt2 } are ergodic random variables, by using the ergodic theorem we have 2 Ln 2E(X02 ).
This with (10.18) implies
n(n ) = 2 Ln
 {z
1
nLn ().
(10.19)
a.s.
(2E(X02 ))1
nLn ().
We observe that
n
2 X
Ln () =
Xt1 t ,
n1
t=2
is the sum of martingale differences, since E(Xt1 t Xt1 ) = Xt1 E(t Xt1 ) = Xt1 E(t ) = 0
(here we used Definition 10.6.1). In order to show asymptotic of Ln () we will use the martingale
central limit theorem.
We now use Theorem 10.6.1 to show that
278
have to verify conditions (10.14)(10.16). We note in our example that Zt := Xt1 t , and that the
series {Xt1 t }t is an ergodic process. Furthermore, since for any function g, E(g(Xt1 t )Ft1 ) =
E(g(Xt1 t )Xt1 ), where Ft = (Xt , Xt1 , . . .) we need only to condition on Xt1 rather than the
entire sigmaalgebra Ft1 .
C1 : By using the ergodicity of {Xt1 t }t we have
n
t=1
t=1
1X 2
1X 2 2 P
2
Zt =
Xt1 t E(Xt1
) E(2t ) = 2 c(0).
 {z }
n
n
=1
1X
1X
2
E(Zt2 I(Zt  > n)Ft1 ) =
E(Xt1
2t I(Xt1 t  > n)Xt1 )
n
n
t=1
t=1
2 2 and
We now use the CauchySchwartz inequality for conditional expectations to split Xt1
t
I(Xt1 t  > ). We recall that the CauchySchwartz inequality for conditional expectations
is E(Xt Zt G) [E(Xt2 G)E(Zt2 G)]1/2 almost surely. Therefore
n
1X
E(Zt2 I(Zt  > n)Ft1 )
n
t=1
n
1/2
1 X
4
E(Xt1
4t Xt1 )E(I(Xt1 t  > n)2 Xt1 )
n
n
1/2
1X 2
Xt1 E(4t )1/2 E(I(Xt1 t  > n)2 Xt1 )
.
n
t=1
(10.20)
t=1
We note that rather than use the CauchySchwartz inequality we can use a generalisation
of it called the H
older inequality. The Holder inequality states that if p1 + q 1 = 1, then
E(XY ) {E(X p )}1/p {E(Y q )}1/q (the conditional version also exists). The advantage of
using this inequality is that one can reduce the moment assumptions on Xt .
Returning to (10.20), and studying E(I(Xt1 t  > )2 Xt1 ) we use that E(I(A)) = P(A)
and the Chebyshev inequality to show
X 2 var(t )
n
= P (t  >
)) t1 2
.
Xt1
n
279
(10.21)
1X
E(Zt2 I(Zt  > n)Ft1 )
n
t=1
2
1/2
n
X
1
2
4 1/2 Xt1 var(t )
Xt1 E(t )
n
2 n
t=1
E(4t )1/2
n3/2
n
X
t=1
4
1/2
E(t ) E(2t )1/2
n1/2
1X
Xt1 3 .
n
t=1
If E(4t ) < , then E(Xt4 ) < , therefore by using the ergodic theorem we have
1
n
3 a.s.
t=1 Xt1 
Pn
E(X0 3 )
P
0.
Hence condition (10.15) is satisfied.
C3 : We need to verify that
n
1X
P
E(Zt2 Ft1 ) 2 .
n
t=1
t=1
t=1
1X
1X
2
E(Zt2 Ft1 ) =
E(Xt1
2 Xt1 )
n
n
=
n
n
X
1X 2
2
2 1
2
Xt1 E( Xt1 ) = E( )
Xt1
n
n
t=1
 t=1{z }
a.s.
E(X02 )
280
1 X
D
nLn () =
Xt1 t N (0, 2 c(0)).
n t=1
(10.22)
D
nLn () N (0, 2 ) we have
n(n ) = 2 Ln
 {z
a.s.
1
nLn () .
{z
}
} 
(10.23)
1
D
n(n ) N (0, 2 c(0)1 ).
4
(10.24)
Pn
t=2 t Xt1 )
2
n1
1
Ln () =
Pn
2 Pn
t=2 t Xt1
n1
,
2 Pn
2
t=2 Xt1
n1
(10.25)
= O( n1 ). This implies
(n ) = Op (n1/2 ).
Indeed the results also holds almost surely
(n ) = O(n1/2 ).
(10.26)
The same result is true for autoregressive processes of arbitrary finite order. That is
10.6.3
) N (0, E(p )1 2 ).
n(
n
(10.27)
Previously we have discussed the weight peiodogram, here we show normality of it, in the case
P
that the time series Xt is zero mean linear time series (has the representation Xt = j j tj ).
281
A(, In ) =
1X
(k )In (k )
n
k=1
n
1
1X
(k )A(k )2 I (k ) + o( ).
n
n
k=1
1
n
Pn
2
k=1 (k )A(k ) I (k ),
1X
(k )A(k )2 I (k ) =
n
k=1
n
n
n
1 X
1X
1 X
2
t
(k )A(k ) exp(ik (t )) =
t gn (t )
n
n
n
t, =1
t, =1
k=1
where
n
1X
1
gn (t ) =
(k )A(k )2 exp(ik (t )) =
n
2
k=1
1
),
n2
(the rate for the derivative exchange is based on assuming that the second derivatives of A() and
P
exist and (0) = (2)). We can rewrite n1 nt, =1 t gn (t ) as
n
1 X
[t E(t )]gn (t )
n
t, =1
n
X
1X
2
2
[(t E(t )]gn (0) + t
[gn (t ) gn ( t)]
n
<t
t=1
:=
n
1X
Zt,n
n
t=1
where it is straightforward to show that {Zt,n } are the sum of martingale differences. Thus we can
show that
n
n
n
1 X
1 X
1 X
t gn (t ) E
t gn (t ) =
Zt,n
n t, =1
n t, =1
n t=1
satisfies the conditions of the martingale central limit theorem, which gives asymptotic normality
P
of n1 nt, =1 t gn (t ) and thus A(, In ).
In the remainder of this chapter we obtain the sampling properties of the ARMA estimators
282
10.7
In this section we will derive the sampling properties of the HannanRissanen estimator. We will
obtain an almost sure rate of convergence (this will be the only estimator where we obtain an almost
sure rate). Typically obtaining only sure rates can be more difficult than obtaining probabilistic
rates, moreover the rates can be different (worse in the almost sure case). We now illustrate why
that is with a small example. Suppose {Xt } are iid random variables with mean zero and variance
P
one. Let Sn = nt=1 Xt . It can easily be shown that
var(Sn ) =
1
1
therefore Sn = Op ( ).
n
n
(10.28)
However, from the law of iterated logarithm we have for any > 0
p
p
1.
P (Sn (1 + ) 2n log log n infinitely often) = 0P (Sn (1 ) 2n log log n infinitely often) =(10.29)
Comparing (10.28) and (10.29) we see that for any given trajectory (realisation) most of the time
1
n Sn
n
will be within the O( 1n ) bound but there will be excursions above when it to the O( loglog
n
1
n Sn
1
2 log log n
Sn = O(
) almost surely.
n
n
Hence the probabilistic and the almost sure rates are (slightly) different. Given this result is true
for the average of iid random variables, it is likely that similar results will hold true for various
estimators.
In this section we derive an almost sure rate for HannanRissanen estimator, this rate will
be determined by a few factors (a) an almost sure bound similar to the one derived above (b)
the increasing number of parameters pn (c) the bias due to estimating only a finite number of
parameters when there are an infinite number in the model.
283
n = R
1 rn ,
b
n
(10.30)
0 = (b1,n , . . . , bp ,n ),
where b
n
n
n =
R
n
X
Xt1 X0t1
T
X
rn =
t=pn +1
Xt Xt1
t=pn +1
t = Xt
pn
X
bj,n Xtj .
j=1
, where
(iii) Now use as estimates of 0 and 0
n n
, = arg min
n n
n
X
(Xt
t=pn +1
p
X
j Xtj
j=1
q
X
i ti )2 .
(10.31)
i=1
sn =
1
T
n
X
t Xt ,
Y
t=pn +1
284
t = Xt
= t +
pn
X
bj,n Xtj
j=1
p
n
X
j=1
j=pn +1
(bj,n bj )Xtj
bj Xtj .
(10.32)
Hence
pn
X
X
t t
bj Xtj .
(bj,n bj )Xtj +
(10.33)
j=pn +1
j=1
, we need to
=
Therefore to study the asymptotic properties of
n n
Obtain a rate of convergence for supj bj,n bj .
Obtain a rate for 
t t .
, ).
n = (
Use the above to obtain a rate for
n n
We first want to obtain the uniform rate of convergence for supj bj,n bj . Deriving this is
technically quite challanging. We state the rate in the following theorem, an outline of the proof
can be found in Section 10.7.1. The proofs uses results from mixingale theory which can be found
in Chapter B.
Theorem 10.7.1 Suppose that {Xt } is from an ARMA process where the roots of the true char n be defined
acteristic polynomials (z) and (z) both have absolute value greater than 1 + . Let b
as in (10.30), then we have almost surely
r
n bn k2 = O p2
kb
n
Corollary 10.7.1 Suppose the conditions in Theorem 10.7.1 are satisfied. Then we have
t t pn max bj,n bj Zt,pn + Kpn Ytpn ,
1jpn
285
(10.34)
where Zt,pn =
1
pn
Ppn
t=1 Xtj 
and Yt =
Ppn
j
t=1 Xt ,
n
1 X
ti Xtj ti Xtj = O(pn Q(n) + pn )
n
(10.35)
n
1 X
ti tj ti tj = O(pn Q(n) + pn )
n
(10.36)
t=pn +1
t=pn +1
p3n
n
+ pn pn .
O(pn Q(n))
t=pn +1
1
n
n
X
t=pn +1
pn
n
1 X
Xt Ytpn 
n
t=pn +1
= O(pn Q(n) + ).
!
.
, ) and = ( , ).
n = (
for any > 0, where
0
n n
0 0
n that
PROOF. We note from the definition of
n 0
1
n
0 .
= R
sn R
n
n and sn we replace the estimated residuals n with the true unobserved residuals.
Now in the R
286
This gives us
n 0
Rn =
1
n
1
1 sn )
= R1
n sn Rn 0 + (Rn sn Rn
n
X
Yt Yt
sn =
t=max(p,q)
1
n
n
X
(10.37)
Yt Xt ,
t=max(p,q)
1
1
(R1
sn ).
n sn Rn
n (Rn Rn )Rn sn + Rn (sn
1
Now, almost surely R1
n , Rn = O(1) (if E(Rn ) is nonsingular). Hence we only need to obtain a
n Rn and sn sn . We recall that
bound for R
X
tY
0 Yt Y0 ),
n Rn = 1
(Y
R
t
t
n
t=pn +1
hence the terms differ where we replace the estimated t with the true t , hence by using (10.35)
and (10.36) we have almost surely
n Rn  = O(pn Q(n) + pn ) and sn sn  = O(pn Q(n) + pn ).
R
Therefore by substituting the above into (10.38) we obtain
n 0
pn
= R1
n sn Rn 0 + O(pn Q(n) + ).
(10.38)
n
X
t Y t .
t=max(p,q)
287
Substituting
the above bound into (??), and noting that O(Q(n)) dominates O(
r
n n
2 = O p3n
10.7.1
gives
T bT k 2 )
Proof of Theorem 10.7.1 (A rate for kb
We observe that
n bn = R1 rn R
n bn + R
n1 Rn1 rn R
n bn
b
n
where b, Rn and rn are deterministic, with bn = (b1 . . . , bpn ), (Rn )i,j = E(Xi Xj ) and (rn )i =
E(X0 Xi ). Evaluating the Euclidean distance we have
n bn k2 kR1 kspec
rn R
n bn
+ kR1 kspec kR
1 kspec
R
n Rn
rn R
n bn
, (10.39)
kb
n
n
n
2
2
2
1 R
1 = R
1 (Rn R
n )R1 and the norm inequalities. Now by using
where we used that R
n
n
n
n
Lemma 5.4.1 we have min (Rn1 ) > /2 for all T . Thus our aim is to obtain almost sure bounds
n bn k2 and kR
n Rn k2 , which requires the lemma below.
for krn R
Theorem 10.7.3 Let us suppose that {Xt } has an ARMA representation where the roots of the
characteristic polynomials (z) and (z) lie are greater than 1 + . Then
(i)
n
1 X
t Xtr = O(
n
t=r+1
(10.40)
(ii)
1
n
n
X
r
Xti Xtj = O(
t=max(i,j)
288
(10.41)
To obtain the bounds we first note that if the there wasnt an MA component in the ARMA pro n bn = 1 Pn
cess, in other words {Xt } was an AR(p) process with pn p, then rn R
t=pn +1 t Xtr ,
n
which has a mean zero. However because an ARMA process has an AR() representation and we
n bn . Therefore we obtain
are only estimating the first pn parameters, there exists a bias in rn R
the decomposition
n bn )r =
(rn R
=
n
1 X
n
Xt
t=pn +1
n
X
bj Xtj
n
1 X
Xtr +
n
t=pn +1 j=pn +1
j=1
1
1 X X
t Xtr +
bj Xtj Xtr
n
n
t=pn +1
t=pn +1 j=pn +1
{z
} 

{z
}
stochastic term
(10.43)
bias
X
X
X
pn 1
n bn ) r 1
(rn R
K
X

j Xtpn j .
t tr
tr
n
n
t=pn +1
Let Yt =
j
j=1 Xtj
and Sn,k,r =
1
n
t=1
Pn
t=1 Xtr 
j=1
(10.44)
j=1
j X
tkj .
are ergodic sequences. By applying the ergodic theorm we can show that for a fixed k and r,
a.s.
Sn,k,r E(Xtr Ytk ). Hence Sn,k,r are almost surely bounded sequences and
pn
t=1
j=1
X
1X
Xtr 
j Xtpn j  = O(pn ).
n
n
1 X
t Xt1 k2 + O(pn pn ).
n
t=pn +1
289
)!
.
(10.45)
n bn . Next we consider R
n . It is clear from the definition of R
n that
This gives us a rate for rn R
almost surely we have
n )i,j E(Xi Xj ) =
(R
n
1 X
Xti Xtj E(Xi Xj )
n
1
n
1
n
t=pn +1
n
X
[Xti Xtj
t=min(i,j)
T
X
1
E(Xi Xj )]
n
pn
X
t=min(i,j)
t=min(i,j)
Xti Xtj +
min(i, j)
E(Xi Xj )
n
pn
).
n
pn
E(Xi Xj ) = O( +
n
p2n
pn
+
n
)!
.
(10.46)
!
.
As we mentioned previously, because the spectrum of Xt is bounded away from zero, min (Rn )
n ) min (Rn ) max (R
n Rn )
is bounded away from zero for all T . Moreover, since min (R
n Rn )2 ), which for a large enough n is bounded away from zero. Hence we obtain
min (Rn )tr((R
almost surely
r
n bn k2 = O
kb
p2n
290
!
,
(10.47)
10.8
Xt
p
X
(0)
i Xti = t +
i=1
(0)
(0)
(0)
q
X
(0)
j tj ,
(10.48)
j=1
(0)
n1
n1
t=1
t=1
(,) 2
1X
1 X (Xt+1 Xt+1t )
.
log rt+1 (, , ) +
n
n
rt+1 (, , )
(10.49)
To show consistency and asymptotic normality we will use the following assumptions.
Assumption 10.8.1
(ii) The parameter space should be such that all (z) and (z) in the parameter space have roots
whose absolute value is greater than 1 + . 0 (z) and 0 (z) belong to this space.
Assumption 10.8.1 means for for some finite constant K and
P
P
j
j
j
j
1
K
j=0  Z .
j=0  z  and (z)  K
1
1+
To prove the result, we require the following approximations of the GML. Let
(,) =
X
t+1t,...
t
X
bj (, )Xt+1j .
(10.50)
j=1
This is an approximation of the onestep ahead predictor. Since the likelihood is constructed from
the onestep ahead predictors, we can approximated the likelihood n1 Ln (, , ) with the above and
define
T 1
1
1 X
(,) )2 .
Ln (, , ) = log 2 +
(Xt+1 X
t+1t,...
2
n
n
t=1
291
(10.51)
(,) was derived from X (,) which is the onestep ahead predictor of Xt+1
We recall that X
t+1t,...
t+1t,...
given Xt , Xt1 , . . ., this is
(,)
Xt+1t,...
bj (, )Xt+1j .
(10.52)
j=1
1
n Ln (, , )
(since the infinite past of {Xt } is not observed). Let us define the criterion
T 1
1
1 X
(,)
Ln (, , ) = log 2 +
(Xt+1 Xt+1t,... )2 .
n
n 2
(10.53)
t=1
In practice
1
n Ln (, , )
(10.52), obtained using the parameters = {j } and = {i }, where the roots the corresponding
characteristic polynomial (z) and (z) have absolute value greater than 1 + . Then
t
(,)
(,)
X
X
t+1t
t+1t,...
t X i
Xi ,
1
(10.54)
i=1
(,)
(,) )2 Kt ,
E(Xt+1t X
t+1t,...
292
(10.55)
X
X
t
t+1t,... (1) Xt+1t,... =
X
bj (, )Xt+1j K
j Xj ,
j=t+1
(10.56)
j=0
(,)
(,) )2 Kt
E(Xt+1t,... X
t+1t,...
(10.57)
rt (, , ) 2  Kt
(10.58)
and
t
X
bj (, )Xt+1j +
j=1
bj (, )r 0t,j (, )t (, )1 X t ,
(10.59)
j=t+1
bj r 0t,j t (, )1 X t .
j=t+1
Since the largest eigenvalue of t (, )1 is bounded (see Lemma 5.4.1) and (r t,j )i  = E(Yti Yj )
Kti+j we obtain the bound in (10.54). Taking expectations, we have
,
2
,
E(Xt+1t
X
t+1t,... ) =
bj r 0t,j t (, )1 t (0 , 0 )t (, )1
j=t+1
bt+j r t,j .
j=t+1
Now by using the same arguments given in the proof of (5.29) we obtain (10.55).
To prove (10.57) we note that
t+1t,... (1) Xt+1t,... )2 = E(
E(X
X
bj (, )Xt+1j )2 = E(
bt+j (, )Xj )2 ,
j=t+1
j=1
293
1
1+
j=1 bj (0 , 0 )Xtj
X
2
1
1
Ln (, , ) = log 2 +
Xt
bj (, )Xt+1j
2
n
n
j=1
T 1
1
1 X
Ln (, , ) log 2 +
(B)1 (B)Xt (B)1 (B)Xt
2
n
n
t=1
n
n
X
1 X 2 2X
= log 2 +
bj (, )Xtj
t
t
2
n
n
t=1
t=1
j=1
2
1X X
(bj (, ) bj (0 , 0 ))Xtj .
+
n
t=1
j=1
Remark 10.8.1 (Derivatives involving the Backshift operator) Consider the transformation
j=0
j=0
X
X
1
Xt =
j B j Xt =
j Xtj .
1 B
Suppose we want to differentiate the above with respect to , there are two ways this can be done.
P
1
j
Either differentiate
j=0 Xtj with respect to or differentiate 1B with respect to . In other
words
X
d
1
B
Xt =
Xt =
jj1 Xtj .
2
d 1 B
(1 B)
j=0
Pp
j=1 j B
and (B) =
d (B)
B j (B)
(B)
Xt =
Xt =
Xtj
2
dj (B)
(B)
(B)2
d (B)
Bj
1
Xt =
Xt =
Xtj .
dj (B)
(B)2
(B)2
Moreover in the case of squares we have
(B)
(B)
d (B)
(
Xt )2 = 2(
Xt )(
Xtj ),
dj (B)
(B)
(B)2
d (B)
(B)
1
(
Xt )2 = 2(
Xt )(
Xtj ).
dj (B)
(B)
(B)2
294
1
n Ln
n
1
(B)
2 X
i Ln (, , ) = 2
((B)1 (B)Xt )
Xti
n
(B)2
t=1
n
X
1
2
j Ln (, , ) = 2
n
n
((B)1 (B)Xt )
t=1
1
Xtj
(B)
n
X
2
1 X
1
bj (, )Xtj .
t
2
4
1
2 Ln (, , ) =
n
t=1
(10.60)
j=1
1
sup k 3 Ln k2 KSn
, n
(10.61)
max(p,q) n
X X
Ytr1 Ytr2
(10.62)
r1 ,r2 =0 t=1
where
Yt = K
j Xtj .
j=0
for any
1
(1+)
< < 1.
PROOF. The proof follows from the the roots of (z) and (z) having absolute value greater than
1 + .
02
1
+ 2 E(Zt (, )2 )
2
where
X
Zt (, ) =
(bj (, ) bj (0 , 0 ))Xtj
j=1
295
Lemma 10.8.2 Suppose that Assumption 10.8.1 are satisfied. Then for all , , we have
(i)
a.s.
1 i
n Ln (, , ))
i L(, , )) for i = 0, 1, 2, 3.
Pmax(p,q) Pn
a.s.
(ii) Let Sn defined in (10.62), then Sn E( r1 ,r2 =0
t=1 Ytr1 Ytr2 ).
PROOF. Noting that the ARMA process {Xt } are ergodic random variables, then {Zt (, )} and
{Yt } are ergodic random variables, the result follows immediately from the Ergodic theorem.
We use these results in the proofs below.
,
Theorem 10.8.1 Suppose that Assumption 10.8.1 is satisfied. Let (n ,
) = arg min Ln (, , )
n n
a.s.
,
(i) (n ,
) (0 , 0 , 0 ).
n n
(ii)
D
)
n(n 0 ,
N (0, 02 1 ), where
0
n
and {Ut } and {Vt } are autoregressive processes which satisfy 0 (B)Ut = t and 0 (B)Vt = t .
PROOF. We prove the result in two stages below.
PROOF of Theorem 10.8.1(i) We will first prove Theorem 10.8.1(i). Noting the results in
Section 10.3, to prove consistency we recall that we must show (a) the (0 , 0 , 0 ) is the unique
minimum of L() (b) pointwise convergence
a.s.
1
T L(, , ))
tinuity (as defined in Definition 10.3.2). To show that (0 , 0 , 0 ) is the minimum we note that
L(, , )) L(0 , 0 , 0 )) = log(
2
2
)
+
1 + E(Zt (, )2 ).
02
02
P
Since for all positive x, log x + x 1 is a positive function and E(Zt (, )2 ) = E(
j=1 (bj (, )
bj (0 , 0 ))Xtj )2 is positive and zero at (0 , 0 , 0 ) it is clear that 0 , 0 , 0 is the minimum of
L. We will assume for now it is the unique minimum. Pointwise convergence is an immediate
consequence of Lemma 10.8.2(i). To show stochastic equicontinuity we note that for any 1 =
(1 , 1 , 1 ) and 2 = (2 , 2 , 2 ) we have by the mean value theorem
,
Ln (1 , 1 , 1 ) Ln (2 , 2 , 2 )) = (1 2 )Ln (,
).
296
PROOF of Theorem 10.8.1(ii) We now prove Theorem 10.8.1(i) using the Martingale central
limit theorem (see Billingsley (1995) and Hall and Heyde (1980)) in conjunction with the CramerWold device (see Theorem 10.6.1).
Using the mean value theorem we have
n 0 = 2 Ln (
n )1 Ln (0 , 0 , 0 )
,
,
), = ( , 0 , 0 ) and
n = ,
n and 0 .
n = (
lies between
where
0
0
n n,
n
Using the same techniques given in Theorem 10.8.1(i) and Lemma 10.8.2 we have pointwise
a.s.
a.s.
nonsingular) we have
a.s.
n )1 2 1 .
2 Ln (
(10.63)
Now we show that Ln (0 ) is asymptotically normal. By using (10.60) and replacing Xti =
0 (B)1 0 (B)ti we have
n
t=1
n
X
t=1
T
X
1
Ln (0 , 0 , 0 ) =
n i
2 X (1)
2 X
=
t Vti
t
ti
2n
0 (B)
2n
1
Ln (0 , 0 , 0 ) =
n j
2
2n
1
2 Ln (0 , 0 , 0 ) =
n
T
T
1
1 X 2
1 X 2
=
( 2 ),
2 4n
4n
t=1
1
2
tj = 2
0 (B)
n
t=1
297
t=1
t Utj
t=1
i = 1, . . . , q
j = 1, . . . , p
where Ut =
1
0 (B) t
and Vt =
1
0 (B) t .
We observe that
1
n Ln
differences. If E(4t ) < , it is clear that E((t Utj )4 ) = E((4t )E(Utj )4 ) < , E((t Vti )4 ) =
E((4t )E(Vti )4 ) < and E(( 2 2t )2 ) < . Hence Lindebergs condition is satisfied (see the
proof given in Section 10.6.2, for why this is true). Hence we have
nLn (0 , 0 , 0 ) N (0, ).
n 0 = n2 Ln (
n )1 Ln (0 )
n
D
n 0 N (0, 4 1 ).
n
,
The above result proves consistency and asymptotically normality of (n ,
), which is based
n n
on Ln (, , ), which in practice is impossible to evaluate. However we will show below that the
gaussian likelihood, Ln (, , ) and is derivatives are sufficiently close to Ln (, , ) such that
,
,
the estimators (n ,
) and the GMLE, (n ,
) = arg min Ln (, , ) are asymptotically
n n
n n
1
1
Ln (, , ) L(, , ) = O( ),
n
(,,) n
sup
for k = 0, 1, 2, 3.
PROOF. The proof of the result follows from (10.54) and (10.56).
, ) Ln (, , ), a similar proof can be used for the rest of the result.
sup(,,) n1 L(,
Let us consider the difference
Ln (, ) Ln (, ) =
1
(In + IIn + IIIn ),
n
298
where
In =
IIIn =
n1
X
t=1
n1
X
t=1
rt (, , ) 2 ,
IIn =
n1
X
t=1
1
(,)
(,)
(X
Xt+1t )2
rt (, , ) t+1
1
(,)
(,) ) + ((X (,) )2 (X
(,) )2 ) .
2X
(X
X
t+1
t+1t
t+1t,...
t+1t
t+1t,...
2
X
t+1t
t+1t,...
where Vt =
Pt
i
i=1 Xi .
t
(1 )
Hence since E(Xt2 ) < and E(Vt2 ) < we have that supn EIn  < ,
supn EIIn  < and supn EIIIn  < . Hence the sequence {In + IIn + IIIn }n is almost surely
bounded. This means that almost surely
1
sup Ln (, ) Ln (, ) = O( ).
n
,,
Thus giving the required result.
Now by using the above proposition the result below immediately follows.
= arg min LT (, , ) and (, )
= arg min L
T (, , )
Theorem 10.8.2 Let (, )
a.s.
a.s.
(i) (, )
(0 , 0 ) and (, )
(0 , 0 ).
(ii)
D
)
T (T 0 ,
N (0, 04 1 )
0
T
D
)
and T (T 0 ,
N (0, 04 1 ).
0
T
299
Appendix A
Background
A.1
300
formula). Suppose (A) < 1, then Gelfands formula implies that for any (A) < < 1,
there exists a constant, C, (which only depends A and ), such that kAj k CA, j .
The mean value theorem.
This basically states that if the partial derivative of the function f (x1 , x2 , . . . , xn ) has a
bounded in the domiain , then for x = (x1 , . . . , xn ) and y = (y1 , . . . , yn )
n
X
f
cx=x
f (x1 , x2 , . . . , xn ) f (y1 , y2 , . . . , yn ) =
(xi yi )
xi
i=1
n
X
(xi yi )
n
X
f 2
f
+
(xi yi )(xj yj )
cx=x
xi
xi xj
i,j=1
i=1
Partial Fractions.
We use the following result mainly for obtaining the MA() expansion of an AR process.
Suppose that gi  > 1 for 1 i n. Then if g(z) =
Qn
i=1 (1
satisfies
n
i=1
j=1
i
XX
gi,j
1
=
,
g(z)
(1 gzi )j
where gi,j = ..... Now we can make a polynomial series expansion of (1 gzi )j which is valid
for all z 1.
Dominated convergence.
Suppose a sequence of functions fn (x) is such that pointwise fn (x) f (x) and for all n and
R
R
x, fn (x) g(x), then fn (x)dx f (x)dx as n .
We use this result all over the place to exchange infinite sums and expectations. For example,
301
if
X
X
E(
aj Zj ) =
aj E(Zj ).
j=1
j=1
Dominated convergence can be used to prove the following lemma. A more hands on proof
is given below the lemma.
Lemma A.1.1 Suppose
k= c(k)
1
n
as n . Moreover, if
(n1)
kc(k) 0
k=(n1)
k= kc(k)
< , then
1
n
P(n1)
k=(n1) kc(k)
= O( n1 ).
P
PROOF. The proof is straightforward in the case that
k= kc(k) < (the second asserP(n1)
1
tion), in this case k=(n1) k
n c(k) = O( n ). The proof is slightly more tricky in the case
P
P
that
k= c(k) < for every > 0 there
k= c(k) < . First we note that since
P
exists a N such that for all n N , kn c(k) < . Let us suppose that n > N , then we
have the bound
1
n
(n1)
kc(k)
k=(n1)
1
n
(N 1)
X
k=(N 1)
1
2n
kc(k) +
1
n
1
n
kc(k)
N kn
(N 1)
kc(k) + .
k=(N 1)
P(N 1)
k=(N 1) kc(k)
0 as n . Since this is
true for all (for different thresholds N ) we obtain the required result.
Cauchy Schwarz inequality.
In terms of sequences it is

X
j=1
X
X
aj bj  (
a2j )1/2 (
b2j )1/2
j=1
302
j=1
Holders inequality.
This is a generalisation of the Cauchy Schwarz inequality. It states that if 1 p, q and
p + q = 1, then
EXY  E(Xp )1/p E(Y q )1/q
. A similar results is true for sequences too.
Martingale differences. Let Ft be a sigmaalgebra, where Xt , Xt1 , . . . Ft . Then {Xt } is a
sequence of martingale differences if E(Xt Ft1 ) = 0.
Minkowskis inequality.
If 1 < p < , then
n
n
X
X
(E(
Xi )p )1/p
(E(Xi p ))1/p .
i=1
i=1
Doobs inequality.
This inequality concerns martingale differences. Let Sn =
Pn
t=1 Xt ,
then
2
E( sup Sn 2 ) E(SN
).
nN
Burkh
olders inequality.
Suppose that {Xt } are martingale differences and define Sn =
Pn
k=1 Xt .
For any p 2 we
have
{E(Snp )}1/p
2p
n
X
E(Xkp )2/p
1/2
k=1
An application, is to the case that {Xt } are identically distributed random variables, then
we have the bound E(Snp ) E(X0p )2 (2p)p/2 np/2 .
It is worthing noting that the Burkholder inequality can also be defined for p < 2 (see
303
Davidson (1994), pages 242). It can also be generalised to random variables {Xt } which are
not necessarily martingale differences (see Dedecker and Doukhan (2003)).
RiemannStieltjes Integrals.
R
In basic calculus we often use the basic definition of the Riemann integral, g(x)f (x)dx, and if
R
R
the function F (x) is continuous and F 0 (x) = f (x), we can write g(x)f (x)dx = g(x)dF (x).
There are several instances where we need to broaden this definition to include functions F
which are not continuous everywhere. To do this we define the RiemannStieltjes integral,
which coincides with the Riemann integral in the case that F (x) is continuous.
R
R
g(x)dF (x) is defined in a slightly different way to the Riemann integral g(x)f (x)dx.
P
Let us first consider the case that F (x) is the step function F (x) = ni=1 ai I[xi1 ,xi ] , then
R
R
Pn
g(x)dF (x) is defined as g(x)dF (x) =
i=1 (ai ai1 )g(xi ) (with a1 = 0). Already
we see the advantage of this definition, since the derivative of the step function is not
well defined at the jumps. As most functions can be written as the limit of step funcR
P k
ai,nk I[xi 1,x ] ), we define g(x)dF (x) =
tions (F (x) = limk Fk (x), where Fk (x) = ni=1
ik
k1
Pnk
limk i=1 (ai,nk ai1,nk )g(xik ).
In statistics, the function F will usually be nondecreasing and bounded. We call such
functions distributions.
Theorem A.1.1 (Hellys Theorem) Suppose that {Fn } are a sequence of distributions with
Fn () = 0 and supn Fn () M < . There exists a distribution F , and a subsequence
Fnk such that for each x R Fnk F and F is right continuous.
A.2
Martingales
Definition A.2.1 A sequence {Xt } is said to be a martingale difference if E[Xt Ft1 ], where
Ft=1 = (Xt1 , Xt2 , . . .). In other words, the best predictor of Xt given the past is simply zero.
Martingales are very useful when proving several results, including central limit theorems.
Martingales arise naturally in several situations. We now show that if correct likelihood is
used (not the quasicase), then the gradient of the conditional log likelihood evaluated at the true
P
parameter is the sum of martingale differences. To see why, let BT = Tt=2 log f (Xt Xt1 , . . . , X1 )
304
CT () =
T
X
log f (Xt Xt1 , . . . , X1 )
t=2
log f (Xt Xt1 , . . . , X1 )
c=0 Xt1 , Xt2 , . . . , X1 = 0,
we will show this. Rewriting the above in terms of integrals and exchanging derivative with integral
we have
log f (Xt Xt1 , . . . , X1 )
c=0 Xt1 , Xt2 , . . . , X1
1
f (xt Xt1 , . . . , X1 )
c=0 f0 (xt Xt1 , . . . , X1 )dxt
f0 (xt Xt1 , . . . , X1 )
Z
f (xt Xt1 , . . . , X1 )dxt c=0 = 0.
E
Z
=
Z
=
=
Xt1 ,...,X1 )
Therefore { log f (Xt
c=0 }t are a sequence of martingale differences and Ct (0 ) is the sum
A.3
The Fourier transform is a commonly used tool. We recall that {exp(2ij); j Z} is an orthogonal
R2
basis of the space L2 [0, 1]. In other words, if f L2 [0, 1] (ie, 0 f ()2 d < ) then
fn (u) =
n
X
cj e
iju2
Z
cj =
f (u) exp(i2ju)du,
0
j=n
where
cj eiju .
jZ
305
An important property is that f (u) constant iff cj = 0 for all j 6= 0. Moreover, for all n Z
f (u + n) = f (u) (hence f is periodic).
Some relations:
(i) Discrete Fourier transforms of finite sequences
It is straightforward to show (by using the property
Pn
j=1 exp(i2k/n)
= 0 for k 6= 0) that if
1 X
dk =
xj exp(i2jk/n),
n
j=1
1 X
xr =
dk exp(i2rk/n),
n
k=1
1 X
xk exp(ik),
f () =
2 k=
where
R 2
0
f ()2 d =
f () exp(ik).
0
Z
aj bkj
j=
1 ak
2
A()B() exp(ik)d
=
0
j=
A()B( )d.
aj bj exp(ij) =
0
306
(A.1)
aj bj exp(ij) =
Z
X
r= 0
j=
Z Z
=
A(1 )B(2 )
exp(ir( 1 2 )) d1 d2
r=

Z
{z
= (1 +2 )
A()B( )d.
=
0
Pn
j=k
aj bjs for
all s = 0, . . . , n 1 in as few computing computing operations. This is typically done via the
DFT. Examples in time series where this is useful is in calculating the sample autocovariance
function.
Suppose we have two sequences a = (a1 , . . . , an ) and b = (b1 , . . . , bn ). Let An (k,n ) =
Pn
Pn
j=1 bj exp(ijk,n ) where k,n = 2k/n. It is straightj=1 aj exp(ijk,n ) and Bn (k,n ) =
forward to show that
n
s1
k=1
j=s
j=1
X
X
1X
An (k,n )Bn (k,n ) exp(isk,n ) =
aj bjs +
aj bjs+n ,
n
this is very fast to compute (requiring only O(n log n) operations using first the FFT and
then inverse FFT). The only problem is that we dont want the second term.
By padding the sequences and defining An (k,2n ) =
Pn
j=1 aj
exp(ijk,2n ) =
P2n
j=1 aj
exp(ijk,2n ),
with k,2n = 2k/2n (where we set aj = 0 for j > 0) and analogously Bn (k,2n ) =
Pn
j=1 bj exp(ijk,2n ), we are able to remove the second term. Using the same calculations
we have
2n
s1
X
X
1X
An (k,2n )Bn (k,2n ) exp(isk,2n ) =
aj bjs +
aj bjs+2n .
n
j=s
j=1
k=1

{z
}
=0
This only requires O(2n log(2n)) operations to compute the convolution for all 0 k n 1.
(v) The Poisson Summation Formula Suppose we do not observe the entire function and
observe a sample from it, say ft,n = f ( nt ) we can use this to estimate the Fourier coefficient
307
cj,n
2t
1X t
f ( ) exp(ij
).
=
n
n
n
t=1
cj+kn +
k=1
cjkn ,
k=1
jZ cj e
ij2t/n .
cj,n cj 
cj+kn  +
k=1
cjkn 
k=1
(A.2)
where C is some finite constant. However, we cannot use this result in the case that f is
bounded and piecewise monotone, however it can still be shown that
cj,n cj  Cn1 ,
see Section 6.3, page 189, Briggs and Henson (1997).
308
(A.3)
A.4
There are two inequalities (one for 1 < p 2). Which is the following:
Theorem A.4.1 Suppose that Yk are martingale differences and that Sn =
Pn
j=1 Yk ,
then for
0<q2
ESn q 2
n
X
E(Xkq ),
(A.4)
j=1
C1 E
m
X
!p/2
Xi2
ESn p C2 E
i=1
m
X
!p/2
Xi2
(A.5)
i=1
An immediately consequence of the above for p 2 is the following corollary (by using Holders
inequality):
Corollary A.4.1 Suppose {Si : Fi } is a martingale and 2 p < . Then there exists constants
C1 , C2 depending only on p such that
kSn kE
p
2/p
C2
m
X
!1/2
kXi2 kE
p/2
(A.6)
i=1
1/p
ESn p
C2 E
Xi2
i=1
E 1/2
m
X
2/p
= C2
Xi2
.
i=1
309
p/2
(A.7)
p 1/p
ESn 
2/p
C2
m
X
#1/2
kXi2 kE
p/2
(A.8)
i=1
We see the value of the above result in the following application. Suppose Sn =
1
n
Pn
k=1 Xk
and kXk kE
p K. Then we have
n
1X
Xk
n
!p
#p/2
n
1 2/p X
2 E
C
kXk kp/2
n 2
k=1
#p/2
" n
" n
#p/2
C2 X 2
1
C2 X
2 E
p
kXk kp/2
K
= O( p/2 ).
np
n
n
"
k=1
k=1
(A.9)
k=1
Below is the result that that Moulines et al (2004) use (they call it the generalised Burkholder
inequality) the proof can be found in Dedecker and Doukhan (2003). Note that it is for p 2,
which I forgot to state in what I gave you.
Lemma A.4.1 Suppose {k : k = 1, 2, . . .} is a stochastic process which satisfies E(k ) = 0 and
E(pk ) < for some p 2. Let Fk = (k , k1 , . . .). Then we have that
1/2
s
s
s
X
E
X
X
.
k
2p
kk kE
kE(j Fk )kE
p
p
k=1
We note if
Ps
j=k
k=1
(A.10)
j=k
rate as (A.9).
But I think one can obtain something similar for 1 p 2. I think the below is correct.
Lemma A.4.2 Suppose {k : k = 1, 2, . . .} is a stochastic process which satisfies E(k ) = 0 and
E(qk ) < for some 1 < q 2. Let Fk = (k , k1 , . . .). Further, we suppose that there exists a
0 < < 1, and 0 < K < such that kE(t Ftj )kq < Kj . Then we have that
E
s
X
ak k
k=1
K
1
310
s
X
k=1
!1/q
ak q
(A.11)
k =
(A.12)
j=0
ak k =
s X
Ps
k=1 ak k
[Ekj (k ) Ekj1 (k )] =
k=1 j=0
k=1
we obtain
s
X
X
j=0
!
[Ekj (k ) Ekj1 (k )] .
(A.13)
k=1
Keeping j constant, we see that {Ekj (k ) Ekj1 (k )}k is a martingale sequence. Hence
Ps
k=1 [Ekj (k ) Ekj1 (k )] is the sum of martingale differences. This implies we can apply
(A.4) to (A.13), and get
s
E
s
E
X
X
X
ak k
j=0
X
j=0
k=1
s
X
!1/q
ak (kEkj (k )
q
Ekj1 (k )kE
q )
k=1
j
Under the stated assumption kEkj (k ) Ekj1 (k )kE
q 2K . Substituting this inequality into
X
X
X
X
q
j q
1+1/q
ak k
2
ak  (2K )
2
K
j
k=1
j=0
j=0
k=1
A.5
s
X
!1/q
ak q
k=1
The Discrete Fourier transform is used widely in several disciplines. Even in areas its use may
not be immediately obvious (such as inverting Toeplitz matrices) it is still used because it can be
evalated in a speedy fashion using what is commonly called the fast fourier transform (FFT). It is
an algorithm which simplifies the number of computing operations required to compute the Fourier
311
transform of a sequence of data. Given that we are in the age of big data it is useful to learn what
one of most popular computing algorithms since the 60s actually does.
Recalling the notation in Section 8.2.2 the Fourier transform is the linear transformation
Fn X n = (Jn (0 ), . . . , Jn (n1 )).
If this was done without any using any tricks this requires O(n2 ) computing operations. By using
some neat factorizations, the fft reduces this to n log n computing operations.
To prove this result we will ignore the standardization factor (2n)1/2 and consider just the
Fourier transform
d(k,n ) =
n
X
xt exp (itk,n ),
t=1
{z
k different frequencies
where k,n =
2k
n .
Here we consider the proof for general n, later in Example A.5.1 we consider
the specific case that n = 2m . Let us assume that n is not a prime (if it is then we simply pad the
vector with one zero and increase the length to n + 1), then it can be factorized as n = pq. Using
these factors we write t as t = t1 p + tmodp where t1 is some integer value that lies between 0 to
q 1 and t0 = tmodp lies between 0 to p 1. Substituting this into d(k ) gives
d(k ) =
n
X
xt exp [i(t1 p
t=1
p1 X
q1
X
+ tmodp)k,n ]
t0 =0 t1 =0
p1
X
t0 =0
2t1 pk
n
312
2t1 k
q
q1
X
t1 =0
d(k ) =
p1
X
t1 =0
t0 =0
p1
X
q1
X
q1
X
t1 =0
t0 =0
{z
p1
X
t0 =0
where k0 = kmodq can take values from 0, . . . , q 1. Thus to evaluate d(k ) we need to evaluate
A(t0 , kmodq) for 0 t0 p 1, 0 k0 q 1. To evaluate A(t0 , kmodq) requires q computing
operations, to evaluate it for all t0 and kmodq requires pq 2 operations. Note, the key is that less
frequencies need to be evaluated when calculating A(t0 , kmodq), in particular q frequencies rather
than N . After evaluating {A(t0 , k0 ); 0 t0 p 1, 0 k0 q 1} we then need to take the
Fourier transform of this over t0 to evaluate d(k ) which is p operations and this needs to be done
n times (to get all {d(k )}k ) this leads to np. Thus in total this leads to
p2 q
{z}
evaluation of all A
= pq 2 + pn = n(q + p).
np
{z}
(A.14)
A(t0 , k0 ) =
q1
X
t1 =0
Therefore, we can use the same method as was used above to reduce this number. To do this we
313
need to factorize q into p = p1 q1 and using the above method we can write this as
A(t0 , k0 ) =
pX
1 1
1 1 qX
t2 =0 t3 =0
pX
1 1
exp [it2 k0 ,q ]
qX
1 1
t2 =0
t3 =0
pX
1 1
qX
1 1
exp [it2 k0 ,q ]
t3 =0
t2 =0
We note that k0 modq1 = (kmod(p1 q1 )modq1 ) = kmodq1 , substituting this into the above we have
A(t0 , k0 ) =
pX
1 1
t2 =0
pX
1 1
t2 =0
exp [it2 k0 ,q ]
qX
1 1
t3 =0
Thus we see that q1 computing operations are required to calculate A(t0 , t2 , k0 modq1 ) and to calculate A(t0 , t2 , kmodq1 ) for all 0 t2 p1 1 and 0 kmodq1 q1 1 requires in total q12 p1
computing operations. After evaluating {A(t0 , t2 , k0 modq1 ); 0 t2 q2 1, 0 kmodq1 q1 1}
we then need to take its Fourier transform over t2 to evaluate A(t0 , k0 ), which is p1 operations.
Thus in total to evaluate A(t0 , k0 ) over all k0 we require q12 p1 + p1 q operations. Thus we have
reduced the number of computing operations for A(t0 , k0 ) from q 2 to q(p1 + q1 ), substituting this
into (A.14) gives the total number of computing operations to calculate {d(k )}
pq(p1 + q1 ) + np = n(p + p1 + q1 ).
In general the same idea can be used to show that given the prime factorization of n =
P
then the number of computing operations to calculate the DFT is n( m
s=1 ps ).
314
Qm
s=1 ps ,
d(k ) =
n
X
xt exp(itk ) =
n/2
X
t=1
(n/2)1
X2t exp(i2tk ) +
t=1
t=0
n/2
(n/2)1
t=1
X2t+1 exp(i2tk )
t=0
Pn/2
and
Pn/2
coarser scale, therefore we can only identify the frequencies on a coarser scale. It is clear from the
above that the evaluation of A(0, kmod(n/2)) for 0 kmod(n/2) n/2 requires (n/2)2 operations
and same for A(1, kmod(n/2)). Thus to evaluate both A(0, kmod(n/2)) and A(1, kmod(n/2)) requires 2(n/2)2 operations. Then taking the Fourier transform of these two terms over all 0 k
n 1 is an additional 2n operations leading to
2(n/2)2 + 2n = n2 /2 + 2n operations < n2 .
We can continue this argument and partition
A(0, kmod(n/2)) =
n/2
X
X2t exp(i2tk )
t=1
n/4
X
(n/4)1
t=1
X4t+2 exp(i4tk ).
t=0
Using the same argument as above the calculation of this term over all k requires 2(n/4)2 +2(n/2) =
n2 /8 + n operations. The same decomposition applies to A(1, kmod(n/2)). Thus calculation of both
terms over all k requires 2[n2 /8 + n] = n2 /4 + 2n operations. In total this gives
(n2 /4 + 2n + 2n)operations.
Continuing this argument gives mn = n log2 n operations, which is the often cited rate.
Typically, if the sample size is not of order 2m zeros are added to the end of the sequence (called
padding) to increase the length to 2m .
315
Appendix B
Mixingales
In this section we prove some of the results stated in the previous sections using mixingales.
We first define a mixingale, noting that the definition we give is not the most general definition.
Definition B.0.1 (Mixingale) Let Ft = (Xt , Xt1 , . . .), {Xt } is called a mixingale if it satisfies
2 1/2
= E E(Xt Ftk ) E(Xt )
,
t,k
(B.1)
j=0
m
X
E(Xt Ftk ) E(Xt Ftk1 ) + E(Xt Ftm1 ) E(Xt ) .
k=0
2
X
=
E(Xt Ftk ) E(Xt Ftk1 ) .
k=0
316
We observe that (B.1) resembles the Wold decomposition. The difference is that the Wolds
decomposition decomposes a stationary process into elements which are the errors in the best linear
predictors. Whereas the result above decomposes a process into sums of martingale differences.
It can be shown that functions of several ARCHtype processes are mixingales (where t,k Kk
(rho < 1)), and Subba Rao (2006) and Dahlhaus and Subba Rao (2007) used these properties to
obtain the rate of convergence for various types of ARCH parameter estimators. In a series of
papers, Wei Biao Wu considered properties of a general class of stationary processes which satisfied
P
Definition B.0.1, where
k=1 k < .
In Section B.2 we use the mixingale property to prove Theorem 10.7.3. This is a simple illustration of how useful mixingales can be. In the following section we give a result on the rate of
convergence of some random variables.
B.1
The following lemma is a simple variant on a result proved in Moricz (1976), Theorem 6.
Lemma B.1.1 Let {ST } be a random sequence where E(sup1tT St 2 ) (T ) and {phi(t)} is a
monotonically increasing sequence where (2j )/(2j1 ) K < for all j. Then we have almost
surely
1
ST = O
T
p
(T )(log T )(log log T )1+
.
T
PROOF. The idea behind the proof is to that we find a subsequence of the natural numbers and
define a random variables on this subsequence. This random variable, should dominate (in some
sense) ST . We then obtain a rate of convergence for the subsequence (you will see that for the
subsequence its quite easy by using the BorelCantelli lemma), which, due to the dominance, can
be transfered over to ST . We make this argument precise below.
Define the sequence Vj = supt2j St . Using Chebyshevs inequality we have
P (Vj > )
(2j )
.
317
Let (t) =
p
(t)(log log t)1+ log t. It is clear that
P (Vj > (2 ))
j=1
X
j=1
C(2j )
< ,
(2j )(log j)1+ j
where C is a finite constant. Now by Borel Cantelli, this means that almost surely Vj (2j ). Let
us now return to the orginal sequence ST . Suppose 2j1 T 2j , then by definition of Vj we have
a.s (2j )
Vj
ST
<
(T )
(2j1 )
(2j1 )
under the stated assumptions. Therefore almost surely we have ST = O((T )), which gives us the
required result.
We observe that the above result resembles the law of iterated logarithms. The above result
is very simple and nice way of obtaining an almost sure rate of convergence. The main problem
is obtaining bounds for E(sup1tT St 2 ). There is on exception to this, when St is the sum
of martingale differences then one can simply apply Doobs inequality, where E(sup1tT St 2 )
E(ST 2 ). In the case that ST is not the sum of martingale differences then its not so straightforward.
However if we can show that ST is the sum of mixingales then with some modifications a bound
for E(sup1tT St 2 ) can be obtained. We will use this result in the section below.
B.2
t=r+1
(B.2)
(ii)
1
n
n
X
r
Xti Xtj = O(
t=max(i,j)
318
(B.3)
Pn
t=r+1 t Xtr
T
X
r
Xti Xtj = O(
t=max(i,j)
However the proof is more complex, since {Xti Xtj } are not martingale differences and we cannot
directly use Doobs inequality. However by showing that {Xti Xtj } is a mixingale we can still
show the result.
To prove the result let Ft = (Xt , Xt1 , . . .) and Gt = (Xti Xtj , Xt1i Xtji , . . .). We
observe that if i > j, then Gt Fti .
Lemma B.2.1 Let Ft = (Xt , Xt1 , . . .) and suppose Xt comes from an ARMA process, where
the roots are greater than 1 + . Then if E(4t ) < we have
E E(Xti Xtj Ftmin(i,j)k ) E(Xti Xtj )
2
Ck .
X
=
aj1 aj2 E(tij1 tjj2 Ftkmin(i,j) ) E(tij1 tjj2 ) .
j1 ,j2 =0
Now in the case that tij1 > tkmin(i, j) and tjj2 > tkmin(i, j), E(tij1 tjj2 Ftkmin(i,j) ) =
E(tij1 tjj2 ). Now by considering when tij1 tkmin(i, j) or tj j2 tkmin(i, j)
we have have the result.
319
(B.4)
n
X
Vt,k
(B.5)
k=0 t=min(i,j)
where Vt,k = E(Xti Xtj Ftkmin(i,j) ) E(Xti Xtj Ftkmin(i,j)1 ), are martingale differences.
2 ) Kk and
(ii) Furthermore E(Vt,k
E
s
X
sup
min(i,j)sn
{Xti Xtj E(Xti Xtj )})2 Kn,
(B.6)
t=min(i,j)
s
X
sup
min(i,j)sn
= E
sup
min(i,j)sn
= E
t=min(i,j)
s
X
X
Vt,k
k=0
(B.7)
2
k=0 t=min(i,j)
sup
s
X
Vt,k1
k1 =0 k2 =0 min(i,j)sn t=min(i,j)
X
=
E
sup
s
X
2
Vt,k1
s
X
Vt,k2
t=min(i,j)
1/2 2
min(i,j)sn t=min(i,j)
Now we see that {Vt,k }t = {E(Xti Xtj Ftkmin(i,j) ) E(Xti Xtj Ftkmin(i,j)1 )}t , therefore
Ps
{Vt,k }t are also martingale differences. Hence we can apply Doobs inequality to E supmin(i,j)sn
t=min(i,j) Vt,k
320
s
X
sup
min(i,j)sn
Vt,k
2
t=min(i,j)
n
X
Vt,k
t=min(i,j)
2
n
X
2
E(Vt,k
) K nk .
t=min(i,j)
sup
min(i,j)sn
s
X
{Xti Xtj E(Xti Xtj )})2 Kn.
t=min(i,j)
321
Bibliography
HongZhi An, ZhaoGuo. Chen, and E.J. Hannan. Autocorrelation, autoregression and autoregressive approximation. Ann. Statist., 10:926936, 1982.
A. Aue, L. Horvath, and J. Steinbach. Estimation in random coefficient autoregressive models.
Journal of Time Series Analysis, 27:6176, 2006.
K. I. Beltrao and P. Bloomfield. Determining the bandwidth of a kernel spectrum estimate. Journal
of Time Series Analysis, 8:2338, 1987.
I. Berkes, L. Horv
ath, and P. Kokoskza. GARCH processes: Structure and estimation. Bernoulli,
9:20012007, 2003.
I. Berkes, L. Horvath, P. Kokoszka, and Q. Shao. On discriminating between long range dependence
and changes in mean. Ann. Statist., 34:11401165, 2006.
R.N. Bhattacharya, V.K. Gupta, and E. Waymire. The hurst effect under trend. J. Appl. Probab.,
20:649662, 1983.
P. Billingsley. Probability and Measure. Wiley, New York, 1995.
T Bollerslev. Generalized autoregressive conditional heteroscedasticity. J. Econometrics, 31:301
327, 1986.
P. Bougerol and N. Picard. Stationarity of GARCH processes and some nonnegative time series.
J. Econometrics, 52:115127, 1992a.
P. Bougerol and N Picard. Strict stationarity of generalised autoregressive processes. Ann. Probab.,
20:17141730, 1992b.
322
G. E. P. Box and G. M. Jenkins. Time Series Analysis, Forecasting and Control. Cambridge
University Press, Oakland, 1970.
A. Brandt. The stochastic equation Yn+1 = An Yn + Bn with stationary coefficients. Adv. in Appl.
Probab., 18:211220, 1986.
W.L. Briggs and V. E. Henson. The DFT: An Owners manual for the Discrete Fourier Transform.
SIAM, Philadelphia, 1997.
D.R. Brillinger. Time Series: Data Analysis and Theory. SIAM Classics, 2001.
P. Brockwell and R. Davis. Time Series: Theory and Methods. Springer, New York, 1998.
W. W. Chen, C. Hurvich, and Y. Lu. On the correlation matrix of the discrete Fourier Transform
and the fast solution of large toeplitz systems for long memory time series. Journal of the
American Statistical Association, 101:812821, 2006.
R. Dahlhaus and D. Janas. A frequency domain bootstrap for ratio statistics in time series analysis.
Ann. Statistic., 24:19341963, 1996.
R. Dahlhaus and S. Subba Rao. A recursive online algorithm for the estimation of timevarying
arch parameters. Bernoulli, 13:389422, 2007.
J Davidson. Stochastic Limit Theory. Oxford University Press, Oxford, 1994.
J. Dedecker and P. Doukhan. A new covariance inequality. Stochastic Processes and their applications, 106:6380, 2003.
R. Douc, E. Moulines, and D. Stoffer. Nonlinear Time Series: Theory, Methods and Applications
with R Examples. Chapman and Hall, 2014.
Y. Dwivedi and S. Subba Rao. A test for second order stationarity based on the discrete fourier
transform. Journal of Time Series Analysis, 32:6891, 2011.
R. Engle. Autoregressive conditional heteroscedasticity with estimates of the variance of the United
Kingdom inflation. Econometrica, 50:9871006, 1982.
J. C. Escanciano and I. N Lobato. An automatic Portmanteau test for serial correlation. Journal
of Econometrics, 151:140149, 2009.
323
J. Fan and Q. Yao. Nonlinear time series: Nonparametric and parametric methods. Springer,
Berlin, 2003.
W. Fuller. Introduction to Statistical Time Series. Wiley, New York, 1995.
C. W. J. Granger and A. P. Andersen. An introduction to Bilinear Time Series models. Vandenhoek
and Ruprecht, G
ottingen, 1978.
U. Grenander and G. Szeg
o. Toeplitz forms and Their applications. Univ. California Press, Berkeley,
1958.
P Hall and C.C. Heyde. Martingale Limit Theory and its Application. Academic Press, New York,
1980.
E.J. Hannan and Rissanen. Recursive estimation of ARMA order. Biometrika, 69:8194, 1982.
J. Hart. Kernel regression estimation with time series errors. Journal of the Royal Statistical
Society, 53:173187, 1991.
C. Jentsch and S. Subba Rao. A test for second order stationarity of multivariate time series.
Journal of Econometrics, 2014.
D. A. Jones. Nonlinear autoregressive processes. Proceedings of the Royal Society (A), 360:7195,
1978.
Rosenblatt. M. and U. Grenander. Statistical Analysis of Stationary Time Series. Chelsea Publishing Co, 1997.
T. Mikosch. Elementary Stochastic Calculus With Finance in View. World Scientific, 1999.
T. Mikosch and C. St
aric
a. Is it really long memory we see in financial returns? In P. Embrechts,
editor, Extremes and Integrated Risk Management, pages 149168. Risk Books, London, 2000.
T. Mikosch and C. St
aric
a. Longrange dependence effects and arch modelling. In P. Doukhan,
G. Oppenheim, and M.S. Taqqu, editors, Theory and Applications of Long Range Dependence,
pages 439459. Birkh
auser, Boston, 2003.
F. Moricz. Moment inequalities and the strong law of large numbers. Z. Wahrsch. verw. Gebiete,
35:298314, 1976.
324
325
Gy. Terdik. Bilinear Stochastic Models and Related Problems of Nonlinear Time Series Analysis;
A Frequency Domain Approach, volume 142 of Lecture Notes in Statistics. Springer Verlag, New
York, 1999.
M. Vogt. Nonparametric regression for locally stationary time series. Annals of Statistics, 40:
26012633, 2013.
A. M. Walker. On the estimation of a harmonic component in a time series with stationary
independent residuals. Biometrika, 58:2136, 1971.
P. Whittle. Gaussian estimation in stationary time series. Bulletin of the International Statistical
Institute, 39:105129, 1962.
326