STA457

STA457 Notes
Shawn Unger and Associates

University of Toronto
Winter2017
Contents
I Time Series 3
1 Intro 3
1.1 Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Time series data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Time and Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Graphical Methods for Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.7 Time Series Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Correlogram and Periodogram 6

2.1 Autcovariance and Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Correlogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Short memory time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Long memory time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Partial Correlogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Periodogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Tapering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 Interpreting the Periodogram . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.5 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Estimating Trends 11
3.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Reg/Curve Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Differencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Second Differencing Operation . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1
4 Stochastic Process 15
4.1 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Stochastic Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 Stationary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5 Spectral Representation 16
5.1 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Important Stationary Stochastic Process . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.3 General Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6 ARIMA Models 20
6.1 Autoregressive Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.1.1 AR(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.1.2 AR(1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.2 Stationary Condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3 Autocovariance Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3.1 Yale-Walker Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.3.2 Applying Spectral Density Function . . . . . . . . . . . . . . . . . . . . . . . 22
6.4 Random Walk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.5 Unit Root AR(p) process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7 Filtering Theorem 23
7.1 Application to moving average and Autoregression process . . . . . . . . . . . . . . . 26
7.1.1 AR(p) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.1.2 ARMA Process (Autoregression Moving average Process) . . . . . . . . . . . 26
7.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.3 Summary: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.4 Prediction and Partial Autocorellations . . . . . . . . . . . . . . . . . . . . . . . . . 28
7.4.1 (s) vs (s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.2 Computing (s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.4.3 Kologorovs Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2
Part I
Time Series
1 Intro
1.1 Time Series
Observation occur in temporal order induced by time
e.g. Daily/hourly stock price EEG oitentail ever 2 seconds
Monthly revenues of a business
U niversal M ultivariatetimeseries
concentrate on this panel or longitudinal data
n individuals follows over time
1.2 Goal
The goal is to understand structure of the process generating time series data
Data indexed by space (as well as time)
idea from invariable TSA apply here
1.3 Objectives
1. Description
Understand in crude terms the structure of the data:
use graphical methods

suggested possible models
2. Modelling and inference
use data building model for past & future data
test validating of the models and estimate uncertainty of each parameter
3. Production/Forecasting
predict future values of time series given past values
typically use model-based prediction
4. Control
Assume input and out per process multivariate time series
Example
Geostatistics: Box with xs and os
observe a variable at some points
use data to predict value point o
3
Example:
Input: Interest rate
Output : value of dollar
use interest rate to control value of dollar
Process of TSA
Description modeling /inference prediction or control
1.4 Time series data

Almost always evaluated at discrete points in time
Typically assume that we observe at equally spaced points
sometimes transform time to achieve this
Economics data are of aggregate over months
months have different lengths
months have different number of weekdays and weekends
certain feast dates (Easter/Ramadan) change month from year to year
1.5 Time and Frequency

1. Time Domain (More straight forward)
Look at time series as a function of time
Example Autoregression model of order 1
xt = xk1 +
value at time t = unknown parameter x value at t-1 + nioze
2. Frequency domain Look at at time series as function of frequency(or periods of certain

length) X X
xt = Aj cos(2wj t) + Bj sin(2wj t)
j j
where
Aj and Bj = random variables
wj = frequency prediction
In practice time domain approach is more useful for modeling and prediction
But... frequency domain approaches can provide insight for building time domain models
1.6 Graphical Methods for Description

Goal: Describe group behavior of time series
Idenitfiy memory - type of time series
1. Short-memories: Immediate
Past gives some information on the short term future but essentially no information on long
term future
4
Special Case:
White noise or purely random process
2. Long - Memory:
Past gives significant information about future
e.g. seasomonthly trends ( jan 2015 for jan 2016/2017)
1.7 Time Series Plot

Time series: {x1 , x2 , ..., xn }
Time series plot: plot of xt vs t
By plotting out points over t and then connecting xt with xt+1 But... need to be careful:
change of scale or aspect ration is very important

a bug but also a feature
5
2 Correlogram and Periodogram
Correlogram time approach
Periodogram frequency approach
2.1 Autcovariance and Autocorrelation

2.1.1 Definitions
We define the Autocovariance function (s) ,where as s is the lag, as:
ns
1X
(s) = (xt x)(xt+s x)
n t=1
n
1 X
= (xt x)(xts x)
n t=1+s
We define the Autocorrelation as:
(s) (n)
p(s) = 1
P 2
=
n (xt x) (0)
This function is well define s N where s n, but there is not enough information for s > n
2.1.2 Properties
1. p(0) = 1
2. 1 p(s) 1 s
3. p(s) = p(s) (even function)
4. a1 , .., an constant positive definite. Defined by:
n X
X n
at as p(s t) 0
t=1 s=1
(since we divide (s) by n not n s )
2.2 Correlogram
plot p(s) vs s for s {0, ...., max (lag)}
R fucntion: ACF
horizontal times at 2n
This allows us to consider whether data has attributed of short or long term memory
6
2.2.1 Short memory time series
p(s) decays to 0 relatively quickly as s increases
2.2.2 Long memory time series

p(s) decays to 0 very slowly (will eventually go to zero, just slow)
p(s) may oscillate around 0
2.2.3 Partial Correlogram

Refinement of Correlogram
Idea: look at correlation between xt and xt+s , by adjusting for xt+1 , ..., xt+s1 partial correlation
2.3 Periodogram
The primary motivation behind using a Periodogram is identifying periodicity/cycles within the
Time Series (TS)
Idea: given some frequency w ( period = w1 ) look at correlation between xt and sinusoids with
frequency w.
Assume a regression model for xt
x + t = 0 + 1 cos(swt) + 2 sin(2wt) + t
Assume w is known, and we use the least square estimate A:

1 cos(2w) sin(2w)
. cos(4w) .

A= . . .

. . .
1 cos(2nw) sin(2nw)
and
1 x1
. .
= (AT A)1 AT

.

.

. .
n xn
If n is moderately large (i.e. not small)
then 1 = n2 (xt x) cos(2wt)
P
and 2 = n2 (xt x) sin(2wt)

P
q
This, the estimate amplitude of the sinusoidal of frequency w is 1 + 2
( the square root of the sum of the above two formulas)
The larger the amplitude, the more important the frequency.
7
1
Repeating this w between 0 and 2 Periodogram
!2 !2
n n
1 X X
I(w) = (xt x) cos(2t) + (xt x) sin(2t)
n t=1 t=1
n
1 X
= | (xt x)e2it |2
n t=1
The above two formulas give us a fundamental formulation of the periodogram. Bellow we show a
derivation of another.
n
1 X 1 XX
I(w) = | (xt x)e2it |2 = (xt x)(xs x)e2iwt e2iws
n t=1 n s t
1 XX
= (xt x)(xs x)e2iw(ts)
n s t
Using change (v = t s, t = t) of variables and summing the diagonals:
n1 n|v|
1 X X
= (xt x)(xt+|v| x)e2iwv
n t
v=(n1)
(Bounds are v = -(n-1) for n-1)

n1
X n1
X n1
X
2iwv
= (v)e = (v)cos(2wv) = (0) + 2 (v)cos(2wv)
v=(n1) v=(n1) v=1
(0), (1), ..., (n 1) determines I(w)

Note: The reverse is also true:
Z 1
(s) = I(w)cos(wws)dw
0
Where I(w) = I(1 w)

Even though these are similar, we need them because:
1. (s) and I(w) highlight features of a time series
2. (s) gives information about period (temporal) dependence in time series
3. I(w) tells us which frequency are dominant with in a time series
low vs high frequency
identifying cycles in data
One caveat
8
Periodogram is very difficult to interpret ( very noisy)
V (I(w)) = E(I(w))2 , where R
would need some means of smoothing
2.3.1 Properties
1
1. I(w) is completely determined by its values 0 w 2 (Nyquist frequency)
2. I(w) = I(w)
3. I(w) = I(w + k) k Z
Note : Definition of I(w) does often vary from book to book. E.g.
n 2
1 X
iwt
I(w) = (xt x)e 0 w
2n t=1
2.3.2 Computation
which frequencies w to evaluate I(w) ?
k1
If n is not so small, evaluate I(w) at Fourier frequency wk = n for k {1, ..., [ n2 ] + 1}
If n can be factored into a product of small prime numbers, i.e. n = n1 n2 n2 ....np , then we can
evaluate I(w) at Fourier frequencies very efficiently. Best Case: n = 2p
2.3.3 Tapering
periodogram can be improved by tapering
n n
1 X X
I ? (w) = | (xt x)e2iwt |2 where H = h2t
H t=1 t=1
{ht } is called a taper

{ht } typically has a trapezoid shape
Reason for Tapering

peaks in periodogram tend to be more well defined
leakage reduces leakages
may not always be use full but good insurance
9
2.3.4 Interpreting the Periodogram
Always plot ln(I(w)) or ln(I ? (w)) vs w:
Appearance of Data Appearance of Periodogram

Smooth (i.e. little local variation) I(w) is larger w close to 0
Wiggle (high values followed by low values) I(w) is larger w close to 21
Random I(w) is more or less constant across w
Periodic with Period p I(W ) has peaks at w = pt where t = 1 and
smaller peaks at t 1, while w 12
2.3.5 Transformation
Original scale or transformed scale?
Goal is to transform data so that local variability is more or less constant
If Time series is positive and takes values over several orders of magnitude then taking logs is
a good idea.
e.g. financial data: almost always a good idea to take logs
10
3 Estimating Trends
3.1 Concept
1. Trend may be very important to know
2. Removing trends allow us to find out more interesting structure in Time Series
e.g. Seasonal Adjustment
Time series xt (monthly data)
Model: Tt + St + t
(a) Tt : Trend (varies slowly over time )
(b) St : seasonal (St = St+12 )
(c) t : irregularity component
Seasonal Adjusted data (ideally)
Model :
(adj)
xt = Tt + t = xt St
But we dont know St , Tt estimate them
(adj)
find Tt and St So that xt = xt St
(Do need Tt as well )
Methods
1. Reg/Curve fitting
2. Filtering (i.e. running average)
3. Differencing Removes trends

Conceptual Model: xt = f (t) + t = trend + noise
3.2 Reg/Curve Fitting

Assume a final proven model for f:
e.g. f (t) = 0 + 1 t + 22 t2 + ... + p tp
and estime the period by minimizing
n
X
(xt 0 1 t1 ... p tp )2
t=1
Dont use this in practice unless you know what you are doing.
gets very sensitive to small changes
Non Parametric
e.g. Loess (local polynomial regression)
the function above is available in R
11
3.3 Filtering
Estimate f (t) by a weighted average of values around t.
r
X
f(t) = cu xtu
u=r
Pr
where u=r cu = 1 and {cu } called filler.
e.g. simple running average :
P
1 xt u
cu = for r a r f(t) =
2n + 1 2r + 1
Issues
1. Choice of filter, length of filter:
Simple approach: start with simple running average and apply it several times
e.g. (iteratign n times)
f(1) (t) = 13 (xt1 + xt + xt+1 r = 1

f(2) (t) = 13 (f(1) (t 1) + f(1) (t) + f(1) (t + 1))
.
..
f(k+1) (t) = 31 (f(k) (t 1) + f(k) (t) + f(k) (t + 1))
2. how do we define f(t) near end point

Pr
f(t) = u=r cu xtu is only well defined for t = r + 1, ..., n r
lose 2 r points (r on either side!)
So how do we define t = 1, ..., r, n r + 1, ..., n ?
t 0, t n + 1
e.g.
pseudo observation
xr+1 = = x0 = x1
xn+1 = = xn+r = xn
Two Approaches
Regression
Filtering
General Approach
define a nxn matrix A
f(1)

x1
.. ..
. = A

.
f(n) xn
Filtering: A is the filtering coefficients

Regression: A is the matrix of parameters
12
Conditions which the matrix must satisfy
e.g. if x1 , ..., xn are smooth then

x1 x1
.. ..
. = A

.
xn xn
i.e. = 1
and also, the eigenvector v should coreespng to eigenvlaues should be smooth functions. Minimal
Requirements:
1. First Property:
1 1
.. ..
. = A

.
1 1
and
1 1
.. ..
. = A

.
n n
2. Secon Property:
f(1)

x1
.. ..
. = A

.
f(n) xn
then
f(n)

x1
.. ..
. = A

.
f(1) xn
(called a centro-symmetric Matrix)
(will come up in seasonal adjustments)
3.4 Differencing
Remove trend to focus on frequency amongst others)
Useful for removing trend
First Differencing Time series x1 , . . . , xn we define different on
yt = xt xt1 = xt Bxt
(B Bachshift operator to shift index back1 so Bxt = xt1 )
= xt (1 b) = Oxt
(O Difference operator)
13
Second Differencing
zt = yt yt1 = Oyt = O2 xt
(1 B)2 xt = (1 2B + B 2 )xt
xt 2xt1 + xt2
Idea Behind How Differencing Removes Trends Suppose F (t) = xt = 0 +1 t+t2 +. . .+p tp
Then Op xt = constant
e.g.
p = 1 x t = + 1 t
[xt1 = 0 + 1 (t 1)] Oxt = 1
(this shows the shift from xt1 xt
3.4.1 Second Differencing Operation

Od = 1 B d Od xt = xt xtd
e.g. monthly data: use d = 12 to remove seasonality
14
4 Stochastic Process
4.1 Probability Theory
The goal of Time series is :
Descriptive Methods M odels T heory
4.2 Stochastic Process

Collection of RV : {xt : x I}
e.g. of I :
1. I = R
I = [0, ) = R+
{xt } is a continuous time stochastic process
2. I = {0, 1, 2, . . .} = {0, 4, 24, . . .}
(not restricted to Z)
{xt } is a discrete time stochastic process
Convention: Stochastic process {xt } And realization of a stochastic process {xt }
i.e. a time series
4.2.1 Probability Distribution

{xt |t I}
Probability distribution of the stochastic process is determined by all possible finite dimensional
distributions.
That is, knowledge if P[xt1 xt , . . . , xtp xp ] tq , . . . , tp I and x1 , . . . , xp tells us everything
about the distribution of {xt : t I}
4.2.2 Stationary
{xt |t I} is strictly stationary if the distribution of (xt1 , . . . , xtp ) is the same as that of (xt1 +4 , . . . , xtp +4 )
4, t, . . . , tp such that t1 , . . . , tp I and t1 4, . . . , tp 4 I
If in addition E(x2t ) < then
1. E(xt ) = (t) = (constant)
2. V (xt ) = 2 (t) = 2 (constant)
3. Cor(xt , xt+s ) = (s)()(s) = (s) (where (s) is the autocovariacne function of {xt }
Strict stationary is almost impossible to verify given a finite realization (o.e. time series)
weaker condition: 2n order stationary (or weaker covariance)
A stochastic process {xt } is secondorder stationary if E(x2j ) < and
1. E(xt ) = (constant)
2. V (xt ) = 2 (constant)
3. Cor(xt , xt+s ) = (s)()(s) = (s)
15
5 Spectral Representation
{xt } discrete time (assume I = {0, 1, 2, 3. . . .}
assume {xt } is (second order) stationary
1. a function F (w) for 0 w 1 st.

Z 1
(s) = cos(2ws)dF (w)
0
N
X k k k1
(s) = lim cos 2 s F F
N N N N
k=1
(F (w) is called the spectral distribution function non discrete)

Properties of F (w):
(a) F (0) = 0, F (1) = (0) = var(xt )
(b) F (w) + F (1 w) = (0) = var(xt )
2. If

X
|(s)| <
s=0
then F (w) is differentiable with F 0 (w) = f (w) and so

Z 1
(s) = cos(2ws)f (w)dw
0
( f (w) si called Spectal density)

Properties of f (w):
(a) f (w) = f (1 w)
P P
(b) f (w) = s= (s)cos(2ws) = (0) + 2 k=1 (s)cos(2ws)
Important Notes:
P
1. s=1 |(s)| < implies that (s) 0 as s .
This implies that w converges reasonably fast.
e.g.
k
(s) = 1+ for > 0
s
2.
3. Relationship between f (w) and (s) is analogous to that between I(w) and (s):

X
I(w) = (0) + cos(2ws)(s)
s=1
Z 1
(s) = cos(2ws)(I(w))dw
0
16
suggest I(w) is an estimate of f (w)
(It is, but not a good one)
Examples
1.
2.
Note that for {yt } the spectral density function DNE, since:

X
X
X
2
|(s)| = A + x2 + |cos(2ws)| = since |cos(2ws)| diverges
s=0 s=1 s=1
17
Spectral Distribution Function
1 2
F is linear ( w slop x2 ) w jumps of @ A at frequencies w and 1 w
same slope
same jump = 12 A
2
5.1 Application
Simulate a stationary stochastic process given p.d.f f(w) (note: f(w) = f(1-w))
Take N very large
Define RV
A1 , A2 , ..., An , B1 , ..., Bn
(2N RV total)
With:
E(Ak ) = E(Bk ) k
Cov(Aj , Ak ) = j 6= k
Cov(Bj , Bk ) = j 6= k
Cov(Aj , Bk ) = j 6= p
1 k

V ar(Ak ) = V ar(Bk ) = Nf n the s.d.f.
Define:
N X N
X k 2 k
xt = + Ak cos 2 t + V ar(Bk ) sin 2 t
N n
k=1 k=1
PN k
. R1
By V ar(Ak ) = V ar(Bk ) = N1 k=1 f N = 0 f (w)dw
Since N large )
. R1
Likewise, Cov(xt , xt+s ) = 0 f (w)cos(2ws)dw
By similar process as last formula
5.2 Important Stationary Stochastic Process

1. White Noise Process
2
{xt } uncorrelated
( ( 0 for s 6= 0 withou V ar(xt ) = constant
RV Cov(xt , xt+s ) =
2
s=0 (s) 1 s=0
(s) = p(s) = (0) = Importance:
0 s 6= 0 0 s 6= 0
(a) Most Stochastic process used in modeling are driven by White Noise
(b) Time series models produce residuals (if model fits data well, it should like W.N.)
18
2. Moving Average Process {t } White noise process
V ar(t ) = 2 and E(t ) = 0
q
X q
X
xt = u tu or = u tu
u=0 u=0
TypicallyPassume B0 = 1
q
xt = + u=1 u tu + t
Write: {xt } is MA(q) (Moving Average) process
q X
X q
Cov(xt , xts ) = u v Cov(tu , tv+s )
u=0 v=1
Cov(tu , tv+s ) = 0 unless t u = t + s r and u = r s

So ( P
q|s|
2 u=0 u u+|s| |s| q
(s) =
0 |s| q
And Pq|s|
P u u+|s|
u=0
q |s| q
(s) = u=0
u
0 |s| q

Graphical representation:
PDF:

X
f (w) = (0) + 2 (s)cos(2ws)
s=1
Xq
= (0) + 2 (s)cos(2ws)
s=1
q
X q X
X qs
= 2 u + 2 u u+s cos(2ws)
u=0 s=1 u=0
q 2
X
2
= u e(2iwu)

u=0
5.3 General Result

If {xt } has PDF f () and
n
1X
X = xt
n t=1
f (0)
V ar(X) =
n
Variability in X controlled by lower frequency
19
6 ARIMA Models
6.1 Autoregressive Process
ARP of order one, we abbreviate it as AR(1)
{1 } White noise (WN) E(t ) = 2
xt = xt1 + or xt = + (xt ) + t where t is called the innovation
Assume Cov(xt1 , t ) = 0
xt = (xt2 + t1 ) + t
= 2 xt2 + t1 + t
= s xts + s1 xts+1 + . . . + t
Now assume || < 1 and let s

X
xt = u tu
n=0
need || < 1 for {xt } to be a stationary process
6.1.1 AR(1)
xt = t1 + t
{t } white noise t is uncorrelated with xt1 , xt2
And {xt } is a stationery with || < 1
6.1.2 AR(1)
xt = 1 xt1 + 2 xt2 , . . . , p xtp + t or xt = + 1 (xt1 ) + 2 (xt2 ), . . . , p (xtp ) + t

where {t } is WN and Cov(t , xts ) = 0s 1
This can be rewritten using the Bach shift operator as:
(1 1 2 2 . . . p p )xt = t
or
t
xt =
1 1 2 2 . . . p p
which can be re written as polynomial with regards to operator, as:
t
xt =
(1 1 )(1 2 2 ) . . . . . . (1 p p )
AR(1) we have that :

xt =
1
= t + t1 + 2 t2 + . . .
i
X
= nf tys s t
s=0
20
Generalization for AR(p):
if |a1 |, |a2 |, . . . , |ap | < 1 then

! ! !
X X X
xt = a21 s ) as2 s ) asp s
s=o s=o s=o

X
= bs s t
s=0

X
= bs ts
s=0
6.2 Stationary Condition

Define (z) = 1 1 z 2 z 2 . . . p z p
{xt } is stationary if all solutions of (z) = 0 satisfy |z| > 1 if(z)
Qp = (1 1 z)(1 2 z) (1 p z)
Which implies that if (z) = 0 then |z| > 1 is equivalent to i=1 |ai | < 1
This form of the AR(p) process is called the casual form
Any stationary AR process can be written in causal form
6.3 Autocovariance Function

Assume E(xt ) = 0 (i.e. = 0)
p
X
xt xts = i xti xts + t xts
s=1
for s 1 And now take the expected value of both sides:

p
X
(s) = i (s i)
s=1
And if E(2t ) = V ar(t ) = 2 Then

p
X
2 = (0) i (i)
i=1
where (0) = V ar(xt )
6.3.1 Yale-Walker Equation

Collectively, the above is equation is know as Yale-Walker equation.
given 1 , . . . , p and 2 = V ar(t ) allows us to determine (s), (s) for s 0
Given data, a Time series {st } can estimate 1 , . . . , p using estimated Autocorrelation
21
Example:
6.3.2 Applying Spectral Density Function

X
F () = (0) + 2 (s) cos(sws)
s=1
It is difficult to apply formula here, but there are other, easier, methods.
6.4 Random Walk

limiting case of sationary AR(1)
not stationary:
xt = xt1 + t
Where:
t is WN
E=0
V ar(t ) = 2
Cov(t , xts ) = 0
Note: why not stationary?

X
xt = x0 + u
u=1

t
!
X
V ar(xt ) = V ar(x0 + V ar t
u=1
= V ar(x0 ) + t 2
22
(not that the final equality depends on t and so, is not stationary)
often write Oxt = t where O = 1
6.5 Unit Root AR(p) process

p
X
xt = s xts + t
s=1
(z) = 1 1 z 2 z 2 . . . p z p
= (1 )1 z)(1 a2 z) . . . (1 z)
where (z) = (1 a2 z) . . . (1 z)
Suppose a1 = 1
()xt = t
()(1 )xt = t {Oxt } is a AR(p-1) process

()Oxt = t
Later we will that we can use the unit root test:
H0 : {xt } non -stationary with {Oxt } stationary
Ha : {xt }stationary
7 Filtering Theorem
P
{xt } stat with a Spectral Density Function (SPD) f () Tt = u= xu xtu and {cu } is a linear
filter
Where fy ():

X
fy () = y (0) + 2 y (s) cos(2s)
s=1

X
= y (s)e2is
s=
Due to symmetry, i will cancel out
23
Assume E(xt ) = 0 = EY= 0 Then
y (s) = E[Yt , Yt+s ]

" ! !#
X X
=E xu xtu , xv xt+sv
u= v=

X
X
= cu cv E[xtu , xt+sv ] where the expected value is equal to x (s v + u)
u= v=
X X
X
fy () = cu cv x (s v + u)e2is CoV:r = s + u v, s = r + v u
s= u= v=
X X X
= cu cv x (r)e2i+vu
s= u= v=
X X X
= cu e2iu x (r)e2ir cv e2iv
u= r= v=
X
= | cu e2iu |2 fx ()
u=
= |()|2 fx ()
Here () is the :
1. transfer function:
2. effect of the fitter on the difference frequency
P
3. = u= cu e2iu = |()|e2i()
() is the gain
e2i() is the phase shift (degree of distortion)
P
4. Given a time series x1 , . . . , xn and periodogram Ix () then if yt = cu xtu
Iy () ' |()|2 Ix () where () = cu e2iu

P
Bellow are 3 examples
24
1.
2.
3. Cascaded Filters P P
Yt = u cu xtu Zt = u du ytu
(double filters)
e.g. yt = Oxt and zt = Oyt = O2 xt = second difference of xt
P P
fz () = | u= du e2iu |2 fy () where ( fy () = | u= du e2iu |2 fy ())
fz () = |d ()|2 |c ()|2 fx ()
(note that the order of the filter does not matter here)
can improve running average filter by Cascading it several times
25
7.1 Application to moving average and Autoregression process
7.1.1 AR(p)
xt = 1 xt1 + . . . + p xtp + t (where t is WN)
xt 1 xt1 . . . p xtp = t
Note that
p
X
xt 1 xt1 . . . p xtp = cu xtu
u=0
where c0 = 1, c1 = 1 , . . . By the filtering theoremm

f () = |c ()|2 fx () where f () = 2
2
fx () = 1
Pp
u e2iu
u=0
Comparing the two Autoregression process

xt = 0.9xt1 + t
yt = 0.9yt1 + t
2
fx () =
1.81 1.81cos(2w)
2
fu () =
1.81 + 1.81cos(2w)
7.1.2 ARMA Process (Autoregression Moving average Process)

{t } WN , with E = 0, V ar(t ) = 2
xt = 1 xt1 + . . . + p xtp + t + 1 t1 + . . . + p xtq
or
xt = + 1 (xt1 ) + 2 (xt2 ) + . . . + p (xp ) + t + 1 t1 + . . . + q tq
The TS is an ARMA(p,q)
We need restricitons on i , j where i = 1, . . . p and j = 1, . . . , q
Reson: Stationary identifiablity
1. Define (z) = 1 1 2 . . . p 2p
All solutions of (z) = 0 satisfy |z| > 1
2. Define: (z)z = 1 + 1 z + . . . + q z q
All solutions of (z) = 0 |z| > 1
26
3. the equations (z) = 0, (z) = 0 have no common solutions.
i.e. (z) = (z) (z) 6= 0
Reasons for conditions:
1. allows us to write ARMA (p, q) process as a MA() (Moving average) (or linear process)
process;

X
x t = t + cu tu
u=1
where the summation depends on the values of i and j
2. Allows us to write {xt } TS as a AR() process

X
xt = bu xtu (where bb depends on i , j )
u=1
The TS is said to be an invertible ARMA(p, q) process

If 1 conditions is violated then the TS is really an ARMA(p k, q k) where k is the number of
commons Os
e.g.
ARMA(1, 1):
xt = 1 xt1 + t t1
(z) = 1 (z) = (z)
|| < 1 above conditions satisfied
Note: WN {t } satisfies difference equation of TS
ARM A(0, 0) = WN , sincet is independent over time periods
7.2 Properties
Auto coverience/corelation functions are very complicated in genral
SDF is nice to comput:
xt q xt1 . . . p xtp = t + 1 t1 + . . . + p tp = yt
Filter Theorem:
1. 2
p
X
fy () = 1 s e2is fx ()

s=1
2. 2
q
X
2is
fy () = 1 + s e 2

s=1
Equate and solve for fx ():

1 + q s e2is 2 2
P
s=1
fx () = Pq 2
|1 s=1 s e2is |
27
7.3 Summary:
1. ARMA Processes
(Includes AR and MA as special cases)
provide a rich class of stationary process
useful for modeling stationary TS
2. ARIMA Process
(Auto regression/Integrated/MA)
TS is ARIMA(p, d, q) if
Od xt = (1 )d xt is ARMA(p, q)
where ARMA(p, q) satisfies the following three conditions:
(a) {xt } is stationary
(if not, difference of time to make stationary)
(b) fit ARMA(p, q) model to the possibly differenced data
(c) Diagnostic: WN Test for residuals, etc.
7.4 Prediction and Partial Autocorellations

Stationary time series with autocovariance (s) (and possibly sdf f ())
Given xxt1 , xt2 , . . . , xtp

predicting value of xt using priors
Linear prediction
st = 0 + 1 xt1 + + p xtp
Trying to minimize the predictions means square erroe (PMSE)
E (xt xt )2

3 main questions
1. What are optimal 0 , . . . , p ?
2. How much do I benefit by having past p + 1 values vs having p values?

3. What is the best possible PMSE?
Find optimal 0 , . . . , p by:

E (xt 0 xt , . . . , p xtp )2 = 2E

i
Optimal {k } satisfies:
(s) = 1 (s 1) + + p (s p) for s = 1, . . . , p
(s) = 1 (s 1) + + p (s p)
28
0 = (1 1 2 p )E(xt )
We then Define
p2 = E (xt xt )2

= optima l PMSE
= p2 (1 p2 )
= 2 (1 2 (p))
and define (1), (2) are the partia lautocrellations of the time series {xt }
7.4.1 (s) vs (s)

(s) is correlation between xt and xt+s
(s) is corelation between xt and xt+s adjusting for dependence of xt and xt+s on {xti }s1
i=1
(1) (1) (1)

xt = (0 + 1 xt+1 + s1 xt+s1 )
(2) (2) (2)
xt+s = (0 + 1 xt+1 + s1 xt+s1 )
(2) is correlation between 3 residuals

(1) = (1)
7.4.2 Computing (s)

1. Transparent ( but time consuming) approach:
(1) (2) (s 1)

1 1 (1)
.. .. .. .. ..
(1)
. . .
.
.

.. . . . . . . .. .. ..
=

. . . . .
. .

. ..
.. ..
(s 2) (1) . .
(s 1) (1) 1 s (s)
2. Levinson Algorithm (Less Transparent but more effecient)

start with (0), (1),
(1)
(a) (1) = (1) = (0) = 11
where the optimal PMSE based on xt1 is 12 = (0)(1 2 (1))
(b) For s = 2, 3, 4
i. Ps1
(s) u,s1 (s u)
u=1
(s) = ss = 2
s1
ii. s2 = s1
2
(1 2 (s))
iii. us = u,s1 s ssu,s1 (for u = 1, . . . , s 1)
29
Comments
1. IF {xr } is an AR(p) procss then (s) = 0 for s = p + 1, p + 2 . . . i.e. s > p
xt = + 1 (xt1 ) + + p (xtp ) + t
and so (p) = p
2. If {xt } is WN then (s) = 0 s 1
3. Typically (s) 6= 0 s ((s) 0 as s )
Suppose {xt } stationary, how well can xt be predicted based on infinte past?

X
xt = 0 + s xts
s=1
2
= E (xt xt )2

7.4.3 Kologorovs Formula

If {xt } has sdf f () then R1
2 ln(f ())d
E (xt xt )2 e 0

Lower bound is attainable

can find 0 , 1 , . . . such that E (xt xt )2 =Lower Bound
R1
note that 0 ln(f ())d V (xt )
Special Cases
1. {xt } WN, then V (xt ) = 2 = (0)

f () = 2
1 2 2
eint0 ln d eln( ) = 2
R1
ln(f ())d
2. Suppose f () = 0 for a b, then e 0 =0
if know entire past of {xt } can be predict exactly
3. If {xt } is a stationary invertible ARMA(p, q) process, e.g.:
xt = 1 xt1 + + p xtp + t + 1 t1 + + q tq
and V (Rt ) = 2
1
ln(f ())d
Then e 0 = 2 = V (t )
This result can Pbe thought of as a consequence of the fact that we can write the invertible
AR( ) xt = u=1 bu xti + t
30

STA457

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STA457

Uploaded by

Copyright:

Available Formats

STA457 Notes

Shawn Unger and Associates

2 Correlogram and Periodogram 6

idea from invariable TSA apply here

use graphical methods

1.4 Time series data

1.5 Time and Frequency

value at time t = unknown parameter x value at t-1 + nioze

2. Frequency domain Look at at time series as function of frequency(or periods of certain

1.6 Graphical Methods for Description

1.7 Time Series Plot

change of scale or aspect ration is very important

2.1 Autcovariance and Autocorrelation

(since we divide (s) by n not n s )

2.2.2 Long memory time series

2.2.3 Partial Correlogram

Assume w is known, and we use the least square estimate A:

and 2 = n2 (xt x) sin(2wt)

The larger the amplitude, the more important the frequency.

(Bounds are v = -(n-1) for n-1)

(0), (1), ..., (n 1) determines I(w)

Where I(w) = I(1 w)

V (I(w)) = E(I(w))2 , where R

would need some means of smoothing

{ht } is called a taper

Reason for Tapering

may not always be use full but good insurance

Appearance of Data Appearance of Periodogram

e.g. financial data: almost always a good idea to take logs

3. Differencing Removes trends

3.2 Reg/Curve Fitting

f(1) (t) = 13 (xt1 + xt + xt+1 r = 1

2. how do we define f(t) near end point

Filtering: A is the filtering coefficients

(will come up in seasonal adjustments)

First Differencing Time series x1 , . . . , xn we define different on

(B Bachshift operator to shift index back1 so Bxt = xt1 )

3.4.1 Second Differencing Operation

4.2 Stochastic Process

4.2.1 Probability Distribution

1. a function F (w) for 0 w 1 st.

(F (w) is called the spectral distribution function non discrete)

then F (w) is differentiable with F 0 (w) = f (w) and so

( f (w) si called Spectal density)

5.2 Important Stationary Stochastic Process

Cov(tu , tv+s ) = 0 unless t u = t + s r and u = r s

5.3 General Result

Variability in X controlled by lower frequency

Now assume || < 1 and let s

xt = 1 xt1 + 2 xt2 , . . . , p xtp + t or xt = + 1 (xt1 ) + 2 (xt2 ), . . . , p (xtp ) + t

6.2 Stationary Condition

6.3 Autocovariance Function

for s 1 And now take the expected value of both sides:

And if E(2t ) = V ar(t ) = 2 Then

where (0) = V ar(xt )

6.3.1 Yale-Walker Equation

6.3.2 Applying Spectral Density Function

6.4 Random Walk

6.5 Unit Root AR(p) process

()(1 )xt = t {Oxt } is a AR(p-1) process

H0 : {xt } non -stationary with {Oxt } stationary

Due to symmetry, i will cancel out

y (s) = E[Yt , Yt+s ]

Iy () ' |()|2 Ix () where () = cu e2iu

Cov(tu , tv+s ) = 0 unless t u = t + s r and u = r s

xt = 1 xt1 + 2 xt2 , . . . , p xtp + t or xt = + 1 (xt1 ) + 2 (xt2 ), . . . , p (xtp ) + t

And if E(2t ) = V ar(t ) = 2 Then

()(1 )xt = t {Oxt } is a AR(p-1) process

xt = 1 xt1 + . . . + p xtp + t + 1 t1 + . . . + p xtq

xt = 1 xt1 + + p xtp + t + 1 t1 + + q tq