Professional Documents
Culture Documents
1
Zhi Ying Feng
Time Reversibility...................................................................................................................... 20
Birth and Death Process ................................................................................................................. 21
Transition Rates and Embedded Markov Chain ........................................................................ 21
Examples of Birth and Death Processes .................................................................................... 21
Expected Time in States ............................................................................................................. 22
Kolmogorov Equations .............................................................................................................. 23
Limiting Probabilities ................................................................................................................ 23
Application: Pure Birth Process ................................................................................................. 24
Application: Simple Sickness Model ......................................................................................... 24
Occupancy Probabilities and Time ............................................................................................ 25
First Holding Time ..................................................................................................................... 25
Non-homogenous Markov Jump Processes ................................................................................... 26
Residual Holding Time .............................................................................................................. 27
Current Holding Time ................................................................................................................ 27
Part 2: Time Series ............................................................................................................................. 28
Introduction to Time Series............................................................................................................ 28
Classical Decomposition Model ................................................................................................ 28
Moving Average Linear Filters .................................................................................................. 28
Differencing ............................................................................................................................... 29
Stationarity ................................................................................................................................. 30
Sample Statistics ........................................................................................................................ 30
Noise .......................................................................................................................................... 31
Linear Processes ......................................................................................................................... 31
Time Series Models ....................................................................................................................... 32
Moving Average Process ........................................................................................................... 32
Autoregressive Process .............................................................................................................. 33
Autoregressive Moving Average Models .................................................................................. 34
Causality..................................................................................................................................... 34
Invertibility................................................................................................................................. 35
Calculation of ACF .................................................................................................................... 35
Partial Autocorrelation Function ................................................................................................ 37
Model Building .............................................................................................................................. 38
Model Selection ......................................................................................................................... 38
Parameter Estimation ................................................................................................................. 39
Model Diagnosis ........................................................................................................................ 40
2
Zhi Ying Feng
Non-Stationarity ............................................................................................................................. 41
Stochastic Trends ....................................................................................................................... 41
ARIMA Model ........................................................................................................................... 42
SARIMA Model ......................................................................................................................... 42
Dickey-Fuller Test ..................................................................................................................... 43
Overdifferencing ........................................................................................................................ 44
Cointegrated Time Series ........................................................................................................... 44
Time Series Forecasting ................................................................................................................. 45
Time Series and Markov Property ............................................................................................. 45
k-step Ahead Predictor ............................................................................................................... 46
Best Linear Predictor ................................................................................................................. 47
Part 3: Brownian Motion.................................................................................................................... 48
Definitions .................................................................................................................................. 48
Properties of Brownian Motion.................................................................................................. 48
Brownian Motion and Symmetric Random Walk...................................................................... 49
Brownian Motion with Drift ...................................................................................................... 49
Geometric Brownian Motion ..................................................................................................... 49
Gaussian Processes .................................................................................................................... 50
Differential Form of Brownian Motion ..................................................................................... 50
Stochastic Differential Equations............................................................................................... 51
Stochastic Integration ................................................................................................................. 52
Part 4: Simulation............................................................................................................................... 53
Continuous Random Variables Simulation .................................................................................... 53
Pseudo-Random Numbers.......................................................................................................... 53
Inverse Transform Method......................................................................................................... 53
Discrete Random Variable Simulation ...................................................................................... 54
Acceptance-Rejection Method ................................................................................................... 55
Simulation Using Distributional Relationships.......................................................................... 56
Monte Carlo Simulation ................................................................................................................. 57
Expectation and Variance .......................................................................................................... 57
Antithetic Variables ................................................................................................................... 58
Control Variates ......................................................................................................................... 59
Importance Sampling ................................................................................................................. 60
Number of Simulations .............................................................................................................. 60
3
Zhi Ying Feng
X t , t T
X t2 X t1
A stochastic process has independent increments if X t0 , X t1 X t0 ,..., X tn X tn1 are
independent for all t1 t2 ... tn . Equivalently, the r.v. X t s X t and X t are independent
for all s t , i.e. future increases are independent of the past or present.
A stochastic process has stationary increments if X t2 X t1 and X t2 X t1 , i.e.
increments of the same length, have the same probability distribution, for all t1 , t2 and 0 .
Markov Processes
A Markov process is a stochastic process that has the Markov property, where given the present
state, the future state is independent of the past states:
A Markov Chain is a Markov process on a discrete index set T 0,1, 2,... denoted by
X n , n 0,1, 2,...
i.e. the process is in state k at time n, and a finite or countable state space:
4
Zhi Ying Feng
Transition Probability
The one-step transition probabilities of the Markov chain is the conditional probability of
moving to state j in one step, given that the process is in state i in the present:
Pij n,n1 Pr X n1 j | X n i
If the one-step transition probabilities do NOT depend on time, i.e. same for all n to n+1, then the
Markov chain is homogenous with stationary transition probabilities.
P P P
10
11
12
P Pij
P20 P21 P22
Pijn Pr X nm j | X m i
Chapman-Kolmogorov Equations
The Chapman-Kolmogorov equations computes the n+m-step transition probabilities:
This is the sum of the transitional probabilities of reaching an intermediate state k after n steps, then
reaching state j after m steps. Or in matrix multiplication:
nm
n
m
P P P P P...
n m times
n
Where P Pijn is the matrix consisting of the n-step transition probabilities.
Consider the matrix multiplication A*B = AB
To get the ith row of AB, multiply the ith row of A by B
To get the jth column of AB, multiply A by the jth column of A
Kolmogorov Forward Equations: start in state i, transition into state k after n steps, then a onestep transition to state j
Kolmogorov Backward Equations: start in state i, one-step transition into state k, then n step
transition into state j.
5
Zhi Ying Feng
Classification of States
Absorbing state
A state i is said to be an absorbing state if Pii 1 or Pij 0 for all j i . An absorbing state is a state
in which, if the process ones arrive in it, the process will stay in.
Accessible
State j is accessible from state i if Pijn 0 for some n 0 . State j is accessible from i if there is a
positive probability that the process will be in state j at some future time, given that the process is
currently in state i. This is written as:
i j
Note that:
if i j and j k then i k
The absorbing state is the ONLY accessible state for an absorbing state.
Communicate
States i and j communicates if:
i j and j i, i.e. i j
That is, there is a positive probability that if the process is currently in state i, then at a future point
in time the process will return to state i in between visiting state j. Note that:
i i for all i
if i j then j i
if i j and j k then i k
The class of states that communicate with state i is the set:
C i j S : i j
That is, there is a positive probability that if the process is in state i, then at a future point in time the
process will return to state i in between visiting at least one state in class C
Irreducible Markov Chain
A Markov chain is irreducible if there is only ONE class, i.e. all states communicate with each
other. Properties:
All states in a finite, irreducible Markov chain is recurrent
The probability of returning to the current state in the long run is positive
The probability of being in any state in the long run is also positive
Recurrent and Transient States
Let fi be the probability that the process starting in state i will return to state i sometime in the future.
fi Piin Pr X n i | X 0 i
A state is:
Recurrent if fi 1
Transient if fi 1
If state i is recurrent then, starting in state i, the process will return to state i at some point in the
future, and the process will enter state i infinitely often.
6
Zhi Ying Feng
If state i is transient then, starting in state i, the number of the times the process will enter state i
thereafter is a Bernoulli random variable
0 if process returns to state i
X i ~ Ber p 1 fi
1 if process does not return to state i
Then the probability that the process will be in state i for exactly n time periods can be modelled
using the geometric distribution:
fi n1 1 fi
The total number of periods that the process is in state i is given by:
In ,
n 0
1 if X n i;
where I n
0 if X n i,
With the expected number of time periods that the process starts in state i:
E I n | X 0 i E I n | X 0 i
n 0
n 0
Pr X n i | X 0 i
n 0
Piin
n 0
1
1 fi
recurrent if
Piin ;
n 1
transient if
Piin
n 1
That is, a transient state will only be visited a finite number of times.
Note:
In a finite state Markov chain, AT LEAST one state must be recurrent and NOT ALL states
can be transient. Otherwise, after a finite number of times, no states will be visited!
An absorbing state is recurrent, since it will revisit itself infinite times
If state i communicates with state j and state i is:
Recurrent, then state j is also recurrent
Transient, then state j is also transient
A class of states is:
Recurrent if all states in the class are recurrent
Transient if all states in the class are transient
Closed if all of the states in the class can only lead to states WITHIN the class. Hence,
states in a closed class are recurrent
Open if the state in the class can lead to states OUTSIDE the class. Hence states in an open
class are transient
7
Zhi Ying Feng
Class Properties
Period
State i has period d(i) if d(i) is the greatest common divisor of all n 1for which Piin 0 , i.e. its
the number of steps in all possible paths back state i, if the process starts in state i. Note that:
If i j , then d i d j
Positive Recurrent
A recurrent state i is positive recurrent if the expected time of return to itself is finite. In a finite
state Markov chain, ALL recurrent states are positive recurrent.
Ergodic
A state i is ergodic if it is positive recurrent and aperiodic.
Limiting Probabilities
Limiting probabilities is the long run probability that a process is in a certain state. For an
irreducible (only one class, all states communicate) and ergodic Markov chain, the limiting
probability exists and is independent of i, i.e. where the process starts from:
j lim Pijn
j0
j i Pij
and
i 0
j 1
j 0
P where 1, 2 ,...
This can be interpreted as:
The probability that the process being in state j at time t is the same as t+1, as t
j is the long run proportion of time that the Markov chain is in state j
Note that the limiting probabilities may not exist at all, or only for even/odd transitions
If state j is transient and state i is recurrent, then lim Pijn 0 and lim Piin 1
n
If the distribution of the initial states is chosen to be the limiting distribution, then the probabilities
of being in state j initially is same as the probability of being in state j at time n:
Pr X 0 j j , then Pr X n j j
The mean time between visits to state j is the expected number of transitions m jj until a Markov
chain which starts in state j to return to state j, given by the mean of geometric distribution:
1
m jj
That is, the proportion of time in state j equals the inverse of the mean time between visits to state j.
8
Zhi Ying Feng
PT
...
s
s
...
S 21 22
st1 st 2 ...
Conditioning on the initial transition, we have for i, j T
s1t
s2t
stt
,
t
if i j
Pik skj
k 1
sij
t
1 P s if i j
k 1 ik ki
i.e. mean time spent in state j given the process is currently in state i, given by the transition
probability into state k starting in state i, times by the time spent in state j starting in state k,
summed for all possible k.
Then we have:
S = I PT S
S I PT
Note that:
a b
A
c d
1 d b
1 d b
det A c a ad bc c a
Branching Processes
A branching process is a Markov chain that describes the size of a population where each member
in each generation produces a random number offsprings in the next generation.
Let p j , j 0 be the probability of producing j offsprings by each individual in each
generation
Assume that the offsprings of each individual is independent of the numbers by others
Let X0 be the number of individuals initially present, i.e. the zeroth generation
Individuals produced from the nth generation belong to the (n+1)th generation
Let Xn be the size of the nth generation.
9
Zhi Ying Feng
If p0 0 , all states other than 0 are transient, i.e. the population will either die out or
converge to infinity, since transient states are visited finite amount of times
X n 1 Yi
i 1
Where Yk are i.i.d. r.v. that represent the no. of offspring of the Xi-th person:
Pr Yi k pk , for k 0,1,...
pk 1
k 0
E Yi kpk
k 0
2 var Yi k pk
2
k 0
The mean and variance of number of the population of the nth generation is:
E Xn n
2 n 1 n 1 if 1
1
var X n
2 n
if 1
Under the assumption that X 0 1 , the probability of extinction, i.e. everyone dies, is given by:
0 0 pk
k
k 0
0 GY 0
If 1 then 0 1
GX t E t X t x Pr X x
x 0
Properties:
Relationship between the probability mass function of X:
x
1 PX t
pX x
x ! t x
t 0
GX t mX log t
10
Zhi Ying Feng
Pr X m j , X m 1 j
Pr X m 1 j
Pr X m j Pr X m 1 j | X m j
Pr X m 1 j
j Pji
i
for all i, j
i Pij j Pji
for all i, j
Thus, the rate at which the process goes from state i to state j (LHS) is the same as the rate at which
it goes from state j to state i.
e x if x 0
f x
otherwise
0
The cumulative probability distribution and the survival function are in the form:
1 e x if x 0
e x if x 0
F x
S x 1 F x
if x 0
if x 0
0
0
The moment generating function is given by:
M X t E etX
for t
var X
S x
1 F x e x
11
Zhi Ying Feng
t n1
n 1!
X min X1, X 2 ,..., X n also has an exponential distribution with parameter and mean:
n
EX
and
i 1
i1 i
n
j 1 j
n
If all parameters are not equal, i.e. i j then the sum of n independent exponential r.v. has
a hyperexponential distribution with pdf:
n
i1 X i
n
x Ci,ni e x
i
i 1
where Ci ,n
j i
j
j i
Counting Process
A counting process N t , t 0 represents the number of events that occur up to time t, i.e. it is a
discrete state space, continuous time stochastic process. A counting process has the properties:
N t 0
N t is integer-valued
A counting process has independent increments if the number of events that occur in the interval
( s, t ] , N t N s , is independent of the number of events that occur up to time s
N t2 N t1 , i.e. the number of events that occur in any interval of time period only depends on
the length of the time period.
12
Zhi Ying Feng
Poisson Process
A Poisson process denoted by N t , t 0 is a counting process that counts the number of events
which occur at a rate from time 0. It has the properties:
N 0 0
Pr N s t N s n Pr N t n
e
stationary increments
t n
n!
Poisson process, the inter-arrival times are i.i.d. exponential r.v. with:
fTn t et ,
Pr Tn t et
and
E Tn
The waiting time S n is the time until the nth event, given by:
n
Sn T1 T2 ... Tn Ti
i 1
The sum of exponential random variables each with SAME parameter has a gamma(n, )
distribution, therefore:
f Sn t e
t n1
n 1!
Note: if Sn is the aggregate of m independent claims, then it has a gamma(n, m) distribution. E.g.
for a motor insurer, each insured motorist makes a claim at a rate of 0.2 per year, and there are a
total of 200 insured. Then S100, the waiting time until the 100th event, has the distribution:
Order Statistics
Let Yi be the ith smallest value among Y1 , Y2 ,..., Yn , then Y1 , Y 2 ,..., Y n is called the order statistics. If
Yi are i.i.d. continuous r.v. with pdf f yi , then the joint pdf of the order statistics Y1 , Y 2 ,..., Y n is:
n
f y1 , y2 ,..., yn n ! f yi ,
i 1
where y1 y2 ... yn
If Yi are uniformly distributed over 0,t then the joint pdf of the order statistics is:
f y1 , y2 ,..., yn
n!
tn
13
Zhi Ying Feng
i.e. inter-arrival time between the n 1 and nth event is equal to the waiting time for n events
th
f s1 , s2 ,..., sn | N t n
f s1 , s2 ,..., sn , n
Pr N t n
e ... e
s1
sn sn1
e t
Pr N s m | N t n
n!
n!
tn
Pr N s m, N t n
Pr N t n
Pr N s m Pr N t s n m
Pr N t n
n ! s e s t s
m
t sn
nm
t s
e
m ! n m ! t e t
n
n sm
nm
n t s
m t
Thinning of Poisson Process
Consider a Poisson process N t , t 0 with rate and that each time an event occurs, it is either:
N t N1 t N 2 t
N 0 0
Pr N t h N t 2 o h
m t y dy
t
Then:
N m
X t
N t
Yi
i 1
Where:
N t , t 0 is a Poisson process
This is useful for insurance companies, where the uncertainty on total claim size X(t) is due to both
the number of claims N(t) which follows a Poisson process, and also the claim size Y, which can
have any distribution as long as they are i.i.d. Note that claim size Y is independent of the number of
claims N(t)
The mean and variance of a compound Poisson process is given by:
E X t tE Yi
var X t t E Yi 2
15
Zhi Ying Feng
time with a discrete set space. The Markov jump process X t , t 0 has a continuous version of
the Markov property: for all s, t 0 and nonnegative integers i, j, x u , 0 u s :
X t i,
Pr X t s j X u x u , Pr X t s j | X t i
0 u t
i.e. the process at time t+s is only conditional on the state at time t and independent of its past
states before time t, i.e. X u x u for 0 u t . This means that time path up to the state at time t
does not matter.
The Markov jump process is a homogenous process, i.e. all transition probabilities are stationary, if:
Pr X t s j | X t i Pr X s j | X 0 i
i.e. independent of t
Pij s
Hence, the transition probability over a time period only depends on the duration of the time period.
The Poisson Process is a continuous time Markov chain:
Pr X t s j | X s i Pr j i jumps in ( s, s t ] | i jumps in 0, s
Pr j i jumps in ( s, s t ]
independent increments
Pr j i jumps in (0, t ]
e
stationary increments
t j i
j i !
Pr Ti s t | Ti s Pr X v i, s v s t X u i , 0 u s
Pr X v i, s v s t | X s i
Markov Property
Pr X v i, 0 v t | X 0 i
Stationary Increments
Pr Ti t
The only continuous distribution with the memoryless property is the exponential distribution.
Hence, the time spent in a state is exponentially distributed with:
Ti ~ exp vi
Pr Ti t 1 evit
With the expected time spent in state i before transitioning into another state is:
E Ti 1 vi
Where vi is the rate transition when in state i, i.e. the transition rate out of state i to another state. It
is the sum of all the rates of jumping from state i to another state j, denoted as qij:
vi j i qij qii
16
Zhi Ying Feng
Pij t , t s Pr X t s j | X t i
Pr X s j | X 0 i
Pij s
With the initial conditions
0 if i j
Pij s, s Pij 0
1 if i j
Transition Rates of Homogenous Processes
The instantaneous transition rate qij from state i to state j of a continuous-time Markov chain is
defined as the rate of change of the transition probability over a small period of time:
Pij h
lim
if i j
P
t
lim
t ij
h 0
h
t 0
lim Pii h 1 q v if i j
ii
i
h
h 0
Where vi qii denote the transition rate out of state i when the process is in state i. Note that:
j Pij 1
j qij 0
Equivalently, the transition probability over a small time h can be defined as:
if i j
q h o h
Pij h ij
1 qii h o h if i j
t Pij s, t lim
h 0
t s
Pij s, s h Pij s, s
h
Pij s, s h
qij s
if i j
lim
h 0
h
lim Pii s, s h 1 q s v s if i j
ii
i
h0
h
Equivalently, the transition probability over a small time h can be defined as:
if i j
q s h o h
Pij s, s h ij
1 qii s h o h if i j
17
Zhi Ying Feng
Now, ignore when the transition occurs and how long is spent in each state. Consider only the
series of states that the process transitions into. Let:
vi be the transition rate OUT of state i when the process is in state i
Pij be the probability that the transition is into state j, conditional on the fact that a transition
has OCCURRED and process is currently in state i. (see embedded Markov chain)
Then, qij , the transition rate INTO state j, when the process is in state i, is given by:
qij vi Pij
Therefore,
Pij
qij
vi
qij
qij
j i
Chapman-Kolmogorov Equations
For a homogenous continuous-time Markov chain the Chapman-Kolmogorov equation is:
k 0
0 if i j
Pij 0
1 if i j
The Kolmogorovs backward equation is:
Q = qij
P t Pij t
t
t
P t s P t P s
Limiting Probabilities
The limiting probabilities of a continuous-time Markov chain Pj is the long run probability, or the
long-run proportion of time that the process will be in state j, independent of the initial state:
Pj lim Pij t
t
v j Pj qkj Pk
qkj Pk 0
k j
P Q 0
Pk 1
k
This equation implies that the rate at which process leaves state j is equal to the rate at which the
process enters state j:
v j Pj is the rate at which process leaves state j since Pj is the long-run proportion of time the
process is in state j and v j is the rate of transition out of state j when the process is in state j
k j qkj Pk
is the rate at which the process enters state j since Pk is the long-run proportion
of the time the process is in state k and qkj is the rate of transition from state k to state j
For the limiting probabilities limt Pij to exist, the conditions are:
All states communicate so that starting in state i there is a positive probability of being in
state j
Markov chain is positive recurrent so that starting in any state, the mean time to return to
that state is finite
When the limiting probabilities exist, then the Markov chain is ergodic, note that aperiodic
is unnecessary as periodicity does not apply to continuous-time Markov chains
If the initial state is chosen according to the limiting distribution, then the probability of being in
state j at time t for all t has the same distribution (homogenous)
Pr X 0 j Pj
Pr X t j Pj
qij
qij
j i
Assuming that the embedded Markov chain is ergodic, its limiting probabilities i , i.e. the long run
proportion of transitions into state i, are the unique solutions of the set of equations:
i j Pji
i 1
and
The proportion of time the continuous-time process spends in state i is given, i.e. the limiting
probabilities of the original continuous Markov process, can also be found using:
i vi
Pi
j j vj
Note that i is the limiting probability of the embedded Markov chain, is the proportion of the
number of transits into state i multiplied by1 vi , i.e. the mean time spent in state i during a visit
Time Reversibility
Going backwards, given the process is in state i at time t the probability that the process has been in
state i for an amount of time greater than s is e vi s :
Pr X t s i Pr Ti s
Pr X t i
e vi s
Thus, the continuous-time Markov chain will be time reversible, i.e. with same probability
structure as the original process, if the embedded chain is time reversible, i.e.
i Pij j Pji
Using the proportion of time the continuous-time chain is in state i:
i vi Pv i
Pi
i
i i
j vj
j j vj
i Pij j Pji
Pv
i i Pij Pj v j Pji
Note that qij vi Pij , then we have an equivalent condition for time reversibility:
Pq
i ij Pj q ji
i.e. rate at which the process goes directly from i to j is the same as the rate to go directly from j to i.
20
Zhi Ying Feng
The time until the next arrival is exponentially distributed with mean 1 n
The time until the next departure is exponentially distributed with mean 1 n
j i
v0 0
vi i i
For the corresponding embedded Markov chain, let the transition probability Pij denote the
transition probabilities between states. If the Markov chain is in state 0 and a transition occurs, it
must transition into state 1, therefore:
P01 1
To derive the transition probabilities of the embedded Markov chain, consider a population of i. It
will jump to i 1 if a birth occurs before death and jump to i 1 if a death occurs before birth:
The time to a birth Tb is exponential with rate i
Therefore, the probability that a birth occurs before a death is given by: (minimum of ind. Exp.)
Pr Tb Td
i i
for i 0
n
0
for n 0
Pi ,i 1
0
for i 1, 2,...
21
Zhi Ying Feng
Pi ,i 1
for i 0,1, 2...
min
i
,
s
0
for
n
min i, s
n n for 1 n s
Pi ,i 1
for i 0,1, 2..
s for n s
min i, s
Poisson Process
A poison process is a special case of a birth and death process, with no death:
Pi ,i 1 1 for i 0,1, 2...
n
for n 0
n 0
for n 0
Population Growth
Each individual in a population gives birth at an exponential rate plus there is immigration at an
exponential rate . Each individual has an exponential death rate
n n
n n
for n 0
for n 0
n
n n
n
Pi ,i 1
n n
Pi ,i 1
Ti 1
i i i i i i
i
1
Ti 1 Ti
i i
i i
i
i
Ti
i i
Ti 1
Since the time to first transition, regardless of whether it is a birth or death, is exponential with rate
i i . The time taken to i+1 if there is a death is the time from i-1 to i then from i to i+1. Note:
T0 1
0
In the case where the rates are homogenous, i.e. i , i for all i :
i 1
i
1
1
Ti 1 ...
Then the expected time to go from state i to a higher state j is:
Tj Ti Ti1 ... Tj 1
22
Zhi Ying Feng
Kolmogorov Equations
We can use the transition rates of a death and birth process to find the Kolmogorovs equations, and
then solve these differential equations to find the transition probabilities.
qi ,i 1 i
q0,1 0
q0,0 v0 0
qi ,i 1 i
q0,i 0
qi ,i vi i i , otherwise zero
for all i 1
Pi , j t i Pi 1, j t i Pi 1, j t i i Pi , j t
t
Kolmogorovs Forward Equations
Pi ,0 t 0 Pi ,1 t 0 Pi ,0 t
t
Pi , j t j 1Pi , j 1 t j 1Pi , j 1 t i i Pi , j t
t
Limiting Probabilities
The limiting probabilities are determined by the balance equations:
Pk 1
k
This states that the rate at which the process leaves state j (RHS) is equal to the rate at which the
process enters state j (LHS). For a birth and death process, this condition breaks down iteratively:
n n Pn n1Pn1 n1Pn1
n Pn n1Pn1
Then in general:
P1
P0 , P2 1 P1 1 0 P0 ...
1
2
2 1
n
n 1
n 1
1 0
j 1
Pn
Pn 1
... P0 P0
n
n
2 1
j 1 j
P0 Pn 1
n 1
n1
1
... 1 0 1 P0
2 1
n 1 n
1 n1 n1 ... 1 0
n
2 1
P0 P0
n1
... 1 0
n
2 1
Pn n 1 ... 1 0 P0
n
2 1
1 n 1 n1 ... 1 0
n
2 1
This also gives the condition for long run probabilities to exist, i.e.:
n1 n1 ... 1 0
n
2
1
23
Zhi Ying Feng
n 0
for n 0
In this case, re arranging the Kolmogorov backward equations for the case i 0
Pi , j t Pi 1, j t Pi , j t
Pi , j t Pi , j t Pi 1, j t
t
t
Multiply both sides by e t :
et Pi , j t Pi , j t et Pi 1, j t
t
e s Pi , j s e 0 Pi , j 0 et Pi 1, j t dt
Pi , j 0 0 for i j
s s t
Pi , j s e Pi 1, j t dt
0
LHS is the probability of moving from state i to j over the time period 0 to s
s t
RHS is the probability of the process staying in state i for a time s t , i.e. e , then
jumping to state i+1 over the time interval s t to s t dt , i.e. dt , then the probability
that the process starts in i+1 and finishes in j over the time interval s t to s, i.e. Pi 1, j t dt ,
integrated for all possible times from 0 to s, which is equivalent to t varying from 0 to s.
Application: Simple Sickness Model
In a simple sickness model, an individual can be in state 0 (healthy) or in state 1 (sick).
An individual remains healthy for an exponential time with mean 1 before becoming sick
An individual remains sick for an exponential time with mean 1 before becoming healthy
0 and
P0,0 s
s
e
s
P1,0 s
1 e
s
P0,1 s
1 e
s
P1,1 s
24
Zhi Ying Feng
P1,1 s is the probability that a sick individual remains sick for same period
By the Markov property, the time spent in state 0 or state 1 is exponential with the memoryless
property:
P0,0 s Pr T0 s 1 Pr T0 s 1 (1 e s ) e s
P1,1 s Pr T1 s 1 Pr T1 s 1 (1 e s ) e s
The occupancy time O(t) is the total time that the process spends in each state during the interval
(0,t). If we define the indicator function:
1 if X s 1
I s
0 if X s 0
Then the occupation time for being sick is:
O t I s ds
t
The expected occupation time being sick given that the initial state is healthy is given by:
t I s ds | X 0 0
0
O t | X 0 0
I s | X 0 0 ds
Pr X s 1| X 0 0 ds
t
P0,1 s ds
t
1 e ds
0
t
T0 inf t : X t X 0
For a homogenous Markov jump process, this is exponentially distributed with rate i which is the
rate of jumping out of state i, previously denoted as vi .
i vii j i qij
Thus, the first holding time has the c.d.f. and p.d.f.:
Pr T0 t | X 0 i eit
Pr T0 t | X 0 i i eit
The probability of the state to which the process jumps from is:
qij
Pr X T0 j | X 0 i Pij
Where X T0 is independent of T0
25
Zhi Ying Feng
Pij s, t
t
t s
So that the non-homogenous transition probabilities over a small time period of h is:
ij s
if i j
h s o h
Pij s, s h ij
1 h ii s o h if i j
ii s ij s
i j
h s h o h
Pij s h, s ij
1 h ii s h o h if i j
So that:
ij s h
Pij s h, s Pij s, s o h
h
ij s lim
Pij s h, s Pij s, s o h
h 0
Pij s, t
s
t s
The Chapman-Kolmogorov equations are:
differential w.r.t s
P s, t P s, u P u , t
P s, s I
k j
P s, t P s, t Q t
t
The Kolmogorovs backward equations are derived by differentiating the Chapman-Kolmogorov
equation w.r.t s, then setting u s :
ik s Pkj s, t v j t Pkj s, t
k j
P s, t Q s P s, t
s
Where Q t is the matrix containing ij s for all i,j.
26
Zhi Ying Feng
Rs w, X s i X u i, s u s w
i.e. the state remains in the same state i between time s and s+w. Note that this is the nonhomogenous case of first holding time. We can show that:
sw
Pr Rs w | X s i exp
vi u du
sw
Pr Rs w | X s i
1 Pr Rs w | X s i vi s w exp vi u du
s
w
w
Define X s X s Rs as the state the process jumps to at the next jump. The density for this is:
ij s w
Pr X s j | X s i, Rs w
vi s w
sw
s w
vi u du
e s
vi s w iI
PIj s w, t dw
v
s
I i 0
i
This is the conditional probability that a process is in state j at time t, given that it started in state i at
time s. This transition can be done by:
(1) Staying in state i from time s for a duration of w, i.e. e
sw
vi u du
Ct w, X t j X u j, t w u t
Pr Ct w | X t j exp
The density of Ct | X t j is given by:
t w
v j u du
Pr Ct w | X t j
1 Pr Ct w | X t j v j t w exp v j du
t w
w
w
Then we have the transition probability as:
t s
t
kj t w
v j u du
Pij s, t Pik s, t w
v j t w e tw
dw
v j t w
k j 0
This is the integral form of the Kolmogorov forward equation, i.e. we consider the jump (q) last.
27
Zhi Ying Feng
x1, x2 ,...xt
A time series model for xt is a family of distributions to which the joint distribution of X t is
assumed to belong. Let X t denote the time series, where each X t is a r.v.:
..., X t 1, X t , X t 1,...
Therefore, xt 1 , xt , xt 1 are realisations of the r.v. X t 1 , X t , X t 1
Classical Decomposition Model
The classical decomposition model decomposes the original data Xt into 3 components
X t Tt St Nt
Nt 0
St St d
j 1 St j 0
d
Where:
Tt is a deterministic trend component that is slowly changing and perfectly predictable
St is a deterministic seasonal component with a known period of d and perfectly
predictable. Note that the seasonal component over a complete cycle is zero, i.e. d would be
4 if it is quarterly.
Nt is a random component with expected value 0, as all information should be captured in
the trend and seasonal component. Nt may be correlated and hence partially predictable
Moving Average Linear Filters
A moving average linear filter has the form:
Tt
a j Xt j
q
1
Xt j
2q 1 j q
1
1
1
X t q X t q 1 ... X t q
d
d
d
If the period d is even, i.e. d 2q , use the filter
11
1
X t q X t q 1 ... X t q 1 X t q
d 2
2
q
q
1
1
T
t j t j 2q 1 t j Nt j t Tt
2q 1 j q
j q
28
Zhi Ying Feng
j q
a j Xt j
q
q
q
1
1
1
T
t j 2q 1 t j 2q 1 N t j
2q 1 j q
j q
j q
Tt if:
Nt xt Tt St
Differencing
The backshift operator, or the lag operator, B is defined by:
BX t X t 1
B j Xt Xt j
X t 1 B X t X t X t 1
The powers of are defined by:
j X t 1 B X t
j
0 X t X t
The difference operator with lag d is defined by:
d X t 1 B d X t X t X t d
Eliminating the Trend
Trend can be eliminated by using differencing. E.g. consider a linear trend:
X t Tt Nt , Tt c0 c1t
Apply differencing to the power of one:
X t Tt Nt
c0 c1t c0 c1 t 1 N t
c1 Nt
i.e. the trend becomes a constant c1, which is stationary. This constant can be estimated by the
sample average of X t . In general, a trend of a polynomial of degree k can be reduced to a constant
by differencing k times:
Tt c0 c1t ... ck t k
k X t kTt k Nt
29
Zhi Ying Feng
Stationarity
The theory of stationary random processes is important in modelling time series, as stationarity
allows parameters to be estimated efficiently, since we can treat all samples as from the same
distribution. A non-stationary random process must be transformed into a stationary one before
analysis and modelling, i.e. by removing trend and seasonality, or by applying transformations.
A random process X t is said to be:
Integrated of order n, i.e. I n if n1 X t is still not stationary, but the nth differenced series
0 Cov X t , X t var X t
Corr X t , X t
0 Cov X t , X t
X X Y Y
XY X Y
Cov X , Y
Cov aX , bY abCov X , Y
Cov aX bY , cU dV acCov X ,U adCov X ,V bcCov Y ,U bdCov Y ,V
A process X t is said to be weakly stationary if:
Xt
cov X t , X t
i.e. mean is constant and the covariance of the process only depend on the time difference .
30
Zhi Ying Feng
Noise
I.I.D Noise
X t is i.i.d. noise if X t and X t h are independently and identically distributed with mean zero, i.e.
no covariance.
X t ~ IID 0, 2
Assuming X t2 , as for weakly stationary series we need bounded first and second moments,
then:
X 0
2 if h 0
X h
if h 0
White Noise
X t is white noise with zero mean if:
X t ~ WN 0, 2
Where:
X 0
2 if h 0
X h
if h 0
0
IID noise is white noise, but white noise is not necessarily IID noise.
White noise is weakly stationary
Usually, we assume that the error terms have a normal distribution for the purpose of
parameter estimation and etc.
X t ~ N 0, 2
Linear Processes
X t is a linear process if it can be represented as:
Xt
j Zt j j B j Zt B Zt
Linear processes are stationary because they are a linear combination of stationary white noise
terms. For stationary processes, the regularity condition
j j
holds, i.e.
j j is
absolutely convergent. This ensures that the infinite sum can be manipulated the same way as a
finite sum, i.e. two absolutely convergent series can be added or multiplied together.
In general, if Yt is stationary and
j j
holds then
X t B Yt
X t Zt Zt 1
i.e. the next term depends on the noise plus the proportion of the noise of the previous period.
The mean is:
X t Zt Zt 1
0
Cov Zt Zt 1 , Zt Z t 1 var Z t Z t 1 2 1 2
Cov Zt Zt 1 , Zt 1 Z t 2 var Z t 1 2
Cov Z Z , Z Z
t
t 1
t h
t h 1 0
h
h
h
2
2
2
0 1
1
0
The conditional mean is stochastic and depends on Xt:
if h 0
if h 1
if h 1
if h 0
if h 1
if h 1
X t 1 | X t , X t 1,... Zt 1 Zt | Zt Zt 0 X t 1
X t is
a moving average process of order q, MA(q), if the process depends on its previous q
realisations of noises:
X t Zt 1Zt 1 ... q Zt q
Note:
Moving average processes are stationary processes as Xt is a linear combination of stationary
white noise terms
The ACF of a MA(q) process has non-zero values up until lag q, and near-zero values for all
lags greater than q
The conditional mean/variance is used for forecasts, whereas the unconditional
mean/variance is the long run results
32
Zhi Ying Feng
Autoregressive Process
X t X t 1 Zt
i.e. the next term depends on the previous value of the process and slowly converges to its mean.
Note that an AR(1) can also be written as:
X t h X t h Z t Z t 1 ... h 1Z t h 1
h
h 1 X t h 1 j Zt j
j 0
X E X t 1 E Zt
0
2
1
X h cov X t , X t h
2
h
1 2
if h 0
if h 0
X h
X h 1
X 0 h
if h 0
if h 0
E X t 1 | X t , X t 1,... X t E Zt 1 | X t X t 0 E X t 1
Note:
For an AR(1) process to be stationary, 1 , so it can be expressed as a linear process.
When 1 , the process is known as a random walk.
For a higher order AR(p) process, there will also be conditions on 1 , 2 ... p for stationarity
j 0
j 0
X t h 1 X t h 1 j Zt j j Zt j
as h
An AR(p) process has, in absolute values, in general decaying non-zero ACF for all lags.
The smaller , the faster ACF decays. If is negative, then ACF will have alternating signs.
33
B X t B Zt
Where
B 1 B
and
B 1 B
B X t B Zt
Note that for Xt to be stationary, the condition is the same as the AR(1) process, i.e. 1 .
B X t B Zt
Where:
z 1 1z 2 z 2 ... p z p
z 1 1z 2 z ... q z q
The ARMA(p,q) process X t is stationary if the equation z 0 has no roots on the unit circle,
note that the roots can be complex i.e.:
z 0 for z 1
If we estimate then remove the mean, then we have X as a normal ARMA(p,q) process. The
drift here represents the expected change after differencing.
Causality
X t j Zt j
j 0
i.e. Xt is expressible in terms of current and past noise terms. Causal processes are a subset of
stationary processes, i.e. to be causal it must be stationary first. Causality is important in practice,
since if Xt is not causal then it depends on future noise term, which doesnt make sense.
Theorem:
The X t satisfying B X t B Zt is causal if and only if all the roots of the equation z 0
are outside the unit circle.
34
Zhi Ying Feng
Invertibility
Zt j X t j
j 0
i.e. Zt is expressible in terms of current and past Xt. If Xt is not invertible then it depends on future
processes, again it does not make sense
Theorem:
The X t satisfying B X t B Zt is invertible if and only if all the roots of the equation
Calculation of ACF
Linear Filter Method
Consider the causal ARMA(p,q) process:
X t 1 X t 1 ... p X t p Zt 1Zt 1 ... q Zt q
This can be written as:
Xt
B
Zt B Zt j Zt j
B
j 0
Note that the summation is only from 0 to infinity, as the process is causal.
First step is to determine the j by equating the coefficients in the equation:
B B Zt B Zt
1 B ... B
1
Then calculate the ACF by replacing the Xt by their linear filter form. For h>0:
h Cov X t h , X t
Cov j Z t h j , j Zt j
j 0
j 0
Cov j h Z t j , j Z t j
j h
j 0
2 j h j
j 0
This method is convenient for MA processes, since they are easily expressed as linear processes
35
Zhi Ying Feng
Yule-Walker Equations
Consider the causal ARMA(p,q) process:
X t 1 X t 1 ... p X t p Zt 1Zt 1 ... q Zt q
First, multiply both sides by Xt-h and then take the covariance:
h 1 h 1 ... p h p Ch
Where Ch is the moving average components on the RHS:
For h 0,1, 2,... p there are a total of p 1 equations which can be expressed in matrix form:
0 1 1 ... p p C0
1 1 0 ... p p 1 C1
2 1 1 ... p p 2 C2
1
1
1
2
1
2 1 3
p p 1
p
0 C0
0
1 C1
Thus,
p C
p
... 1
p 1 p 1 ... p 0 C p
given Cj and j, the ACF can be computed by solving the linear equation, i.e. by taking the inverse.
For h p , we can find the ACF through:
...
...
...
h 1 h 1 ... p h p Ch
To figure out C0, C1,Cp, if js are available, then we can use:
j 0
j h
X t h j Zt h j j h Zt j
Ch Cov X t h Z t 1Z t 1 ... q Z t q
Cov j h Z t j Z t 1Z t 1 ... q Z t q
j h
q h
j h j j j h
2
j h
j 0
36
Zhi Ying Feng
1
... p 1 1 C1
1 0
0
... p 2 2 C2
2 1
0 p C p
p p 1 p 2 ...
Or equivalently:
p p p C p
Where the red matrix is the covariance matrix and Ch are:
p p1 C p p
Note that for an AR(p) process, Ch are all zeroes since the noise terms are uncorrelated with the past:
p p1 p
Define the vector h h1 h 2 ... hh h1 h , then for an AR(p) process where p p1 p ,
T
we have:
Because AR(p) process is a special case of an AR(h) process, where t 0 for t > p
We have:
hh
p
0
if h p
if h p
if h p
if h 0
h
hh if h 0
Where hh is the last element in the vector h1 h
E.g., for an AR(2) process, the PACF at lag 2 is given by:
2 22
where
21 0 1 1
1 0 2
22
37
Model Building
For a stationary time series, model building can be classified into 3 stages:
Model Selection
To determine the appropriate model we need to look at the sample ACF and PACF:
The sample autocovariance function is:
1
x x xt x
n t 1 t
Note that since SPACF will be calculated from SACF, both measures will have estimation error.
Once we have graphed the sample ACF and sample PACF, we first check that the sample ACF
should quickly converge to zero, which shows that the time series is stationary.
If the sample ACF decreases slowly but steadily from a value near 1, then the data need to
be differenced before fitting the model
If the sample ACF exhibits a periodic oscillation, then there may be some seasonality still.
Then we compare the sample ACF and PACF to the theoretical ACF and PACF of different
processes to see if there is a match.
For an AR(p) process:
Sample ACF shows exponential decay towards near-zero values
Sample PACF shows significant values up to lag p, then near-zero values thereafter
For an MA(q) process:
Sample ACF shows significant values up to lag q, then near-zero values thereafter
Sample PACF shows exponential decay towards near-zero values
If neither of these situations occur, then consider an ARMA(p,q) process. However, the sample
ACF and PACF of an ARMA(p,q) process is very flexible, but in general the ACF and PACF are
the sum of the ACF and PACF of an AR(p) and a MA(q) process. So for an ARMA(p,q) model, it
should display:
ACF that decays towards zero after lag p, either direct or oscillatory
PACF that decays towards zero after lag q, either direct or oscillatory
38
Zhi Ying Feng
IC p, q, 2log L P n, p, q
Where the first term, i.e. the log-likelihood function, always decreases with the no. of parameters.
However, the second penalty term, in terms of the no. of observations and the parameters, always
increases with the no. of parameters. Thus the IC seeks to balance out bias and variance.
There common ICs are:
Information Criteria
AIC
Penalty Term
2 p q 1
p q 1 log n
2 p q 1 n
BIC
AICc
n pq2
Parameter Estimation
Mean Estimation
Let X t be a weakly stationary process with mean . An estimator of is the sample mean:
1 n
Xt
n t 1
1 n
E X E Xt
n t 1
This estimator is also consistent:
1 n
1 n
var X cov X t , X s
n s 1
n t 1
1 n n
2 t s
n s 1 t 1
1
n2
ns
h
s 1 h 1 s
h n
n h
n2
stationary time series the ACF should eventually converge to zero regardless of whether it is AR or
MA. For a more detailed proof, see ACTL2003 Proofs.
39
Zhi Ying Feng
Parameter Estimation
If the sufficient number of sample ACF is known, then one way to estimate the parameters in a
model, i.e. , , 2, is by equating them to the theoretical ACF derived from the Yule-Walker
equations. Use this to set up equations in terms of the parameters and solve. If there are more than
one solution, choose the solution that makes the model causal and/or invertible.
Another method is to use maximum likelihood estimation. Suppose we have a set of errors that are
assumed to be normal, then the Xt themselves are also normal so:
X n X1 ,..., X n ~ N 0, n
T
Where:
1
n 2
n 1
1
0
1
2
1
0
...
...
...
n 1
n 2
n 3
n 2 n 3 ... 0
Assuming that the observations follow a multivariate normal distribution, the likelihood function is:
1
1
1
L
exp X nT n X n
12
n2
2
2 det n
The maximum likelihood estimator is the value that maximises the likelihood function L().
Under the normality assumption, we have the asymptotic distribution of the MLE:
A
~ N , var
Where the variance can be estimated by:
2 ln L
var
T
This result can be used to compute confidence intervals for the parameters and for hypothesis
testing about the parameters, e.g. whether to include certain parameters.
Model Diagnosis
The residuals of the proposed model are:
Z t Xt X t
Where are the fitted values computed using the estimated parameters. If the proposed model is a
good approximation to the underlying time series process, then the residuals should be
approximately a white noise process. There are several methods to check this:
Plot of Residuals
If the plot of Zt against t shows any trend or patterns in fluctuations, then the model is inadequate.
40
Zhi Ying Feng
SACF of Residuals
If Zt ~ WN 0,1 , then it can be shown that the ACF of Zt has the distribution:
1
Therefore, at 95% significance, the sample ACF should be should be within the range:
1.96
0
n
Z h ~ N 0,
n
Since for a white noise process, h 0 for h 1. If too many values of SACF lie outside this
range, then the model does not fit the process well and more parameters will be needed.
Ljung-Box Test
We can test the null hypothesis that Zt ~ WN 0,1 using the Ljung-Box test statistic. This tests
whether jointly, all correlations at lags greater than zero are zero. Under the null hypothesis, for
large n, the Ljung-Box test statistic is:
h
Q n n 2
j 1
Z j
2
n j
~ h2 p q
Where n is the no. of time series observations. In practice, h is chosen to be between 15 and 30, and
n should be large, i.e. n 100 .
Using the Ljung-Box test, we reject the null hypothesis, i.e. the residual is not white noise, if:
Q h2 p q, 1
Non-Stationarity
In practice, a non-stationary time series may exhibit a non-stationary level of mean, variance or
both. Transformations can be used to remove non-stationarity, e.g. taking logarithm of an
exponential trend can remove non-stationary mean, or it can smooth out the variance
Stochastic Trends
Apart from a deterministic trend or seasonality, a stochastic trend also causes non-stationarity. A
stochastic trend is when the noise terms have a permanent effect on the process. Consider the
random walk rewritten iteratively
X t X t 1 Zt
Xt X0 Z j
j 1
where Zt ~ WN 0, 2
In this case, the effect of any Zt on Xt+h is the SAME for all h 0 , since 1 . This is not true for
stationary processes like AR(1) or ARMA(1,1), where 1 , since depending on h, Zt will have
different level of impact since the coefficient j changes, e.g.:
X t j 1 j Zt j Zt
Since noise terms have a lasting impact, the correlation between Xt and Xt-h is relatively high, so a
distinctive feature of a random walk is a very slowly decaying positive ACF. Note that differencing
the random walk once obtains a stationary series!
Yt 1 B X t X t X t 1 Zt
41
Zhi Ying Feng
ARIMA Model
The process Yt is an autoregressive integrated moving average model ARIMA(p,d,q) with order of
integration d if:
B 1 B Yt B Zt
d
B Wt B Zt
E.g. consider the process defined by:
X t 0.6 X t 1 0.3 X t 2 0.1X t 3 Zt 0.25Zt 1
This process can be re-written as:
X t 0.6 X t 1 0.3 X t 2 0.1X t 3 Z t 0.25Z t 1
1 0.25B Z t
1 0.25B Z t
1 B
SARIMA Model
Suppose that X t exhibits a stochastic seasonal trend, i.e. where Xt not only depends on
B 1 B 1 1B s ... P B s
d
AR p
I d
1 B
s D
I D
AR P
X t B 1 1B s ... Q B s
MA q
MA Q
Where AR(P), MA(Q) and I(D) are polynomials with the term B s .
E,g, consider an SARIMA 1,0,0 0,1,112 process given by:
1 B 1 B12 X t 1 B12 Zt
AR1
1 B
12
I 1
MA1
B B13 X t Zt Zt 12
X t X t 12 X t 1 X t 13 Zt Zt 12
Dickey-Fuller Test
The Dickey-Fuller test is a unit root test, i.e. it tests whether there is a unit root in the time series.
Note that if the polynomial z has a unit root, then the time series is not stationary and requires
differencing.
Consider a causal time series process X t with:
X t t X t 1 Zt
Z ~ WN 0, 2
H 0 : 1 against
H1 : 1
Note that if 1 then there is a unit root, which leads to a stochastic trend so Xt is not stationary.
We can write the above model as:
X t X t 1 t 1 X t 1 Zt
X t t * X t 1 Zt
where * 1
H0 : * 0
This is known as the Dickey-Fuller Test, and the test statistic is:
*
se *
Once the parameters , , * have been estimated, reject the null hypothesis * 0 if is LESS than
the critical value, which will be negative since * is negative. Note that rejection of the null
hypothesis implies that the time series is stationary and accepting the null hypothesis implies that
differencing is required.
The distribution of this test statistic is a non-standard distribution depending on and , with
asymptotic percentiles.
Probability to the left
0.01
0.05
0.10
Standard Normal
-2.33
-1.65
-1.28
-2.58
-1.95
-1.62
DF with 0, 0
-3.43
-2.86
-2.57
DF with 0
DF (unconstrained)
-3.96
-3.41
-3.12
Note that the DF distributions are much more spread out than the standard normal. When choosing
whether or not to do DF with 0, 0 , there is a trade-off:
If or is set to 0 when in fact the true values are nonzero, the test becomes inconsistent
and asymptotic critical values are no longer valid. Decisions based on the test are likely to
be wrong, i.e. it might confuse deterministic and stochastic trends
However, allowing or to be non-zero reduces the power of the test, i.e. harder to detect a
false null hypothesis
How to determine and usually depends on what type of series we have. E.g. if a linear trend
exists then we expect the difference to only have a constant, so 0, 0
43
Zhi Ying Feng
Overdifferencing
Let Ut be an ARMA(p,q) process:
B U t B Zt
Vt 1 B
B
Z
B t
B Vt 1 B B Zt
d
However, this MA polynomial has a unit root so the process Vt is not invertible. Therefore, we
should avoid overdifferencing as it will give us a non-invertible process, even though its still
stationary.
Cointegrated Time Series
Many time series in finance and economics are non-stationary (random walks), e.g. CPI and GDP,
but at the same time do not move too far apart from each other. Cointegration is used to model nonstationary series that move together
T
d X t ,1
Xt d
X t ,2
d
d 1 X t is not.
An I(d) bivariate process is cointegrated if there is a cointegrating factor 1 , 2 such that
T
1 X t ,1 2 X t ,2 ~ I 0
If X t ,1 and X t ,2 are cointegrated, then:
et Yt X t is I(0)
1, a T
That is, Xt,1 and Xt,2 are random walks themselves but the difference X t ,1 aX t ,2 is stationary. The a
term can be estimated by using regressing:
X t ,1 aX t ,2 t
Then we expect in the long run that the two processes converge to:
X t ,1 aX t ,2
44
Zhi Ying Feng
Pr X t A | X s1 x1 ,... X sn xn , X s x Pr X t A | X s x
For all times s1 s2 ... sn s t and all states x1, x2 ,..., xn , x S and all subsets A of S
AR Processes
An AR(1) process has the Markov property, since the conditional distribution of X n1 given all
previous X t depends only on X n . However, an AR(2) process does not have the Markov property,
since the conditional distribution of X n1 given all previous X t depends on both X n and X n1 . Thus in
general, AR(p) processes do not have the Markov property for p greater than 1.
However, for an AR(2) process if we define a vector-valued process Y by Yt X t , X t 1 , then Y
T
has the Markov property since the conditional distribution of Yn 1 given all previous Yt depends only
on Yn . In general, for an AR(p) process we can define a vector-valued process with p elements that
will have the Markov property.
MA Processes
A MA(q) process can never have the Markov property, even in vector form, since the distribution of
X n1 depends on the value of Z n and in theory no knowledge of the value of X n or any finite
collection of X n ,..., X nq 1 will never be enough to deduce the value of Z n .However in practice,
T
1 B B B X
2
Zt
X t 1 1 X t 1 1 X t 2 Zt
This is clearly not Markov. However, it can still be written as a vector-valued process that has the
Markov property. In general, the vector process needs p+d terms to be Markov
If q is not equal to zero, i.e. it has a moving average part, then it will never be Markov for the same
reason that MA(q) is never Markov.
45
Zhi Ying Feng
X n k |n
X nk | X n ,..., X1
To forecast X n k |1 we have:
X n 1|n
X n1 | X n ,..., X1
To forecast X n k |2 we have:
X n 2|n
X n 2 | X n ,..., X1
In practice we do not observe Zt so if there are MA terms in the model, then there are more values
of Z t than X t and there is no way of determining all of them from data. Consider the MA(1) process:
X t Zt 0.5Zt 1
Zt X t 0.5Zt 1
Z n 0.5 X n j 0.5 Z 0
j
j 0
To determine Z n we need Z 0 first, one simple way is to assume that Z0 0 . If the process is invertible,
then this assumption will have negligible effect on X n k |n if n is large, since 0.5 will be large.
n
46
Zhi Ying Feng
Pn X nh a0 a1 X n a2 X n1 ... an X1
We need the values of a0 ,..., an that minimises the mean squared error:
X P X 2
n h n nh
The general solution is found by minimising the n+1 first-order conditions:
MSE a0 0 X n h Pn X n h 0
MSE
MSE a1 0
X n h Pn X n h X n 0
MSE an 0
X n h Pn X n h X 1 0
Note that due to the very first condition, the expected prediction error is:
X nh Pn X nh 0
X n h Pn X n h X n X n h Pn X n h
0
Cov X n h Pn X n h , X n
h a1 0 a2 1 ... an n 1
0
Then applying this trick to every subsequent MSE minimising condition, we end up with a system
of n equations (excluding the first one) that is very similar to the process of finding PACF
coefficients. This system can be represented in matrix form:
n h n an
Where:
1
... n 1
0
1
0
... n 2
0
n 1 n 2 ...
Therefore, the solution is:
a1
a
an 2
an
h 1
n h
h n 1
an n1 n h
Once this is known, the a0 can be found by rewriting the first equation in matrix form:
a0 1 1T an
X0 0
(1)
(2)
For all t 0 , X t ~ N 0, 2t
(3)
X t ~ N 0, 2t for t 0
since if two
Xt
are independent.
This can be shown by finding the covariance between any Xs and Xt
cov X s , X t cov X s , X s X t X s
cov X s , X s cov X s , X t X s
2s
For any s and t, X s and X t X s are both normally distributed, i.e.:
~ N 0, t s
X s ~ N 0, 2 s
Xt X s
48
Zhi Ying Feng
X t x Y1 Y2 ... Y t
t
Then, if we let x t and let t 0 then the limiting process of X(t) is a Brownian motion.
This is because:
X t , t 0 has stationary increments, since the distribution of the change in position of the
random walk over any time interval depends only on the length of the interval
For all t 0 , X t ~ N t , 2t
A Brownian motion with drift can be converted to a standard Brownian motion by defining:
X t
Bt t
Similarly, a standard Brownian motion can be converted to a Brownian motion with drift by
defining:
X t t Bt
Geometric Brownian Motion
Let X t , t 0 be a Brownian motion process with drift and volatility 2 , then the process
Yt , t 0 defined by:
Yt exp X t
is called a geometric Brownian motion. This is useful when you dont want negative values
49
Zhi Ying Feng
Gaussian Processes
X n X 1 X 2 X 1 ... X n X n 1
Remember that Xs and Xt are not independent but their increments are!
Differential Form of Brownian Motion
Consider a standard Brownian motion Bt , t 0 and define Bt as the change in Bt over t . Using
the property of Brownian motions, we can write this as:
Bt Bt t Bt Z t
Where Z denotes a standard normal random variable
Taking the differential limit t 0 , we have the differential form:
dBt Z dt
With:
dBt 0
and
var dBt dt
B 2 var B
t
t
t
2
4
var Bt E Bt
Bt
B 2
t
Z 4 t 2
since Bt Z t
2
2
2
t var Z 2 E Z 2 t
3 t t
2
since
i 0
k
Z i2 k2 with k and 2 2k
o t
dB 2 dt
t
2
var dBt o dt
We can essentially treat o dt as zero, so the variance is zero and loosely speaking, dBt is just a
2
constant:
dBt
dt
50
Zhi Ying Feng
t Bt
t Bt t Bt
0
var t Bt
t 2 B B 2
t t
t
t Bt t Bt
t t 0
2
o t
dt dBt 0
var dt dBt o dt
dt dBt 0
Finally, consider dt :
2
dt 2 o dt 0
The following table summarises the results:
dt
dBt
dt
dBt
dt
dX t X t , t dt X t , t dBt
Yt F X t , t
Then the stochastic differential equation satisfied by Yt is given by the Itos formula:
2
F x, t
F x, t
2 F x, t
1
dYt
dX
dt
X
,
t
x X t
x X t
t
t x 2 x X t dt
t
2
Proof:
The derivative of the second order Taylor series expansion for a function of two variables is:
1
2
2
df x, y f x dx f y dy
f xx dx f yy dy 2 f xy dxdy
2
Applying this to Yt gives:
dYt
F x, t
F x, t
dX t
dt
x x X
t
x X
t
2 F x, t
2 F x, t
1 F x, t
2
2
dX t dt
dX t
dt 2
2
2 x 2
xt
t
x X t
x X t
x X t
dt dX t dt X t , t dt X t , t dBt
0
51
Zhi Ying Feng
And:
dX t
X t , t dt X t , t dBt
2 X t , t dt 2 X t , t dBt 2 X t , t X t , t dBt dt
2
2 X t , t dBt
2 X t , t dt
Therefore, the expansion reduces to the Itos formula
E.g. consider a Brownian motion X t , t 0 with 0 drift and variance 2 , find the SDE for tX t2
We have the SDE for Xt:
dX t dBt
Let Yt tX t2 F X t , t where F x, t tx 2 so the derivatives are:
F x, t
2tx
x
F x, t
x2
t
2 F x, t
x 2
2t
1
d tX t2 dF X t , t 2tX t dX t X t2 dt 2 2t dt
2
2 tX t dBt X t2 2t dt
Stochastic Integration
Consider a Brownian motion X t , t 0 with zero drift and variance 2 . Let f be a function with
continuous derivative on [a,b]. The stochastic process Yt defined by the stochastic/Itos integral is:
Yt f t dX t
a
f tk 1 X t
n
lim
n
max tk tk 1 0 k 1
X tk 1
b f t dX
t
a
f tk 1 X t
n
lim
n
max tk tk 1 0 k 1
X tk 1 0
Note that functions of t in this case includes stochastic variables, e.g. Btn dBt 0
Thus, the variance and expectation squared are equal:
-
f t dX t
var
f t dX t
f 2 tk 1 var X t
n
lim
n
max tk tk 1 0 k 1
X tk 1
lim
n
max tk tk 1 0
2 f 2 tk 1 tk tk 1
k 1
2 f 2 t dt
b
52
Zhi Ying Feng
Part 4: Simulation
Continuous Random Variables Simulation
When simulating continuous random variables we have the probability/cumulative distribution of
the random variable. The cumulative probabilities will always have a Uniform(0,1) distribution.
Since if X ~ U 0,1 then F 1 X is distributed as F. This means that the probability that a
probability is in [a,b] is the same as the probability that a probability is in [c,d]. Therefore, the first
step in simulation is usually to generate random variables from a Uniform(0,1).
Pseudo-Random Numbers
The procedure to generate pseudo-random numbers, i.e. from a Uniform(0,1), is:
1. Start with a seed X0 and specify positive integers a, c and m, which are usually given
2. Generate pseudo-random numbers recursively using:
X n1 aX n c mod m
3.
X n 1
will be an approximation to a Uniform(0,1) random variable
m
X FX1 U
Then X will have the distribution function:
FX x Pr X x
Pr FX1 U x
Pr U FX x
FU FX x
FX x
using FU y y
Then the inverse transform procedure to generate r.v. from a cumulative distribution F is:
1. Compute FX1 , from the p.d.f or c.d.f, if possible
2. Generate a Uniform(0,1) random variable U
3. Set X FX1 U , then X will be from the distribution F
Note that for the inverse transform method to work, we must be able to calculate the inverse of the
c.d.f., i.e. FX1 must have an explicit expression.
E.g. to simulate an exponential random variable, first find FX1 :
FX x exp s ds 1 exp x
x
Fx1 x
ln 1 y
Pr X x j Pj , for j 1, 2,...
j Pj 1
Then the procedure to simulate r.v. from this p.m.f. is:
1. Generate a Uniform(0,1) random variable U
2. Set:
x1 if U P1
x if P U P P
1
1
2
2
X
x if j 1 P U j P
i 1 i
i 1 i
j
E.g. simulate random variables from a geometric distribution.
Consider the probability mass function for a geometric random variable:
Pj Pr X j p 1 p
j 1
j 1
Notice that:
j 1
Pi 1 Pr X j 1
i 1
1 p 1 p
i 1
j 1
p 1 p
1
1 1 p
j 1
1 1 p
geometric sum
j 1
Then we have:
i 1 Pi U i 1 Pi
j 1
1 1 p
j 1
U 1 1 p
1 p j 1 U 1 p j 1
Since U is a Uniform(0,1) random variable, this is equivalent to:
1 p j U 1 p j 1
j ln 1 p ln U j 1 ln 1 p
ln U
j 1
j
since ln 1 p 0
ln 1 p
Therefore to generate a random variable from a geometric distribution:
1. Generate a uniform random number U
2. Set X as j, where j is the first integer for which:
ln U
j
ln 1 p
54
Zhi Ying Feng
Acceptance-Rejection Method
Suppose we have a method, e.g. inverse transform, to simulate an r.v. with density g(y). Then we
can use this as the basis for simulating from the continuous distribution with density f(x).
The procedure to simulate random variables using the rejection method is:
1. Choose a distribution g for which you know you can simulate outcomes
2. Set c be some constant such that c f y g y for all y
3. Simulate a random variable Y with density function g(y)
4. Simulate a Uniform(0,1) random variable U
5. Accept this as the random number, i.e. set X = Y, if:
f Y
U
c g Y
6. Otherwise, reject and return to step 3.
Therefore, the value for X is YN where N is the number of iterations until a random number is
accepted. We want to be as efficient as possible, i.e. minimise the no. of iterations, by:
For efficiency, choose a density g(y) similar to f(y), e.g. exponential to normal
For efficiency, choose the smallest value of c that satisfies the inequality, using calculus, i.e.:
c max f y g y
f x
x2
2
exp
2
2
1. Let another random variable Y be from the density g x exp x , note that this is
comparable to f(x)
2. Choose the smallest value of c such that c f y g y , e.g.:
x 12
2e
2e
c max
max
exp
f x
g x
Y log U1
f Y
2
2
1
1
exp Y 1 exp log U1 1
cg Y
2
6. If true, then set X Y log U1 , if false then return to step 3 and repeat.
7. To now generate a standard normal random variable Z, generate U3 and set:
if U 3 0.5
X
Z
X if U 3 0.5
55
Zhi Ying Feng
3. Since the sum of n independent Exp() random variables has a Gamma distribution, set:
ln Ui
~ Gamma n,
i 1
Chi-Squared Distribution
The sum of n standard normal r.v. has a chi-squared distribution with n degrees of freedom:
i1 Zi2 ~ n2
n
Gamma k , 1 2 distribution. If the degree of freedom is odd, i.e. 2k+1 we can add on an extra Z2
term, where Z is standard normal. That is:
2 i 1 ln U i ~ 2 k
k
Z 2 2 i 1 ln U i ~ 2 k 1
k
Poisson Distribution
Recall that the no. of events within one period where events follow an Exp() distribution is Poi().
To simulate random variables from a Poisson() distribution, the procedure is:
1. Generate a Gamma n, random variable using the above steps
2. Since the waiting time until the nth event has a Gamma n, distribution, which is a sum of
independent Exp() random variables, we set:
1 n
X max n : ln U i 1
i 1
~ Poi
Normal Distribution
One method of simulation random variables from a normal distribution is the Box-Muller approach:
1. Generate two random variables U1 and U2 from Uniform(0,1).
2. Set:
1
X 2 ln U1 2 cos 2 U 2
1
Y 2 ln U1 2 sin 2 U 2
Let X and Y be the ith simulated sample path of X and Y . Then the MC simulation procedure is:
i
1. Generate a random vector X X1 ,..., X n with joint density f x1 ,..., x n and compute
Y g X
1
2
2
2
2. Generate second, independent from step 1, random vector X X1 ,..., X n with joint
2
2
density f x1 ,..., xn and compute Y g X
3. Repeat this process until r (a fixed number) i.i.d. random vectors are generated:
Y g X , for i 1, 2,..., r
i
Y ... Y
g X
r
This method works due to the strong law of large numbers:
1
1
r
Y ... Y
r
r
lim
Y i
g X
g X1 ,..., X n
g X :
1 r i 1 r
Y
r i 1
r i 1
Since each Yi is independent and identically distributed with the distribution of g X1 ,..., X n
The variance of Y is given by:
var Y
1 r i 1 r
i
var Y var Y 2 var Y
r
r i 1
r i 1
Note that usually we do not know var Y , so we estimate it using the sample estimate:
i
var Y
1 r
i
Y Y
r 1 i 1
In the next 3 sections, we will describe some techniques to REDUCE this variance
57
Zhi Ying Feng
Antithetic Variables
Reducing the variance of the estimate using antithetic variable involves generating estimates with
negative correlation then adding these estimates to obtain a final estimate.
Assume that k is even and that r 2 . The antithetic variates procedure is:
1
1
1. Generate a set of n variates X1 ,..., X n and determine Y g X1 ,..., X n
g X1
,..., X n
,..., X n
1
2
r
3. Repeat steps 1 and 2 times to form Y 1 , Y ,..., Y and Y , Y ,..., Y
Y1
Y2
and
i 1
Y i
i 1
5. Use:
Y
as the final estimate for
Y1 Y 2
2
g X
Using this method, as long as the correlation between Y 1 and Y 2 are negative, then the variance will
be reduced. One example of this is by using:
cov X ,1 X 1
To show why this method reduces the variance, consider the variance of the estimator using the
antithetic variable
Y1 Y 2
var Y AV var
2
1
var Y 1 var Y 2 2 var Y 1 var Y 2
4
1
var Y 1 1
since var Y 1 var Y 2
2
1 var Y
1
2
var Y
i
var Y
var Y
i
for 0
58
Zhi Ying Feng
Control Variates
g X and there is a function f such that the expected
value of f(X) can be evaluated analytically with:
f X
g X
g X a f X
0
g X a f X
a g X af X
g X Y cv
a
a
g X af X
1 r
g X i af X i
r i 1
var Y cv
1
r2
var g X i af X i
i 1
var g X i af X i
r
var g X i
a 2 var f X i
r
This variance can be minimised by solving:
2aCov g X i , f X i
r
var Y cv
a
With the solution being the best choice of a:
a
Cov g X i , f X i
var f X i
Substituting this value back, we have that the minimised variance is:
var Y cv
var g X i
r
var g X i
Cov g X , f X
r var f X i
0
var Y
Therefore, using a control variates has decreased the variance compared to the original estimator
59
Zhi Ying Feng
Importance Sampling
We can write:
g X g x f X x dx
f x
g X g x
h x dx
h x
f Y
Y g Y
h Y
Then, we can simulate Yi from density h(x) and estimate the expectation
Yh
g X with:
f Yi
1 r
g Yi
r i 1
h Yi
f X
var g X
var g X
h X
To do this, we need to select a density h(x) such that the ratio of the two densities, i.e. f x h x is
large when g(x) is small and vice versa.
Number of Simulations
Ideally, we want to carry out simulations as efficiently as possible since the no. of simulations
required for accuracy is quite large. Assume we generate n samples from a known distribution. To
estimate its mean we will use the sample average as an estimator:
Y
This is an unbiased estimator, i.e.:
1 r i
Y
r i 1
var Y
var Y
i
60
Zhi Ying Feng
However, the value of var Y is usually not known, so we estimate it using the sample variance
i
from the first k runs where k < n and usually at least 30. Then the estimate of the mean and variance
becomes:
2
1
1
k
k
i
i
i
Y i 1Y
var Y
Y
Y
k
k
k 1 i 1
For large values of n, we know that by the Central Limit Theorem this estimator will be
approximately normal:
2
Y ~ N ,
Then, we can select n such that the estimate is within a desired accuracy of the true mean, i.e. a
percentage of the true mean with a specified probability.
E.g. random variates are generated from a Gamma distribution:
f x
1 x
x e
for x 0
Twenty values are generated for each sample and the mean and standard deviation of each sample
are given as:
Sample 1
2
3
4
5
6
7
8
9
10
Mean
12.01 11.79 13.43 14.01 11.44 11.19 11.24 12.42 12.91 12.29
SD
6.15
4.73
6.42
7.02
4.30
3.90
3.84
4.35
3.59
4.60
Determine the no. of simulated values required for the estimate of the mean to be to be within 5% of
the true mean with 95% certainty.
We require n such that:
Pr X 0.05 0.95
X 0.05
0.95
Pr
n n
0.05
Pr Z
0.95
n
0.05
0.05
Pr
Z
0.95
n
n
0.05
2 Pr Z
1 0.95
n
0.05
1.96
n
We do not know the true value for and , so we must use the sample estimates. The sample
estimate for the mean is given by averaging the mean of each sample:
X
1 1 10 20
1 10
X ij X i 12.273
20 10 i 1 j 1
10 i 1
61
Zhi Ying Feng
1 n
Xi X
n 1 i 1
n
n
2
1 n 2
X
2
X
y
i
i
n 1 i 1
i 1
i 1
2
1 n 2
Xi nX
n 1 i 1
10 20 2
2
1
xij 10 20 X
10 20 1 i 1 j 1
Where we have:
10 20
10
i 1 j 1
i 1
xij2 20 1 si2 20 X i
10
10
i 1
i 1
19 si2 20 X i
4790.0596 30285.702
35075.7616
Therefore the estimated variance and standard deviation is:
1
35075.7616 200 12.2732
2
199
24.87666
4.98765
Substituting back into the previous equation, we have:
0.05
1.96
n
0.05 12.273
1.96
4.98765 n
n 254
We can also estimate the parameters of the Gamma distribution, since we know that:
var X 2
X
12.273
24.87666
2
0.4934, 6.0549
62
Zhi Ying Feng