You are on page 1of 27

A Simple Guide to Partial Observability

Matthew Mit hell

May 27, 2008

1 Introdu tion

This guide provides an introdu tion to hidden Markov Models and draws heav-
ily from the ex ellent tutorial paper written by Rabiner (1989). I have at-
tempted to provide intuition into the various algorithms by using some simple
fully worked examples to hopefully illuminate the mathemati al des riptions
provided by Rabiner. It is hoped the explanation of the on epts is suÆ ient to
help an otherwise uninformed reader to understand later des riptions (su h as
the Baum-Wel h algorithm) where worked examples are not provided.
The se ond part of the guide aims to explain the relationship between
Hidden Markov Models (HMMs) and Partially Observable De ision Problems
(POMDPs). While the primary on ern in HMMs is to learn a good model,
the addition of a tions { the ability to make de isions { in POMDPs adds the
additional on ern of learning whi h a tion to sele t.
These two problems are related to POMDPs whi h add the ability to learn a
model and the ability to learn whi h a tion to sele t. Some systems address one
of these on erns ex lusively, others attempt to address both simultaneously.

2 Partially Observable Pro esses

In partially observable pro esses it is assumed that there is an underlying


Markov pro ess, it just annot be observed dire tly, instead for ea h state a
number of observations are possible ea h with a ertain probability. Be ause
the observations do not uniquely identify the state, the underlying state tran-
sition model is diÆ ult to dis over and the true state at any time is diÆ ult to
keep tra k of.
A rst step in solving a partially observable problem is to dis over the prob-
ability of o uren e of parti ular observations in ea h state. This may seem
a ir ular problem be ause to do this su essfully you need to know what the
state transitions are. Ignoring this ir ularity for now and assuming that we
an dis over the probability of observations given states we then need a method
of using this knowledge to tra k whi h state we are urrently in. This diÆ ulty
arises be ause observations do not orrespond uniquely with states, so the same

1
sequen e of observations may o ur with many di erent underlying state se-
quen es. The solution to this involves maintaining a probability for ea h of the
possible states we may urrently be in given the observations we have en oun-
tered sin e the start of the pro ess (or sin e we were last ertain of what state
we were in). The sum of all the probabilities must equal 1.0 (that is we must be
in one state and one state only) and this probability distribution a ross states
at a parti ular time is alled a belief state. The belief state is a single point in
multi-dimensional spa e where the dimensions are the di erent possible states.
The set all of possible belief states (all possible points) is alled the belief spa e.
Sin e probabilities are real-numbers this spa e is ontinuous not dis rete.

2.1 A Hidden Markov Model


An example of a Partially Observable pro ess is provided by (Rabiner 1989). In
his example we have a small number of urns ea h with ontaining a olle tion of
di erent oloured balls. The urns all have balls of the same olour, but di erent
numbers of ea h. Let us assume there are 3 urns ea h ontaining 20 balls. The
se ond urn has 10 green balls while ea h of the others has 5. Now imagine
the urns are hidden from our view and someone randomly draws a ball from
one of the urns and alls out its olour. If we know whi h urns ontain whi h
balls, we an onstru t a belief state about whi h urn the ball ame from. For
example, if we are told (reliably) that the ball is green we an assume it ame
from the se ond urn with probability 0.5 (sin e half the green balls are in this
urn) and ea h of the other urns with probability 0.25 respe tively. Thus our
states are the urns, our observations the balls and our belief state in this ase
is (0:25; 0:50; 0:25). Ea h state is denoted as S = fS1 ; S2 ;    ; SN g where N is
the number of states.
Urn 1 Urn 2 Urn 3

P(GREEN) = 0.25 P(GREEN) = 0.50 P(GREEN) = 0.25


P(YELLOW) = 0.25 P(YELLOW) = 0.25 P(YELLOW) = 0.50
P(RED) = 0.50 P(RED) = 0.25 P(RED) = 0.25

Figure 1: The three urns.

Lets ontinue the example assuming that ea h urn ontains 5 yellow and
5 green balls ex ept that urn 2 ontains 10 green balls and urn 3 ontains
10 yellow balls (See Figure 1). Then a possible sequen e of observations is:
(RED1 ; Y ELLOW2 ; RED3 ), where the rst ball (at timestep 1) is red, the
se ond yellow (at timestep 2) and the last (at timestep 3) red again, assume
also that ea h ball is repla ed in its urn before the next is drawn. There are 33

2
possible state sequen es that mat h this observation sequen e whi h are shown
in Tables 2(a) to 2( ).
There are 3 possible urns that the third ball may have ome from, and for
ea h of these 9 possible paths (state transitions) that may have lead there. The
belief state will depend on the probability of ea h of these paths given the obser-
vation sequen e. Sin e our observation sequen e is (RED1 ; Y ELLOW2 ; RED3 )
and ea h urn is equally likely to be sele ted from ea h time, the most likely state
sequen e is (11 ; 22 ; 13 ), that is the rst ball was drawn from Urn 1 at timestep
1, the se ond ball from Urn 2 at timestep 2 and the third ball from Urn 1 again
at timestep 3. The belief state after the third observation is (0.5, 0.25, 0.25).

2.2 Des ribing the Hidden Markov Model


Our example has all the features of a Hidden Markov Model (HMM). It has 3
states (the three urns) and for ea h of these states three possible observations
(the oloured balls). Ea h state has its own observation probability for ea h of
the olours and an underlying transition probability to the other states. Com-
bining this with our equi-probable state distribution we have a model whi h
we an use to either generate observations sequen es or to explain how a given
observation sequen e was generated (Rabiner 1989).

3 Using HMM's - The Three Tasks

Rabiner (1989) des ribes three tasks we need to be able to do with HMM's to
make them useful:
1. to ompute the probability of an observation sequen e;
2. to hoose a state sequen e whi h explains the observations; and
3. to adjust the model parameters.
If we an do 1, and we have many possible models (possible states, state
transitions or observation probabilities) we an sele t the model whi h best
mat hes an observation sequen e by determining in whi h model the sequen e
was most probable. Using 2 we an determine the most probable state given
an observation sequen e. If we an do 3 we an onstru t models from training
sets of observation sequen es.

4 Implementing Task 1

Task 1, al ulating the probability of an observation sequen e for a given model,


is implemented by taking into a ount the state transitions in that model along
with the observation probabilites for ea h of those states. For a parti ular state
sequen e we multiply the probability of ea h state o uran e by the probability
of ea h observation o uran e.

3
Sequen e Prob 1 Prob 2 Prob 3 Produ t
(1,1,1) 0.50  0.25  0.50 0.06
(1,2,1) 0.50  0.50  0.50 0.13
(1,3,1) 0.50  0.25  0.50 0.06
(2,1,1) 0.25  0.25  0.50 0.03
(2,2,1) 0.25  0.50  0.50 0.06
(2,3,1) 0.25  0.25  0.50 0.03
(3,1,1) 0.25  0.25  0.50 0.03
(3,2,1) 0.25  0.50  0.50 0.06
(3,3,1) 0.25  0.25  0.50 0.03
Sum 0.50
(a) Possible state sequen es terminating with Urn 1

Sequen e Prob 1  Prob 2  Prob 3 Produ t


(1,1,2) 0.50  0.25  0.25 0.03
(1,2,2) 0.50  0.50  0.25 0.06
(1,3,2) 0.50  0.25  0.25 0.03
(2,1,2) 0.25  0.25  0.25 0.02
(2,2,2) 0.25  0.50  0.25 0.03
(2,3,2) 0.25  0.25  0.25 0.02
(3,1,2) 0.25  0.25  0.25 0.02
(3,2,2) 0.25  0.50  0.25 0.03
(3,3,2) 0.25  0.25  0.25 0.02
Sum 0.25
(b) Possible state sequen es terminating with Urn 2.

Sequen e Prob 1  Prob 2  Prob 3 Produ t


(1,1,3) 0.50  0.25  0.25 0.03
(1,2,3) 0.50  0.50  0.25 0.06
(1,3,3) 0.50  0.25  0.25 0.03
(2,1,3) 0.25  0.25  0.25 0.02
(2,2,3) 0.25  0.50  0.25 0.03
(2,3,3) 0.25  0.25  0.25 0.02
(3,1,3) 0.25  0.25  0.25 0.02
(3,2,3) 0.25  0.50  0.25 0.03
(3,3,3) 0.25  0.25  0.25 0.02
Sum 0.25
( ) Possible state sequen es terminating with Urn 3.

Figure 2: Possible state sequen es.

4
The spe i ation of the example problem in Se tion 2.1, will be the model,
 used in the following dis ussion. The probability of the observation sequen e
O = (RED1 ; Y ELLOW2 ; RED3 ) and state sequen e Q = (URN 11; URN 12; URN 13),
o uring given this model is al ulated using equation 1. This gives us P (O; Qj),
i.e the probability of O and Q o uring together given the model.

P (O; Qj) = P (REDjURN 1)  P (URN 1)  P (Y ELLOW jURN 1)


P (URN 1)  P (REDjURN 1)  P (URN 1) (1)
= 0:50  0:33  0:25  0:33  0:50  0:33
= :225  10 2
To nd the probability of O given the model, P (Oj), we need to al ulate
P (O; Qj) for all possible state sequen es and sum these probabilities. Tables
3(a) to 3( ) demonstrate this for all possible states sequen es given the obser-
vation sequen e (RED1 ; Y ELLOW2 ; RED3 ) showing that the probability of
seeing this observation sequen e given this model is 0.04.
The enumerative approa h used in Tables 3(a) to 3( ) an quite ompu-
tationally expensive for large problems. Fortunately, there is a more eÆ ient
method alled the forward-ba kward pro edure (Rabiner 1989). In fa t we will
see as we progress that the forward-ba kward pro edure has a role to play in all
three of the tasks we require with HMMs.

4.1 The Forward Pro edure


For now we are primarily interested in the forward part of the forward-ba kward
pro edure. The pro edure gains its eÆ ien y by taking advantage of the fa t
that there are only N states, and any sequen e of states, regardless of length
will merge ba k into one of these N states.
For a given observation sequen e (or partial sequen e) and model we al u-
late the probability of being in ea h state Si as the forward variable (Rabiner
1989):

Æt (i) = P (O1 ; O2    Ot ; qt = Si j) (2)


here the observation sequen e in ludes the observations from time 1 to time
t: (O1    Ot ) and the state St is the state at time t, qt . We are assuming the
model  from Se tion 2.1.
Starting with the rst observation in the sequen e we al ulate the forward
variable for ea h state. We then ontinue using the forward variables from one
timestep to al ulate the forward variables of the next. Ea h new set of forward
variables is adapted from the last using the probability of the next observation in
the sequen e along with the probabilities of possible next states. This ontinues
until we rea h the end of the sequen e.
We will use the observation sequen e (RED1 ; Y ELLOW2 ; RED3 ) for a
worked example. Ea h iteration of the algorithm is split into two loops. In

5
Q P [O1 = R℄ aij P [O2 = Y ℄ aij P [O3 = R℄ aij Produ t
(1,1,1) 0.50 0:33 0.25 0:33 0.50 0.33 0.00225
(1,2,1) 0.50 0.33 0.50 0.33 0.50 0.33 0.00449
(1,3,1) 0.50 0.33 0.25 0.33 0.50 0.33 0.00225
(2,1,1) 0.25 0.33 0.25 0.33 0.50 0.33 0.00112
(2,2,1) 0.25 0.33 0.50 0.33 0.50 0.33 0.00225
(2,3,1) 0.25 0.33 0.25 0.33 0.50 0.33 0.00112
(3,1,1) 0.25 0.33 0.25 0.33 0.50 0.33 0.00112
(3,2,1) 0.25 0.33 0.50 0.33 0.50 0.33 0.00225
(3,3,1) 0.25 0.33 0.25 0.33 0.50 0.33 0.00112
Sum 0.01797
(a) Possible observation and state sequen es terminating with Urn 1.

Q P [O1 = R℄ aij P [O2 = Y ℄ aij P [O3 = R℄ aij Produ t


(1,1,2) 0.50 0.33 0.25 0.33 0.25 0.33 0.00112
(1,2,2) 0.50 0.33 0.50 0.33 0.25 0.33 0.00225
(1,3,2) 0.50 0.33 0.25 0.33 0.25 0.33 0.00112
(2,1,2) 0.25 0.33 0.25 0.33 0.25 0.33 0.00056
(2,2,2) 0.25 0.33 0.50 0.33 0.25 0.33 0.00112
(2,3,2) 0.25 0.33 0.25 0.33 0.25 0.33 0.00056
(3,1,2) 0.25 0.33 0.25 0.33 0.25 0.33 0.00056
(3,2,2) 0.25 0.33 0.50 0.33 0.25 0.33 0.00112
(3,3,2) 0.25 0.33 0.25 0.33 0.25 0.33 0.00056
Sum 0.00898
(b) Possible observation and state sequen es terminating with Urn 2.

Q P [O1 = R℄ aij P [O2 = Y ℄ aij P [O3 = R℄ aij Produ t


(1,1,3) 0.50 0.33 0.25 0.33 0.25 0.33 0.00112
(1,2,3) 0.50 0.33 0.50 0.33 0.25 0.33 0.00225
(1,3,3) 0.50 0.33 0.25 0.33 0.25 0.33 0.00112
(2,1,3) 0.25 0.33 0.25 0.33 0.25 0.33 0.00056
(2,2,3) 0.25 0.33 0.50 0.33 0.25 0.33 0.00112
(2,3,3) 0.25 0.33 0.25 0.33 0.25 0.33 0.00056
(3,1,3) 0.25 0.33 0.25 0.33 0.25 0.33 0.00056
(3,2,3) 0.25 0.33 0.50 0.33 0.25 0.33 0.00112
(3,3,3) 0.25 0.33 0.25 0.33 0.25 0.33 0.00056
Sum 0.00898
( ) Possible observation and state sequen es terminating with Urn 3.

Figure 3: Possible observation and state sequen es.

6
the inner loop of the rst iteration we al ulate the initial forward variables
for ea h state (the Æ1 (i) values for all i) as the probability of having sele ted
a ball from a parti ular urn (of being in ea h state) given that the ball is red
(the partial sequen e (RED1 )). This is simply the produ t of the initial state
o upation probability for ea h state (i ) and the orresponding probability of
the observation RED (bi (RED)) for ea h state as shown in Table 1.
i bi (RED) Æ1 (i)
0.33 0.50 0.1650
0.33 0.25 0.0825
0.33 0.25 0.0825
Table 1: Inner loop of iteration 1 of the forward-ba kward pro edure - the initial
set of state and observation probabilites for our model given partial observation
sequen e (RED1 ).

The forward values (Æ1 (i)) in Table 1 give us the probability for ea h urn
of it being the urn the ball was sele ted from and of the sele ted ball being
RED at timestep 1. We now move to the outer loop of iteration 1 and use these
probabilites to al ulate the probabilities of the next state. We use the forward
variables along with the state transition probabilities (ai;j , for the transition
from state Si at time t to state Sj at t + 1 where 0  i  N and 0  j  N )
to work out the di erent paths to ea h of our three states. The sum of the
paths to a parti ular state is the likelihood we are in that state given the partial
observation sequen e so far. The paths to ea h of our states given the initial
forward variables are shown in Table 2.
Æ1 (i) ai;j Produ t
0.1650 0.33 0.05445
0.0825 0.33 0.02723
0.0825 0.33 0.02723
Sum 0.10890
Table 2: Outer loop of iteration 1 of the forward-ba kward algorithm. This loop
al ulates the probability of being in a state Sj given the previous state Si and
the observation probabilities. Sin e the transition probabilites from Si to Sj are
the same for all states in S , the al ulation is the same for all 3 states.

Noti e that at this point our state o upation probabilites have ollapsed
down into a single probability for ea h of the three possible states. We now take
the produ t of these state probabilities and the next observation to al ulate
the next set of forward values. This is very similar to the al ulation of our
orginal forward values. We repla e the initial state o upation probabilities with
the state o upation probabilities we just al ulated whi h take into a ount
the observation O1 (RED) and also in lude the observation probabilites for O2

7
sums bi (Y ELLOW ) Æ2 (i)
0.1089 0.25 0.02723
0.1089 0.25 0.02723
0.1089 0.50 0.05445
Table 3: Inner loop of iteration 2 of the forward-ba kward pro edure - the se ond
set of state and observation probabilites for our model given partial observation
sequen e (RED1 ; Y ELLOW2 ).

(YELLOW). Table 3 shows the al ulation of the se ond set of forward values.
This al ulation in luding the inner and outer loops for a parti ular timestep
t, (where 2  t  T ) forms the update rule of the forward pro edure whi h is
summarised in equation 3 (Rabiner 1989).

"N #
X
Æt (j ) = Æt 1 (j )ai;j bj (Ot ) 1  j  N (3)
i=1

From these we again al ulate the probabilities of the next state as shown
in Table 4.
Æ1 (i) ai;j Produ t
0.0272 0.33 0.00898
0.0272 0.33 0.00898
0.0545 0.33 0.01797
Sum 0.03594
Table 4: Outer loop of iteration 2 of the forward-ba kward algorithm. This
loop al ulates the probability of being in a state given the previous state and
observation probabilities. Sin e the transition probabilites from Si to Sj are the
same for all states in S , the al ulation is the same for all 3 states.

Finally, the pro edure terminates by al ulating the produ t of the state
probabilities produ ed by iteration 2 with the respe tive observation proba-
bilites for the nal observation of RED. As with the enumerative approa h the
probability of the observation sequen e (RED1 ; Y ELLOW2 ; RED3 ) given our
model is 0.04.
The forward pro edure implements task 1, it nds the probability of an
observation sequen e given a model. It does this by rst nding the probability
of the observation sequen e o uring with ea h state sequen e: P (O; Qj), then
sums these probabilities to arrive at the probability of the observation sequen e
given the model: P (Oj). For tasks 2 and 3 w also need to know the ba kward
part of the forward-ba kward pro edure.

8
4.2 The Ba kward Pro edure
The ba kward part of the forward-ba kward pro edure nds the probability of
a partial observation sequen e o uring from a given state. Re all that the
forward variable (see equation 2) tells us the probability of a partial observation
sequen e up to a ertain state. The ba kward variable is similar. It tells us
the probability of a partial observation sequen e from a ertain state as follows
(Rabiner 1989):

t (i) = P (Ot+1 ; Ot+2    OT jqt = Si ; ) (4)


Here given the state Si is the state qt we al ulate the probability of a partial
observation sequen e, O from t + 1 to T , given a model.
Starting at time t we need to onsider all the possible states transitions to
states at t + 1 along with the observation probabilities in those states. We then
ontinue onsidering all the possible transitions from these states. This looks
like a large amount of omputation, but the ba kward variable, like the forward
variable gives us some eÆ ien y due to the ollapsing of states.
For ea h state i at time t we need the produ t of the state transition from
Si to Sj (i.e ai;j ) and the probability of the observation we are interested in at
t + 1 (i.e bj (O)) o uring in state Sj along with the ba kward variable for Sj at
t + 1; t+1 (j ). The al ulation of this produ t for a given i and j will be refered
to as the produ t rule for the ba kward pro edure. The ba kward variable for
state Si at time t is the sum of the produ t rule being applied for all possible
next states (all j at t + 1).
Sin e we need the ba kward variables for t + 1 to al ulate the ba kward
variables for t we work from T ba k to t (from the end to the beginning). This
pro ess is started (initialised arbitrarily) by de ning the ba kward variable for
all states (all j ) at time T as 1. We now ompute the ba kward variables for
ea h previous timestep in turn until we rea h t.
As a worked solution we will assume we are urrently in state 1 at time
t (the last ball sele ted was from urn 1) and we are using the model from
Se tion 2.1. We want to know the probability of the observation sequen e
(REDt+1 ; Y ELLOWt+2 ; REDT ).

sums bi (Y ELLOW ) Æ2 (i)


0.0359 0.50 0.01797
0.0359 0.25 0.00898
0.0359 0.25 0.00898
Sum 0.04
Table 5: Termination of the forward-ba kward pro edure - the third set of
state and observation probabilites for our model given the observation sequen e
(RED1 ; Y ELLOW2 ; RED3 ).

9
Sin e we know the ba kward variables for all states in S (all Sj ) at time
T (they are all 1), we rst use these with our produ t rule to al ulate the
ba kward variables for time T 1 (whi h is also t + 2) as shown in Table 6.

ai;j bi (RED) T (j ) Produ t


0.33 0.50 1.0 0.17
0.33 0.25 1.0 0.08
0.33 0.25 1.0 0.08
Sum t+2 (i) 0.33
Table 6: Applying the produ t rule for all j and al ulating t+2 (i). Sin e the
transitions from i to j are the same for all states in S , the sum is the same for
all Si (i.e all states).

We then use the t+2 (i) values for all i (all states) in our next iteration to
al ulate t+1 (i) as shown in Table 7. This pro ess repeats one more time to
terminate with the al ulation shown in Table 8 of t (i), whi h produ es the
probability of the observation sequen e (REDt+1 ; Y ELLOWt+2 ; REDT ) given
the state Si at time t, as 0.04.
ai;j bi (Y ELLOW ) t+2 (j ) Produ t
0.33 0.25 0.33 0.03
0.33 0.25 0.33 0.03
0.33 0.50 0.33 0.05
Sum t+1 (i) 0.11
Table 7: Applying the produ t rule for all j and al ulating t+1 (i). Again,
sin e the transitions from Si to Sj are the same for all states in S , the sum is
the same for all Si .

ai;j bi (RED) t+1 (j ) Produ t


0.33 0.50 0.11 0.02
0.33 0.25 0.11 0.01
0.33 0.25 0.11 0.01
Sum t (i) 0.04
Table 8: Applying the produ t rule for all j and al ulating t (i).

Now that we have both the forward and ba kwards part of the forward-
ba kwards pro edure, we an look at how they, or other pro edures based on
them, an be applied to solving Tasks 2 and 3.

10
5 Implementing Task 2

Task 2 was to nd a state sequen e whi h best explains the observations. Two
possible solutions des ribed in Rabiner (1989) to this are presented. The rst so-
lution uses both the forward variable des ribed in Se tion 4.1 and the ba kward
variable des ribed in Se tion 4.2.

5.1 Using the forward and ba kward variables


One possible solution to Task 2 is to look at ea h observation in turn in the
observation sequen e and nd the state that is most likely at that point given
the entire sequen e. This is expressed in equation 5 whi h de nes our forward-
ba kward variable t (i) as the probability of being in state Si at time t given
an observation sequen e and a model (Rabiner 1989).

t (i) = P (qt = Si jO; ) (5)


To take into onsideration the entire sequen e, we must take into a ount
the observations leading up to the one we are interested in, as well as the
observations following it. The forward variable an be used to take into a ount
those observations leading up to the urrent observation of interest (the urrent
point in the observation sequen e), while the ba kward variable an be used
to in lude those following it. Equation 6 illustrates how the forward-ba kward
variable, t (i) is al ulated from the forward variable, Æt (i) and the ba kward
variable, t (i). The denominator in equation 6, whi h takes into a ount the sum
of the forward and ba kward variables for all other states, makes the forward-
ba kward variable a probability (Rabiner 1989).

Æt (i) t (i)
t (i) = (6)
X
N
Æt (i) t (i)
i=1

On e we have the forward-ba kward variables for all states at a given point,
the most likely state at that point is the one whose forward-ba kward vari-
able has the highest value (i.e is more probable). Cal ulating the most likely
sequen e of states using forward-ba kwards variables requires the variables to
be al ulated for every state for ea h observation point. However, be ause the
most likely state at ea h point is al ulated independently of the other points
it is possible the most likely sequen es of states produ ed by this pro ess is one
that in pra ti e would never o ur. This problem arises when the probability
of a transition from one state in the al ulated sequen e to the next is 0 (Ra-
biner 1989). An alternative to nding a sequen e of states by ombining the
independently al ulated best states, is to onstru t a sequen e of best states
whi h takes into a ount the other states in the sequen e. In other words, the

11
sequen e as a whole must be optimal in some sense rather than the individ-
ual parts. There is method alled the Viterbi algorithm, to onstru t su h a
sequen e whi h is based on the forward pro edure and dynami programming
te hniques (Rabiner 1989).

5.2 Using the Viterbi Algorithm


The Viterbi algorithm nds the most probable sequen e of states followed given
a sequen e of observations (i.e the most probable path) and its probability. For
ea h subsequent observation in the sequen e it sele ts the most probable state
transition whi h would provide the observation. The probability of that state
transition is used to al ulate the probability of the nal path and ea h state
transition sele ted is stored so the nal path an be retrieved on e the path is
ompleted (Rabiner 1989).
The Viterbi algorithm is very similar to the forward algorithm, but rather
than summing all the transition probabilities from one state to another it on-
siders only the most likely transition. The probability of the best path up to
time t is al ulated using what I will all the Viterbi variable, t (i) presented
in equation 7. The algorithm is initialised using the initial state o upation
probabilities as in equation 8 (Rabiner 1989).

t (i) = max [t 1 (i)ai;j ℄bt (j ) 2  t  T


1iN

1jN (7)

1 (i) = i bi (O1 ) 1  i  N (8)


The probability of the best path for the entire sequen e, P  , is determined
using equation 9 (Rabiner 1989).

P  = max [T (i)℄ (9)


1iN

To keep tra k of the a tual sequen e of states we need to store ea h state on


the most probable path as we do our al ulations. Determining whi h state to
store takes into a ount the urrent value of ea h state's Viterbi variable as well
as its transition probabilities to other states. The state whi h has the maximum
produ t of these two variables is the one we store.
Our worked example will again be the observation sequen e (REDt+1 ; Y ELLOWt+2 ; REDT ).
The initial value of the Viterbi variables for ea h state are the same as the initial
values used in our forward pro edure (see Se tion 4.1) and are shown in Table
9.
The al ulation of the most probable state is shown in Table 10. The state
sele ted is the one with the maximum produ t of its Viterbi variable and next
state transition. This state (urn 1) is stored so we an keep tra k of the most

12
i bi (RED) 1 (i)
0.33 0.50 0.1650
0.33 0.25 0.0825
0.33 0.25 0.0825
Table 9: Initial values for the Viterbi variables - the initial set of state and ob-
servation probabilites for our model given partial observation sequen e (RED1 ).

probable sequen e of states. The produ t is then used along with the se ond
observation (YELLOW) to al ulate the next set of Viterbi variables. The use
of a single produ t rather than a sum of produ ts is where the Viterbi algorithm
di ers from the forward pro edure presented in Se tion 4.1. The al ulation of
the produ ts for the is shown in Table 12. Here the se ond urn is stored, and the
maximum produ t is used along with the nal observation RED, to al ulate
the last set of Viterbi variables shown in Table 13.
State (urn) Æ1 (i) ai;j Produ t
1 0.16500 0.33 0.05445
2 0.08250 0.33 0.02723
3 0.08250 0.33 0.02723
Table 10: Cal ulating the most probable state (whi h is stored). This is the
state with the maximum produ t of the Viterbi variable and transition to state
Sj . This state is urn 1 whi h is shown in bold. Sin e the transition probabilities
are the same for all Sj ea h state has only one row in the table.

Urn Produ t bt (Y ELLOW ) 2 (j )


1 0.05445 0.25 0.01361
2 0.05445 0.25 0.01361
3 0.05445 0.50 0.02723
Table 11: Cal ulation of the Viterbi variables given the se ond observation.

Sin e the transitions are the same from all states the third state to be stored
is the state with the highest Viterbi variable, urn 1. So our most probable
state sequen e given the observation sequen e (REDt+1 ; Y ELLOWt+2 ; REDT )
is (URN 11 ; URN 32 ; URN 13 ). The probability of this state sequen e o uring
is 0.045.

13
State (urn) Æ1 (i) ai;j Produ t
1 0.01361 0.33 0.00449
2 0.01361 0.33 0.00449
3 0.02723 0.33 0.00899

Table 12: Cal ulating the most probable state (whi h is stored) after the se ond
observation. This state is urn 3 whi h is shown in bold. Sin e the transition
probabilities are the same for all Sj ea h state has only one row in the table.

Urn Produ t bt (Y ELLOW ) 2 (j )


1 0.00899 0.50 0.00450
2 0.00899 0.25 0.00225
3 0.00899 0.25 0.00225
Table 13: Cal ulation of the Viterbi variables given the third observation.

6 Implementing Task 3

The third task involves adjusting the model parameters to maximise the prob-
ability of the observation sequen e given the model. In other words, we would
like to develop our model from training data where the training data in ludes
observation sequen es.
The model parameters we are interested in learning are the state transition
probabilities (our ai;j values represented as A), the initial state distributions
(represented as  ) and the observation probabilities for ea h state (represented
as B ). These are the parameters that de ne our model,  = (A; B;  ). Un-
fortunately implementing Task 3 is quite diÆ ult (Rabiner 1989). One way to
approa h task 3 is to iteratively update and improve our model using reesti-
mation te hniques. Some te hniques are the Expe tation-Modi ation (EM)
algorithm, the Baum-Wel h algorithm and gradient des ent (Rabiner 1989).

6.1 Updating Model Parameters


What is needed is a way to al ulate a new set of model parameters given
an existing set. As it turns out we an al ulate a new initial distribution
probability  for a state i, and a new transition probability from one state to
another, ai;j as well as a new probability of observing observation Vk in a state
Si (bi (k), where V = fv1 ; v2 ;    vM g is a set of M observation symbols). This
gives a new model,  from an existing model, . The new model parameters
are al ulated based on the equations 10 to 12 (Rabiner 1989):

 = expe ted number of times in state Si at time (t = 1): (10)

14
expe ted number of transitions from state Si to state Sj
ai;j = (11)
expe ted number of transitions from state Si

bi (k ) = expe ted number of times in state Sj and observing symbol vk (12)
expe ted number of times in state Sj
Without going into too many details, the al ulations in equations 10 to 12
are all based on our forward and ba kward variables. The al ulation of the
probability of being in state Si at time t and state Sj at time t + 1 given a
model and observation sequen e is shown in equation 13.

t (i; j ) = P (qt = Si ; qt+1 = Sj jO; ) (13)


Æ (i) ai;j bj (Ot+1 ) t+1 (j )
= t
P (Oj)
The variable t (i; j ) from equation 13 an be used to al ulate the expe ted
number of transitions from state Si (see equation 14). Similarly our ombined
ba kward-forward variable from Se tion 5.1 an be used to al ulate the ex-
pe ted number of transitions from state Si to state Sj (see equation 15) (Rabiner
1989).

TX1
t (i; j ) = expe ted number of transitions from Si (14)
t=1

TX1
t (i) = expe ted number of transitions from Si to Sj (15)
t=1

6.2 The Baum-Wel h Algorithm


The above use of foward and ba kward variables implements the Baum-Wel h
algorithm whi h in itself is an implementation of the statisti al expe tation-
modi ation (EM) algorithm that uses the forward and ba kward variables
(Rabiner 1989). Note, that as we saw in Se tion 5.1 the forward ba kward
variables only allow lo al maximisation.

7 Reinfor ement Learning

Reinfor ement learning uses a s alar signal as a method of spe ifying what is
to be a hieved without spe ifying how it is to be done. Using this approa h

15
the agent learns how to omplete its task through intera tions with it's environ-
ment in a pro ess of trial and error. Formally reinfor ement learning is based on
learning a fun tion where at ea h step of intera tion with its environment the
agent re eives an input indi ating the urrent state of the environment (Kael-
bling, Littman, and Moore 1996). Based on this input, the agent then sele ts
an a tion as an output to the environment. This output is apable of hanging
the state of the environment resulting in a new input in the next timestep. The
value of the state transition a hieved through the a tion sele ted is indi ated
to the agent by a reinfor ement signal r sele ted from a s alar range of possible
rewards R whi h the agent re eives as a result of arriving in the new state.
The aim is to learn a fun tion (behaviour) of inputs and a tions whi h allows
the agent's behaviour to maximize the long-run reinfor ement (utility) re eived.
This evaluation fun tion implements the agent's poli y, its sele tion of an a -
tion in a parti ular state is determined by the agent's urrent learned poli y
(Lin 1991). The behaviour fun tion usually involves learning a value fun tion
whi h estimates the expe ted future reward asso iated with parti ular states or
state-a tion pairs. The predi tion of the value fun tion is based on the assump-
tion that subsequent a tions are sele ted based on a parti ular poli y,  . Issues
relating to following poli ies and al ulating value fun tions are dis ussed later.
The advantage of reinfor ement learning lies in its ability to spe ify the goals
of the agent without spe ifying how they should be a hieved. This gives a very
general ar hite ture for learning. All the environments in following dis ussion
environments are assumed to be rst order Markov problems.

7.1 Approa hes to Reinfor ement Learning


Sutton and Barto (1998) dis usses the basi approa hes to reinfor ement learn-
ing that are implemented by learning a tion-values. In this ase, the true value
of an a tion Q (a) is estimated by Qt (a) whi h is updated at ea h timestep t.
This estimate is al ulated as the mean reward re eived so far from past sele -
tions of a. If a has been sele ted k times in the past, then by the law of large
numbers, as ka ! 1, Qt (a) onverges to Q (a). In pra ti e it saves time and
spa e to update the estimate for the k times an a tion has been taken in the
past in rementally as follows:
1
Qk+1 = Qk + [r Qk ℄ (16)
k + 1 k+1
where rk+1 is the reward re eived at time k + 1. If the environment is
non-stationary (i.e either transition probabilities from one state to another or
rewards asso iated with states hange over time) more re ent rewards an be
given in reased signi an e (weighted more heavily) by using a onstant step
size, , 0 <  1 giving:

Qk+1 = Qk + [rk+1 Qk ℄ (17)


As the approa h above is based on a sample of the rewards, it is alled a
sample-average method. After a number of samples it is possible to sele t an

16
a tion a for whi h the estimate of expe ted return at time t is among Qt (a ) =
max aQt (a). Su h a strategy greedily tries to exploit urrent knowledge to
maximise immediate reward, however, does not dedi ate any e ort into trying
other a tions to see if they are really better than they appear. One method for
exploring other a tions is to o asionally sele t a tions independently from the
a tion-value estimates. One su h method is the -greedy method whi h involves
making a tion sele tions with probability , a ording to a uniform distribution.
An alternative to the -greedy method is softmax methods. These methods rank
a tions so that more promising a tions are sele ted more often and a tions whi h
appear quite poor are avoided. One possible softmax method is to use a Gibbs
distribution for sele tion a tions at time t as follows:
Qt (a)=
Pne (18)
b=1 e
Qt (b)=

In ontrast to a tion-value approa hes there are reinfor ement- omparison


approa hes. These do not maintain a tion-values but preferen es for a tions
in relation to a referen e reward. The referen e reward may be the average
of re ently re eived rewards while a tion preferen es an be al ulated based
on omparison between the reward asso iated with a sele ted a tion and the
referen e reward. The preferen e for an a tion a at timestep t may be given by
pt (a) and used to determine a tion sele tion probabilities using the following
softmax relationship:
ePt (a)
t (a) = Pn Pt (b) (19)
b=1 e
here t (a) denotes the probability of sele ting a tion a at timestep t. This
probability is updated based on a omparison of the a tual reward, rt , re eived
after sele ting the a tion, and the referen e reward, r t :

Pt+1 (at ) = Pt (at ) + [rt rt ℄ (20)


where is a positive step-size parameter. After the update of 20 the referen e
reward is updated in rementally as an average of all re ently re eived rewards:

rt+1 = rt + [rt rt ℄ (21)


where , 0 <  1 is a step-size parameter, whi h is probabily best set to
weight re ent rewards more heavily as a tion sele tion inproves.

7.2 Exploration
While some possible hoi es for en ouraging exploration are the use of -greedy
strategies or softmax sele tion me hanisms there a number of other possibilities.
One is to base sele tion of a tions on the ertainty of the a tion's estimated
value. Interval estimation is one su h method whi h estimates a on den e
interval of an a tions value. However, these involve a large degree of omplexity

17
for the statisti al methods used to ompute the on den e intervals. Another
possibility is to ompute the Bayes optimal way to balan e exploration and
exploitation. In this ase, the method used would need to be an approximation
as the full ase is omputationally intra table.

7.3 Delayed E e ts of A tions


Both a tion-value and reinfor ement- omparison te hniques and hybrids of these
two te hniques only take into a ount the immediate results of a tions, ignoring
any delayed a e ts. For example, an a tion may result in an immediate high re-
ward yet pla e the agent in a state from where only low rewards an be obtained
(Sutton 1991b). This is the problem of temporal redit assignment, and an be
viewed as trying to attribute redit to a tions whi h, though did not dire tly
lead to a reward, ontributed to re eiving a reward as part of a sequen e of
a tions. The e e ts of sequen es pf a tions an be taken into a ount by trying
to maximise long term returns rather than immediate rewards. The following
se tions dis uss learning approa hes whi h take into a ount the delayed e e ts
of a tions. One popular approa h for solving the temporal redit assignment
problem is the use of temporal di eren e methods. Temporal di eren e methods
an be seen as ombining elements of two simpler problem solving te hniques,
dynami programming and monte- arlo methods. Dynami programming alu-
lates expe ted future returns as averages of returns over all subsequent states
while monte arlo methods reate an estimate of expe ted future returns based
on sample transitions to subsequent states.

7.3.1 Episodi Tasks

Sutton (1991b) des ribes the ase of applying dynami programming to a deter-
ministi MDP. The agent observes its environment at ea h of as a sequen e of
dis rete time steps, t = 0; 1; 2; :::, to be in a state st 2 S . At ea h time step the
agent sele ts an a tion at 2 A whi h determines the state of the world at the
next time step. The agent's obje tive is to arrive at a goal state g 2 G  S and
as ea h a tion in urs a ost, sa 2 <+ , it aims to a hieve this with the minimum
umulative ost.
The aim of dynami programming is to asso iate a numeri al value, V (s),
with ea h state indi ating the ost-to-go from that state to a goal state.
The minimum ost-to-go from a parti ular state an be determined re ur-
sively as the minimum ost a tion from a parti ular state plus the minimum
ost-to-go from the resulting state:

V (s) = min sa + V (su (s; a))


a2A
8s 2 S G (22)

where su (s,a) denotes the su essor state to s given the sele tion of a tion
a and where:

18
V (s) = 0 8s 2 G (23)
Equation (23) de nes the values of the goal states while (22) spe i es how
these are ba ked up to de ne the values of all other states. On e the values for
states are determined the best a tion is the one whi h minimises the righthand
side of (22).

7.3.2 Continuing Tasks

The ost-to-goal formulation has some limitations, a major one being that it
does not take into a ount what happens on e the goal has been rea hed. This
is appropriate for problems with a nite sequen e of a tions leading up to an
absorbing terminal state, but not for ontinuing tasks where the agent must
ontinue making de isions on e the goal is rea hed.
This problem an be over ome by hanging the formulation to remove the
use of spe ial goal states and make the agent's obje tive the optimisation of
osts on state transitions. To implement this the per-transition osts an be
onverted to per transition out omes or rewards. States may be asso iated with
positive rewards if they are desirable or with negative rewards if they are states
whi h should be avoided.
Like the original formulation the agent observes its environment at ea h of
as a sequen e of dis rete time steps, t = 0; 1; 2; :::, to be in a state st 2 S . At
ea h time step the agent sele ts an a tion at 2 A whi h determines the state
of the world at the next time step. However, now after ea h a tion the agent
re eives a reward from the environment, rt+1 2 < as well as the next state st+1 .
The expe ted value of the rt+1 is R(st ; at ), and the probability of st+1 = x for
a random variable x 2 S , is P (st ; x; at ). The agent's obje tive may now be seen
as maximising the sum of total rewards it re eives from the environment:

rt+1 + rt+2 + rt+3 +    (24)


As this sequen e is in nite, the sum is also likely to be in nite or divergent.
To over ome this rewards an be dis ounted a ording to time using a dis ount
fa tor , 0 < < 1:

rt+1 + rt+2 + 2 rt+3 + 3 rt+4    (25)


The use of the dis ount fa tor has two e e ts, it over omes the problem of
an in nite sum and a ts as a ost for a tions. Now a tion sequen es whi h lead
to more immediate rewards will be prefered over a tion sequen es for whi h the
same reward is delayed. In navigation tasks, the dis ount fa tor en ourages
shorter paths to be used when sea hing for terminal states. At ea h time step
the agent hooses an a tion at so as to maximise the expe ted return:
(1 )
X
E k rt+k+1 : (26)
k=0

19
Now the value of a state, V (s) is de ned as the best expe ted dis ounted
return a hievable from that state if the agent behaves optimally. Now we arrive
at a re ursive relationship very similar to (22) where the ost to goal is repla ed
by the reward re eived and we in lude the dis ount fa tor :

V (s) = max R(s; a) + V (su (s; a))


a2A
8s 2 S (27)

where the deterministi su essor of state, s given a tion, a is given by


su (s; a). This relationship an be extended to over probabilisti state tran-
sitions:

X
V (s) = max
a2A
R(s; a) + P (s; x; a)V (x) 8s 2 S (28)
x2S

Now the optimal a tion an be determined by sele ting the state whi h
maximises (28).

7.4 Value Fun tions


The deterministi example just given overed both state-value fun tions for
ontinuing tasks. Here we go into more detail on value fun tions, expanding our
notation and in orporating a tion-value fun tions.
In the above, we were al ulating a value fun tion V whi h was in fa t
the optimal value fun tion, from now on denoted as V  . As an alternative
to the optimal value fun tion, V (s) may be al ulated for a parti ular poli y
, whi h gives the probability (s; a) of sele ting a tion a 2 A when in state
s 2 math alS . In this ase, the value of state s under poli y  is denoted as
V  (s) and de ned for MDPs as follows:
(1 )
X
V  (s) = E  fRt j st = sg = E k rt+k+1 j st = s ; (29)
k=0
where E denotes the expe ted value given that the poli y  is followed.
A ordingly, this fun tion, V  , is alled the state-value fun tion for poli y  .
So far we have dis ussed only state-value fun tions, however, the same prin-
iples apply to a tion-value fun tions. The a tion value fun tion for poli y  is
the expe ted return for taking a tion a starting in state s and following poli y
 after thereafter. This fun tion denoted as Q (s; a) is de ned as:
(1 )
X
Q (s; a) = E  fRt j st = s; at = ag = E k rt+k+1 j st = s; at = a :
k=0
(30)

20
7.4.1 Optimal Value Fun tions

The optimal poli y for nite MDPs an be de ned as a poli y  that is better
than or equal to a poli y  0 in that its expe ted return is greater than or equal
to  0 for all states. This takes advantage of the partial ordering de ned by
value fun tions over poli ies. There is always at least one poli y su h that,
 < 0 if and only if V  (s)  V  (s) for all s 2 S . Poli ies whi h meet these
0

onditions are optimal poli ies whi h all share the same state value fun tion and
are denoted by   . The value fun tion shared by optimal poli ies is denoted as
V  and de ned as
V  = max

V  (s); (31)
for all s 2 S . Optimal poli ies also have the same optimal a tion-value
fun tion, Q , de ned as:

Q (s; a) = max

Q (s; a); (32)
for all s 2 S and a 2 A(s). The fun tion Q an be written in terms of V 
as follows:

Q (s; a) = E frt+1 + V  (st+1 ) j st = s; at = ag (33)


Value fun tions must have the property of meeting ertain re ursive rela-
tionships. For any poli y  and any state s, the following Bellman equation for
V  summarizes the onsisten y ondition whi h holds for the value of s and its
su essor states:
X X
V  (s) = (s; a) Pssa [Rass + V  (s0 )℄ ;
0 0 (34)
a s
0

where a tions, a are taken from set A(s) and next states, s0 , are taken from
the set S .
The value fun tion V  is the value fun tion for an optimal poli y. Under
su h a poli y the value of a state must the equal the expe ted return for the best
a tion from that state. Be ause of this relationship and be ause V  is optimal,
it an be des ribed without refering to any parti ular poli y. In this ase, the
equation is alled the Bellman optimality equation and is written as follows:

V  (s) = max
a
E frt+1 + V  (st+1 ) j st = s; at = ag (35)
X a a
= max P [R + V  (s0 )℄ :
a ss0 ss0 (36)
s0

The Bellman optimality equation for Q is:

21
n o
Q (s) = E rt+1 + max Q (st+1 ; a0 ) j st = s; at = a (37)
X a h a a i
0

= P R + max Q(s0 ) :
ss0 ss0 a0
(38)
s0

7.4.2 Update Rules

If the above equations are onverted into update rules, they an be applied
as dynami programming (DP) te hniques for nite state problems where the
world model is known. The omputations using the equations des ribed above
require ba king up all the values from the start state to the goal states. A
less omputationally expensive approa h is to base ba k-ups on a al ulated
estimate of V , V^ . For a nite state spa e MDP the urrent estimate of V^ an
be stored in a table. In the eposidi ase, the approximate value fun tion V^ (s)
at a state s 2 S G an be al ulated using the following assignment onverted
from equation (22):

V^ (s) := min sa + V^ (su (s; a)) (39)


a2A

where, for s 2 G ; V^ = V (s) = 0. Repeated repla ement of V^ with its


ba ked-up value leads to V^ be oming a better approximator of V .
Similarly the equality for ontinuing tasks 28 an also be onverted into an
improvement operator for whi h su essive appli ations an used to al ulate V
from the approximation:

X
V^ (s) = max R(s; a) +
a2A
P (s; x; a)V^ (x) 8s 2 S G (40)
x2S
Under our assumptions the dynami s of the model are provided and the
transition probabilities P (s; x; a) do not need to be learned. The equations 39
and 40 also demonstrate two di erent types of ba k-up. The updates in 40
are based on a sample of the su essor states, in this ase the state with the
minimum value fun tion, as su h the ba k-up used here is alled a sample ba k-
up. On the other hand, the ba k-up used in 39 is based on all su essor states
and is refered to as a full ba k-up.

7.4.3 Value Iteration

The updates performed using (40) an stop on e (39) no longers auses the value
of a state to hange, and therefore (22) holds for V^ . This pro ess of omputing
V by o asionally applying V^ to states is alled asyn hronous value iteration.
The syn hronous version of value iteration updates all states values at one time,
repeating this pro ess until onvergen e. Value iteration requires enough mem-
ory to store values for every state in the state spa e. This spa e requirement is

22
unreasonable for very large state spa es, however, o ers the advantage of faster
sear h and onvergen e on e suÆ ient trials have been ondu ted.

7.4.4 Dynami Programming

In fa t value iteration an be seen as two pro esses o uring simultaneously;


poli y evaluation and poli y improvement. Both these are based on prin iples
from dynami programming. Poli y evaluation onsists of al ulating the value
fun tion V  for a parti ular poli y  . This an be done using the following
update rule based on the Bellman optimality equation for V  :

Vk+1 (s) = E frt+1 + Vk (st+1 ) j st = sg (41)


X X a a
= (s; a) Pss [Rss + V  (s0 )℄ ;
0 0 (42)
a s0
for all s 2 S . Ea h approximation of Vk+1 from Vk is al ulated for ea h state
s using this rule given by 41. This update an be done where the new values
are omputed one-by-one without overwriting any existing values until all states
have been updated in this one sweep of the states. Alternatively it an be done
in-pla e when the newly al ulated values overwrite values existing before the
sweep began. This means that during the same sweep newly al ulated values
are sometimes used to al ulate other new values.
The se ond pro ess, of poli y improvement, involves al ulating a value fun -
tion then using this value fun tion to help sele t a new, better, poli y. In this
ase, we onsider the bene ts of sele ting some a tion a 6=  (s) when in a state
s. Whether this hange is better or worse than the original poli y an be deter-
mined by omparing the return of following the original poli y before and after
the hange. If the hange results in a better return all the time then we should
update to the new poli y as indi ated by the poli y improvement theorem:

Q (s; 0 (s))  V (s) (43)


where  and  0 are any pair deterministi poli ies where for all s 2 S ;  0 is
as good or better than  . This idea an be applied to all states and a tions as
follows:

0 (s) = arg maxa Q (s; a)


= arg maxa E frt+1 + V  (st + 1) j st = s; at = ag (44)
X a a
= arg maxa Pss [Rss + V  (s0 )℄ ;
0 0

s0

The poli y  0 des ribed here always sele ts the a tion for whi h Q (s; a) is
maximised. Poli y improvement is the appli ation of this greedy strategy for
improvement based on the value fun tion for the urrent poli y. If the new
poli y  0 is a good as but not better than the old poli y  then V  = V  for all
0

23
s 2 S then V  must be V and both  and 0 must be optimal poli ies (Sutton
0

and Barto 1998).


The interleaving of poli y evaluation and poli y improvement is refered to as
Generalised Poli y Iteration (GPI). In GPI the poli y evaluation and improve-
ment steps an intera t at various levels of granularity.

7.5 Temporal Di eren e Learning


Temporal di eren e methods are a lass of approa hes for implementing GPI.
Typi ally these te hniques are based on the omparison of predi tions from one
time to another (Sutton 1988). In parti ular they use the ideas of dynami
programming presented above ombined with the use of estimates at one step
to improve estimates at previous steps. Temporal di eren e methods make
improvements to value fun tion estimates by updating value fun tions using
sample ba k-ups rather than the more expensive full ba k-ups of full dynami
programming approa hes. An example of this is the TD(0) algorithm (Sutton
1988; Sutton and Barto 1998):

V (st ) V (st ) + [rt+1 + V (st+1 V (st )℄ (45)


Here a TD error is al ulated based on the di eren e between the estimated
value at one state at time t and the estimated value of a subsequent state at
time t +1. In addition to dis ounting using a learning rate ; 0   1 is also
used. TD(0) is somewhat similar to the asyn hronous value-iteration pro edure
of (40) only here the sample ba k-ups are based on a tual experien e rather
than generated from a known model.
Temporal di eren e methods allow a range of varian e in the granularity
with whi h poli y evaluation and poli y iteration an be interleaved. One ap-
proa h whi h on eptually separates the two pro edures of poli y evaluation
and interation is the a tor- riti model.

7.5.1 Adaptive Heuristi Criti

A tor- riti methods are an extension of the on epts of reinfor ement- omparison
dis ussed in Se tion 7.1 to in lude temporal di eren es. An example of a tor-
riti methods is the following implementation of an Adaptive Heuristi Criti
(AHC) ar hite ture. This an be seen as in orporating two omponents, a
TD(0) learner alled the Adaptive Heuristi riti (AHC), whi h learns a value
fun tion, and an a tor, whi h sele ts a tions. The riti learns the value fun -
tion for whatever poli y the a tor is urrently following and provides feedba k
to the a tor on the quality of its a tion sele tions. This feedba k may be in the
form of a temporal di eren e error between predi ted and a tual performan e:

= rt+1 + V (st+1 ) V (st ); (46)


where V is the urrent value fun tion maintained by the riti . Based on this
this error, the a tor an evaluate the a tion just taken. Negative errors an be

24
used to bias the poli y away from asso iated a tions while positive errors may
in rease the probability of sele ting re ent a tions (Sutton and Barto 1998).

7.5.2 Q-Learning

Q-Learning was developed by Watkins (1989) and aims to onstru t a Q-fun tion
as the a tion-value evaluation (predi tion) fun tion (Lin and Mit hell 1993). An
update rule to learn the Q-fun tion in deterministi environments is (Sutton
1991a):

Q(st ; at ) Q(st ; at ) +  (r +  max


a
Q(st ; a) Q(st ; at )
+1 (47)
where is a positive learning parameter. Variations on this learning rule
are possible (Lin and Mit hell 1993). It is also possible to generalise the up-
date rule to apply to sto hasti environments (Sutton 1991b). The learned
Q-fun tion onverges to Q the optimal value as long as ertain onditions are
met in luding the requirement that all states-a tion pairs are ontinually up-
dated. As onvergen e is guaranteed no matter whi h poli y is being followed
Sutton and Barto (1998) ategorises Q-Learning as an o -poli y te hnique.

7.5.3 Sarsa and The Bu ket Brigade

An on-poli y TD te hnique based on a tion-values rather than state-values is


sarsa (or modi ed Q-learning). Sarsa develops as estimate Q (s; a) for a poli y
 and for all states s 2 S and a tions a 2 A. The same onvergen e theorems
that apply to TD(0) also apply to the following update rule for a tion-values:

Q(st ; at ) := Q(st ; at ) + [rt+1 + Q(st+1 ; at+1 ) Q(st ; at )℄ : (48)


This te hnique an be used with -greedy te hniques to estimate Q for a
poli y  . Sarsa onverges with probability 1 to an optimal poli y and a tion-
value fun tion as long as all state-a tion pairs are visited an in nite number of
times and the poli y onverges in the limit to the greedy poli y (Sutton and
Barto 1998). The Sarsa te hnique is also related to Holland, Holyoak, Nisbett,
and Thagard (1986)'s bu ket-brigade algorithm.

7.5.4 TD() and Multi-step Predi tion

TD(0) is spe ialised ase of a more general lass of TD algorithms, TD() with
 = 0. TD(0) looks only one step ahead when making value adjustments while
the more general TD() is applied to every state based on their eligibility. The
TD( rule is similar to the 45 as follows:

V (st ) V (st ) + [rt+1 + V (st+1 V (st )℄ et (s) (49)


where et (s) 2 < determines the eligibility of state s atime t for updates.
+

At ea h step the eligibility tra es for all states de ay by  while the eligibility
tra e for the state just visited is in remented by 1:

25

et 1 (s) if s 6= st
et (s) = et 1 (s) + 1 if s = st (50)
for all s 2 S , where is the dis ount fa tor and  is the tra e-de ay parameter
whi h determines the rate at whi h the eligibity de reases for states. The tra e
given in 50 is alled an a umulating tra e be ause it a umulates ea h time a
state is visited and fades away during periods when states are not visited.
An alternative to a umulating tra es is repla ing tra es. This is de ned for
a state s as follows:

et(s) = et 1 (s) if s 6= st (51)
1 if s = st
The repla ing tra e an in some ases provide a signi ant improvement on
learning rate over a umulating tra es (Sutton and Barto 1998). This is due to
the way in whi h a umulating tra es may in rease the eligibility of ommonly
sele ted poor states or state-a tion pairs in relation to less ommonly sele ted
state or state-a tion pairs whi h a tually lead to higher returns.
Multi-step updates based on eligibility tra es may also be applied to Q-
learning and Sarsa.

7.6 Temporal Di eren e Methods


To over ome these limitations, learning methods are used where the agent is
expe ted to learn a poli y whi h a hieves the highest possible reward in the long
run. There are a number of models for what is optimal for this type of poli y
learning, one of whi h is the in nite-horizon model (Kaelbling, Littman, and
Moore 1996). Under this model rewards re eived in the future are dis ounted
a ording to a dis ount fa tor, (where 0   1) that an be interpreted as a
ost for taking a tions. At ea h time step t it is possible for the agent to re eive
a reward, rt and at a given time the agent should sele t a tions whi h optimize
its future dis ounted reward:
1
X
E( rt ):
t=0

7.7 Summary of Reinfor ement Learning


The basis of temporal di eren e methods is the omparison of predi tions from
one time to another (Sutton 1988). These methods o er two main advantages
over supervised learning, one is its ability to allow a higher level of in remental
learning, hanges to predi tions for observations in a sequen e an be al ulated
for ea h observation rather than waiting for the sequen e to omplete. This has
the additional advantage of removing the ne essity of remembering the predi -
tions for ea h observation redu ing the memory requirements for learning algo-
rithms. Another advantage is its ability to onverge more rapidly and onstru t
models whi h better mat h the true models.

26
8 Learning Value fun tions for POMDP's

The dis ussion of reinfor ement learning so far has on entrated on dis rete
spa es. However, for POMDPs our urrent state is represented by a belief state
whi h is a point in multi-dimensional ontinuous spa e. So one of the areas of
resear h in the area of POMDPs is how to learn value fun tions over ontinuous
belief spa es.

Referen es

Holland, J., K. Holyoak, R. Nisbett, and P. Thagard (1986). Indu tion. Pro-
esses of Inferen e, Learning and Dis overy. The MIT Press.
Kaelbling, L., L. Littman, and A. Moore (1996). Reinfor ement Learning: A
Survey. Journal of Arti ial Intelligen e Resear h 4, 237{285.
Lin, L. (1991). Self-improving rea tive agents: Case studies of reinfor ement
learning frameworks. In J. Meyer and S. Wilson (Eds.), From Animals
to Animats, pp. 297{305. First International Conferen e on Simulation of
Adaptive Behaviour: MIT Press.
Lin, L. and T. Mit hell (1993). Reinfor ement learning with hidden states. In
J. Meyer, H. Roitblat, and S. Wilson (Eds.), From Animals to Animats
2, USA, pp. 271{280. Se ond International Conferen e on Simulation of
Adaptive Behaviour: MIT Press.
Rabiner, L. (1989). A tutorial on Hidden Markov Models and sele ted appli-
ations in spee h re ognition. Pro eedings of the IEEE 77 (2), 257{286.
Sutton, R. (1988). Learning to predi t by the methods of temporal di eren es.
Ma hine Learning 3, 9{44.
Sutton, R. (1991a). Planning by in remental dynami programming. In Pro-
eedings of the Ninth Conferen e on Ma hine Learning, pp. 353{357.
Morgan-Kaufmann.
Sutton, R. (1991b). Reinfor ement learning ar hite tures for animats. In
J. Meyer and S. Wilson (Eds.), From Animals to Animats, pp. 288{296.
First International Conferen e on Simulation of Adaptive Behaviour: MIT
Press.
Sutton, R. and A. Barto (1998). Reinfor ement Learning: An introdu tion.
MIT Press.
Watkins, C. (1989). Learning From Delayed Rewards. Ph. D. thesis, Cam-
bridge University.

27

You might also like