Professional Documents
Culture Documents
1 Introdu tion
This guide provides an introdu
tion to hidden Markov Models and draws heav-
ily from the ex
ellent tutorial paper written by Rabiner (1989). I have at-
tempted to provide intuition into the various algorithms by using some simple
fully worked examples to hopefully illuminate the mathemati
al des
riptions
provided by Rabiner. It is hoped the explanation of the
on
epts is suÆ
ient to
help an otherwise uninformed reader to understand later des
riptions (su
h as
the Baum-Wel
h algorithm) where worked examples are not provided.
The se
ond part of the guide aims to explain the relationship between
Hidden Markov Models (HMMs) and Partially Observable De
ision Problems
(POMDPs). While the primary
on
ern in HMMs is to learn a good model,
the addition of a
tions { the ability to make de
isions { in POMDPs adds the
additional
on
ern of learning whi
h a
tion to sele
t.
These two problems are related to POMDPs whi
h add the ability to learn a
model and the ability to learn whi
h a
tion to sele
t. Some systems address one
of these
on
erns ex
lusively, others attempt to address both simultaneously.
1
sequen
e of observations may o
ur with many dierent underlying state se-
quen
es. The solution to this involves maintaining a probability for ea
h of the
possible states we may
urrently be in given the observations we have en
oun-
tered sin
e the start of the pro
ess (or sin
e we were last
ertain of what state
we were in). The sum of all the probabilities must equal 1.0 (that is we must be
in one state and one state only) and this probability distribution a
ross states
at a parti
ular time is
alled a belief state. The belief state is a single point in
multi-dimensional spa
e where the dimensions are the dierent possible states.
The set all of possible belief states (all possible points) is
alled the belief spa
e.
Sin
e probabilities are real-numbers this spa
e is
ontinuous not dis
rete.
Lets
ontinue the example assuming that ea
h urn
ontains 5 yellow and
5 green balls ex
ept that urn 2
ontains 10 green balls and urn 3
ontains
10 yellow balls (See Figure 1). Then a possible sequen
e of observations is:
(RED1 ; Y ELLOW2 ; RED3 ), where the rst ball (at timestep 1) is red, the
se
ond yellow (at timestep 2) and the last (at timestep 3) red again, assume
also that ea
h ball is repla
ed in its urn before the next is drawn. There are 33
2
possible state sequen
es that mat
h this observation sequen
e whi
h are shown
in Tables 2(a) to 2(
).
There are 3 possible urns that the third ball may have
ome from, and for
ea
h of these 9 possible paths (state transitions) that may have lead there. The
belief state will depend on the probability of ea
h of these paths given the obser-
vation sequen
e. Sin
e our observation sequen
e is (RED1 ; Y ELLOW2 ; RED3 )
and ea
h urn is equally likely to be sele
ted from ea
h time, the most likely state
sequen
e is (11 ; 22 ; 13 ), that is the rst ball was drawn from Urn 1 at timestep
1, the se
ond ball from Urn 2 at timestep 2 and the third ball from Urn 1 again
at timestep 3. The belief state after the third observation is (0.5, 0.25, 0.25).
Rabiner (1989) des
ribes three tasks we need to be able to do with HMM's to
make them useful:
1. to
ompute the probability of an observation sequen
e;
2. to
hoose a state sequen
e whi
h explains the observations; and
3. to adjust the model parameters.
If we
an do 1, and we have many possible models (possible states, state
transitions or observation probabilities) we
an sele
t the model whi
h best
mat
hes an observation sequen
e by determining in whi
h model the sequen
e
was most probable. Using 2 we
an determine the most probable state given
an observation sequen
e. If we
an do 3 we
an
onstru
t models from training
sets of observation sequen
es.
4 Implementing Task 1
3
Sequen
e Prob 1 Prob 2 Prob 3 Produ
t
(1,1,1) 0.50 0.25 0.50 0.06
(1,2,1) 0.50 0.50 0.50 0.13
(1,3,1) 0.50 0.25 0.50 0.06
(2,1,1) 0.25 0.25 0.50 0.03
(2,2,1) 0.25 0.50 0.50 0.06
(2,3,1) 0.25 0.25 0.50 0.03
(3,1,1) 0.25 0.25 0.50 0.03
(3,2,1) 0.25 0.50 0.50 0.06
(3,3,1) 0.25 0.25 0.50 0.03
Sum 0.50
(a) Possible state sequen
es terminating with Urn 1
4
The spe
i
ation of the example problem in Se
tion 2.1, will be the model,
used in the following dis
ussion. The probability of the observation sequen
e
O = (RED1 ; Y ELLOW2 ; RED3 ) and state sequen
e Q = (URN 11; URN 12; URN 13),
o
uring given this model is
al
ulated using equation 1. This gives us P (O; Qj),
i.e the probability of O and Q o
uring together given the model.
5
Q P [O1 = R℄ aij P [O2 = Y ℄ aij P [O3 = R℄ aij Produ
t
(1,1,1) 0.50 0:33 0.25 0:33 0.50 0.33 0.00225
(1,2,1) 0.50 0.33 0.50 0.33 0.50 0.33 0.00449
(1,3,1) 0.50 0.33 0.25 0.33 0.50 0.33 0.00225
(2,1,1) 0.25 0.33 0.25 0.33 0.50 0.33 0.00112
(2,2,1) 0.25 0.33 0.50 0.33 0.50 0.33 0.00225
(2,3,1) 0.25 0.33 0.25 0.33 0.50 0.33 0.00112
(3,1,1) 0.25 0.33 0.25 0.33 0.50 0.33 0.00112
(3,2,1) 0.25 0.33 0.50 0.33 0.50 0.33 0.00225
(3,3,1) 0.25 0.33 0.25 0.33 0.50 0.33 0.00112
Sum 0.01797
(a) Possible observation and state sequen
es terminating with Urn 1.
6
the inner loop of the rst iteration we
al
ulate the initial forward variables
for ea
h state (the Æ1 (i) values for all i) as the probability of having sele
ted
a ball from a parti
ular urn (of being in ea
h state) given that the ball is red
(the partial sequen
e (RED1 )). This is simply the produ
t of the initial state
o
upation probability for ea
h state (i ) and the
orresponding probability of
the observation RED (bi (RED)) for ea
h state as shown in Table 1.
i bi (RED) Æ1 (i)
0.33 0.50 0.1650
0.33 0.25 0.0825
0.33 0.25 0.0825
Table 1: Inner loop of iteration 1 of the forward-ba
kward pro
edure - the initial
set of state and observation probabilites for our model given partial observation
sequen
e (RED1 ).
The forward values (Æ1 (i)) in Table 1 give us the probability for ea
h urn
of it being the urn the ball was sele
ted from and of the sele
ted ball being
RED at timestep 1. We now move to the outer loop of iteration 1 and use these
probabilites to
al
ulate the probabilities of the next state. We use the forward
variables along with the state transition probabilities (ai;j , for the transition
from state Si at time t to state Sj at t + 1 where 0 i N and 0 j N )
to work out the dierent paths to ea
h of our three states. The sum of the
paths to a parti
ular state is the likelihood we are in that state given the partial
observation sequen
e so far. The paths to ea
h of our states given the initial
forward variables are shown in Table 2.
Æ1 (i) ai;j Produ
t
0.1650 0.33 0.05445
0.0825 0.33 0.02723
0.0825 0.33 0.02723
Sum 0.10890
Table 2: Outer loop of iteration 1 of the forward-ba
kward algorithm. This loop
al
ulates the probability of being in a state Sj given the previous state Si and
the observation probabilities. Sin
e the transition probabilites from Si to Sj are
the same for all states in S , the
al
ulation is the same for all 3 states.
Noti
e that at this point our state o
upation probabilites have
ollapsed
down into a single probability for ea
h of the three possible states. We now take
the produ
t of these state probabilities and the next observation to
al
ulate
the next set of forward values. This is very similar to the
al
ulation of our
orginal forward values. We repla
e the initial state o
upation probabilities with
the state o
upation probabilities we just
al
ulated whi
h take into a
ount
the observation O1 (RED) and also in
lude the observation probabilites for O2
7
sums bi (Y ELLOW ) Æ2 (i)
0.1089 0.25 0.02723
0.1089 0.25 0.02723
0.1089 0.50 0.05445
Table 3: Inner loop of iteration 2 of the forward-ba
kward pro
edure - the se
ond
set of state and observation probabilites for our model given partial observation
sequen
e (RED1 ; Y ELLOW2 ).
(YELLOW). Table 3 shows the
al
ulation of the se
ond set of forward values.
This
al
ulation in
luding the inner and outer loops for a parti
ular timestep
t, (where 2 t T ) forms the update rule of the forward pro
edure whi
h is
summarised in equation 3 (Rabiner 1989).
"N #
X
Æt (j ) = Æt 1 (j )ai;j bj (Ot ) 1 j N (3)
i=1
From these we again
al
ulate the probabilities of the next state as shown
in Table 4.
Æ1 (i) ai;j Produ
t
0.0272 0.33 0.00898
0.0272 0.33 0.00898
0.0545 0.33 0.01797
Sum 0.03594
Table 4: Outer loop of iteration 2 of the forward-ba
kward algorithm. This
loop
al
ulates the probability of being in a state given the previous state and
observation probabilities. Sin
e the transition probabilites from Si to Sj are the
same for all states in S , the
al
ulation is the same for all 3 states.
Finally, the pro
edure terminates by
al
ulating the produ
t of the state
probabilities produ
ed by iteration 2 with the respe
tive observation proba-
bilites for the nal observation of RED. As with the enumerative approa
h the
probability of the observation sequen
e (RED1 ; Y ELLOW2 ; RED3 ) given our
model is 0.04.
The forward pro
edure implements task 1, it nds the probability of an
observation sequen
e given a model. It does this by rst nding the probability
of the observation sequen
e o
uring with ea
h state sequen
e: P (O; Qj), then
sums these probabilities to arrive at the probability of the observation sequen
e
given the model: P (Oj). For tasks 2 and 3 w also need to know the ba
kward
part of the forward-ba
kward pro
edure.
8
4.2 The Ba
kward Pro
edure
The ba
kward part of the forward-ba
kward pro
edure nds the probability of
a partial observation sequen
e o
uring from a given state. Re
all that the
forward variable (see equation 2) tells us the probability of a partial observation
sequen
e up to a
ertain state. The ba
kward variable is similar. It tells us
the probability of a partial observation sequen
e from a
ertain state as follows
(Rabiner 1989):
9
Sin
e we know the ba
kward variables for all states in S (all Sj ) at time
T (they are all 1), we rst use these with our produ
t rule to
al
ulate the
ba
kward variables for time T 1 (whi
h is also t + 2) as shown in Table 6.
We then use the t+2 (i) values for all i (all states) in our next iteration to
al
ulate t+1 (i) as shown in Table 7. This pro
ess repeats one more time to
terminate with the
al
ulation shown in Table 8 of t (i), whi
h produ
es the
probability of the observation sequen
e (REDt+1 ; Y ELLOWt+2 ; REDT ) given
the state Si at time t, as 0.04.
ai;j bi (Y ELLOW ) t+2 (j ) Produ
t
0.33 0.25 0.33 0.03
0.33 0.25 0.33 0.03
0.33 0.50 0.33 0.05
Sum t+1 (i) 0.11
Table 7: Applying the produ
t rule for all j and
al
ulating t+1 (i). Again,
sin
e the transitions from Si to Sj are the same for all states in S , the sum is
the same for all Si .
Now that we have both the forward and ba
kwards part of the forward-
ba
kwards pro
edure, we
an look at how they, or other pro
edures based on
them,
an be applied to solving Tasks 2 and 3.
10
5 Implementing Task 2
Task 2 was to nd a state sequen
e whi
h best explains the observations. Two
possible solutions des
ribed in Rabiner (1989) to this are presented. The rst so-
lution uses both the forward variable des
ribed in Se
tion 4.1 and the ba
kward
variable des
ribed in Se
tion 4.2.
Æt (i)t (i)
t (i) = (6)
X
N
Æt (i)t (i)
i=1
On
e we have the forward-ba
kward variables for all states at a given point,
the most likely state at that point is the one whose forward-ba
kward vari-
able has the highest value (i.e is more probable). Cal
ulating the most likely
sequen
e of states using forward-ba
kwards variables requires the variables to
be
al
ulated for every state for ea
h observation point. However, be
ause the
most likely state at ea
h point is
al
ulated independently of the other points
it is possible the most likely sequen
es of states produ
ed by this pro
ess is one
that in pra
ti
e would never o
ur. This problem arises when the probability
of a transition from one state in the
al
ulated sequen
e to the next is 0 (Ra-
biner 1989). An alternative to nding a sequen
e of states by
ombining the
independently
al
ulated best states, is to
onstru
t a sequen
e of best states
whi
h takes into a
ount the other states in the sequen
e. In other words, the
11
sequen
e as a whole must be optimal in some sense rather than the individ-
ual parts. There is method
alled the Viterbi algorithm, to
onstru
t su
h a
sequen
e whi
h is based on the forward pro
edure and dynami
programming
te
hniques (Rabiner 1989).
1jN (7)
12
i bi (RED) 1 (i)
0.33 0.50 0.1650
0.33 0.25 0.0825
0.33 0.25 0.0825
Table 9: Initial values for the Viterbi variables - the initial set of state and ob-
servation probabilites for our model given partial observation sequen
e (RED1 ).
probable sequen
e of states. The produ
t is then used along with the se
ond
observation (YELLOW) to
al
ulate the next set of Viterbi variables. The use
of a single produ
t rather than a sum of produ
ts is where the Viterbi algorithm
diers from the forward pro
edure presented in Se
tion 4.1. The
al
ulation of
the produ
ts for the is shown in Table 12. Here the se
ond urn is stored, and the
maximum produ
t is used along with the nal observation RED, to
al
ulate
the last set of Viterbi variables shown in Table 13.
State (urn) Æ1 (i) ai;j Produ
t
1 0.16500 0.33 0.05445
2 0.08250 0.33 0.02723
3 0.08250 0.33 0.02723
Table 10: Cal
ulating the most probable state (whi
h is stored). This is the
state with the maximum produ
t of the Viterbi variable and transition to state
Sj . This state is urn 1 whi
h is shown in bold. Sin
e the transition probabilities
are the same for all Sj ea
h state has only one row in the table.
Sin
e the transitions are the same from all states the third state to be stored
is the state with the highest Viterbi variable, urn 1. So our most probable
state sequen
e given the observation sequen
e (REDt+1 ; Y ELLOWt+2 ; REDT )
is (URN 11 ; URN 32 ; URN 13 ). The probability of this state sequen
e o
uring
is 0.045.
13
State (urn) Æ1 (i) ai;j Produ
t
1 0.01361 0.33 0.00449
2 0.01361 0.33 0.00449
3 0.02723 0.33 0.00899
Table 12: Cal
ulating the most probable state (whi
h is stored) after the se
ond
observation. This state is urn 3 whi
h is shown in bold. Sin
e the transition
probabilities are the same for all Sj ea
h state has only one row in the table.
6 Implementing Task 3
The third task involves adjusting the model parameters to maximise the prob-
ability of the observation sequen
e given the model. In other words, we would
like to develop our model from training data where the training data in
ludes
observation sequen
es.
The model parameters we are interested in learning are the state transition
probabilities (our ai;j values represented as A), the initial state distributions
(represented as ) and the observation probabilities for ea
h state (represented
as B ). These are the parameters that dene our model, = (A; B; ). Un-
fortunately implementing Task 3 is quite diÆ
ult (Rabiner 1989). One way to
approa
h task 3 is to iteratively update and improve our model using reesti-
mation te
hniques. Some te
hniques are the Expe
tation-Modi
ation (EM)
algorithm, the Baum-Wel
h algorithm and gradient des
ent (Rabiner 1989).
14
expe
ted number of transitions from state Si to state Sj
ai;j = (11)
expe
ted number of transitions from state Si
bi (k ) = expe
ted number of times in state Sj and observing symbol vk (12)
expe
ted number of times in state Sj
Without going into too many details, the
al
ulations in equations 10 to 12
are all based on our forward and ba
kward variables. The
al
ulation of the
probability of being in state Si at time t and state Sj at time t + 1 given a
model and observation sequen
e is shown in equation 13.
TX1
t (i; j ) = expe
ted number of transitions from Si (14)
t=1
TX1
t (i) = expe
ted number of transitions from Si to Sj (15)
t=1
Reinfor
ement learning uses a s
alar signal as a method of spe
ifying what is
to be a
hieved without spe
ifying how it is to be done. Using this approa
h
15
the agent learns how to
omplete its task through intera
tions with it's environ-
ment in a pro
ess of trial and error. Formally reinfor
ement learning is based on
learning a fun
tion where at ea
h step of intera
tion with its environment the
agent re
eives an input indi
ating the
urrent state of the environment (Kael-
bling, Littman, and Moore 1996). Based on this input, the agent then sele
ts
an a
tion as an output to the environment. This output is
apable of
hanging
the state of the environment resulting in a new input in the next timestep. The
value of the state transition a
hieved through the a
tion sele
ted is indi
ated
to the agent by a reinfor
ement signal r sele
ted from a s
alar range of possible
rewards R whi
h the agent re
eives as a result of arriving in the new state.
The aim is to learn a fun
tion (behaviour) of inputs and a
tions whi
h allows
the agent's behaviour to maximize the long-run reinfor
ement (utility) re
eived.
This evaluation fun
tion implements the agent's poli
y, its sele
tion of an a
-
tion in a parti
ular state is determined by the agent's
urrent learned poli
y
(Lin 1991). The behaviour fun
tion usually involves learning a value fun
tion
whi
h estimates the expe
ted future reward asso
iated with parti
ular states or
state-a
tion pairs. The predi
tion of the value fun
tion is based on the assump-
tion that subsequent a
tions are sele
ted based on a parti
ular poli
y, . Issues
relating to following poli
ies and
al
ulating value fun
tions are dis
ussed later.
The advantage of reinfor
ement learning lies in its ability to spe
ify the goals
of the agent without spe
ifying how they should be a
hieved. This gives a very
general ar
hite
ture for learning. All the environments in following dis
ussion
environments are assumed to be rst order Markov problems.
16
a
tion a for whi
h the estimate of expe
ted return at time t is among Qt (a ) =
max aQt (a). Su
h a strategy greedily tries to exploit
urrent knowledge to
maximise immediate reward, however, does not dedi
ate any eort into trying
other a
tions to see if they are really better than they appear. One method for
exploring other a
tions is to o
asionally sele
t a
tions independently from the
a
tion-value estimates. One su
h method is the -greedy method whi
h involves
making a
tion sele
tions with probability , a
ording to a uniform distribution.
An alternative to the -greedy method is softmax methods. These methods rank
a
tions so that more promising a
tions are sele
ted more often and a
tions whi
h
appear quite poor are avoided. One possible softmax method is to use a Gibbs
distribution for sele
tion a
tions at time t as follows:
Qt (a)=
Pne (18)
b=1 e
Qt (b)=
7.2 Exploration
While some possible
hoi
es for en
ouraging exploration are the use of -greedy
strategies or softmax sele
tion me
hanisms there a number of other possibilities.
One is to base sele
tion of a
tions on the
ertainty of the a
tion's estimated
value. Interval estimation is one su
h method whi
h estimates a
onden
e
interval of an a
tions value. However, these involve a large degree of
omplexity
17
for the statisti
al methods used to
ompute the
onden
e intervals. Another
possibility is to
ompute the Bayes optimal way to balan
e exploration and
exploitation. In this
ase, the method used would need to be an approximation
as the full
ase is
omputationally intra
table.
Sutton (1991b) des
ribes the
ase of applying dynami
programming to a deter-
ministi
MDP. The agent observes its environment at ea
h of as a sequen
e of
dis
rete time steps, t = 0; 1; 2; :::, to be in a state st 2 S . At ea
h time step the
agent sele
ts an a
tion at 2 A whi
h determines the state of the world at the
next time step. The agent's obje
tive is to arrive at a goal state g 2 G S and
as ea
h a
tion in
urs a
ost,
sa 2 <+ , it aims to a
hieve this with the minimum
umulative
ost.
The aim of dynami
programming is to asso
iate a numeri
al value, V (s),
with ea
h state indi
ating the
ost-to-go from that state to a goal state.
The minimum
ost-to-go from a parti
ular state
an be determined re
ur-
sively as the minimum
ost a
tion from a parti
ular state plus the minimum
ost-to-go from the resulting state:
where su
(s,a) denotes the su
essor state to s given the sele
tion of a
tion
a and where:
18
V (s) = 0 8s 2 G (23)
Equation (23) denes the values of the goal states while (22) spe
ies how
these are ba
ked up to dene the values of all other states. On
e the values for
states are determined the best a
tion is the one whi
h minimises the righthand
side of (22).
The
ost-to-goal formulation has some limitations, a major one being that it
does not take into a
ount what happens on
e the goal has been rea
hed. This
is appropriate for problems with a nite sequen
e of a
tions leading up to an
absorbing terminal state, but not for
ontinuing tasks where the agent must
ontinue making de
isions on
e the goal is rea
hed.
This problem
an be over
ome by
hanging the formulation to remove the
use of spe
ial goal states and make the agent's obje
tive the optimisation of
osts on state transitions. To implement this the per-transition
osts
an be
onverted to per transition out
omes or rewards. States may be asso
iated with
positive rewards if they are desirable or with negative rewards if they are states
whi
h should be avoided.
Like the original formulation the agent observes its environment at ea
h of
as a sequen
e of dis
rete time steps, t = 0; 1; 2; :::, to be in a state st 2 S . At
ea
h time step the agent sele
ts an a
tion at 2 A whi
h determines the state
of the world at the next time step. However, now after ea
h a
tion the agent
re
eives a reward from the environment, rt+1 2 < as well as the next state st+1 .
The expe
ted value of the rt+1 is R(st ; at ), and the probability of st+1 = x for
a random variable x 2 S , is P (st ; x; at ). The agent's obje
tive may now be seen
as maximising the sum of total rewards it re
eives from the environment:
19
Now the value of a state, V (s) is dened as the best expe
ted dis
ounted
return a
hievable from that state if the agent behaves optimally. Now we arrive
at a re
ursive relationship very similar to (22) where the
ost to goal is repla
ed
by the reward re
eived and we in
lude the dis
ount fa
tor
:
X
V (s) = max
a2A
R(s; a) +
P (s; x; a)V (x) 8s 2 S (28)
x2S
Now the optimal a
tion
an be determined by sele
ting the state whi
h
maximises (28).
20
7.4.1 Optimal Value Fun
tions
The optimal poli
y for nite MDPs
an be dened as a poli
y that is better
than or equal to a poli
y 0 in that its expe
ted return is greater than or equal
to 0 for all states. This takes advantage of the partial ordering dened by
value fun
tions over poli
ies. There is always at least one poli
y su
h that,
< 0 if and only if V (s) V (s) for all s 2 S . Poli
ies whi
h meet these
0
onditions are optimal poli
ies whi
h all share the same state value fun
tion and
are denoted by . The value fun
tion shared by optimal poli
ies is denoted as
V and dened as
V = max
V (s); (31)
for all s 2 S . Optimal poli
ies also have the same optimal a
tion-value
fun
tion, Q , dened as:
Q (s; a) = max
Q (s; a); (32)
for all s 2 S and a 2 A(s). The fun
tion Q
an be written in terms of V
as follows:
where a
tions, a are taken from set A(s) and next states, s0 , are taken from
the set S .
The value fun
tion V is the value fun
tion for an optimal poli
y. Under
su
h a poli
y the value of a state must the equal the expe
ted return for the best
a
tion from that state. Be
ause of this relationship and be
ause V is optimal,
it
an be des
ribed without refering to any parti
ular poli
y. In this
ase, the
equation is
alled the Bellman optimality equation and is written as follows:
V (s) = max
a
E frt+1 +
V (st+1 ) j st = s; at = ag (35)
X a a
= max P [R +
V (s0 )℄ :
a ss0 ss0 (36)
s0
21
n o
Q (s) = E rt+1 +
max Q (st+1 ; a0 ) j st = s; at = a (37)
X a h a a i
0
= P R +
max Q(s0 ) :
ss0 ss0 a0
(38)
s0
If the above equations are
onverted into update rules, they
an be applied
as dynami
programming (DP) te
hniques for nite state problems where the
world model is known. The
omputations using the equations des
ribed above
require ba
king up all the values from the start state to the goal states. A
less
omputationally expensive approa
h is to base ba
k-ups on a
al
ulated
estimate of V , V^ . For a nite state spa
e MDP the
urrent estimate of V^
an
be stored in a table. In the eposidi
ase, the approximate value fun
tion V^ (s)
at a state s 2 S G
an be
al
ulated using the following assignment
onverted
from equation (22):
X
V^ (s) = max R(s; a) +
a2A
P (s; x; a)V^ (x) 8s 2 S G (40)
x2S
Under our assumptions the dynami
s of the model are provided and the
transition probabilities P (s; x; a) do not need to be learned. The equations 39
and 40 also demonstrate two dierent types of ba
k-up. The updates in 40
are based on a sample of the su
essor states, in this
ase the state with the
minimum value fun
tion, as su
h the ba
k-up used here is
alled a sample ba
k-
up. On the other hand, the ba
k-up used in 39 is based on all su
essor states
and is refered to as a full ba
k-up.
The updates performed using (40)
an stop on
e (39) no longers
auses the value
of a state to
hange, and therefore (22) holds for V^ . This pro
ess of
omputing
V by o
asionally applying V^ to states is
alled asyn
hronous value iteration.
The syn
hronous version of value iteration updates all states values at one time,
repeating this pro
ess until
onvergen
e. Value iteration requires enough mem-
ory to store values for every state in the state spa
e. This spa
e requirement is
22
unreasonable for very large state spa
es, however, oers the advantage of faster
sear
h and
onvergen
e on
e suÆ
ient trials have been
ondu
ted.
s0
The poli
y 0 des
ribed here always sele
ts the a
tion for whi
h Q (s; a) is
maximised. Poli
y improvement is the appli
ation of this greedy strategy for
improvement based on the value fun
tion for the
urrent poli
y. If the new
poli
y 0 is a good as but not better than the old poli
y then V = V for all
0
23
s 2 S then V must be V and both and 0 must be optimal poli
ies (Sutton
0
A
tor-
riti
methods are an extension of the
on
epts of reinfor
ement-
omparison
dis
ussed in Se
tion 7.1 to in
lude temporal dieren
es. An example of a
tor-
riti
methods is the following implementation of an Adaptive Heuristi
Criti
(AHC) ar
hite
ture. This
an be seen as in
orporating two
omponents, a
TD(0) learner
alled the Adaptive Heuristi
riti
(AHC), whi
h learns a value
fun
tion, and an a
tor, whi
h sele
ts a
tions. The
riti
learns the value fun
-
tion for whatever poli
y the a
tor is
urrently following and provides feedba
k
to the a
tor on the quality of its a
tion sele
tions. This feedba
k may be in the
form of a temporal dieren
e error between predi
ted and a
tual performan
e:
24
used to bias the poli
y away from asso
iated a
tions while positive errors may
in
rease the probability of sele
ting re
ent a
tions (Sutton and Barto 1998).
7.5.2 Q-Learning
Q-Learning was developed by Watkins (1989) and aims to
onstru
t a Q-fun
tion
as the a
tion-value evaluation (predi
tion) fun
tion (Lin and Mit
hell 1993). An
update rule to learn the Q-fun
tion in deterministi
environments is (Sutton
1991a):
TD(0) is spe
ialised
ase of a more general
lass of TD algorithms, TD() with
= 0. TD(0) looks only one step ahead when making value adjustments while
the more general TD() is applied to every state based on their eligibility. The
TD( rule is similar to the 45 as follows:
At ea
h step the eligibility tra
es for all states de
ay by
while the eligibility
tra
e for the state just visited is in
remented by 1:
25
et 1 (s) if s 6= st
et (s) =
et 1 (s) + 1 if s = st (50)
for all s 2 S , where
is the dis
ount fa
tor and is the tra
e-de
ay parameter
whi
h determines the rate at whi
h the eligibity de
reases for states. The tra
e
given in 50 is
alled an a
umulating tra
e be
ause it a
umulates ea
h time a
state is visited and fades away during periods when states are not visited.
An alternative to a
umulating tra
es is repla
ing tra
es. This is dened for
a state s as follows:
et(s) =
et 1 (s) if s 6= st (51)
1 if s = st
The repla
ing tra
e
an in some
ases provide a signi
ant improvement on
learning rate over a
umulating tra
es (Sutton and Barto 1998). This is due to
the way in whi
h a
umulating tra
es may in
rease the eligibility of
ommonly
sele
ted poor states or state-a
tion pairs in relation to less
ommonly sele
ted
state or state-a
tion pairs whi
h a
tually lead to higher returns.
Multi-step updates based on eligibility tra
es may also be applied to Q-
learning and Sarsa.
26
8 Learning Value fun
tions for POMDP's
The dis
ussion of reinfor
ement learning so far has
on
entrated on dis
rete
spa
es. However, for POMDPs our
urrent state is represented by a belief state
whi
h is a point in multi-dimensional
ontinuous spa
e. So one of the areas of
resear
h in the area of POMDPs is how to learn value fun
tions over
ontinuous
belief spa
es.
Referen es
Holland, J., K. Holyoak, R. Nisbett, and P. Thagard (1986). Indu
tion. Pro-
esses of Inferen
e, Learning and Dis
overy. The MIT Press.
Kaelbling, L., L. Littman, and A. Moore (1996). Reinfor
ement Learning: A
Survey. Journal of Arti
ial Intelligen
e Resear
h 4, 237{285.
Lin, L. (1991). Self-improving rea
tive agents: Case studies of reinfor
ement
learning frameworks. In J. Meyer and S. Wilson (Eds.), From Animals
to Animats, pp. 297{305. First International Conferen
e on Simulation of
Adaptive Behaviour: MIT Press.
Lin, L. and T. Mit
hell (1993). Reinfor
ement learning with hidden states. In
J. Meyer, H. Roitblat, and S. Wilson (Eds.), From Animals to Animats
2, USA, pp. 271{280. Se
ond International Conferen
e on Simulation of
Adaptive Behaviour: MIT Press.
Rabiner, L. (1989). A tutorial on Hidden Markov Models and sele
ted appli-
ations in spee
h re
ognition. Pro
eedings of the IEEE 77 (2), 257{286.
Sutton, R. (1988). Learning to predi
t by the methods of temporal dieren
es.
Ma
hine Learning 3, 9{44.
Sutton, R. (1991a). Planning by in
remental dynami
programming. In Pro-
eedings of the Ninth Conferen
e on Ma
hine Learning, pp. 353{357.
Morgan-Kaufmann.
Sutton, R. (1991b). Reinfor
ement learning ar
hite
tures for animats. In
J. Meyer and S. Wilson (Eds.), From Animals to Animats, pp. 288{296.
First International Conferen
e on Simulation of Adaptive Behaviour: MIT
Press.
Sutton, R. and A. Barto (1998). Reinfor
ement Learning: An introdu
tion.
MIT Press.
Watkins, C. (1989). Learning From Delayed Rewards. Ph. D. thesis, Cam-
bridge University.
27