You are on page 1of 12

6.

231 DYNAMIC PROGRAMMING


LECTURE 8
LECTURE OUTLINE
DP for imperfect state info
Sucient statistics
Conditional state distribution as a sucient
statistic
Finite-state systems
Examples
1
REVIEW: IMPERFECT STATE INFO PROBLEM
Instead of knowing x
k
, we receive observations
z
0
= h
0
(x
0
, v
0
), z
k
= h
k
(x
k
, u
k1
, v
k
), k 0
I
k
: information vector available at time k:
I
0
= z
0
, I
k
= (z
0
, z
1
, . . . , z
k
, u
0
, u
1
, . . . , u
k1
), k 1
Optimization over policies = {
0
,
1
, . . . ,
N1
},
where
k
(I
k
) U
k
, for all I
k
and k.
Find a policy that minimizes
J

=
E
_
N1
g
N
(x
N
) + g
k
x
k
,
k
(I
k
), w
k
x
0
,w ,v
k k
k=0,...,N1
k

=0
_
_ _
subject to the equations
x
k+1
= f
k
_
x
k
,
k
(I
k
), w
k
z
0
= h
0
(x
0
, v
0
), z
k
= h
k
x
k
,
k
_
, k 0,
1
(I
k1
), v
k
, k 1
_ _
2
DP ALGORITHM
DP algorithm:
J
k
(I
k
) = min
u U
k k
_
E
x
x , w , z
k k +1
_
g
k
(
k
, u
k
, w
k
)
k
+ J
k+1
(I
k
, z
k+1
, u
k
) | I
k
, u
k
_
for k = 0, 1, . . . , N 2, and for k = N 1,
_
J
N1
(I
N1
) = min
u
N1
U
N1
_
E
_
g
N
_
f
N1
(x
N1
, u
N1
, w
N1
)
x
N1
, w
N1
_
+ g
N1
(x
N1
, u
N1
, w
N1
) | I
N1
, u
N1
_
_
The optimal cost J

is given by
J

=
E
z
0
_
J
0
(z
0
)
_
.
3
SUFFICIENT STATISTICS
Suppose that we can nd a function S
k
(I
k
) such
that the right-hand side of the DP algorithm can
be written in terms of some function H
k
as
min H
k
_
S
k
(I
k
), u
k
u
k
U
k
_
.
Such a function S
k
is called a sucient statistic.
An optimal policy obtained by the preceding
minimization can be written as

k
(I
k
) =
k
S
k
(I
k
) ,
where
k
is an appropriate
_
functio
_
n.
Example of a sucient statistic: S
k
(I
k
) = I
k
Another important sucient statistic
S
k
(I
k
) = P
x
k
|I
k
4
DP ALGORITHM IN TERMS OF P
X
K
|I
K
It turns out that P
x
k
|I
k
is generated recursively
by a dynamic system (estimator) of the form
P
x
k+1
|I
k+1
=
k
_
P
x k k
k
|I
k
, u , z
+1
for a suitable function
k
_
DP algorithm can be written as
J
k
(P
x
k
|I
E
g
k
k
) = min
k
(x
k
, u , w
k
)
u
k
U
k
_
x
k
,w
k
,z
k+1
+J
k+1

k
(P
_
x
k
|I
k
, u
k
, z
k+1
) | I
k
, u
k
_ _ _
_
u
k
x
k
Delay
Estimator
u
k - 1
u
k - 1
v
k
z
k
z
k
w
k

k - 1
Actuator
x
k + 1
= f
k
(x
k
,u
k
,w
k
) z
k
= h
k
(x
k
,u
k

- 1
,v
k
)
System Measurement
P
x
k
| I
k

k
5
EXAMPLE: A SEARCH PROBLEM
At each period, decide to search or not search
a site that may contain a treasure.
If we search and a treasure is present, we nd
it with prob. and remove it from the site.
Treasures worth: V . Cost of search: C
States: treasure present & treasure not present
Each search can be viewed as an observation of
the state
Denote
p
k
: prob. of treasure present at the start of time k
with p
0
given.
p
k
evolves at time k according to the equation
p
k+1
=
_
p if not search,
_
k
0 if search and
_ p
k
(1)
p
k
(1)+1p
if search and no treasure.
k
nd treasure,
6
SEARCH PROBLEM (CONTINUED)
DP algorithm
J
k
(p
k
) = max
_
0, C +p
k
V
_
p
k
(1 )
+ (1 p
k
)J
k+1
,
p
k
(1 ) + 1 p
k
_
_
with J
N
(p
N
) = 0.
Can be shown by induction that the functions
J
k
satisfy
C
J
k
(p
k
) = 0, for all p
k

V
Furthermore, it is optimal to search at period
k if and only if
p
k
V C
(expected reward from the next search the cost
of the search)
7
FINITE-STATE SYSTEMS - POMDP
Suppose the system is a nite-state Markov
chain, with states 1, . . . , n.
Then the conditional probability distribution
P
x
k
|I
k
is a vector
_
P(x
k
= 1 | I
k
), . . . , P(x
k
= n | I
k
)
_
The DP algorithm can be executed over the n-
dimensional simplex (state space is not expanding
with increasing k)
When the control and observation spaces are
also nite sets the problem is called a POMDP
(Partially Observed Markov Decision Problem).
For POMDP it turns out that the cost-to-go
functions J
k
in the DP algorithm are piecewise
linear and concave (Exercise 5.7).
This is conceptually important. It is also useful
in practice because it forms the basis for approxi-
mations.
8
INSTRUCTION EXAMPLE
Teaching a student some item. Possible states
are L: Item learned, or L: Item not learned.
Possible decisions: T: Terminate the instruc-
tion, or T: Continue the instruction for one period
and then conduct a test that indicates whether the
student has learned the item.
The test has two possible outcomes: R: Student
gives a correct answer, or R: Student gives an
incorrect answer.
Probabilistic structure
L L R
r t
1 1
1 - r 1 - t
L R L
Cost of instruction is I per period
Cost of terminating instruction; 0 if student has
learned the item, and C > 0 if not.
9
INSTRUCTION EXAMPLE II
Let p
k
: prob. student has learned the item given
the test results so far
p
k
= P(x
k
|I
k
) = P(x
k
= L | z
0
, z
1
, . . . , z
k
).
Using Bayes rule we can obtain
p
k+1
=
(
p
k
, z
k+1
)
=
_
1(1t)(1p
k
)
1(1t
+1
)(1r)(1p
k
)
if z
k
= R,
0 if z
k+1
= R.
DP algorithm:
J
k
(p
k
) = min
_
(1 p
k
)C, I +
E
z
k+1
_
J
k+1
_
(p
k
, z
k+1
)
__
_
.
starting with
J
N1
(p
N1
) = min
_
(1p
N1
)C, I+(1t)(1p
N1
)C

.
10
INSTRUCTION EXAMPLE III
Write the DP algorithm as
J
k
(p
k
) = min (1 p
k
)C, I +A
k
(p
k
) ,
where
_
A
k
(p
k
) = P(z
k+1
= R | I
k
)J
k+1
(p
k
, R)
+P(z
k+1
= R | I
k
)J
k
_
+1
(p
k
,
_
R)
Can show by induction that A
k
(p) a
_
re piecew
_
ise
linear, concave, monotonically decreasing, with
A
k1
(p) A
k
(p) A
k+1
(p), for all p [0, 1].
0
p
C
I
I + A
N - 1
(p)
I + A
N - 2
(p)
I + A
N - 3
(p)
1
a
N - 1
a
N - 3
a
N - 2
1 -
I
C
11



MIT OpenCourseWare
http://ocw.mit.edu
6.231 Dynamic Programming and Stochastic Control
Fall 2011
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

You might also like