Imperfect State Information Problems

6.
231 DYNAMIC PROGRAMMING

LECTURE 8
LECTURE OUTLINE
DP for imperfect state info
Sucient statistics
Conditional state distribution as a sucient
statistic
Finite-state systems
Examples
1
REVIEW: IMPERFECT STATE INFO PROBLEM
Instead of knowing x
k
, we receive observations
z
0
= h
0
(x
0
, v
0
), z
k
= h
k
(x
k
, u
k1
, v
k
), k 0
I
k
: information vector available at time k:
I
0
= z
0
, I
k
= (z
0
, z
1
, . . . , z
k
, u
0
, u
1
, . . . , u
k1
), k 1
Optimization over policies = {
0
,
1
, . . . ,
N1
},
where
k
(I
k
) U
k
, for all I
k
and k.
Find a policy that minimizes
J
=
E
_
N1
g
N
(x
N
) + g
k
x
k
,
k
(I
k
), w
k
x
0
,w ,v
k k
k=0,...,N1
k
=0
_
_ _
subject to the equations
x
k+1
= f
k
_
x
k
,
k
(I
k
), w
k
z
0
= h
0
(x
0
, v
0
), z
k
= h
k
x
k
,
k
_
, k 0,
1
(I
k1
), v
k
, k 1
_ _
2
DP ALGORITHM
DP algorithm:
J
k
(I
k
) = min
u U
k k
_
E
x
x , w , z
k k +1
_
g
k
(
k
, u
k
, w
k
)
k
+ J
k+1
(I
k
, z
k+1
, u
k
) | I
k
, u
k
_
for k = 0, 1, . . . , N 2, and for k = N 1,
_
J
N1
(I
N1
) = min
u
N1
U
N1
_
E
_
g
N
_
f
N1
(x
N1
, u
N1
, w
N1
)
x
N1
, w
N1
_
+ g
N1
(x
N1
, u
N1
, w
N1
) | I
N1
, u
N1
_
_
The optimal cost J
is given by
J
=
E
z
0
_
J
0
(z
0
)
_
.
3
SUFFICIENT STATISTICS
Suppose that we can nd a function S
k
(I
k
) such
that the right-hand side of the DP algorithm can
be written in terms of some function H
k
as
min H
k
_
S
k
(I
k
), u
k
u
k
U
k
_
.
Such a function S
k
is called a sucient statistic.
An optimal policy obtained by the preceding
minimization can be written as
k
(I
k
) =
k
S
k
(I
k
) ,
where
k
is an appropriate
_
functio
_
n.
Example of a sucient statistic: S
k
(I
k
) = I
k
Another important sucient statistic
S
k
(I
k
) = P
x
k
|I
k
4
DP ALGORITHM IN TERMS OF P
X
K
|I
K
It turns out that P
x
k
|I
k
is generated recursively
by a dynamic system (estimator) of the form
P
x
k+1
|I
k+1
=
k
_
P
x k k
k
|I
k
, u , z
+1
for a suitable function
k
_
DP algorithm can be written as
J
k
(P
x
k
|I
E
g
k
k
) = min
k
(x
k
, u , w
k
)
u
k
U
k
_
x
k
,w
k
,z
k+1
+J
k+1

k
(P
_
x
k
|I
k
, u
k
, z
k+1
) | I
k
, u
k
_ _ _
_
u
k
x
k
Delay
Estimator
u
k - 1
u
k - 1
v
k
z
k
z
k
w
k
k - 1
Actuator
x
k + 1
= f
k
(x
k
,u
k
,w
k
) z
k
= h
k
(x
k
,u
k

- 1
,v
k
)
System Measurement
P
x
k
| I
k
k
5
EXAMPLE: A SEARCH PROBLEM
At each period, decide to search or not search
a site that may contain a treasure.
If we search and a treasure is present, we nd
it with prob. and remove it from the site.
Treasures worth: V . Cost of search: C
States: treasure present & treasure not present
Each search can be viewed as an observation of
the state
Denote
p
k
: prob. of treasure present at the start of time k
with p
0
given.
p
k
evolves at time k according to the equation
p
k+1
=
_
p if not search,
_
k
0 if search and
_ p
k
(1)
p
k
(1)+1p
if search and no treasure.
k
nd treasure,
6
SEARCH PROBLEM (CONTINUED)
DP algorithm
J
k
(p
k
) = max
_
0, C +p
k
V
_
p
k
(1 )
+ (1 p
k
)J
k+1
,
p
k
(1 ) + 1 p
k
_
_
with J
N
(p
N
) = 0.
Can be shown by induction that the functions
J
k
satisfy
C
J
k
(p
k
) = 0, for all p
k

V
Furthermore, it is optimal to search at period
k if and only if
p
k
V C
(expected reward from the next search the cost
of the search)
7
FINITE-STATE SYSTEMS - POMDP
Suppose the system is a nite-state Markov
chain, with states 1, . . . , n.
Then the conditional probability distribution
P
x
k
|I
k
is a vector
_
P(x
k
= 1 | I
k
), . . . , P(x
k
= n | I
k
)
_
The DP algorithm can be executed over the n-
dimensional simplex (state space is not expanding
with increasing k)
When the control and observation spaces are
also nite sets the problem is called a POMDP
(Partially Observed Markov Decision Problem).
For POMDP it turns out that the cost-to-go
functions J
k
in the DP algorithm are piecewise
linear and concave (Exercise 5.7).
This is conceptually important. It is also useful
in practice because it forms the basis for approxi-
mations.
8
INSTRUCTION EXAMPLE
Teaching a student some item. Possible states
are L: Item learned, or L: Item not learned.
Possible decisions: T: Terminate the instruc-
tion, or T: Continue the instruction for one period
and then conduct a test that indicates whether the
student has learned the item.
The test has two possible outcomes: R: Student
gives a correct answer, or R: Student gives an
incorrect answer.
Probabilistic structure
L L R
r t
1 1
1 - r 1 - t
L R L
Cost of instruction is I per period
Cost of terminating instruction; 0 if student has
learned the item, and C > 0 if not.
9
INSTRUCTION EXAMPLE II
Let p
k
: prob. student has learned the item given
the test results so far
p
k
= P(x
k
|I
k
) = P(x
k
= L | z
0
, z
1
, . . . , z
k
).
Using Bayes rule we can obtain
p
k+1
=
(
p
k
, z
k+1
)
=
_
1(1t)(1p
k
)
1(1t
+1
)(1r)(1p
k
)
if z
k
= R,
0 if z
k+1
= R.
DP algorithm:
J
k
(p
k
) = min
_
(1 p
k
)C, I +
E
z
k+1
_
J
k+1
_
(p
k
, z
k+1
)
__
_
.
starting with
J
N1
(p
N1
) = min
_
(1p
N1
)C, I+(1t)(1p
N1
)C
.
10
INSTRUCTION EXAMPLE III
Write the DP algorithm as
J
k
(p
k
) = min (1 p
k
)C, I +A
k
(p
k
) ,
where
_
A
k
(p
k
) = P(z
k+1
= R | I
k
)J
k+1
(p
k
, R)
+P(z
k+1
= R | I
k
)J
k
_
+1
(p
k
,
_
R)
Can show by induction that A
k
(p) a
_
re piecew
_
ise
linear, concave, monotonically decreasing, with
A
k1
(p) A
k
(p) A
k+1
(p), for all p [0, 1].
0
p
C
I
I + A
N - 1
(p)
I + A
N - 2
(p)
I + A
N - 3
(p)
1
a
N - 1
a
N - 3
a
N - 2
1 -
I
C
11

MIT OpenCourseWare
http://ocw.mit.edu
6.231 Dynamic Programming and Stochastic Control
Fall 2011
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Imperfect State Information Problems

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Imperfect State Information Problems

Uploaded by

Copyright:

Available Formats

6.

231 DYNAMIC PROGRAMMING

You might also like