You are on page 1of 14

MIT OpenCourseWare

http://ocw.mit.edu
6.231 Dynamic Programming and Stochastic Control
Fall 2008
For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
6.231 DYNAMIC PROGRAMMING
LECTURE 16
LECTURE OUTLINE
Control of continuous-time Markov chains
Semi-Markov problems
Problem formulation Equivalence to discrete-
time problems
Discounted problems
Average cost problems
CONTINUOUS-TIME MARKOV CHAINS
Stationary system with nite number of states
and controls
State transitions occur at discrete times
Control applied at these discrete times and stays
constant between transitions
Time between transitions is random
Cost accumulates in continuous time (may also
be incurred at the time of transition)
Example: Admission control in a system with
restricted capacity (e.g., a communication link)
Customer arrivals: a Poisson process
Customers entering the system, depart after
exponentially distributed time
Upon arrival we must decide whether to ad-
mit or to block a customer
There is a cost for blocking a customer
For each customer that is in the system, there
is a customer-dependent reward per unit time
Minimize time-discounted or average cost
PROBLEM FORMULATION
x(t) and u(t): State and control at time t
t
k
: Time of kth transition (t
0
= 0)
x
k
= x(t
k
); x(t) = x
k
for t
k
t < t
k+1
.
u
k
= u(t
k
); u(t) = u
k
for t
k
t < t
k+1
.
No transition probabilities; instead transition
distributions (quantify the uncertainty about both
transition time and next state)
Q
ij
(, u) = P{t
k+1
t
k
, x
k+1
= j | x
k
= i, u
k
= u}
Two important formulas:
(1) Transition probabilities are specied by
p
ij
(u) = P{x
k+1
= j | x
k
= i, u
k
= u} = lim

Q
ij
(, u)
(2) The Cumulative Distribution Function (CDF)
of given i, j, u is (assuming p
ij
(u) > 0)
P{t
k+1
t
k
| x
k
= i, x
k+1
= j, u
k
= u} =
Q
ij
(, u)
p
ij
(u)
Thus, Q
ij
(, u) can be viewed as a scaled CDF
EXPONENTIAL TRANSITION DISTRIBUTIONS
Important example of transition distributions:
Q
ij
(, u) = p
ij
(u)
_
1 e

i
(u)
_
,
where p
ij
(u) are transition probabilities, and
i
(u)
is called the transition rate at state i.
Interpretation: If the system is in state i and
control u is applied
the next state will be j with probability p
ij
(u)
the time between the transition to state i
and the transition to the next state j is ex-
ponentially distributed with parameter
i
(u)
(independently of j):
P{transition time interval > | i, u} = e

i
(u)
The exponential distribution is memoryless.
This implies that for a given policy, the system
is a continuous-time Markov chain (the future de-
pends on the past through the present).
Without the memoryless property, the Markov
property holds only at the times of transition.
COST STRUCTURES
There is cost g(i, u) per unit time, i.e.
g(i, u)dt = the cost incurred in time dt
There may be an extra instantaneous cost
g(i, u) at the time of a transition (lets ignore this
for the moment)
Total discounted cost of = {
0
,
1
, . . .} start-
ing from state i (with discount factor > 0)
lim
N
E
_
N1

k=0
_
t
k+1
t
k
e
t
g
_
x
k
,
k
(x
k
)
_
dt

x
0
= i
_
Average cost per unit time
lim
N
1
E{t
N
}
E
_
N1

k=0
_
t
k+1
t
k
g
_
x
k
,
k
(x
k
)
_
dt

x
0
= i
_
We will see that both problems have equivalent
discrete-time versions.
A NOTE ON NOTATION
The scaled CDF Q
ij
(, u) can be used to model
discrete, continuous, and mixed distributions for
the transition time .
Generally, expected values of functions of can
be written as integrals involving d Q
ij
(, u). For
example, the conditional expected value of given
i, j, and u is written as
E{ | i, j, u} =
_

0

d Q
ij
(, u)
p
ij
(u)
If Q
ij
(, u) is continuous with respect to , its
derivative
q
ij
(, u) =
dQ
ij
d
(, u)
can be viewed as a scaled density function. Ex-
pected values of functions of can then be written
in terms of q
ij
(, u). For example
E{ | i, j, u} =
_

0

q
ij
(, u)
p
ij
(u)
d
If Q
ij
(, u) is discontinuous and staircase-like,
expected values can be written as summations.
DISCOUNTED PROBLEMS COST CALCULATION
For a policy = {
0
,
1
, . . .}, write
J

(i) = E{1st transition cost}+E{e

1
(j) | i,
0
(i)}
where J

1
(j) is the cost-to-go of the policy
1
=
{
1
,
2
, . . .}
We calculate the two costs in the RHS. The
E{1st transition cost}, if u is applied at state i, is
G(i, u) = E
j
_
E

{1st transition cost | j}


_
=
n

j=1
p
ij
(u)
_

0
_
_

0
e
t
g(i, u)dt
_
dQ
ij
(, u)
p
ij
(u)
=
n

j=1
_

0
1 e

g(i, u)dQ
ij
(, u)
Thus the E{1st transition cost} is
G
_
i,
0
(i)
_
= g
_
i,
0
(i)
_
n

j=1
_

0
1 e

dQ
ij
_
,
0
(i)
_
COST CALCULATION (CONTINUED)
Also the expected (discounted) cost from the
next state j is
E
_
e

1
(j) | i,
0
(i)
_
= E
j
_
E{e

| i,
0
(i), j}J

1
(j) | i,
0
(i)
_
=
n

j=1
p
ij
(u)
__

0
e

dQ
ij
(, u)
p
ij
(u)
_
J

1
(j)
=
n

j=1
m
ij
_
(i)
_
J

1
(j)
where m
ij
(u) is given by
m
ij
(u) =
_

0
e

dQ
ij
(, u)
_
<
_

0
dQ
ij
(, u) = p
ij
(u)
_
and can be viewed as the eective discount fac-
tor [the analog of p
ij
(u) in the discrete-time
case].
So J

(i) can be written as


J

(i) = G
_
i,
0
(i)
_
+
n

j=1
m
ij
_

0
(i)
_
J

1
(j)
EQUIVALENCE TO AN SSP
Similar to the discrete-time case, introduce a
stochastic shortest path problem with an articial
termination state t
Under control u, from state i the system moves
to state j with probability m
ij
(u) and to the ter-
mination state t with probability 1

n
j=1
m
ij
(u)
Bellmans equation: For i = 1, . . . , n,
J

(i) = min
uU(i)
_
_
G(i, u) +
n

j=1
m
ij
(u)J

(j)
_
_
Analogs of value iteration, policy iteration, and
linear programming.
If in addition to the cost per unit time g, there
is an extra (instantaneous) one-stage cost g(i, u),
Bellmans equation becomes
J

(i) = min
uU(i)
_
_
g(i, u) + G(i, u) +
n

j=1
m
ij
(u)J

(j)
_
_
MANUFACTURERS EXAMPLE REVISITED
A manufacturer receives orders with interarrival
times uniformly distributed in [0,
max
].
He may process all unlled orders at cost K > 0,
or process none. The cost per unit time of an
unlled order is c. Max number of unlled orders
is n.
The nonzero transition distributions are
Q
i1
(, Fill) = Q
i(i+1)
(, Not Fill) = min
_
1,

max
_
The one-stage expected cost G is
G(i, Fill) = 0, G(i, Not Fill) = c i,
where
=
n

j=1
_

0
1 e

dQ
ij
(, u) =
_

max
0
1 e

max
d
There is an instantaneous cost
g(i, Fill) = K, g(i, Not Fill) = 0
MANUFACTURERS EXAMPLE CONTINUED
The eective discount factors m
ij
(u) in Bell-
mans Equation are
m
i1
(Fill) = m
i(i+1)
(Not Fill) = ,
where
=
_

0
e

dQ
ij
(, u) =
_

max
0
e

max
d =
1 e

max

max
Bellmans equation has the form
J

(i) = min
_
K+J

(1), ci+J

(i+1)

, i = 1, 2, . . .
As in the discrete-time case, we can conclude
that there exists an optimal threshold i

:
ll the orders <==> their number i exceeds i

AVERAGE COST
Minimize
lim
N
1
E{t
N
}
E
_
_
t
N
0
g
_
x(t), u(t)
_
dt
_
assuming there is a special state that is recurrent
under all policies
Total expected cost of a transition
G(i, u) = g(i, u)
i
(u),
where
i
(u): Expected transition time.
We now apply the SSP argument used for the
discrete-time case. Divide trajectory into cycles
marked by successive visits to n. The cost at (i, u)
is G(i, u)

i
(u), where

is the optimal ex-


pected cost per unit time. Each cycle is viewed as
a state trajectory of a corresponding SSP problem
with the termination state being essentially n.
So Bellmans Eq. for the average cost problem:
h

(i) = min
uU(i)
_
_
G(i, u)

i
(u) +
n

j=1
p
ij
(u)h

(j)
_
_
AVERAGE COST MANUFACTURERS EXAMPLE
The expected transition times are

i
(Fill) =
i
(Not Fill) =

max
2
the expected transition cost is
G(i, Fill) = 0, G(i, Not Fill) =
c i
max
2
and there is also the instantaneous cost
g(i, Fill) = K, g(i, Not Fill) = 0
Bellmans equation:
h

(i) = min
_
K

max
2
+h

(1),
ci

max
2

max
2
+ h

(i + 1)
_
Again it can be shown that a threshold policy
is optimal.

You might also like