Professional Documents
Culture Documents
0018-9286 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1
0
..
.
Q=
N 1
0
D1
A2
D0
A1
A2
A0
A1
..
.
A0
..
.
..
J (i) = E
t=0
(1)
(2)
(3)
j=1
(4)
where, again the control u is specified by policy for each component state. Thus the reward-to-go under a specified policy can be
evaluated by solving the set of linear equations (4). There is a unique
solution to these equations because of the probabilistic assumptions
made on the SSP problem, namely that the absorbing state is reachable
with probability one from any initial state (see [3]). The solution of
(4) is called policy evaluation and is the main computational cost in
performing standard policy iteration. For general transition probability
matrices Q, this requires O(P 3 ) operations which can be prohibitively
large. However, in the QBD case, we can exploit the special structure
of Q to reduce this complexity substantially.
(5)
T
for n = 1, . . . N/2 1, where J = [J0T JN
1 ] . So if the odd
blocks of J are available we can easily compute the even blocks using
(5) via
J0 = (I D1 )1 (g 0 + D0 J1 ),
J2n = (I A1 )1 (g 2n + A0 J2n+1 + A2 J2n1 )
(6)
for n = 1, . . . , N/2 1. The inverse of the matrix I A1 is guaranteed to exist because A1 is substochastic. Similar remarks apply to
I D1 . Thus, we need to compute the three matrices (I D1 )1 D0 ,
(I A1 )1 A0 , and (I A1 )1 A2 , which requires O(M 3 ) effort,
independent of N . We also need N matrix vector multiplications
requiring O(N M 2 ) effort.
Now, following the folding idea of [5], consider the process Yt
defined to be Xt observed on odd levels, including the absorbing state.
This process is also a level independent QBD (having N/2 + 1 levels),
with transition probability matrix
1
0
..
.
0
E1
B2
B0
B1
B2
N/21
P 1
J (i) =
[Q(u)]i,j g(i, j, u)
j=1
C1
P 1
[g(u)]i =
The lower right block of this matrix (call it Q), is of size P P with
each sub block having size M M . The diagonal blocks of Q contain
the (unnormalized) transition probabilities associated with each level
whilst the off-diagonal blocks contain the transition probabilities
between levels. The first column contains the transition probabilities
from each level to the absorbing state, which are independent of level
apart from level 0 and level N 1. At most two of 0 , , N 1 may
be zero.
In a MDP, the transition probability matrix Q is parametrized by
a finite set of control functions U. Thus, each block in Q is also a
function of u U. In the sequel, we shall only consider SSP problems,
where the process starts in a given state at time t = 0 and evolves over
t = 1, 2, . . . ,. We assume there is a unique absorbing state as described
above, and that the absorbing state is reached with probability one in a
finite time. We will assume that a stationary policy (see [3]) is applied.
This means that the mapping (policy) : X(t) u(t) is independent
of t. The reward-to-go from state i under policy is defined by
C2
1165
B0
B1
..
.
B0
..
.
..
.
F2
F1
where
E1 = A1 + A0 (I A1 )1 A2 + A2 (I D1 )1 D0 ,
B0 = A0 (I A1 )1 A0
F2 = C2 (I A1 )1 A2 ,
B2 = A2 (I A1 )1 A2 ,
(7)
1166
B1 = A1 + A0 (I A1 )1 A2 + A2 (I A1 )1 A0 ,
F1 = C1 + C2 (I A1 )
A0 .
(8)
+ A2 (I A1 )1 (g1 A1 )(I A1 )1 A2
0 = I + A0 (I A1 )1 + A2 (I D1 )1 0 ,
= I + (A0 + A2 )(I A1 )
1
N/21 = N 1 + C2 (I A1 )1 .
(9)
(11)
m=1
1
[A0 ]k ,j .
(12)
m=1
Multiplying these two terms together and summing over all the km
terms yields the average reward of a path from (2n + 1, i) to (2n +
3, j) having exactly stops in level 2n + 2. Let this quantity be called
h(i, j; ), which is independent of n, and is given by
h(i, j; ) =
k,j
1
m=1 k,r
[g0 ]k,j A0 A1 1
i,k
i,k
[A0 ]k,j .
r,j
(13)
(14)
m=1
(15)
(16)
(17)
(18)
(19)
1167
should be noted that LLR allows the QBD MDP to be level dependent, so
is more general.
reasons of brevity, we consider the familiar case (e.g., [7]) where the
server is allocated to either queue with probability taking a value in a
finite set, and that probability is not dependent on the queue occupancy
(apart from when either of the queues is empty). Thus pm and qm
equal constant values p and q respectively. We show the reward-to-go
from the initial state of both queues being empty, until the data queue
overflows (absorbing state). As detailed above, the service probability
can be different in the case when the data queue is full. In our example,
we took the data queue arrival rate to be 0.8 packets per time unit, and
the voice queue arrival rate to be 0.1 packets per time unit. The service
rate is 1 packet per time unit. The data packet buffer is 32 and the voice
buffer is 5. A reward of m units is received for serving the voice queue,
when it contains m packets, and a reward of 0.5 units is received for
serving the data queue unless it is full when a reward of 5 units is
received. Table I shows the reward-to-go for various values of p and q
(each pair (p, q) is one of the policies being ranked). This problem has
a deterministic solution which allocates all service capacity to the data
queue when it is non-empty. This result is in contrast to the so-called
c-rule [7], which considers the infinite-horizon problem, and where
all capacity would be allocated to the voice queue unless the data queue
was full. The effect of having to reset the system due to data
queue overflow and the high reward obtained when serving a full data
queue changes the service policy significantly compared to [7].
V. C ONCLUSION
We have presented a new algorithm for policy evaluation for
stochastic shortest path MDP with Quasi-Birth Death structure. The algorithm has computational complexity of O(M 3 log2 N ) + O(M 2N )
1168
R EFERENCES
[1] M. Neuts, Matrix-Geometric Solutions in Stochastic Models. Baltimore,
MD, USA: Johns Hopkins Univ. Press, 1981.
[2] G. Latouche and V. Ramaswami, Introduction to Matrix Analytic
Methods in Stochastic Modeling. Philadelphia, PA, USA: ASA-SIAM,
1999.
[3] D. P. Bertsekas, Dynamic Programming and Optimal Control, 2nd ed.
Nashua, NH: Athena Scientific, 2001.
[4] L. B. White, A new algorithm for policy evaluation for Markov decision
processes having quasi-birth death structure, Stochastic Models, vol. 21,
no. 23, pp. 785797, 2005.
[5] J. Ye and S.-Q. Li, Folding algorithm: A computational method for finite
QBD processes with level-dependent transitions, IEEE Trans. Commun.,
vol. 42, no. 2/3/4, pp. 625639, 1994.
[6] J. Lambert, B. Van Houdt, and C. Blondia, A policy iteration algorithm
for Markov decision processes skip-free in one direction, in ValueTools,
ICST, Belgium, 2007, pp. 75:175:9.
[7] J.-B. Suk and C. G. Cassandras, Optimal scheduling of two competing queues with blocking, IEEE Trans. Autom. Control, vol. 36, no. 9,
pp. 10861091, Sep. 1991.