You are on page 1of 5

1164

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 4, APRIL 2015

Folding Algorithm for Policy Evaluation for Markov Decision


Processes With Quasi-Birth Death Structure
Yassir Yassir and Langford B. White
AbstractThis technical note presents a new numerical procedure for policy evaluation of Stochastic Shortest Path Markov
Decision Processes (MDPs) having a level independent Quasi
Birth-Death structure. The algorithm is derived using a method
analogous to the folding method of Ye and Li (1994). The computational complexity is O(M 3 log2 N ) + O(M 2N ), where the
process has N levels and M phases. A simple example involving
the control of two queues is presented to illustrate the application
of this efficient policy evaluation algorithm to compare and rank
control policies.
Index TermsDynamic programming, optimisation, queueing
analysis.
I. I NTRODUCTION
A finite level Quasi Birth-Death (QBD) Process is a (finite) discrete
state Markov process with a transition probability matrix having a
block tridiagonal structure. QBD processes represent an extension of
standard birth-death processes (which possess tridiagonal transition
probability matrices) to more than one dimension. QBD processes are
the subject of texts such as [1] and [2] to which the reader is referred
for details. An important application of QBD models is in telecommunications systems modelling, where the determination of the stationary
distribution of the states of the system (usually queue occupancies)
permits the evaluation of various performance measures such as blocking probabilities and delays which are important in assessing system
performance. Matrix analytic methods (MAM) are a commonly used
approach for determining the stationary distribution of a QBD process
(see [1], [2] and references therein). A significant computational
saving is obtained by use of various MAM algorithms which exploit
the QBD structure of the process. In such problems, resources can be
allocated in different ways in order to optimize some utility function
or cost/reward. Examples would be minimizing blocking probabilities
or maximizing throughput. Thus there is a notion of the reward of a
particular policy of allocating resources. In this technical note, we are
interested in the evaluation of the reward of a specified policy associated with a QBD process, rather than the evaluation of their stationary
distribution. Thus our class of models become Markov decision processes (MDPs), and we are interested in evaluating policies for controlling MDPs with QBD transition probability structure.
In previous work, [4], author White presented an approach to policy
evaluation for QBD MDPs which was based on the MAM technique

Manuscript received March 12, 2013; revised November 12, 2013,


February 17, 2014, and July 2, 2014; accepted August 12, 2014. Date of publication August 18, 2014; date of current version March 20, 2015. Recommended
by Associate Editor L. H. Lee.
The authors are with the School of Electrical and Electronic Engineering,
The University of Adelaide, 5005, South Australia (e-mail: Yassir.Yassir@
adelaide.edu.au; Lang.White@adelaide.edu.au).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TAC.2014.2348803

known as linear level reduction. This approach has computational


complexity of O(N M 3 ) where N is the number of levels, and M
is the number of phases in the QBD MDP, and was applicable for the
general level dependent case. The purpose of this technical note is to
describe a faster numerical procedure for policy evaluation applicable
to level independent QBD MDPs. A QBD MDP is level independent if
both its transition probabilities and one-stage rewards are independent
of level (apart from the boundary levels). The technique described
is based on the folding method presented by Ye and Li [5], and has
computational complexity of O(M 3 log2 N ) + O(M 2 N ). The first
term dominates except when M is small and N is large. There is thus a
significant computational saving compared to the linear reduction case
[4]. In linear algebraic terms, linear level reduction corresponds to LU
factorization of a general block tridiagonal matrix, whilst the folding
method, a form of logarithmic reduction corresponds to a kind of
factorisation applicable to block tridaigonal Toeplitz matrices. Importantly, as argued in the MAM literature, the probabilistic interpretation
assists in the proof of the applicability of these factorizations in terms
of the existence of certain inverses and other relevant issues. Computational complexity is the same. The other significant contribution of
this technical note is that here we consider stochastic shortest path
MDPs, rather than uniformly discounted MDPs (as in [4]), although
the latter case can also be addressed in the current framework. We
note at this point that we do not address general approaches to finding
optimal policies for QBD MDPs in this technical note. This problem
is more complicated than the general approaches based on value or
policy iteration [3], although policy evaluation as presented here would
be expected to form part of an appropriate policy iteration method. The
general optimization case is a matter for ongoing work.
This technical note briefly describes discrete-time, discrete-state
QBD MDP models in Section II. In Section III, our algorithm for
policy evaluation based on the folding method is described. Finally,
in Section IV, we present a queuing example which utilises our new
method in the context of ranking a number of control policies in terms
of expected reward. The paper concludes with some suggestions for
consequent research.
II. Q UASI B IRTH -D EATH M ARKOV D ECISION P ROCESSES
A discrete time QBD process X(t), t 0 is a finite state Markov
process defined on a state space labelled (without loss of generality)
by the set of all ordered pairs (n, m) for 0 n N 1, 0 m
M 1, where P = N M denotes the total number of states. In MAM
terminology, the set of states corresponding to a given index n, is called
a level and the set of states corresponding to a given m is called a
phase. The set of states (n) = {(n, m) : m = 0, . . . , M 1} is
called level n. The key property of a QBD that distinguishes it from a
general two-dimensional Markov process, is that transitions are allowable only within a given level, or between adjacent levels. Thus allowable transitions from a given state (n, m) to state (j, k) are restricted
to the cases where |n j| 1. In this technical note we assume that
the number of levels is a power of 2, although this assumption can be
relaxed as indicated in [5]. In addition, as we will be interested in solving stochastic shortest path (SSP) problems [3], we augment the state

0018-9286 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 4, APRIL 2015

space with a unique absorbing state which we shall characterize as


level/state 1 (with a single phase).
The transition probability matrix for a level independent QBD
process X(t), t 0 with an absorbing state has the block matrix form

1
0

..
.

Q=

N 1

0
D1
A2

D0
A1
A2

A0
A1
..
.

A0
..
.

..

J (i) = E


t=0

(1)

then (3) becomes

III. F OLDING M ETHOD FOR P OLICY E VALUATION


In the sequel, we consider the policy fixed and delete explicit reference to it. Consider the equation (I Q) J = g, where J represents
T
the reward-to-go vector, and g = [g T0 g TN 1 ] is the vector of
average one stage rewards. Then
(I D1 ) J0 D0 J1 = g 0 ,

(2)

[Q ((i))]i,j {J (j) + g (i, j, (i))}

(3)

j=1

assumption can be relaxed at the boundary levels 0 and N 1 as in


the example of Section IV.
1 This

(4)

where, again the control u is specified by policy for each component state. Thus the reward-to-go under a specified policy can be
evaluated by solving the set of linear equations (4). There is a unique
solution to these equations because of the probabilistic assumptions
made on the SSP problem, namely that the absorbing state is reachable
with probability one from any initial state (see [3]). The solution of
(4) is called policy evaluation and is the main computational cost in
performing standard policy iteration. For general transition probability
matrices Q, this requires O(P 3 ) operations which can be prohibitively
large. However, in the QBD case, we can exploit the special structure
of Q to reduce this complexity substantially.

(I A1 ) J2n A2 J2n1 A0 J2n+1 = g 2n

(5)

T
for n = 1, . . . N/2 1, where J = [J0T JN
1 ] . So if the odd
blocks of J are available we can easily compute the even blocks using
(5) via

J0 = (I D1 )1 (g 0 + D0 J1 ),
J2n = (I A1 )1 (g 2n + A0 J2n+1 + A2 J2n1 )

(6)

for n = 1, . . . , N/2 1. The inverse of the matrix I A1 is guaranteed to exist because A1 is substochastic. Similar remarks apply to
I D1 . Thus, we need to compute the three matrices (I D1 )1 D0 ,
(I A1 )1 A0 , and (I A1 )1 A2 , which requires O(M 3 ) effort,
independent of N . We also need N matrix vector multiplications
requiring O(N M 2 ) effort.
Now, following the folding idea of [5], consider the process Yt
defined to be Xt observed on odd levels, including the absorbing state.
This process is also a level independent QBD (having N/2 + 1 levels),
with transition probability matrix

1
0

..
.

0
E1
B2

B0
B1
B2

N/21

P 1

J (i) =

J = g(u) + Q(u) J I Q(u) J = g(u)

where g(., ., .) 0 denotes the reward obtained for a transition from


X(t) to X(t + 1) under control u(t) = (X(t)), and expectation is
with respect to all states evolving under the policy . We restrict attention to functions which are level independent. The one stage reward
for self-transitions in the absorbing state is zero, i.e., g(1, 1, u) = 0
for all controls u. We also assume that the one-stage rewards are
independent of level, i.e., if X(t) = (n, i) and X(t + 1) = (m, j)
then g(X(t), X(t + 1), u) depends only on i and j, for all controls
u.1 Similar remarks apply to transitions to the absorbing state. A
policy is admissible if all resulting controls u(t) U. In particular, this
means in the current problem, that all admissible policies give rise to a
level independent QBD transition matrix (1). An optimal policy is one
which realizes the maximum reward-to-go for each initial state. The
determination of an optimal policy for a level independent QBD MDP
process is not a standard dynamic programming problem, and will not
be addressed in this technical note. We propose that the policy evaluation algorithm presented here can find utility in ranking a number
of candidate policies which might be selected on a heuristic basis.
Note that the reward-to-go from the absorbing state is zero, i.e.,
J(1) = 0 for all admissible policies (including an optimal policy)
because the Markov chain always remains in the absorbing state if
initially there, and self-transitions from this state yield zero reward.
From (2), using the strong Markov property of the process X(t), we
can write, for i = 0, . . . , P 1

[Q(u)]i,j g(i, j, u)

j=1

C1

g (X(t), X(t + 1), u(t))


X(0) = i

P 1

[g(u)]i =

The lower right block of this matrix (call it Q), is of size P P with
each sub block having size M M . The diagonal blocks of Q contain
the (unnormalized) transition probabilities associated with each level
whilst the off-diagonal blocks contain the transition probabilities
between levels. The first column contains the transition probabilities
from each level to the absorbing state, which are independent of level
apart from level 0 and level N 1. At most two of 0 , , N 1 may
be zero.
In a MDP, the transition probability matrix Q is parametrized by
a finite set of control functions U. Thus, each block in Q is also a
function of u U. In the sequel, we shall only consider SSP problems,
where the process starts in a given state at time t = 0 and evolves over
t = 1, 2, . . . ,. We assume there is a unique absorbing state as described
above, and that the absorbing state is reached with probability one in a
finite time. We will assume that a stationary policy (see [3]) is applied.
This means that the mapping (policy) : X(t) u(t) is independent
of t. The reward-to-go from state i under policy is defined by

which can be conveniently written in matrix-vector form as follows.


Let g(u) RP denote the vector of average rewards out of state i
under control u, given by

C2

1165

B0
B1
..
.

B0
..
.

..

.
F2

F1

where
E1 = A1 + A0 (I A1 )1 A2 + A2 (I D1 )1 D0 ,
B0 = A0 (I A1 )1 A0
F2 = C2 (I A1 )1 A2 ,

B2 = A2 (I A1 )1 A2 ,

(7)

1166

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 4, APRIL 2015

B1 = A1 + A0 (I A1 )1 A2 + A2 (I A1 )1 A0 ,
F1 = C1 + C2 (I A1 )

A0 .

(8)

g2 = (g2  A2 )(I A1 )1 A2 + A2 (I A1 )1 (g2  A2 )

The transition probabilities for Yt to the absorbing state are

+ A2 (I A1 )1 (g1  A1 )(I A1 )1 A2

0 = I + A0 (I A1 )1 + A2 (I D1 )1 0 ,

= I + (A0 + A2 )(I A1 )


1

N/21 = N 1 + C2 (I A1 )1 .

(9)

The calculation of these quantities requires O(M 3 ) effort, independent


We now
of N . Let the lower right hand block of (7) be denoted by Q.
seek to find average one-stage rewards g for the process Yt so that
the reward-to-go for Yt correspond to those for the odd levels of Xt ,
J = g, then Jn = J2n+1 , n = 0, . . . , N/2
i.e., if we solve (I Q)
1. Because of space limitations of this technical note, the proofs are
omitted but are straightforward using the sample path approach of [4].
Let 0 n N/2 2 and let X0 (2n + 1). Suppose we move up
to level 2n + 3 via level 2n + 2. A typical sample path having stops
in (2n + 2) has the form
(2n + 1, i) (2n + 2, k1 ) (2n + 2, k ) (2n + 3, j).
(10)
The corresponding sample path for Yt is simply the one step from
(n, i) (n + 1, j) (assuming we relabel the levels for Yt ). The
reward associated with this path is
[g0 ]i,k1 +

[g1 ]km ,km+1 + [g0 ]k ,j

(11)

m=1

where g0 RM M are the one-stage rewards for Xt from (n) to


(n + 1), and g1 RM M are the one-stage rewards for Xt from
(n) to (n). This path has probability
[A0 ]i,k1

 1


[A1 ]km ,km+1

[A0 ]k ,j .

(12)

m=1

Multiplying these two terms together and summing over all the km
terms yields the average reward of a path from (2n + 1, i) to (2n +
3, j) having exactly stops in level 2n + 2. Let this quantity be called
h(i, j; ), which is independent of n, and is given by
h(i, j; ) =

[g0 ]i,k [A0 ]i,k A1 1 A0


k,j

1

m=1 k,r

[g1 ]k,r A0 A1m1

[g0 ]k,j A0 A1 1


i,k

i,k

[A1 ]k,r A1 m1A0

[A0 ]k,j .


r,j

(13)

In matrix terms, (13) can be written


h( ) = (g0  A0 )A1 1 A0 + A0 A1 1 (g0  A0 )
+

A0 A1m1 (g1  A1 )A1 m1 A0 .

(14)

m=1

Here  denotes the Hadamard (componentwise) product. Using this


result, we can then show that the average reward of a transition upwards of the process Yt from level n, 0 n N/2 2 is given by
g0 = (g0  A0 )(I A1 )1 A0 + A0 (I A1 )1 (g0  A0 )
+ A0 (I A1 )1 (g1  A1 )(I A1 )1 A0 .

By a similar process, for 1 n N/2 1, the one-stage rewards for


Yt going from (n) to (n 1) are

(15)

(16)

where g2 RM M are the one-stage rewards for Xt from (n)


to (n 1).
For 1 n N/2 2, then the one-stage rewards for Yt remaining
in (n) are
g1 = (g1  A1 ) + (g0  A0 )(I A1 )1 A2
+ A0 (I A1 )1 (g1  A1 )(I A1 )1 A2
+ A0 (I A1 )1 (g2  A2 ) + (g2  A2 )(I A1 )1 A0
+ A2 (I A1 )1 (g1  A1 )(I A1 )1
+ A2 (I A1 )1 (g0  A0 ).

(17)

For 1 n N/2 2, then the one-stage rewards for Yt from level n


to the absorbing state are given by
g1 = (  g1 ) + (g0  A0 )(I A1 )1
+ A0 (I A1 )1 (  g1 )
+ A0 (I A1 )1 (g1  A1 )(I A1 )1
+ (g2  A2 )(I A1 )1 + A2 (I A1 )1 (  g1 )
+ A2 (I A1 )1 (g1  A1 )(I A1 )1

(18)

where g1 are the one-stage rewards to the absorbing state.


The boundary cases for g1 and g1 for levels 0 and N/2 1 are
treated separately using the same approach with the appropriate modifications. Space limitations preclude the inclusion of details here.
The average one-stage rewards g n for Yt going from (n), n =
1 . . . , N/2 2 are thus obtained via
g = (
g0 + g1 + g2 )1 + g1

(19)

with appropriate modifications for levels 0 and N/2 1. Here 1 is a


vector of size N M/2 consisting of all ones.
We repeat the above reduction process, by considering a new
Markov process which is Yt observed on its odd levels, and determine
its transition probabilities and average one-stage rewards using the
results above. At the k-th step, k = 0, . . . , log2 N , the process has
n = N/2k levels which correspond to levels j2k 1, j = 1, . . . , n
of the original process Xt . The computational overhead of each step is
O(M 3 ), independent of the number of levels n. At the final step, we
will have a process with only one level, corresponding to level N 1
of Xt . We can then evaluate JN 1 with O(M ) cost. As pointed out in
[5], it is important that this process is performed with numerical accuracy in mind, since errors made in determining JN 1 will propagate
backwards as the remaining reward-to-go blocks are evaluated. We can
then determine the corresponding even level reward-to-go, JN/21 .
We then recurse backwards evaluating the even levels at each step,
until finally we return to step 0, and then have the complete reward-togo vector. The overall computational requirement is O(M 3 log2 N )
for the forward reduction process and O(M 2 N ) for the backwards
process as in [5]. Ye and Li [5] also provide considerable discussion
regarding computational and memory requirements for their folding
algorithm, much of which is directly relevant (with appropriate modifications) to the policy evaluation method presented above. Readers
are referred to [5] for these details.

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 4, APRIL 2015

1167

IV. N UMERICAL E XAMPLE


In this section, we present an example of a MDP problem to illustrate the application of the method of policy evaluation presented in
this technical note. The problem addressed is the scheduling of service
in a generalisation of the two class priority queueing system such as
considered in [7], to which the reader is referred for more details. We
firstly compare the execution time of evaluating a specified scheduling
policy using the linear level reduction (LLR) method of [4],2 and the
method of this technical note. We then show how the method can be
used to rank a number of candidate scheduling policies.
In this example, there are two priority classes of traffic. The high
priority class is delay-sensitive, but loss-tolerant (e.g., voice traffic)
whilst the low priority class is delay-tolerant, but loss-sensitive (e.g.,
data traffic). In a switch or router, a buffer (queue) is provided for each
class of packets. There is a single server, with adequate capacity to handle total offered traffic, allocated to serve each queue with a specified
probability. This probability is, in general, dependent on the number of
packets waiting in the system. We assume that there is a requirement
to buffer a significantly larger number of data packets than voice
packets. So we identify the levels of a QBD process with the number
of data packets in the system (since this is the quantity to which the
logarithmic reduction applies), and the phases as the number of voice
packets in the system. The maximum buffer sizes for the data and voice
packets are N and M respectively. Because loss of data packets is important, the switch needs to signal higher levels of the communication
system that data buffer overflow has occurred. We model this in the
context of a stochastic shortest path problem where the absorbing state
corresponds to overflow of the data buffer. When the absorbing state is
entered, the switch sends a message to higher levels of the system, and
subsequent appropriate action is taken which does not concern us here.
A reward is accumulated as each packet is switched (served). The
QBD structure allows these rewards to be functions of the number of
voice packets (phases) in the system, so we reward service of the voice
queue more highly when it has a larger number of packets present to
mitigate against delay. These rewards are independent of the number
of data packets (i.e., level independent) apart from the boundary condition where the data queue is full. We then allocate a significantly larger
reward to mitigate against loss of data packets due to overflow. The
control variables are the probabilities pm that we serve the low priority
(data) queue given that the voice queue has m 1 packets present, and
the data queue is not full. In this case, the high priority (voice) queue
is served with probability 1 pm unless it is empty. When the data
queue is full, we use a different set of server scheduling probabilities,
qm defined as a function of m, the number of voice packets as for pm .
This is to allow a higher level of service for the data queue when it is
full because the cost of overflow in this case is large. Thus admissible
policies are those which yield the QBD transition structure (1) with
blocks dependent on the pm and qm terms. We compared the execution
times of policy evaluation between folding algorithm method and
linear level reduction method as a function of number of levels N ,
with number of phases fixed at M = 8 averaging over 200 000 independent trials. From Fig. 1, we observe that the time required to evaluate a policy with folding algorithm is considerably faster than that of
the LLR method.
Now we turn to the application of the policy evaluation algorithm to
the ranking of a number of candidate resource allocation schemes. As
mentioned in the introduction, we do not consider the full optimization
problem in this technical note because the level independent case is not
a standard DP problem, and is the subject of ongoing work. Here, for
2 It

should be noted that LLR allows the QBD MDP to be level dependent, so
is more general.

Fig. 1. Execution times as a function of the number of levels of policy


evaluation.
TABLE I
AVERAGE R EWARD - TO - GO (000 S ) FOR Q UEUE S ERVICE P OLICIES
W ITH VARIOUS VALUES OF p AND q

reasons of brevity, we consider the familiar case (e.g., [7]) where the
server is allocated to either queue with probability taking a value in a
finite set, and that probability is not dependent on the queue occupancy
(apart from when either of the queues is empty). Thus pm and qm
equal constant values p and q respectively. We show the reward-to-go
from the initial state of both queues being empty, until the data queue
overflows (absorbing state). As detailed above, the service probability
can be different in the case when the data queue is full. In our example,
we took the data queue arrival rate to be 0.8 packets per time unit, and
the voice queue arrival rate to be 0.1 packets per time unit. The service
rate is 1 packet per time unit. The data packet buffer is 32 and the voice
buffer is 5. A reward of m units is received for serving the voice queue,
when it contains m packets, and a reward of 0.5 units is received for
serving the data queue unless it is full when a reward of 5 units is
received. Table I shows the reward-to-go for various values of p and q
(each pair (p, q) is one of the policies being ranked). This problem has
a deterministic solution which allocates all service capacity to the data
queue when it is non-empty. This result is in contrast to the so-called
c-rule [7], which considers the infinite-horizon problem, and where
all capacity would be allocated to the voice queue unless the data queue
was full. The effect of having to reset the system due to data
queue overflow and the high reward obtained when serving a full data
queue changes the service policy significantly compared to [7].
V. C ONCLUSION
We have presented a new algorithm for policy evaluation for
stochastic shortest path MDP with Quasi-Birth Death structure. The algorithm has computational complexity of O(M 3 log2 N ) + O(M 2N )

1168

IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 60, NO. 4, APRIL 2015

where the MDP has N levels and M phases as compared to O(N M 3 )


for the linear level reduction based method presented in [4]. This
new method represents the policy evaluation analogue to the folding
method for evaluation of the stationary probabilities of a Markov chain
presented by Ye and Li [5]. A simple example involving the capacity
scheduling of a two class priority queueing system is presented to
illustrate the applicability of the method. There are several possible
extensions of the work presented here. Clearly, the issue of optimisation via a suitable DP methodology is a subject of current research by
the authors. Also, one can apply the ideas of [5] to piecewise level independent MDPs with a commensurate increase in computational and
memory requirements. One could also address uniformly discounted
and average reward-per-stage QBD MDPs using our approach. Further
investigation of the priority queue example is also ongoing work.
ACKNOWLEDGMENT
The authors would like to thank Professor Peter G Taylor, and the
papers referees for helpful comments.

R EFERENCES
[1] M. Neuts, Matrix-Geometric Solutions in Stochastic Models. Baltimore,
MD, USA: Johns Hopkins Univ. Press, 1981.
[2] G. Latouche and V. Ramaswami, Introduction to Matrix Analytic
Methods in Stochastic Modeling. Philadelphia, PA, USA: ASA-SIAM,
1999.
[3] D. P. Bertsekas, Dynamic Programming and Optimal Control, 2nd ed.
Nashua, NH: Athena Scientific, 2001.
[4] L. B. White, A new algorithm for policy evaluation for Markov decision
processes having quasi-birth death structure, Stochastic Models, vol. 21,
no. 23, pp. 785797, 2005.
[5] J. Ye and S.-Q. Li, Folding algorithm: A computational method for finite
QBD processes with level-dependent transitions, IEEE Trans. Commun.,
vol. 42, no. 2/3/4, pp. 625639, 1994.
[6] J. Lambert, B. Van Houdt, and C. Blondia, A policy iteration algorithm
for Markov decision processes skip-free in one direction, in ValueTools,
ICST, Belgium, 2007, pp. 75:175:9.
[7] J.-B. Suk and C. G. Cassandras, Optimal scheduling of two competing queues with blocking, IEEE Trans. Autom. Control, vol. 36, no. 9,
pp. 10861091, Sep. 1991.

You might also like