You are on page 1of 31

1

Optimal Energy-Delay Tradeoff for


Opportunistic Spectrum Access in Cognitive
Radio Networks
Oussama Habachi, Yezekael Hayel and Rachid El-azouzi
CERI/LIA, University of Avignon, France

Abstract
Cognitive radio (CR) has been considered as a promising technology to enhance spectrum efficiency
via opportunistic transmission at link level. Basic CR features allow secondary users to transmit only
when the licensed channel is not occupied by primary users. However, waiting for idle time slot may
include large packet delay and high energy consumption. Thus, we consider Opportunistic Spectrum
Access (OSA) mechanism that takes into account packet delay and energy consumption. We formulate
the OSA problem as a Partially Observable Markov Decision Process (POMDP) by explicitly considering
the energy constraint as well as, the delay constraint, which are often ignored in existing OSA solutions.
Specifically, we consider a POMDP with an average reward criterion. We further consider that the
secondary user may decide, at any moment, to use another dedicated way (3G) of communication in order
to transmit its packets. We derive structural properties of the value function and we show the existence
of optimal strategies in the class of the threshold strategies. For implementation purposes, we propose
online learning mechanisms that estimate the primary user activity based on statistical knowledge of the
primary user activity. In particular, numerical illustrations validate our theoretical findings. It is shown
that optimal policy has a threshold structure. We also present numerical illustrations on the convergence
of the proposed algorithms for estimating primary user activity.
Index Terms
POMDP, Cognitive Radio Networks, QoS.

I. I NTRODUCTION
The access to the spectrum frequency is defined by licenses assigned to primary users. The latter must
be conform to the specifications described in the license (e.g. location of the base station, frequency and
January 16, 2012

DRAFT

the maximum transmission power). Nonetheless, a recent study made by the Federal Communications
Commission (FCC) has proved that some frequency bands are not sufficiently used by licensed users at
a particular time and in a specific location [1].
Cognitive radio, which is a new paradigm for designing wireless communication systems, has appeared
in order to enhance the utilization of the radio frequency spectrum. Cognitive radio has been considered
as the key technology that enable secondary users to access the licensed spectrum. A cognitive user,
as defined in [2], is a mobile who has the faculty to adapt its transmission parameters (e.g. frequency
and modulation) to the wireless environment, and support different communication standards (e.g. GSM,
CDMA, WiMAX and WiFi). Moreover, when there is no opportunity to transmit over the licensed
channels, the secondary users may have the possibility to transmit on dedicated channels, generally,
with a higher cost and/or a lower throughput than transmitting over licensed channels. The possibility of
having dedicated channels reserved for secondary mobiles has been proposed in [3],[4] and [5]. Those CR
architectures are described in [6] where the authors also present the network components, the spectrum
and network heterogeneity, and the spectrum management framework. We focus in this paper, on a CR
network where a secondary user communicates with other secondary users through an ad-hoc connection
using a spectrum hole of a licensed frequency (see Figure 1). A secondary user can be considered as a
pair of transmitter-receiver nodes. We assume that there is no interactions with other secondary users.
This model is also suited for the scenario depicted on figure 2 where the secondary user is a cognitive
radio base station which is able to sense the activity of a primary base station, and then takes profit of
spectrum holes for transmitting on the downlink. Our main contribution is to consider in this cognitive
radio setting, an optimal opportunistic spectrum access (OSA) mechanism that takes into account energy
and delay constraints. Many works have focused on the study of optimal sensing and access policies in
cognitive radio networks (see [7], [8] and [9]). All these works have focused on either spectrum sensing
or dynamic spectrum sharing. In [10], the authors focused on an OSA problem with an energy constraint.
The authors have formulated their problem as a POMDP and derived some properties of the optimal
sensing control policies. Their control parameter is the duration of sensing used by a secondary user at
each time slot for determining the primary user activity. They provided heuristic control policies based
on gird-based approximation, myopic policies and static policies which have low complexity but give
suboptimal control policies. finally, they compare their heuristics methods with optimal solutions obtained
using a POMDP solver. Authors of [11] incorporate the energy constraint in the design of the optimal
policy of sensing and access in cognitive radio network. They formulate the problem also as a POMDP
but with a finite horizon and established a threshold structure of the optimal policy for the single channel
January 16, 2012

DRAFT

model. However, they did not provide analytical expression of the optimal control policy. It is noteworthy
that the impact of the energy constraint or the capacity of cognitive radio to support additional Quality-ofService (QoS), such as the expected delay, has been somehow ignored in the literature. In fact, it is very
important for today multimedia applications on wireless networks, to provide reliable communication
while sustaining a certain level of QoS. In fact, taking into account the delay constraint as well as
the energy constraint significantly complicates the optimization problem. Without considering the delay
constraint, the secondary user achieves the best tradeoff between trying to access the licensed channel
and sleeping to conserve energy. The design of such tradeoff lies among several conflicting objectives:
gaining immediate access, gaining spectrum occupancy information, conserving energy and minimizing
packets delay. Then, the goal of our paper is to study such energy-QoS tradeoff for determining an
optimal OSA mechanism for secondary users in a cognitive radio network. The major contributions of
our work are:

The problem is formulated as an infinite horizon POMDP with average criterion. The average
criterion is better than the discount or the total criterion as the secondary user takes often decisions.

In order to gain insights into the energy-delay constrained OSA problem, we derive structural
properties of the value function. We are able to show that the value function is increasing with the
belief and decreasing with the packet delay. These structural results not only give us the fundamental
design thresholds but also reduce the computational complexity when seeking for the optimal policies.

We show that the secondary user can maximize its average reward by adopting a simple threshold
policy, and we derived closed-form expressions for these thresholds.

Since the secondary user may use a dedicated channel for its packets, the optimal threshold policy
guarantees a bounded delay.

The organization of the paper is as follows. In the next section, we describe the primary and the
secondary user models. Section III presents our Markov decision process framework. In Section IV, we
study the existence of an optimal threshold policy for our opportunistic spectrum access with an energyQoS tradeoff. We propose two learning based protocols for estimation of the state transition rates in
Section V. Before concluding the paper and giving some perspectives, we present, in Section VI, some
numeric illustrations.
II. C OGNITIVE RADIO NETWORK MODEL
We consider a wireless system with N independent channels licensed to primary users. The state of
each channel n {1, . . . , N } is modeled by a time-homogeneous discrete Markov process sn (t). The
January 16, 2012

DRAFT

state space is {0, 1} where sn (t) = 0 means that the channel n is free for secondary access and sn (t) = 1
means that the channel n is occupied by primary user. The transition probabilities of the channel n is
given by the following matrix:

Pn =

n 1 n
n

1 n

The transition rates evolve as illustrated in Figure 3.


The global system state, composed of the N channels, is denoted by the vector s(t) = [s1 (t), ..., sN (t)]
and the global state space is = {0, 1}N . The transition probabilities can be determined by the statistics
of the primary network traffic and are assumed to be known by secondary users. We present in section
V some methods allowing the secondary user to estimate these transition probabilities on the fly.
We consider a secondary user having the possibility to access to anyone of the N licensed channels.
The objective of the secondary user is to detect the channels that are free during a given time slot.
However waiting for idle time slot may include large packet delay and high energy consumption due
to sensing. To overcome this, we consider an OSA that takes into account packet delay, throughput
and energy consumption. Since todays wireless networks are highly heterogeneous with mobile devices
consisting of multiple wireless network interfaces, we assume that at any time, the secondary user has
access to the network through another technology like 3G. This is typically the case with the 802.22
standard in which secondary users transmit over the TV bands [12]. The secondary user will prefer to
transmit its packet on a licensed channel because it is cheaper than a dedicated communication while the
dedicated channel guarantees perfect access.
The goal of each secondary user is to minimize the expected delay of its packets, accounting for
energy, throughput and monetary costs. In order to achieve such goal, a secondary user has to choose at
each time slot one of the following actions:

to be inactive during the slot,

to sense a primary channel and to transmit if the channel is available during time slot, else to wait
for next time slot,

or to sense a primary channel and to transmit if the channel is available during time slot, else to
use the dedicated channel.

Our important contribution is to consider the average transmission delay of a packet in the optimal
decision. Indeed, sensing a primary channel has a cost for the secondary user. We look for an optimal
sensing policy which depends on the history of observations and actions.

January 16, 2012

DRAFT

III. PARTIAL O BSERVATION M ARKOV D ECISION P ROCESS F RAMEWORK


Due to partial spectrum sensing, the global system state s(t) cannot be directly observed by a secondary
user. To overcome this difficulty, the secondary user infers the global system state based on observations
that can be summarized in a belief vector (t) = {1 (t), ..., 2N (t)} where j (t) is the conditional
probability (given the observation history and the decision) that the system state s(t) = j in slot t. Since
the N channels are independent, it has been proved in [13] that we can consider the following simpler
belief vector:
~(t) = [1 (t), .., N (t)],

where i (t) is the conditional probability that the channel i is available in slot t. Hence, we study the
problem of OSA for secondary user as a POMDP problem.

A. Description of the POMDP


1) State: The state of the system at time slot t is given by (~(t), l(t)) where l(t) is the delay of the
packet held by secondary user at time t. The delay of a new packet equals one, and increases by one
every time slot, except when the secondary user transmits the packet.
2) Action: For each time slot t and each state (~(t), l(t)), the three possible actions are:

0, to be inactive

1, to sense and to transmit only if the channel is available during time slot,
a(t) =

2, to sense and to transmit if the channel is available during time slot,

else to transmit through the dedicated channel.


3) Observation and belief: When the secondary user decides to sense (i.e. to take action a(t) {1, 2}),
one channel n (t) is determined and the secondary user observes the channel occupancy state sn (t) (t)
{0, 1}. Let (t) be the observation outcome at time t, where (t) = 0 if the sensed channel is idle and
(t) = 1 otherwise. The user updates the belief vector ~(t) after the observation outcome. For each

channel n, the conditional probability n (t + 1)

n (t + 1) := Pr(sn (t + 1) = 0|a(t), (t)) =

January 16, 2012

is therefore defined as follows:


n + (n n )n (t) if a(t) = 0 or n 6= n (t),
n

if

a(t) 6= 0, (t) = 0

and n = n (t),
n

if

(1)

a(t) 6= 0, (t) = 1

and n = n (t).

DRAFT

Note that we can extend easily our model to sense not only one channel but a subset of the primary
channels.
4) Channel choice policy: At each time slot t, based on its belief vector ~(t), the secondary user
chooses a channel n (t) N to be sensed. There exists several channel choice policies in the literature
like deterministic, randomized and periodic (see [1]). An example of channel choice policy is to sense
the channel which has the highest probability to be idle, i.e. n (t) := arg maxn (n (t)).
5) Policies: The strategy of the secondary user is defined by the probability of choosing a given action
depending on the system state. We define a sensing and access policy as a vector [1 , 2 , . . .] where t
is a mapping from a state (~(t), l(t)) to an action a(t). The set of policies is denoted by . A stationary
policy is a mapping that specifies for each state, independently of the time slot t, an action to be chosen.
In the next section, we show that our POMDP problem has an optimal stationary policy which allows
us to restrict our problem to stationary policies.
6) Reward and costs:

Reward : Let be the reward representing the number of delivered bits when the secondary user
transmits its packet.

Costs : Let cs be the energy cost function for sensing a primary channel, measured as monetary
units. This function depends on the action a(t) as:

c ,
s
cs (a(t)) =
0,

if a(t) > 0,
if a(t) = 0.

The primary user and the service provider for the dedicated access, charge a price for each packet
transmitted. Those prices are respectively Pp for a transmission over a primary channel and P3G for
a transmission over the dedicated channel.
Hence, when the secondary user transmits successfully a packet, he gets the reward zt (a(t), (t))
which depends on the action a(t) and the observation (t) by:

0,
if a(t) = 0,

zt (a(t), (t)) =
Pp
if a(t) 1 and (t) = 0,

P , if a(t) = 2 and (t) = 1.


3G
In order to model the impact of the delay, we introduce an additional cost when a packet is not
transmitted. This cost depends on the current delay l of the packet and is defined by the function
f (l). This function is assumed to be increasing with l in order to growth the incentive of transmitting

the packet when it becomes delayed from a long time.

January 16, 2012

DRAFT

Instantaneous reward: At time slot t, the instantaneous reward rt of a secondary user depends on
the system state (~(t), l(t)) and the action a(t), and is expressed by:
rt ((~(t), l(t)), a(t)) = zt (a(t), (t)) f (l(t)) cs (a(t)).

The problem faced by the secondary user consists of finding the sensing policy that maximizes its
expected average reward defined by:
1

R()
= lim IE
T T

T
X

!
rt ((~(t), l(t)), a(t))|~(0) ,

t=1

while ~(0) is the initial belief vector. Then our objective is to find an optimal sensing policy that
, i.e.:
maximizes the average reward R()
1
= arg max lim IE
T T

T
X

!
rt ((~(t), l(t)), a(t))|~(0) .

(2)

t=1

In some particular MDP and POMDP problems, we are able to determine an optimal policy in a
smaller set reduced to stationary policies. We prove in the following proposition that there exists an
average optimal stationary policy for our POMDP problem.
Proposition 1: There exists an average optimal stationary policy for our POMDP formulation described
in (2).
Proof: see Appendix A.
Given this result, we can restrict our problem to the set S of stationary policies. Then, for the
remainder of this paper, we omit the time index t and we look for an optimal sensing policy which is a
mapping between a system state (~, l) to an action a, independently of the time slot t. Now, we make a
first analysis of the value function of the POMDP.
We denote by ns (~|) the function that updates the belief vector ~ when the user chooses to be
inactive in the current slot, i.e. the secondary user takes action 0. The function s (~|) updates the belief
vector ~ when the secondary user senses a licensed channel in the current slot and observes , i.e. the
secondary user takes the action 1 or 2.
The value function is denoted V (, l). Let us denote by Qa (, l) the action-value function taking the
action a in the current slot when the information state is (, l). Therefore, the value function is expressed
by
gu + V (~, l) = max Qa (~, l),
aA

(3)

where gu is a constant, and the optimal action is given by


a (~, l) = arg max Qa (~, l).
aA

January 16, 2012

(4)
DRAFT

We determine the action-value function for each different action 0, 1 and 2. When the secondary user
decides to wait, i.e. to take the action a = 0, we have:
Q0 (~, l) = f (l) + V (ns (~| = 0), l + 1).

(5)

When the secondary user chooses to sense the channel n and decides to wait for the next time slot if
the channel n is busy, i.e. to take action 1, we have:
Q1 (~, l) = cs + n ( Pp + V (s (~| = 0), 1))

(6)

+(1 n )(f (l) + V (s (~| = 1), l + 1)).

When the secondary user chooses to sense the channel n and to transmit using the dedicated channel if
the channel n is busy, i.e. to take action 2, we have:
Q2 (~, l) = cs + n (Pp + V (s (~| = 0), 1))

(7)

+(1 n )(P3G + V (s (~| = 1), 1)).

We focus on the case of one licensed channel. The multichannel case will be studied in Section III-C.
We take the assumption that there exists a packet delay l such that the secondary user transmits its
packet using the dedicated channel if the observation is = 1. In fact, this assumption is somehow
realistic as the user has no interest to keep the file in its buffer indefinitely. We denote by and the
transition rates of the channel, and the belief of the secondary user. We consider that . When
, the analysis is similar and the results are unchanged.

B. The single channel model


Let us focus on the belief update function ns .
Lemma 1: We have the following properties of the belief update function ns .
1) The update function ns (|) is increasing with belief .
2) We have the following equivalence:
ns (|)

(0),

ns (|)

(0),

and

where (0) =

1+

is the stationary probability that the primary channel is idle. Figure 4 depicts

the belief evolution.


Proof: See Appendix B.
January 16, 2012

DRAFT

It has be shown in [15] that the value function for a POMDP over a finite time horizon is piecewise
linear and convex with respect to the belief vector. In Proposition 2, we show that the value function for
our POMDP problem over an infinite horizon with the average criterion, has also this property.
Proposition 2: The value function V (, l) given in (3) is piecewise linear and convex with respect to
the belief vector .
Proof: See Appendix C.
Note that monotonicity results help us for establishing the structure of the optimal policies (see [16]
for an example) and provide insights into the underlying problem. The following propositions states
monotonicity results of the value function with respect to each of its parameters.
Proposition 3: For each belief vector , the value function is monotonically decreasing with the packet
delay l, i.e. V (, l) V (, l0 ) for l l0 .
Proof: See Appendix D.
This result is intuitive because for the same belief and for a given packet delay, the maximum
expected remaining reward that can be accrued is lower than the one the secondary user can get with a
smaller packet delay.
Proposition 4: The value function is monotonically increasing with the belief vector , i.e. V (, l)
V (0 , l) for 0 .

Proof: See Appendix F.


Again this result seems somehow intuitive as for the same packet delay, when the belief vector is
higher yields that the maximum expected remaining reward becomes higher.
Given all the previous results on the value function V (, l), we are able to show the existence of
an optimal sensing policy for our POMDP problem. Moreover, we determine explicitly the threshold
structure of such optimal policy.

C. The multichannel model


The Lemma 1 holds for the multichannel model. In fact, if ~1 ~2 , then n1 n2 and ns (n1 )
ns (n2 ), and therefore, ns (~1 ) ns (~2 ). Second, if n (0), then we ns (n ) n , and thus
ns (~) ~. Otherwise, we have ns (~) ~.

The Proposition 2 can be straightforwardly extended to the multichannel model. Furthermore, we


studied in Proposition 3 the monotonicity of the value function with a fixed belief value with respect to
the packet delay. This proposition can be also extended to the multichannel model.

January 16, 2012

DRAFT

10

Let us focus on the Proposition 4. The monotonicity with respect to the belief vector depends on the
order relation over the belief set and also on the monotonicity of the belief update functions s (~| = 0)
and s (~| = 1) depending on the belief vector.
IV. O PTIMAL T HRESHOLD POLICY
Let us focus on the characteristics of an optimal policy for the secondary user. Intuitively, when the
delay l and the belief probability are small, the secondary user waits for a better opportunity. Thus,
depending on the belief probability, the secondary user makes the decision to sense a primary channel or
not. We prove in this section, that the intuition is true and there exists an optimal sensing policy which
has a threshold structure.
The first decision for a secondary user is whether to sense licensed channels or to wait, depending on
its belief and the current delay of the packet l. We have the following result which gives us a threshold
on the belief probability in order to answer this question.
Proposition 5: For all packet delay l, the optimal action for the secondary user is to wait for the
next slot, i.e. a (, l) = 0 if and only if where is the solution of the equation =
max(0, min{T h1( , l), T h2( , l)}) with
T h1( , l) =
T h2( , l) =

V (ns ( |), l + 1) V (, l + 1) + Cs
,
f (l) + Pp + V (, 1) V (, l + 1)

and

V (ns ( |), l + 1) V (, 1) + Cs f (l) + P3G


.
Pp + V (, 1) + P3G V (, 1)

Proof: see Appendix G.


This proposition gives us a necessary and sufficient condition on the use of the action 0 depending on
the belief probability . Consequently, if > then the optimal action is to sense a primary channel,
i.e. a (, l) 6= 0.
Furthermore, we have the following property of the optimal policy.
Proposition 6: For all > (0) and l, the secondary user never takes the action 0 and thus, Q0 (, l) <
max {Q1 (, l), Q2 (, l)}.

Proof: See Appendix H.


Therefore, the secondary user never chooses the action 0 after it transmits a packet over the primary
channel because s (, = 0) = > (0). Furthermore, we have the following result about the use of
the dedicated channel.

January 16, 2012

DRAFT

11

Proposition 7: For all belief , the secondary user chooses to use the dedicated channel in spite of
waiting for the next slot if and only if the delay l of the current packet verifies:
f (l) + P3G + V (, l + 1) V (, 1) > 0.

Proof: See Appendix I.


We note that this expression does not depend on the cost of sensing Cs nor on the belief vector .
That is obvious as this expression determines the best action to do after sensing a channel. We have the
last property about the optimal threshold policy.
Corollary 1 (Never Wait After Sensing): If, for all l, the penalty cost f (l) is lower than P3G ,
then the secondary user transmits on the dedicated channel when the sensed channel is not idle.
Proof: See Appendix J.
This result is also somewhat intuitive. In fact, when the secondary user senses the channel as busy, it
gets P3G as reward if he uses the dedicated channel otherwise he gets a penalty f (l) if he decides
to wait. Thus, if P3G + f (l) is positive the secondary user has no incentive to wait after sensing the
licensed channels.
In all the results, the optimal sensing policy depends on the transition rates and of the primary
user activity. In the literature, those parameters are assumed to be known by the secondary user. We
focus in the next section on online learning algorithms that allow the secondary user to estimate those
rates on the fly.
V. O NLINE L EARNING OF PRIMARY USER S ACTIVITY
We proved that the secondary user has an optimal energy-delay constrained policy given perfect
knowledge of the channels transition rates. However, in practice, some information like the transition
rates and are not available for the secondary user. In this section, we consider a model where the
secondary user does not have external information about the state transition rates. We present two learning
based protocols for the secondary user in order to estimate the primary channels dynamics: rate estimator,
and transition matrix estimator.
A. Rate Estimator
In this approach, the secondary user begins with an initial arbitrary values of and . The secondary
user updates them every time slot depending on the information about the system state. Then, the secondary user computes its sensing policy based on the estimators
= {
1 , ...,
N } and = {1 , ..., N }
where
i (resp. i ) is the estimator of i (resp. i ).
January 16, 2012

DRAFT

12

First, the secondary user estimates


i which is the probability that the channel i will be sensed idle
given that it was idle in the previous slot. Second, the secondary user estimates
i (0) the stationary
probability for this channel to be idle. The secondary user obtains the estimated value of i based on

i (0)
the relation i = (1
i ) 1
i (0) .

Formally, we consider the following counting processes for the estimation of


i and
i (0):

= {K
1 , ..., K
N } where K
i represents the number of time slots a channel stays in the
The vector K
i is incremented if the channel i is sensed and is idle at time slot t and t 1.
idle state, i.e. K

The vector I = {I1 , ..., IN } where Ii represents the number of time slots that the channel is sensed
and is idle.

= {M
1 , ..., M
N } where M
i represents the number of time slots that the channel is
The vector M

sensed.
Therefore the secondary user estimates the state transition rates
and
i (0) based on the following
expressions:
i =

i
K
Ii

and
i (0) =

Ii
i .
M

B. Transition Matrices Estimator


The convergence of the previous estimators
and depends on the occurrence of two successive
sensing actions of the same channel. The secondary user may not sense frequently the same channel in
two successive time slots. Therefore, the previous learning mechanism converges slowly. We present, in
this section, a learning protocol which estimates the transition matrices. We define the set of transition
matrices {Pi (0), Pi (1), ...} where Pi (j) is the transition matrix of the channel i when this channel was
not sensed during j consecutive slots. For example, if the channel i was sensed j slots before as idle;
the current belief on the state of this channel is (1, 0) Pi (j). As like as the rate estimator, the transition
matrices are estimated using a counting process.The previous learning protocol is somehow a particular
case of this approach. In fact, estimating and is equivalent to estimating the set of transition matrices
such that the channel was sensed in the previous slot {P1 (0), ..., PN (0)}. Therefore, this learning based
protocol gives more accurate estimation of primary users activity. However, it needs more space and
computational complexity compared to the rates estimators method.
VI. N UMERIC I LLUSTRATIONS
We illustrate our results through simulations of the system over an important number of packets (we
consider 3000 packets). It was shown in [17] that in practice, the average number of available primary
channels is about 15. Unfortunately, we consider only 4 i.i.d primary channels, i.e. N = 4, due to
January 16, 2012

DRAFT

13

exponential states space (with 4 primary channels, we have approximatively 106 states). Furthermore, we
consider the following system parameters: P3G = 80, Pp = 10, cS = 5 and = 35.
We propose to illustrate our results in three scenarios with symmetric channels:
1) Scenario 1: Primary channels are often occupied (1 = 2 = 3 = 4 = 0.15 and 1 = 2 =
3 = 4 = 0.1),

2) Scenario 2: Primary channels are often idle (1 = 2 = 3 = 4 = 0.85 and 1 = 2 = 3 =


4 = 0.7),

3) Scenario 3: Primary channels have low transition rates (1 = 2 = 3 = 4 = 0.95 and 1 =


2 = 3 = 4 = 0.05). This last scenario is realistic if we consider TV white space [17].

We describe, first, the optimal threshold policy given perfect knowledge about the transition rates of the
primary channels. Second, we give some results using estimated values of transition rates.
A. Single channel model
We consider only one licensed channel with the transition rates = 0.15 and = 0.1. Figure 5
illustrates the optimal policy of the secondary user depending on the belief and the packet delay. For
each packet delay, the secondary user has a threshold policy depending the belief. Moreover, the threshold
belief probability is decreasing with the packet delay. We observe that the maximum packet delay is
13 slots.
Consider the the same scenario with transition rates = 0.2 and = 0.25. We observe in Figure 6
that the secondary user policy has also a threshold structure. A packet has a most a delay of 3 slots.
B. Optimal policy with perfect knowledge of and
We simulate the first scenario and we depict in Figure 7 the thresholds (l) determined in proposition
5 depending on the packet delay l. For each packet delay l, the best action for the secondary user is to
wait for the next slot if its belief probability is lower than . Otherwise, the secondary user decides to
sense the primary channels. In this context, where the primary channels are often occupied (Scenario 1,
Figure 7), the maximum packet delay l obtained with Proposition 7 equals 9. Then, when the packet
delay is l = 9, the user decides to sense and to transmit using the dedicated channel if the sensed channel
is occupied. We describe the optimal policy for the Scenario 2 on Figure 8. The maximum packet delay
in this case is l = 5. This result is intuitive as in this scenario, the primary channels are more often
idle, inducing a lower packet delay. Finally, for the last scenario depicted on Figure 9, which implies
that the maximum packet delay is 5. We observe that the secondary user policy has also a threshold
January 16, 2012

DRAFT

14

structure. However, the the threshold belief probability is not decreasing with the packet delay. . In
fact, the primary channels are more static (the probability for each channel to stay occupied or idle is
high enough), it appears one kind of periodic threshold strategy.
C. Average reward using estimated values of and
We consider the learning approaches proposed in section V. Let us compare, first, the average reward
and the average delay using the two learning based protocols with perfect knowledge of the channels
transition rates. Figures 10 and 11, show that both learning protocols converge. In fact, we observe on
Figures 10 and 11, that both protocols converge before 400 iterations. However, in Figures 12 and 13, we
can observe that the transition matrices estimation method converge 3 times faster (about 1000 iterations)
than the rate estimators method (about 3000 iterations). Moreover, the average reward and the average
packet delay using the estimated transition rates are close to the average reward and the average delay
with known channels transition rates.
VII. C ONCLUSION AND PERSPECTIVES
In this paper, we have used a POMDP framework for determining an optimal sensing policy for opportunistic spectrum sensing and access (OSA) taking into account an energy-delay tradeoff for secondary
users. Introducing a QoS metric in the spectrum sensing policy is very important with the emergence
of heterogeneous mobiles that are able to transmit their traffic with possible high QoS constraints, at
any time over different ways of communication like 3G, WiFi and TV White Space. We have provided
some structural properties of the value function and then proved the existence of an optimal average
stationary spectrum sensing policy. We have been able to determine explicitly the threshold structure of
the optimal policy. The interaction between several secondary users has not been considered here, and
in the literature very few. This perspective is also very important because if the channel choice policy is
the same for all the secondary users, there could have lots of collisions between several secondary users
that have sensed the same idle primary channel. This decentralized system with partial information can
be modeled using decentralized-POMDP or interactive-POMDP and will be studied in future works.
A PPENDIX
A. Proof of Proposition 1
We use the Theorems 8.10.9 and 8.10.7 from [14] to prove the existence of an optimal stationary
policy for our problem. First, the immediate reward rt ((s, l), a) is finite, i.e. < rt ((s, l), a) < +
January 16, 2012

DRAFT

15

(as all costs and rewards are finite). Second, We prove that there exist a stationary policy d for which
the derived Markov chain is positive recurrent.
Let us focus on the following belief vector:
0 = (1 , 2 , ..., N )

such that j = j1 (j |0),

for

j = 1, . . . , N,

where j represents the belief of a channel that was not sensed for j successive slots.
Denote by d the stationary policy which senses licensed channels at every slot, with periodic channel
choice policy. Let us prove that the derived Markov chain is positive recurrent. The probability that the
Q
n
system returns to the initial belief form any state is p() = N
k=0 (1 (j )) > 0, n {O, ..., N }
and then the return time to the initial belief j follow a geometric distribution so that E{j } =

1
p(j )

and

therefore all state are positive recurrent under d .

Third, let us prove that g d > and the set {b Sb : rt ((s, l), a) > g d

for some a A} is finite

and no empty. As the policy d senses licensed channels every slot, g d = f (l(t)) cs (f (l(t)) +
Pp )n . If we have
f (l(t)) cs (f (l(t)) + Pp )n > max{f (l(t)), cs P3G (Pp P3G )n }

for all belief b, the policy always sense primary channels is optimal and we have achieved our goal.
Otherwise, the set {b Sb : rt ((s, l), a) > g d

for some a A} is finite and no empty.

Finally, we obtain from the theorems 8.10.9 and 8.10.7 from [14] that there exists an average optimal
stationary policy.

B. Proof of Lemma 1
First, the update function ns is linear with the belief because because ns () = + ( ). As
we considered the case where , then the update function is increasing with the belief.
Second, let us prove that ns () if (0) by induction on the belief.
1) We have the initial condition: (0) =

1+

and ns () = + ( ) .

2) We assume that ns () for a given (0).


3) The induction operator gives: ns (ns ()) = + ( )ns () + ( ) = ns ().
Thus, ns () for all (0). The analysis for (0) is similar.

January 16, 2012

DRAFT

16

C. Proof of Proposition 2
The proof of the proposition 2 is similar to [15] where the authors consider the finite time horizon
problem. Hence, we briefly describe the procedure for this proof. Considering the maximum packet delay
l and for all belief vector , the value function V (, l ) is linear with the belief because
V (, l ) = Q2 (, l ) gu ,
= gu + cs P3G + V (s (| = 1), 1) +
n (P3G Pp + V (s (| = 0), 1) V (s (| = 1), 1)).

Then the value function V (, l ) can be rewritten as an inner product of the belief vector and a -vector.
As Q2 (, l) = Q2 (, l ), for all l, the action-value function Q2 (, l) can be also rewritten as an inner
product of the belief vector and a -vector. We suppose that Proposition 2 holds for all packet delays
higher than l + 1 and we prove that the proposition is true for packet delay l. After some algebra, we
can rewrite the action-value functions given in (5) and (7) in terms of -vector:
"
#
X
X
ns
(|)
Q0 (, l) = f (l) + max < ns (|), >= f (l) +
s
P (s0 |s)l+1
,
l+1

sS

(8)

s0 S

and

Q1 (, l) = cs + ( Pp + V (, 1)) + (1 )(f (l) + max < s (| = 1), >)


l+1
"
#
X
X
s
(|=1)
= cs + ( Pp + V (, 1)) + (1 )(f (l) +
s
P (s0 |s)l+1
), (9)
sS
ns

where l+1

(|)

s0 S

(|=1)

and l+1

are, respectively, the -vectors for the regions containing belief vectors

ns (|) and s (| = 1), respectively. Each term in the square brackets of (8) and (9) are elements
,l of a -vector l . Then the action-value functions can be rewritten as an inner product of the belief

vector and a -vector l . Moreover, there are only a finite number of such -vector l since we have
a finite set of belief for all l. As the maximum of a finite set of piecewise linear and convex functions
is also piecewise linear and convex, the Proposition 2 holds.

D. Proof of Proposition 3
Let us prove first that the value function V (, l) is monotonically decreasing with the packet delay l
for all belief vector . The secondary user takes the action 2 for all when the packet delay is l , thus

January 16, 2012

DRAFT

17

we have:
V (, l ) = cs + (Pp + V (, 1)) + (1 )(P3G + V (, 1)).

The secondary user chooses the action that maximizes its average utility and thus:
V (, l 1) = max Qa (, l 1) gu Q2 (, l 1) gu ,
a

= cs + (Pp + V (, 1)) + (1 )(P3G + V (, 1)) gu ,


= V (, l ).

Let us prove that this propriety holds for all packet delays using a backward induction on l:
1) initial condition: For all belief vector , V (, l ) V (, l 1),
2) we suppose that V (, l + 2) V (, l + 1), .
3) We have:
Q0 (, l) = f (l) + V (ns (|), l + 1),
f (l + 1) + V (ns (|), l + 2),
= Q0 (, l + 1).
Q1 (, l) = cs + ( Pp + V (, 1)) + (1 )(f (l) + V (, l + 1)),
cs ( Pp + V (, 1)) + (1 )(f (l + 1) + V (, l + 2)),
= Q1 (, l + 1).
Q2 (, l) = cs + P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),
Q2 (, l + 1).

The inequalities come from the induction assumption and the monotonicity of the penalty function
f (l). Thus, we have:
,

V (, l) V (, l + 1).

The value function is therefore decreasing with the packet delay.


Lemma 2: We have the following inequality:
Pp + V (, 1) P3G + V (, 1).

January 16, 2012

DRAFT

18

E. Proof of Lemma 2
We prove this lemma by contradiction, so we suppose that Pp + V (, 1) < P3G + V (, 1). We
first prove that the following:
gu + V (, 1) Q2 (, 1),
gu + V (, 1) cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),
gu + V (, 1) cs + Pp + V (, 1),
gu > cs Pp .

and we take the assumption that the immediate reward when the channel is idle is positive, i.e. cs
Pp 0.

We know that the secondary user takes the action 2 in the state (, l ) for all belief vector , i.e
a (, l ) = 2, . We have:
gu + V (, l ) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)).

Let us focus on the packet delay l 1. If (0), we have:


Q0 (, l 1) = f (l 1) + V (ns (), l ),
= gu f (l 1) cs + ns ()( Pp + V (, 1)) + (1 ns ())( P3G + V (, 1)),
= V (, l ) f (l 1) + (ns () )(P3G Pp + V (, 1) V (, 1)),
< V (, l ).

The inequality is due to the assumption that Pp +V (, 1) < P3G +V (, 1), ns () and f (l 1)
is positive. As the value function V (, l) is decreasing with the packet delay l (see Proposition 3), then
Q0 (, l 1) < V (, l ) < V (, l 1). As we proved that gu 0, the secondary user does not take

the action 0 when the packet delay is l 1. For the action 1, we have:
Q1 (, l 1) = cs + ( Pp + V (, 1)) + (1 )(f (l 1) + V (, l )),
= cs + ( Pp + V (, 1)) + (1 ) ( gu f (l 1) cs
+(Pp + V (, 1)) + (1 )(P3G + V (, 1))) ,
< cs + ( Pp + V (, 1)) + (1 )( gu f (l 1) cs P3G + V (, 1)),
< cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),
= Q2 (, l 1).
January 16, 2012

DRAFT

19

The first inequality is due to the assumption that Pp + V (, 1) < P3G + V (, 1) and the second one
is because gu , f (l 1) and cs are positive. Thus, the optimal strategy is to take the action 2 when the
packet delay is l 1.
Let us prove now by backward induction on l that the optimal action is the action 2 for all belief
vector (0).

If the secondary user takes the action 2 when the packet delay is l , then it takes also the action 2
when the packet delay is l 1.

We suppose that secondary user takes the action 2 when the packet delay is l < l 1.

We have the following inequalities:


Q0 (, l 1) = f (l 1) + V (ns (), l),
= gu f (l 1) cs + ns ()( Pp + V (, 1)) + (1 ns ())( P3G + V (, 1)),
= V (, l) f (l 1) + (ns () )(P3G Pp + V (, 1) V (, 1)),
< V (, l).

The inequality is due to the assumption that Pp + V (, 1) < P3G + V (, 1) and ns () ,


and f (l 1) is positive. As the value function is decreasing with the packet delay (see Proposition
3), then Q0 (, l 1) < V (, l 1) + gu , i.e. the secondary user does not take the action 0 with the
packet delay l 1.
Q1 (, l 1) = cs + ( Pp + V (, 1)) + (1 )(f (l 1) + V (, l)),
= cs + ( Pp + V (, 1)) + (1 ) ( gu f (l 1) cs
+(Pp + V (, 1)) + (1 )(P3G + V (, 1))) ,
< cs + ( Pp + V (, 1)) + (1 )( gu f (l 1) cs P3G + V (, 1)),
< cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),
= Q2 (, l 1).

The first inequality is due to the assumption that Pp + V (, 1) < P3G + V (, 1) and the second
one is because gu , f (l 1) and cs are positive. Thus, The optimal strategy is to take action 2 when
the packet delay is l 1. Thus, the secondary user does not take the action 1 with the packet delay
l 1. Finally, the secondary user takes action 2 for all packet delays and beliefs lower than (0).

January 16, 2012

DRAFT

20

We now look at the action-value function Q2 (, 1) when the packet delay is l = 1.


Q2 (, 1) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),
Q2 (, 1) = cs P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),
gu + Q2 (, 1) = gu + V (, 1) Pp + cs + ( 1)(P3G Pp + V (, 1) V (, 1)).

As the secondary user takes the action 2 also for the state (, 1), we have:
gu + V (, 1) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),
gu + V (, 1) = cs P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),
gu = cs P3G + (P3G Pp + V (, 1) V (, 1)).

Thus, we obtain:
gu + Q2 (, 1) = V (, 1) + P3G Pp + ( 1)(P3G Pp + V (, 1) V (, 1)).

As we assumed that P3G Pp + V (, 1) V (, 1) < 0, and P3G > Pp , then we obtain V (, 1) + gu


Q2 (, 1) and therefore the secondary user takes also the action 2 in the state (, 1). Then we get:
gu + V (, 1) = Q2 (, 1) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)).

Let us evaluate finally the difference V (, 1) V (, 1):


V (, 1) V (, 1) = ( )(P3G Pp + V (, 1) V (, 1)),
V (, 1) V (, 1) < 0.

and
V (, 1) V (, 1) = ( )(P3G Pp + V (, 1) V (, 1)),
(V (, 1) V (, 1))(1 + ) = ( )(P3G Pp ),
V (, 1) V (, 1) =

( )(P3G Pp )
,
1+

> 0.

which leads to a contradiction, and therefore, Pp + V (, 1) P3G + V (, 1). The analysis is similar
when > (0).

January 16, 2012

DRAFT

21

F. Proof of Proposition 4
Let us prove that the value function V (, l) is increasing with the belief vector for any packet delay
l. For all 1 2 , we have that:
V (1 , l ) = gu cs + P3G + V (, 1) + 1 (P3G Pp + V (, 1) V (, 1)),
gu cs + P3G + V (, 1) + 2 (P3G Pp + V (, 1) V (, 1)),
= V (2 , l ).

This inequality result from the Lemma 2. Let us prove that this propriety holds for all packet delays l
using backward induction:

Initial condition: There exists a packet delay l such that V (1 , l ) V (2 , l ), 1 2 ,

We suppose that V (1 , l + 1) V (2 , l + 1), 1 2 ,


First case: We assume that + f (l) Pp + V (, 1) V (, l + 1) 0, then:
Q0 (1 , l) = f (l) + V (ns (1 |), l + 1),
f (l) + V (ns (2 |), l + 1),
= Q0 (2 , l).

The inequality is a direct result from the induction assumption and the Lemma 1. We have also:
Q1 (1 , l) = cs f (l) + V (, l + 1) + 1 ( + f (l) Pp + V (, 1) V (, l + 1)),
cs f (l) + V (, l + 1) + 2 ( + f (l) Pp + V (, 1) V (, l + 1)),
= Q1 (2 , l).
Q2 (1 , l) = cs + P3G + V (, 1) + 1 (P3G Pp + V (, 1) V (, 1)),
cs + P3G + V (, 1) + 2 (P3G Pp + V (, 1) V (, 1)),
= Q2 (2 , l).

The inequalities comes from the Lemma 2. Thus, we have proved that V (1 , l) V (2 , l).

January 16, 2012

DRAFT

22

Second case: We suppose that + f (l) Pp + V (, 1) V (, l + 1) < 0, then for all we have:
Q1 (, l) = cs + ( Pp + V (, 1)) + (1 )(f (l) + V (, l + 1)),
cs f (l) + V (, l + 1),
f (l) + V (, l + 1),
cs f (l) + V (ns (|), l + 1),
Q0 (, l).

In fact, we have that ns (|) for all belief vector and the value function V (, l) is
increasing with the belief for the packet delay l + 1 (induction assumption). Thus, gu + V (, l) =
max {Q0 (, l), Q2 (, l)}. Moreover, we have:
Q0 (1 , l) = f (l) + V (ns (1 |), l + 1),
f (l) + V (ns (2 |), l + 1),
= Q0 (2 , l).

The inequality is a direct result from the induction assumption. Finally, we have that:
Q2 (1 , l) = cs + P3G + V (, 1) + 1 (P3G Pp + V (, 1) V (, 1)),
cs + P3G + V (, 1) + 2 (P3G Pp + V (, 1) V (, 1)),
= Q2 (2 , l).

The inequality comes from the Lemma 2.


Thus, V (1 , l) V (2 , l) for belief vectors 1 2 and for all packet delay l.

G. Proof of Proposition 5
In this proposition, we determine explicitly the best action a (, l) for the secondary user depending
on the belief and the packet delay l. At each time slot and for a given information state (, l), the
secondary use will decide to take the action 0 if Q0 (, l) max {Q1 (, l), Q2 (, l)}.

First we assume that Q1 (, l) > Q2 (, l), then, let us compare Q0 (, l) and Q1 (, l). The inequality
Q0 (, l) Q1 (, l) is equivalent to:
f (l) + V (ns (|), l + 1) cs + ( Pp + V (, 1)) + (1 )(f (l) + V (, l + 1)),
V (ns (|), l + 1) V (, l + 1) cs + (f (l) + Pp + V (, 1) V (, l + 1)).

January 16, 2012

DRAFT

23

As the value function V (, l)is decreasing with the packet delay l and increasing with the belief ,
we have V (, 1) V (, l + 1). As we assumed that the immediate reward is higher than the cost
Pp , we obtain that f (l) + Pp + V (, 1) V (, l + 1) is positive. Then, we have the following

equivalence:
Q0 (, l) Q1 (, l) V (ns (|), l+1) V (, l+1)cs +(f (l)+Pp +V (, 1)V (, l+1)).

Define the functions F and G as follow:


F (, l) = V (ns (|), l + 1),
G(, l) = V (, l + 1) cs + (f (l) + Pp + V (, 1) V (, l + 1)).

We proved in Proposition 2 that the value function is Piecewise linear and convex. Therefore, for
all packet delays, the function F (, l) is PWLC and increasing with , and the function G(, l) is
linear and increasing with . Note that
If F (, l) G(, l), then Q0 (, l) Q1 (, l) and therefore the best action is 0.
If F (, l) < G(, l), then Q0 (, l) < Q1 (, l) and therefore the best action is 1.
Let us study the sign of the function H(, l) = F (, l) G(, l). Under these setting, six cases rise
up:
1) F (, l) is always higher than G(, l), see Figure (14, case 1).
2) F (, l) is always lower than G(, l), see Figure (14, case 2).
3) F (, l) and G(, l) intersect once and F (, l) < G(, l), see Figure (14, case3).
4) F (, l) and G(, l) intersect once and F (, l) (, l), see Figure (14, case 4).
5) F (, l) and G(, l) intersect twice and F (, l) (, l), see Figure (14, case 5).
6) G(, l) is tangent to F (, l), see Figure (14, case 6).
Let us focus on F ((0), l) and G((0), l).
Let us prove that gu > f (l). We have:
gu + V (, 1) Q0 (, 1),
gu + V (, 1) f (l) + V (ns (), l + 1),
gu + V (, 1) V (ns (), l + 1) f (l),
gu > f (l).

January 16, 2012

DRAFT

24

The inequality is because of the monotonicity of the value function and ns () < . Suppose that
the secondary user chooses the action 0 for the state ((0), l). We have:
gu + V ((0), l) = f (l) + V (ns ((0)), l + 1),
gu + V ((0), l) f (l) + V (ns ((0)), l),
gu + V ((0), l) f (l) + V ((0), l),
gu f (l).

This leads to a contradiction as gu > f (l). Thus, Q0 (, l) < Q1 (, l) and therefore, F ((0), l) <
G((0), l). Therefore, the cases 1, 3, 5 and 6 are eliminated. Finally, the optimal policy is a kind

of threshold and is depicted in the following:


The secondary user takes the action 0 for all beliefs lower than the following threshold
T h1(, l) =

V (ns (|), l + 1) V (, l + 1) + cs
,
f (l) + Pp + V (, 1) V (, l + 1)

and take the action 1 otherwise.

Second, we assume that Q2 (, l) > Q1 (, l) and then, we have to compare the action 0 and 2, which
is equivalent to compare the action-value functions Q0 (, l) and Q2 (, l). The secondary user takes
the action 0 instead of the action 2 if Q0 (, l) Q2 (, l), which is equivalent to:
f (l) + V (ns (|), l + 1) cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),
V (ns (|), l + 1) V (, 1) + + f (l) cs P3G + (P3G Pp + V (, 1) V (, 1)).

We have from the Lemma 2, that P3G Pp + V (, 1) V (, 1) 0. Then, we can provide the
same analysis presented in the previous case with the function F (, l) = V (ns (|), l + 1) and
the function G(, l) = V (, 1) + + f (l) cs P3G + (P3G Pp + V (, 1) V (, 1)). The
latter is linear increasing in . We obtain the following threshold policy:
The secondary user takes the action 0 for all beliefs lower than the following threshold:
T h2(, l) =

V (ns (|), l + 1) V (, 1) f (l) + cs + P3G


,
P3G Pp + V (, 1) V (, 1)

and take the action 2 otherwise.

January 16, 2012

DRAFT

25

H. Proof of Proposition 6
We have from the Lemma 1 that if > (0) then ns () . Suppose that the secondary user takes
the action 0 for a belief and packet delay l. Thus we have
gu + V (, l) = f (l) + V (ns (), l + 1),
gu + V (, l) f (l) + V (ns (), l),
gu + V (, l) f (l) + V (, l),
gu f (l).

This leads to a contradiction as gu > f (l). The first inequality is because the value function is decreasing
with the packet delay and the second one is because that the value function is increasing with the
belief and ns () . Thus, if > (0), then the secondary user never takes the action 0 and then
Q0 (, l) < max {Q1 (, l), Q2 (, l)}.

I. Proof of Proposition 7
Let us compare the value-action functions Q1 (, l) and Q2 (, l) for all belief vector and packet delay
l. The secondary user waits for next time slot after sensing if Q1 (, l) Q2 (, l), which is equivalent

to:
cs + ( Pp + V (, 1)) + (1 )(f (l) + V (, l + 1)) cs + ( Pp + V (, 1))
+(1 )( P3G + V (, 1)),
f (l) + V (, l + 1) P3G + V (, 1) 0.

Remark that this condition depends only on the packet delay l and not on the belief vector .

J. Proof of Corollary 1
If f (l) is lower than P3G , then f (l) + P3G + V (, l + 1) V (, 1) is always negative.
In fact, V (, 2) V (, 1) is negative and f (l) + Pp + V (, l + 1) V (, 1) is decreasing with l.
Therefore, the previous expression is negative for all l 1.

January 16, 2012

DRAFT

26

R EFERENCES
[1] E. Hossain, D. Niyato and Zhu Han, Dynamic spectrum access and management in cognitive radio networks, Cambridge,
2009.
[2] J. Mitola, Cognitive radio: An integrated agent architecture for software defined radio, PhD Dissertation, Royal Inst.
Technol. (KTH), Stockholm, Sweden, 2000.
[3] F. Akyildiz, Won-yeol Lee and al., NeXt generation dynamic spectrum access cognitive radio wireless networks: A
survey, Computer Networks, 2006.
[4] K. Jaganathan, I. Menache, E. Modiano, and G. Zussman, Non-cooperative Spectrum Access - The Dedicated vs. Free
Spectrum Choice, Proc. ACM MOBIHOC11, May 2011.
[5] O. Habachi and Y. Hayel, Optimal sensing strategy for opportunistic secondary users in a cognitive radio network, in the
13th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM),
2010.
[6] I. Akyildiz, W. Lee, M. Vuran, S. Mohanty, A Survey on Spectrum Management in Cognitive Radio Networks, in IEEE
Communication Magazine, 2008.
[7] Qing Zhao and al., Decentralized cognitive MAC for opportunistic spectrum access in ad Hoc networks: A POMDP
framework, IEEE journal on selected areas in communication vol. 25 NO. 3, April 2007.
[8] H. Liu, B. Krishnamachari and Q. Zhao, Cooperation and learning in multiuser oppoertunistic spectrum access, in ICC,
2008.
[9] H. Zheng, and C. Peng, Collaboration and Fairness in Opportunistic Spectrum Access, in proc. of IEEE International
Conference on Communication (ICC), 2005.
[10] A. T. Hoang, Y. C. Liang, D. T. C. Wong, Y. Zeng, and R. Zhang, Opportunistic Spectrum Access for Energy-constrained
Cognitive Radios, in IEEE transaction on wireless communications, 2008.
[11] Y. Chen, Q. Zhao and A. Swami, Distributed Spectrum Sensing and Access in Cognitive Radio Networks With Energy
Constraint, in IEEE transaction on signal processing, february 2009.
[12] K. Challapali, C. Cordeiro, D. Birru, Evolution of spectrum-agile cognitive-radios: first wireless internet standard and
beyond, in proceedgins of WICON, 2006.
[13] Q. Zhao, L. Tong, and A. Swami, Decentralized cognitive MAC for dynamic spectrum access, in Proc. 1st IEEE Symp.
New Frontiers Dynamic Spectrum Access Networks, Nov. 2005.
[14] Martin L. PUTTERMAN, Markov Decision Process Discrete Stochastic Dynamic Programming, WILEY Series in
Probability and Statistique, 2005.
[15] Smallwood, R. D.and Sondik, E. J., The optimal control of partially observable Markov decision processes over a finite
horizon, Operations Research, vol 21,pp 1071-1088, 1973.
[16] W. S. Lovejoy, Some Monotonicity Results for Partially Observed Markov Decision Processes, Oper. Res. vol. 35, no.
5, pp. 736-743, Sept. 1987.
[17] S. Shellhammer, A. Sadek and W. Zhang, Technical Challenges for Cognitive Radio in the TV White Space Spectrum,
Information Theory and Appplications, 2009.

January 16, 2012

DRAFT

27

Fig. 1.

Using cognitive radio in ad-hoc communication. If the licensed frequency f 1 is not used by primary users, secondary

users can communicate in ad-hoc mode using f 1.

Fig. 2.

Cognitive radio network architecture

Fig. 3.

The channel transition probabilities for channel i.

Fig. 4.

The belief update function ns with respect to the packet delay.

January 16, 2012

DRAFT

28

Fig. 5.

Optimal policy with one licensed channel.

Fig. 6.

Optimal policy with one licensed channel.

Fig. 7.

Optimal policy for the secondary user in the scenario 1.

January 16, 2012

DRAFT

29

Fig. 8.

Optimal policy for the secondary user in the scenario 2.

Fig. 9.

Optimal policy for the secondary user in the scenario 3.

Fig. 10.

Average reward depending on the number of iteration for scenario 2.

January 16, 2012

DRAFT

30

Fig. 11.

Average delay depending on the number of iteration for scenario 2.

Fig. 12.

Average reward depending on the number of iteration for scenario 1.

Fig. 13.

Average delay depending on the number of iteration for scenario 1.

January 16, 2012

DRAFT

31

Fig. 14.

The function F (, l) and G(, l).

January 16, 2012

DRAFT

You might also like