Professional Documents
Culture Documents
Abstract
Cognitive radio (CR) has been considered as a promising technology to enhance spectrum efficiency
via opportunistic transmission at link level. Basic CR features allow secondary users to transmit only
when the licensed channel is not occupied by primary users. However, waiting for idle time slot may
include large packet delay and high energy consumption. Thus, we consider Opportunistic Spectrum
Access (OSA) mechanism that takes into account packet delay and energy consumption. We formulate
the OSA problem as a Partially Observable Markov Decision Process (POMDP) by explicitly considering
the energy constraint as well as, the delay constraint, which are often ignored in existing OSA solutions.
Specifically, we consider a POMDP with an average reward criterion. We further consider that the
secondary user may decide, at any moment, to use another dedicated way (3G) of communication in order
to transmit its packets. We derive structural properties of the value function and we show the existence
of optimal strategies in the class of the threshold strategies. For implementation purposes, we propose
online learning mechanisms that estimate the primary user activity based on statistical knowledge of the
primary user activity. In particular, numerical illustrations validate our theoretical findings. It is shown
that optimal policy has a threshold structure. We also present numerical illustrations on the convergence
of the proposed algorithms for estimating primary user activity.
Index Terms
POMDP, Cognitive Radio Networks, QoS.
I. I NTRODUCTION
The access to the spectrum frequency is defined by licenses assigned to primary users. The latter must
be conform to the specifications described in the license (e.g. location of the base station, frequency and
January 16, 2012
DRAFT
the maximum transmission power). Nonetheless, a recent study made by the Federal Communications
Commission (FCC) has proved that some frequency bands are not sufficiently used by licensed users at
a particular time and in a specific location [1].
Cognitive radio, which is a new paradigm for designing wireless communication systems, has appeared
in order to enhance the utilization of the radio frequency spectrum. Cognitive radio has been considered
as the key technology that enable secondary users to access the licensed spectrum. A cognitive user,
as defined in [2], is a mobile who has the faculty to adapt its transmission parameters (e.g. frequency
and modulation) to the wireless environment, and support different communication standards (e.g. GSM,
CDMA, WiMAX and WiFi). Moreover, when there is no opportunity to transmit over the licensed
channels, the secondary users may have the possibility to transmit on dedicated channels, generally,
with a higher cost and/or a lower throughput than transmitting over licensed channels. The possibility of
having dedicated channels reserved for secondary mobiles has been proposed in [3],[4] and [5]. Those CR
architectures are described in [6] where the authors also present the network components, the spectrum
and network heterogeneity, and the spectrum management framework. We focus in this paper, on a CR
network where a secondary user communicates with other secondary users through an ad-hoc connection
using a spectrum hole of a licensed frequency (see Figure 1). A secondary user can be considered as a
pair of transmitter-receiver nodes. We assume that there is no interactions with other secondary users.
This model is also suited for the scenario depicted on figure 2 where the secondary user is a cognitive
radio base station which is able to sense the activity of a primary base station, and then takes profit of
spectrum holes for transmitting on the downlink. Our main contribution is to consider in this cognitive
radio setting, an optimal opportunistic spectrum access (OSA) mechanism that takes into account energy
and delay constraints. Many works have focused on the study of optimal sensing and access policies in
cognitive radio networks (see [7], [8] and [9]). All these works have focused on either spectrum sensing
or dynamic spectrum sharing. In [10], the authors focused on an OSA problem with an energy constraint.
The authors have formulated their problem as a POMDP and derived some properties of the optimal
sensing control policies. Their control parameter is the duration of sensing used by a secondary user at
each time slot for determining the primary user activity. They provided heuristic control policies based
on gird-based approximation, myopic policies and static policies which have low complexity but give
suboptimal control policies. finally, they compare their heuristics methods with optimal solutions obtained
using a POMDP solver. Authors of [11] incorporate the energy constraint in the design of the optimal
policy of sensing and access in cognitive radio network. They formulate the problem also as a POMDP
but with a finite horizon and established a threshold structure of the optimal policy for the single channel
January 16, 2012
DRAFT
model. However, they did not provide analytical expression of the optimal control policy. It is noteworthy
that the impact of the energy constraint or the capacity of cognitive radio to support additional Quality-ofService (QoS), such as the expected delay, has been somehow ignored in the literature. In fact, it is very
important for today multimedia applications on wireless networks, to provide reliable communication
while sustaining a certain level of QoS. In fact, taking into account the delay constraint as well as
the energy constraint significantly complicates the optimization problem. Without considering the delay
constraint, the secondary user achieves the best tradeoff between trying to access the licensed channel
and sleeping to conserve energy. The design of such tradeoff lies among several conflicting objectives:
gaining immediate access, gaining spectrum occupancy information, conserving energy and minimizing
packets delay. Then, the goal of our paper is to study such energy-QoS tradeoff for determining an
optimal OSA mechanism for secondary users in a cognitive radio network. The major contributions of
our work are:
The problem is formulated as an infinite horizon POMDP with average criterion. The average
criterion is better than the discount or the total criterion as the secondary user takes often decisions.
In order to gain insights into the energy-delay constrained OSA problem, we derive structural
properties of the value function. We are able to show that the value function is increasing with the
belief and decreasing with the packet delay. These structural results not only give us the fundamental
design thresholds but also reduce the computational complexity when seeking for the optimal policies.
We show that the secondary user can maximize its average reward by adopting a simple threshold
policy, and we derived closed-form expressions for these thresholds.
Since the secondary user may use a dedicated channel for its packets, the optimal threshold policy
guarantees a bounded delay.
The organization of the paper is as follows. In the next section, we describe the primary and the
secondary user models. Section III presents our Markov decision process framework. In Section IV, we
study the existence of an optimal threshold policy for our opportunistic spectrum access with an energyQoS tradeoff. We propose two learning based protocols for estimation of the state transition rates in
Section V. Before concluding the paper and giving some perspectives, we present, in Section VI, some
numeric illustrations.
II. C OGNITIVE RADIO NETWORK MODEL
We consider a wireless system with N independent channels licensed to primary users. The state of
each channel n {1, . . . , N } is modeled by a time-homogeneous discrete Markov process sn (t). The
January 16, 2012
DRAFT
state space is {0, 1} where sn (t) = 0 means that the channel n is free for secondary access and sn (t) = 1
means that the channel n is occupied by primary user. The transition probabilities of the channel n is
given by the following matrix:
Pn =
n 1 n
n
1 n
to sense a primary channel and to transmit if the channel is available during time slot, else to wait
for next time slot,
or to sense a primary channel and to transmit if the channel is available during time slot, else to
use the dedicated channel.
Our important contribution is to consider the average transmission delay of a packet in the optimal
decision. Indeed, sensing a primary channel has a cost for the secondary user. We look for an optimal
sensing policy which depends on the history of observations and actions.
DRAFT
where i (t) is the conditional probability that the channel i is available in slot t. Hence, we study the
problem of OSA for secondary user as a POMDP problem.
0, to be inactive
1, to sense and to transmit only if the channel is available during time slot,
a(t) =
if
a(t) 6= 0, (t) = 0
and n = n (t),
n
if
(1)
a(t) 6= 0, (t) = 1
and n = n (t).
DRAFT
Note that we can extend easily our model to sense not only one channel but a subset of the primary
channels.
4) Channel choice policy: At each time slot t, based on its belief vector ~(t), the secondary user
chooses a channel n (t) N to be sensed. There exists several channel choice policies in the literature
like deterministic, randomized and periodic (see [1]). An example of channel choice policy is to sense
the channel which has the highest probability to be idle, i.e. n (t) := arg maxn (n (t)).
5) Policies: The strategy of the secondary user is defined by the probability of choosing a given action
depending on the system state. We define a sensing and access policy as a vector [1 , 2 , . . .] where t
is a mapping from a state (~(t), l(t)) to an action a(t). The set of policies is denoted by . A stationary
policy is a mapping that specifies for each state, independently of the time slot t, an action to be chosen.
In the next section, we show that our POMDP problem has an optimal stationary policy which allows
us to restrict our problem to stationary policies.
6) Reward and costs:
Reward : Let be the reward representing the number of delivered bits when the secondary user
transmits its packet.
Costs : Let cs be the energy cost function for sensing a primary channel, measured as monetary
units. This function depends on the action a(t) as:
c ,
s
cs (a(t)) =
0,
if a(t) > 0,
if a(t) = 0.
The primary user and the service provider for the dedicated access, charge a price for each packet
transmitted. Those prices are respectively Pp for a transmission over a primary channel and P3G for
a transmission over the dedicated channel.
Hence, when the secondary user transmits successfully a packet, he gets the reward zt (a(t), (t))
which depends on the action a(t) and the observation (t) by:
0,
if a(t) = 0,
zt (a(t), (t)) =
Pp
if a(t) 1 and (t) = 0,
DRAFT
Instantaneous reward: At time slot t, the instantaneous reward rt of a secondary user depends on
the system state (~(t), l(t)) and the action a(t), and is expressed by:
rt ((~(t), l(t)), a(t)) = zt (a(t), (t)) f (l(t)) cs (a(t)).
The problem faced by the secondary user consists of finding the sensing policy that maximizes its
expected average reward defined by:
1
R()
= lim IE
T T
T
X
!
rt ((~(t), l(t)), a(t))|~(0) ,
t=1
while ~(0) is the initial belief vector. Then our objective is to find an optimal sensing policy that
, i.e.:
maximizes the average reward R()
1
= arg max lim IE
T T
T
X
!
rt ((~(t), l(t)), a(t))|~(0) .
(2)
t=1
In some particular MDP and POMDP problems, we are able to determine an optimal policy in a
smaller set reduced to stationary policies. We prove in the following proposition that there exists an
average optimal stationary policy for our POMDP problem.
Proposition 1: There exists an average optimal stationary policy for our POMDP formulation described
in (2).
Proof: see Appendix A.
Given this result, we can restrict our problem to the set S of stationary policies. Then, for the
remainder of this paper, we omit the time index t and we look for an optimal sensing policy which is a
mapping between a system state (~, l) to an action a, independently of the time slot t. Now, we make a
first analysis of the value function of the POMDP.
We denote by ns (~|) the function that updates the belief vector ~ when the user chooses to be
inactive in the current slot, i.e. the secondary user takes action 0. The function s (~|) updates the belief
vector ~ when the secondary user senses a licensed channel in the current slot and observes , i.e. the
secondary user takes the action 1 or 2.
The value function is denoted V (, l). Let us denote by Qa (, l) the action-value function taking the
action a in the current slot when the information state is (, l). Therefore, the value function is expressed
by
gu + V (~, l) = max Qa (~, l),
aA
(3)
(4)
DRAFT
We determine the action-value function for each different action 0, 1 and 2. When the secondary user
decides to wait, i.e. to take the action a = 0, we have:
Q0 (~, l) = f (l) + V (ns (~| = 0), l + 1).
(5)
When the secondary user chooses to sense the channel n and decides to wait for the next time slot if
the channel n is busy, i.e. to take action 1, we have:
Q1 (~, l) = cs + n ( Pp + V (s (~| = 0), 1))
(6)
When the secondary user chooses to sense the channel n and to transmit using the dedicated channel if
the channel n is busy, i.e. to take action 2, we have:
Q2 (~, l) = cs + n (Pp + V (s (~| = 0), 1))
(7)
We focus on the case of one licensed channel. The multichannel case will be studied in Section III-C.
We take the assumption that there exists a packet delay l such that the secondary user transmits its
packet using the dedicated channel if the observation is = 1. In fact, this assumption is somehow
realistic as the user has no interest to keep the file in its buffer indefinitely. We denote by and the
transition rates of the channel, and the belief of the secondary user. We consider that . When
, the analysis is similar and the results are unchanged.
(0),
ns (|)
(0),
and
where (0) =
1+
is the stationary probability that the primary channel is idle. Figure 4 depicts
DRAFT
It has be shown in [15] that the value function for a POMDP over a finite time horizon is piecewise
linear and convex with respect to the belief vector. In Proposition 2, we show that the value function for
our POMDP problem over an infinite horizon with the average criterion, has also this property.
Proposition 2: The value function V (, l) given in (3) is piecewise linear and convex with respect to
the belief vector .
Proof: See Appendix C.
Note that monotonicity results help us for establishing the structure of the optimal policies (see [16]
for an example) and provide insights into the underlying problem. The following propositions states
monotonicity results of the value function with respect to each of its parameters.
Proposition 3: For each belief vector , the value function is monotonically decreasing with the packet
delay l, i.e. V (, l) V (, l0 ) for l l0 .
Proof: See Appendix D.
This result is intuitive because for the same belief and for a given packet delay, the maximum
expected remaining reward that can be accrued is lower than the one the secondary user can get with a
smaller packet delay.
Proposition 4: The value function is monotonically increasing with the belief vector , i.e. V (, l)
V (0 , l) for 0 .
DRAFT
10
Let us focus on the Proposition 4. The monotonicity with respect to the belief vector depends on the
order relation over the belief set and also on the monotonicity of the belief update functions s (~| = 0)
and s (~| = 1) depending on the belief vector.
IV. O PTIMAL T HRESHOLD POLICY
Let us focus on the characteristics of an optimal policy for the secondary user. Intuitively, when the
delay l and the belief probability are small, the secondary user waits for a better opportunity. Thus,
depending on the belief probability, the secondary user makes the decision to sense a primary channel or
not. We prove in this section, that the intuition is true and there exists an optimal sensing policy which
has a threshold structure.
The first decision for a secondary user is whether to sense licensed channels or to wait, depending on
its belief and the current delay of the packet l. We have the following result which gives us a threshold
on the belief probability in order to answer this question.
Proposition 5: For all packet delay l, the optimal action for the secondary user is to wait for the
next slot, i.e. a (, l) = 0 if and only if where is the solution of the equation =
max(0, min{T h1( , l), T h2( , l)}) with
T h1( , l) =
T h2( , l) =
V (ns ( |), l + 1) V (, l + 1) + Cs
,
f (l) + Pp + V (, 1) V (, l + 1)
and
DRAFT
11
Proposition 7: For all belief , the secondary user chooses to use the dedicated channel in spite of
waiting for the next slot if and only if the delay l of the current packet verifies:
f (l) + P3G + V (, l + 1) V (, 1) > 0.
DRAFT
12
i (0)
the relation i = (1
i ) 1
i (0) .
= {K
1 , ..., K
N } where K
i represents the number of time slots a channel stays in the
The vector K
i is incremented if the channel i is sensed and is idle at time slot t and t 1.
idle state, i.e. K
The vector I = {I1 , ..., IN } where Ii represents the number of time slots that the channel is sensed
and is idle.
= {M
1 , ..., M
N } where M
i represents the number of time slots that the channel is
The vector M
sensed.
Therefore the secondary user estimates the state transition rates
and
i (0) based on the following
expressions:
i =
i
K
Ii
and
i (0) =
Ii
i .
M
DRAFT
13
exponential states space (with 4 primary channels, we have approximatively 106 states). Furthermore, we
consider the following system parameters: P3G = 80, Pp = 10, cS = 5 and = 35.
We propose to illustrate our results in three scenarios with symmetric channels:
1) Scenario 1: Primary channels are often occupied (1 = 2 = 3 = 4 = 0.15 and 1 = 2 =
3 = 4 = 0.1),
We describe, first, the optimal threshold policy given perfect knowledge about the transition rates of the
primary channels. Second, we give some results using estimated values of transition rates.
A. Single channel model
We consider only one licensed channel with the transition rates = 0.15 and = 0.1. Figure 5
illustrates the optimal policy of the secondary user depending on the belief and the packet delay. For
each packet delay, the secondary user has a threshold policy depending the belief. Moreover, the threshold
belief probability is decreasing with the packet delay. We observe that the maximum packet delay is
13 slots.
Consider the the same scenario with transition rates = 0.2 and = 0.25. We observe in Figure 6
that the secondary user policy has also a threshold structure. A packet has a most a delay of 3 slots.
B. Optimal policy with perfect knowledge of and
We simulate the first scenario and we depict in Figure 7 the thresholds (l) determined in proposition
5 depending on the packet delay l. For each packet delay l, the best action for the secondary user is to
wait for the next slot if its belief probability is lower than . Otherwise, the secondary user decides to
sense the primary channels. In this context, where the primary channels are often occupied (Scenario 1,
Figure 7), the maximum packet delay l obtained with Proposition 7 equals 9. Then, when the packet
delay is l = 9, the user decides to sense and to transmit using the dedicated channel if the sensed channel
is occupied. We describe the optimal policy for the Scenario 2 on Figure 8. The maximum packet delay
in this case is l = 5. This result is intuitive as in this scenario, the primary channels are more often
idle, inducing a lower packet delay. Finally, for the last scenario depicted on Figure 9, which implies
that the maximum packet delay is 5. We observe that the secondary user policy has also a threshold
January 16, 2012
DRAFT
14
structure. However, the the threshold belief probability is not decreasing with the packet delay. . In
fact, the primary channels are more static (the probability for each channel to stay occupied or idle is
high enough), it appears one kind of periodic threshold strategy.
C. Average reward using estimated values of and
We consider the learning approaches proposed in section V. Let us compare, first, the average reward
and the average delay using the two learning based protocols with perfect knowledge of the channels
transition rates. Figures 10 and 11, show that both learning protocols converge. In fact, we observe on
Figures 10 and 11, that both protocols converge before 400 iterations. However, in Figures 12 and 13, we
can observe that the transition matrices estimation method converge 3 times faster (about 1000 iterations)
than the rate estimators method (about 3000 iterations). Moreover, the average reward and the average
packet delay using the estimated transition rates are close to the average reward and the average delay
with known channels transition rates.
VII. C ONCLUSION AND PERSPECTIVES
In this paper, we have used a POMDP framework for determining an optimal sensing policy for opportunistic spectrum sensing and access (OSA) taking into account an energy-delay tradeoff for secondary
users. Introducing a QoS metric in the spectrum sensing policy is very important with the emergence
of heterogeneous mobiles that are able to transmit their traffic with possible high QoS constraints, at
any time over different ways of communication like 3G, WiFi and TV White Space. We have provided
some structural properties of the value function and then proved the existence of an optimal average
stationary spectrum sensing policy. We have been able to determine explicitly the threshold structure of
the optimal policy. The interaction between several secondary users has not been considered here, and
in the literature very few. This perspective is also very important because if the channel choice policy is
the same for all the secondary users, there could have lots of collisions between several secondary users
that have sensed the same idle primary channel. This decentralized system with partial information can
be modeled using decentralized-POMDP or interactive-POMDP and will be studied in future works.
A PPENDIX
A. Proof of Proposition 1
We use the Theorems 8.10.9 and 8.10.7 from [14] to prove the existence of an optimal stationary
policy for our problem. First, the immediate reward rt ((s, l), a) is finite, i.e. < rt ((s, l), a) < +
January 16, 2012
DRAFT
15
(as all costs and rewards are finite). Second, We prove that there exist a stationary policy d for which
the derived Markov chain is positive recurrent.
Let us focus on the following belief vector:
0 = (1 , 2 , ..., N )
for
j = 1, . . . , N,
where j represents the belief of a channel that was not sensed for j successive slots.
Denote by d the stationary policy which senses licensed channels at every slot, with periodic channel
choice policy. Let us prove that the derived Markov chain is positive recurrent. The probability that the
Q
n
system returns to the initial belief form any state is p() = N
k=0 (1 (j )) > 0, n {O, ..., N }
and then the return time to the initial belief j follow a geometric distribution so that E{j } =
1
p(j )
and
Third, let us prove that g d > and the set {b Sb : rt ((s, l), a) > g d
and no empty. As the policy d senses licensed channels every slot, g d = f (l(t)) cs (f (l(t)) +
Pp )n . If we have
f (l(t)) cs (f (l(t)) + Pp )n > max{f (l(t)), cs P3G (Pp P3G )n }
for all belief b, the policy always sense primary channels is optimal and we have achieved our goal.
Otherwise, the set {b Sb : rt ((s, l), a) > g d
Finally, we obtain from the theorems 8.10.9 and 8.10.7 from [14] that there exists an average optimal
stationary policy.
B. Proof of Lemma 1
First, the update function ns is linear with the belief because because ns () = + ( ). As
we considered the case where , then the update function is increasing with the belief.
Second, let us prove that ns () if (0) by induction on the belief.
1) We have the initial condition: (0) =
1+
and ns () = + ( ) .
DRAFT
16
C. Proof of Proposition 2
The proof of the proposition 2 is similar to [15] where the authors consider the finite time horizon
problem. Hence, we briefly describe the procedure for this proof. Considering the maximum packet delay
l and for all belief vector , the value function V (, l ) is linear with the belief because
V (, l ) = Q2 (, l ) gu ,
= gu + cs P3G + V (s (| = 1), 1) +
n (P3G Pp + V (s (| = 0), 1) V (s (| = 1), 1)).
Then the value function V (, l ) can be rewritten as an inner product of the belief vector and a -vector.
As Q2 (, l) = Q2 (, l ), for all l, the action-value function Q2 (, l) can be also rewritten as an inner
product of the belief vector and a -vector. We suppose that Proposition 2 holds for all packet delays
higher than l + 1 and we prove that the proposition is true for packet delay l. After some algebra, we
can rewrite the action-value functions given in (5) and (7) in terms of -vector:
"
#
X
X
ns
(|)
Q0 (, l) = f (l) + max < ns (|), >= f (l) +
s
P (s0 |s)l+1
,
l+1
sS
(8)
s0 S
and
where l+1
(|)
s0 S
(|=1)
and l+1
are, respectively, the -vectors for the regions containing belief vectors
ns (|) and s (| = 1), respectively. Each term in the square brackets of (8) and (9) are elements
,l of a -vector l . Then the action-value functions can be rewritten as an inner product of the belief
vector and a -vector l . Moreover, there are only a finite number of such -vector l since we have
a finite set of belief for all l. As the maximum of a finite set of piecewise linear and convex functions
is also piecewise linear and convex, the Proposition 2 holds.
D. Proof of Proposition 3
Let us prove first that the value function V (, l) is monotonically decreasing with the packet delay l
for all belief vector . The secondary user takes the action 2 for all when the packet delay is l , thus
DRAFT
17
we have:
V (, l ) = cs + (Pp + V (, 1)) + (1 )(P3G + V (, 1)).
The secondary user chooses the action that maximizes its average utility and thus:
V (, l 1) = max Qa (, l 1) gu Q2 (, l 1) gu ,
a
Let us prove that this propriety holds for all packet delays using a backward induction on l:
1) initial condition: For all belief vector , V (, l ) V (, l 1),
2) we suppose that V (, l + 2) V (, l + 1), .
3) We have:
Q0 (, l) = f (l) + V (ns (|), l + 1),
f (l + 1) + V (ns (|), l + 2),
= Q0 (, l + 1).
Q1 (, l) = cs + ( Pp + V (, 1)) + (1 )(f (l) + V (, l + 1)),
cs ( Pp + V (, 1)) + (1 )(f (l + 1) + V (, l + 2)),
= Q1 (, l + 1).
Q2 (, l) = cs + P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),
Q2 (, l + 1).
The inequalities come from the induction assumption and the monotonicity of the penalty function
f (l). Thus, we have:
,
V (, l) V (, l + 1).
DRAFT
18
E. Proof of Lemma 2
We prove this lemma by contradiction, so we suppose that Pp + V (, 1) < P3G + V (, 1). We
first prove that the following:
gu + V (, 1) Q2 (, 1),
gu + V (, 1) cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),
gu + V (, 1) cs + Pp + V (, 1),
gu > cs Pp .
and we take the assumption that the immediate reward when the channel is idle is positive, i.e. cs
Pp 0.
We know that the secondary user takes the action 2 in the state (, l ) for all belief vector , i.e
a (, l ) = 2, . We have:
gu + V (, l ) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)).
The inequality is due to the assumption that Pp +V (, 1) < P3G +V (, 1), ns () and f (l 1)
is positive. As the value function V (, l) is decreasing with the packet delay l (see Proposition 3), then
Q0 (, l 1) < V (, l ) < V (, l 1). As we proved that gu 0, the secondary user does not take
the action 0 when the packet delay is l 1. For the action 1, we have:
Q1 (, l 1) = cs + ( Pp + V (, 1)) + (1 )(f (l 1) + V (, l )),
= cs + ( Pp + V (, 1)) + (1 ) ( gu f (l 1) cs
+(Pp + V (, 1)) + (1 )(P3G + V (, 1))) ,
< cs + ( Pp + V (, 1)) + (1 )( gu f (l 1) cs P3G + V (, 1)),
< cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),
= Q2 (, l 1).
January 16, 2012
DRAFT
19
The first inequality is due to the assumption that Pp + V (, 1) < P3G + V (, 1) and the second one
is because gu , f (l 1) and cs are positive. Thus, the optimal strategy is to take the action 2 when the
packet delay is l 1.
Let us prove now by backward induction on l that the optimal action is the action 2 for all belief
vector (0).
If the secondary user takes the action 2 when the packet delay is l , then it takes also the action 2
when the packet delay is l 1.
We suppose that secondary user takes the action 2 when the packet delay is l < l 1.
The first inequality is due to the assumption that Pp + V (, 1) < P3G + V (, 1) and the second
one is because gu , f (l 1) and cs are positive. Thus, The optimal strategy is to take action 2 when
the packet delay is l 1. Thus, the secondary user does not take the action 1 with the packet delay
l 1. Finally, the secondary user takes action 2 for all packet delays and beliefs lower than (0).
DRAFT
20
As the secondary user takes the action 2 also for the state (, 1), we have:
gu + V (, 1) = cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),
gu + V (, 1) = cs P3G + V (, 1) + (P3G Pp + V (, 1) V (, 1)),
gu = cs P3G + (P3G Pp + V (, 1) V (, 1)).
Thus, we obtain:
gu + Q2 (, 1) = V (, 1) + P3G Pp + ( 1)(P3G Pp + V (, 1) V (, 1)).
and
V (, 1) V (, 1) = ( )(P3G Pp + V (, 1) V (, 1)),
(V (, 1) V (, 1))(1 + ) = ( )(P3G Pp ),
V (, 1) V (, 1) =
( )(P3G Pp )
,
1+
> 0.
which leads to a contradiction, and therefore, Pp + V (, 1) P3G + V (, 1). The analysis is similar
when > (0).
DRAFT
21
F. Proof of Proposition 4
Let us prove that the value function V (, l) is increasing with the belief vector for any packet delay
l. For all 1 2 , we have that:
V (1 , l ) = gu cs + P3G + V (, 1) + 1 (P3G Pp + V (, 1) V (, 1)),
gu cs + P3G + V (, 1) + 2 (P3G Pp + V (, 1) V (, 1)),
= V (2 , l ).
This inequality result from the Lemma 2. Let us prove that this propriety holds for all packet delays l
using backward induction:
The inequality is a direct result from the induction assumption and the Lemma 1. We have also:
Q1 (1 , l) = cs f (l) + V (, l + 1) + 1 ( + f (l) Pp + V (, 1) V (, l + 1)),
cs f (l) + V (, l + 1) + 2 ( + f (l) Pp + V (, 1) V (, l + 1)),
= Q1 (2 , l).
Q2 (1 , l) = cs + P3G + V (, 1) + 1 (P3G Pp + V (, 1) V (, 1)),
cs + P3G + V (, 1) + 2 (P3G Pp + V (, 1) V (, 1)),
= Q2 (2 , l).
The inequalities comes from the Lemma 2. Thus, we have proved that V (1 , l) V (2 , l).
DRAFT
22
Second case: We suppose that + f (l) Pp + V (, 1) V (, l + 1) < 0, then for all we have:
Q1 (, l) = cs + ( Pp + V (, 1)) + (1 )(f (l) + V (, l + 1)),
cs f (l) + V (, l + 1),
f (l) + V (, l + 1),
cs f (l) + V (ns (|), l + 1),
Q0 (, l).
In fact, we have that ns (|) for all belief vector and the value function V (, l) is
increasing with the belief for the packet delay l + 1 (induction assumption). Thus, gu + V (, l) =
max {Q0 (, l), Q2 (, l)}. Moreover, we have:
Q0 (1 , l) = f (l) + V (ns (1 |), l + 1),
f (l) + V (ns (2 |), l + 1),
= Q0 (2 , l).
The inequality is a direct result from the induction assumption. Finally, we have that:
Q2 (1 , l) = cs + P3G + V (, 1) + 1 (P3G Pp + V (, 1) V (, 1)),
cs + P3G + V (, 1) + 2 (P3G Pp + V (, 1) V (, 1)),
= Q2 (2 , l).
G. Proof of Proposition 5
In this proposition, we determine explicitly the best action a (, l) for the secondary user depending
on the belief and the packet delay l. At each time slot and for a given information state (, l), the
secondary use will decide to take the action 0 if Q0 (, l) max {Q1 (, l), Q2 (, l)}.
First we assume that Q1 (, l) > Q2 (, l), then, let us compare Q0 (, l) and Q1 (, l). The inequality
Q0 (, l) Q1 (, l) is equivalent to:
f (l) + V (ns (|), l + 1) cs + ( Pp + V (, 1)) + (1 )(f (l) + V (, l + 1)),
V (ns (|), l + 1) V (, l + 1) cs + (f (l) + Pp + V (, 1) V (, l + 1)).
DRAFT
23
As the value function V (, l)is decreasing with the packet delay l and increasing with the belief ,
we have V (, 1) V (, l + 1). As we assumed that the immediate reward is higher than the cost
Pp , we obtain that f (l) + Pp + V (, 1) V (, l + 1) is positive. Then, we have the following
equivalence:
Q0 (, l) Q1 (, l) V (ns (|), l+1) V (, l+1)cs +(f (l)+Pp +V (, 1)V (, l+1)).
We proved in Proposition 2 that the value function is Piecewise linear and convex. Therefore, for
all packet delays, the function F (, l) is PWLC and increasing with , and the function G(, l) is
linear and increasing with . Note that
If F (, l) G(, l), then Q0 (, l) Q1 (, l) and therefore the best action is 0.
If F (, l) < G(, l), then Q0 (, l) < Q1 (, l) and therefore the best action is 1.
Let us study the sign of the function H(, l) = F (, l) G(, l). Under these setting, six cases rise
up:
1) F (, l) is always higher than G(, l), see Figure (14, case 1).
2) F (, l) is always lower than G(, l), see Figure (14, case 2).
3) F (, l) and G(, l) intersect once and F (, l) < G(, l), see Figure (14, case3).
4) F (, l) and G(, l) intersect once and F (, l) (, l), see Figure (14, case 4).
5) F (, l) and G(, l) intersect twice and F (, l) (, l), see Figure (14, case 5).
6) G(, l) is tangent to F (, l), see Figure (14, case 6).
Let us focus on F ((0), l) and G((0), l).
Let us prove that gu > f (l). We have:
gu + V (, 1) Q0 (, 1),
gu + V (, 1) f (l) + V (ns (), l + 1),
gu + V (, 1) V (ns (), l + 1) f (l),
gu > f (l).
DRAFT
24
The inequality is because of the monotonicity of the value function and ns () < . Suppose that
the secondary user chooses the action 0 for the state ((0), l). We have:
gu + V ((0), l) = f (l) + V (ns ((0)), l + 1),
gu + V ((0), l) f (l) + V (ns ((0)), l),
gu + V ((0), l) f (l) + V ((0), l),
gu f (l).
This leads to a contradiction as gu > f (l). Thus, Q0 (, l) < Q1 (, l) and therefore, F ((0), l) <
G((0), l). Therefore, the cases 1, 3, 5 and 6 are eliminated. Finally, the optimal policy is a kind
V (ns (|), l + 1) V (, l + 1) + cs
,
f (l) + Pp + V (, 1) V (, l + 1)
Second, we assume that Q2 (, l) > Q1 (, l) and then, we have to compare the action 0 and 2, which
is equivalent to compare the action-value functions Q0 (, l) and Q2 (, l). The secondary user takes
the action 0 instead of the action 2 if Q0 (, l) Q2 (, l), which is equivalent to:
f (l) + V (ns (|), l + 1) cs + ( Pp + V (, 1)) + (1 )( P3G + V (, 1)),
V (ns (|), l + 1) V (, 1) + + f (l) cs P3G + (P3G Pp + V (, 1) V (, 1)).
We have from the Lemma 2, that P3G Pp + V (, 1) V (, 1) 0. Then, we can provide the
same analysis presented in the previous case with the function F (, l) = V (ns (|), l + 1) and
the function G(, l) = V (, 1) + + f (l) cs P3G + (P3G Pp + V (, 1) V (, 1)). The
latter is linear increasing in . We obtain the following threshold policy:
The secondary user takes the action 0 for all beliefs lower than the following threshold:
T h2(, l) =
DRAFT
25
H. Proof of Proposition 6
We have from the Lemma 1 that if > (0) then ns () . Suppose that the secondary user takes
the action 0 for a belief and packet delay l. Thus we have
gu + V (, l) = f (l) + V (ns (), l + 1),
gu + V (, l) f (l) + V (ns (), l),
gu + V (, l) f (l) + V (, l),
gu f (l).
This leads to a contradiction as gu > f (l). The first inequality is because the value function is decreasing
with the packet delay and the second one is because that the value function is increasing with the
belief and ns () . Thus, if > (0), then the secondary user never takes the action 0 and then
Q0 (, l) < max {Q1 (, l), Q2 (, l)}.
I. Proof of Proposition 7
Let us compare the value-action functions Q1 (, l) and Q2 (, l) for all belief vector and packet delay
l. The secondary user waits for next time slot after sensing if Q1 (, l) Q2 (, l), which is equivalent
to:
cs + ( Pp + V (, 1)) + (1 )(f (l) + V (, l + 1)) cs + ( Pp + V (, 1))
+(1 )( P3G + V (, 1)),
f (l) + V (, l + 1) P3G + V (, 1) 0.
Remark that this condition depends only on the packet delay l and not on the belief vector .
J. Proof of Corollary 1
If f (l) is lower than P3G , then f (l) + P3G + V (, l + 1) V (, 1) is always negative.
In fact, V (, 2) V (, 1) is negative and f (l) + Pp + V (, l + 1) V (, 1) is decreasing with l.
Therefore, the previous expression is negative for all l 1.
DRAFT
26
R EFERENCES
[1] E. Hossain, D. Niyato and Zhu Han, Dynamic spectrum access and management in cognitive radio networks, Cambridge,
2009.
[2] J. Mitola, Cognitive radio: An integrated agent architecture for software defined radio, PhD Dissertation, Royal Inst.
Technol. (KTH), Stockholm, Sweden, 2000.
[3] F. Akyildiz, Won-yeol Lee and al., NeXt generation dynamic spectrum access cognitive radio wireless networks: A
survey, Computer Networks, 2006.
[4] K. Jaganathan, I. Menache, E. Modiano, and G. Zussman, Non-cooperative Spectrum Access - The Dedicated vs. Free
Spectrum Choice, Proc. ACM MOBIHOC11, May 2011.
[5] O. Habachi and Y. Hayel, Optimal sensing strategy for opportunistic secondary users in a cognitive radio network, in the
13th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM),
2010.
[6] I. Akyildiz, W. Lee, M. Vuran, S. Mohanty, A Survey on Spectrum Management in Cognitive Radio Networks, in IEEE
Communication Magazine, 2008.
[7] Qing Zhao and al., Decentralized cognitive MAC for opportunistic spectrum access in ad Hoc networks: A POMDP
framework, IEEE journal on selected areas in communication vol. 25 NO. 3, April 2007.
[8] H. Liu, B. Krishnamachari and Q. Zhao, Cooperation and learning in multiuser oppoertunistic spectrum access, in ICC,
2008.
[9] H. Zheng, and C. Peng, Collaboration and Fairness in Opportunistic Spectrum Access, in proc. of IEEE International
Conference on Communication (ICC), 2005.
[10] A. T. Hoang, Y. C. Liang, D. T. C. Wong, Y. Zeng, and R. Zhang, Opportunistic Spectrum Access for Energy-constrained
Cognitive Radios, in IEEE transaction on wireless communications, 2008.
[11] Y. Chen, Q. Zhao and A. Swami, Distributed Spectrum Sensing and Access in Cognitive Radio Networks With Energy
Constraint, in IEEE transaction on signal processing, february 2009.
[12] K. Challapali, C. Cordeiro, D. Birru, Evolution of spectrum-agile cognitive-radios: first wireless internet standard and
beyond, in proceedgins of WICON, 2006.
[13] Q. Zhao, L. Tong, and A. Swami, Decentralized cognitive MAC for dynamic spectrum access, in Proc. 1st IEEE Symp.
New Frontiers Dynamic Spectrum Access Networks, Nov. 2005.
[14] Martin L. PUTTERMAN, Markov Decision Process Discrete Stochastic Dynamic Programming, WILEY Series in
Probability and Statistique, 2005.
[15] Smallwood, R. D.and Sondik, E. J., The optimal control of partially observable Markov decision processes over a finite
horizon, Operations Research, vol 21,pp 1071-1088, 1973.
[16] W. S. Lovejoy, Some Monotonicity Results for Partially Observed Markov Decision Processes, Oper. Res. vol. 35, no.
5, pp. 736-743, Sept. 1987.
[17] S. Shellhammer, A. Sadek and W. Zhang, Technical Challenges for Cognitive Radio in the TV White Space Spectrum,
Information Theory and Appplications, 2009.
DRAFT
27
Fig. 1.
Using cognitive radio in ad-hoc communication. If the licensed frequency f 1 is not used by primary users, secondary
Fig. 2.
Fig. 3.
Fig. 4.
DRAFT
28
Fig. 5.
Fig. 6.
Fig. 7.
DRAFT
29
Fig. 8.
Fig. 9.
Fig. 10.
DRAFT
30
Fig. 11.
Fig. 12.
Fig. 13.
DRAFT
31
Fig. 14.
DRAFT