You are on page 1of 16

1

Adaptive Channel Recommendation for Dynamic


Spectrum Access
Xu Chen∗, Jianwei Huang∗, Husheng Li†
∗ Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong
† Department of Electrical Engineering and Computer Science, The University of Tennessee Knoxville, TN, USA

email:{cx008,jwhuang}@ie.cuhk.edu.hk,husheng@eecs.utk.edu
arXiv:1102.4728v1 [cs.DC] 23 Feb 2011

Abstract—We propose a dynamic spectrum access scheme


where secondary users recommend “good” channels to each
other and access accordingly. We formulate the problem as an
average reward based Markov decision process. We show the
existence of the optimal stationary spectrum access policy, and
explore its structure properties in two asymptotic cases. Since
the action space of the Markov decision process is continuous,
it is difficult to find the optimal policy by simply discretizing
the action space and use the policy iteration, value iteration, or
Q-learning methods. Instead, we propose a new algorithm based
on the Model Reference Adaptive Search method, and prove
its convergence to the optimal policy. Numerical results show
that the proposed algorithm achieves up to 18% performance
improvement than the static channel recommendation scheme
and 10% performance improvement than the Q-learning method,
and is robust to channel dynamics.

Fig. 1. Illustration of the channel recommendation scheme. User D


I. I NTRODUCTION recommends channel 4 to other users. As a result, both user A and user
C access the same channel 4, and thus lead to congestion and a reduced rate
Cognitive radio technology enables unlicensed secondary for both users.
wireless users to opportunistically share the spectrum with
licensed primary users, and thus offers a promising solution to
address the spectrum under-utilization problem [1]. Designing
the channels they have successfully accessed to nearby sec-
an efficient spectrum access mechanism for cognitive radio
ondary users. Since each secondary user originally only has
networks, however, is challenging for several reasons: (1)
a limited view of spectrum availability, such information
time-variation: spectrum opportunities available for secondary
exchange enables secondary users to take advantages of the
users are often time-varying due to primary users’ stochastic
correlations in time and space, make more informed decisions,
activities [1]; and (2) limited observations: each secondary
and achieve a high total transmission rate.
user often has a limited view of the spectrum opportunities
due to the limited spectrum sensing capability [2]. Several The recommendation scheme in [5], however, is rather static
characteristics of the wireless channels, on the other hand, and does not dynamically change with network conditions. In
turn out to be useful for designing efficient spectrum access particular, the static scheme ignores two important characteris-
mechanisms: (1) temporal correlations: spectrum availabilities tics of cognitive radios. The first one is the time variability we
are correlated in time, and thus observations in the past can mentioned before. The second one is the congestion effect. As
be useful in the near future [3]; and (2) spatial correlation: depicted in Figure 1, too many users accessing the same good
secondary users close to one another may experience similar channel leads to congestion and a reduced rate for everyone.
spectrum availabilities [4]. In this paper, we shall explore the To address the shortcomings of the static recommendation
time and space correlations and propose a recommendation- scheme, in this paper we propose an adaptive channel rec-
based collaborative spectrum access algorithm, which achieves ommendation scheme, which adaptively changes the spectrum
good communication performances for the secondary users. access probabilities based on users’ latest channel recommen-
Our algorithm design is directly inspired by the recom- dations. We formulate and analyze the system as a Markov
mendation system in the electronic commerce industry. For decision process (MDP), and propose a numerical algorithm
example, existing owners of various products can provide that always converges to the optimal spectrum access policy.
recommendations (reviews) on Amazon.com, so that other The main results and contributions of this paper include:
potential customers can pick the products that best suit their • Markov decision process formulation: we formulate and
needs. Motivated by this, Li in [5] proposed a static channel analyze the optimal recommendation-based spectrum ac-
recommendation scheme, where secondary users recommend cess as an average reward MDP.
2

• Existence and structure of the optimal policy: we show


that there always exists a stationary optimal spectrum
access policy, which requires only the channel recom-
Fig. 2. Structure of each spectrum access time slot
mendation information of the most recent time slot. We
also explicitly characterize the structure of the optimal
stationary policy in two asymptotic cases (either the
number of users or the number of users goes to infinity).
• Novel algorithm for finding the optimal policy: we pro-
pose an algorithm based on the recently developed Model
Reference Adaptive Search method [6] to find the optimal
stationary spectrum access policy. The algorithm has a
low complexity even when dealing with a continuous
action space of the MDP. We also show that it always Fig. 3. Two states Markovian channel model
converges to the optimal stationary policy.
• Superior Performance: we show that the proposed al-
gorithm achieves up to 18% performance improvement which follows a two-state Markov chain as
than the static channel recommendation scheme and 10%
performance improvement than the Q-learning method, pm (t) = pm (t − 1)Γ, ∀t ≥ 1,
and is also robust to channel dynamics. with the transition matrix
The rest of the paper is organized as follows. We introduce  
the system model and the static channel recommendation 1−p p
Γ= .
scheme in Sections II and III, respectively. We then discuss q 1−q
the motivation for designing an adaptive channel recommen- Note that when p = 0 or q = 0, the channel state stays
dation scheme in Section IV. The Markov decision process unchanged. In the rest of the paper, we will look at the
formulation and the structure results of the optimal policy more interesting and challenging cases where 0 < p ≤ 1
are presented in Section V, followed by the Model Reference and 0 < q ≤ 1. The stationary distribution of the Markov
Adaptive Search based algorithm in Section VI. We illustrate chain is given as
the performance of the algorithm through numerical results in
Section VII. We discuss the related work in Section VIII and q
lim P r{Sm (t) = 0} = , (1)
conclude in Section IX. t→∞ p+q
p
lim P r{Sm (t) = 1} = . (2)
t→∞ p+q
II. S YSTEM M ODEL
We consider a cognitive radio network with M independent
• Maximum Rate per Channel: When a secondary user
and stochastically identical primary channels. N secondary
transmits successfully on an idle channel, it achieves a
users try to access these channels using a slotted transmission
data rate of B. Here we assume that channels and users
structure (see Figure 2). The secondary users can exchange
are homogeneous.
information by broadcasting messages over a common control
• Congestion Effect: When multiple secondary users try to
channel.1 We assume that the secondary users are located
access the same channel, each secondary user will execute
close-by, thus they experience the same channel availability
the following two steps:
and can hear one another’s broadcasting messages. To protect
the primary transmissions, secondary users need to sense the – Randomly generate a continuous backoff timer value
channel states before the data transmission. τ according to a common uniform distribution on
The system model is described as follows: (0, τmax ).
– Once the timer expires, monitor the channel and
• Channel State: For each primary channel m, the channel
transmit data only if the channel is clear.
state at time slot t is
Lemma 1. If km (t) secondary users compete to access to the

0, if channel m is occupied by

same channel m at time slot t, then the expected throughput
Sm (t) = primary transmissions, m (t)
 of each user is BS
km (t) .
1, if channel m is idle.

Lemma 1 shows that expected throughput of a secondary
• Channel State Transition: The channel states change ac- user decreases as the number of users accessing the same
cording to independent and identical Markovian processes channel increases. However, the total expected rate of all km (t)
(see Figure 3). We denote the channel state probability secondary users is BSm (t), i.e., there is no wasted resource
vector of channel m at time t as due to users’ competition.2 We give the detailed proof of
pm (t) , (P r{Sm (t) = 0}, P r{Sm (t) = 1}), Lemma 1 in the Appendix.

1 Please refer to [7] for the details on how to set up and maintain a reliable 2 This may not be true for other random MAC mechanisms such as the
common control channel in cognitive radio networks. slotted Aloha.
3

III. R EVIEW OF S TATIC C HANNEL R ECOMMENDATION channels is small. In the extreme case when only R = 1
The key idea of the static channel recommendation scheme channel is recommended, calculation (3) suggests that every
in [5] is that secondary users inform each other about the user will access that channel with a probability Prec . When
available channels they have just accessed. More specifically, the number of users N is large, the expected number of
each secondary user executes the following three stages syn- users accessing this channel N Prec will be high. Thus heavy
chronously during each time slot (See Figure 2): congestion happens and each secondary user will get a low
expected throughput.
• Spectrum sensing: sense one of the channels based on
A better way is to adaptively change the value of Prec based
channel selection result made at the end of the previous
on the number of recommended channels. This is the key
time slot.
idea of our proposed algorithm. To illustrate the advantage
• Data transmission: if the channel sensing result is idle,
of adaptive algorithms, let us first consider a simple heuristic
compete for the channel with the timer mechanism de-
adaptive algorithm. In this algorithm, we choose the branching
scribed in Section II. Then transmit data packets if the
probability such that the expected number of secondary users
user successfully grabs the channel.
choosing a single recommended channel is one. To achieve
• Channel recommendation and selection:
this, we need to set Prec as in Lemma 2.
– Announce recommendation: if the user has success-
R
fully accessed an idle channel, broadcast this channel Lemma 2. If we choose the branching probability Prec = N ,
ID to all other secondary users. then the expected number of secondary users choosing any
– Collect recommendation: collect recommendations one of the R recommended channels is one.
from other secondary users and store them in a Please refer to the Appendix for the detailed proof of
buffer. Typically, the correlation of channel avail- Lemma 2.
abilities between two slots diminishes as the time Without going through detailed analysis, it is straightfor-
difference increases. Therefore, each secondary user ward to show the benefit for such adaptive approach through
will only keep the recommendations received from simple numerical examples. Let us consider a network with
the most recent W slots and discard the out-of-date M = 10 channels and N = 5 secondary users. For each
information. The user’s own successful transmission channel m, the initial channel state probability vector is
history within W recent time slots is also stored in pm (0) = (0, 1) and the transition matrix is
the buffer. W is a system design parameter and will  
be further discussed later. 1 − 0.01ǫ 0.01ǫ
Γ= ,
– Select channel: choose a channel to sense at the 0.01ǫ 1 − 0.01ǫ
next time slot by putting more weights on the rec- where ǫ is called the dynamic factor. A larger value of ǫ
ommended channels according to a static branching implies that the channels are more dynamic over time. We
probability Prec . Suppose that the user has R differ- are interested in the time average system throughput
ent channel recommendations in the buffer, then the PT PN
probability of accessing a channel m is un (t)
U = t=1 n=1 ,
(
Prec
T
Pm = 1−P R , if channel m is recommended,
where un (t) is the throughput of user n at time slot t. In the
M−R , otherwise,
rec
simulation, we set the total number of time slots T = 2000.
(3) We implement the following three channel access schemes:
A larger value of Prec means that putting more
• Random access scheme: each secondary user selects a
weight on the recommended channels.
channel randomly.
To illustrate the channel selection process, let us take the • Static channel recommendation scheme as in [5] with the
network in Figure 1 as an example. Suppose that the branching optimal constant branching probability Prec = 0.7.
probability Prec = 0.4. Since only R = 1 recommendation is • Heuristic adaptive channel recommendation scheme with
available (i.e., channel 4), the probabilities of choosing the the variable branching probability Prec = N R
.
recommended channel 4 and any unrecommended channel are Figure 4 shows that the heuristic adaptive channel recom-
0.4 1−0.4
1 = 0.4 and 6−1 = 0.12, respectively. mendation scheme outperforms the static channel recommen-
Numerical studies in [5] showed that the static channel dation scheme, which in turn outperforms the random access
recommendation scheme achieves a higher performance over scheme. Moreover, the heuristic adaptive scheme is more
the traditional random channel access scheme without infor- robust to the dynamic channel environment, as it decreases
mation exchange. However, the fixed value of Prec limits the slower than the static scheme when ǫ increases.
performance of the static scheme, as explained next. We can imagine that an optimal adaptive scheme (by setting
the right Prec (t) over time) can further increase the net-
IV. M OTIVATIONS F OR A DAPTIVE C HANNEL work performance. However, computing the optimal branching
R ECOMMENDATION probability in closed-form is very difficult. In the rest of the
The static channel recommendation mechanism is simple to paper, we will focus on characterizing the structures of the
implement due to a fixed value of Prec . However, it may lead optimal spectrum access strategy and designing an efficient
to significant congestions when the number of recommended algorithm to achieve the optimum.
4

Recall that B is the data rate that a single user can obtain
on an idle channel.
|R|
• Stationary Policy: π ∈ Ω , P maps each state R to
an action Prec , i.e., π(R) is the action Prec taken when
the system is in state R. The mapping is stationary and
does not depend on time t.
Given a stationary policy π and the initial state R0 ∈ R,
we define the network’s value function as the time average
system throughput, i.e.
"T −1 #
1 X
Φπ (R0 ) = lim Eπ U (R(t), π(R(t))) .
T →∞ T
t=0

We want to find an optimal stationary policy π ∗ that maxi-


Fig. 4. Comparison of three channel access schemes mizes the value function Φπ (R0 ) for any initial state R0 , i.e.
π ∗ = arg max Φπ (R0 ), ∀R0 ∈ R.
π
V. A DAPTIVE C HANNEL R ECOMMENDATION S CHEME Notice that this is a system wide optimization, although the
To find the optimal adaptive spectrum access strategy, we optimal solution can be implemented in a distributed fashion.
formulate the system as a Markov Decision Process (MDP). This is because every user knows the number of recommended
For the sake of simplicity, we assume that the recommendation channels R, and it can determine the same optimal access
buffer size W = 1, i.e., users only consider the recommenda- probability locally.
tions received in the last time slot. Our method also applies to
the case when W > 1 by using a high-order MDP formulation, B. Existence of Optimal Stationary Policy
although the analysis is more involved.
MDP formulation above is an average reward based MDP.
We can prove that an optimal stationary policy that is in-
A. MDP Formulation For Adaptive Channel Recommendation dependent of initial system state always exists in our MDP
We model the system as a MDP as follows: formulation. The proof relies on the following lemma from
[8].
• System state: R ∈ R , {0, 1, ..., min{M, N }} denotes
the number of recommended channels at the end of time Lemma 3. If the state space is finite and every stationary
slot t. Since we assume that all channels are statistically policy leads to an irreducible Markov chain, then there exists a
identical, then there is no need to keep track of the stationary policy that is optimal for the average reward based
recommended channel IDs3 . MDP.
• Action: Prec ∈ P , (0, 1) denotes the branching
The irreducibility of Markov chain means that it is possible
probability of choosing the set of recommended channels.
to get to any state from any state. For the adaptive channel
• Transition probability: The probability that action Prec in
recommendation scheme, we have
system state R in time slot t will lead to system state R′
in the next time slot is Lemma 4. Given a stationary policy π for the adaptive
′ channel recommendation MDP, the resulting Markov chain is
PPrec
R,R′ = P r{R(t + 1) = R |R(t) = R, Prec (t) = Prec }. irreducible.
We can compute this probability as in (4), with detailed Proof: We consider the following two cases:
derivations given in Appendix C. Case I, when 0 < q < 1: since 0 < Prec < 1, 0 < p ≤ 1,
• Reward: U (R, Prec ) is the expected system throughput and 0 < q < 1, we can verify that given any state R, the
in the next time slot when the action Prec is taken under transition probability P P ′
R,R′ > 0 for all R ∈ R. Thus, any
rec

the current system state R, i.e., two states communicate with each other.
X Case II, when q = 1: for all R ∈ R, the transition
U (R, Prec ) = PPR,R′ UR′ ,
rec

probability P P ′
R,R′ > 0 if R ∈ {0, ..., min{M − R, N }}. It
rec
R∈R′ ′
follows that the state R = 0 is accessible from any other
where UR′ is the system throughput in state R′ . If R′ idle state R ∈ R. By setting R = 0, we see that P P R,R′ > 0, for
rec

channels are utilized by the secondary users in a time slot, all R ∈ {0, ..., min{M, N }}. That is, any other state R′ ∈ R

then these R′ channels will be recommended at the end is also accessible from the state R = 0. Thus, any two states
of the time slot. Thus, we have communicate with each other.
Since any two states communicate with each other in all
UR′ = R′ B.
cases and the number of system state |R| is finite, the resulting
3 Users need to know the IDs of the recommended channels in order to Markov chain is irreducible.
access them. However, the IDs are not important in terms of MDP analysis. Combining Lemmas 3 and 4, we have
5

 
Prec
X X X N nr
P R,R ′ = Prec (1 − Prec )nu
nr
mr +mu =R′ R≥m̄r ≥mr ,M −R≥m̄u ≥mu nr +nu =N,nr ≥m̄r ,nu ≥m̄u
   
m̄r R! nr − 1
· (1 − q)mr q m̄r −mr R−nr
mr (R − m̄r )! m̄r − 1
   
m̄u p mu q (M − R)! nu − 1
· ( ) ( )m̄u −mu (M − R)−nu . (4)
mu p+q p+q (M − R − m̄u )! m̄u − 1

Theorem 1. There exists an optimal stationary policy for the Theorem 2. When M = ∞ and N < ∞, for the adaptive
adaptive channel recommendation MDP. channel recommendation MDP, the optimal stationary policy
π ∗ is monotone, that is, π ∗ (R) is nondecreasing on R ∈ R.
Furthermore, the irreducibility of the adaptive channel rec-
ommendation MDP also implies that the optimal stationary Proof: For the ease of discussion, we define
policy π ∗ is independent of the initial state R0 [8], i.e. X
Qt (R, Prec ) = PP ′
R,R′ [UR′ + βVt+1 (R )],
rec

Φπ∗ (R0 ) = Φπ∗ , ∀R0 ∈ R, R′ ∈R

where Φπ∗ is the maximum time average system throughput. with the partial cross derivative being
In the rest of the paper, we will just use “optimal policy” ∂ R′ ∈R P P ′
P
∂ 2 Qt (R, Prec ) R+1,R′ [UR′ + βVt+1 (R )]
rec

to refer “optimal stationary policy that is independent of the =


∂R∂Prec ∂Prec
initial system state”.
∂ R′ ∈R P P ′
P
R,R′ [UR′ + βVt+1 (R )]
rec

− .
∂Prec
C. Structure of Optimal Stationary Policy
By Lemma 6 in the Appendix, we know the reverse cumulative
Next we characterize the structure of the optimal policy P Prec
distribution function R′ ∈R P R,R ′ is supermodular on R×P.
without using the closed-form expressions of the policy (which It implies
is generally hard to achieve). The key idea is to treat the
∂ R′ ∈R P P Prec
P P
average reward based MDPs as the limit of a sequence of rec
R+1,R′ ∂ R′ ∈R P R,R ′

discounted reward MDPs with discounted factors going to − ≥ 0.


∂Prec ∂Prec
one. Under the irreducibility condition, the average reward
Since Vt+1 (R′ ) is nondecreasing in R′ by Proposition 1
based MDP thus inherits the structure property from the
and UR′ = R′ B, we know that UR′ + βVt+1 (R′ ) is also
corresponding discounted reward MDP [8]. We can write down
nondecreasing in R′ . Then we have
the Bellman equations of the discounted version of our MDP
∂ R′ ∈R P P ′
P
R+1,R′ [UR′ + βVt+1 (R )]
rec
problem as:
X
PP ′ ∂Prec
Vt (R) = max R,R′ [UR′ +βVt+1 (R )], ∀R ∈ R, (5)
rec
Prec ∈P
P Prec ′
R′ ∈R ∂ P
R′ ∈R R,R′ [UR′ + βVt+1 (R )]
≥ ,
where Vt (R) is the discounted maximum expected system ∂Prec
throughput starting from time slot t when the system in state i.e.,
R. ∂ 2 Qt (R, Prec )
≥ 0,
Due to the combinatorial complexity of the transition prob- ∂R∂Prec
ability P Prec
R,R′ in (4), it is difficult to obtain the structure results which implies that Qt (R, Prec ) is supermodular on R × P.
for the general case. We further limit our attention to the Since
following two asymptotic cases. π ∗ (R) = arg max Qt (R, Prec ),
Prec
1) Case One, the number of channels M goes to infinity
while the number of users N stays finite: In this case, by the property of super-modularity, the optimal policy π ∗ (R)
the number of channels is much larger than the number of is nondecreasing on R for the discounted MDP above. Since
secondary users, and thus heavy congestion rarely happens the average reward based MDP inherits its structure property,
on any channel. Thus it is safe to emphasizing on accessing this result is also true for the adaptive channel recommendation
the recommended channels. Before proving the main result of MDP.
Case One in Theorem 2, let us first characterize the property 2) Case Two, the number of users N goes to infinity while
of discounted maximum expected system payoff Vt (R). the number of channels M stays finite: In this case, the
number of secondary users is much larger than the number
Proposition 1. When M = ∞ and N < ∞ , the value
of channels, and thus congestion becomes a major concern.
function Vt (R) for the discounted adaptive channel recom-
However, since there are infinitely many secondary users, all
mendation MDP is nondecreasing in R .
the idle channels at each time slot can be utilized as long
The proof of Proposition 1 is given in the Appendix. Based as users have positive probabilities to access all channels.
on the monotone property of the value function Vt (R), we From the system’s point of view, the cognitive radio network
prove the following main result. operates in the saturation state. Formally, we show that
6

Theorem 3. When N = ∞ and M < ∞, for the adaptive and


channel channel recommendation MDP, any stationary policy 


′ ′
π satisfying (1 − q)R q R̂−R
R′
0 < π(R) < 1, ∀R ∈ R, M−XR̂  M − R̂  p j q M−R̂−j
= ( ) ( )
j p + q p+q
is optimal. j=0
 
R̂ ′ ′
Proof: We first define the sets of policies ∆ , {π : · (1 − q)R q R̂−R ,
R′
0 < π(R) < 1, ∀R ∈ R} and ∆c = Ω\∆. Recall that the
value of π(R) equals the probability of choosing the set of compared with (6), we have
recommended channels, i.e., Prec . M M
π(R) π ′ (R)
X X
Then it is easy to check that the probability of accessing PR,R′ ≥ RR,R′ , ∀i, R ∈ R, π ∈ ∆, π ′ ∈ ∆c .
an arbitrary channel m is positive under any policy π ∈ ∆. R′ =i R′ =i
Since the number of secondary users N = ∞, it implies that Suppose that the time horizon consists of any T time slots,
all the channels will be accessed by the secondary users. In and Vtπ (R) denotes the expected system throughput under the
this case, the transition probability from a system state R to policy π by starting from time slot t when the system in state
R′ of the resulting Markov chain is given by R.
π(R)
When t = T ,
PR,R′ ′
  VTπ (R) = VTπ (R)
X R
= (1 − q)mr q R−mr = UR
mr
mr +mu =R′ ,mr ≤R,mu ≤M−R
  = RB, ∀R ∈ R, π ∈ ∆, π ′ ∈ ∆c .
M −R p mu q M−R−mu
· ( ) ( ) , (6) It follows that UR + βVTπ (R) = UR + βVTπ (R), and hence

mu p+q p+q
M
π(R)
X
which is independent of the branching probability π(R). It PR,R′ [U (R) + βVTπ (R)]
implies that any policy π ∈ ∆ leads to a Markov chain with the R′ =0
Prec
same transition probabilities PR,R ′ . Thus, any policy π ∈ ∆ M
π ′ (R) ′
X
offers the same time average system throughput. ≥ RR,R′ [U (R) + βVTπ (R)],
We next show that any policy π ′ ∈ ∆c leads to a payoff R′ =0
no better than the payoff of a policy π ∈ ∆. For a policy π ′ i.e.,
where there exists some states R̄ such that π ′ (R̄) = 0, the ′

transition probability from the system state R̄ to R′ is VTπ−1 (R) ≥ VTπ−1 (R), ∀R ∈ R, π ∈ ∆, π ′ ∈ ∆c .

 ! Recursively, for any time slots t ≤ T , we can show that


M − R̄ p R′ q M−R̄−R′ ′
( p+q ) ( p+q ) Vtπ (R) ≥ Vtπ (R), ∀R ∈ R, π ∈ ∆, π ′ ∈ ∆c .


R′



π (R̄)
PR̄,R′ =

 If R′ ≤ M − R̄, Thus, if there exists a policy π ′ ∈ ∆c that is optimal, then all
the policies π ∈ ∆ is also optimal. If there does not exist such

If R′ > M − R̄.

0
a policy π ′ , then we conclude that only the policy π ∈ ∆ is
optimal.
If there exists some states R̂ such that π ′ (R̂) = 1, we have
the transition probability as
VI. M ODEL R EFERENCE A DAPTIVE S EARCH F OR
 ! O PTIMAL S PECTRUM ACCESS P OLICY
R̂ ′ ′
(1 − q)R q R̂−R If R′ ≤ R̂,

Next we will design an algorithm that can converge to the

π ′ (R̂)
PR̂,R′ = R′
 optimal policy under general system parameters (not limiting
0 If R′ > R̂.

to the two asymptotic cases). Since the action space of the
adaptive channel recommendation MDP is continuous (i.e.,
Since choosing a probability Prec in (0, 1)), the traditional method
  of discretizing the action space followed by the policy, value
M − R̄ p R′ q M−R̄−R′ iteration, or Q-learning cannot guarantee to converge to the
( ) ( )
R′ p+q p+q optimal policy. To overcome this difficulty, we propose a
X R̄ 

 new algorithm developed from the Model Reference Adaptive
= (1 − q)j q R̄−j Search method, which was recently developed in the Opera-
j
j=0
  tions Research community [6]. We will show that the proposed
M − R̄ p R′ q M−R̄−R′ algorithm is easy to implement and is provably convergent to
· ( ) ( ) ,
R′ p+q p+q the optimal policy.
7

A. Model Reference Adaptive Search Method B. Model Reference Adaptive Search For Optimal Spectrum
Access Policy
We first introduce the basic idea of the Model Reference
Adaptive Search (MRAS) method. Later on, we will show In this section, we design an algorithm based on the MRAS
how the method can be used to obtain optimal spectrum access method to find the optimal spectrum access policy. Here we
policy for our problem. treat the adaptive channel recommendation MDP as a global
The MRAS method is a new randomized method for global optimization problem over the policy space. The key challenge
optimization [6]. The key idea is to randomize the original is the choice of proper probabilistic model f (·), which is
optimization problem over the feasible region according to crucial for the convergence of the MRAS algorithm.
a specified probabilistic model. The method then generates 1) Random Policy Generation: To apply the MRAS
candidate solutions and updates the probabilistic model on the method, we first need to set up a random policy generation
basis of elite solutions and a reference model, so that to guide mechanism. Since the action space of the channel recommen-
the future search toward better solutions. dation MDP is continuous, we use the Gaussian distributions.
Formally, let J(x) be the objective function to maximize. Specifically, we generate sample actions π(R) from a Gaussian
The MRAS method is an iterative algorithm, and it includes distribution for each system state R ∈ R independently, i.e.
2 4
three phases in each iteration k: π(R) ∼ N (µR , σR ). In this case, a candidate policy π can
be generated from the joint distribution of |R| independent
• Random solution generation: generate a set of random
Gaussian distributions, i.e.,
solutions {x} in the feasible set χ according to a
parameterized probabilistic model f (x, vk ), which is a (π(0), ..., π(min{M, N })) ∼ N (µ0 , σ02 ) × · · ·
probability density function (pdf) with parameter vk . 2
×N (µmin{M,N } , σmin{M,N } ).
The number of solutions to generate is a fixed system
parameter. As shown later, Gaussian distribution has nice analytical and
• Reference distribution construction: select elite solutions convergent properties for the MRAS method.
among the randomly generated set in the previous phase, For the sake of brevity, we denote f (π(R), µR , σR ) as the
such that the chosen ones satisfy J(x) ≥ γ. Construct a pdf of the Gaussian distribution N (µR , σR 2
), and f (π, µ, σ)
reference probability distribution as as random policy generation mechanism with parameters µ ,

I{J(x)≥γ} (µ0 , ..., µmin{M,N } ) and σ , (σ0 , ..., σmin{M,N } ), i.e.,
 I k = 1,
Ef (x,v0 ) [ {J(x)≥γ}

] min{M,N }
gk (x) = J(x)
f (x,v0 )
(7)
gk−1 (x)
Y
 e I{J(x)≥γ}

J(x) k ≥ 2, f (π, µ, σ) = f (π(R), µR , σR )
Eg k−1
[e I
{J(x)≥γ} ]
R=0
where I{̟} is an indicator function, which equals 1 if min{M,N } (π(R)−µR ) 2
Y 1 −
2σ2
the event ̟ is true and zero otherwise. Parameter v0 = p
2
e R ,
R=0
2ϕσR
is the initial parameter for the probabilistic model (used
during the first iteration, i.e., k = 1), and gk−1 (x) is the where ϕ is the circumference-to-diameter ratio.
reference distribution in the previous iteration (used when 2) System Throughput Evaluation: Given a candidate pol-
k ≥ 2). icy π randomly generated based on f (π, µ, σ), we need
• Probabilistic model update: update the parameter v of the to evaluate the expected system throughput Φπ . From (4),
probabilistic model f (x, v) by minimizing the Kullback- π(R)
we obtain the transition probabilities P R,R′ for any system
Leibler divergence between gk (x) and f (x, v), i.e. state R, R′ ∈ R. Since a policy π leads to a finitely irre-

gk (x)
 ducible Markov chain, we can obtain its stationary distribution.
vk+1 = arg min Egk ln . (8) Let us denote the transition matrix of the Markov chain
v f (x, v) π(R)
as Q , [P R,R′ ]|R|×|R| and the stationary distribution as
By constructing the reference distribution according to (7), p = (P r(0), ..., P r(min{M, N })). Obviously, the stationary
the expected performance of random elite solutions can be distribution can be obtained by solving the following equation
improved under the new reference distribution, i.e.,
pQ = p.
e2J(x) I{J(x)≥γ} gk−1 (x)dx
R
Egk [e J(x)
I{J(x)≥γ} ] =
x∈χ We then calculate the expected system throughput Φπ by
Egk−1 [eJ(x) I{J(x)≥γ} ] X
Egk−1 [e2J(x) I{J(x)≥γ} ] Φπ = P r(R)UR .
= R∈R
Egk−1 [eJ(x) I{J(x)≥γ} ]
Note that in the discussion above, we assume that π ∈ Ω
≥ Egk−1 [eJ(x) I{J(x)≥γ} ]. (9)
implicitly, where Ω is the feasible policy space. Since Gaus-
To find a better solution to the optimization problem, it is sian distribution has a support over (−∞, +∞), we thus
natural to update the probabilistic model (from which random 4 Note that the Gaussian distribution has a support over (−∞, +∞), which
solution are generated in the first stage) as close to the new is larger than the feasible region of π(R). This issue will be handled in Section
reference probability as possible, as done in the third stage. VI-B2.
8

extend the definition of expected system throughput Φπ over and,


(−∞, +∞)|R| as eΦπ I{Φπ ≥γ} g1 (π)
g2 (π) =
(P Eg1 [eΦπ I{Φπ ≥γ} ]
R∈R P r(R)UR π ∈ Ω,
Φπ = eΦπ I{Φπ ≥γ} I{Φπ ≥γ}
−∞ Otherwise. = R
Eg1 [eΦπ I{Φπ ≥γ} ] π∈Ω I{Φπ ≥γ} dπ
In this case, whenever any generated policy π is not feasible, eΦπ I{Φπ ≥γ} I{Φπ ≥γ}
= I{Φπ ≥γ}
we have Φπ = −∞. As a result, such policy π will not be R
R R
Φπ I
π∈Ω e {Φπ ≥γ} I
π∈Ω

dπ π∈Ω I{Φπ ≥γ} dπ
{Φπ ≥γ}
selected as an elite sample (discussed next) and will not used
for probability updating. Hence the search of MRAS algorithm eΦπ I{Φπ ≥γ}
= R .
will not bias towards any unfeasible policy space. π∈Ω
eΦπ I{Φπ ≥γ} dπ
3) Reference Distribution Construction: To construct the Repeat the above computation iteratively, we have
reference distribution, we first need to select the elite policies.
Suppose L candidate policies, π1 , π2 , ..., πL , are generated at e(k−1)Φπ I{Φπ ≥γ}
gk (π) = R , k ≥ 1. (13)
each iteration. We order them based on an increasing order of π∈Ω
e(k−1)Φπ I{Φπ ≥γ} dπ
the expected system throughputs Φπ , i.e., Φπ̂1 ≤ Φπ̂2 ≤ ... ≤ Then, the problem in (8) is equivalent to solving
Φπ̂L , and set the elite threshold as R
max π∈Ω gk (π) ln f (π, µ, σ)dπ, (14)
µ,σ
γ = Φπ̂⌈(1−ρ)L⌉ , subject to µ, σ  0,

where 0 < ρ < 1 is the elite ratio. For example, when L = 100 Substituting (13) into (14), we have
and ρ = 0.4, then γ = Φπ̂60 and the last 40 samples in the max π∈Ω e(k−1)Φπ I{Φπ ≥γ} ln f (π, µ, σ)dπ, (15)
R
sequence will be selected as elite samples. Note that as long µ,σ

as L is sufficiently large, we shall have γ < ∞ and hence subject to µ, σ  0,


only feasible policies π are selected. According to (7), we Function f (π(R), µR , σR ) is log-concave, since it is
then construct the reference distribution as the pdf of the Gaussian distribution. Since the log-
concavity is closed under multiplication, then f (π, µ, σ) =
Qmin{M,N }

I{Φπ ≥γ} R=0 f (π(R), µR , σR ) is also log-concave. It implies

 I
{Φπ ≥γ}
k = 1, the problem in (14) is a concave optimization problem. Solving
Ef (π,µ0 ,σ0 ) [ f (π,µ ]
gk (π) = Φπ
0 ,σ 0 ) (10) by the first order condition, we have
e I{Φπ ≥γ} k−1 (π)
g

 Egk−1 [eΦπ I{Φπ ≥γ} ] k ≥ 2.
∂ π∈Ω e(k−1)Φπ I{Φπ ≥γ} ln f (π, µ, σ)dπ
R
= 0, ∀R ∈ R,
∂µR
4) Policy Generation Update: For the MRAS algorithm,
∂ π∈Ω e(k−1)Φπ I{Φπ ≥γ} ln f (π, µ, σ)dπ
R
the critical issue is the updating of random policy generation = 0, ∀R ∈ R,
mechanism f (π, µ, σ), or solving the problem in (8). The ∂σR
optimal update rule is described as follow. which leads to (11) and (12). Due to the concavity of the
optimization problem in (14), the solution is also the global
Theorem 4. The optimal parameter (µ, σ) that minimizes the
optimum for the random policy generation updating.
Kullback-Leibler divergence between the reference distribution
5) MARS Algorithm For Optimal Spectrum Access Policy:
gk (π) in (10) and the new policy generation mechanism
Based on the MARS algorithm, we generate L candidate
f (π, µ, σ) is
polices at each iteration. Then the updates in (11) and (12)
are replaced by the sample average version in (18) and (19),
e(k−1)Φπ I{Φπ ≥γ} π(R)dπ
R
π∈Ω respectively. As a summary, we describe the MARS-based
µR = R , ∀R ∈ R, (11)
e(k−1)Φπ I{Φπ ≥γ} dπ algorithm for finding the optimal spectrum access policy of
R π∈Ω(k−1)Φ
2 π∈Ω e R
π
I{Φπ ≥γ} [π(R) − µR ]2 dπ adaptive channel recommendation MDP in Algorithm 1.
σR = , ∀R ∈ R.
π∈Ω
e(k−1)Φπ I{Φπ ≥γ} dπ
C. Convergence of Model Reference Adaptive Search
(12) In this part, we discuss the convergence property of the
MRAS-based optimal spectrum access policy. For ease of ex-
Proof: First, from (10), we have position, we assume that the adaptive channel recommendation
MDP has a unique global optimal policy. Numerical studies
I{Φπ ≥γ} in [6] show that the MRAS method also converges where
g1 (π) = I
{Φπ ≥γ}
Ef (π,µ0 ,σ0 ) [ f (π,µ there are multiple global optimal solutions. We shall show
,σ 0 ) ]
0
that the random policy generation mechanism f (π, µk , σ k )
I{Φπ ≥γ}
= R , will eventually generate the optimal policy.
I
π∈Ω {Φπ ≥γ}

9

Algorithm 1 MRAS-based Algorithm For Adaptive Recom- into account, we consider the following two types of channel
mendation Based Optimal Spectrum Access state transition matrices:
1: Initialize parameters for Gaussian distributions (µ0 , σ 0 ),
 
1 − 0.005ǫ 0.005ǫ
the elite ratio ρ, and the stopping criterion ξ. Set initial Type 1: Γ1 = , (20)
0.025ǫ 1 − 0.025ǫ
elite threshold γ0 = 0 and iteration index k = 0.  
1 − 0.01ǫ 0.01ǫ
2: repeat: Type 2: Γ2 = , (21)
0.01ǫ 1 − 0.01ǫ
3: Increase iteration index k by 1.
4: Generate L candidate policies π1 , ..., πL from the where ǫ is the dynamic factor. Recall that a larger ǫ means that
random policy generation mechanism f (π, µk−1 , σk−1 ). the channels are more dynamic over time. Using (2), we know
5: Select elite policies by setting the elite threshold γk = that channel models Γ1 and Γ2 have the stationary channel idle
max{Φπ̂⌈(1−ρ)L⌉ , γk−1 }. probabilities of 1/6 and 1/2, respectively. In other words, the
6: Update the random policy generation mechanism by primary activity level is much higher with the Type 1 channel
PL (k−1)Φπ than with the Type 2 channel.
i=1 e I{Φπi ≥γk } πi (R)
µR,k = PL (k−1)Φ , ∀R ∈ R, We initialize the parameters of MRAS algorithm as follows.
i=1 e πI
{Φπi ≥γk } We set µR = 0.5 and σR = 0.5 for the Gaussian distribution,
(18) which has 68.2% support over the feasible region (0, 1).
PL (k−1)Φπ We found that the performance of the MRAS algorithm is
i=1 e I{Φπi ≥γk } [πi (R) − µR ]2
2
σR,k = PL (k−1)Φ , ∀R ∈ R.insensitive to the elite ratio ρ when ρ ≤ 0.3. We thus choose
i=1 e πI
{Φ πi ≥γ k } ρ = 0.1.
(19) When using the MRAS-based algorithm, we need to de-
termine how many (feasible) candidate policies to generate
7: until maxR∈R σR,k < ξ. in each iteration. Figure 5 shows the convergence of MRAS
algorithm with 100, 300, and 500 candidate policies per itera-
tion, respectively. We have two observations. First, the number
Theorem 5. For the MRAS algorithm, the limiting point of the of iterations to achieve convergence reduces as the number of
policy sequence {πk } generated by the sequence of random candidate policies increases. Second, the convergence speed is
policy generation mechanism {f (π, µk , σ k )} converges point- insignificant when the number changes from 300 to 500. We
wisely to the optimal spectrum access policy π ∗ for the thus choose L = 500 for the experiments in the sequel.
adaptive channel recommendation MDP, i.e.,

lim Ef (π,µk ,σk ) [π(R)] = π ∗ (R), ∀R ∈ R, (16)


k→∞
lim V arf (π,µk ,σk ) [π(R)] = 0, ∀R ∈ R. (17)
k→∞

The proof is given in the Appendix.


From Theorem 5, we see that parameter (µR,k , σR,k ) for
updating in (18) and (19) also converges, i.e.,

lim µR,k = π ∗ (R), ∀R ∈ R,


k→∞
lim σR,k = 0, ∀R ∈ R.
k→∞

Thus, we can use maxR∈R σR,k < ξ as the stopping criterion


in Algorithm 1.
Fig. 5. The convergence of MRAS-based algorithm with different number
of candidate policies per iteration
VII. S IMULATION R ESULTS
In this section, we investigate the proposed adaptive channel
recommendation scheme by simulations. The results show B. Simulation Results
that the adaptive channel recommendation scheme not only We consider two simulation scenarios: (1) the number of
achieves a higher performance over the static channel recom- channels is greater than the number of secondary users, and
mendation scheme and random access scheme, but also is more (2) the number of channels is smaller than the number of
robust to the dynamic change of the channel environments. secondary users. For each case, we compare the adaptive
channel recommendation scheme with the static channel rec-
ommendation scheme in [5] and a random access scheme.
A. Simulation Setup
1) More Channels, Fewer Users: We implement three
We consider a cognitive radio network consisting of multi- spectrum access schemes with M = 10 channels and N = 5
ple independent and stochastically identical primary channels. secondary users . As there are enough channels to choose from,
In order to take the impact of primary user’s long run behavior congestion is not a major issue in this setting. We choose
10

Fig. 6. System throughput with M = 10 channels and N = 5 users under Fig. 8. Performance gain over random access scheme. The Type 1 and Type
the Type 1 channel state transition matrix 2 channels have the stationary channel idle probabilities of 1/6 and 1/2,
respectively.

Fig. 7. System throughput with M = 10 channels and N = 5 users under


the Type 2 channel state transition matrix
Fig. 9. Performance gain over static channel recommendation scheme. The
Type 1 and Type 2 channels have the stationary channel idle probabilities of
1/6 and 1/2, respectively.

the dynamic factor ǫ within a wide range to investigate the


robustness of the schemes to the channel dynamics. The results
are shown in Figures 6 – 9. From these figures, we see that channel recommendation scheme over the random access
• Superior performance of adaptive channel recommen- scheme under two different types of transition matrix
dation scheme (Figures 6 and 7): the adaptive channel scenarios. We see that the performance gain decreases
recommendation scheme performs better than the ran- with the idle probability of the channel as in Type I
dom access scheme and static channel recommendation channel environment. This shows that the information
scheme. Typically, it offers 5%∼18% performance gain of channel recommendations can enhance the spectrum
over the static channel recommendation scheme. access more efficiently when the primary activity level
• Impact of channel dynamics (Figures 6 and 7): the increases (i.e., when the channel idle probability is low).
performances of both adaptive and static channel rec- Interestingly, Figure 9 shows that the performance gain
ommendation schemes degrade as the dynamic factor ǫ of the adaptive channel recommendation scheme over the
increases. The reason is that both two schemes rely on the static channel recommendation scheme tends to increase
recommendation information from previous time slots to with the channel idleness probability. This illustrates
make decisions. When channel states change rapidly, the that the adaptive channel recommendation scheme can
value of recommendation information diminishes. How- better utilize the channel opportunities than the static
ever, the adaptive channel recommendation is much more recommendation scheme given the information of channel
robust to the dynamic channel environment changing (See recommendations.
Figure 9). This is because the optimal adaptive policy 2) Fewer Channels, More Users: We next consider the
takes the channel dynamics into account while the static case of M = 5 channels and N = 10 users, and show the
one does not. simulation results in Figures 10 and 11. We can check that
• Impact of channel idleness level (Figures 8 and 9): the observations in Section VII-B1 still hold. In other words,
Figure 8 shows the performance gain of the adaptive the adaptive channel recommendation scheme still has a better
11

performance than static one and random access scheme when where 0 < α < 1 is the smoothing factor. Given a system
the cognitive radio network suffers serve congestion effect. state R, the probability of choosing an action Prec is
eτ Q(R,Prec )
Pr (Prec (t) = Prec |R(t) = R) = P τ Q(R,Prec )
,
P ′ ∈P̂ e
rec

where τ > 0 is the temperature.


After the Q-learning converges, we obtain the corresponding
spectrum access policy πQ over the discretized action space P̂.
Note that πQ is a sub-optimal policy for the adaptive channel
recommendation MDP over the continuous action space P.
We compare the Q-learning based policy with our MRAS-
based optimal policy when there are M = 10 channels and
N = 5 users, and show the simulation results in Figures 12 and
13. From these figures, we see that the MRAS-based algorithm
outperforms Q-learning up to 10%, which demonstrates the
effectiveness of our proposed algorithm.
Fig. 10. System throughput with M = 5 channels and N = 10 users under
Type 1 channel state transition matrix

Fig. 12. Comparison of MRAS-based algorithm and Q-learning with Type


1 channel state transition matrix

Fig. 11. System throughput with M = 5 channels and N = 10 users under


Type 2 channel state transition matrix

C. Comparison of MRAS algorithm and Q-Learning


To benchmark the performance of the spectrum access
policy based on the MRAS algorithm, we compare it with
the policy obtained by Q-learning algorithm [9].
Since the Q-learning can only be used over the discrete
action space, we first discretize the action space P into a finite
discrete action space P̂ = {0.1, ..., 1.0}. The Q-learning then
defines a Q-value representing the estimated quality of a state-
action combination as Fig. 13. Comparison of MRAS-based algorithm and Q-learning with Type
2 channel state transition matrix
Q : R × P̂rec → R.

Given a new reward U (R(t), Prec (t)) is received, we can


update the Q-value to be VIII. R ELATED W ORK
The spectrum access by multiple secondary users can be ei-
Q(R(t), Prec (t)) = (1 − α)Q(R(t), Prec (t)), ther uncoordinated or coordinated. For the uncoordinated case,
+ α[U (R(t), Prec (t)) + max Q(R(t + 1), Prec )], multiple secondary users compete with other for the resource.
Prec ∈P̂ Huang et al. in [10] designed two auction mechanisms to
12

(1) i
allocate the interference budget among selfish users. Southwell by secondary user i and τm = min{τm |i 6= n, i ∈ Km }. The
and Huang in [11] studied the largest and smallest convergence probability that the user n captures the channel m is given as
time to an equilibrium when secondary users access multiple (1) n
channels in a distributed fashion. Liu et al. in [12] modeled P rn,m = P {τm > τm }
n
the interactions among spatially separated users as congestion τm km (t)−1
= (1 − ) .
games with resource reuse. Li and Han in [13] applied the τmax
graphic game theory to address the spectrum access problem Thus, the expected throughput of user n is
with limited range of mutual interference. Anandkumar et al. Z τmax
1 n
in [14] proposed a learning-based approach for competitive un (t) = BP rn,m dτm
0 τmax
spectrum access with incomplete spectrum information. Law Z τmax
τn 1
et al. in [15] showed that uncoordinated spectrum access may = B(1 − m )km (t)−1 dτ n
lead to poor system performance. 0 τmax τmax m
For the coordinated spectrum access, Zhao et al. in [16] B
= .
proposed a dynamic group formation algorithm to distribute km (t)
secondary users’ transmissions across multiple channels. Shu BSm (t)
= .
and Krunz proposed a multi-level spectrum opportunity frame- km (t)
work in [17]. The above papers assumed that each secondary
user knows the entire channel occupancy information. We
consider the case where each secondary user only has a limited
view of the system, and improve each other’s information by B. Proof of Lemma 2
recommendation. Let ΛC denote the event that C secondary users choose the
Our algorithm design is partially inspired by the recommen- recommended channels, and P r(c1 , ..., cR ) denote probability
dation systems in the electronic commerce industry, where an- mass function that the number of secondary users on these R
alytical methods such as collaborative filtering [18] and multi- recommended channels equal to c1 , ..., cR respectively. Given
armed bandit process modeling [19] are useful. However, we the event ΛC , we have
cannot directly apply the existing methods to analyze cognitive 
n

radio networks due to the unique congestion effect in our P r(c1 , ..., cR |ΛC ) = R−C ,
c1 , ..., cR
model.
which is a Multinomial mass function. By the property of
Multinomial distribution, we have
IX. C ONCLUSION
C
E[cm |ΛC ] =.
In this paper, we propose an adaptive channel recommenda- R
tion scheme for efficient spectrum sharing. We formulate the It follows that the expected number of users choosing a
problem as an average reward based Markov decision process. recommended channel m is
We first prove the existence of the optimal stationary spectrum N
access policy, and then characterize the structure of the optimal
X
E[cm ] = E[cm |ΛC ]P r(ΛC )
policy in two asymptotic cases. Furthermore, we propose a C=0
novel MRAS-based algorithm that is provably convergent to N  
X C N
the optimal policy. Numerical results show that our proposed = C
Prec (1 − Prec )N −C
algorithm outperforms the static approach in the literature by R C
C=0
up to 18% and the Q-learning method by up to 10% in terms Prec N
of system throughput. Our algorithm is also more robust to = .
R
the channel dynamics compared to the static counterpart.
Then E[cm ] = 1 requires that
In terms of future work, we are currently extending the
analysis by taking the heterogeneity of channels into con- R
Prec = .
sideration. We also plan to consider the case where the N
secondary users are selfish. Designing an incentive-compatible
channel recommendation mechanism for that case will be very
interesting and challenging. C. Derivation of Transition Probability
When the system state transits from R to R′ , we assume that
A PPENDIX mr and mu recommendations, out of R′ recommendations,
are channels that have been recommended and have not
A. Proof of Lemma 1
been recommended at time slot t respectively. Obviously,
When Sm (t) = 0, this trivially holds. We focus on the case mr + mu = R′ . We assume that m̄r recommended channels
that Sm (t) = 1. and m̄u unrecommended channels have been accessed by the
Let Km = {1, ..., km (t)} be the set of secondary users secondary users at time slot t+1. We thus have R ≥ m̄r ≥ mr
i
accessing the channel m, τm be the backoff time be generated and M − R ≥ m̄u ≥ mu . We also assume that there are
13

p 2
nr secondary users have accessed these m̄r recommended Prec
P0,2 (2) = (1 − Prec )2 ( ) ,
p+q
channels and nu secondary users have accessed those m̄u
unrecommended channels at time slot t + 1. Obviously, we
Prec 2 q 2
have nr + nu = N , nr ≥ m̄r and nu ≥ m̄u . P1,0 (2) = Prec q + (1 − Prec )2 ( )
For the first term, the probability that the user distribu- p+q
tion (nr , nu ) happens follows the Binomial distribution as q2
 +2Prec (1 − Prec ) ,
N nr p+q
Prec (1 − Prec )nu .
nr
For the second term, when m̄r ≥ 1, it is easy to check Prec 2 2pq
nr − 1 P1,1 (2) = Prec (1 − q) + (1 − Prec )2
that there are ways for nr secondary users to (p + q)2
m̄r − 1
(1 − q)q + pq
choose m̄r recommended channels and there are (R−R!m̄r )! +2Prec (1 − Prec ) ,
possibilities for these m̄r recommended channels out of the R p+q
recommended channels, each of which has probability ( R1 )nr . p 2 (1 − q)p
Among these m̄r recommended channels that have been ac-
Prec
P1,2 (2) = (1 − Prec )2 ( ) + 2Prec (1 − Prec ) ,
p+q p+q
 that mr channels
cessed by the secondary users, theprobability
m̄r
turn out to be idle is given as (1 − q)mr q m̄r −mr . q + q2 q 2
mr Prec 2
P2,0 (2) = Prec + (1 − Prec )2 ( )
When m̄r = 0, it requires that ur = 0. Thus, we define 2 p+q
  ( q2
nr − 1 1 If nr =0, +2Prec (1 − Prec ) ,
= p+q
−1 0 Otherwise.
Similarly, we can obtain the third term for the unrecom- 2 1 − q + (1 − q)q 2pq
mended channels case.
Prec
P2,1 (2) = Prec + (1 − Prec )2
2 (p + q)2
(1 − q)q + pq
+2Prec (1 − Prec ) ,
D. Lemma 5 p+q
Since the operation R′ ∈R P P
P
R,R′ [·] plays a key role in the
rec

Bellman equation, to facilitate the study, we first define the Prec 2 (1 − q)2 p 2
P2,2 (2) = Prec + (1 − Prec )2 ( )
following function 2 p+q
min{M,N }
(1 − q)p
+2Prec (1 − Prec ) .
p+q
X
fr (R, Prec ) , PP
R,i , ∀r ∈ R.
rec

i=r It is easy to check the following holds


Since Prec Prec Prec
P0,0 (2) ≥ P1,0 (2) ≥ P2,0 (2),
fr (R, Prec ) Prec
P2,2 Prec
(2) ≥ P1,2 Prec
(2) ≥ P0,2 (2).
= P r(R(t + 1) ≥ r|R(t) = R, Prec (t) = Prec )
Since
= 1 − P r(R(t + 1) < r|R(t) = R, Prec (t) = Prec ),
2
X 2
X 2
X
We call the function fr (R, Prec ) as the reverse cumulative Prec Prec Prec
P0,i (2) = P1,i (2) = P2,i (2) = 1,
distribution function in the sequel. i=0 i=0 i=0

Lemma 5. When M = ∞ and N < ∞, the reverse we thus obtain


cumulative distribution function fr (R, Prec ) is nondecreasing
in R for all r, R ∈ R, Prec ∈ P. fr2 (R + 1, Prec ) ≥ fr2 (R, Prec ), ∀R, r ∈ R, Prec ∈ P,

proof: We prove the result by induction argument. In abuse i.e. fr (R, Prec ) is nondecreasing in R for the case N = 2.
of notation, we denote the transition probability P P rec
R,R′ and
We then assume that fr (R, Prec ) is nondecreasing in R for
the reverse cumulative distribution function fr (R, Prec ) when all R ∈ R, Prec ∈ P for the case that N = k ≥ 2 i.e.
Prec k
the number of users N = k as PR,R ′ (k) and fr (R, Prec )

respectively. frk (R + 1, Prec ) ≥ frk (R, Prec ), ∀R, r ∈ R, Prec ∈ P.


When N = 2, from (4), we have We next prove that fr (R, Prec ) is nondecreasing for the case
Prec 2 q 2 the N = k + 1 under this hypothesis.
P0,0 (2) = Prec + (1 − Prec )2 ( )
p+q Let ψ denote the event that one arbitrary user out of these
q k + 1 users, does not generate a recommendation at time slot
+2Prec (1 − Prec ) ,
p+q t + 1. Obviously,
2pq p q
Prec
P0,1 (2) = (1 − Prec )2 2
+ 2Prec (1 − Prec ) , P r(ψ) = Prec q + (1 − Prec ) ,
(p + q) p+q p+q
14

which depends on Prec and the channel environment only. By Now, we assume it also holds for Vt (R) when t = k +
conditioning on the event ϕ, we have 1, k + 2, ..., T. Let R̂ be a system state such that R̂ ≥ R. By
the hypothesis, we have Vk+1 (R̂) ≥ Vk+1 (R). Let π ∗ be the
PP
R+1,i (k + 1) =
rec
PP
R+1,i−1 (k)[1 − P r(ψ)]
rec
optimal policy. From the Bellman equation in (5), we have
Prec
+P R+1,i (k)P r(ψ), (22) min{M,N }
π ∗ (R)
X
PP
R,i (k
rec
+ 1) = PP
R,i−1 (k)[1 − P r(ψ)].
rec
Vk (R) = P R,R′ [UR′ + βVk+1 (R′ )], ∀R ∈ R.
Prec R′ =0
+P R,i (k)P r(ψ) (23) (26)
Thus, By defining a new system state −1 such that U−1 +
βVk+1 (−1) = 0, we can rewrite the equation in (26) as
frk+1 (R + 1, Prec ) − frk+1 (R, Prec ) min{M,N } R ′

k+1 k+1 π ∗ (R)


X X
X X Vk (R) = P R,R′ {[Ui + βVk+1 (i)]
= PP R+1,i (k + 1) −
rec
PP R,i (k + 1)
rec
R′ =0 i=0
i=r i=r
−[Ui−1 + βVk+1 (i − 1)]}
k+1 k+1
X X min{M,N }
= [ PP
R+1,i−1 (k)
rec
− PP
R,i−1 (k)][1
rec
− P r(ψ)] X
i=r i=r
= {[UR′ + βVk+1 (R′ )]
k k R′ =0
min{M,N }
X X
+[ PP
R+1,i (k) −
rec
PP
R,i (k)]P r(ψ)
rec

X π ∗ (R)
i=r i=r −[UR′ −1 + βVk+1 (R − 1)]} P R,i .
k k i=R′
X X
= [ PP
R+1,j (k) −
rec
PP
R,j (k)][1 − P r(ψ)]
rec
By lemma 5 in the Appendix, we have
j=r j=r
min{M,N } min{M,N }
k k π ∗ (R) π ∗ (R)
X X
X X P R̂,i ≥ P R,i , ∀R′ ∈ R.
+[ PP
R+1,i (k)
rec
− PP
R,i (k)]P r(ψ)
rec
i=R′ i=R′
i=r i=r
k k Then
= [fr−1 (R + 1, Prec ) − fr−1 (R, Prec )][1 − P r(ψ)]
k k min{M,N }
= [fr (R + 1, Prec ) − fr (R, Prec )]P r(ψ) X
Vk (R) ≤ {[UR′ + βVk+1 (R′ )]
≥ 0. (24) R′ =0
min{M,N }
i.e. fr (R, Prec ) is also nondecreasing for the case the N = X π ∗ (R)
−[UR′ −1 + βVk+1 (R′ − 1)]} P R̂,i
k + 1. By the induction argument, the result holds for the case
i=R′
that N ≥ 2.
min{M,N }
π ∗ (R)
X
E. Lemma 6 = P R̂,R′ [UR′ + βVk+1 (R′ )]
Lemma 6. When M = +∞ and N < +∞, the reverse R′ =0
X
Prec
cumulative distribution function fr (R, Prec ) is supermodular ≤ max P R̂,R ′
[UR′ + βVt+1 (R′ )]
Prec ∈P
on R × P. R′ ∈R
min{M,N }
proof: To show fr (R, Prec ) is supermodular on R × P is X π ∗ (R̂)
= P R̂,R′ [UR′ + βVk+1 (R′ )]
equivalent to proving the following is true:
R′ =0
∂ 2 fr (R, Prec ) = Vk (R̂),
≥ 0. (25)
∂Prec ∂R
i.e., for t = k, Vk (R̂) ≥ Vk (R) also holds. This completes the
Since R is an integral variable, (25) is equivalent to proof.
∂fr (R + 1, Prec ) ∂fr (R, Prec )
− ≥ 0.
∂Prec ∂Prec G. Proof of Theorem 5
(R,Prec )
That is, it is equivalent to showing ∂fr∂P is nondecreas- We first show that under the reference distribution, the
rec
ing in R. By the similar procedure in proof of Lemma 5, we optimal policy is attainable.
show this holds. Lemma 7. For the MRAS algorithm, the policy π generated
by the sequence of reference distributions {gk } converges
F. Proof of Proposition 1 point-wisely to the optimal spectrum access policy π ∗ for the
adaptive channel recommendation MDP, i.e.
We prove the proposition by induction. Suppose that the
time horizon consists of any T time slots. lim Egk [π(R)] = π(R)∗ , ∀R ∈ R, (27)
k→∞
When t = T , VT (R) = UR = RB, and the proposition is lim V argk [π(R)] = 0, ∀R ∈ R. (28)
trivially true. k→∞
15

proof: The proof is developed on the basis of the results in


[6]. To complete the proof of the theorem, we next show that
First, from the MRAS algorithm, we have
γk ≤ γk+1 , Egk [π(R)] = Ef (π,µ,σ) [π(R)], ∀R ∈ R,

i.e. the sequence {γk } is monotone. Since 0 ≤ γk ≤ Φπ∗


is bounded, there must exist a finite K such that γk+1 = Egk [π 2 (R)] = Ef (π,µ,σ) [π 2 (R)], ∀R ∈ R.
γk , ∀k ≥ K.
When γK = Φπ∗ , we have
For the sake of simplicity, we first define a function
lim Egk [eΦπ I{Φπ ≥γk } ] = eΦπ∗ .
k→∞ Z
holds. H(µ, σ, γk ) , e(k−1)Φπ I{Φπ ≥γk } ln f (π, µ, σ)dπ.
π∈Ω
When γK < Φπ∗ , from (9), we know that
Egk [eΦπ I{Φπ ≥γk } ] ≥ Egk−1 [eΦπ I{Φπ ≥γk } ], ∀k ≥ K. Since
That is, the sequence {Egk [eΦπ I{Φπ ≥γk } ]} is monotone and min{M,N }
hence converges. We then show that the limit of this sequence
Y
f (π, µ, σ) = f (π(R), µR , σR )
must be eΦπ∗ by contradiction. R=0
Suppose that min{M,N } (π(R)−µR )2
Y 1 −
2σ2
lim Egk [eΦπ I{Φπ ≥γk } ] = eΦ∗ < eΦπ∗ . = p
2
e R ,
k→∞ R=0
2piσR
Define the set min{M,N } 2
µ2 − π(R)
Y µR π(R)
− 2σR 1 2

eΦ∗ + eΦπ∗ = e σR R p
2
e 2σ
R

Θ = {π : Φπ ≥ max{γK , ln }}. R=0


2piσR
2 min{M,N }
µR π(R) µ2
Since γK < Φπ∗ , the set Θ is not empty by the continuous − 2σR
Y
= e σR R f (π(R), 0, σR )
property over the policy space of MDP [8]. Note that R=0
k min{M,N }
Y eΦπ I{Φπ ≥γi } gk−1 (π) Y µR π(R)
gk (π) = g1 (π), = [e σR
f (π(R), 0, σR )
i=1
Egi [eΦπ I{Φπ ≥γi } ]
R=0
µR π(R)
Z
and · f (π(R), 0, σR )dπ(R)],
eΦπ I{Φπ ≥γk } eΦπ I{Φπ ≥γK } π(R)∈P σR
lim = > 1, ∀π ∈ Θ,
k→∞ Egk [eΦπ I{Φπ ≥γk } ] eΦ∗
we then obtain
we thus have
lim gk (π) = ∞, ∀π ∈ Θ. H(µ, σ, γk )
k→∞ min{M,N } Z
X µR π(R)
By Fatou’s lemma, we have = e(k−1)Φπ I{Φπ ≥γk } dπ
π∈Ω σR
R=0
Z
lim inf gk (π)dπ min{M,N } Z
k→∞ π∈Ω
X
+ e(k−1)Φπ I{Φπ ≥γk } ln f (π(R), 0, σR )dπ
= 1 π∈Ω
R=0
Z
≥ lim inf gk (π)dπ min{M,N } Z
k→∞ π∈Θ
X
Z − { e(k−1)Φπ I{Φπ ≥γk }
≥ lim inf gk (π)dπ π∈Ω
ZR=0
π∈Θ k→∞ µR π(R)
= ∞, · ln[ f (π(R), 0, σR )dπ(R)]dπ}.
π(R)∈P σR
which forms a contradiction. Hence, we have
lim Egk [eΦπ I{Φπ ≥γk } ] = eΦπ∗ . Since the optimization problem in (15) is to solve
k→∞

Since eΦπ I{Φπ ≥γ} is a monotone function of Φπ and one- max H(µ, σ, γk ),
µ,σ
to-one map over the field {π : Φπ ≥ γ}, the result above
implies that
the updated parameters (µk , σ k ) thus maximizes H(µ, σ, γk ).
lim Egk [π] = π∗ , (29) It means that
k→∞
lim V argk [π] = 0. (30) ∇H(µk , σ k , γk ) = 0.
k→∞
16

That is and,

∇H(µ, σ, γk ) lim V arf (π,µ,σ) [π(R)]


k→∞
µR π(R)
f (π(R), 0, σR ) π(R) = lim {Ef (π,µ,σ) [π 2 (R)] − Ef (π,µ,σ) [π(R)]2 }
R
π(R)∈P e 2 dπ(R)
σR
σR k→∞
= R µR π(R)
= lim {Egk [π 2 (R)] − Egk [π(R)]2 }
e σR f (π(R), 0, σR )dπ(R) k→∞
Z π(R)∈P = lim V argk [π(R)]
· e(k−1)Φπ I{Φπ ≥γk } dπ k→∞
π∈Ω = 0.
π(R)
Z
− e(k−1)Φπ I{Φπ ≥γk } 2 dπ,
π∈Ω σR
= 0.
R EFERENCES
It follows that [1] J. Mitola, “Cognitive radio: An integrated agent architecture for software
defined radio,” Ph.D. dissertation, Royal Institute of Technology (KTH)
e(k−1)Φπ I{Φπ ≥γk } π(R)dπ Stockholm, Sweden, 2000.
R
π∈Ω
R [2] Q. Zhao, L. Tong, A. Swami, and Y. Chen, “Decentralized cognitive mac
π∈Ω
e(k−1)Φπ I{Φπ ≥γk } dπ for opportunistic spectrum access in ad hoc networks: A pomdp frame-
= work,” IEEE Journal on Selected Areas in Communications, vol. 25, pp.
R µR π(R) 589–600, 2007.
π(R)∈P e f (π(R), 0, σR )π(R)dπ(R)
σR
[3] M. Wellens, J. Riihijarvi, and P. Mahonen, “Empirical time and fre-
µR π(R)
, ∀R ∈ R. quency domain models of spectrum use,” Elsevier Physical Communi-
cations, vol. 2, pp. 10–32, 2009.
R
π(R)∈P e σR
f (π(R), 0, σR )dπ(R)
[4] M. Wellens, J. Riihijarvi, M. Gordziel, and P. Mahonen, “Spatial
statistics of spectrum usage: From measurements to spectrum models,”
By multiplying the same constant on the numerator and in IEEE International Conference on Communications, 2009.
denominator of the terms on both sides, we have [5] H. Li, “Customer reviews in spectrum: recommendation system in
cognitive radio networks,” in IEEE Symposia on New Frontiers in
R e(k−1)Φπ I{Φπ ≥γk } gk−1 (π) Dynamic Spectrum Access Networks (DySPAN), 2010.
π∈Ω Egk−1 [eΦπ I{Φπ ≥γ} ] π(R)dπ [6] J. Hu, M. Fu, and S. Marcus, “A model reference adaptive search
algorithm for global optimization,” Operations Research, vol. 55, pp.
R e(k−1)Φπ I{Φπ ≥γk } gk−1 (π) 549–568, 2007.
π∈Ω Egk−1 [eΦπ I{Φπ ≥γ} ] dπ
[7] C. Cormio, Kaushik, and R. Chowdhury, “Common control channel
= R design for cognitive radio wireless ad hoc networks using adaptive
π(R)∈P
f (π(R), µR , σR )π(R)dπ(R) frequency hopping,” Elsevier Journal of Ad Hoc Networks, vol. 8, pp.
R , ∀R ∈ R, 430–438, 2010.
π(R)∈P
f (π(R), µR , σR )dπ(R) [8] S. M. Ross, Introduction to stochastic dynamic programming. Academic
Press, 1993.
Since [9] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.
A Bradford Book, 1998.
Z [10] J. Huang, R. Berry, and M. L. Honig, “Auction-based spectrum sharing,”
f (π(R), µR , σR )dπ(R) ACM/Springer Mobile Networks and Applications Journal, 2006.
π(R)∈P [11] R. Southwell and J. Huang, “Convergence dynamics of resource-
(k−1)Φπ homogeneous congestion games,” in International Conference on Game
e I{Φπ ≥γk } gk−1 (π)
Z
Theory for Networks, Shanghai, China, April 2011.
= dπ
π∈Ω Egk−1 [eΦπ I{Φπ ≥γ} ] [12] M. Liu, S. Ahmad, and Y. Wu, “Congestion games with resource reuse
and applications in spectrum sharing,” in International Conference on
= 1, Game Theory for Networks, 2009.
[13] H. Li and Z. Han, “Competitive spectrum access in cognitive radio net-
we obtain works: graphical game and learning,” in IEEE Wireless Communications
and Networking Conference (WCNC), 2010.
e(k−1)Φπ I{Φπ ≥γk } gk−1 (π) [14] A. Anandkumar, N. Michael, and A. Tang, “Opportunistic spectrum
Z
π(R)dπ access with multiple users: learning under competition,” in The IEEE In-
Egk−1 [eΦπ I{Φπ ≥γ} ] ternational Conference on Computer Communications (Infocom), 2010.
Zπ∈Ω [15] L. M. Law, J. Huang, M. Liu, and S. Li, “Price of anarchy of cognitive
= f (π(R), µR , σR )π(R)dπ(R), ∀R ∈ R, mac games,” in IEEE Global Communications Conference, 2009.
π(R)∈P [16] J. Zhao, H. Zheng, and G. Yang, “Distributed coordination in dynamic
spectrum allocation networks,” in IEEE Symposia on New Frontiers in
i.e. Dynamic Spectrum Access Networks (DySPAN), 2005.
[17] T. Shu and M. Krunz, “Coordinated channel access in cognitive radio
Egk [π(R)] = Ef (π,µ,σ) [π(R)], ∀R ∈ R. networks: a multi-level spectrum opportunity perspective,” in The IEEE
International Conference on Computer Communications (Infocom),
2009.
Similarly, we can show that [18] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry, “Using collaborative
filtering to weave an information tapestry,” Communications of the ACM,
Egk [π 2 (R)] = Ef (π,µ,σ) [π 2 (R)], ∀R ∈ R. vol. 35, pp. 61–70, 1992.
[19] B. Awerbuch and R. Kleinberg, “Competitive collaborative learning,”
Journal of Computer and System Sciences, vol. 74, pp. 1271–1288, 2008.
From (29), it follows that

lim Ef (π,µk ,σk ) [π] = lim Egk [π]


k→∞ k→∞

= π .

You might also like