Professional Documents
Culture Documents
email:{cx008,jwhuang}@ie.cuhk.edu.hk,husheng@eecs.utk.edu
arXiv:1102.4728v1 [cs.DC] 23 Feb 2011
1 Please refer to [7] for the details on how to set up and maintain a reliable 2 This may not be true for other random MAC mechanisms such as the
common control channel in cognitive radio networks. slotted Aloha.
3
III. R EVIEW OF S TATIC C HANNEL R ECOMMENDATION channels is small. In the extreme case when only R = 1
The key idea of the static channel recommendation scheme channel is recommended, calculation (3) suggests that every
in [5] is that secondary users inform each other about the user will access that channel with a probability Prec . When
available channels they have just accessed. More specifically, the number of users N is large, the expected number of
each secondary user executes the following three stages syn- users accessing this channel N Prec will be high. Thus heavy
chronously during each time slot (See Figure 2): congestion happens and each secondary user will get a low
expected throughput.
• Spectrum sensing: sense one of the channels based on
A better way is to adaptively change the value of Prec based
channel selection result made at the end of the previous
on the number of recommended channels. This is the key
time slot.
idea of our proposed algorithm. To illustrate the advantage
• Data transmission: if the channel sensing result is idle,
of adaptive algorithms, let us first consider a simple heuristic
compete for the channel with the timer mechanism de-
adaptive algorithm. In this algorithm, we choose the branching
scribed in Section II. Then transmit data packets if the
probability such that the expected number of secondary users
user successfully grabs the channel.
choosing a single recommended channel is one. To achieve
• Channel recommendation and selection:
this, we need to set Prec as in Lemma 2.
– Announce recommendation: if the user has success-
R
fully accessed an idle channel, broadcast this channel Lemma 2. If we choose the branching probability Prec = N ,
ID to all other secondary users. then the expected number of secondary users choosing any
– Collect recommendation: collect recommendations one of the R recommended channels is one.
from other secondary users and store them in a Please refer to the Appendix for the detailed proof of
buffer. Typically, the correlation of channel avail- Lemma 2.
abilities between two slots diminishes as the time Without going through detailed analysis, it is straightfor-
difference increases. Therefore, each secondary user ward to show the benefit for such adaptive approach through
will only keep the recommendations received from simple numerical examples. Let us consider a network with
the most recent W slots and discard the out-of-date M = 10 channels and N = 5 secondary users. For each
information. The user’s own successful transmission channel m, the initial channel state probability vector is
history within W recent time slots is also stored in pm (0) = (0, 1) and the transition matrix is
the buffer. W is a system design parameter and will
be further discussed later. 1 − 0.01ǫ 0.01ǫ
Γ= ,
– Select channel: choose a channel to sense at the 0.01ǫ 1 − 0.01ǫ
next time slot by putting more weights on the rec- where ǫ is called the dynamic factor. A larger value of ǫ
ommended channels according to a static branching implies that the channels are more dynamic over time. We
probability Prec . Suppose that the user has R differ- are interested in the time average system throughput
ent channel recommendations in the buffer, then the PT PN
probability of accessing a channel m is un (t)
U = t=1 n=1 ,
(
Prec
T
Pm = 1−P R , if channel m is recommended,
where un (t) is the throughput of user n at time slot t. In the
M−R , otherwise,
rec
simulation, we set the total number of time slots T = 2000.
(3) We implement the following three channel access schemes:
A larger value of Prec means that putting more
• Random access scheme: each secondary user selects a
weight on the recommended channels.
channel randomly.
To illustrate the channel selection process, let us take the • Static channel recommendation scheme as in [5] with the
network in Figure 1 as an example. Suppose that the branching optimal constant branching probability Prec = 0.7.
probability Prec = 0.4. Since only R = 1 recommendation is • Heuristic adaptive channel recommendation scheme with
available (i.e., channel 4), the probabilities of choosing the the variable branching probability Prec = N R
.
recommended channel 4 and any unrecommended channel are Figure 4 shows that the heuristic adaptive channel recom-
0.4 1−0.4
1 = 0.4 and 6−1 = 0.12, respectively. mendation scheme outperforms the static channel recommen-
Numerical studies in [5] showed that the static channel dation scheme, which in turn outperforms the random access
recommendation scheme achieves a higher performance over scheme. Moreover, the heuristic adaptive scheme is more
the traditional random channel access scheme without infor- robust to the dynamic channel environment, as it decreases
mation exchange. However, the fixed value of Prec limits the slower than the static scheme when ǫ increases.
performance of the static scheme, as explained next. We can imagine that an optimal adaptive scheme (by setting
the right Prec (t) over time) can further increase the net-
IV. M OTIVATIONS F OR A DAPTIVE C HANNEL work performance. However, computing the optimal branching
R ECOMMENDATION probability in closed-form is very difficult. In the rest of the
The static channel recommendation mechanism is simple to paper, we will focus on characterizing the structures of the
implement due to a fixed value of Prec . However, it may lead optimal spectrum access strategy and designing an efficient
to significant congestions when the number of recommended algorithm to achieve the optimum.
4
Recall that B is the data rate that a single user can obtain
on an idle channel.
|R|
• Stationary Policy: π ∈ Ω , P maps each state R to
an action Prec , i.e., π(R) is the action Prec taken when
the system is in state R. The mapping is stationary and
does not depend on time t.
Given a stationary policy π and the initial state R0 ∈ R,
we define the network’s value function as the time average
system throughput, i.e.
"T −1 #
1 X
Φπ (R0 ) = lim Eπ U (R(t), π(R(t))) .
T →∞ T
t=0
the current system state R, i.e., two states communicate with each other.
X Case II, when q = 1: for all R ∈ R, the transition
U (R, Prec ) = PPR,R′ UR′ ,
rec
probability P P ′
R,R′ > 0 if R ∈ {0, ..., min{M − R, N }}. It
rec
R∈R′ ′
follows that the state R = 0 is accessible from any other
where UR′ is the system throughput in state R′ . If R′ idle state R ∈ R. By setting R = 0, we see that P P R,R′ > 0, for
rec
channels are utilized by the secondary users in a time slot, all R ∈ {0, ..., min{M, N }}. That is, any other state R′ ∈ R
′
then these R′ channels will be recommended at the end is also accessible from the state R = 0. Thus, any two states
of the time slot. Thus, we have communicate with each other.
Since any two states communicate with each other in all
UR′ = R′ B.
cases and the number of system state |R| is finite, the resulting
3 Users need to know the IDs of the recommended channels in order to Markov chain is irreducible.
access them. However, the IDs are not important in terms of MDP analysis. Combining Lemmas 3 and 4, we have
5
Prec
X X X N nr
P R,R ′ = Prec (1 − Prec )nu
nr
mr +mu =R′ R≥m̄r ≥mr ,M −R≥m̄u ≥mu nr +nu =N,nr ≥m̄r ,nu ≥m̄u
m̄r R! nr − 1
· (1 − q)mr q m̄r −mr R−nr
mr (R − m̄r )! m̄r − 1
m̄u p mu q (M − R)! nu − 1
· ( ) ( )m̄u −mu (M − R)−nu . (4)
mu p+q p+q (M − R − m̄u )! m̄u − 1
Theorem 1. There exists an optimal stationary policy for the Theorem 2. When M = ∞ and N < ∞, for the adaptive
adaptive channel recommendation MDP. channel recommendation MDP, the optimal stationary policy
π ∗ is monotone, that is, π ∗ (R) is nondecreasing on R ∈ R.
Furthermore, the irreducibility of the adaptive channel rec-
ommendation MDP also implies that the optimal stationary Proof: For the ease of discussion, we define
policy π ∗ is independent of the initial state R0 [8], i.e. X
Qt (R, Prec ) = PP ′
R,R′ [UR′ + βVt+1 (R )],
rec
where Φπ∗ is the maximum time average system throughput. with the partial cross derivative being
In the rest of the paper, we will just use “optimal policy” ∂ R′ ∈R P P ′
P
∂ 2 Qt (R, Prec ) R+1,R′ [UR′ + βVt+1 (R )]
rec
− .
∂Prec
C. Structure of Optimal Stationary Policy
By Lemma 6 in the Appendix, we know the reverse cumulative
Next we characterize the structure of the optimal policy P Prec
distribution function R′ ∈R P R,R ′ is supermodular on R×P.
without using the closed-form expressions of the policy (which It implies
is generally hard to achieve). The key idea is to treat the
∂ R′ ∈R P P Prec
P P
average reward based MDPs as the limit of a sequence of rec
R+1,R′ ∂ R′ ∈R P R,R ′
transition probability from the system state R̄ to R′ is VTπ−1 (R) ≥ VTπ−1 (R), ∀R ∈ R, π ∈ ∆, π ′ ∈ ∆c .
A. Model Reference Adaptive Search Method B. Model Reference Adaptive Search For Optimal Spectrum
Access Policy
We first introduce the basic idea of the Model Reference
Adaptive Search (MRAS) method. Later on, we will show In this section, we design an algorithm based on the MRAS
how the method can be used to obtain optimal spectrum access method to find the optimal spectrum access policy. Here we
policy for our problem. treat the adaptive channel recommendation MDP as a global
The MRAS method is a new randomized method for global optimization problem over the policy space. The key challenge
optimization [6]. The key idea is to randomize the original is the choice of proper probabilistic model f (·), which is
optimization problem over the feasible region according to crucial for the convergence of the MRAS algorithm.
a specified probabilistic model. The method then generates 1) Random Policy Generation: To apply the MRAS
candidate solutions and updates the probabilistic model on the method, we first need to set up a random policy generation
basis of elite solutions and a reference model, so that to guide mechanism. Since the action space of the channel recommen-
the future search toward better solutions. dation MDP is continuous, we use the Gaussian distributions.
Formally, let J(x) be the objective function to maximize. Specifically, we generate sample actions π(R) from a Gaussian
The MRAS method is an iterative algorithm, and it includes distribution for each system state R ∈ R independently, i.e.
2 4
three phases in each iteration k: π(R) ∼ N (µR , σR ). In this case, a candidate policy π can
be generated from the joint distribution of |R| independent
• Random solution generation: generate a set of random
Gaussian distributions, i.e.,
solutions {x} in the feasible set χ according to a
parameterized probabilistic model f (x, vk ), which is a (π(0), ..., π(min{M, N })) ∼ N (µ0 , σ02 ) × · · ·
probability density function (pdf) with parameter vk . 2
×N (µmin{M,N } , σmin{M,N } ).
The number of solutions to generate is a fixed system
parameter. As shown later, Gaussian distribution has nice analytical and
• Reference distribution construction: select elite solutions convergent properties for the MRAS method.
among the randomly generated set in the previous phase, For the sake of brevity, we denote f (π(R), µR , σR ) as the
such that the chosen ones satisfy J(x) ≥ γ. Construct a pdf of the Gaussian distribution N (µR , σR 2
), and f (π, µ, σ)
reference probability distribution as as random policy generation mechanism with parameters µ ,
I{J(x)≥γ} (µ0 , ..., µmin{M,N } ) and σ , (σ0 , ..., σmin{M,N } ), i.e.,
I k = 1,
Ef (x,v0 ) [ {J(x)≥γ}
] min{M,N }
gk (x) = J(x)
f (x,v0 )
(7)
gk−1 (x)
Y
e I{J(x)≥γ}
J(x) k ≥ 2, f (π, µ, σ) = f (π(R), µR , σR )
Eg k−1
[e I
{J(x)≥γ} ]
R=0
where I{̟} is an indicator function, which equals 1 if min{M,N } (π(R)−µR ) 2
Y 1 −
2σ2
the event ̟ is true and zero otherwise. Parameter v0 = p
2
e R ,
R=0
2ϕσR
is the initial parameter for the probabilistic model (used
during the first iteration, i.e., k = 1), and gk−1 (x) is the where ϕ is the circumference-to-diameter ratio.
reference distribution in the previous iteration (used when 2) System Throughput Evaluation: Given a candidate pol-
k ≥ 2). icy π randomly generated based on f (π, µ, σ), we need
• Probabilistic model update: update the parameter v of the to evaluate the expected system throughput Φπ . From (4),
probabilistic model f (x, v) by minimizing the Kullback- π(R)
we obtain the transition probabilities P R,R′ for any system
Leibler divergence between gk (x) and f (x, v), i.e. state R, R′ ∈ R. Since a policy π leads to a finitely irre-
gk (x)
ducible Markov chain, we can obtain its stationary distribution.
vk+1 = arg min Egk ln . (8) Let us denote the transition matrix of the Markov chain
v f (x, v) π(R)
as Q , [P R,R′ ]|R|×|R| and the stationary distribution as
By constructing the reference distribution according to (7), p = (P r(0), ..., P r(min{M, N })). Obviously, the stationary
the expected performance of random elite solutions can be distribution can be obtained by solving the following equation
improved under the new reference distribution, i.e.,
pQ = p.
e2J(x) I{J(x)≥γ} gk−1 (x)dx
R
Egk [e J(x)
I{J(x)≥γ} ] =
x∈χ We then calculate the expected system throughput Φπ by
Egk−1 [eJ(x) I{J(x)≥γ} ] X
Egk−1 [e2J(x) I{J(x)≥γ} ] Φπ = P r(R)UR .
= R∈R
Egk−1 [eJ(x) I{J(x)≥γ} ]
Note that in the discussion above, we assume that π ∈ Ω
≥ Egk−1 [eJ(x) I{J(x)≥γ} ]. (9)
implicitly, where Ω is the feasible policy space. Since Gaus-
To find a better solution to the optimization problem, it is sian distribution has a support over (−∞, +∞), we thus
natural to update the probabilistic model (from which random 4 Note that the Gaussian distribution has a support over (−∞, +∞), which
solution are generated in the first stage) as close to the new is larger than the feasible region of π(R). This issue will be handled in Section
reference probability as possible, as done in the third stage. VI-B2.
8
where 0 < ρ < 1 is the elite ratio. For example, when L = 100 Substituting (13) into (14), we have
and ρ = 0.4, then γ = Φπ̂60 and the last 40 samples in the max π∈Ω e(k−1)Φπ I{Φπ ≥γ} ln f (π, µ, σ)dπ, (15)
R
sequence will be selected as elite samples. Note that as long µ,σ
Algorithm 1 MRAS-based Algorithm For Adaptive Recom- into account, we consider the following two types of channel
mendation Based Optimal Spectrum Access state transition matrices:
1: Initialize parameters for Gaussian distributions (µ0 , σ 0 ),
1 − 0.005ǫ 0.005ǫ
the elite ratio ρ, and the stopping criterion ξ. Set initial Type 1: Γ1 = , (20)
0.025ǫ 1 − 0.025ǫ
elite threshold γ0 = 0 and iteration index k = 0.
1 − 0.01ǫ 0.01ǫ
2: repeat: Type 2: Γ2 = , (21)
0.01ǫ 1 − 0.01ǫ
3: Increase iteration index k by 1.
4: Generate L candidate policies π1 , ..., πL from the where ǫ is the dynamic factor. Recall that a larger ǫ means that
random policy generation mechanism f (π, µk−1 , σk−1 ). the channels are more dynamic over time. Using (2), we know
5: Select elite policies by setting the elite threshold γk = that channel models Γ1 and Γ2 have the stationary channel idle
max{Φπ̂⌈(1−ρ)L⌉ , γk−1 }. probabilities of 1/6 and 1/2, respectively. In other words, the
6: Update the random policy generation mechanism by primary activity level is much higher with the Type 1 channel
PL (k−1)Φπ than with the Type 2 channel.
i=1 e I{Φπi ≥γk } πi (R)
µR,k = PL (k−1)Φ , ∀R ∈ R, We initialize the parameters of MRAS algorithm as follows.
i=1 e πI
{Φπi ≥γk } We set µR = 0.5 and σR = 0.5 for the Gaussian distribution,
(18) which has 68.2% support over the feasible region (0, 1).
PL (k−1)Φπ We found that the performance of the MRAS algorithm is
i=1 e I{Φπi ≥γk } [πi (R) − µR ]2
2
σR,k = PL (k−1)Φ , ∀R ∈ R.insensitive to the elite ratio ρ when ρ ≤ 0.3. We thus choose
i=1 e πI
{Φ πi ≥γ k } ρ = 0.1.
(19) When using the MRAS-based algorithm, we need to de-
termine how many (feasible) candidate policies to generate
7: until maxR∈R σR,k < ξ. in each iteration. Figure 5 shows the convergence of MRAS
algorithm with 100, 300, and 500 candidate policies per itera-
tion, respectively. We have two observations. First, the number
Theorem 5. For the MRAS algorithm, the limiting point of the of iterations to achieve convergence reduces as the number of
policy sequence {πk } generated by the sequence of random candidate policies increases. Second, the convergence speed is
policy generation mechanism {f (π, µk , σ k )} converges point- insignificant when the number changes from 300 to 500. We
wisely to the optimal spectrum access policy π ∗ for the thus choose L = 500 for the experiments in the sequel.
adaptive channel recommendation MDP, i.e.,
Fig. 6. System throughput with M = 10 channels and N = 5 users under Fig. 8. Performance gain over random access scheme. The Type 1 and Type
the Type 1 channel state transition matrix 2 channels have the stationary channel idle probabilities of 1/6 and 1/2,
respectively.
performance than static one and random access scheme when where 0 < α < 1 is the smoothing factor. Given a system
the cognitive radio network suffers serve congestion effect. state R, the probability of choosing an action Prec is
eτ Q(R,Prec )
Pr (Prec (t) = Prec |R(t) = R) = P τ Q(R,Prec )
,
P ′ ∈P̂ e
rec
(1) i
allocate the interference budget among selfish users. Southwell by secondary user i and τm = min{τm |i 6= n, i ∈ Km }. The
and Huang in [11] studied the largest and smallest convergence probability that the user n captures the channel m is given as
time to an equilibrium when secondary users access multiple (1) n
channels in a distributed fashion. Liu et al. in [12] modeled P rn,m = P {τm > τm }
n
the interactions among spatially separated users as congestion τm km (t)−1
= (1 − ) .
games with resource reuse. Li and Han in [13] applied the τmax
graphic game theory to address the spectrum access problem Thus, the expected throughput of user n is
with limited range of mutual interference. Anandkumar et al. Z τmax
1 n
in [14] proposed a learning-based approach for competitive un (t) = BP rn,m dτm
0 τmax
spectrum access with incomplete spectrum information. Law Z τmax
τn 1
et al. in [15] showed that uncoordinated spectrum access may = B(1 − m )km (t)−1 dτ n
lead to poor system performance. 0 τmax τmax m
For the coordinated spectrum access, Zhao et al. in [16] B
= .
proposed a dynamic group formation algorithm to distribute km (t)
secondary users’ transmissions across multiple channels. Shu BSm (t)
= .
and Krunz proposed a multi-level spectrum opportunity frame- km (t)
work in [17]. The above papers assumed that each secondary
user knows the entire channel occupancy information. We
consider the case where each secondary user only has a limited
view of the system, and improve each other’s information by B. Proof of Lemma 2
recommendation. Let ΛC denote the event that C secondary users choose the
Our algorithm design is partially inspired by the recommen- recommended channels, and P r(c1 , ..., cR ) denote probability
dation systems in the electronic commerce industry, where an- mass function that the number of secondary users on these R
alytical methods such as collaborative filtering [18] and multi- recommended channels equal to c1 , ..., cR respectively. Given
armed bandit process modeling [19] are useful. However, we the event ΛC , we have
cannot directly apply the existing methods to analyze cognitive
n
radio networks due to the unique congestion effect in our P r(c1 , ..., cR |ΛC ) = R−C ,
c1 , ..., cR
model.
which is a Multinomial mass function. By the property of
Multinomial distribution, we have
IX. C ONCLUSION
C
E[cm |ΛC ] =.
In this paper, we propose an adaptive channel recommenda- R
tion scheme for efficient spectrum sharing. We formulate the It follows that the expected number of users choosing a
problem as an average reward based Markov decision process. recommended channel m is
We first prove the existence of the optimal stationary spectrum N
access policy, and then characterize the structure of the optimal
X
E[cm ] = E[cm |ΛC ]P r(ΛC )
policy in two asymptotic cases. Furthermore, we propose a C=0
novel MRAS-based algorithm that is provably convergent to N
X C N
the optimal policy. Numerical results show that our proposed = C
Prec (1 − Prec )N −C
algorithm outperforms the static approach in the literature by R C
C=0
up to 18% and the Q-learning method by up to 10% in terms Prec N
of system throughput. Our algorithm is also more robust to = .
R
the channel dynamics compared to the static counterpart.
Then E[cm ] = 1 requires that
In terms of future work, we are currently extending the
analysis by taking the heterogeneity of channels into con- R
Prec = .
sideration. We also plan to consider the case where the N
secondary users are selfish. Designing an incentive-compatible
channel recommendation mechanism for that case will be very
interesting and challenging. C. Derivation of Transition Probability
When the system state transits from R to R′ , we assume that
A PPENDIX mr and mu recommendations, out of R′ recommendations,
are channels that have been recommended and have not
A. Proof of Lemma 1
been recommended at time slot t respectively. Obviously,
When Sm (t) = 0, this trivially holds. We focus on the case mr + mu = R′ . We assume that m̄r recommended channels
that Sm (t) = 1. and m̄u unrecommended channels have been accessed by the
Let Km = {1, ..., km (t)} be the set of secondary users secondary users at time slot t+1. We thus have R ≥ m̄r ≥ mr
i
accessing the channel m, τm be the backoff time be generated and M − R ≥ m̄u ≥ mu . We also assume that there are
13
p 2
nr secondary users have accessed these m̄r recommended Prec
P0,2 (2) = (1 − Prec )2 ( ) ,
p+q
channels and nu secondary users have accessed those m̄u
unrecommended channels at time slot t + 1. Obviously, we
Prec 2 q 2
have nr + nu = N , nr ≥ m̄r and nu ≥ m̄u . P1,0 (2) = Prec q + (1 − Prec )2 ( )
For the first term, the probability that the user distribu- p+q
tion (nr , nu ) happens follows the Binomial distribution as q2
+2Prec (1 − Prec ) ,
N nr p+q
Prec (1 − Prec )nu .
nr
For the second term, when m̄r ≥ 1, it is easy to check Prec 2 2pq
nr − 1 P1,1 (2) = Prec (1 − q) + (1 − Prec )2
that there are ways for nr secondary users to (p + q)2
m̄r − 1
(1 − q)q + pq
choose m̄r recommended channels and there are (R−R!m̄r )! +2Prec (1 − Prec ) ,
possibilities for these m̄r recommended channels out of the R p+q
recommended channels, each of which has probability ( R1 )nr . p 2 (1 − q)p
Among these m̄r recommended channels that have been ac-
Prec
P1,2 (2) = (1 − Prec )2 ( ) + 2Prec (1 − Prec ) ,
p+q p+q
that mr channels
cessed by the secondary users, theprobability
m̄r
turn out to be idle is given as (1 − q)mr q m̄r −mr . q + q2 q 2
mr Prec 2
P2,0 (2) = Prec + (1 − Prec )2 ( )
When m̄r = 0, it requires that ur = 0. Thus, we define 2 p+q
( q2
nr − 1 1 If nr =0, +2Prec (1 − Prec ) ,
= p+q
−1 0 Otherwise.
Similarly, we can obtain the third term for the unrecom- 2 1 − q + (1 − q)q 2pq
mended channels case.
Prec
P2,1 (2) = Prec + (1 − Prec )2
2 (p + q)2
(1 − q)q + pq
+2Prec (1 − Prec ) ,
D. Lemma 5 p+q
Since the operation R′ ∈R P P
P
R,R′ [·] plays a key role in the
rec
Bellman equation, to facilitate the study, we first define the Prec 2 (1 − q)2 p 2
P2,2 (2) = Prec + (1 − Prec )2 ( )
following function 2 p+q
min{M,N }
(1 − q)p
+2Prec (1 − Prec ) .
p+q
X
fr (R, Prec ) , PP
R,i , ∀r ∈ R.
rec
proof: We prove the result by induction argument. In abuse i.e. fr (R, Prec ) is nondecreasing in R for the case N = 2.
of notation, we denote the transition probability P P rec
R,R′ and
We then assume that fr (R, Prec ) is nondecreasing in R for
the reverse cumulative distribution function fr (R, Prec ) when all R ∈ R, Prec ∈ P for the case that N = k ≥ 2 i.e.
Prec k
the number of users N = k as PR,R ′ (k) and fr (R, Prec )
which depends on Prec and the channel environment only. By Now, we assume it also holds for Vt (R) when t = k +
conditioning on the event ϕ, we have 1, k + 2, ..., T. Let R̂ be a system state such that R̂ ≥ R. By
the hypothesis, we have Vk+1 (R̂) ≥ Vk+1 (R). Let π ∗ be the
PP
R+1,i (k + 1) =
rec
PP
R+1,i−1 (k)[1 − P r(ψ)]
rec
optimal policy. From the Bellman equation in (5), we have
Prec
+P R+1,i (k)P r(ψ), (22) min{M,N }
π ∗ (R)
X
PP
R,i (k
rec
+ 1) = PP
R,i−1 (k)[1 − P r(ψ)].
rec
Vk (R) = P R,R′ [UR′ + βVk+1 (R′ )], ∀R ∈ R.
Prec R′ =0
+P R,i (k)P r(ψ) (23) (26)
Thus, By defining a new system state −1 such that U−1 +
βVk+1 (−1) = 0, we can rewrite the equation in (26) as
frk+1 (R + 1, Prec ) − frk+1 (R, Prec ) min{M,N } R ′
eΦ∗ + eΦπ∗ = e σR R p
2
e 2σ
R
Since eΦπ I{Φπ ≥γ} is a monotone function of Φπ and one- max H(µ, σ, γk ),
µ,σ
to-one map over the field {π : Φπ ≥ γ}, the result above
implies that
the updated parameters (µk , σ k ) thus maximizes H(µ, σ, γk ).
lim Egk [π] = π∗ , (29) It means that
k→∞
lim V argk [π] = 0. (30) ∇H(µk , σ k , γk ) = 0.
k→∞
16
That is and,