Professional Documents
Culture Documents
Introduction
This work was supported in part by ETRI and ICU-OIRC funded by KOSEF.
Z. Mammeri and P. Lorenz (Eds.): HSNMC 2004, LNCS 3079, pp. 268279, 2004.
c Springer-Verlag Berlin Heidelberg 2004
269
and parallel iterative matching (PIM) schemes have been proposed to achieve
100% throughput for cell-based input queueing switches [8]. Marsan et al. develop novel scheduling algorithms to deal with variable length IP packets for
IP switching system, and prove that no throughput limitations exist by operating an input queueing switches in packet mode comparing to output queueing
switches [4]. Nong and et al. evaluate the maximum throughput of cell-based
IP switching systems for the PIM algorithm under bursty trac [5]. Note that
all of these works are based on VOQ based maximum matching algorithm such
as PIM and iSLIP, which can achieve 100% throughput even under nonuniform
trac. However, they have two types of constraints. One constraint is that the
multiple arbitrations have to be completed within one cell time slot. The other
constraint is that each arbitration logic has to handle up to N contending cells
at a time.
For the former constraint, a pipeline-based scheduling algorithms called
round-robin greedy scheduling (RRGS) was proposed by Smiljanic et al. [9].
Recently, Eiji et al. introduced the pipeline-based scheduling scheme which enables to relax the timing constraint for arbitration [10]. However, the constraint
of arbitration logic has not been studied yet, even though the arbitration logic is
not practical due to the implementation complexity of multiple cell arbitrations
per each output port when the switch size increases.
In this paper, we only consider the complexity of arbitration logic and show it
is a still bottleneck. Then, we propose a VOQ-based windowing (VOQW) scheme
and analyze the performance of the proposed scheme under nonuniform IP trac.
We believe that the combination of VOQ and windowing scheme can overcome
the drawback of the performance degradation of the conventional windowing
scheme under nonuniform trac. Moreover, the arbitration logic can be suitable
to be implemented in hardware since the proposed scheme only handles a small
number of contending cells in each arbitration similar to dual round-robin (DRR)
scheme [3]. With the analysis of the maximum throughput, we also show that the
proposed scheme outperforms comparing to the FIFO based windowing scheme
and DRR schemes even though it has a little less performance than iSLIP. We
verify the analytic results through computer simulation.
The remainder of this paper is organized as follows. In Section 2, we describe
the switch model and the VOQW scheme. In Section 3, we analyze the complexity of the arbitration logic and obtain the maximum throughput of the switch
under various trac patterns. In Section 4, we present the numerical results and
compare with simulation. Finally, we conclude in Section 5.
2
2.1
270
1
2
Arbiter
1
a1
g1
N
W-HOL
Switching
Fabric
HOL
Contention
Logic
Arbiter
2
aN
N
gN
buer size of each input queue is assumed to be innite. The switch operates
synchronously so that the cells are received and transmitted within the xed
time interval called a slot. Each input port can transmit at most one cell to any
output and each output can receive at most one cell in each slot time. However,
multiple cells arrive in the input queue as a train of cells.
Each input has a separate FIFO queue for each output called a virtual output
queue (VOQ). For example, input port i has N VOQs, says from 1 to N , and
VOQi,j stores cells arriving at input port i with the destination of output port j.
Each input has its own contention logic, and operating independently from the
others. The contention logic decides which VOQ at input port will be transferred
to outputs in each contention phase. Each output also has an arbiter which can
pick cells from the contending cells. Fig. 1 shows an example of the switch
structure with VOQW scheduling scheme, where W-HOL i queue consists of
HOL cells which are rst queued in all VOQs.
2.2
271
3
3.1
Performance Analysis
Trac Model
The trac intensity in the switch can be represented by means of a rate matrix
describing trac passing from input i to output j. The particular form of the
rate matrix which has been used in previous studies is
ij = i Qj
(1)
where i is the average arrival rate of cells at input i, and Qj is the probability
of a cell at any input passing to output j.
The arrival statistics considering in this paper are correlated bursty trac.
The correlated bursty trac model represents a realistic IP trac since real IP
packets tend to fragments of a variable length packet, corresponding to arrival
in bursts. The input trac alternates between burst and idle with geometrically distributed mean lengths, while output address of each burst are tightly
correlated with the same output. We can assume the input trac as a simple
on/o arrival process modelled by the interrupted bernoulli process. For input
trac model, we also consider the self-similar arrival process modelled by Paretodistributed ON/OFF trac with Hurst parameter H = (3 a)/2. It can be used
to characterize probability densities that describe packet interarrival time with
heavy-tailed distribution.
Next, we consider outgoing trac intensity. In real environment, some particular destination(s) such as a popular database, communication server or outgoing
trunks can cause trac concentration. Output ports included in these ports may
cause the trac imbalance. Thus, the number of packets destined for dierent
outputs may not be identical. Such trac imbalance that is dierent from the
uniform one is referred to as nonuniform trac.
In this paper, we do not consider input imbalance trac. We only consider
nonuniform trac that the output addresses are not uniformly distributed. The
272
Qj = 1.
for all j
(2)
for all j
(3)
It can be divided into following two cases. The most general nonuniform
trac pattern is the output imbalance trac consisting of two output groups.
In this case, the outputs are divided into two groups N1o and N2o . The output
imbalance factor for each output group is given by
if j N1o
P1 N1o
1
Qj =
(4)
(1 P1 ) N1o if j N2o
2
where P1 (or 1 P1 ) means the portion of input trac going to group N1o (N2o )
and 1/N1o (1/N2o ) means the portion of a specic output in the same output
group. From now on, we call P1 as the bi-group coecient.
Another nonuniform trac pattern is the hot-spot imbalance where a single
hot-spot is super-imposed on the background of uniform trac. This is a special
case of bi-group imbalance model as N1o 1 and the output imbalanced factor
becomes
h + 1h if j N1o
Qj = 1h N
(5)
otherwise
N
where h is called the hot-spot coecient.
3.2
Complexity Analysis
273
For example, iSLIP and PIM may contend all cells which are queued in WHOL. The contending cells with the same destination have to be arbitrated in an
arbiter. iSLIP and PIM schemes enormously increase the number of contending
cells for arbitration, even though they can improve the switch throughput. On the
other hand, the proposed VOQW scheme picks one cell from each W-HOL queue
as a random selection. The FIFO-based scheduling scheme picks the oldest cell.
DRR scheduling scheme picks one from W-HOL queue based on a slightly more
complicated round-robin service discipline. Hence, the VOQW scheme limits the
total number of contending cells by the number of inputs N , and distributes
the contending cells into all outputs randomly. Therefore, the average number
of contending cells destined for an output is considerably reduced as much as
that of DRR scheme. Moreover, the average number of contending cells per each
contention phase is almost the same of DRR scheme.
Fig. 2 shows the average number of contending cells of the VOQW scheme
and compare to that of iSLIP under the hot-spot nonuniform trac for h = 0.005
and the average number of contending cells per each contention phase when the
oered load is 0.98. As shown in this gure, iSLIP (w = 1) can increase the
average number of contending cells up to N in proportional to the oered load.
When the oered load approaches 0.98, the number of contending cells becomes
100 or more. For iterative scheme (w = 5), the average number of contending
cells is considerably reduced to 28. However, the number of contending cells at
the rst contention phase is still high, greater than 40. On the other hand, the
proposed VOQW scheme picks one cells from W-HOL queue, and contends for
its output. Since each input picks a nonempty VOQ as a random selection, the
contending cells are evenly distributed to all outputs. Therefore, the number of
contending cells remains at low when the trac load is increased. In addition, the
arbitration of the proposed VOQW scheme can be performed through requestgrant procedure while iSLIP requires for the three-way handshaking mechanism
(request-grant-accept) to arbitrate input queueing cells. It indicates that the
VOQW scheme can considerably reduce the complexity of arbitration comparing
to iSLIP or PIM.
3.3
Throughput Analysis
Let us analyze the dynamics of the VOQW scheme. In the switching system,
the cell will be served by the windowing scheme. So, each input port contend
for the desired outputs up to w times, but one and after. It means that a cell
can contend for the output one by one, but the next cell can do only when all of
former contentions are blocked. Regarding a contending cell, each input chooses
a VOQ with equal probability for scheduling.
Let focus on the dynamics of a tagged output. From the point of the tagged
output, the probability that none of the contending VOQs is destined for the
tagged output is (11/N )M . Here, M is the expected number of total contending
VOQs. If is the utilization of each VOQ, the expected number of contending
VOQs in the system is given by N . So, the probability that at least one cell
among M cells is destined for the tagged output becomes 1 (1 1/N )N . By
274
120
iSLIP
DRR
VOQW
80
iSLIP
window size =1
60
VOQW
window size =1 & 5
40
iSLIP (w=5)
DRR
VOQW (w=5)
40
100
iSLIP
window size =5
30
20
10
20
0.2
0.4
0.6
0.8
Offered load
4
Arbitration window
(a)
(b)
Fig. 2. Average number of contending cells per each arbiter as a function of (a) oered
load and (b) arbitration window under the hot-spot nonuniform trac (h = 0.005)
taking expectation of this probability, we can get the expected throughput for
the tagged output such as
E[T ] = E[1 (1 1/N )N ].
(6)
(7)
(8)
275
M2 = M1 + M1 (1 E[T1 ])
M1 = N
By applying (7) into (6), then we can obtain the maximum throughput of the
VOQ-based windowing scheme as follows
E[T1 ] = 1 e1
1
= 1 e(1+e
..
.
1 )
+e(1+e
(9)
276
P1 = 0.25 or h = 0 both cases become the uniform trac case (i.e., addresses
of incoming cells are uniformly distributed to all outputs). Here, we use the
Pareto distribution with Hurst parameters of H = 0.7 and H = 0.8 as a selfsimilar trac. In the following gures, lines indicate the simulation results and
small circles indicate the analytic results. The close match between the analytic
and simulation results indicates that the analysis is adequate in predicting the
performance.
Fig. 3 shows the maximum throughput versus bi-group coecient and hotspot coecient under correlated bursty trac in (a) and (b), respectively. Fig. 4
shows the maximum throughput as a function of average burst length as well as
window size under correlated bursty and hot-spot nonuniform trac (h = 0.005).
As shown in these gures, the maximum throughput of FIFO based windowing
scheme is dramatically decreased as the nonuniform coecient increases. Hotspot trac has more adverse eect on the maximum throughput. The reason
is that the majority of cells in the input queues are destined to the specied
output, which are attempting to pass an oered load many times that of their
capacity. While FIFO based windowing scheme has a little eect for the correlated bursty trac (the maximum throughput per port abruptly converges to
0.5 when the burst size is greater than 5), the performance improvement of the
windowing scheme is rapidly reduced as the burst size increases [12]. This is
because the correlated bursty trac increases only HOL blocking due to the dependency of consecutive cells in the same input port. Moreover, the blocked cells
are accumulated into the same input queue. On the other hand, correlated bursty
trac and nonuniform trac have no impact on the maximum throughput of
the VOQW scheme. This is because the blocked cells are accumulated into the
VOQ. Each input can select another VOQ for contending as a random selection.
Thus, the VOQW scheme can considerably increase the maximum throughput
as the window size increases. As shown in this gure, the maximum throughput
becomes 0.85 when w = 5. The maximum throughput is consistently remained
at the same value under the correlated bursty and nonuniform trac. Consequently, we know that the VOQW scheme is useful under correlated bursty and
nonuniform trac as well as uniform trac.
Figs. 5 and 6 show the switch throughput and delay performance below the
saturation point. These results are obtained from computer simulation for various self-similar trac under 1/b = 20, and h = 0.05. As shown in Fig. 5,
the self-similar trac (H = 0.7 and H = 0.8) has little impact on the switch
throughput. Moreover, the switch throughput of the VOQW scheme is linearly
increased but it settles down in a saturated trac load. The saturated trac
load is restricted by the window size. iSLIP scheme also linearly increases the
switch throughput below the saturated point, but the switch throughput is continuously increased with low rate up to 1. It means that the switch throughput
of the VOQW scheme is almost same to that of iSLIP without considering trac
condition.
Fig. 6 shows the waiting time of iSLIP and DRR schemes as well as the
VOQW scheme. As shown in this gure, the self-similar trac deteriorates the
1.0
1.0
0.8
0.8
Maximum Throughput
Maximum Throughput
0.6
VOQ based windowing scheme
0.4
0.0
0.2
0.3
0.6
0.4
w=1
w=3
w=5
w=7
0.2
0.2
0.4
0.5
Bigroup coefficient
0.6
0.7
w=1
w=3
w=5
w=7
0.0
0.00
0.8
277
0.01
0.02
0.03
hotspot coefficient
(a)
(b)
Fig. 3. Maximum throughput versus imbalance coecient under correlated bursty trafc for FIFO and VOQ based windowing schemes (1/b = 20)
1
0.95
w=1
w=5
0.8
VOQ based windowing scheme
Maximum Throughput
Maximum Throughput
0.85
0.75
FIFO based windowing scheme
0.65
0.4
VOQ based scheme (P1=0.35)
VOQ based scheme (h=0.005)
FIFO based scheme (P1=0.35)
FIFO based scheme (h=0.01)
0.2
0.55
0.45
0.0
0.6
5.0
10.0
Average burst length
(a)
15.0
20.0
4
Window size
(b)
Fig. 4. Maximum throughput versus (a) average burst length (b) window size, under
correlated bursty and nonuniform trac (1/b = 20)
performance of waiting time a little. Moreover, this gure shows that the VOQW
scheme has the lower waiting time than iSLIP just below the saturated trac
load while has higher waiting time than iSLIP above the saturated trac load.
This is because the switch throughput of iSLIP scheme can be continuously
increased below the saturated trac load through the desynchronization eect.
The iSLIP (or DRR) scheme can reduce the waiting time at the region even
though the waiting time is a slightly increased below the saturated trac load
due to the desynchronization eect. On the other hand, the VOQW scheme
can restrict the switch throughput up to upper bound. The waiting time of
the VOQW scheme abruptly is increased at the saturated trac load, but the
waiting time remains at low below the saturated trac load. From the results,
we observe that the VOQW scheme can considerably reduce the total waiting
time below the saturated trac load comparing to that of the iSLIP scheme
278
iSLIP
window size =5
VOQW
window size =5
0.8
Switch Throughput
DRR
0.6
VOQW
window size =1
0.4
iSLIP
window size =1
0.2
0.2
0.4
0.6
0.8
Offered load
Fig. 5. Switch throughput under correlated bursty and nonuniform self-similar trac
10000
8000
iSLIP
window size =1
VOQW
window size =1
6000
VOQW
window size =5
4000
DRR
2000
iSLIP
window size =5
0.2
0.4
0.6
0.8
Offered load
Fig. 6. Total waiting time under correlated bursty and nonuniform self-similar trac
even though the waiting time is abruptly increased in the saturated trac load.
Consequently, designer can consider the VOQ based windowing scheme below
the saturated trac load.
Conclusion
The objective of this paper is to show the performance of the proposed VOQbased windowing (VOQW) scheme under the correlated bursty and nonuniform
trac. From the results, we know that the VOQW scheme can be implemented
with a simple arbitration logic similar to DRR scheme. The VOQW scheme can
considerably reduce the switch complexity comparing to that of iSLIP. Moreover,
the nonuniform or correlated bursty trac has no impact on the performance
279
of the switch with VOQW scheme. That is, the VOQW scheme can provide
consistent performance under various trac. In addition, the VOQW scheme
can considerably increase the switch throughput comparing to the FIFO based
windowing scheme or DRR scheme, even though the throughput of the VOQW
scheme is a little less than that of iSLIP. Consequently, we concluded that the
VOQW scheme is useful to be implemented when desiging scheduling scheme
for high-speed IP switches below the saturation point.
References
1. A. Adas, Trac models in broadband networks, IEEE Commun. Mag., vol. 35,
pp. 8289, July 1997.
2. P. Gupta, Scheduling in input queued switches: a survey, in
citeseer.nj.nec.com/246798.html
3. Y. Li, S. Panwar, H. J. Chao, On the performance of a Dual Round-Robin switch,
in IEEE INFOCOM 01, vol. 3, pp. 1688-1697, April 2001.
4. M. A. Marsan, A. Bianco, P. Giaccone E. Leonardi and F. Neri, Packet scheduling
in input-queued cell-based switches, in IEEE INFOCOM 01, 2001.
5. G. Nong, M. Hamdi, and J. K. Muppala, Performance evaluation of multiple
input-queued ATM switches with PIM scheduling under bursty trac, IEEE
Trans. Commun., vol. 49, pp. 13291333, Aug. 2001.
6. D. Manjunath and B. Sikdar, Variable length packet switches: delay analysis of
crossbar switches under Poisson and self similar trac, in IEEE INFOCOM 00,
2000.
7. A. Mekkittikul and N. Mckeown, A practical scheduling algorithm to achieve
100% throughput in input-queued switches, IEEE INFOCOM 98, pp. 792799,
1998.
8. N. Mckeown, A. Mekkittikul, V. Anantharam, and J. Walrand, Achieving 100%
throughput in an input-queued switch, IEEE Trans. Commun., vol. 47, pp. 1260
1267, Aug. 1999.
9. A. Smiljanic, R. Fan and G. Ramamurthy, RRGS-round-robin greedy scheduling
for electronic/optical terabit switches, Proc. of GLOBECOM 99, pp. 12441250,
1999.
10. E. Oki, R. Rojas-Cessa and H. J. Chao, A pipeline-based approach for maximalsized matching scheduling in input-buered swtiches, IEEE Commun. Letter,
vol. 5, No. 6, pp. 263265, June 2001.
11. A. Santhanam and A. Karandikar, Window-based cell scheduling algorithm for
VLSI implementation of an input-queued ATM switch, in IEE Proc.-Commun.
Vol. 147, No. 2, April 2000.
12. J. S. Choi and H. H. Lee, Performance Study of an Input Queueing ATM Switch
with windowing scheme for IP Switching System , in Proceeding of HPSR 2002.
Kobe, Japan, May, 2002.
13. Y. J. Hui, Switching and trac theory for integrated broadband network. Boston:
Kluwer Academic Publishers, 1990.