Professional Documents
Culture Documents
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/292148529
CITATIONS
READS
587
3 authors, including:
Ali Yekkehkhany
University of Illinois, Urbana-Champaign
1 PUBLICATION 0 CITATIONS
SEE PROFILE
AbstractA fundamental problem to all data-parallel applications is data locality. An example is map task scheduling in
the MapReduce framework. Existing theoretical work analyzes
systems with only two levels of locality, despite the existence of
multiple locality levels within and across data centers. We found
that going from two to three levels of locality changes the problem
drastically, as a tradeoff between performance and throughput
emerges. The recently proposed priority algorithm, which is
throughput and heavy-traffic optimal for two locality levels, is
not even throughput-optimal with three locality levels. The JSQMaxWeight algorithm proposed by Wang et al. is heavy-traffic
optimal only for a special traffic scenario with two locality levels.
We show that an extension of the JSQ-MaxWeight algorithm to
three locality levels preserves its throughput-optimality, but suffers from the same lack of heavy-traffic optimality for most traffic
scenarios. We propose a novel algorithm that uses WeightedWorkload (WW) routing and priority service. We establish its
throughput and heavy-traffic optimality for all traffic scenarios.
The main challenge is the construction of an appropriate ideal
load decomposition that allows the separate treatment of different
subsystems.
I. I NTRODUCTION
Data-parallel applications have become prevalent for processing large data sets by online social networks, search
engines, scientific research and health-care industry. A fundamental problem to data-parallel applications is near-data
scheduling, or scheduling with data locality, as the amount
of time and resource consumed by data-processing tasks vary
with their locations. An example is map task scheduling in
the MapReduce framework. Even with the increase in the
speed of data center networks, there remains a significant
difference in average processing speed [19], [1], [2] depending
on whether the data reside in memory, on a local disk, in a
local rack, in the same cluster or in a different data center.
As the processing speed depends on the task-server pair, it is
an affinity scheduling problem [14], [7], [8], albeit with an
explosive number of task types defined by the combinatorial
set of locations of its data, and only a few different processing
speeds due to the multi-level locality.
We are interested in designing a throughput and heavytraffic optimal scheduling algorithm in a system with multilevel locality. We define the optimality criteria as follows.
Throughput Optimality: An algorithm stabilizes any arrival
rate vector strictly within the capacity region, hence is robust
to variations in traffic.
Heavy-traffic Optimality: An algorithm asymptotically minimizes the average delay as the arrival rate vector approaches
Server 1
Server 2
Rack 1
1.9
Server 3
Server 4
Rack 2
However, for a system with three levels of locality, the priority algorithm is not throughput-optimal. Consider a system
with two racks, each consisting of two servers, as illustrated
in Fig. 1. There are three types of tasks: one type of tasks
is only local to server 1 and has rate , one is only local to
server 4 and has rate 1.9, and the third type is local to both
servers 2 and 3 and has rate . Assume a local task is served
at rate = 1, a rack-local task is served at rate = 0.9,
and a remote task is served at rate = 0.5. With the priority
algorithm [17], the system is stable only if
0.5
0.5
) + (1 ) + (1
),
1.9 < + (1
if server is
for type tasks. We use the notation
if server
rack-local to type tasks, and similarly,
tasks that
Arrivals. Let () denote the number of type
arrive at the beginning of time slot . We assume that the arrival
tasks is i.i.d. with rate- . We denote the
process of type
). The number of total
arrival rate vector by = ( :
arrivals in one time slot is assumed to be bounded.
Services. For each task, we assume that its service time
follows a geometric distribution with mean 1/ if processed
at a local server, and with mean 1/ and 1/ at a rack-local
server and a remote server respectively. On average, a task
is processed fastest at a local server, and slowest at a remote
server, hence we assume > > . Each server can process
one task at a time and all services are non-preemptive.
A. Outer Bound of the Capacity Region
We consider a decomposition of the arrival rate vector =
). For any task type
, is decomposed
( :
, ), where ,
is assumed to be the arrival
into (,
+
+
< 1.
(1)
Let be the set of arrival rates such that each element has
a decomposition satisfying condition (1):
{
(
)
=
= :
, , ..
,
0,
,
,
,
=1
,
,
+
< 1,
}.
Therefore gives an outer bound of the capacity region.
III. R ESULTS ON JSQ-M AX W EIGHT
We summarize the results on an extension of the JSQMaxWeight algorithm proposed by Wang et al. [16]. We refer
the reader to our technical report [18] for the complete proofs.
We extend the JSQ-MaxWeight algorithm to a system with
three levels of locality: local, rack-local and remote. The
central scheduler maintains a set of queues, where the th queue, denoted by , receives tasks local to server .
Let Q = (1 , 2 , , ) denote the vector of these queue
lengths. The algorithm consists of JSQ routing and MaxWeight
scheduling:
arrives, the scheduler
JSQ routing: When a task of type
arg min{
(){=} ,
(){()=()} ,
(){()=()} }.
Ties are broken randomly.
Theorem 1: Any arrival rate vector strictly within is
supportable by JSQ-MaxWeight. Thus is the capacity region
of the system and JSQ-MaxWeight is throughput optimal.
For a system with only two levels of data locality, JSQMaxWeight algorithm has been shown to be heavy-traffic
optimal for a special traffic scenario [16], where a server is
either locally overloaded or receives zero local traffic. For a
system with the rack structure, hence three levels of locality,
we consider the following traffic scenario.
All traffic concentrates on a subset of racks, and any rack
with non-zero local tasks is overloaded. Moreover, any server
in an overloaded rack either receives zero local traffic or is
locally overloaded. Denote the set of racks that can have
local tasks as , the set of servers that receives non-zero
local traffic as , the set of servers that receives zero local
traffic but belongs to racks as , the set of servers in
racks that receive zero local traffic as . For any subset
of servers , we denote by () = {
> ,
()
()
()
>
()
+ .
<
It is easy to see that in a stable system,
.
We
assume
that
+
= + + ,
(2)
IV. O UR A LGORITHM
The Weighted-Workload algorithm is illustrated in Fig. 2.
The central scheduler maintains a set of queues, where
the -th queue consists of 3 sub-queues denoted by ,
and , which receive tasks local, rack-local and
() () ()
+
+
.
Scheduler
Join local
Schedule local
Server 1
Type
Join rack-local
Schedule rack-local
Server 2
Join remote
Schedule remote
Server M
Tasks
sub-queue , rack-local sub-queue , and remote subqueue , denoted by (), () and (), respec(), () =
tively,
are given by () = :
(), () = :
().
,
:
1, if server is idle
0,
if server serves a local task from
() =
1,
if server serves a rack-local task from
2,
if server serves a remote task from
When server completes a task at the end of time slot 1,
i.e., ( ) = 1, it is available for a new task at time slot t.
The scheduling decision is based on the working status vector
f () = (1 (), 2 (), , ()) and the queue length vector
Q().
Let () denote the scheduling decision for server at
time slot . Note that () = () for all busy servers, and
when ( ) = 1, i.e., server is idle, () is determined
by the scheduler according to the algorithm.
(),
() and
() denote the local, rack-local
Let
and remote service provided by server respectively, where
() Bern({ ()=0} ),
() Bern({ ()=1} ) and
() Bern() when
with varying probability. For instance,
()
server is scheduled to its local sub-queue, and
=
=
=
() + ()
(),
() + () (),
() + ()
() + (),
() () ()} is the
where () = max{0,
unused service. As the service times follow geometric dis
tributions, Q()
together with the working status vector f ()
form an irreducible and aperiodic Markov chain {Z() =
(Q(),
f ()), 0}.
B. Throughput Optimality
Theorem 3: The Weighted-Workload algorithm is throughput optimal. That is, it stabilizes any arrival rate vector strictly
within the capacity region.
To prove Theorem 3, we use a Lyapunov function that is
quadratic in the expected workload in each queue:
( () () () )2
2
+
+
.
() = W() =
Note that the service discipline does not affect the proof as
the expected workload is reduced at the same rate regardless
of which sub-queue is served. The proof is similar to that for
the throughput-optimality of JSQ-MaxWeight. The weightedworkload queueing effectively replaces the role of MaxWeight
services, but leaves the choice of service discipline free for
potential achievement of delay optimality. We defer the full
proof to our technical report [18] due to space constraint.
Tasks local to
= :
, ,
, ..
,,
0,
=1
:
,,
:
:
,
,,
,
,,
:
:
,,
< 1,
(3)
:
:
},
The equivalence
of the capacity region can be established by
setting
. Each ,
is further decomposed
,
,,
tasks local to
, which denotes the rate of type
into ,,
=1
:
,,
, ,
= { : },
,,
= 0,
/ , .
(4)
( )
(1 ). (5)
:()=,
:()=, <
Note that the LHS of (5) gives the amount of local traffic for
overloaded servers in rack that could not be served locally.
The RHS of (5) is the maximum rack-local service that can
be provided by underloaded servers within rack . Hence rack
requires remote service if (5) holds.
Lemma 2: Assume 2 > . For any arrival rate vector
,,
there exists a decomposition {
,
} that satisfies (3), and
= 0,
/ , .
(6)
{ : () < ,
, = ,
and
,, = 0,
{ : () ,
, = ,
and
,, = 0},
{ : () < ,
, = ,
and
,,
= 0},
{ : () ,
, = ,
and
,,
= 0,
, s.t. () = (),
and
,, = 0}.
and
where = {
and = {
,
s.t. },
= + ( ) + ( ) , (7)
of . That is, the total local load for helpers is fixed. This
assumption can be removed with more care.
()
with arrival
Consider the arrival processes { ()}
()
rate vector
satisfying the above conditions. Note that
()
() 2
We denote by ( ) the variance of the number of arrivals
that(are only local to) beneficiaries in overloaded racks, i.e.,
()
() = ( () )2 , which converges to 2 as
Var
}
{ ()
() (), f () ()), 0 be the system
0. Let Z () = (Q
state under the proposed algorithm when the arrival rate is
() . Since () , the Markov chain Z() () is positive
recurrent and has a steady state distribution. All theorems in
this section concern the steady state process.
Theorem 4: (Helper queues)
[
]
)
(
()
() + ()
lim
= 0,
()
0
lim
0
]
()
()
= 0.
=
(
+
)
queue length Q
] .
[
, by
In order to characterize the scaling order of Q
Theorem 4, we only need to consider
(
) (
)
+ +
+ + .
=
+
Define c =
+ as a vector with unit 2 norm, where
,
,
=
The parallel and perpendicular components of the steadystate weighted queue-length vector W with respect to c are
W = c, Wc, W = W W .
Theorem 5: (State space collapse) There exists a sequence
of finite numbers { : } such that for each positive
integer ,
]
[
()
W ,
that is, the deviation of W from the direction c are bounded
and independent of the heavy-traffic parameter .
Define the service process
() () =
() +
() +
(), (8)
where { ()} , { ()} and { ()} are independent and each process is i.i.d. For all , ()
Bern(). For all , () Bern((1 )), where
is the proportion of time helper spends on local tasks
in steady state. For all , () Bern((1 )),
where is the proportion of time helper spends on local
and rack-local tasks in steady state. We denote Var(() ()) by
( () )2 , which converges to a constant 2 as 0.
Theorem 6: (Lower Bound)
[
] ( () )2 + ( () )2 + 2
() ()
.
2
2
Therefore, in the heavy traffic limit as 0,
[
] 2 + 2
.
lim inf () ()
0
2
(9)
B. Outline of Proofs
(Theorem 4.) We first show that in steady state, the expected
local load on any helper is upper bounded by a constant < 1
which is independent of . As shown in [17], with upperbounded local load and priority scheduling for local tasks,
the expected local queue length is bounded and independent
of . Therefore the local sub-queue lengths of and
are bounded and independent of . Under the ideal load
decomposition, all tasks of types are served locally by
in order to achieve maximum remote capacity for overloaded
racks. We can show that in the absence of , the number of
tasks in that are served rack-locally or remotely vanishes
as 0. Hence we can also show the uniform boundedness
of the rack-local sub-queue lengths of .
(Theorem 5.) We consider the Lyapunov function
(Z) = W .
We can show that the drift of (Z) is always finite and
becomes negative for sufficiently large . According to the
extended version of Lemma 1 in [6], all moments of (Z)
exist and are finite. The main challenge is to show that the
} given in Lemma 3 satisfies
ideal load decomposition {,,
or
, { ,
the following:
, where is a
, or },
,,
()
,
()
[
]
and
() are the maximum amount of local,
rack-local and remote services that can be provided for
()
(). Then in steady state, () () is stochastically
+ ,
+ + ,
+ U(),
Q(
+ A()
S()
S
and U
are defined similarly as Q.
where A,
2
We consider the Lyapunov function (Z) = Q
,
60
Weighted-Workload
JSQ-MaxWeight
Simple Priority
40
20
0.2
0.4
0.6
0.8
0.9
25
40
20
Weighted-Workload
JSQ-MaxWeight
15
10
5
0
0.6
0.65
0.7
0.75
(a) Distribution-1
Weighted-Workload
JSQ-MaxWeight
30
20
10
0
0.8
0.82
0.84
0.86
0.88
0.9
Figure 6 compares the delay performance of JSQMaxWeight and the Weighted-Workload algorithm at high
load. With distribution-1, both algorithms achieve heavy-traffic
optimality. Figure (a) shows that the Weighted-Workload algorithm has similar performance as JSQ-MaxWeight. With
distribution-2, however, the Weighted-Workload algorithm
achieves up to 4-fold improvement over JSQ-MaxWeight
algorithm at high load. The Weighted-Workload algorithm is
shown to be heavy-traffic optimal for all traffic scenarios. The
significant improvement of the Weighted-Workload algorithm
over JSQ-MaxWeight at high load in Fig. 6(b) shows that
JSQ-MaxWeight is not heavy-traffic optimal for all traffic
scenarios.
VIII. C ONCLUSION
We considered a stochastic model with multi-level datalocality for computing clusters. We studied an extension of
R EFERENCES
[1] G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica,
D. Harlan, and E. Harris. Scarlett: Coping with Skewed Popularity
Content in MapReduce Clusters. In Proceedings of the European
Conference on Computer Systems (EuroSys), 2011.
[2] G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula,
S. Shenker, and I. Stoica. PACMan: Coordinated Memory Caching
for Parallel Jobs. In Proceedings of Symposium on Networked Systems
Design and Implementation (NSDI). USENIX, 2012.
[3] Apache Hadoop, June 2011.
[4] S. L. Bell and R. J. Williams. Dynamic scheduling of a system with
two parallel servers in heavy traffic with resource pooling: asymptotic
optimality of a threshold policy. Annals of Applied Probability, 2001.
[5] S. L. Bell and R. J. Williams. Dynamic scheduling of a parallel server
system in heavy traffic with complete resource pooling: Asymptotic
optimality of a threshold policy. Electron. J. Probab., 10:no. 33, 1044
1115, 2005.
[6] A. Eryilmaz and R. Srikant. Asymptotically tight steady-state queue
length bounds implied by drift conditions. Queueing Syst. Theory Appl.,
72(3-4):311359, 2012.
[7] J. M. Harrison. Heavy traffic analysis of a system with parallel servers:
Asymptotic optimality of discrete review policies. Annals of Applied
Probability, 1998.
[8] J. M. Harrison and M. J. Lopez. Heavy traffic resource pooling in
parallel-server systems. Queueing Syst. Theory Appl., 33(4), Apr. 1999.
[9] C. He, Y. Lu, and D. Swanson. Matchmaking: A New MapReduce
Scheduling Technique. In Proceedings of the International Conference
on Cloud Computing Technology and Science (CloudCom). IEEE, 2011.
[10] S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu, and S. Wu. Maestro:
Replica-aware Map Scheduling for MapReduce. In Proceedings of
the International Symposium on Cluster, Cloud and Grid Computing
(CCGrid). IEEE, 2012.
[11] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. In
Proceedings of the Symposium on Operating Systems Principles (SOSP).
ACM, 2009.
[12] J. Jin, J. Luo, A. Song, F. Dong, and R. Xiong. Bar: An Efficient Data
Locality Driven Task Scheduling Algorithm for Cloud Computing. In
Proceedings of the International Symposium on Cluster, Cloud and Grid
Computing (CCGrid). IEEE, 2011.
[13] A. Mandelbaum and A. L. Stolyar. Scheduling flexible servers with
convex delay costs: Heavy-traffic optimality of the generalized -rule.
Operations Research, 52, 2004.
[14] M. Squillante, C. Xia, D. Yao, and L. Zhang. Threshold-based priority
policies for parallel-server systems with affinity scheduling. In Proc.
IEEE American Control Conf., 2001.
[15] A. L. Stolyar. Maxweight scheduling in a generalized switch: State
space collapse and workload minimization in heavy traffic. Annals of
Applied Probability, 14(1):153, 2004.
[16] W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang. Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-traffic
Optimality. In Proceedings of INFOCOM. IEEE, 2013.
[17] Q. Xie and Y. Lu. Priority Algorithm for Near-data Scheduling:
Throughput and Heavy-Traffic Optimality. In Proceedings of INFOCOM. IEEE, 2015.
[18] Q. Xie and Y. Lu. Scheduling with multi-level data locality: Throughput
and heavy-traffic optimality. Technical Report, 2015.
[19] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and
I. Stoica. Delay scheduling: A Simple Technique for Achieving Locality
and Fairness in Cluster Scheduling. In Proceedings of the European
Conference on Computer Systems (EuroSys), 2010.