You are on page 1of 10

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/292148529

Scheduling with Multi-level Data Locality:


Throughput and Heavy-traffic Optimality
Conference Paper April 2016
DOI: 10.1109/INFOCOM.2016.7524416

CITATIONS

READS

587

3 authors, including:
Ali Yekkehkhany
University of Illinois, Urbana-Champaign
1 PUBLICATION 0 CITATIONS
SEE PROFILE

All in-text references underlined in blue are linked to publications on ResearchGate,


letting you access and read them immediately.

Available from: Ali Yekkehkhany


Retrieved on: 01 September 2016

Scheduling with Multi-level Data Locality:


Throughput and Heavy-traffic Optimality
Qiaomin Xie, Ali Yekkehkhany, Yi Lu
{qxie3, yekkehk2, yilu4}@illinois.edu
University of Illinois at Urbana-Champaign

AbstractA fundamental problem to all data-parallel applications is data locality. An example is map task scheduling in
the MapReduce framework. Existing theoretical work analyzes
systems with only two levels of locality, despite the existence of
multiple locality levels within and across data centers. We found
that going from two to three levels of locality changes the problem
drastically, as a tradeoff between performance and throughput
emerges. The recently proposed priority algorithm, which is
throughput and heavy-traffic optimal for two locality levels, is
not even throughput-optimal with three locality levels. The JSQMaxWeight algorithm proposed by Wang et al. is heavy-traffic
optimal only for a special traffic scenario with two locality levels.
We show that an extension of the JSQ-MaxWeight algorithm to
three locality levels preserves its throughput-optimality, but suffers from the same lack of heavy-traffic optimality for most traffic
scenarios. We propose a novel algorithm that uses WeightedWorkload (WW) routing and priority service. We establish its
throughput and heavy-traffic optimality for all traffic scenarios.
The main challenge is the construction of an appropriate ideal
load decomposition that allows the separate treatment of different
subsystems.

I. I NTRODUCTION
Data-parallel applications have become prevalent for processing large data sets by online social networks, search
engines, scientific research and health-care industry. A fundamental problem to data-parallel applications is near-data
scheduling, or scheduling with data locality, as the amount
of time and resource consumed by data-processing tasks vary
with their locations. An example is map task scheduling in
the MapReduce framework. Even with the increase in the
speed of data center networks, there remains a significant
difference in average processing speed [19], [1], [2] depending
on whether the data reside in memory, on a local disk, in a
local rack, in the same cluster or in a different data center.
As the processing speed depends on the task-server pair, it is
an affinity scheduling problem [14], [7], [8], albeit with an
explosive number of task types defined by the combinatorial
set of locations of its data, and only a few different processing
speeds due to the multi-level locality.
We are interested in designing a throughput and heavytraffic optimal scheduling algorithm in a system with multilevel locality. We define the optimality criteria as follows.
Throughput Optimality: An algorithm stabilizes any arrival
rate vector strictly within the capacity region, hence is robust
to variations in traffic.
Heavy-traffic Optimality: An algorithm asymptotically minimizes the average delay as the arrival rate vector approaches

the boundary of the capacity region, hence is efficient under


stressed conditions.
A. Previous Work
The existing algorithms for affinity scheduling fall into two
categories. The first category includes the work by Harrison [7], Harrison and Lopez [8] and Bell and Williams [4],
[5]. These algorithms require the arrival rates of all task types
to compute a graph of basic activities and achieve optimality
using carefully constructed thresholds on per-task-type queues.
The second category consists of the MaxWeight algorithm
proposed by Stolyar [15], [13]. The MaxWeight algorithm
does not require task arrival rates, but it does not achieve delay
optimality in general. Instead, it minimizes a specific weighted
sum of delay. In addition, both categories of algorithms use
per-task-type queues, which is impractical for this setting as
the typical number of task types scales cubically with the
number of servers.
Recently Wang et. al. proposed the JSQ-MaxWeight algorithm [16], which solved the problem of per-task-type queue
with MaxWeight when there are two locality levels. Like
MaxWeight, JSQ-MaxWeight is throughput-optimal. However,
it was shown to be heavy-traffic optimal only for a special
traffic scenario where a server is either locally overloaded or
receives zero local traffic. We explain in Section III that an
extension of the JSQ-MaxWeight algorithm to three locality
levels preserves its throughput optimality, but suffers from the
same lack of heavy-traffic optimality in all but a special set
of scenarios.
Another recent work proposed a priority algorithm [17] for
the map task assignment problem. It is shown to be both
throughput and heavy-traffic optimal for two levels of locality.
However, with three-level locality, it is not even throughputoptimal. We delay detailed explanation to Section I-B.
There are also a number of heuristics [3], [19], [11], [12],
[10], [9] proposed for near-data scheduling with multi-level
locality, but their fundamental throughput and delay properties
are not known.
B. A Performance-versus-Throughput Dilemma
Under the priority algorithm [17], each server maintains a
queue that only receives tasks local to this server. The load
balancing step balances tasks across their local queues. Each
server serves a local task if its queue is not empty; otherwise
it serves a remote task from the longest queue in the system.

With two levels of locality, the priority algorithm achieves


good delay performance as it maximizes the number of tasks
served locally. The system is also throughput-optimal as any
remaining capacity of an underloaded server is devoted to
remote service.

Server 1

Server 2

Rack 1

1.9

Server 3

Server 4

Rack 2

Fig. 1. A simple system with two racks.

However, for a system with three levels of locality, the priority algorithm is not throughput-optimal. Consider a system
with two racks, each consisting of two servers, as illustrated
in Fig. 1. There are three types of tasks: one type of tasks
is only local to server 1 and has rate , one is only local to
server 4 and has rate 1.9, and the third type is local to both
servers 2 and 3 and has rate . Assume a local task is served
at rate = 1, a rack-local task is served at rate = 0.9,
and a remote task is served at rate = 0.5. With the priority
algorithm [17], the system is stable only if

0.5
0.5
) + (1 ) + (1
),
1.9 < + (1

Thus the achievable throughput is < 0.9355, while the


system is clearly stabilizable for < 1. The problem with
the priority algorithm is that the shared local traffic of rate
is split evenly among servers 2 and 3. However, the throughput
will increase if server 3 serves no local tasks, but instead
devotes its capacity to rack-local service for server 4.
While the priority algorithm achieves good delay performance at low load when most tasks can be served locally,
it sacrifices throughput at high load. This example raises the
question as to whether there exist scheduling algorithms that
can simultaneously achieve throughput and delay optimality.
C. Our contributions
We propose a novel algorithm that uses Weighted-Workload
(WW) routing and priority service. The key insight is that
throughput optimality requires the workload to be kept at
the correct ratio at different queues, but the composition of
the workload can be designed appropriately such that it is
delay-optimal in the heavy-traffic regime. We note that this is
the only known delay-optimal algorithm in the heavy-traffic
regime when the arrival rates are unknown.
We state our results in the rest of the paper for three
levels of locality. We consider a discrete-time model for the
system, where tasks arrive at the beginning of each time slot

according to some stochastic process. Each task processes one


data chunk. Within each time slot, a task is completed with
probability at a local server, at a rack-local server, or
at a remote server, with > > . Our main contributions
are as follows:
We identify the capacity region of a system with three
locality levels. The capacity is defined to be the set
of arrival rate vectors under which the system can be
stabilized by some scheduling algorithm.
We extend the JSQ-MaxWeight algorithm [16] and show
that it is throughput-optimal. It is heavy-traffic optimal
for special traffic scenarios analogous to that with twolevel locality [16].
We establish the throughput optimality of our proposed
algorithm.
We establish the heavy-traffic optimality of our proposed
algorithm for all traffic scenarios provided 2 > ,
which holds in the case of map task scheduling. The
priority service precludes the use of the 2 norm Lyapunov drift. The main idea is the construction of a multilevel ideal load decomposition for each arrival rate vector,
which resolves the problem encountered by the priority
algorithm [17].
II. S YSTEM M ODEL
We consider a MapReduce system with a hierarchical network. The system consists of racks, each of which contains
multiple servers. Servers within a rack share a common switch.
A large data set is divided into blocks, each of which is
replicated on a few servers for fault tolerance and performance.
A job consists of a number of map tasks, each of which
processes a different data chunk. For each task, we call a server
a local server for the task if the data block to be processed by
the task is stored locally, and we call this task a local task for
the server. Analogously, we call a server a rack-local server if
the data block to be processed by the task is not stored on the
server, but in the same rack as the server, and we call this task
a rack-local task for the server. A server is a remote server if
it is neither local nor rack-local for the task and this task is
called a remote task for the server.
We consider a discrete-time model that consists of racks
indexed by , where = {1, 2, , }. There are
parallel servers in the system, indexed by , where
= {1, 2, , }. For each server , we denote by
() the index of corresponding rack where it locates. Each
of servers. As each task
data chunk is replicated on a set
local servers. Define the
processes one data chunk, it has
of its local servers. For instance,
type of a task as the set
= 3 the task type
is defined as:
with
{(1 , 2 , 3 ) 3 , 1 < 2 < 3 },

where 1 , 2 , 3 are the indices of the three local servers.


to denote that server is a local server
We use

if server is
for type tasks. We use the notation

if server
rack-local to type tasks, and similarly,

is remote to type tasks. Let denote the set of task types.

tasks that
Arrivals. Let () denote the number of type
arrive at the beginning of time slot . We assume that the arrival
tasks is i.i.d. with rate- . We denote the
process of type
). The number of total
arrival rate vector by = ( :
arrivals in one time slot is assumed to be bounded.
Services. For each task, we assume that its service time
follows a geometric distribution with mean 1/ if processed
at a local server, and with mean 1/ and 1/ at a rack-local
server and a remote server respectively. On average, a task
is processed fastest at a local server, and slowest at a remote
server, hence we assume > > . Each server can process
one task at a time and all services are non-preemptive.
A. Outer Bound of the Capacity Region
We consider a decomposition of the arrival rate vector =
). For any task type
, is decomposed
( :
, ), where ,
is assumed to be the arrival
into (,

tasks for server . To ensure that the arrival rate


rate of type
vector is supportable, a necessary condition is that the sum
of local, rack-local and remote load on any server is strictly
less than 1, i.e.,
,
,
,

+
+
< 1.
(1)

Let be the set of arrival rates such that each element has
a decomposition satisfying condition (1):
{
(
)

=
= :
, , ..
,
0,

,
,
,

=1

,
,

+
< 1,

}.
Therefore gives an outer bound of the capacity region.
III. R ESULTS ON JSQ-M AX W EIGHT
We summarize the results on an extension of the JSQMaxWeight algorithm proposed by Wang et al. [16]. We refer
the reader to our technical report [18] for the complete proofs.
We extend the JSQ-MaxWeight algorithm to a system with
three levels of locality: local, rack-local and remote. The
central scheduler maintains a set of queues, where the th queue, denoted by , receives tasks local to server .
Let Q = (1 , 2 , , ) denote the vector of these queue
lengths. The algorithm consists of JSQ routing and MaxWeight
scheduling:
arrives, the scheduler
JSQ routing: When a task of type

compares the lengths of the tasks local queues, { },


and inserts the task into the shortest queue. Ties are broken
randomly.
MaxWeight scheduling: When server becomes idle, its
scheduling decision () is chosen from the following set

arg min{

(){=} ,

(){()=()} ,

(){()=()} }.
Ties are broken randomly.
Theorem 1: Any arrival rate vector strictly within is
supportable by JSQ-MaxWeight. Thus is the capacity region
of the system and JSQ-MaxWeight is throughput optimal.
For a system with only two levels of data locality, JSQMaxWeight algorithm has been shown to be heavy-traffic
optimal for a special traffic scenario [16], where a server is
either locally overloaded or receives zero local traffic. For a
system with the rack structure, hence three levels of locality,
we consider the following traffic scenario.
All traffic concentrates on a subset of racks, and any rack
with non-zero local tasks is overloaded. Moreover, any server
in an overloaded rack either receives zero local traffic or is
locally overloaded. Denote the set of racks that can have
local tasks as , the set of servers that receives non-zero
local traffic as , the set of servers that receives zero local
traffic but belongs to racks as , the set of servers in
racks that receive zero local traffic as . For any subset

of servers , we denote by () = {

, s.t. } the set of task types with local servers in


. Analogously, for any subset of racks , denote by
, s.t. ,
and () } the set
() = {
()
of task types with local servers in racks . Let = {
() } be the set of servers having local traffic and
()
belonging to racks , and = { () }
the set of servers without any local traffic and belonging to
racks . Formally, the heavy-traffic regime assumes that for
any ,

> ,

()

and for any ,

()

()

>

()

+ .

<
It is easy to see that in a stable system,

.
We
assume
that
+

= + + ,
(2)

where > 0 characterizes the distance of the arrival rate vector


from the capacity boundary.
()
Theorem 2: Consider the arrival processes { (),
()
0}
with mean satisfying the above condition. Then

JSQ-MaxWeight is heavy-traffic optimal.


Note that JSQ-MaxWeight is not heavy-traffic optimal in
other traffic scenarios, when the underloaded racks, and the
underloaded servers in overloaded racks, receive local traffic,
for the same reason as with two levels of locality [16]. One
problem is the growth of queues of local tasks at the servers
that have zero local queue lengths in the special scenario.
The growing queues of local tasks at the underloaded servers
and racks result in non-optimal delay. Our weighted-workload
algorithm solves this problem.

IV. O UR A LGORITHM
The Weighted-Workload algorithm is illustrated in Fig. 2.
The central scheduler maintains a set of queues, where
the -th queue consists of 3 sub-queues denoted by ,
and , which receive tasks local, rack-local and

remote to server respectively. We denote by Q()


=
2 (), , Q
()) the queue lengths at time , where
1 (), Q
(Q
() = ( (), (), ()). We define the expected
Q
workload of the -th queue, (), as
() =

() () ()
+
+
.

Scheduler
Join local

Schedule local

Server 1

Type

Join rack-local

Schedule rack-local

Server 2

Join remote

Schedule remote

Server M

Tasks

Fig. 2. The Weighted-Workload algorithm

At the beginning of each time slot , the central scheduler


routes new arrivals to one of the queues and schedules a new
task for an idle server as follows:
arWeighted-Workload queueing: When a task of type
rives, the scheduler selects three queues that have the least
workload among its local servers, rack-local servers and
remote servers respectively. They are further weighted by
1/, 1/, 1/ respectively, and the task joins the queue with
the minimum weighted workload. Ties are broken randomly.
The task then join the corresponding sub-queue depending on
whether it is local, rack-local or remote for the selected server.
Formally, the final selected queue () is in the set
{
}
()
()
()
{}
{ } ,
{ } .
arg min
,

Prioritized scheduling: When a server becomes idle, it serves


tasks from its queue in the order of local, rack-local and
remote. For instance, both the local and rack-local sub-queues
need to be empty before a remote task is served. When all its
sub-queues are empty, the server remains idle.
A. Queue Dynamics
tasks that are
() denote the number of type
Let ,

routed to . The total number of tasks that join local

sub-queue , rack-local sub-queue , and remote subqueue , denoted by (), () and (), respec(), () =
tively,
are given by () = :

(), () = :
().

,
:

Let () denote the working status of server at time


slot ,

1, if server is idle

0,
if server serves a local task from
() =
1,
if server serves a rack-local task from

2,
if server serves a remote task from
When server completes a task at the end of time slot 1,
i.e., ( ) = 1, it is available for a new task at time slot t.
The scheduling decision is based on the working status vector
f () = (1 (), 2 (), , ()) and the queue length vector

Q().
Let () denote the scheduling decision for server at
time slot . Note that () = () for all busy servers, and
when ( ) = 1, i.e., server is idle, () is determined
by the scheduler according to the algorithm.

(),
() and
() denote the local, rack-local
Let
and remote service provided by server respectively, where

() Bern({ ()=0} ),
() Bern({ ()=1} ) and

() Bern({ ()=2} ) are Bernoulli random variables

() Bern() when
with varying probability. For instance,

()
server is scheduled to its local sub-queue, and

Bern(0) otherwise. The same applies to () and (). The


dynamics of three sub-queues at server can be described as
( + 1)
( + 1)
( + 1)

=
=
=

() + ()
(),

() + () (),

() + ()
() + (),

() () ()} is the
where () = max{0,
unused service. As the service times follow geometric dis
tributions, Q()
together with the working status vector f ()
form an irreducible and aperiodic Markov chain {Z() =

(Q(),
f ()), 0}.

B. Throughput Optimality
Theorem 3: The Weighted-Workload algorithm is throughput optimal. That is, it stabilizes any arrival rate vector strictly
within the capacity region.
To prove Theorem 3, we use a Lyapunov function that is
quadratic in the expected workload in each queue:
( () () () )2
2

+
+
.
() = W() =

Note that the service discipline does not affect the proof as
the expected workload is reduced at the same rate regardless
of which sub-queue is served. The proof is similar to that for
the throughput-optimality of JSQ-MaxWeight. The weightedworkload queueing effectively replaces the role of MaxWeight
services, but leaves the choice of service discipline free for
potential achievement of delay optimality. We defer the full
proof to our technical report [18] due to space constraint.

V. I DEAL L OAD D ECOMPOSITION


A key component of the proof of heavy-traffic optimality of
the WW algorithm is the construction of an ideal load decomposition. An analogous method was used [17] for two levels
of locality, but the construction with three levels of locality
is more involved and lends insight into general systems. The
construction serves two purposes: 1) The ideal load obtained
for each server is used as an intermediary in the proofs of statespace collapse; 2) The construction uniquely identifies four
types of servers, helpers and beneficiaries in underloaded and
overloaded racks respectively, which have very different traffic
compositions and require distinct treatment in the proofs.

Tasks local only to

Tasks local to  but not

Tasks local to but not or

Tasks local to

Fig. 3. The queue compositions of the four types of servers.

Figure 3 illustrates the different sub-queue compositions of


the four subsystems under the ideal load decomposition:
Helpers in underloaded racks, : A server belongs to
if it is not overloaded, provides rack-local service and remote
service, and all tasks local to this server are served locally in
the system.
Beneficiaries in underloaded racks, : A server belongs to
if it is overloaded, does not provide rack-local or remote
service, and tasks local to this server receive rack-local service
but not remote service.
Helpers in overloaded racks, : A server belongs to if it
is not overloaded, provides rack-local service but not remote
service, and all tasks local to this server are served locally in
the system.
Beneficiaries in overloaded racks, : A server belongs to
if it is overloaded, does not provide rack-local or remote
service, and tasks local to this server receive rack-local service
and remote service.
We will define overloaded servers and racks in a more
precise manner in V-B. While pure helpers and beneficiaries in
underloaded or overloaded racks do not exist in a real system,
the ideal load decomposition approximately depicts the load
distribution in the heavy-traffic regime.
The condition 2 >
We focus on the case where 2 > , i.e., the racklocal rate is significantly larger than the remote rate. Note
that this condition holds in MapReduce clusters. In addition,
it simplifies the procedure of constructing the ideal load
decomposition.

Consider an overloaded rack and an underloaded rack.


Suppose there exists traffic that are local to both racks. The
condition 2 > dictates that all such traffic should be
moved to the underloaded rack in the ideal load decomposition
regardless of the load on the servers. For instance, moving
amount of traffic from to creates new capacity for
so that it can serve an additional
amount of racklocal traffic in the overloaded rack. On the other hand, when
a server in becomes overloaded (and hence becomes ),
the movement creates new rack-local traffic in the underloaded
rack and as a result reduces a
amount of remote traffic
served in this rack. The condition 2 > implies that

> , i.e., the increase in rack-local capacity outweighs


the decrease in remote capacity. Hence the movement of
shared local traffic continues even if a server changes from
to . And in the ideal load decomposition constructed, no
shared local traffic between underloaded racks and overloaded
racks is routed to overloaded racks.
In other words, the condition 2 > ensures that the
sacrifice of local service for rack-local service benefits the
system capacity by reducing the amount of traffic that should
be served remotely. The final load decomposition is ideal in
the sense that it minimizes the amount of remote traffic.
We construct the ideal load decomposition in the rest of
the section. We highlight the key steps here and defer the
full proofs to our technical report [18]. First, we need an
equivalent capacity region with a more refined decomposition
to define the overloaded set. The ideal load decomposition is
constructed from this refined decomposition in two steps: 1)
Identify the overloaded servers and racks; 2) Construct the
decomposition that produces , , , .
A. An Equivalent Capacity Region
We define the following equivalent capacity region:
{
(
)
=

= :
, ,
, ..
,,
0,


=1
:

,,

:
:

,
,,
,

,,

:
:

,,

< 1,

(3)

:
:

},
The equivalence
of the capacity region can be established by
setting

. Each ,
is further decomposed

,
,,

tasks local to
, which denotes the rate of type
into ,,

the server but processed at server . The additional index


provides a pseudo-distribution of tasks across their local
servers only. It does not affect where they are processed. The
information is used for constructing the overloaded set, which
only depends on the types and rates of local tasks to a server.

B. Overloaded Servers and Racks


Our definition of overloaded servers is the same as with
two-level locality [17]. A server is overloaded only if its
local traffic cannot be distributed by load balancing with
underloaded servers. We reproduce the lemma here. Define
=

=1
:

,,
, ,

which gives the pseudo-arrival rate of local tasks to server


under a decomposition {,,
}. For any subset of servers

, we denote by the set of task types local only to


servers in , and by the set of task types that have at least
one local server in .
there
Lemma 1 ([17]): For any arrival rate vector ,
,,
}
which
satisfies
(3)
and

exists a decomposition {

= { : },
,,

= 0,
/ , .
(4)

Next, we define the overloaded racks. With a slight abuse


of notation, for any subset of racks , we denote by
the set of task types that are local only to servers in racks .
} if
A rack is overloaded under a decomposition {,,

( )
(1 ). (5)

:()=,

:()=, <

Note that the LHS of (5) gives the amount of local traffic for
overloaded servers in rack that could not be served locally.
The RHS of (5) is the maximum rack-local service that can
be provided by underloaded servers within rack . Hence rack
requires remote service if (5) holds.
Lemma 2: Assume 2 > . For any arrival rate vector
,,
there exists a decomposition {
,
} that satisfies (3), and

for any such that () , where is the set of


overloaded racks satisfying Eq. (5),
,,

= 0,
/ , .
(6)

The decomposition is such that the overloaded set of racks


only receives non-zero arrivals from task types that are
local only to . The concept is similar to that of overloaded
servers. The proof starts with the decomposition given by
Lemma 1 and iteratively moves an appropriate amount of load
from currently overloaded racks to underloaded racks. This is
possible whenever an overloaded rack receives local arrivals
that are also local to some underloaded rack. The assumption
2 > ensures that this process can continue until (6) is
satisfied. At the end of each step, either there is no more shared
local load between the two racks, or they have both become
underloaded or overloaded. It can be shown that at each step,
the decomposition continues to satisfy (3) and reduces the total
load in the system.
C. Ideal Load Decomposition
We are ready to formally define the four types of servers.
Lemma 3: For any arrival rate vector , there exists
} satisfying (3) and Lemma 2. Let
a decomposition {,,

and denote the set of overloaded and underloaded racks,


respectively. The system is divided into the following four
subsystems:

{ : () < ,
, = ,
and
,, = 0,

and , s.t. () = (), ,,


= 0},

{ : () ,
, = ,
and
,, = 0},
{ : () < ,
, = ,
and

,,

= 0},

{ : () ,
, = ,
and

,,

= 0,

, s.t. () = (),
and
,, = 0}.

Lemma 3 states that for any arrival vector, there exists an


ideal load decomposition, under which all servers are classified
into these four types. The proof constructs the decomposition
,,
} given in Lemma 2.
iteratively from {

VI. H EAVY- TRAFFIC O PTIMALITY


To establish the heavy-traffic optimality of the WeightedWorkload algorithm, we use the framework developed in [6].
The main steps include: 1. Establish state-space collapse of
the system in the heavy-traffic limit; 2. Obtain a lower bound
on the expected queue length as 0; 3. Obtain a matching
upper bound on the expected queue length as 0. The
matching bounds will establish the delay-optimality of the
algorithm in the heavy-traffic regime.

Fig. 4. The queue compositions of the four types of servers in the

heavy-traffic regime with : : = 1 : 0.8 : 0.5. The workload at


the four types of servers maintain the ratio : : / : = 1 :
0.8 : 0.625 : 0.5.

However, the Lyapunov drift analysis developed in [6]


cannot be applied directly to our algorithm due to the prioritized service and a more complicated state-space collapse.
Figure 4 illustrates the one-dimensional state-space vector the
system collapses to in the heavy-traffic regime. There are
two key ideas. First, the prioritized service allows us to have
a uniformly bounded helper subsystem in the heavy-traffic
regime, which corresponds to the disappearance of the racklocal and local queues for and that of the local queue
for in Figure 4. Second, the weighted-workload routing

distributes the tasks local only to in the ratio of : :


in terms of server workload across , and . More
interestingly, the workload of servers in and that in has
a ratio of : . This is because servers in are only helped
by rack-local service from in the ideal load decomposition,
and the weighted-workload routing of tasks local only to
maintains the : ratio even if the workload of consist
only of remote tasks from in the heavy-traffic regime.
Traffic distributions
)) on the system
The traffic distribution ( = ( :
can be classified into two categories: the set of overloaded
racks = , or = . In the first case, each rack can
accommodate its load, and the system in the heavy-traffic
regime decomposes into independent racks, each of which has
two levels of locality. We focus on the second case in this
paper, which is more challenging of the two, and defer the
proof for the first case to [18].
A. Formal Statement of Results
For simplicity, we formally state the main theorems for =
and = . The proof techniques are similar for = .
Due to space constraint, the complete proofs can be found
in [18], and we provide the outline of proofs
in Section VI-B.

Let the local traffic on and be

and
where = {

and = {
,

s.t. },

, and s.t. }. We define the heavytraffic regime to be

= + ( ) + ( ) , (7)

where > 0 characterizes the distance of the arrival rate


vector from the capacity boundary. We will make a further
} are independent
assumption that the {

of . That is, the total local load for helpers is fixed. This
assumption can be removed with more care.
()
with arrival
Consider the arrival processes { ()}

()
rate vector
satisfying the above conditions. Note that
()

the variance of { ()}


is independent of .


() 2
We denote by ( ) the variance of the number of arrivals
that(are only local to) beneficiaries in overloaded racks, i.e.,

()
() = ( () )2 , which converges to 2 as
Var

}
{ ()
() (), f () ()), 0 be the system
0. Let Z () = (Q
state under the proposed algorithm when the arrival rate is
() . Since () , the Markov chain Z() () is positive
recurrent and has a steady state distribution. All theorems in
this section concern the steady state process.
Theorem 4: (Helper queues)
[
]
)
(
()

() + ()
lim
= 0,
()
0

lim
0

]
()
()

= 0.

Theorem 4 states that the helper subsystem is uniformly


bounded and independent of . As the arrival rate approaches
the capacity boundary,
0, the steady state
] mean
[ ] i.e.,[

 =
(
+

)
queue length Q

] .
[
 , by
In order to characterize the scaling order of Q
Theorem 4, we only need to consider

(
) (
)
+ +
+ + .
=
+

Define c =


+ as a vector with unit 2 norm, where

,
,
=

The parallel and perpendicular components of the steadystate weighted queue-length vector W with respect to c are
W = c, Wc, W = W W .
Theorem 5: (State space collapse) There exists a sequence
of finite numbers { : } such that for each positive
integer ,
 ]
[
 () 
W  ,
that is, the deviation of W from the direction c are bounded
and independent of the heavy-traffic parameter .
Define the service process

() () =
() +
() +
(), (8)

where { ()} , { ()} and { ()} are independent and each process is i.i.d. For all , ()
Bern(). For all , () Bern((1 )), where
is the proportion of time helper spends on local tasks
in steady state. For all , () Bern((1 )),
where is the proportion of time helper spends on local
and rack-local tasks in steady state. We denote Var(() ()) by
( () )2 , which converges to a constant 2 as 0.
Theorem 6: (Lower Bound)
[
] ( () )2 + ( () )2 + 2

() ()

.
2
2
Therefore, in the heavy traffic limit as 0,
[
] 2 + 2
.
lim inf () ()
0
2

(9)

Theorem 7: (Upper bound)


[
] ( () )2 + ( () )2
() ()
+ () ,
2
where () = ( 1 ), i.e., lim () = 0. Therefore, in the
0

heavy-traffic limit, we have


[
] 2 + 2
,
lim sup () ()
2
0
which coincides with the lower bound (9).

B. Outline of Proofs
(Theorem 4.) We first show that in steady state, the expected
local load on any helper is upper bounded by a constant < 1
which is independent of . As shown in [17], with upperbounded local load and priority scheduling for local tasks,
the expected local queue length is bounded and independent
of . Therefore the local sub-queue lengths of and
are bounded and independent of . Under the ideal load
decomposition, all tasks of types are served locally by
in order to achieve maximum remote capacity for overloaded
racks. We can show that in the absence of , the number of
tasks in that are served rack-locally or remotely vanishes
as 0. Hence we can also show the uniform boundedness
of the rack-local sub-queue lengths of .
(Theorem 5.) We consider the Lyapunov function
(Z) = W .
We can show that the drift of (Z) is always finite and
becomes negative for sufficiently large . According to the
extended version of Lemma 1 in [6], all moments of (Z)
exist and are finite. The main challenge is to show that the
} given in Lemma 3 satisfies
ideal load decomposition {,,

or

, { ,
the following:

, where is a
, or },
,,

positive constant independent of . That is, each task type only


local to receives service from all of its local servers, racklocal servers in and remote servers in . A crucial step to
bound the drift of (Z) is to use the ideal load decomposition
as an intermediary.
]
[
(Theorem 6.) In order to obtain a lower bound on () () ,
()
we construct{a single server system
} () with an ar
()
rival process
(), 0 and a service process

{() (), 0}, as defined in Eq. (8). The definition


of ],
[
[
]
and is such that

()
,

()
[
]
and
() are the maximum amount of local,
rack-local and remote services that can be provided for

()
(). Then in steady state, () () is stochastically

smaller than () ().


[ Using] Lemma 4 in [6], we can obtain a
lower bound on () () .
[
]
(Theorem 7.)[ We obtain] an upper bound on () () by
1,
2, ,
),
, where Q
= (
bounding
cc, Q


+ ,

+ + ,

The corresponding queueing dynamics is given by


+ 1) = Q()

+ U(),

Q(
+ A()
S()
S
and U
are defined similarly as Q.

where A,

 2
 
We consider the Lyapunov function (Z) = Q
 ,

where Q is the parallel component of the vector Q with

respect to the direction c. Note that the drift of (Z) is


zero in steady state. However, since the service rate of each
server varies with the task type and depends on the status of
its three sub-queues, the terms related to service in the drift of
(Z) cannot be bounded directly. In addition, tasks arrivals
which makes
of types such as also depend on Q,
the terms difficult to bound. To solve the problem, we construct
a series of ideal arrival and service processes, which allows
and bound the terms using
us to rewrite the dynamics of Q,
Lemma 8 in [6].
VII. E VALUATION
We compare the performance of the Weighted-Workload
algorithm with the JSQ-MaxWeight algorithm and the priority
algorithm [17] via simulation. We consider a continuous-time
system of 10 racks, where each rack consists of 50 servers.
Tasks arrive at the system according to a Poisson process. The
service rates for local, rack-local and remote tasks are = 1,
= 0.9 and = 0.5, respectively. So the mean slowdown
of remote tasks is 2, which is consistent to the measurements
in [19]. We consider exponential service time distribution for
each task.
The task type is designated at arrival. For each task, a set
of three servers are chosen to be its local servers according to
the distribution of requested data in the system. We consider
two cases:
1. Distribution-1. All the dataset requested by the incoming
traffic are distributed uniformly in a subset of servers, which
co-locate at a subset of racks. This simulates the special
scenario where the JSQ-MaxWeight algorithm achieves heavytraffic optimality. Here we report the results for = 5 and
= 250. That is, the set of three local servers for each task
are sampled uniformly randomly from all servers of 5 racks.
2. Distribution-2. At each task arrival, with probability 1 ,
the task samples a set of three servers uniformly randomly
from a subset of 1 servers in the first rack; with probability
2 , it samples uniformly from a subset of 2 servers in the
second rack; with probability 1 1 2 , it samples from all
other 1 2 servers. We choose 1 = 0.2, 1 = 10,
2 = 0.06, 2 = 25. This simulates the traffic with four types
of servers when the mean arrival rate is large. In particular,
the first rack becomes overloaded with the 1 severs as ,
the other 50 1 servers as ; the 2 servers in the second
rack become ; all other servers in the system become .
Figure 5 compares the stability regions for JSQ-MaxWeight,
the Weighted-Workload algorithm and the priority
algorithm.
The x-axis shows the mean arrival rate, /, and
the y-axis shows the mean completion time for all tasks. A
drastic increase in completion time indicteas that an algorithm
is close to its critical load. For distribution-2, we can compute
the capacity region < 0.9027. Observe that both the
proposed Weighted-Workload algorithm and JSQ-MaxWeight
are stable for < 0.9027, hence are throughput-optimal.
However, the simple priority algorithm becomes unstable at
0.83. This shows that maximizing the amount of tasks

Mean task completion time

60

the JSQ-MaxWeigth algorithm to three locality levels. We


have shown that the JSQ-MaxWeight is throughput optimal
but only heavy-traffic optimal for a special traffic scenario. We
proposed an algorithm that uses weighted workload routing
and priority service. The proposed algorithm is shown to
achieve throughput and heavy-traffic optimality for all traffic
scenarios.

Weighted-Workload
JSQ-MaxWeight
Simple Priority

40

20

0.2

0.4

0.6

0.8

0.9

Mean arrival rate


Fig. 5. Capacity regions with distribution-2.

Mean task completion time

25

Mean task completion time

served locally can lead to instability at a much lower load than


the full capacity.

40

20

Weighted-Workload
JSQ-MaxWeight

15
10
5
0
0.6

0.65

0.7

0.75

Mean arrival rate

(a) Distribution-1
Weighted-Workload
JSQ-MaxWeight

30
20
10
0
0.8

0.82

0.84

0.86

0.88

0.9

Mean arrival rate


(b) Distribution-2
Fig. 6. Mean task completion time.

Figure 6 compares the delay performance of JSQMaxWeight and the Weighted-Workload algorithm at high
load. With distribution-1, both algorithms achieve heavy-traffic
optimality. Figure (a) shows that the Weighted-Workload algorithm has similar performance as JSQ-MaxWeight. With
distribution-2, however, the Weighted-Workload algorithm
achieves up to 4-fold improvement over JSQ-MaxWeight
algorithm at high load. The Weighted-Workload algorithm is
shown to be heavy-traffic optimal for all traffic scenarios. The
significant improvement of the Weighted-Workload algorithm
over JSQ-MaxWeight at high load in Fig. 6(b) shows that
JSQ-MaxWeight is not heavy-traffic optimal for all traffic
scenarios.
VIII. C ONCLUSION
We considered a stochastic model with multi-level datalocality for computing clusters. We studied an extension of

R EFERENCES
[1] G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica,
D. Harlan, and E. Harris. Scarlett: Coping with Skewed Popularity
Content in MapReduce Clusters. In Proceedings of the European
Conference on Computer Systems (EuroSys), 2011.
[2] G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula,
S. Shenker, and I. Stoica. PACMan: Coordinated Memory Caching
for Parallel Jobs. In Proceedings of Symposium on Networked Systems
Design and Implementation (NSDI). USENIX, 2012.
[3] Apache Hadoop, June 2011.
[4] S. L. Bell and R. J. Williams. Dynamic scheduling of a system with
two parallel servers in heavy traffic with resource pooling: asymptotic
optimality of a threshold policy. Annals of Applied Probability, 2001.
[5] S. L. Bell and R. J. Williams. Dynamic scheduling of a parallel server
system in heavy traffic with complete resource pooling: Asymptotic
optimality of a threshold policy. Electron. J. Probab., 10:no. 33, 1044
1115, 2005.
[6] A. Eryilmaz and R. Srikant. Asymptotically tight steady-state queue
length bounds implied by drift conditions. Queueing Syst. Theory Appl.,
72(3-4):311359, 2012.
[7] J. M. Harrison. Heavy traffic analysis of a system with parallel servers:
Asymptotic optimality of discrete review policies. Annals of Applied
Probability, 1998.
[8] J. M. Harrison and M. J. Lopez. Heavy traffic resource pooling in
parallel-server systems. Queueing Syst. Theory Appl., 33(4), Apr. 1999.
[9] C. He, Y. Lu, and D. Swanson. Matchmaking: A New MapReduce
Scheduling Technique. In Proceedings of the International Conference
on Cloud Computing Technology and Science (CloudCom). IEEE, 2011.
[10] S. Ibrahim, H. Jin, L. Lu, B. He, G. Antoniu, and S. Wu. Maestro:
Replica-aware Map Scheduling for MapReduce. In Proceedings of
the International Symposium on Cluster, Cloud and Grid Computing
(CCGrid). IEEE, 2012.
[11] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. In
Proceedings of the Symposium on Operating Systems Principles (SOSP).
ACM, 2009.
[12] J. Jin, J. Luo, A. Song, F. Dong, and R. Xiong. Bar: An Efficient Data
Locality Driven Task Scheduling Algorithm for Cloud Computing. In
Proceedings of the International Symposium on Cluster, Cloud and Grid
Computing (CCGrid). IEEE, 2011.
[13] A. Mandelbaum and A. L. Stolyar. Scheduling flexible servers with
convex delay costs: Heavy-traffic optimality of the generalized -rule.
Operations Research, 52, 2004.
[14] M. Squillante, C. Xia, D. Yao, and L. Zhang. Threshold-based priority
policies for parallel-server systems with affinity scheduling. In Proc.
IEEE American Control Conf., 2001.
[15] A. L. Stolyar. Maxweight scheduling in a generalized switch: State
space collapse and workload minimization in heavy traffic. Annals of
Applied Probability, 14(1):153, 2004.
[16] W. Wang, K. Zhu, L. Ying, J. Tan, and L. Zhang. Map Task Scheduling in MapReduce with Data Locality: Throughput and Heavy-traffic
Optimality. In Proceedings of INFOCOM. IEEE, 2013.
[17] Q. Xie and Y. Lu. Priority Algorithm for Near-data Scheduling:
Throughput and Heavy-Traffic Optimality. In Proceedings of INFOCOM. IEEE, 2015.
[18] Q. Xie and Y. Lu. Scheduling with multi-level data locality: Throughput
and heavy-traffic optimality. Technical Report, 2015.
[19] M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and
I. Stoica. Delay scheduling: A Simple Technique for Achieving Locality
and Fairness in Cluster Scheduling. In Proceedings of the European
Conference on Computer Systems (EuroSys), 2010.

You might also like