Kairos - Preemptive Data Center Scheduling Without Runtime Estimates

Kairos: Preemptive Data Center Scheduling
Without Runtime Estimates

Pamela Delgado1 , Diego Didona1 , Florin Dinu1 and Willy Zwaenepoel1
1
EPFL
Submission Type: Research
Abstract user-facing services. Scheduling such jobs while achiev-

ing low scheduling times, good job placement and high
The vast majority of data center schedulers use job run- resource utilization is a challenging task. The complex-
time estimates to improve the quality of their schedul- ity is exacerbated by the data-parallel nature of these jobs.
ing decisions. Knowledge about the runtimes allows the That is, a job is composed of multiple tasks and the job
schedulers, among other things, to achieve better load bal- completes only when all of its tasks complete.
ance and to avoid head-of-line blocking. Obtaining accu- Many state-of-the-art systems rely on estimates of the
rate runtime estimates is, however, far from trivial, and runtimes of tasks within a job1 to improve the quality of
erroneous estimates may lead to sub-optimal scheduling their scheduling decisions in the face of job heterogeneity
decisions. Techniques to mitigate the effect of inaccurate and data-parallelism [5, 19, 21, 22, 30, 31, 38]. Execu-
estimates have shown some success, but the fundamental tion times from prior runs [5] or a preliminary profiling
problem remains. phase [13] are often used for this purpose. The accuracy
This paper presents Kairos, a novel data center sched- of such estimates has a significant impact on the perfor-
uler that assumes no prior information on job runtimes. mance of these schedulers. For instance, queueing a 1-
Kairos introduces a distributed approximation of the Least second job behind a job that is estimated to take 1 second
Attained Service (LAS) scheduling policy. Kairos con- but in reality takes 3 seconds doubles the completion time
sists of a centralized scheduler and a per-node scheduler. of the former job. Similarly, scheduling at the same time
The per-node schedulers implement LAS for tasks on two jobs estimated to be of equal length may seem to pro-
their node, using preemption as necessary to avoid head- vide excellent load balance, but in fact significant load
of-line blocking. The centralized scheduler distributes imbalance occurs if one job turns out to be shorter and the
tasks among nodes in a manner that balances the load and other longer.
imposes on each node a workload in which LAS provides Limitations of estimates-based approaches. Unfortu-
favorable performance. nately, obtaining accurate job runtime estimates is far
We have implemented Kairos in YARN. We compare from being trivial. We show in Section 2 that a widely
its performance against YARN FIFO scheduler and Big- employed estimation technique – using the mean task ex-
C, an open source state-of-the-art YARN-based scheduler ecution time as a predictor of the execution of all tasks
that also uses preemption. We show that Kairos reduces in a job [11, 31] – can lead to large errors ( > 100%).
the median job completion time by 37% (resp. 73%) and Our findings are confirmed by recent work that shows that
the 99-th percentile by 57% (resp. 30%), with respect to more sophisticated approaches based on machine learn-
Big-C and FIFO. We evaluate Kairos at scale by imple- ing [30] still exhibit significant estimation errors. Many
menting it in the Eagle simulator and comparing its per- factors contribute to the difficulty of obtaining reliable
formance against Eagle. Kairos improves the 99th per- runtime estimates. For example, small changes in the in-
centile of short job completion times by up to 55% and put data of a recurring job may substantially change its ex-
85% the for the Google and Yahoo traces respectively. ecution time [1], thus compromising the accuracy of esti-
mates based on previous executions. Data skew may lead
a few tasks in a job to take considerably more time to com-
1 Introduction plete than other tasks in the same job [9, 27]. Techniques
to tackle these issues, such as queue re-balancing [31]
Modern data centers face increasingly heterogeneous or uncertainty-aware scheduling policies [30] have shown
workloads composed of long batch jobs, e.g., data ana-
lytics, and latency-sensitive short jobs, e.g., operations of 1 We refer to such estimates as “job runtime estimates”.
1
some success in mitigating the impact of misestimations, scheduling discipline.
but they do not fundamentally address the problem. 3) We implement this distributed LAS in YARN, and
Kairos. In this paper, we propose an alternative approach compare its performance to state-of-the-art alternatives by
to data center scheduling, which does not use job runtime measurement and simulation.
estimates. Our approach draws from the Least Attained Roadmap. The outline of the rest of this paper is as fol-
Service (LAS) scheduling policy [28]. LAS is a preemp- lows. Section 2 provides the necessary background. Sec-
tive scheduling technique that executes first the task that tion 3 describes the design of Kairos. Section 4 describes
has received the smallest amount of service so far. LAS its implementation in YARN. Section 5 evaluates the per-
is known to achieve good job completion times when the formance of the Kairos YARN implementation. Section 6
distribution of job runtimes has a high variance, as is the provides simulation results. Section 7 discusses related
case in the often heavy-tailed data center workloads. work. Section 8 concludes the paper.
The main challenge is to find a good approximation for
LAS in a data center environment. A naive implemen-
tation would cause frequent task migrations, with its at- 2 Background
tendant performance penalties. Task migration would be
needed to allow a preempted task to resume its execution 2.1 Misestimations
on any worker node with available resources.
Instead, we have developed a two-level scheduler that Estimates in existing systems. Earlier schedulers like
avoids task migrations altogether, but still offers good Sparrow [29] do not rely on prior information about job
performance. In particular, Kairos consists of a central- runtimes, but in heavy-tailed workloads latency-sensitive
ized scheduler and per-node schedulers. The per-node short tasks often experience long queueing delays due to
schedulers implement LAS for tasks on their node, using head-of-line blocking [12].
preemption as necessary to avoid head-of-line blocking. Most state-of-the-art data center schedulers rely on
The centralized scheduler distributes tasks among worker job runtime estimates to make informed scheduling de-
nodes in a manner that balances the load. cisions [5, 11, 12, 13, 19, 20, 21, 22, 26, 41]. Estimates
A first challenge in this design is how to ensure high are used to avoid head-of-line blocking and resource con-
resource utilization in absence of runtime estimates. To tention, provide load balancing and fairness, and meet
address this issue, the central scheduler aims to equal- deadlines. The accuracy of job runtime estimates is there-
ize the number of tasks per node, and reduces the amount fore of paramount importance. Estimates of the runtime of
of load imbalance possible among nodes by limiting the a task within a job can be obtained from past executions
maximum number of tasks assigned to a worker core. of the same task, if any, or at past executions of similar
A second challenge is how to ensure that the distributed tasks [5], or by means of on-line profiling [13]. A com-
approximation of LAS preserves the performance bene- mon estimation technique for the task duration is to take
fits of the original formulation of LAS. Kairos addresses the average of the task durations over previous executions
this issue by means of a novel task-to-node dispatching of the job [12, 31]. More sophisticated techniques rely on
approach. In this approach, the central scheduler assigns machine learning [30].
tasks to nodes in such a way to induce high variance in the Challenges in obtaining accurate estimates. Unfortu-
distribution of the runtime of tasks assigned to each node. nately, obtaining accurate and reliable estimates is a far
We have implemented Kairos in YARN. We compare from easy task. Many factors contribute to the difficulty
its performance against the YARN FIFO scheduler and in obtaining reliable estimates. The scheduler may have
Big-C [6], a state-of-the-art YARN-based scheduler that limited or no information to produce estimates for new
also uses preemption. We show that Kairos reduces the jobs, i.e., jobs that have never been submitted before [31].
median job completion time by 37%, resp. 73%, and the Even if jobs are recurring, evidence indicates that changes
99-th percentile by 57%, resp., 30%, with respect to Big-C in the input data set may lead to significant and hard-to-
and FIFO. We evaluate Kairos at scale by implementing predict shifts in the runtime for a job [1]. Changes in
it in the Eagle scheduler [11] and comparing its perfor- the data placement may cause the job execution time to
mance against Eagle using traces from Google and Yahoo. change. Skew in the input data distribution can lead tasks
Kairos improves the completion times of latency-sensitive in the same job to have radically different runtimes [9, 27].
jobs by up to 55% and 85% respectively. Finally, failures and transient resource utilization spikes
may lead to stragglers [2], which not only have an un-
Contributions. We make the following contributions:
predictable duration, but represent outliers in the data-set
1) We demonstrate good data center scheduling perfor- used to predict future runtimes for the same job.
mance without using task runtime estimates. We provide an example of the estimation errors that can
2) We present an efficient distributed version of the LAS affect job scheduling decisions by analyzing public traces
2
0.08 1.1
Yahoo 1
Google 0.9
0.06 Cloudera 0.8
Facebook 0.7
PDF
CDF
0.6
0.04 0.5
0.4 Yahoo
0.02 0.3 Google
0.2 Cloudera
0.1 Facebook
0 0
<-100 -75 -50 -25 -1 1 25 50 75 >100 0 25 50 75 100 125 150
Relative error (%) Absolute relative error (%)
(a) PDF of relative prediction error. (b) CDF of the absolute relative prediction error.
Figure 1: Prediction error when estimating the duration of each task in a job as the mean task duration in that job.
(a) The PDF of the relative error (the ’<100’ datapoint includes all under-estimations of more than 100%, and the
’>100’ data point all over-estimations larger than 100%. (b) The CDF of the absolute relative error (the very tail of
this distribution is not shown for the sake of readability).
that are widely used to evaluate data center schedulers. 2.2 Least-Attained-Service
In particular, we consider the Cloudera [7],Yahoo [8],
Google [32] and Facebook [7] traces. We study the distri- Prioritizing short jobs. Typical data centers workloads
bution of the error incurred when using the mean execu- are a mix of long and short jobs [8, 7, 32]. Giving higher
tion time of tasks in a job as an indicator of the execution priority to short jobs improves their response times by re-
time of a task in that job. ducing head-of-line-blocking. The Shortest Remaining
Processing Time (SRPT) scheduling policy [35] priori-
Let J be a job in the trace and T the set of tasks t1 , · · ·, tizes short tasks by executing pending tasks in increasing
tn , in the job, each with an associated execution time ti .t. order of expected runtime and by preempting a task if a
Let TJ be the mean execution time of tasks in J. Then, shorter task arrives. SRPT is provably optimal with re-
we compute the relative prediction error for a task as E = spect to mean response time [34].
100 × (ti .t − T )/T , and the absolute relative prediction
Recent systems have successfully adopted SRPT in the
error as |E|. We show the PDF of E and the CDF of |E| in
context of data center scheduling [11, 24, 31]. These sys-
Figure 1. While up to 50% of the predictions are accurate
tems do not support preemption, so they implement a vari-
to within 10%, some prediction errors can be higher than
ant of SRPT, where the shortest task executes but, once
100%.
started, a task runs to completion.
Similar degrees of misestimation have also been re-
Least Attained Service (LAS). SRPT requires task run-
ported in recent work that uses a machine learning ap-
time estimates to determine which task should be exe-
proach to predicting task resource demands [30].
cuted. LAS is a scheduling policy akin to SRPT, but it
Coping with misestimations and Kairos approach. does not rely on a priori estimates [28]. LAS instead uses
Previous work has shown that job runtime misestima- the service time already received by the task as an indica-
tion leads to worse job completion times [11], and fail- tion of the remaining runtime of the task.
ure to meet service level objectives [13, 38] or job com- Given a set of tasks to run, LAS schedules for execu-
pletion deadlines [38]. Some systems deal with misesti- tion the one with the lowest attained service, i.e., the one
mations by runtime correction mechanisms such as task that has executed for the smallest amount of time so far.
cloning [2] and queue re-balancing [31], or by taking us- We call such task the youngest one. In case there are
ing a distribution of estimates rather than a single value es- n youngest tasks, all of them are assigned an equal 1/n
timates [30]. These solutions mitigate the effects of mis- share of processing time, i.e., they run according to the
estimations, but they do not avoid the problem entirely, Processor Sharing (PS) scheduling policy (as in typical
and increase the complexity of the system. multiprogramming operating systems). LAS makes use
Kairos overcomes the limitations of scheduling based of preemption to allow the youngest task to execute at any
on runtime estimates by adapting the LAS scheduling pol- moment.
icy [28] to a data center environment. LAS does not re- Rationale. LAS uses the attained service as an indication
quire a priori information about task runtimes and is well of the remaining service demand of a task. The rationale
suited to workloads with high variance in runtimes, as is behind the effectiveness of this service demand prediction
the case in the often heavy-tailed data center workloads. policy lies in the heavy-tailed service demand distribution
3
that is prevalent among production workloads. That is, if
a task has executed for a long amount of time, it is likely
that it is a large task, and hence it still has much to execute
before completion. Hence, it is better to execute younger
tasks, as they are more likely to be short tasks.
In addition, if the youngest task in the queue has an
attained service T , a new incoming task is going to be
the youngest one until it has received a service of T (if
no other task arrives in the meantime). Hence, if the task
is a short one –which is likely under the assumption of
heavy-tailed runtime distribution– then it is likely that the
task is going to complete within T , thus experiencing no
queueing at all.
Figure 2: Kairos architecture. Nodes schedulers imple-

3 Kairos ment LAS locally (Section 3.2) and periodically send in-
formation to the central scheduler, such as the number of
3.1 Design overview tasks and the variance of their attained service times. The
Challenges of LAS in a data center. LAS is an appealing central scheduler uses this information to assign tasks to
starting point to design a data center scheduler that does worker nodes in such a way to achieve load balancing
not require a priori job runtime estimates. In a strict im- (Section 3.3.2) and to maximize the effectiveness of LAS
plementation of LAS, however, the youngest task should within each worker node (Section 3.3.3).
be running at any moment in time. Then, adapting LAS
to the data center scenario with a distributed set of worker
ning tasks, the start time of their current quantum. Each
nodes requires that a preempted task must be able to re-
node scheduler implements LAS taking as input the num-
sume its execution on any worker node.
ber of available cores N and the quantum of time W to
Allowing task migration across worker nodes incurs allow the interleaved execution of tasks with identical at-
costs such as transferring input data or intermediate out- tained service times (as described in Section 2.2).
put of the task, and setting up the environment in which
When a new task arrives, it is immediately executed.
the task runs (e.g., a container). Determining whether or
If there is at least one core available, the task is assigned
not to migrate a task is a challenging problem, especially
to that core (Line 8). Else, the task preempts the running
in the absence of an estimate of the remaining runtime of
task with the highest attained service time (Line 11). This
the task. Therefore, Kairos does not strictly follow LAS,
task is moved to the node queue, and its attained service
but rather implements an approximation thereof.
time is increased by the service time that it has received.
Kairos approach to LAS. Kairos implements LAS in an When a task terminates, if the node queue is not empty,
approximate fashion. It uses a two-level scheduling hier- the task with the lowest attained service is scheduled for
archy consisting of a central scheduler and a node sched- execution (Line 21).
uler on each worker node. We depict the high-level ar- When a task t is assigned to a core, a timer is set to
chitecture of Kairos in Figure 2. The node schedulers expire after W seconds (Line 17). If t has not completed
implement LAS locally on each worker node (see Sec- by the time the timer fires (Line 27), the scheduler in-
tion 3.2). The central scheduler assigns tasks to worker creases the attained service time of the task by W . Let
nodes in such a way to achieve load balancing and to max- T be the updated value of the attained service time t. If
imize the effectiveness of LAS within each worker node there is a task t0 in the node queue with attained service
(see Section 3.3). time lower than T , t0 is scheduled for execution by pre-
empting t (Line 31). Otherwise, t continues its execution,
and the corresponding timer is reset (Line 39).
3.2 Node scheduler
Periodically, the node scheduler communicates to the
Each worker node has N cores, which can run N con- central scheduler the number of tasks currently assigned
current tasks, and a queue, in which preempted tasks are to it, and the variance in the service times already attained
placed. Algorithm 1 presents the data structures main- by such tasks (Line 42). The latter information is used by
tained by the node schedulers and the operations they per- the central scheduler in deciding where to send a task, as
form. A TaskEntry structure maintains for each task explained in Section 3.3.3.
information such as its attained service time and, for run- The node scheduler implements an anti-starvation
4
Algorithm 1 Node scheduler Algorithm 2 Central scheduler
1: Set<TaskEntry> IdleTasks, RunningTasks . Track suspended/running tasks 1: Queue CentralQueue . Queue where incoming tasks are placed
2: Node[numNodes] Nodes . Entries track # tasks and attained service times
2: upon event Task t arrives do
3: TaskEntry te 3: upon event New job J arrives do
4: te.task ← t 4: for task t ∈ J do
5: te.attained ← 0 5: Queue.push(t)
6: te.start ← now()
7: RunningTasks.add(te) 6: upon event Heartbeat HB from Node i arrives do
8: if (idleCores.size() < N ) then . Free core can execute t 7: Nodes[i].var ← HB.var
9: core c = idleCores.pop() 8: Nodes[i].numTasks ← HB.numTasks
10: else . Preempt oldest running task
11: tp ← argmax{tt.attained} {tt ∈ RunningT asks} 9: procedure MAINLOOP
12: tp .attained+ = now() − tp .start 10: while (true) do
13: c ← core serving t 11: for i = 0, . . . , N + Q do
14: remove tp from c 12: Si ← {Node m ∈ N odes : m.numT asks = i}
15: IdleTasks.add(tp ) 13: while (!Si .isEmpty() ∧!CentralQueue.isEmpty()) do
16: assign t to c 14: Node m ← argminn.var {n ∈ Si }
17: c.startTimer(W) 15: Task t ← CentralQueue.pop()
18: start t 16: Assign t to m
17: Si ← Si \ {m}
19: upon event Task t finishes on core c do 18: Sleep(∆)
20: RunningTasks.remove(t)
21: if (!IdleTasks.isEmpty() then . Run youngest suspended task
22: TaskEntry tr ← argmin{ti .attained} {ti ∈ IdleT asks}
23: RunningTasks.add(tr )
24: assign tr to c able situation for a short task that has been preempted to
25: tr .start ← now() make room for a new incoming task. A low value for W ,
26: c.startTimer(W)
27: Start tr .task instead, gives a task frequent opportunities to execute and
28: else hence potentially complete. However, it may also lead to
29: IdleCores.push(c)
long completion times, because task completion time may
30: upon event Timer fires on core c running task t do be delayed by frequent interleaving. We study the sensi-
31: TaskEntry ts ←TaskEntry e : e.task = t tivity of Kairos to the setting of W in Section 5.3.3, where
32: ts.attained+ = now() − ts.start
. Find youngest suspended task we show that Kairos is relatively robust to sub-optimal
33: TaskEntry tm ← argmin{ti.attained} {ti ∈ IdleT asks} settings of W .
34: if (tm.attained ≤ ts.attained) then . Preempt t
35: IdleTasks.remove(tm)
36: IdleTasks.add(ts)
37: RunningTasks.remove(ts) 3.3 Central scheduler
38: RunninTasks.add(tm)
39: tm.start ← now()
40: place tm.task on c Algorithm 2 presents the data structures maintained by the
41: Start tm.task central scheduler and its operations.
42: else . Continue running t
43: ts.start ← now()
44: c.startTimer(W)
3.3.1 Challenges in the absence of estimates.
45: upon event Every ∆ do
46: Heartbeat HB The lack of a priori job runtime estimates makes it cum-
47: HB.num ← IdleTasks.size() + RunningTasks.size()
48: HB.var ← var{t.attained} {t ∈ IdleT asks ∪ RunningT asks} bersome to achieve load balancing. Existing approaches
49: send HB to the central scheduler use job runtime estimates to place a task on the worker
node that is expected to minimize the waiting time of the
task [5, 31]. This strategy improves task completion times
and achieve high resource utilization by equalizing the
mechanism to avoid that long jobs can be preempted in-
load on the worker nodes. Kairos cannot re-use such ex-
definitely (not shown in the pseudocode). Each task is
isting techniques in a straightforward fashion, because it
associated with a counter that tracks how many times the
cannot accurately estimate the backlog on a worker node
task has been preempted. If a task is preempted more than
and the additional load posed by a task being scheduled.
a given number of times, then it has the right to run for
To circumvent this problem, Kairos decouples the prob-
a quantum of time, during which it cannot be preempted.
lems of achieving load balance and high resource utiliza-
This mechanism ensures the progress of every task.
tion from the problem of achieving low completion times.
Impact and setting of W . The value of W determines Kairos leverages the insight that short completion times
the trade-off between task waiting times and completion are already achieved by implementing LAS in the individ-
times. A high value for W allows the shortest tasks to ual node schedulers. In fact, LAS gives shorter tasks the
complete within a single execution window. However, it possibility to completely or partially bypass the queues on
may also lead a preempted task in the node queue to wait the worker nodes. This means that the central scheduler
for a long time before being it can run again, an undesir- can be to some extent agnostic of the actual backlog on
5
worker nodes, because the backlog is not an indicator of 3.3.3 Maximizing LAS effectiveness.
the waiting time for a task.
Kairos implements a LAS-aware policy to break ties in
Hence, in Kairos, the central scheduler has two goals
cases in which two or more worker nodes have an equal
1) Enforcing that resources do not get wasted, i.e., there number of tasks assigned to them. In more detail, it as-
are no cores idle if there is any tasks in some queue (ei- signs the task to the worker node with the lowest variance
ther the eligible idle tasks in the central or in any worker in the attained service times of tasks currently placed on
queue). This leads to high resource utilization and implies the worker node, in the hope that by doing so it can signif-
balancing the load among worker nodes (Section 3.3.2). icantly increase the variance on that node. The rationale
behind this choice is that LAS is most effective when the
2) Maximizing LAS effectiveness, e.g., by improving the
task duration distribution has a high variance. Intuitively,
chances that short tasks bypass long tasks and task do not
if only short tasks were assigned to a node, the youngest
hurt each other response times by excessive interleaved
short tasks would preempt older short tasks, hurting their
executions (Section 3.3.3).
completion times. Similarly, if only long tasks were as-
signed to a node, all would run in an interleaved fashion,
each one hurting the completion time of the others.
3.3.2 Load Balancing The effectiveness of this policy is grounded in previ-
ous analysis of SRPT in distributed environments, that
The central scheduler aims to balance the load across
show that maximizing the heterogeneity of task runtimes
worker nodes by enforcing that each of them is assigned
on each worker node is key to improve task completion
an equal number of tasks. Hence, the first outstanding task
times [4, 14]. Unlike previous studies, however, Kairos
in the central queue is placed on the worker node with the
does not rely on an exact knowledge of the runtimes of the
smallest amount of assigned tasks.
tasks on each worker node, and uses the attained service
This policy alone, however, is not sufficient with heavy- times of tasks on a worker node to estimate the variability
tailed runtime distributions, as it may lead to temporary in task runtimes on that worker node.
load imbalance scenarios. For example, a worker node
may be assigned many short tasks while another worker
node is loaded with longer tasks. Then, the first worker 4 Kairos implementation
node might complete all its short tasks and becomes idle
while some tasks lie idle on the other worker node, wait- We implement Kairos as part of YARN [18], a widely
ing to receive service time. used scheduler for data parallel jobs. Figure 3 shows the
To address this issue, the central scheduler enforces that main building blocks of YARN, their interactions and the
each worker is assigned at most Q + N tasks at any mo- components introduced by Kairos.
ment in time. This admission control mechanism bounds YARN. YARN consists of a ResourceManager re-
the amount of load imbalance possible, since a worker siding on a master node, and a NodeManager re-
node can host at most Q idle tasks that could have been siding on each worker node. YARN runs a task on
assigned to other worker nodes with available resources. a worker node within a container, which specifies the
Kairos task-to-node dispatching policy achieves load node resources allocated to the task. Each worker
balancing and high resource utilization, and is cheap node also has a ContainerManager that manages
enough to lead to low-latency scheduling decisions. This the containers on the node. Finally, each job has an
allows Kairos to sustain high job arrival rates without in- ApplicationManager that runs on a worker node
curring long scheduling delays. and tracks the advancement of all tasks within the job.
Impact and setting of Q. The value of Q determines the The ResourceManager assigns tasks to worker
trade-off between load balance and effectiveness of LAS. nodes and communicates with the NodeManagers on
A small value of Q reduces the amount of possible load the worker nodes. A NodeManager communicates with
imbalance, but may lead many short tasks to sit in the cen- the ResourceManager by means of periodic heartbeat
tral queue, instead of being assigned to a worker node, ex- messages. These heartbeats contain information about the
ecute by preempting a previous task and potentially com- node’s health and the containers running on it.
plete quickly. A high value of Q, on the contrary, may lead Kairos central scheduler. The Kairos central schedul-
to higher load imbalance, but enables more parallelism. ing policy is implemented in the ResourceManager.
We assess the sensitivity of Kairos to the setting of Q In particular, the Kairos central scheduler extends the Ca-
in Section 5.3.3, where we show that Kairos performance pacityScheduler, to allow the possibility of a worker node
are not dramatically affected by sub-optimal settings of to be allocated more containers than available cores.
Q. Kairos node scheduler. The node scheduler of
6
Category input #maps #reduces extraFlops duration probability
1 small 4GB 15 15 0 85s 0.32
2 medium small 4GB 15 15 500 201s 0.31
3 medium 8GB 30 30 0 239s 0.31
4 medium long 30GB 112 60 500 308s 0.04
5 long 60GB 224 60 1000 1175s 0.02
Table 1: Job categories composing our workload. Job

runtimes follow a heavy-tailed distribution, as typical in
modern data centers.
publicly available, and to the FIFO YARN scheduler.

Background on Big-C. Big-C uses available runtime es-
timates to perform task placements, and preemption as a
mechanism to prioritize short tasks in case of high utiliza-
tion. Big-C extends the capacity scheduler of YARN. Jobs
Figure 3: Kairos integration in YARN. As a part of Re- are partitioned in classes, and each class is assigned a pri-
source Manager, Kairos Centralized Scheduler assigns re- ority, which determines the share of resources dedicated
sources to tasks by means of containers. On each worker to jobs in that class. When, according to its priority, a
node, the KairosNodeScheduler extends the Container- job should run and there is not enough resource available,
Manager to implement LAS, by suspending and resuming tasks belonging to lower priority jobs are preempted.
containers. Big-C defines two job classes, corresponding to long
and short jobs. A job is classified according to available
runtime estimates. A high share of the resources is as-
Kairos is implemented within the ContainerManager signed to short jobs, so as to prioritize them over long
by the KairosNodeScheduler component. The ones. Long tasks can opportunistically use more than their
KairosNodeScheduler consists of a thread that share of resources, if there is unused capacity. Such long
monitors the status of the containers and implements tasks can be preempted by newly incoming short tasks.
LAS. We configure Big-C to use its standard value for the
The KairosNodeScheduler maintains the at- share of resources for short jobs (95%) and we classify as
tained service time of the tasks running within the contain- short the jobs in category 1, 2 and 3 (which represent 94%
ers, and implements preemption. It preempts a container of the total jobs in our workload).
by reducing the resources allocated to it to a minimum,
and resumes it by restoring the original allocation, similar
to what is done in Chen et al. [6]. Reducing the resources
5.2 Test-bed
to a minimum (rather than to zero) allows the heartbeat Platform. We use a cluster composed of 30 nodes inter-
mechanism to continue to function correctly when a con- connected by an 10Gbps Ethernet network. Each node
tainer is preempted. The KairosNodeScheduler sets is equipped with 4 CPUs per node, having a total of 120
the timers necessary for implementing the processor shar- cores to schedule on. We use Hadoop-2.7.1 to have the
ing window W, and also extends the information sent by same code base as in Big-C. Containers use Docker-1.12.1
the NodeManager in the heartbeat messages, by includ- and their images were downloaded from the online repos-
ing the standard deviation of the attained service of all itory sequenceiq/hadoop-docker.
containers hosted by the node. We set Q = 4, corresponding to hosting on a node a
number of tasks twice as high as the number of available
cores, and W = 50s, which allows the shortest tasks to
5 Evaluation on a Real Cluster execute within one quantum of time. We provide a sen-
sitivity analysis to the setting of these parameters in Sec-
5.1 Methodology and Baselines tion 5.3.3.
We evaluate our Kairos prototype integrated in YARN by Workloads. We consider a workload composed of 100
means of experiments on a medium-sized cluster using WordCount jobs, taken randomly from five categories of
Hadoop jobs. We will release the source code of Kairos. jobs described in Table 1 given their respective probabil-
We compare Kairos to Big-C [6], the most recent ities. The job inter-arrival times follow a Poisson distri-
scheduling system based on YARN whose source code is bution with mean 60s. The workload is inspired by the
7
1 1
0.8 0.8
CDF
CDF
0.6 0.6
0.4 0.4
Kairos Random
0.2 Big-C 0.2 Sum
FIFO Var
0 0
0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500
Job completion times (s) Job completion times (s)
Figure 4: CDF of the job completion times in Kairos,

Big-C and stock FIFO from YARN. Kairos avoids high Figure 5: Comparison of alternative task-to-node dis-
waiting times of short tasks by means of LAS. In Big-C, patching policies in Kairos. Var achieves better perfor-
instead, the shortest tasks can wait for the completion of mance by aiming to increase the variance in the runtime
other longer tasks classified as short. distribution of tasks assigned to a worker node, which in-
creases the effectiveness of LAS.
typical heavy-tailed job runtime distribution that charac-

terizes production workloads. is destined to long jobs. Second, worker nodes in Kairos
We modified the Hadoop WordCount implementation accept Q more tasks than what they can process. This al-
by adding a floating point operation inside a double loop lows short tasks to be placed on a busy node and execute
in both the mapper and the reducer. The number of these thanks to LAS. Instead, in Big-C a short task ts cannot
extra flops is passed as a parameter to artificially make preempt another short task t0s , even if this t0s is actually
the tasks last longer. The resulting workload takes ap- longer than ts . Hence, ts has to wait for some node to
proximatively 2 hours to run in our setup. The input for have free resources before being scheduled for execution.
the WordCount consists of randomly generated 100-char Kairos is also more effective in achieving low comple-
strings. tion times for longer jobs, which is visible at right-end of
The number of mappers in each category its determined the CDF. Kairos reduces the 99-th percentile of job com-
by the input size divided by the block size in HDFS. The pletion times by 57% with respect to Big-C (1452 sec-
number of reducers is a parameter we chose according to onds vs 3368) and by 30% with respect to FIFO (1452
the best configuration for each category. The durations in seconds vs 2061). Kairos achieves better job completion
Table 1 correspond to the total makespan of a job when times at high percentiles by not restricting the share of
running alone in the cluster. resources for longer jobs, and by enhancing the effective-
The memory allocation configuration is 2GB for map ness of LAS by its task-to-node assignment policy. We
tasks and 4GB for reduce tasks. HDFS block size used is note that Big-C achieves worse tail latency than FIFO,
256MB for categories 1 to 4, and 1GB for category 5. The because long jobs can be delayed due to frequent preemp-
replication level is set to default 3. The container size is tions and the low priority enforced by Big-C.
set to <5120 MB,1 vCore>.
5.3.2 LAS-aware task dispatching
5.3 Experimental results Figure 5 reports the CDF of job completion times in
5.3.1 Job completion times. Kairos with different policies used to choose where to
place a task. These policies are implemented when there
Figure 4 reports the CDF of job completion times with are multiple worker nodes with the minimum number of
Kairos, Big-C and stock FIFO scheduling in YARN. tasks already assigned. The Sum policy assigns a task
Kairos achieves better job completion time than Big-C to the node whose tasks have the lowest cumulative at-
and FIFO at all percentiles. Short tasks in Kairos com- tained service time. The rationale is that by using attained
plete more quickly than in Big-C, which is visible by look- service time as an estimation of remaining runtime, the
ing at lower percentiles. For example, Kairos reduces the Sum policy tries to assign a task to the least loaded node.
50-th percentile of job completion times by 37% with re- Random assigns the task to one node at random. The
spect to Big-C (217 seconds vs 341) and by 73% with Var policy is the one implemented by Kairos (see Sec-
respect to FIFO (241 seconds vs 808). The reason for the tion 3.3.2).
performance improvement over Big-C is two-fold. First, The plots show that the Var policy is able to deliver
short tasks can execute on any worker node at any moment better job completion times at all percentiles. The biggest
in time in Kairos, while in Big-C a share of the resources gains over Random and Sum are at around the 30-th per-
8
1 1
0.8 0.8
CDF
CDF
0.6 0.6
0.4 Q=0 0.4 W=10
Q=4 W=50
0.2 Q=8 0.2 W=100
Big-C Big-C
0 0
0 500 1000 1500 0 500 1000 1500
Job completion times (s) Job completion times (s)
(a) Varying Q. (b) Varying W .
Figure 6: Sensitivity analysis to varying the maximum number of tasks on a node scheduler (Q) and the size of the
quantum of time in LAS (W ). Kairos is robust to sub-optimal settings of these parameters and achieves performance
gains over Big-C even without optimal tuning.
centile and towards the tail of the distribution. The ben- much interleaved. This penalizes the longest jobs, i.e., the
efit at the 30-th percentile indicates that the shortest jobs, very tail of the completion times distribution, but leads
which account for 30% of the total (see Table 1), are effec- to better values for lower percentiles. The dual holds for
tively prioritized. The benefit at higher percentiles show W = 100. The longest jobs have big quanta of times to
that Var is also able to effectively use LAS to improve use. This improves their completion times at the detriment
the response time of larger tasks as well. of shorter jobs.
5.3.3 Sensitivity Analysis

6 Large Scale Simulation Study
We now show that Kairos maintains performance that are
better than or comparable to Big-C even in case of sub- 6.1 Methodology and baseline
optimal setting of the parameters W and Q. To this end,
We now evaluate Kairos on large-scale data centers by
we study how the performance of Kairos vary with differ-
means of a simulation study using the popular Yahoo [8]
ent settings for W and Q. When studying the sensitivity of
and Google [32] traces. We compare Kairos to Eagle [11],
Kairos to the setting of one parameter, we keep the other
the most recent system whose design is implemented in a
one to its default value.
simulator. We integrate Kairos design in the Eagle simu-
Sensitivity to Q. Figure 6a shows the CDF of job re- lator and will make the source code of the simulator avail-
sponse times in Kairos with Q = 2, 4, 8 and in Big-C. able. We do not compare our prototype of Kairos with
Q = 8 and Q = 2 perform slightly worse than the de- Eagle because Eagle is built on top of Spark’s scheduler.
fault value Q = 4 we use in Kairos, but still deliver better We report average values of 10 runs for the Yahoo trace,
performance than Big-C at each percentile. and 5 runs for the Google trace.
The shape of the CDFs for different values of Q match
Background on Eagle. Eagle partitions the set of worker
our analysis of Section 3.3. If Q is too low, sometimes
nodes in a sub-cluster for long jobs and one for short jobs.
short tasks are kept in the central queue, thus preventing
Resources are assigned to the two sub-clusters proportion-
them to execute right away by means of LAS. This is vis-
ally to the expected load posed by short and long jobs.
ible for at the 20-th percentile of the CDF, where Q = 2
Hence, in the traces we consider, the majority of the re-
is worse than Q = 8. If Q is too high, instead, short tasks
sources is assigned to long jobs, as they consume the bulk
can more often preempt longer ones, increasing their re-
of the resources. Short tasks are allowed to opportunis-
sponse times and thus leading to worse tail latencies.
Sensitivity to W . Figure 6b shows the CDF of job re-
Trace Total # jobs % Long jobs % Task-Seconds long jobs
sponse times in Kairos with W = 10, 50, 100 and in Big- Yahoo [8] 24262 9.41 98
C. Similar to what is seen for Q, setting W too high or Google [32] 506460 10.00 83
too low negatively impacts the performance of Kairos, but
Kairos maintains its performance lead over Big-C. Table 2: Job heterogeneity in the traces. % Task-seconds
Comparing the performance achieved with the W = 10 long jobs is the sum of the execution times of all long tasks
and W = 100 we see that a too low value for W has divided by the sum of the execution times of all tasks.
the effect that the execution of tasks on a worker node is
9
2 2
50th perc. short jobs 50th perc. long jobs
1.8 90th perc. short jobs 1.8 90th perc. long jobs
Kairos normalized to Eagle

Kairos norm. to Eagle
1.4
avg. utilization for Eagle 1.4
avg. utilization for Eagle
avg. utilization for Kairos avg. utilization for Kairos
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
15000 17000 19000 21000 23000 15000 17000 19000 21000 23000
Number of nodes in the cluster Number of nodes in the cluster
(a) Kairos short jobs normalized to Eagle. Google trace. (b) Kairos long jobs normalized to Eagle. Google trace.
Figure 7: Kairos normalized to Eagle short (a) and long (b) jobs. Google trace.
2 2
50th perc. short jobs 50th perc. long jobs
Kairos normalized to Eagle

Kairos norm. to Eagle
1.4
avg. utilization for Eagle 1.4
avg. utilization for Eagle
avg. utilization for Kairos avg. utilization for Kairos
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
4000 5000 6000 7000 8000 4000 5000 6000 7000 8000
Number of nodes in the cluster Number of nodes in the cluster
(a) Kairos short jobs normalized to Eagle. Yahoo trace. (b) Kairos long jobs normalized to Eagle. Yahoo trace.
Figure 8: Kairos normalized to Eagle short (a) and long (b) jobs. Yahoo trace.
tically use idle nodes in the partition for long jobs. By with 15000-23000 worker nodes for the Google trace and
this workload partition technique, Eagle avoids head-of- 4000-8000 nodes for the Yahoo trace. We keep constant
line-blocking altogether. In addition, short jobs are exe- the job arrival rates to the values in the traces, so increas-
cuted according to a distributed approximation of SRPT ing the number of worker nodes reduces the load on the
that does not use preemption. That is, Eagle aims to ex- worker nodes. We set the network delay to 0.5 millisec-
ecute first the tasks of shorter jobs, but tasks cannot be onds, and we do not assign any cost to making scheduling
suspended once they start. Eagle uses task runtime esti- decisions.
mates to classify jobs as long or short, and to implement
Workloads. Table 2 shows the total number of jobs,
the SRPT policy.
the percentage of long jobs and the percentage of task-
We configure Eagle to use the same parameters as in
seconds for long jobs for the two traces. The percentage
its original implementation (which vary depending on the
of the execution times (task-seconds) of all short jobs is
target workload trace). These include sub-cluster sizes,
17% in the Google trace and 2% for Yahoo. These values
cutoffs to distinguish short jobs from long ones and pa-
determine the size of the partition for short jobs in Eagle.
rameters to implement SRPT.
Each simulated worker node has one core. Kairos uses
Q = 2 for both workloads, W = 100 time units for the
6.2 Simulated Test-bed Yahoo trace and W = 10000 time units for the Google
trace. The anti-starvation counter for Kairos is set to 3 for
Platform. We simulate different large-scale data centers both traces.
10
6.3 Experimental results 7.1 Scheduling policies
Figure 7 and Figure 8 report the 50th, 90th and 99th per- 7.1.1 Scheduling with runtime estimates
centiles of job completion times for Kairos normalized
Most of the state-of-the-art scheduling systems rely on
to the ones obtained by Eagle for the Yahoo and Google
runtime estimates to perform informed scheduling deci-
traces respectively. The plots on the left report short job
sions. These systems differ in how such estimates are in-
completion times; the plots on the right report long job
tegrated into the scheduling policy.
completion times. In addition, we report the average clus-
ter utilization for both Kairos and Eagle as a function of Apollo [5], Yaq [31] and Mercury [26] disseminate in-
the number of worker nodes in the cluster. formation about the expected backlog on worker nodes.
Tasks are scheduled to minimize expected queueing delay
Figure 7 (a) and Figure 8 (a) show that Kairos improves
and to equalize the load. Yaq also uses per-task runtime
the short job completions times significantly at high loads
estimates to implement queue reordering strategies, aimed
(up to 55% for Google and 85% for Yahoo). Kairos im-
at prioritizing short tasks.
provement are due to the fact that, when the load is very
Hawk [12], Eagle [11] and Big-C [6] use runtime esti-
high, short jobs in Eagle are confined in the portion of the
mates to classify jobs as long or short. In Eagle and Hawk,
cluster reserved from them. Hence, short jobs compete for
the set of worker nodes is partitioned in two subsets, sized
the same scarce resources. In Kairos, instead, short jobs
proportionally to the expected load in each class. Then,
can run on any node, and preempt long jobs to achieve
tasks of a job are sent to either of the two sub-clusters
small completion times. As the load decreases, the two
depending on their expected runtime. Big-C gives prior-
systems achieve increasingly similar performance.
ity to short jobs by assigning a higher priority to them
Long jobs exhibit different dynamics. Kairos reduces in the YARN capacity scheduler. Workload partitioning
the completion times of most of the long jobs with respect and short job prioritization aim to reduce [12, 6] or elimi-
to Eagle, when the load is at least 50%. This is visible nate [11] head-of-line-blocking.
looking at the 50-th and 90-th percentiles in Figure 7 (b) Tetrisched [38], Rayon [10], Firmament [19],
and Figure 8 (b). Kairos improves the completion times Quincy [25], Tetris [20], 3Sigma [30] and Medea [16]
of most long jobs because it interleaves their executions, formalize the scheduling decision as a combinatorial op-
leading to better completion times for the shortest among timization problem. The resulting Mixed-Integer Linear
the long jobs. In Eagle, instead, the absence of preemp- Program is solved either exactly, or an approximation is
tion may lead a relatively short task among the long ones computed by means of heuristics.
to wait for the whole execution of a longer task. As what
Jockey [15] uses a simulator to speculate on the evo-
is seen for the short jobs, the differences in performance
lution of the system and accordingly decides the task-to-
at the 50-th and 90-th percentiles level out as the load de-
node placement. Graphene [22] uses estimates to decide
creases.
first the placement of the job with the most complex re-
Kairos achieves a slightly worse 99-th percentile with quirements, and then packs other jobs depending on the
respect to Eagle (between 14% and 50% in Yahoo and be- remaining available resources. Carabyne [21] exploits
tween 11% and 33% for Google). This is due to the fact temporary relaxations of the fairness guarantees to allow
that Kairos frequently preempts the longest jobs to prior- a job to use resources destined to another one.
itize the shorter ones. This is an unavoidable, and we ar-
gue favorable, trade-off that Kairos makes to improve the As opposed to these systems, Kairos eschews the need
performance for the vast majority of the jobs, especially for any a priori information about job runtimes. In-
latency-sensitive-ones, without requiring a priori knowl- stead, Kairos infers the expected remaining runtime of
edge on job runtimes. tasks by looking at the the amount of time they have al-
ready executed, and uses preemption and a novel task-
Finally, Kairos and Eagle achieve the same resource
to-node assignment policy to avoid head-of-line blocking
utilization in both workloads and for all cluster sizes. This
and achieve high resource utilization.
result showcases the capability of Kairos of achieving the
Correction mechanisms. The systems that rely on task
same high resource utilization as approaches that rely on
prior knowledge on the job runtimes. runtime estimates also encompass several techniques to
cope with unavoidable misestimations.
Borg [39] and Mercury [26] kill low-priority jobs to
reallocate the resources they are using to higher-priority
7 Related Work jobs. In Hawk, if a node becomes idle, it steal tasks from
other nodes. Yaq [31] and Mercury [26] migrate tasks that
We compare Kairos to existing systems first focusing on have not started yet to re-balance the load. LATE [40],
scheduling policy, and then on scheduler architecture. Mantri [3], Dolly [2], Hopper [33] and DieHard [37] use
11
techniques like restarting or cloning tasks to cope with tion. Hence, latency-sensitive tasks may incur head-of-
stragglers due to misestimations or due to unexpected line blocking and suffer from high waiting times in case
worker nodes slowdowns or failures. Tetrisched [38], of high utilization. In contrast, Kairos uses preemption
3Sigma [30], Rayon [10] and Jockey [15] periodically re- to allow an incoming task to run as soon as it arrives
evaluate the scheduling plan and change it accordingly in on a worker node, offering short tasks th possibility of
case tasks take longer than expected to complete. completing with limited or no waiting time, even in high-
By contrast, Kairos uses preemption and limits the utilization scenarios.
amount of queue imbalance by means of admission con-
trol. Kairos can integrate speculative execution or queue
re-balancing techniques techniques at the cost of intro-
7.2 Scheduler architecture
ducing heuristics to detect stragglers (e.g., based on their Kairos can be classified as a centralized scheduler, be-
progress rate) and support for task migration (e.g., based cause all tasks are dispatched by a single component, al-
on checkpointing). though the worker nodes also perform local scheduling
Some systems like Rayon [10], 3Sigma [30] and Big- decisions. There is a recent trend towards distributed
C [6] make use of preemption to correct the scheduling schedulers, such as Omega [36], Sparrow [29], Apollo [5]
decision in case a new job arrives that must use resources and Yaq [31], or hybrid schedulers such as Mercury [26],
already allocated. The difference with the use of preemp- Hawk [12] and Eagle [11] to achieve low scheduling la-
tion in Kairos is twofold. First, Kairos uses preemption to tency under high job arrival rates.
avoid the need for runtime estimates, which makes Kairos Kairos can sustain high load and achieve low schedul-
suitable also for environments with highly variable run- ing latency despite being centralized, because i) it ef-
times across several executions of the same job or where fectively distributes the burden of performing scheduling
data on previous runs of the jobs is not available. Second, decisions between the central scheduler and the worker
preemption in Kairos, in addition to allowing short tasks nodes and ii) the task-to-node assignment policy is very
to get served quickly, also allows longer tasks to take turns lightweight.
to execute, thereby ensuring progress. Because of these characteristics, we argue that Kairos
could also be implemented as a distributed scheduler. The
7.1.2 Scheduling without runtime estimates state of the worker nodes could be gossiped across the sys-
tem, e.g., as in Apollo [5] and Yaq [31], or shared among
Sparrow [29] avoids the use of runtime estimates by the distributed schedulers, e.g., as in Omega [36]. Exist-
means of batch sampling. A job with t tasks sends 2t ing techniques like randomly perturbing the state commu-
probes to 2t worker nodes, where the probes are en- nicated to different schedulers [5] and atomic transactions
queued. One task of the job is served when one of the over the shared view of the cluster [36] could be used to
probes reaches the head of its queue. Sparrow improves limit or avoid concurrent conflicting scheduling decisions
response times because the t tasks in a job are executed by different schedulers.
by the least loaded t worker nodes out of the 2t that have
been contacted.
Tyrex [17] aims to avoid head-of-line blocking by par- 8 Conclusion
titioning the workload in classes depending on task run-
times, and by assigning different classes to disjoint parti- We present Kairos, a new data center scheduler that makes
tions of worker nodes. Because runtimes are not known no use of a priori job runtime estimates. Kairos achieves
a priori, workload partitioning is achieved by initially as- a good quality of the scheduling decisions, hence attain-
signing all tasks to partition 1, and then migrating a task ing low latency and high resource utilization, by employ-
from partition i to i + 1 when the task execution time has ing in synergy two techniques. First, a lightweight use
exceeded a threshold ti . of preemption aimed to prioritize short tasks over long
The system in [23] aims to prioritize short jobs by orga- ones and avoid head-of-line-blocking. Second, a novel
nizing jobs in priority queues depending on the cumula- task-to-node assignment that employs in combination an
tive time its tasks have received so far. Jobs in higher- admission control policy, aimed to reduce load imbalance
priority queues are assigned more resources than those among worked nodes, and assigns tasks to nodes so as to
in lower-priority queues. Tasks are hosted in a system- improve the chances that they complete quickly.
wide queue on a centralized scheduler, and are assigned We evaluate Kairos by means of a experiments on
to worker nodes depending on the priority of the corre- a cluster with a full fledge prototype in YARN, and
sponding job. by means of large scale simulations. We show that
Unlike Kairos, in all these systems there is no support Kairos achieves better job latencies than state-of-the-art
for preemption, and tasks, once started, run to comple- approaches that use a priori job runtime estimates.
12
References Analysis, and Simulation of Computer and Telecom-
munication Systems, MASCOTS ’11, pages 390–
[1] O. Alipourfard, H. H. Liu, J. Chen, S. Venkatara- 399, Washington, DC, USA, 2011. IEEE Computer
man, M. Yu, and M. Zhang. Cherrypick: Adaptively Society.
unearthing the best cloud configurations for big data
analytics. In Proceedings of the 14th USENIX Con- [9] E. Coppa and I. Finocchi. On data skewness, strag-
ference on Networked Systems Design and Imple- glers, and mapreduce progress indicators. In Pro-
mentation, NSDI’17, pages 469–482, Berkeley, CA, ceedings of the Sixth ACM Symposium on Cloud
USA, 2017. USENIX Association. Computing, SoCC ’15, pages 139–152, New York,
NY, USA, 2015. ACM.
[2] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and [10] C. Curino, D. E. Difallah, C. Douglas, S. Krishnan,
I. Stoica. Effective straggler mitigation: Attack of R. Ramakrishnan, and S. Rao. Reservation-based
the clones. In Proceedings of the 10th USENIX Con- scheduling: If you’re late don’t blame us! In Pro-
ference on Networked Systems Design and Imple- ceedings of the ACM Symposium on Cloud Com-
mentation, nsdi’13, pages 185–198, Berkeley, CA, puting, SOCC ’14, pages 2:1–2:14, New York, NY,
USA, 2013. USENIX Association. USA, 2014. ACM.
[3] G. Ananthanarayanan, S. Kandula, A. Greenberg, [11] P. Delgado, D. Didona, F. Dinu, and W. Zwaenepoel.
I. Stoica, Y. Lu, B. Saha, and E. Harris. Rein- Job-aware scheduling in eagle: Divide and stick to
ing in the outliers in map-reduce clusters using your probes. In Proceedings of the Seventh ACM
mantri. In Proceedings of the 9th USENIX Confer- Symposium on Cloud Computing, number EPFL-
ence on Operating Systems Design and Implementa- CONF-221125, 2016.
tion, OSDI’10, pages 265–278, Berkeley, CA, USA,
[12] P. Delgado, F. Dinu, A.-M. Kermarrec, and
2010. USENIX Association.
W. Zwaenepoel. Hawk: Hybrid datacenter schedul-
[4] N. Avrahami and Y. Azar. Minimizing total flow ing. In 2015 USENIX Annual Technical Conference
time and total completion time with immediate dis- (USENIX ATC 15), pages 499–510, Santa Clara,
patching. In Proceedings of the Fifteenth Annual CA, July 2015. USENIX Association.
ACM Symposium on Parallel Algorithms and Archi- [13] C. Delimitrou and C. Kozyrakis. Quasar: Resource-
tectures, SPAA ’03, pages 11–18, New York, NY, efficient and qos-aware cluster management. In Pro-
USA, 2003. ACM. ceedings of the 19th International Conference on
Architectural Support for Programming Languages
[5] E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, and Operating Systems, ASPLOS ’14, pages 127–
Z. Qian, M. Wu, and L. Zhou. Apollo: Scalable and 144, New York, NY, USA, 2014. ACM.
coordinated scheduling for cloud-scale computing.
In 11th USENIX Symposium on Operating Systems [14] D. G. Down and R. Wu. Multi-layered round robin
Design and Implementation (OSDI 14), pages 285– routing for parallel servers. Queueing Syst. Theory
300, Broomfield, CO, Oct. 2014. USENIX Associa- Appl., 53(4):177–188, Aug. 2006.
tion.
[15] A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin,
[6] W. Chen, J. Rao, and X. Zhou. Preemptive, low la- and R. Fonseca. Jockey: Guaranteed job latency
tency datacenter scheduling via lightweight virtual- in data parallel clusters. In Proceedings of the 7th
ization. In 2017 USENIX Annual Technical Con- ACM European Conference on Computer Systems,
ference (USENIX ATC 17), pages 251–263, Santa EuroSys ’12, pages 99–112, New York, NY, USA,
Clara, CA, 2017. 2012. ACM.
[16] P. Garefalakis, K. Karanasos, P. Pietzuch, A. Suresh,
[7] Y. Chen, S. Alspaugh, and R. Katz. Interactive an- and S. Rao. Medea: Scheduling of long running ap-
alytical processing in big data systems: A cross- plications in shared production clusters. In Proceed-
industry study of mapreduce workloads. Proc. ings of the Thirteenth EuroSys Conference, EuroSys
VLDB Endow., 5(12):1802–1813, Aug. 2012. ’18, pages 4:1–4:13, 2018.
[8] Y. Chen, A. Ganapathi, R. Griffith, and R. Katz. The [17] B. Ghit and D. H. J. Epema. Tyrex: Size-based
case for evaluating mapreduce performance using resource allocation in mapreduce frameworks. In
workload suites. In Proceedings of the 2011 IEEE IEEE/ACM 16th International Symposium on Clus-
19th Annual International Symposium on Modelling, ter, Cloud and Grid Computing, CCGrid 2016,
13
Cartagena, Colombia, May 16-19, 2016, pages 11– Conference on Usenix Annual Technical Confer-
20, 2016. ence, USENIX ATC ’15, pages 485–497, Berkeley,
CA, USA, 2015. USENIX Association.
[18] A. Ghodsi, M. Zaharia, B. Hindman, A. Kon-
winski, S. Shenker, and I. Stoica. Dominant re- [27] Y. Kwon, M. Balazinska, B. Howe, and J. A. Rolia.
source fairness: Fair allocation of multiple resource Skewtune: mitigating skew in mapreduce applica-
types. In Proceedings of the 8th USENIX Con- tions. In K. S. Candan, Y. C. 0001, R. T. Snodgrass,
ference on Networked Systems Design and Imple- L. Gravano, and A. Fuxman, editors, SIGMOD Con-
mentation, NSDI’11, pages 323–336, Berkeley, CA, ference, pages 25–36. ACM, 2012.
USA, 2011. USENIX Association.
[28] M. Nuyens and A. Wierman. The foreground-
[19] I. Gog, M. Schwarzkopf, A. Gleave, R. M. N. Wat- background queue: A survey. Perform. Eval., 65(3-
son, and S. Hand. Firmament: Fast, centralized clus- 4):286–307, Mar. 2008.
ter scheduling at scale. In Proc. of OSDI, 2016. [29] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica.
Sparrow: Distributed, low latency scheduling. In
[20] R. Grandl, G. Ananthanarayanan, S. Kandula,
Proceedings of the Twenty-Fourth ACM Symposium
S. Rao, and A. Akella. Multi-resource packing for
on Operating Systems Principles, SOSP ’13, pages
cluster schedulers. SIGCOMM Comput. Commun.
69–84, New York, NY, USA, 2013. ACM.
Rev., 44(4):455–466, Aug. 2014.
[30] J. W. Park, A. Tumanov, A. Jiang, M. A. Kozuch,
[21] R. Grandl, M. Chowdhury, A. Akella, and G. Anan- and G. R. Ganger. 3sigma: distribution-based cluster
thanarayanan. Altruistic scheduling in multi- scheduling for runtime uncertainty. In Proceedings
resource clusters. In Proceedings of the 12th of the Thirteenth EuroSys Conference, page 2. ACM,
USENIX Conference on Operating Systems Design 2018.
and Implementation, OSDI’16, pages 65–80, Berke-
ley, CA, USA, 2016. USENIX Association. [31] J. Rasley, K. Karanasos, S. Kandula, R. Fonseca,
M. Vojnovic, and S. Rao. Efficient queue manage-
[22] R. Grandl, S. Kandula, S. Rao, A. Akella, and ment for cluster scheduling. In Proceedings of the
J. Kulkarni. GRAPHENE: Packing and dependency- Eleventh European Conference on Computer Sys-
aware scheduling for data-parallel clusters. In 12th tems, page 36. ACM, 2016.
USENIX Symposium on Operating Systems Design
and Implementation (OSDI 16), pages 81–97, Sa- [32] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz,
vannah, GA, 2016. USENIX Association. and M. A. Kozuch. Heterogeneity and dynamicity of
clouds at scale: Google trace analysis. In Proceed-
[23] Z. Hu, B. Li, Z. Qin, and R. S. M. Goh. Job schedul- ings of the Third ACM Symposium on Cloud Com-
ing without prior information in big data processing puting, SoCC ’12, pages 7:1–7:13, New York, NY,
systems. In Proc. of ICDCS, 2017. USA, 2012. ACM.
[24] C.-C. Hung, L. Golubchik, and M. Yu. Scheduling [33] X. Ren, G. Ananthanarayanan, A. Wierman, and
jobs across geo-distributed datacenters. In Proceed- M. Yu. Hopper: Decentralized speculation-aware
ings of the Sixth ACM Symposium on Cloud Com- cluster scheduling at scale. In Proceedings of the
puting, SoCC ’15, pages 111–124, New York, NY, 2015 ACM Conference on Special Interest Group on
USA, 2015. ACM. Data Communication, SIGCOMM ’15, pages 379–
392, New York, NY, USA, 2015. ACM.
[25] M. Isard, V. Prabhakaran, J. Currey, U. Wieder,
[34] L. Schrage. A proof of the optimality of the shortest
K. Talwar, and A. Goldberg. Quincy: Fair schedul-
remaining processing time discipline. Operations
ing for distributed computing clusters. In Proceed-
Research, 16(3):687–690, 1968.
ings of the ACM SIGOPS 22Nd Symposium on Oper-
ating Systems Principles, SOSP ’09, pages 261–276, [35] L. E. Schrage and L. W. Miller. The queue m/g/1
2009. with the shortest remaining processing time disci-
pline. Oper. Res., 14(4):670–684, Aug. 1966.
[26] K. Karanasos, S. Rao, C. Curino, C. Douglas,
K. Chaliparambil, G. M. Fumarola, S. Heddaya, [36] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek,
R. Ramakrishnan, and S. Sakalanaga. Mercury: Hy- and J. Wilkes. Omega: Flexible, scalable schedulers
brid centralized and distributed scheduling in large for large compute clusters. In Proceedings of the 8th
shared clusters. In Proceedings of the 2015 USENIX ACM European Conference on Computer Systems,
14
EuroSys ’13, pages 351–364, New York, NY, USA,
2013. ACM.
[37] M. Sedaghat, E. Wadbro, J. Wilkes, S. de Luna,
O. Seleznjev, and E. Elmroth. Diehard: Reliable
scheduling to survive correlated failures in cloud
data centers. In IEEE/ACM 16th International Sym-
posium on Cluster, Cloud and Grid Computing, CC-
Grid 2016,.
[38] A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch,
M. Harchol-Balter, and G. R. Ganger. Tetrisched:
global rescheduling with adaptive plan-ahead in dy-
namic heterogeneous clusters. In Proceedings of the
Eleventh European Conference on Computer Sys-
tems, page 35. ACM, 2016.
[39] A. Verma, L. Pedrosa, M. Korupolu, D. Oppen-
heimer, E. Tune, and J. Wilkes. Large-scale clus-
ter management at google with borg. In Proceed-
ings of the Tenth European Conference on Computer
Systems, EuroSys ’15, pages 18:1–18:17, New York,
NY, USA, 2015. ACM.
[40] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz,

and I. Stoica. Improving mapreduce performance
in heterogeneous environments. In Proceedings of
the 8th USENIX Conference on Operating Systems
Design and Implementation, OSDI’08, pages 29–42,
Berkeley, CA, USA, 2008. USENIX Association.
[41] Y. Zhang, G. Prekas, G. M. Fumarola, M. Fontoura,

I. n. Goiri, and R. Bianchini. History-based harvest-
ing of spare cycles and storage in large-scale data-
centers. In Proceedings of the 12th USENIX Confer-
ence on Operating Systems Design and Implementa-
tion, OSDI’16, pages 755–770, Berkeley, CA, USA,
2016. USENIX Association.
15

Kairos - Preemptive Data Center Scheduling Without Runtime Estimates

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kairos - Preemptive Data Center Scheduling Without Runtime Estimates

Uploaded by

Copyright:

Available Formats

Kairos: Preemptive Data Center Scheduling

Without Runtime Estimates

Submission Type: Research

Abstract user-facing services. Scheduling such jobs while achiev-

Figure 2: Kairos architecture. Nodes schedulers imple-

Table 1: Job categories composing our workload. Job

publicly available, and to the FIFO YARN scheduler.

Figure 4: CDF of the job completion times in Kairos,

typical heavy-tailed job runtime distribution that charac-

5.3.3 Sensitivity Analysis

Kairos normalized to Eagle

Kairos normalized to Eagle

[40] M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz,

[41] Y. Zhang, G. Prekas, G. M. Fumarola, M. Fontoura,

You might also like