Professional Documents
Culture Documents
1
some success in mitigating the impact of misestimations, scheduling discipline.
but they do not fundamentally address the problem. 3) We implement this distributed LAS in YARN, and
Kairos. In this paper, we propose an alternative approach compare its performance to state-of-the-art alternatives by
to data center scheduling, which does not use job runtime measurement and simulation.
estimates. Our approach draws from the Least Attained Roadmap. The outline of the rest of this paper is as fol-
Service (LAS) scheduling policy [28]. LAS is a preemp- lows. Section 2 provides the necessary background. Sec-
tive scheduling technique that executes first the task that tion 3 describes the design of Kairos. Section 4 describes
has received the smallest amount of service so far. LAS its implementation in YARN. Section 5 evaluates the per-
is known to achieve good job completion times when the formance of the Kairos YARN implementation. Section 6
distribution of job runtimes has a high variance, as is the provides simulation results. Section 7 discusses related
case in the often heavy-tailed data center workloads. work. Section 8 concludes the paper.
The main challenge is to find a good approximation for
LAS in a data center environment. A naive implemen-
tation would cause frequent task migrations, with its at- 2 Background
tendant performance penalties. Task migration would be
needed to allow a preempted task to resume its execution 2.1 Misestimations
on any worker node with available resources.
Instead, we have developed a two-level scheduler that Estimates in existing systems. Earlier schedulers like
avoids task migrations altogether, but still offers good Sparrow [29] do not rely on prior information about job
performance. In particular, Kairos consists of a central- runtimes, but in heavy-tailed workloads latency-sensitive
ized scheduler and per-node schedulers. The per-node short tasks often experience long queueing delays due to
schedulers implement LAS for tasks on their node, using head-of-line blocking [12].
preemption as necessary to avoid head-of-line blocking. Most state-of-the-art data center schedulers rely on
The centralized scheduler distributes tasks among worker job runtime estimates to make informed scheduling de-
nodes in a manner that balances the load. cisions [5, 11, 12, 13, 19, 20, 21, 22, 26, 41]. Estimates
A first challenge in this design is how to ensure high are used to avoid head-of-line blocking and resource con-
resource utilization in absence of runtime estimates. To tention, provide load balancing and fairness, and meet
address this issue, the central scheduler aims to equal- deadlines. The accuracy of job runtime estimates is there-
ize the number of tasks per node, and reduces the amount fore of paramount importance. Estimates of the runtime of
of load imbalance possible among nodes by limiting the a task within a job can be obtained from past executions
maximum number of tasks assigned to a worker core. of the same task, if any, or at past executions of similar
A second challenge is how to ensure that the distributed tasks [5], or by means of on-line profiling [13]. A com-
approximation of LAS preserves the performance bene- mon estimation technique for the task duration is to take
fits of the original formulation of LAS. Kairos addresses the average of the task durations over previous executions
this issue by means of a novel task-to-node dispatching of the job [12, 31]. More sophisticated techniques rely on
approach. In this approach, the central scheduler assigns machine learning [30].
tasks to nodes in such a way to induce high variance in the Challenges in obtaining accurate estimates. Unfortu-
distribution of the runtime of tasks assigned to each node. nately, obtaining accurate and reliable estimates is a far
We have implemented Kairos in YARN. We compare from easy task. Many factors contribute to the difficulty
its performance against the YARN FIFO scheduler and in obtaining reliable estimates. The scheduler may have
Big-C [6], a state-of-the-art YARN-based scheduler that limited or no information to produce estimates for new
also uses preemption. We show that Kairos reduces the jobs, i.e., jobs that have never been submitted before [31].
median job completion time by 37%, resp. 73%, and the Even if jobs are recurring, evidence indicates that changes
99-th percentile by 57%, resp., 30%, with respect to Big-C in the input data set may lead to significant and hard-to-
and FIFO. We evaluate Kairos at scale by implementing predict shifts in the runtime for a job [1]. Changes in
it in the Eagle scheduler [11] and comparing its perfor- the data placement may cause the job execution time to
mance against Eagle using traces from Google and Yahoo. change. Skew in the input data distribution can lead tasks
Kairos improves the completion times of latency-sensitive in the same job to have radically different runtimes [9, 27].
jobs by up to 55% and 85% respectively. Finally, failures and transient resource utilization spikes
may lead to stragglers [2], which not only have an un-
Contributions. We make the following contributions:
predictable duration, but represent outliers in the data-set
1) We demonstrate good data center scheduling perfor- used to predict future runtimes for the same job.
mance without using task runtime estimates. We provide an example of the estimation errors that can
2) We present an efficient distributed version of the LAS affect job scheduling decisions by analyzing public traces
2
0.08 1.1
Yahoo 1
Google 0.9
0.06 Cloudera 0.8
Facebook 0.7
PDF
CDF
0.6
0.04 0.5
0.4 Yahoo
0.02 0.3 Google
0.2 Cloudera
0.1 Facebook
0 0
<-100 -75 -50 -25 -1 1 25 50 75 >100 0 25 50 75 100 125 150
Relative error (%) Absolute relative error (%)
(a) PDF of relative prediction error. (b) CDF of the absolute relative prediction error.
Figure 1: Prediction error when estimating the duration of each task in a job as the mean task duration in that job.
(a) The PDF of the relative error (the ’<100’ datapoint includes all under-estimations of more than 100%, and the
’>100’ data point all over-estimations larger than 100%. (b) The CDF of the absolute relative error (the very tail of
this distribution is not shown for the sake of readability).
that are widely used to evaluate data center schedulers. 2.2 Least-Attained-Service
In particular, we consider the Cloudera [7],Yahoo [8],
Google [32] and Facebook [7] traces. We study the distri- Prioritizing short jobs. Typical data centers workloads
bution of the error incurred when using the mean execu- are a mix of long and short jobs [8, 7, 32]. Giving higher
tion time of tasks in a job as an indicator of the execution priority to short jobs improves their response times by re-
time of a task in that job. ducing head-of-line-blocking. The Shortest Remaining
Processing Time (SRPT) scheduling policy [35] priori-
Let J be a job in the trace and T the set of tasks t1 , · · ·, tizes short tasks by executing pending tasks in increasing
tn , in the job, each with an associated execution time ti .t. order of expected runtime and by preempting a task if a
Let TJ be the mean execution time of tasks in J. Then, shorter task arrives. SRPT is provably optimal with re-
we compute the relative prediction error for a task as E = spect to mean response time [34].
100 × (ti .t − T )/T , and the absolute relative prediction
Recent systems have successfully adopted SRPT in the
error as |E|. We show the PDF of E and the CDF of |E| in
context of data center scheduling [11, 24, 31]. These sys-
Figure 1. While up to 50% of the predictions are accurate
tems do not support preemption, so they implement a vari-
to within 10%, some prediction errors can be higher than
ant of SRPT, where the shortest task executes but, once
100%.
started, a task runs to completion.
Similar degrees of misestimation have also been re-
Least Attained Service (LAS). SRPT requires task run-
ported in recent work that uses a machine learning ap-
time estimates to determine which task should be exe-
proach to predicting task resource demands [30].
cuted. LAS is a scheduling policy akin to SRPT, but it
Coping with misestimations and Kairos approach. does not rely on a priori estimates [28]. LAS instead uses
Previous work has shown that job runtime misestima- the service time already received by the task as an indica-
tion leads to worse job completion times [11], and fail- tion of the remaining runtime of the task.
ure to meet service level objectives [13, 38] or job com- Given a set of tasks to run, LAS schedules for execu-
pletion deadlines [38]. Some systems deal with misesti- tion the one with the lowest attained service, i.e., the one
mations by runtime correction mechanisms such as task that has executed for the smallest amount of time so far.
cloning [2] and queue re-balancing [31], or by taking us- We call such task the youngest one. In case there are
ing a distribution of estimates rather than a single value es- n youngest tasks, all of them are assigned an equal 1/n
timates [30]. These solutions mitigate the effects of mis- share of processing time, i.e., they run according to the
estimations, but they do not avoid the problem entirely, Processor Sharing (PS) scheduling policy (as in typical
and increase the complexity of the system. multiprogramming operating systems). LAS makes use
Kairos overcomes the limitations of scheduling based of preemption to allow the youngest task to execute at any
on runtime estimates by adapting the LAS scheduling pol- moment.
icy [28] to a data center environment. LAS does not re- Rationale. LAS uses the attained service as an indication
quire a priori information about task runtimes and is well of the remaining service demand of a task. The rationale
suited to workloads with high variance in runtimes, as is behind the effectiveness of this service demand prediction
the case in the often heavy-tailed data center workloads. policy lies in the heavy-tailed service demand distribution
3
that is prevalent among production workloads. That is, if
a task has executed for a long amount of time, it is likely
that it is a large task, and hence it still has much to execute
before completion. Hence, it is better to execute younger
tasks, as they are more likely to be short tasks.
In addition, if the youngest task in the queue has an
attained service T , a new incoming task is going to be
the youngest one until it has received a service of T (if
no other task arrives in the meantime). Hence, if the task
is a short one –which is likely under the assumption of
heavy-tailed runtime distribution– then it is likely that the
task is going to complete within T , thus experiencing no
queueing at all.
4
Algorithm 1 Node scheduler Algorithm 2 Central scheduler
1: Set<TaskEntry> IdleTasks, RunningTasks . Track suspended/running tasks 1: Queue CentralQueue . Queue where incoming tasks are placed
2: Node[numNodes] Nodes . Entries track # tasks and attained service times
2: upon event Task t arrives do
3: TaskEntry te 3: upon event New job J arrives do
4: te.task ← t 4: for task t ∈ J do
5: te.attained ← 0 5: Queue.push(t)
6: te.start ← now()
7: RunningTasks.add(te) 6: upon event Heartbeat HB from Node i arrives do
8: if (idleCores.size() < N ) then . Free core can execute t 7: Nodes[i].var ← HB.var
9: core c = idleCores.pop() 8: Nodes[i].numTasks ← HB.numTasks
10: else . Preempt oldest running task
11: tp ← argmax{tt.attained} {tt ∈ RunningT asks} 9: procedure MAINLOOP
12: tp .attained+ = now() − tp .start 10: while (true) do
13: c ← core serving t 11: for i = 0, . . . , N + Q do
14: remove tp from c 12: Si ← {Node m ∈ N odes : m.numT asks = i}
15: IdleTasks.add(tp ) 13: while (!Si .isEmpty() ∧!CentralQueue.isEmpty()) do
16: assign t to c 14: Node m ← argminn.var {n ∈ Si }
17: c.startTimer(W) 15: Task t ← CentralQueue.pop()
18: start t 16: Assign t to m
17: Si ← Si \ {m}
19: upon event Task t finishes on core c do 18: Sleep(∆)
20: RunningTasks.remove(t)
21: if (!IdleTasks.isEmpty() then . Run youngest suspended task
22: TaskEntry tr ← argmin{ti .attained} {ti ∈ IdleT asks}
23: RunningTasks.add(tr )
24: assign tr to c able situation for a short task that has been preempted to
25: tr .start ← now() make room for a new incoming task. A low value for W ,
26: c.startTimer(W)
27: Start tr .task instead, gives a task frequent opportunities to execute and
28: else hence potentially complete. However, it may also lead to
29: IdleCores.push(c)
long completion times, because task completion time may
30: upon event Timer fires on core c running task t do be delayed by frequent interleaving. We study the sensi-
31: TaskEntry ts ←TaskEntry e : e.task = t tivity of Kairos to the setting of W in Section 5.3.3, where
32: ts.attained+ = now() − ts.start
. Find youngest suspended task we show that Kairos is relatively robust to sub-optimal
33: TaskEntry tm ← argmin{ti.attained} {ti ∈ IdleT asks} settings of W .
34: if (tm.attained ≤ ts.attained) then . Preempt t
35: IdleTasks.remove(tm)
36: IdleTasks.add(ts)
37: RunningTasks.remove(ts) 3.3 Central scheduler
38: RunninTasks.add(tm)
39: tm.start ← now()
40: place tm.task on c Algorithm 2 presents the data structures maintained by the
41: Start tm.task central scheduler and its operations.
42: else . Continue running t
43: ts.start ← now()
44: c.startTimer(W)
3.3.1 Challenges in the absence of estimates.
45: upon event Every ∆ do
46: Heartbeat HB The lack of a priori job runtime estimates makes it cum-
47: HB.num ← IdleTasks.size() + RunningTasks.size()
48: HB.var ← var{t.attained} {t ∈ IdleT asks ∪ RunningT asks} bersome to achieve load balancing. Existing approaches
49: send HB to the central scheduler use job runtime estimates to place a task on the worker
node that is expected to minimize the waiting time of the
task [5, 31]. This strategy improves task completion times
and achieve high resource utilization by equalizing the
mechanism to avoid that long jobs can be preempted in-
load on the worker nodes. Kairos cannot re-use such ex-
definitely (not shown in the pseudocode). Each task is
isting techniques in a straightforward fashion, because it
associated with a counter that tracks how many times the
cannot accurately estimate the backlog on a worker node
task has been preempted. If a task is preempted more than
and the additional load posed by a task being scheduled.
a given number of times, then it has the right to run for
To circumvent this problem, Kairos decouples the prob-
a quantum of time, during which it cannot be preempted.
lems of achieving load balance and high resource utiliza-
This mechanism ensures the progress of every task.
tion from the problem of achieving low completion times.
Impact and setting of W . The value of W determines Kairos leverages the insight that short completion times
the trade-off between task waiting times and completion are already achieved by implementing LAS in the individ-
times. A high value for W allows the shortest tasks to ual node schedulers. In fact, LAS gives shorter tasks the
complete within a single execution window. However, it possibility to completely or partially bypass the queues on
may also lead a preempted task in the node queue to wait the worker nodes. This means that the central scheduler
for a long time before being it can run again, an undesir- can be to some extent agnostic of the actual backlog on
5
worker nodes, because the backlog is not an indicator of 3.3.3 Maximizing LAS effectiveness.
the waiting time for a task.
Kairos implements a LAS-aware policy to break ties in
Hence, in Kairos, the central scheduler has two goals
cases in which two or more worker nodes have an equal
1) Enforcing that resources do not get wasted, i.e., there number of tasks assigned to them. In more detail, it as-
are no cores idle if there is any tasks in some queue (ei- signs the task to the worker node with the lowest variance
ther the eligible idle tasks in the central or in any worker in the attained service times of tasks currently placed on
queue). This leads to high resource utilization and implies the worker node, in the hope that by doing so it can signif-
balancing the load among worker nodes (Section 3.3.2). icantly increase the variance on that node. The rationale
behind this choice is that LAS is most effective when the
2) Maximizing LAS effectiveness, e.g., by improving the
task duration distribution has a high variance. Intuitively,
chances that short tasks bypass long tasks and task do not
if only short tasks were assigned to a node, the youngest
hurt each other response times by excessive interleaved
short tasks would preempt older short tasks, hurting their
executions (Section 3.3.3).
completion times. Similarly, if only long tasks were as-
signed to a node, all would run in an interleaved fashion,
each one hurting the completion time of the others.
3.3.2 Load Balancing The effectiveness of this policy is grounded in previ-
ous analysis of SRPT in distributed environments, that
The central scheduler aims to balance the load across
show that maximizing the heterogeneity of task runtimes
worker nodes by enforcing that each of them is assigned
on each worker node is key to improve task completion
an equal number of tasks. Hence, the first outstanding task
times [4, 14]. Unlike previous studies, however, Kairos
in the central queue is placed on the worker node with the
does not rely on an exact knowledge of the runtimes of the
smallest amount of assigned tasks.
tasks on each worker node, and uses the attained service
This policy alone, however, is not sufficient with heavy- times of tasks on a worker node to estimate the variability
tailed runtime distributions, as it may lead to temporary in task runtimes on that worker node.
load imbalance scenarios. For example, a worker node
may be assigned many short tasks while another worker
node is loaded with longer tasks. Then, the first worker 4 Kairos implementation
node might complete all its short tasks and becomes idle
while some tasks lie idle on the other worker node, wait- We implement Kairos as part of YARN [18], a widely
ing to receive service time. used scheduler for data parallel jobs. Figure 3 shows the
To address this issue, the central scheduler enforces that main building blocks of YARN, their interactions and the
each worker is assigned at most Q + N tasks at any mo- components introduced by Kairos.
ment in time. This admission control mechanism bounds YARN. YARN consists of a ResourceManager re-
the amount of load imbalance possible, since a worker siding on a master node, and a NodeManager re-
node can host at most Q idle tasks that could have been siding on each worker node. YARN runs a task on
assigned to other worker nodes with available resources. a worker node within a container, which specifies the
Kairos task-to-node dispatching policy achieves load node resources allocated to the task. Each worker
balancing and high resource utilization, and is cheap node also has a ContainerManager that manages
enough to lead to low-latency scheduling decisions. This the containers on the node. Finally, each job has an
allows Kairos to sustain high job arrival rates without in- ApplicationManager that runs on a worker node
curring long scheduling delays. and tracks the advancement of all tasks within the job.
Impact and setting of Q. The value of Q determines the The ResourceManager assigns tasks to worker
trade-off between load balance and effectiveness of LAS. nodes and communicates with the NodeManagers on
A small value of Q reduces the amount of possible load the worker nodes. A NodeManager communicates with
imbalance, but may lead many short tasks to sit in the cen- the ResourceManager by means of periodic heartbeat
tral queue, instead of being assigned to a worker node, ex- messages. These heartbeats contain information about the
ecute by preempting a previous task and potentially com- node’s health and the containers running on it.
plete quickly. A high value of Q, on the contrary, may lead Kairos central scheduler. The Kairos central schedul-
to higher load imbalance, but enables more parallelism. ing policy is implemented in the ResourceManager.
We assess the sensitivity of Kairos to the setting of Q In particular, the Kairos central scheduler extends the Ca-
in Section 5.3.3, where we show that Kairos performance pacityScheduler, to allow the possibility of a worker node
are not dramatically affected by sub-optimal settings of to be allocated more containers than available cores.
Q. Kairos node scheduler. The node scheduler of
6
Category input #maps #reduces extraFlops duration probability
1 small 4GB 15 15 0 85s 0.32
2 medium small 4GB 15 15 500 201s 0.31
3 medium 8GB 30 30 0 239s 0.31
4 medium long 30GB 112 60 500 308s 0.04
5 long 60GB 224 60 1000 1175s 0.02
7
1 1
0.8 0.8
CDF
CDF
0.6 0.6
0.4 0.4
Kairos Random
0.2 Big-C 0.2 Sum
FIFO Var
0 0
0 500 1000 1500 2000 2500 3000 3500 0 500 1000 1500
Job completion times (s) Job completion times (s)
8
1 1
0.8 0.8
CDF
CDF
0.6 0.6
0.4 Q=0 0.4 W=10
Q=4 W=50
0.2 Q=8 0.2 W=100
Big-C Big-C
0 0
0 500 1000 1500 0 500 1000 1500
Job completion times (s) Job completion times (s)
(a) Varying Q. (b) Varying W .
Figure 6: Sensitivity analysis to varying the maximum number of tasks on a node scheduler (Q) and the size of the
quantum of time in LAS (W ). Kairos is robust to sub-optimal settings of these parameters and achieves performance
gains over Big-C even without optimal tuning.
centile and towards the tail of the distribution. The ben- much interleaved. This penalizes the longest jobs, i.e., the
efit at the 30-th percentile indicates that the shortest jobs, very tail of the completion times distribution, but leads
which account for 30% of the total (see Table 1), are effec- to better values for lower percentiles. The dual holds for
tively prioritized. The benefit at higher percentiles show W = 100. The longest jobs have big quanta of times to
that Var is also able to effectively use LAS to improve use. This improves their completion times at the detriment
the response time of larger tasks as well. of shorter jobs.
9
2 2
50th perc. short jobs 50th perc. long jobs
1.8 90th perc. short jobs 1.8 90th perc. long jobs
1.4
avg. utilization for Eagle 1.4
avg. utilization for Eagle
avg. utilization for Kairos avg. utilization for Kairos
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
15000 17000 19000 21000 23000 15000 17000 19000 21000 23000
Number of nodes in the cluster Number of nodes in the cluster
(a) Kairos short jobs normalized to Eagle. Google trace. (b) Kairos long jobs normalized to Eagle. Google trace.
Figure 7: Kairos normalized to Eagle short (a) and long (b) jobs. Google trace.
2 2
50th perc. short jobs 50th perc. long jobs
1.8 90th perc. short jobs 1.8 90th perc. long jobs
1.4
avg. utilization for Eagle 1.4
avg. utilization for Eagle
avg. utilization for Kairos avg. utilization for Kairos
1.2 1.2
1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 0
4000 5000 6000 7000 8000 4000 5000 6000 7000 8000
Number of nodes in the cluster Number of nodes in the cluster
(a) Kairos short jobs normalized to Eagle. Yahoo trace. (b) Kairos long jobs normalized to Eagle. Yahoo trace.
Figure 8: Kairos normalized to Eagle short (a) and long (b) jobs. Yahoo trace.
tically use idle nodes in the partition for long jobs. By with 15000-23000 worker nodes for the Google trace and
this workload partition technique, Eagle avoids head-of- 4000-8000 nodes for the Yahoo trace. We keep constant
line-blocking altogether. In addition, short jobs are exe- the job arrival rates to the values in the traces, so increas-
cuted according to a distributed approximation of SRPT ing the number of worker nodes reduces the load on the
that does not use preemption. That is, Eagle aims to ex- worker nodes. We set the network delay to 0.5 millisec-
ecute first the tasks of shorter jobs, but tasks cannot be onds, and we do not assign any cost to making scheduling
suspended once they start. Eagle uses task runtime esti- decisions.
mates to classify jobs as long or short, and to implement
Workloads. Table 2 shows the total number of jobs,
the SRPT policy.
the percentage of long jobs and the percentage of task-
We configure Eagle to use the same parameters as in
seconds for long jobs for the two traces. The percentage
its original implementation (which vary depending on the
of the execution times (task-seconds) of all short jobs is
target workload trace). These include sub-cluster sizes,
17% in the Google trace and 2% for Yahoo. These values
cutoffs to distinguish short jobs from long ones and pa-
determine the size of the partition for short jobs in Eagle.
rameters to implement SRPT.
Each simulated worker node has one core. Kairos uses
Q = 2 for both workloads, W = 100 time units for the
6.2 Simulated Test-bed Yahoo trace and W = 10000 time units for the Google
trace. The anti-starvation counter for Kairos is set to 3 for
Platform. We simulate different large-scale data centers both traces.
10
6.3 Experimental results 7.1 Scheduling policies
Figure 7 and Figure 8 report the 50th, 90th and 99th per- 7.1.1 Scheduling with runtime estimates
centiles of job completion times for Kairos normalized
Most of the state-of-the-art scheduling systems rely on
to the ones obtained by Eagle for the Yahoo and Google
runtime estimates to perform informed scheduling deci-
traces respectively. The plots on the left report short job
sions. These systems differ in how such estimates are in-
completion times; the plots on the right report long job
tegrated into the scheduling policy.
completion times. In addition, we report the average clus-
ter utilization for both Kairos and Eagle as a function of Apollo [5], Yaq [31] and Mercury [26] disseminate in-
the number of worker nodes in the cluster. formation about the expected backlog on worker nodes.
Tasks are scheduled to minimize expected queueing delay
Figure 7 (a) and Figure 8 (a) show that Kairos improves
and to equalize the load. Yaq also uses per-task runtime
the short job completions times significantly at high loads
estimates to implement queue reordering strategies, aimed
(up to 55% for Google and 85% for Yahoo). Kairos im-
at prioritizing short tasks.
provement are due to the fact that, when the load is very
Hawk [12], Eagle [11] and Big-C [6] use runtime esti-
high, short jobs in Eagle are confined in the portion of the
mates to classify jobs as long or short. In Eagle and Hawk,
cluster reserved from them. Hence, short jobs compete for
the set of worker nodes is partitioned in two subsets, sized
the same scarce resources. In Kairos, instead, short jobs
proportionally to the expected load in each class. Then,
can run on any node, and preempt long jobs to achieve
tasks of a job are sent to either of the two sub-clusters
small completion times. As the load decreases, the two
depending on their expected runtime. Big-C gives prior-
systems achieve increasingly similar performance.
ity to short jobs by assigning a higher priority to them
Long jobs exhibit different dynamics. Kairos reduces in the YARN capacity scheduler. Workload partitioning
the completion times of most of the long jobs with respect and short job prioritization aim to reduce [12, 6] or elimi-
to Eagle, when the load is at least 50%. This is visible nate [11] head-of-line-blocking.
looking at the 50-th and 90-th percentiles in Figure 7 (b) Tetrisched [38], Rayon [10], Firmament [19],
and Figure 8 (b). Kairos improves the completion times Quincy [25], Tetris [20], 3Sigma [30] and Medea [16]
of most long jobs because it interleaves their executions, formalize the scheduling decision as a combinatorial op-
leading to better completion times for the shortest among timization problem. The resulting Mixed-Integer Linear
the long jobs. In Eagle, instead, the absence of preemp- Program is solved either exactly, or an approximation is
tion may lead a relatively short task among the long ones computed by means of heuristics.
to wait for the whole execution of a longer task. As what
Jockey [15] uses a simulator to speculate on the evo-
is seen for the short jobs, the differences in performance
lution of the system and accordingly decides the task-to-
at the 50-th and 90-th percentiles level out as the load de-
node placement. Graphene [22] uses estimates to decide
creases.
first the placement of the job with the most complex re-
Kairos achieves a slightly worse 99-th percentile with quirements, and then packs other jobs depending on the
respect to Eagle (between 14% and 50% in Yahoo and be- remaining available resources. Carabyne [21] exploits
tween 11% and 33% for Google). This is due to the fact temporary relaxations of the fairness guarantees to allow
that Kairos frequently preempts the longest jobs to prior- a job to use resources destined to another one.
itize the shorter ones. This is an unavoidable, and we ar-
gue favorable, trade-off that Kairos makes to improve the As opposed to these systems, Kairos eschews the need
performance for the vast majority of the jobs, especially for any a priori information about job runtimes. In-
latency-sensitive-ones, without requiring a priori knowl- stead, Kairos infers the expected remaining runtime of
edge on job runtimes. tasks by looking at the the amount of time they have al-
ready executed, and uses preemption and a novel task-
Finally, Kairos and Eagle achieve the same resource
to-node assignment policy to avoid head-of-line blocking
utilization in both workloads and for all cluster sizes. This
and achieve high resource utilization.
result showcases the capability of Kairos of achieving the
Correction mechanisms. The systems that rely on task
same high resource utilization as approaches that rely on
prior knowledge on the job runtimes. runtime estimates also encompass several techniques to
cope with unavoidable misestimations.
Borg [39] and Mercury [26] kill low-priority jobs to
reallocate the resources they are using to higher-priority
7 Related Work jobs. In Hawk, if a node becomes idle, it steal tasks from
other nodes. Yaq [31] and Mercury [26] migrate tasks that
We compare Kairos to existing systems first focusing on have not started yet to re-balance the load. LATE [40],
scheduling policy, and then on scheduler architecture. Mantri [3], Dolly [2], Hopper [33] and DieHard [37] use
11
techniques like restarting or cloning tasks to cope with tion. Hence, latency-sensitive tasks may incur head-of-
stragglers due to misestimations or due to unexpected line blocking and suffer from high waiting times in case
worker nodes slowdowns or failures. Tetrisched [38], of high utilization. In contrast, Kairos uses preemption
3Sigma [30], Rayon [10] and Jockey [15] periodically re- to allow an incoming task to run as soon as it arrives
evaluate the scheduling plan and change it accordingly in on a worker node, offering short tasks th possibility of
case tasks take longer than expected to complete. completing with limited or no waiting time, even in high-
By contrast, Kairos uses preemption and limits the utilization scenarios.
amount of queue imbalance by means of admission con-
trol. Kairos can integrate speculative execution or queue
re-balancing techniques techniques at the cost of intro-
7.2 Scheduler architecture
ducing heuristics to detect stragglers (e.g., based on their Kairos can be classified as a centralized scheduler, be-
progress rate) and support for task migration (e.g., based cause all tasks are dispatched by a single component, al-
on checkpointing). though the worker nodes also perform local scheduling
Some systems like Rayon [10], 3Sigma [30] and Big- decisions. There is a recent trend towards distributed
C [6] make use of preemption to correct the scheduling schedulers, such as Omega [36], Sparrow [29], Apollo [5]
decision in case a new job arrives that must use resources and Yaq [31], or hybrid schedulers such as Mercury [26],
already allocated. The difference with the use of preemp- Hawk [12] and Eagle [11] to achieve low scheduling la-
tion in Kairos is twofold. First, Kairos uses preemption to tency under high job arrival rates.
avoid the need for runtime estimates, which makes Kairos Kairos can sustain high load and achieve low schedul-
suitable also for environments with highly variable run- ing latency despite being centralized, because i) it ef-
times across several executions of the same job or where fectively distributes the burden of performing scheduling
data on previous runs of the jobs is not available. Second, decisions between the central scheduler and the worker
preemption in Kairos, in addition to allowing short tasks nodes and ii) the task-to-node assignment policy is very
to get served quickly, also allows longer tasks to take turns lightweight.
to execute, thereby ensuring progress. Because of these characteristics, we argue that Kairos
could also be implemented as a distributed scheduler. The
7.1.2 Scheduling without runtime estimates state of the worker nodes could be gossiped across the sys-
tem, e.g., as in Apollo [5] and Yaq [31], or shared among
Sparrow [29] avoids the use of runtime estimates by the distributed schedulers, e.g., as in Omega [36]. Exist-
means of batch sampling. A job with t tasks sends 2t ing techniques like randomly perturbing the state commu-
probes to 2t worker nodes, where the probes are en- nicated to different schedulers [5] and atomic transactions
queued. One task of the job is served when one of the over the shared view of the cluster [36] could be used to
probes reaches the head of its queue. Sparrow improves limit or avoid concurrent conflicting scheduling decisions
response times because the t tasks in a job are executed by different schedulers.
by the least loaded t worker nodes out of the 2t that have
been contacted.
Tyrex [17] aims to avoid head-of-line blocking by par- 8 Conclusion
titioning the workload in classes depending on task run-
times, and by assigning different classes to disjoint parti- We present Kairos, a new data center scheduler that makes
tions of worker nodes. Because runtimes are not known no use of a priori job runtime estimates. Kairos achieves
a priori, workload partitioning is achieved by initially as- a good quality of the scheduling decisions, hence attain-
signing all tasks to partition 1, and then migrating a task ing low latency and high resource utilization, by employ-
from partition i to i + 1 when the task execution time has ing in synergy two techniques. First, a lightweight use
exceeded a threshold ti . of preemption aimed to prioritize short tasks over long
The system in [23] aims to prioritize short jobs by orga- ones and avoid head-of-line-blocking. Second, a novel
nizing jobs in priority queues depending on the cumula- task-to-node assignment that employs in combination an
tive time its tasks have received so far. Jobs in higher- admission control policy, aimed to reduce load imbalance
priority queues are assigned more resources than those among worked nodes, and assigns tasks to nodes so as to
in lower-priority queues. Tasks are hosted in a system- improve the chances that they complete quickly.
wide queue on a centralized scheduler, and are assigned We evaluate Kairos by means of a experiments on
to worker nodes depending on the priority of the corre- a cluster with a full fledge prototype in YARN, and
sponding job. by means of large scale simulations. We show that
Unlike Kairos, in all these systems there is no support Kairos achieves better job latencies than state-of-the-art
for preemption, and tasks, once started, run to comple- approaches that use a priori job runtime estimates.
12
References Analysis, and Simulation of Computer and Telecom-
munication Systems, MASCOTS ’11, pages 390–
[1] O. Alipourfard, H. H. Liu, J. Chen, S. Venkatara- 399, Washington, DC, USA, 2011. IEEE Computer
man, M. Yu, and M. Zhang. Cherrypick: Adaptively Society.
unearthing the best cloud configurations for big data
analytics. In Proceedings of the 14th USENIX Con- [9] E. Coppa and I. Finocchi. On data skewness, strag-
ference on Networked Systems Design and Imple- glers, and mapreduce progress indicators. In Pro-
mentation, NSDI’17, pages 469–482, Berkeley, CA, ceedings of the Sixth ACM Symposium on Cloud
USA, 2017. USENIX Association. Computing, SoCC ’15, pages 139–152, New York,
NY, USA, 2015. ACM.
[2] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and [10] C. Curino, D. E. Difallah, C. Douglas, S. Krishnan,
I. Stoica. Effective straggler mitigation: Attack of R. Ramakrishnan, and S. Rao. Reservation-based
the clones. In Proceedings of the 10th USENIX Con- scheduling: If you’re late don’t blame us! In Pro-
ference on Networked Systems Design and Imple- ceedings of the ACM Symposium on Cloud Com-
mentation, nsdi’13, pages 185–198, Berkeley, CA, puting, SOCC ’14, pages 2:1–2:14, New York, NY,
USA, 2013. USENIX Association. USA, 2014. ACM.
[3] G. Ananthanarayanan, S. Kandula, A. Greenberg, [11] P. Delgado, D. Didona, F. Dinu, and W. Zwaenepoel.
I. Stoica, Y. Lu, B. Saha, and E. Harris. Rein- Job-aware scheduling in eagle: Divide and stick to
ing in the outliers in map-reduce clusters using your probes. In Proceedings of the Seventh ACM
mantri. In Proceedings of the 9th USENIX Confer- Symposium on Cloud Computing, number EPFL-
ence on Operating Systems Design and Implementa- CONF-221125, 2016.
tion, OSDI’10, pages 265–278, Berkeley, CA, USA,
[12] P. Delgado, F. Dinu, A.-M. Kermarrec, and
2010. USENIX Association.
W. Zwaenepoel. Hawk: Hybrid datacenter schedul-
[4] N. Avrahami and Y. Azar. Minimizing total flow ing. In 2015 USENIX Annual Technical Conference
time and total completion time with immediate dis- (USENIX ATC 15), pages 499–510, Santa Clara,
patching. In Proceedings of the Fifteenth Annual CA, July 2015. USENIX Association.
ACM Symposium on Parallel Algorithms and Archi- [13] C. Delimitrou and C. Kozyrakis. Quasar: Resource-
tectures, SPAA ’03, pages 11–18, New York, NY, efficient and qos-aware cluster management. In Pro-
USA, 2003. ACM. ceedings of the 19th International Conference on
Architectural Support for Programming Languages
[5] E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, and Operating Systems, ASPLOS ’14, pages 127–
Z. Qian, M. Wu, and L. Zhou. Apollo: Scalable and 144, New York, NY, USA, 2014. ACM.
coordinated scheduling for cloud-scale computing.
In 11th USENIX Symposium on Operating Systems [14] D. G. Down and R. Wu. Multi-layered round robin
Design and Implementation (OSDI 14), pages 285– routing for parallel servers. Queueing Syst. Theory
300, Broomfield, CO, Oct. 2014. USENIX Associa- Appl., 53(4):177–188, Aug. 2006.
tion.
[15] A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin,
[6] W. Chen, J. Rao, and X. Zhou. Preemptive, low la- and R. Fonseca. Jockey: Guaranteed job latency
tency datacenter scheduling via lightweight virtual- in data parallel clusters. In Proceedings of the 7th
ization. In 2017 USENIX Annual Technical Con- ACM European Conference on Computer Systems,
ference (USENIX ATC 17), pages 251–263, Santa EuroSys ’12, pages 99–112, New York, NY, USA,
Clara, CA, 2017. 2012. ACM.
[16] P. Garefalakis, K. Karanasos, P. Pietzuch, A. Suresh,
[7] Y. Chen, S. Alspaugh, and R. Katz. Interactive an- and S. Rao. Medea: Scheduling of long running ap-
alytical processing in big data systems: A cross- plications in shared production clusters. In Proceed-
industry study of mapreduce workloads. Proc. ings of the Thirteenth EuroSys Conference, EuroSys
VLDB Endow., 5(12):1802–1813, Aug. 2012. ’18, pages 4:1–4:13, 2018.
[8] Y. Chen, A. Ganapathi, R. Griffith, and R. Katz. The [17] B. Ghit and D. H. J. Epema. Tyrex: Size-based
case for evaluating mapreduce performance using resource allocation in mapreduce frameworks. In
workload suites. In Proceedings of the 2011 IEEE IEEE/ACM 16th International Symposium on Clus-
19th Annual International Symposium on Modelling, ter, Cloud and Grid Computing, CCGrid 2016,
13
Cartagena, Colombia, May 16-19, 2016, pages 11– Conference on Usenix Annual Technical Confer-
20, 2016. ence, USENIX ATC ’15, pages 485–497, Berkeley,
CA, USA, 2015. USENIX Association.
[18] A. Ghodsi, M. Zaharia, B. Hindman, A. Kon-
winski, S. Shenker, and I. Stoica. Dominant re- [27] Y. Kwon, M. Balazinska, B. Howe, and J. A. Rolia.
source fairness: Fair allocation of multiple resource Skewtune: mitigating skew in mapreduce applica-
types. In Proceedings of the 8th USENIX Con- tions. In K. S. Candan, Y. C. 0001, R. T. Snodgrass,
ference on Networked Systems Design and Imple- L. Gravano, and A. Fuxman, editors, SIGMOD Con-
mentation, NSDI’11, pages 323–336, Berkeley, CA, ference, pages 25–36. ACM, 2012.
USA, 2011. USENIX Association.
[28] M. Nuyens and A. Wierman. The foreground-
[19] I. Gog, M. Schwarzkopf, A. Gleave, R. M. N. Wat- background queue: A survey. Perform. Eval., 65(3-
son, and S. Hand. Firmament: Fast, centralized clus- 4):286–307, Mar. 2008.
ter scheduling at scale. In Proc. of OSDI, 2016. [29] K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica.
Sparrow: Distributed, low latency scheduling. In
[20] R. Grandl, G. Ananthanarayanan, S. Kandula,
Proceedings of the Twenty-Fourth ACM Symposium
S. Rao, and A. Akella. Multi-resource packing for
on Operating Systems Principles, SOSP ’13, pages
cluster schedulers. SIGCOMM Comput. Commun.
69–84, New York, NY, USA, 2013. ACM.
Rev., 44(4):455–466, Aug. 2014.
[30] J. W. Park, A. Tumanov, A. Jiang, M. A. Kozuch,
[21] R. Grandl, M. Chowdhury, A. Akella, and G. Anan- and G. R. Ganger. 3sigma: distribution-based cluster
thanarayanan. Altruistic scheduling in multi- scheduling for runtime uncertainty. In Proceedings
resource clusters. In Proceedings of the 12th of the Thirteenth EuroSys Conference, page 2. ACM,
USENIX Conference on Operating Systems Design 2018.
and Implementation, OSDI’16, pages 65–80, Berke-
ley, CA, USA, 2016. USENIX Association. [31] J. Rasley, K. Karanasos, S. Kandula, R. Fonseca,
M. Vojnovic, and S. Rao. Efficient queue manage-
[22] R. Grandl, S. Kandula, S. Rao, A. Akella, and ment for cluster scheduling. In Proceedings of the
J. Kulkarni. GRAPHENE: Packing and dependency- Eleventh European Conference on Computer Sys-
aware scheduling for data-parallel clusters. In 12th tems, page 36. ACM, 2016.
USENIX Symposium on Operating Systems Design
and Implementation (OSDI 16), pages 81–97, Sa- [32] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz,
vannah, GA, 2016. USENIX Association. and M. A. Kozuch. Heterogeneity and dynamicity of
clouds at scale: Google trace analysis. In Proceed-
[23] Z. Hu, B. Li, Z. Qin, and R. S. M. Goh. Job schedul- ings of the Third ACM Symposium on Cloud Com-
ing without prior information in big data processing puting, SoCC ’12, pages 7:1–7:13, New York, NY,
systems. In Proc. of ICDCS, 2017. USA, 2012. ACM.
[24] C.-C. Hung, L. Golubchik, and M. Yu. Scheduling [33] X. Ren, G. Ananthanarayanan, A. Wierman, and
jobs across geo-distributed datacenters. In Proceed- M. Yu. Hopper: Decentralized speculation-aware
ings of the Sixth ACM Symposium on Cloud Com- cluster scheduling at scale. In Proceedings of the
puting, SoCC ’15, pages 111–124, New York, NY, 2015 ACM Conference on Special Interest Group on
USA, 2015. ACM. Data Communication, SIGCOMM ’15, pages 379–
392, New York, NY, USA, 2015. ACM.
[25] M. Isard, V. Prabhakaran, J. Currey, U. Wieder,
[34] L. Schrage. A proof of the optimality of the shortest
K. Talwar, and A. Goldberg. Quincy: Fair schedul-
remaining processing time discipline. Operations
ing for distributed computing clusters. In Proceed-
Research, 16(3):687–690, 1968.
ings of the ACM SIGOPS 22Nd Symposium on Oper-
ating Systems Principles, SOSP ’09, pages 261–276, [35] L. E. Schrage and L. W. Miller. The queue m/g/1
2009. with the shortest remaining processing time disci-
pline. Oper. Res., 14(4):670–684, Aug. 1966.
[26] K. Karanasos, S. Rao, C. Curino, C. Douglas,
K. Chaliparambil, G. M. Fumarola, S. Heddaya, [36] M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek,
R. Ramakrishnan, and S. Sakalanaga. Mercury: Hy- and J. Wilkes. Omega: Flexible, scalable schedulers
brid centralized and distributed scheduling in large for large compute clusters. In Proceedings of the 8th
shared clusters. In Proceedings of the 2015 USENIX ACM European Conference on Computer Systems,
14
EuroSys ’13, pages 351–364, New York, NY, USA,
2013. ACM.
[37] M. Sedaghat, E. Wadbro, J. Wilkes, S. de Luna,
O. Seleznjev, and E. Elmroth. Diehard: Reliable
scheduling to survive correlated failures in cloud
data centers. In IEEE/ACM 16th International Sym-
posium on Cluster, Cloud and Grid Computing, CC-
Grid 2016,.
[38] A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch,
M. Harchol-Balter, and G. R. Ganger. Tetrisched:
global rescheduling with adaptive plan-ahead in dy-
namic heterogeneous clusters. In Proceedings of the
Eleventh European Conference on Computer Sys-
tems, page 35. ACM, 2016.
[39] A. Verma, L. Pedrosa, M. Korupolu, D. Oppen-
heimer, E. Tune, and J. Wilkes. Large-scale clus-
ter management at google with borg. In Proceed-
ings of the Tenth European Conference on Computer
Systems, EuroSys ’15, pages 18:1–18:17, New York,
NY, USA, 2015. ACM.
15