You are on page 1of 5

Improvements in Gang Scheduling for Parallel Supercomputers

Fabricio Alves Barbosa da Silva1  ;


Luis Miguel Campos2
Isaac D. Scherson1 2 y ; ;

fabricio.silva@asim.lip6.fr, flcampos,isaacg@ics.uci.edu

1
Laboratoire ASIM, LIP6, Universite Pierre et Marie Curie, Paris, France.
2 Dept. of Information and Comp. Science, University of California, Irvine, CA 92697, U.S.A

Abstract responsible for nding a good scheduling alloca-


tion, both temporal and spatial, as a function of
Gang scheduling has been widely used as a prac- the existing workload. The temporal and spa-
tical solution to the dynamic parallel job schedul- tial allocation represent the two dimensions in
ing problem. Parallel threads of a single job are which computing resources are shared : the tem-
scheduled for simultaneous execution on a paral- poral sharing is also known as time slicing or pre-
lel computer even if the job does not fully utilize emption; and the space sharing is also known as
all available processors. Non allocated proces- space slicing and partitioning. These two classi-
sors go idle for the duration of the time quantum cations are orthogonal, and may lead to a tax-
assigned to the threads. In this paper we pro- onomy based on all possible combinations [1].
pose a class of scheduling policies, dubbed Con- In gang scheduling each thread of execution
current Gang, that is a generalization of gang- of a parallel job is scheduled on an independent
scheduling, and allows for the exible simultane- processor. The threads of a job are supplied with
ous scheduling of multiple parallel jobs, thus im- an environment that is very similar to a dedi-
proving the space sharing characteristics of gang cated machine [4][5], and may or may not use all
scheduling. However, all the advantages of gang available processors.
scheduling such as responsiveness, ecient shar-
ing of resources, ease of programming, etc., are In this paper we propose a class of scheduling
maintained. The resulting policy is simulated policies, dubbed Concurrent Gang. It is a gener-
and compared with gang scheduling using a gen- alization of gang-scheduling, and allows for the
eral purpose event driven simulator specially de- exible simultaneous scheduling of multiple par-
veloped for this purpose. allel jobs, thus improving the space sharing char-
acteristics of gang scheduling. However, all the
advantages of gang scheduling such as respon-
1 Introduction siveness [2], ecient sharing of resources, ease of
programming, ne grain synchronization perfor-
Gang Scheduling [1][5] has been proposed as a mance bene ts [4], etc. are maintained.
practical solution to the dynamic parallel job This paper is organized as follows : In section
scheduling problem. Dynamic means that the 2 the general class of Concurrent Gang policies is
possibility of arbitrary arrival times for new jobs described. In section 3 space sharing under Con-
is allowed. A parallel job scheduler in general is current Gang is considered, with the de nition
Supported by Capes, Brazilian Government, grant
 of important concepts for the precise description
number 1897/95-11. of the space sharing strategy used. Section 4
y Supported in part by the Irvine Research Unit in Ad- gives the simulation results of Concurrent Gang
vanced Computing and NASA under grant #NAG5-3692. with rst t as space sharing strategy, with the
respective analysis and comparison with gang and another dimension is time. As each job is
scheduling. gang scheduled, it allocates the necessary num-
ber of processors in a given slot of time. As we
2 Concurrent Gang suppose that the number of jobs in any given mo-
ment is nite, the time dimension is also nite,
In gang scheduling, each job is allocated to the with a diagram de ning a period that repeats
whole parallel machine for a time slice before be- itself if there is no change in the number and/or
ing preempted. However, not all jobs necessarily corresponding requirements of the jobs.
use all of the machine's processors at all times. A variation of Concurrent Gang was proposed
In order to further optimize the use of massively in [7], where jobs are preempted after all threads,
parallel systems the operating system must sup- running in parallel, execute a xed number of
port scheduling policies aimed at scheduling si- instructions. The reason behind the preemption
multaneously several jobs of di erent sizes and of the jobs after executing a xed number of in-
with no prede ned arrival times. structions is because it provides better perfor-
Let us consider a machine with N processors mance under the competitive ratio metrics [7, 6].
in a MIMD architectural model as described in In that case, a time slice can vary as a function
[6]. In Concurrent Gang the machine is shared of the characteristics of the job.
by more than one job at any given time. All the From the job's perspective, with Concurrent
jobs running concurrently are preempted by the Gang it still has the impression of running in
end of time slice. The scheduler is responsible for a dedicated machine, as in gang scheduling, ex-
providing an ecient machine utilization, both cept perhaps for some possible reduction in I/O
in temporal and spatial dimensions, always gang and network bandwidth due to interference from
scheduling each job on available resources. other jobs. Still, the CPU and memory resources
For the de nition of Concurrent Gang, we required for the job are dedicated.
view the parallel machine as composed of a gen- Concurrent Gang implies a better perfor-
eral queue of jobs to be scheduled and a num- mance and machine utilization than pure gang
ber of servers, each server corresponding to one scheduling, since gang scheduled jobs may not
processor. Each processor may have a queue of use all processors, resulting in a smaller rate of
eligible threads to execute. The mapping and processor utilization. Hence, Concurrent Gang
allocation of threads of a job from the general space sharing is proposed to improve the utiliza-
queue to the processors' queues e ected by the tion of individual processors in a parallel ma-
scheduler. In the event of a job arrival, a job chine by combining the best characteristics of
termination or a job changing its number of el- gang scheduling and partitioning and assuming
igible threads (events which de ne e ectively a current MIMD machines. Besides that, the ex-
workload change) the Concurrent Gang Sched- ecution of jobs in parallel implies also better
uler will : overall execution times, since we are not obliged
to always create a new time slice for a new
1 Update Eligible thread list coming job, but running that job in parallel
2 Allocate Threads of First Job of General with another already-running job, provided that
Queue in the required number of processors. there is a sucient number of processors in the
time slice. The queueing system's approach of
3 While not end of Job Queue Concurrent Gang provides a general framework
for describing space sharing strategies based on
{ Allocate all threads of remaining jobs gang scheduling, considering or not thread mi-
using a de ned spatial sharing strategy gration, given the capability of queueing systems
4 Run to model the workload-resources interaction.
However, the description of a scheduler under
It should be noted that this algorithm leads Concurrent Gang is not complete if a space shar-
to a bidimensional diagram, where one dimen- ing strategy is not de ned. In the next section
sion corresponds to the number of processors, we state some important concepts that are use-
P2-H-2
ful for this de nition and give some examples of Workload Change
Slice
Workload Change

Concurrent Gang schedulers. Cycle

P0 J1 J2 J4

3 Space sharing in Concurrent


P1 J1 J2 J4

P2 J1 J2 J4 J5

Gang
P3 J1 J2 J4 J5
P4 J1 J3 J4 J6
11
00 11
00 11
00 11
00 11
00
00
11 00
11 00
11 00
11 00
11
00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11

It is clear that once the rst job, if any, in the


00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11 00
11
00
11

general queue is allocated, the remaining avail-


Pn-1 J1 J3 J6 Slot

able resources can be allocated to other eligible Period Period Period Period

threads by using a space sharing strategy. Some


possible strategies are rst t, best t and greedy Time

policies. First t and best t policies were orig- Idle Slots

inally de ned by Feitelson [3]. Figure 1: Cycle, slice, period and slot de nitions
To clarify the application of these policies in
Concurrent Gang let us rst state some impor-
tant concepts. These are the concepts of cycle, change again occurs. In the event of a workload
slice, period and slot. Figure 2 illustrates these change, the distribution of jobs in the machine
de nitions. A Workload change occurs at the ar- is reorganized depending on the change in the
rival of a new job, the termination of an existing workload, and as we have a queue of jobs, some
one, or through the variation of the number of thread migration may occur because of this reor-
eligible threads of a job to be scheduled. The ganization. We will refer to this strategy hence-
time between workload changes is de ned as a forth simply as rst t.
cycle. Between workload changes, Concurrent Although we de ned an algorithm where
Gang scheduling is periodic, with a period that thread migration was possible, if the machine
is a function of the workload and the spatial al- under consideration has no ecient mechanism
location. A period is composed of slices; a slice for thread migration, algorithms with no thread
corresponds to a time slice as in gang schedul- migration are also possible using these concepts.
ing, with the di erence that in Concurrent Gang A very simple policy for spatial sharing un-
we may have more than one job simultaneously der Concurrent Gang without thread migration
scheduled in a slice. A slot is the processors' is the greedy one. At arrival, a job is scheduled
view of a slice. A Slice is composed of N slots, in a slice that has sucient idle slots to accom-
for a machine with N processors. If a proces- modate the arriving job. In this case the de -
sor has no assigned thread during its slot in a nitions of cycle, slice, etc. would also be valid.
slice, then we have an idle slot. The bidimen- The scheduler should maintain a list of idle slots
sional diagram showed in gure 3 is inherent to in the period in order to know, at job arrival, if
the concurrent gang algorithm, and it is used to it is possible to schedule the job in an already
de ne the spatial allocation strategy. We refer existing slice.
to this diagram as the trace diagram. It is worth noting that, relative to its def-
The implementation of Concurrent Gang with inition as a queueing network with processor
rst t with thread migration is a rst example sharing discipline, Concurrent Gang is particu-
of a Concurrent Gang scheduler. It is based on a larly convenient to describe schedulers that are
greedy algorithm applied at the time of a work- periodic between workload changes. We will
load change. During the cycle, the workload is now state a theorem that proves that a periodic
obviously assumed constant. Thus, the eligible schedule performs at least as well as any non pe-
threads of queued jobs are allocated to proces- riodic one with respect to the total number of
sors using the rst t strategy for each slice. idle slots, i.e., periodic schedulers achieves bet-
Clearly, after all eligible threads are scheduled on ter spatial allocation than (or at least as good
a processor for some slice (slot), the temporal se- as) non-periodic ones when processor utilization
quence is repeated periodically until a workload is measured through the ratio of total number of
P2-H-3
empty (idle) slots to the total number of slots in the end of each period, all the threads belong-
the period. We denote this measure as the idling ing to the same job have made equal progress.
ratio. Therefore, no two threads lag behind another
thread of the same job by more than a constant
Theorem 1 Given a workload W, for every number of slices.
temporal schedule S there exists a periodic sched- Secondly, observe that it is possible to choose
ule S such that the idling ratio of S is at most a time interval [ k 0k ] such that the happiness
ti ; ti

of each job in the during this interval is at least


p p

that of S,
as much as in the complete trace diagram.This
Proof - First of all, let's make a de nition that implies that the happiness of each job in the
will be useful in this proof. We de ne here job constructed periodic schedule is greater than or
happiness in a interval of time as the number equal to the happiness of each job in the original
of slots allocated to a job divided by the total temporal schedule.
number of slots in the interval. Therefore, the idling ratio of the constructed
De ne the progress of a job at a particular periodic schedule must be less than or equal to
time as the number of slices granted to each of the idling ration of the original temporal sched-
its threads up to that time. Thus, if a job has ule. Since the fraction of area in the trace di-
V threads, its progress at slice t may be rep- agram covered by each job increases, the frac-
resented by a progress vector of V components, tion covered by the idle slots must necessarily
where each component is an integer less than decrease. This concludes the proof.
or equal to t. By the rules of legal execution,
no thread may lag behind another thread of the
same job by more than a constant C number of
4 Simulation and Veri cation
slices. Therefore, no two elements in the progress To verify the results above, we used a general
vector can di er by more than C. De ne the dif- purpose event driven simulator, developed by
ferential progress of a job at a particular time as our research group for studying a variety of re-
the number of slices by which each thread leads lated problems (e.g., dynamic scheduling, load
the slowest thread of the job. Thus a di eren- balancing, etc.). The simulator accepts two dif-
tial progress vector at time t is also a vector of V ferent formats for describing jobs. The rst is a
components, where each component is an integer fully quali ed DAG. The second is a set of pa-
less than or equal to C. The di erential progress rameters used to describe the job characteristics
vector is obtained by subtracting out the mini- such as computation/communication ratio.
mum component of the progress vector from each When the second form is used the actual com-
component of the progress vector . The system's munication type, timing and pattern are left un-
di erential progress vector (SDPV) at time t is speci ed and it is up to the simulator to con-
the concatenation of all job's di erential progress vert this user speci cation into a DAG, using
vectors at time t. The key is to note that the probabilistic distributions, provided by the user,
SDPV can only assume a nite number of val- for each of the parameters. Other parameters
ues. Therefore there exists an in nite sequence include the spawning factor for each thread, a
of times t 1 ; t 2 ; ::: such that the SDPVs at these
i i thread life span, synchronization pattern, degree
times are identical. of parallelism (maximum number of threads that
Consider any time interval [t k ; t 0k ]. One may
i i can be executed at any given time), depth of crit-
construct a periodic schedule by cutting out the ical path, etc. Even though probabilistic distri-
portion of the trace diagram between t k e t 0k and
i i butions are used to generate the DAG, the DAG
replicating it in nitely in the time dimension. itself behaves in a completely deterministic way.
First of all, we claim that such a periodic Once the input is in the form of a DAG, and
schedule is legal. From the equality of the the module responsible for implementing a par-
SPDVs at t k e t 0k it follows that all threads be-
i i
ticular scheduling heuristics is plugged into the
longing to the same job receive the same number simulator, several experiments can be performed
of slices during each period. In other words, at using the same input by changing some of the pa-
P2-H-4
rameters of the simulation such as the number Gang
of processing elements available, the topology of Total Running Time (%) Total Idle Time (%)
the network, among others, and their outputs, 123.6 41.9
in a variety of formats, are recorded in a le for Concurrent Gang
later visualization. Total Running Time (%) Total Idle Time (%)
For this study we grouped parallel jobs in 100 28.2
classes where each class represents a particu-
lar degree of parallelism (maximum number of Table 1: Experimental results
threads that can be executed at any given time).
The reason behind grouping parallel jobs by random times and at any given instance there
their degree of parallelism is to evaluate the per- might not be any job ready to be scheduled. The
formance of the algorithms being studied across last is a result of ineciencies due to the non
the vast spectrum of real parallel applications optimality of the rst t algorithm.
(ranging from massive parallel to programs re-
quiring only two processing elements) and there-
fore reduce the bias towards a single type of ap-
References
plication. [1] Feitelson, D. G.: Job Scheduling in Multi-
We divided the workload into ten di erent programmed Parallel Systems IBM Research
classes with each class containing 50 di erent Report RC 19970, Second Revision, 1997
jobs. The arrival time of a job is described by
a Poisson random variable with an average rate [2] Feitelson, D. G., Jette, M. A.: Improved
of two job arrivals per time slice. The actual Utilization and Responsiveness with Gang
job selection is done in a round robin fashion by Scheduling Job Scheduling Strategies for
picking one job per class. This way we guaran- Parallel Processing, D. G. Feitelson and L.
tee the interleaving of heavily parallel jobs with Rudolph (eds.), pp. 238-261 Springer Verlag,
shorter ones. 1997.
We distinguish the class of computation in- [3] Feitelson, D. G.: Packing Schemes for
struction and that of communication instruction Gang Scheduling Job Scheduling Strategies
in the various threads that compose a job. The for Parallel Processing, D. G. Feitelson and
latter forces the thread to be suspended until L. Rudolph (eds.), pp. 89-110 Springer Ver-
the communication is concluded. If the com- lag, 1996.
munication is concluded during the currently as-
signed time-slice the thread resumes execution. [4] Feitelson, D. G., Rudolph L., Gang Schedul-
We used a factor of 0.001 communications per ing Performance Bene ts for Fine Grain Syn-
computation instructions. chronization, Journal of Parallel and Dis-
The classes are ranked according to their de- tributed Computing 16, pp. 306-318, 1992.
gree of parallelism (between 2 and 1024 in pow- [5] Jette, M. A., Performance Characteristics of
ers of two increments) and the jobs were sched- Gang Scheduling in Multiprogrammed Envi-
uled in a simulated 1024 processor machine. In ronments, Supercomputing' 97, 1997.
table 1 we compare gang scheduling with Con-
current Gang using rst t as space sharing [6] Scherson, I. D., Subramaniam R., Reis, V.
strategy. L. M., Campos, L. M.. Scheduling Compu-
It is important to dissect the value obtained tationally Intensive Data Parallel Programs.
for idle time, which is the result of three factors: Ecole Placement Dynamique et Re'partition
1 - Communications de charge pp. 39-61, 1996
2 - Absence of ready threads [7] Silva,F., Campos, L. M., Scherson, I. D.
3 - Inne ciency of allocation A Lower Bound for Dynamic Scheduling
The rst is a natural consequence of threads of Data Parallel Programs, EUROPAR' 98
communicating among themselves. The second (1998 - to appear)
re ects the fact that jobs arrive and nish at
P2-H-5

You might also like