You are on page 1of 9

Goal of lecture

So far, we have studied

Scheduling how parallelism and locality arise in programs
ordering constraints between tasks for correctness or
This lecture: How do we assign tasks to workers?
multicore: workers might be cores
distributed-memory machines: workers might be
Keshav Pingali rich literature exists for dependence graph scheduling
most of it is not very useful in practice since they use
University of Texas, Austin unrealistic program and machine models
(e.g.) assume task execution times are known
nevertheless, it is useful to study it since it gives us
intuition for what the issues are for scheduling in practice

Dependence DAGs Computer model

P identical processors
DAG with START and END nodes START Memory START
all nodes reachable from START
processors have local memory
END reachable from all nodes
wi all shared-data is stored in global memory
START and END are not essential 1 1
i How does a processor know which nodes it
Nodes are computations must execute?
each computation can be executed by work assignment
a processor in some number of time-


steps How does a processor know when it is safe

computation may require to execute a node? a
reading/writing shared-memory (eg) if P1 executes node a and P2 executes
node weight: time taken by a processor
to perform that computation

node b, how does P2 know when P1 is done?
synchronization b

wi is weight of node i For now, let us defer these questions
Edges are precedence constraints P In general, time to execute program depends P
nodes other than START can be on work assignment
executed only after immediate Processors for now, assume only that if there is an idle Processors
predecessors in graph have been processor and a ready node, that node is
executed assigned immediately to an idle processor
known as dependences
TP = best possible time to execute program
Very old model: on P processors
PERT charts (late 50s): Dependence DAG Dependence DAG
Program Evaluation and Review
developed by US Navy to manage
Polaris submarine contracts

Work and critical path Terminology
Work = i wi START Instantaneous parallelism
time required to execute
program on one processor IP(t) = maximum number of 1
1 processors that can be kept 1 1
= T1 1 1
busy at each point in execution
Path weight of algorithm
1 1
sum of weights of nodes on wi Maximal parallelism
Critical path Data MP = highest instantaneous
path from START to END
that has maximal weight Average parallelism
P 3
this work must be done AP = T1/T
Processors 2
sequentially, so you need These are properties of the
this much time regardless END computation DAG, not of the 1
of how many processors machine or the work assignment
you have Computation DAG time
call this T
Instantaneous and average parallelism

Computing critical path etc. Speed-up

Algorithm for computing earliest start times of nodes
Keep a value called minimum-start-time (mst) with each node, Speed-up(P) = T1/TP
initialized to 0
Do a topological sort of the DAG intuitively, how much faster is it to execute
ignoring node weights program on P processors than on 1
For each node n ( START) in topological order processor?
for each node p in predecessors(n)
mstn = max(mstn, mstp + wp) Bound on speed-up
Complexity = O(|V|+|E|)
Critical path and instantaneous, maximal and average
regardless of how many processors you have,
parallelism can easily be computed from this you need at least T units of time
speed-up(P) T1/T = i wi /CP = AP

Amdahls law Scheduling
Suppose P MP 1
Amdahl: There will be times during 1
1 1
suppose a fraction p of a program can be done in parallel the execution when only 1 1
suppose you have an unbounded number of parallel processors a subset of ready nodes 1
and they operate infinitely fast can be executed. 1 1
speed-up will be at most 1/(1-p). Time to execute DAG can
Follows trivially from previous result. depend on which subset
Plug in some numbers: of P nodes is chosen for
execution. 3
p = 90% speed-up 10
p = 99% speed-up 100
To understand this better, 2
it is useful to have a more 1
To obtain significant speed-up, most of the program must detailed machine model
be performed in parallel time
serial bottlenecks can really hurt you
What if we only had 2 processors?

Machine Model Schedules

Schedule: function from node to (processor, start time)
Also known as space-time mapping
Schedule 1
Processors operate START
synchronously (in lock-step) 1 1
barrier synchronization in hardware
Shared memory 0 1 2 3 4
a b c


if a processor has reached step i, it

can assume all other processors 1 d
P1 b d
have completed tasks in all previous P1 P2 Pp
steps Schedule 2
time END
Each processor has private 1
memory 0 1 2 3 4
P0 START a b d END

P1 c

Intuition: nodes along the critical path should be given preference in scheduling

Optimal schedules Heuristic: list scheduling

Maintain a list of nodes that are ready to execute

Optimal schedule all predecessor nodes have completed execution
shortest possible schedule for a given DAG and the given number of
processors Fill in the schedule cycle-by-cycle
Complexity of finding optimal schedules in each cycle, choose nodes from ready list
one of the most studied problems in CS use heuristics to choose best nodes in case you cannot
DAG is a tree: schedule all the ready nodes
level-by-level schedule is optimal (Aho, Hopcroft)
General DAGs One popular heuristic:
variable number of processors (number of processors is input to assign node priorities before scheduling
problem): NP-complete priority of node n:
fixed number of processors weight of maximal weight path from n to END
2 processors: polynomial time algorithm
3,4,5: complexity is unknown! intuitively, the further a node is from END, the higher its priority
Many heuristics available in the literature

List scheduling algorithm
cycle c = 0; time 1 4
ready-list = {START}; START
inflight-list = { }; 0 1 2 3 4
while (|ready-list|+|inflight-list| > 0) { 2 3 2
P0 START a c END 1 1 1

for each node n in ready-list in priority order { //schedule new tasks c

a b
if (a processor is free at this cycle) {
remove n from ready-list and add to inflight-list; P1 b d 2
add node to schedule at time cycle; 1 d
else break;
} Heuristic picks the good schedule 1
c = c + 1; //increment time 1 END
for each node n in inflight-list {//determine ready tasks Not always guaranteed to produce optimal schedule
if (n finishes at time cycle) {
(otherwise we would have a polynomial time algorithm!)
remove n from inflight-list; P0
add every ready successor of n in DAG to ready-list
} P1

Generating dependence graphs Data dependence
How do we produce dependence graphs in the Basic blocks
first place? straight-line code
Nodes represent statements
Two approaches Edge S1 S2
specify DAG explicitly flow dependence (read-after-write (RAW))
parallel programming S1 is executed before S2 in basic block
S1 writes to a variable that is read by S2
easy to make mistakes anti-dependence (write-after-read (WAR))
data races: two tasks that write to same location but are not S1 is executed before S2 in basic block
ordered by dependence S1 reads from a variable that is written by S2
by compiler analysis of sequential programs output-dependence (write-after-write (WAW))
S1 is executed before S2 in basic block
Let us study the second approach S1 and S2 write to the same variable
input-dependence (read-after-read (RAR)) (usually not important)
called dependence analysis S1 is executed before S2 in basic block
S1 and S2 read from the same variable

Conservative approximation Putting it all together

In real programs, we often cannot determine
precisely whether a dependence exists Write sequential program.
in example, Compiler produces parallel code
i = j: dependence exists generates control-flow graph
i j: dependence does not exist
produces computation DAG for each basic block by performing
dependence may exist for some invocations and not
for others dependence analysis
Conservative approximation generates schedule for each basic block
when in doubt, assume dependence exists use list scheduling or some other heuristic
at the worst, this will prevent us from executing branch at end of basic block is scheduled on all processors
some statements in parallel even if this would be Problem:
average basic block is fairly small (~ 5 RISC instructions)
Aliasing: two program names for the same storage
location One solution:
(e.g.) X(i) and X(j) are may-aliases transform the program to produce bigger basic blocks
may-aliasing is the major source of imprecision in
dependence analysis

One transformation: loop unrolling Smarter loop unrolling
Original program Use new name for loop iteration variable in each
for i = 1,100
unrolled instance
X(i) = i
Unroll loop 4 times: not very useful! for i = 1,100,4
for i = 1,100,4 X(i) = i
X(i) = i o i1 = i+1
o i = i+1 X(i1) = i1
X(i) = i o i2 = i+2
o i = i+1
X(i2) = i2
X(i) = i
o i3 = i+3
i = i+1
X(i) = i X(i3) = i3

Array dependence analysis Array dependence analysis (contd.)

If compiler can also figure out that X(i), X(i+1), X(i+2), We will study techniques for array
and X(i+3) are different locations, we get the following
dependence graph for the loop body dependence analysis later in the course
Problem can be formulated as an integer
for i = 1,100,4
X(i) = i linear programming problem:
i1 = i+1 Is there an integer point within a certain
X(i1) = i1
i2 = i+2
polyhedron derived from the loop bounds and
X(i2) = i2 the array subscripts?
i3 = i+3
X(i3) = i3

Scheduling instructions for VLIW
Two applications machines
Static scheduling Processors functional units
Local memories registers Ops
create space-time diagram at compile-time Global memory memory
a b c
VLIW code generation Time instruction
Nodes in DAG are operations
Dynamic scheduling (load/store/add/mul/branch/..) d
instruction-level parallelism
create space-time diagram at runtime List scheduling
useful for scheduling code for END
multicore scheduling for dense linear algebra pipelined, superscalar and VLIW
used widely in commercial compilers
loop unrolling and array dependence
analysis are also used widely

Historical note on VLIW DAG scheduling for multicores

processors Reality:
hard to build single cycle memory that can be
accessed by large numbers of cores
Ideas originated in late 70s-early 80s
Two key people: Architectural change a b c
Bob Rau (Stanford,UIUC, TRW, Cydrome, HP) decouple cores so there is no notion of a global
Josh Fisher (NYU,Yale, Multiflow, HP) step
Bob Raus contributions: each core/processor has its own PC and cache
transformations for making basic blocks larger: memory is accessed independently by each
software pipelining
hardware support for these techniques New problem:
predicated execution END
rotating register files Bob Rau since cores do not operate in lock-step, how
most of these ideas were later incorporated does a core know when it is safe to execute a
into the Intel Itanium processor node? P0: a
Josh Fisher: Solution: software synchronization P1: b
transformations for making basic blocks larger:
trace scheduling: uses key idea of branch
counter associated with each DAG node P2: c d
probabilities decremented when predecessor task is done
Multiflow compiler used loop unrolling
Software synchronization increases overhead How does P2 know when
of parallel execution P0 and P1 are done?
cannot afford to synchronize at the instruction
Josh Fisher nodes of DAG must be coarse-grain: loop

Increasing granularity: New problem
Block Matrix Algorithms
Original matrix multiplication B00 B01
Difficult to get accurate execution times of
coarse-grain nodes
for I = 1,N
for J = 1,N
B10 B11
conditional inside loop iteration
for K = 1,N
C(I,J)= C(I,J)+A(I,K)*B(K,J) A00 A01 C00 C01
cache misses
Block (tiled) matrix multiplication A10 A11 C10 C11
O/S processes
for IB = 1,N step B
for JB = 1,N step B
parallel loops C00 = A00*B00 + A01*B10 .
Solution: runtime scheduling
for KB = 1,N step B C01 = A01*B11 + A00*B01
for I = IB, IB+B-1 C11 = A11*B01 + A10*B01
for J = JB, JB+B-1 C10 = A10*B00 + A11*B10
for K = KB, KB+B-1
C(I,J) = C(I,J)+A(I,K)*B(K,J)

Example: DAGuE DAGuE: Tiled QR (1)

Dongarra et al (UTK)
Programming model for specifying DAGs for
parallel blocked dense linear algebra codes
nodes: block computations
DAG edges specified by programmer (see next
Runtime system
keeps track of ready nodes
assigns ready nodes to cores
determines if new nodes become ready when a Tiled QR (using tiles and in/out notations)
node completes


DAGuE: Tiled QR (2) Summary of multicore
Tiled QR
DAG of tasks is known
each task is heavy-weight and executing task
on one worker exploits adequate locality
no assumptions about runtime of tasks
no lock-step execution of processors or
synchronous global memory
Dataflow Graph for 2x2 processor grid Machine: 81 nodes, 648 cores keep a work-list of tasks that are ready to execute
use heuristic priorities to choose from ready tasks

Dependence graphs
nodes are computations
edges are dependences
Static dependence graphs: obtained by
studying the algorithm
analyzing the program
Limits on speed-ups
critical path
Amdahls law
DAG scheduling
heuristic: list scheduling (many variations)
static and dynamic scheduling
applications: VLIW code generation, multicore scheduling for dense
linear algebra
Major limitations:
works for topology-driven algorithms with fixed neighborhoods since we
know tasks and dependences before executing program
not very useful for data-driven algorithms since tasks are created
one solution: work-stealing, work-sharing. Study later.