2 views

Uploaded by babarirfanali

algo

- ch08.ppt
- COA 21
- RICC Users Guide1.24 En
- Seminar Topic
- Wakefield GPUs for Reservoir Simulation
- Getting Started With PSLEUTH
- Download (5)
- Lecture 21
- Tanque e Mass Tranfer
- TPP report
- GEH-5980E SPEEDTRONIC™ Mark V Turbine Control Maintenance Manual.pdf
- DataStage_XE
- ICC06
- 50 Ques One Mark
- Dataraw Facts
- Discoverer904 Sizing Calculator 129536
- Www.yuvayana.com Wp Content Uploads 2013 11 Computer Capsule for IBPS CWE Clerk 2013encrypted
- ECE 290 Learning Agreements 3
- Assignment 1
- Question Bank

You are on page 1of 9

Scheduling how parallelism and locality arise in programs

ordering constraints between tasks for correctness or

efficiency

This lecture: How do we assign tasks to workers?

multicore: workers might be cores

distributed-memory machines: workers might be

hosts/machines

Scheduling

Keshav Pingali rich literature exists for dependence graph scheduling

most of it is not very useful in practice since they use

University of Texas, Austin unrealistic program and machine models

(e.g.) assume task execution times are known

nevertheless, it is useful to study it since it gives us

intuition for what the issues are for scheduling in practice

P identical processors

DAG with START and END nodes START Memory START

all nodes reachable from START

processors have local memory

END reachable from all nodes

wi all shared-data is stored in global memory

START and END are not essential 1 1

i How does a processor know which nodes it

Nodes are computations must execute?

each computation can be executed by work assignment

a processor in some number of time-

Memory

Memory

computation may require to execute a node? a

reading/writing shared-memory (eg) if P1 executes node a and P2 executes

node weight: time taken by a processor

to perform that computation

node b, how does P2 know when P1 is done?

synchronization b

wi is weight of node i For now, let us defer these questions

Edges are precedence constraints P In general, time to execute program depends P

nodes other than START can be on work assignment

executed only after immediate Processors for now, assume only that if there is an idle Processors

predecessors in graph have been processor and a ready node, that node is

executed assigned immediately to an idle processor

known as dependences

END END

TP = best possible time to execute program

Very old model: on P processors

PERT charts (late 50s): Dependence DAG Dependence DAG

Program Evaluation and Review

Technique

developed by US Navy to manage

Polaris submarine contracts

1

Work and critical path Terminology

Work = i wi START Instantaneous parallelism

1

1

time required to execute

program on one processor IP(t) = maximum number of 1

1 processors that can be kept 1 1

= T1 1 1

busy at each point in execution

Path weight of algorithm

1

1 1

sum of weights of nodes on wi Maximal parallelism

path

Critical path Data MP = highest instantaneous

parallelism

path from START to END

that has maximal weight Average parallelism

P 3

this work must be done AP = T1/T

Processors 2

sequentially, so you need These are properties of the

this much time regardless END computation DAG, not of the 1

of how many processors machine or the work assignment

you have Computation DAG time

call this T

Instantaneous and average parallelism

Algorithm for computing earliest start times of nodes

Keep a value called minimum-start-time (mst) with each node, Speed-up(P) = T1/TP

initialized to 0

Do a topological sort of the DAG intuitively, how much faster is it to execute

ignoring node weights program on P processors than on 1

For each node n ( START) in topological order processor?

for each node p in predecessors(n)

mstn = max(mstn, mstp + wp) Bound on speed-up

Complexity = O(|V|+|E|)

Critical path and instantaneous, maximal and average

regardless of how many processors you have,

parallelism can easily be computed from this you need at least T units of time

speed-up(P) T1/T = i wi /CP = AP

2

Amdahls law Scheduling

1

Suppose P MP 1

Amdahl: There will be times during 1

1 1

suppose a fraction p of a program can be done in parallel the execution when only 1 1

suppose you have an unbounded number of parallel processors a subset of ready nodes 1

and they operate infinitely fast can be executed. 1 1

speed-up will be at most 1/(1-p). Time to execute DAG can

Follows trivially from previous result. depend on which subset

Plug in some numbers: of P nodes is chosen for

execution. 3

p = 90% speed-up 10

p = 99% speed-up 100

To understand this better, 2

it is useful to have a more 1

To obtain significant speed-up, most of the program must detailed machine model

be performed in parallel time

serial bottlenecks can really hurt you

What if we only had 2 processors?

Schedule: function from node to (processor, start time)

Also known as space-time mapping

1

Schedule 1

Processors operate START

time

synchronously (in lock-step) 1 1

barrier synchronization in hardware

Shared memory 0 1 2 3 4

a b c

1

P0 START a c END

space

can assume all other processors 1 d

P1 b d

have completed tasks in all previous P1 P2 Pp

steps Schedule 2

time END

Each processor has private 1

memory 0 1 2 3 4

P0

P0 START a b d END

space

P1

P1 c

Intuition: nodes along the critical path should be given preference in scheduling

3

Optimal schedules Heuristic: list scheduling

Optimal schedule all predecessor nodes have completed execution

shortest possible schedule for a given DAG and the given number of

processors Fill in the schedule cycle-by-cycle

Complexity of finding optimal schedules in each cycle, choose nodes from ready list

one of the most studied problems in CS use heuristics to choose best nodes in case you cannot

DAG is a tree: schedule all the ready nodes

level-by-level schedule is optimal (Aho, Hopcroft)

General DAGs One popular heuristic:

variable number of processors (number of processors is input to assign node priorities before scheduling

problem): NP-complete priority of node n:

fixed number of processors weight of maximal weight path from n to END

2 processors: polynomial time algorithm

3,4,5: complexity is unknown! intuitively, the further a node is from END, the higher its priority

Many heuristics available in the literature

Example

List scheduling algorithm

cycle c = 0; time 1 4

ready-list = {START}; START

inflight-list = { }; 0 1 2 3 4

while (|ready-list|+|inflight-list| > 0) { 2 3 2

P0 START a c END 1 1 1

space

a b

if (a processor is free at this cycle) {

remove n from ready-list and add to inflight-list; P1 b d 2

add node to schedule at time cycle; 1 d

}

else break;

} Heuristic picks the good schedule 1

c = c + 1; //increment time 1 END

for each node n in inflight-list {//determine ready tasks Not always guaranteed to produce optimal schedule

if (n finishes at time cycle) {

(otherwise we would have a polynomial time algorithm!)

remove n from inflight-list; P0

add every ready successor of n in DAG to ready-list

} P1

}

}

4

Generating dependence graphs Data dependence

How do we produce dependence graphs in the Basic blocks

first place? straight-line code

Nodes represent statements

Two approaches Edge S1 S2

specify DAG explicitly flow dependence (read-after-write (RAW))

parallel programming S1 is executed before S2 in basic block

S1 writes to a variable that is read by S2

easy to make mistakes anti-dependence (write-after-read (WAR))

data races: two tasks that write to same location but are not S1 is executed before S2 in basic block

ordered by dependence S1 reads from a variable that is written by S2

by compiler analysis of sequential programs output-dependence (write-after-write (WAW))

S1 is executed before S2 in basic block

Let us study the second approach S1 and S2 write to the same variable

input-dependence (read-after-read (RAR)) (usually not important)

called dependence analysis S1 is executed before S2 in basic block

S1 and S2 read from the same variable

In real programs, we often cannot determine

precisely whether a dependence exists Write sequential program.

in example, Compiler produces parallel code

i = j: dependence exists generates control-flow graph

i j: dependence does not exist

produces computation DAG for each basic block by performing

dependence may exist for some invocations and not

for others dependence analysis

Conservative approximation generates schedule for each basic block

when in doubt, assume dependence exists use list scheduling or some other heuristic

at the worst, this will prevent us from executing branch at end of basic block is scheduled on all processors

some statements in parallel even if this would be Problem:

legal

average basic block is fairly small (~ 5 RISC instructions)

Aliasing: two program names for the same storage

location One solution:

(e.g.) X(i) and X(j) are may-aliases transform the program to produce bigger basic blocks

may-aliasing is the major source of imprecision in

dependence analysis

5

One transformation: loop unrolling Smarter loop unrolling

Original program Use new name for loop iteration variable in each

for i = 1,100

unrolled instance

X(i) = i

Unroll loop 4 times: not very useful! for i = 1,100,4

for i = 1,100,4 X(i) = i

X(i) = i o i1 = i+1

o i = i+1 X(i1) = i1

X(i) = i o i2 = i+2

o i = i+1

X(i2) = i2

X(i) = i

o i3 = i+3

o

i = i+1

X(i) = i X(i3) = i3

If compiler can also figure out that X(i), X(i+1), X(i+2), We will study techniques for array

and X(i+3) are different locations, we get the following

dependence graph for the loop body dependence analysis later in the course

Problem can be formulated as an integer

for i = 1,100,4

X(i) = i linear programming problem:

i1 = i+1 Is there an integer point within a certain

X(i1) = i1

i2 = i+2

polyhedron derived from the loop bounds and

X(i2) = i2 the array subscripts?

i3 = i+3

X(i3) = i3

6

Scheduling instructions for VLIW

Two applications machines

Static scheduling Processors functional units

START

Local memories registers Ops

create space-time diagram at compile-time Global memory memory

a b c

VLIW code generation Time instruction

Nodes in DAG are operations

Instruction

Dynamic scheduling (load/store/add/mul/branch/..) d

instruction-level parallelism

create space-time diagram at runtime List scheduling

useful for scheduling code for END

multicore scheduling for dense linear algebra pipelined, superscalar and VLIW

machines

used widely in commercial compilers

loop unrolling and array dependence

analysis are also used widely

processors Reality:

hard to build single cycle memory that can be

START

accessed by large numbers of cores

Ideas originated in late 70s-early 80s

Two key people: Architectural change a b c

Bob Rau (Stanford,UIUC, TRW, Cydrome, HP) decouple cores so there is no notion of a global

Josh Fisher (NYU,Yale, Multiflow, HP) step

d

Bob Raus contributions: each core/processor has its own PC and cache

transformations for making basic blocks larger: memory is accessed independently by each

predication

software pipelining

core

hardware support for these techniques New problem:

predicated execution END

rotating register files Bob Rau since cores do not operate in lock-step, how

most of these ideas were later incorporated does a core know when it is safe to execute a

into the Intel Itanium processor node? P0: a

Josh Fisher: Solution: software synchronization P1: b

transformations for making basic blocks larger:

trace scheduling: uses key idea of branch

counter associated with each DAG node P2: c d

probabilities decremented when predecessor task is done

Multiflow compiler used loop unrolling

Software synchronization increases overhead How does P2 know when

of parallel execution P0 and P1 are done?

cannot afford to synchronize at the instruction

level

Josh Fisher nodes of DAG must be coarse-grain: loop

iterations

7

Increasing granularity: New problem

Block Matrix Algorithms

Original matrix multiplication B00 B01

Difficult to get accurate execution times of

coarse-grain nodes

for I = 1,N

for J = 1,N

B10 B11

conditional inside loop iteration

for K = 1,N

C(I,J)= C(I,J)+A(I,K)*B(K,J) A00 A01 C00 C01

cache misses

exceptions

Block (tiled) matrix multiplication A10 A11 C10 C11

O/S processes

for IB = 1,N step B

for JB = 1,N step B

parallel loops C00 = A00*B00 + A01*B10 .

Solution: runtime scheduling

for KB = 1,N step B C01 = A01*B11 + A00*B01

for I = IB, IB+B-1 C11 = A11*B01 + A10*B01

for J = JB, JB+B-1 C10 = A10*B00 + A11*B10

for K = KB, KB+B-1

C(I,J) = C(I,J)+A(I,K)*B(K,J)

Dongarra et al (UTK)

Programming model for specifying DAGs for

parallel blocked dense linear algebra codes

nodes: block computations

DAG edges specified by programmer (see next

slides)

Runtime system

keeps track of ready nodes

assigns ready nodes to cores

determines if new nodes become ready when a Tiled QR (using tiles and in/out notations)

node completes

32

8

DAGuE: Tiled QR (2) Summary of multicore

scheduling

Assumptions

Tiled QR

DAG of tasks is known

each task is heavy-weight and executing task

on one worker exploits adequate locality

no assumptions about runtime of tasks

no lock-step execution of processors or

synchronous global memory

Scheduling

Dataflow Graph for 2x2 processor grid Machine: 81 nodes, 648 cores keep a work-list of tasks that are ready to execute

use heuristic priorities to choose from ready tasks

33

Summary

Dependence graphs

nodes are computations

edges are dependences

Static dependence graphs: obtained by

studying the algorithm

analyzing the program

Limits on speed-ups

critical path

Amdahls law

DAG scheduling

heuristic: list scheduling (many variations)

static and dynamic scheduling

applications: VLIW code generation, multicore scheduling for dense

linear algebra

Major limitations:

works for topology-driven algorithms with fixed neighborhoods since we

know tasks and dependences before executing program

not very useful for data-driven algorithms since tasks are created

dynamically

one solution: work-stealing, work-sharing. Study later.

- ch08.pptUploaded byWaqas Ahmed
- COA 21Uploaded bynikita_gupta3
- RICC Users Guide1.24 EnUploaded byGodwin Larry
- Seminar TopicUploaded byHasnain Ansari
- Wakefield GPUs for Reservoir SimulationUploaded bymartinezrdl
- Getting Started With PSLEUTHUploaded byRicardo Gomes
- Download (5)Uploaded byKaren Savage
- Lecture 21Uploaded byLee RickHunter
- Tanque e Mass TranferUploaded byRicardo Rezende
- TPP reportUploaded byHackingJack
- GEH-5980E SPEEDTRONIC™ Mark V Turbine Control Maintenance Manual.pdfUploaded bymeirangong
- DataStage_XEUploaded bySantosh Kumar
- ICC06Uploaded bySadeeq Ali Mohammad
- 50 Ques One MarkUploaded byNithyasri Arumugam
- Dataraw FactsUploaded byberhanubedassa
- Discoverer904 Sizing Calculator 129536Uploaded bylemo
- Www.yuvayana.com Wp Content Uploads 2013 11 Computer Capsule for IBPS CWE Clerk 2013encryptedUploaded byyrikki
- ECE 290 Learning Agreements 3Uploaded byece290
- Assignment 1Uploaded byRohitTayal
- Question BankUploaded bysatheesh
- Basics of Computer HardwareUploaded byAnoop James
- identifying Purposes and characteristicsUploaded byMohamed Ali
- Fallacies and PitfallsUploaded byNur Sakti
- Cse IV Microprocessors [10cs45] Notes(2)Uploaded byAnonymous uspYoqE
- H264_Decoder980723Uploaded byStarLink1
- Journal Big Data.pdfUploaded byRadityaMN
- Using.labview.to.Create.multithreaded.visUploaded byFahim Khan Martin
- SyllabusUploaded bySarath Mohan
- Efficient Hardware Realization of TruncatedUploaded bySandeep Kumar
- SOF-Sample-Paper-Class-2.pdfUploaded byAjeet Singh

- Solar System MCQSUploaded bybabarirfanali
- CSS_2018_Written_Part_Result.pdfUploaded bybabarirfanali
- 97_astrUploaded byxSonyWIi
- Exercise SolutionsUploaded bybabarirfanali
- Advt. No.9-2018_0.pdfUploaded byMuhammad Tayyab
- sociologyUploaded bybabarirfanali
- CSS 2018 Written Part ResultUploaded bybabarirfanali
- CloudSim-DVFS.pdfUploaded byjorisduda
- 2014_04_msw_a4_format.docUploaded bysunnyday32
- ADV-06-2017_2.pdfUploaded bybabarirfanali
- Advertisment 2017Uploaded bybabarirfanali
- scheduling_2.pdfUploaded bybabarirfanali
- IPDPS03Uploaded bybabarirfanali
- --Advertisment-1. Faculty ADV-0006 MUET-Khairpur (WEB) (1).pdfUploaded byAbdul Latif Abro
- --Advertisment-1. Faculty ADV-0006 MUET-Khairpur (WEB) (1).pdfUploaded byAbdul Latif Abro
- 00726348Uploaded bybabarirfanali

- Introduction to Parallel ComputingUploaded byshussein
- 2012 - Kiss Et Al. - Parallel Realization of the Element-By-element FEM Technique by CUDAUploaded byThiago de Sousa
- You Can Type, but You Can’t Hide: A Stealthy GPU-based KeyloggerUploaded byTom Allen
- CloudsimulatorUploaded byKavitha Ravikumar
- COMP 3511 - Operating SystemUploaded byLinh Ta
- ABInitio-FAQUploaded byVankayalapati Srikanth
- 3.1 Control MechanismUploaded byVarun Galar
- AkkaJava.pdfUploaded byvikasvikarm
- IC CompilerUploaded byJaya
- 8 Sorting SAS Data SetsUploaded byAubain Hilaire Nzokem
- Parellelism in Spectral MethodsUploaded byRamona Corman
- Parallel Prefix Sum (Scan) with CUDAUploaded by.xml
- CompilerUploaded byMandar Gogate
- m.e.appliedelectronics SyllabusUploaded byKanaga Varatharajan
- Dell SQL FastTrack Whitepaper 12TB R720XDV1 6Uploaded byVladimir Bashkatov
- 200451709-DataStage-OA.pptUploaded byRajesh Ganta
- XIV System Architecture2.pdfUploaded byKhadar Vali
- Hardware Architecture for Mobile DeviceUploaded byBenoy Bose
- Computer GenerationUploaded byaneesh_0111
- Datastage PointsUploaded byNaresh Kumar
- Chapter 3.3 - Computer Architecture and Fetch-Execute Cycle (Cambridge AL 9691)Uploaded bylyceum_fans
- BE ComputerUploaded byk19cyrus
- MOLDY USER MANUALUploaded byR.R.Ainunnisa
- VideoCoreIV AG100 RUploaded byPablo Rodriguez
- Final Project Report MRI ReconstructionUploaded byGokul Subramani
- Lattice Data Structure - ThesisUploaded byanant_nimkar9243
- Introduction to Parallel ProcessingUploaded bySathish Kumar
- 2011 Pg Elecive-courses Ver 1-10Uploaded byaftabhyder
- Teradata SQL Quick ReferenceUploaded bysunilsaw
- Opensupport EngUploaded bymabderhakim1075