Professional Documents
Culture Documents
Dependency Analysis:
Types of dependency
Dependency Graphs
Bernsteins Conditions of Parallelism
Asymptotic Notations for Algorithm Complexity Analysis
Parallel Random-Access Machine (PRAM)
Example: sum algorithm on P processor PRAM
Network Model of Message-Passing Multicomputers
Example: Asynchronous Matrix Vector Product on a Ring
Levels of Parallelism in Program Execution
Hardware Vs. Software Parallelism
Parallel Task Grain Size
Software Parallelism Types: Data Vs. Functional Parallelism
Example Motivating Problem With high levels of concurrency
Limited Parallel Program Concurrency: Amdahls Law
Parallel Performance Metrics: Degree of Parallelism (DOP)
Concurrency Profile + Average Parallelism
Steps in Creating a Parallel Program:
1- Decomposition, 2- Assignment, 3- Orchestration, 4- (Mapping + Scheduling)
Program Partitioning Example (handout)
Static Multiprocessor Scheduling Example (handout)
A task only executes on one processor to which it has been mapped or allocated EECC756 - Shaaban
#4 lec # 3 Spring2011 3-17-2011
Conditions of Parallelism:
S1
.. Data & Name Dependence
.. As part of the algorithm/computation
EECC756 - Shaaban
#5 lec # 3 Spring2011 3-17-2011
(True) Data (or Flow) Dependence
Assume task S2 follows task S1 in sequential program order
Task S1 produces one or more results used by task S2,
Then task S2 is said to be data dependent on task S1 S1 S2
Changing the relative execution order of tasks S1, S2 in the parallel program
violates this data dependence and results in incorrect execution.
i.e. Producer
S1 S1
Shared
..
Operands ..
S2 (Read) S2
Data Dependence
i.e. Consumer i.e. Data S2 Program
Order
EECC756 - Shaaban
#6 lec # 3 Spring2011 3-17-2011
Name Dependence Classification: Anti-Dependence
Assume task S2 follows task S1 in sequential program order
Task S1 reads one or more values from one or more names (registers or memory
locations)
Task S2 writes one or more values to the same names (same registers or
memory locations read by S1)
Then task S2 is said to be anti-dependent on task S1 S1 S2
Changing the relative execution order of tasks S1, S2 in the parallel program
violates this name dependence and may result in incorrect execution.
e.g. shared memory locations Task Dependency Graph Representation
S1 (Read) in shared address space (SAS)
S1
..
Shared S1 ..
Names
S2
S2 (Write) Anti-dependence Program
Order
Task S2 is anti-dependent on task S1 S2
Does anti-dependence matter for message passing?
Name: Register or Memory Location
EECC756 - Shaaban
#7 lec # 3 Spring2011 3-17-2011
Name Dependence Classification:
Output (or Write) Dependence
Assume task S2 follows task S1 in sequential program order
Both tasks S1, S2 write to the same a name or names (same registers or
memory locations)
Then task S2 is said to be output-dependent on task S1 S1 S2
Changing the relative execution order of tasks S1, S2 in the parallel program
violates this name dependence and may result in incorrect execution.
e.g. shared memory locations Task Dependency Graph Representation
S1 (Write) in shared address space (SAS)
I S1
Shared ..
Names ..
S2
S2 (Write) Output dependence
J Program
Order
Task S2 is output-dependent on task S1
Does output dependence matter for message passing?
Name: Register or Memory Location
EECC756 - Shaaban
#8 lec # 3 Spring2011 3-17-2011
Dependency Graph Example
Here assume each instruction is treated as a task:
Anti-dependence:
(S2, S3)
i.e. S2 S3 S3
R1 R3
Dependency graph
EECC756 - Shaaban
#9 lec # 3 Spring2011 3-17-2011
Dependency Graph Example
Here assume each instruction is treated as a task
MIPS Code
Task Dependency graph
1 ADD.D F2, F1, F0
2 ADD.D F4, F2, F3
1 3 ADD.D F2, F2, F4
ADD.D F2, F1, F0 4 ADD.D F4, F2, F6
Output Dependence:
(1, 3) (2, 4)
i.e. 1 3 2 4
4 Anti-dependence:
ADD.D F4, F2, F6
(2, 3) (3, 4)
i.e. 2 3 3 4
EECC756 - Shaaban
#10 lec # 3 Spring2011 3-17-2011
Dependency Graph Example
Here assume each instruction is treated as a task
MIPS Code
1 Task Dependency graph
1 L.D F0, 0 (R1)
L.D F0, 0 (R1)
2 ADD.D F4, F0, F2
3 S.D F4, 0(R1)
2 4 L.D F0, -8(R1)
ADD.D F4, F0, F2
5 ADD.D F4, F0, F2
3 6 S.D F4, -8(R1)
S.D F4, 0(R1)
True Date Dependence:
(1, 2) (2, 3) (4, 5) (5, 6)
i.e. 1 2 1 3
4 4 5 5 6
EECC756 - Shaaban
#13 lec # 3 Spring2011 3-17-2011
Bernsteins Conditions: An Example
For the following instructions P1, P2, P3, P4, P5 : P1
Each instruction requires one step to execute Co-Begin
P1, P3, P5
Two adders are available Co-End
P4
P1 : C = D x E
Using Bernsteins Conditions after checking statement pairs:
P2 : M = G + C
P1 || P5 , P2 || P3 , P2 || P5 , P3 || P5 , P4 || P5
P3 : A = B + C
D E D E
P4 : C = L + M Time
P5 : F = G E X P1 X P1 B G E
G G
C
C
P1
X B
+1 P2 L +1 P2 +2 P3 P 5
P5
P2 P4
+1 +3 +2 P3 M +3 P4
L A C A F
f(n) = W(g(n))
if there exist positive constants c, n0 such that
| f(n) | c | g(n) | for all n > n0
i.e. g(n) is a lower bound on f(n)
Asymptotic Tight bound: Big Theta Notation Q ( )
Used in finding a tight limit on algorithm performance
f(n) = Q (g(n))
if there exist constant positive integers c1, c2, and n0 such that
c1 | g(n) | | f(n) | c2 | g(n) | for all n > n0
i.e. g(n) is both an upper and lower bound on f(n)
AKA Tight bound EECC756 - Shaaban
#16 lec # 3 Spring2011 3-17-2011
Graphs of O, W,Q
cg(n) f(n) c2g(n)
f(n) f(n)
cg(n)
c1g(n)
n0 n0 n0
f(n) =O(g(n)) f(n) = W(g(n)) f(n) = Q(g(n))
Upper Bound Lower Bound Tight bound
EECC756 - Shaaban
#17 lec # 3 Spring2011 3-17-2011
Or other metric or quantity such as
memory requirement, communication etc ..
log2 n n n log2 n n2 n3 2n n!
0 1 0 1 1 2 1
1 2 2 4 8 4 2
2 4 8 16 64 16 24
3 8 24 64 512 256 40320
4 16 64 256 4096 65536 20922789888000
5 32 160 1024 32768 4294967296 2.6 x 1035
O(1) < O(log n) < O(n) < O(n log n) < O (n2) < O(n3) < O(2n) < O(n!)
EECC756 - Shaaban
NP = Non-Polynomial
#18 lec # 3 Spring2011 3-17-2011
Rate of Growth of Common Computing Time
Functions
2n n2
n log(n)
log(n)
O(1) < O(log n) < O(n) < O(n log n) < O (n2) < O(n3) < O(2n) EECC756 - Shaaban
#19 lec # 3 Spring2011 3-17-2011
Theoretical Models of Parallel Computers
Parallel Random-Access Machine (PRAM):
p processor, global shared memory model.
Models idealized parallel shared-memory computers with zero
synchronization, communication or memory access overhead.
Why? Utilized in parallel algorithm development and scalability and
complexity analysis.
PRAM variants: More realistic models than pure PRAM
EREW-PRAM: Simultaneous memory reads or writes to/from
the same memory location are not allowed.
CREW-PRAM: Simultaneous memory writes to the same
location is not allowed. (Better to model SAS MIMD?)
ERCW-PRAM: Simultaneous reads from the same memory
location are not allowed.
CRCW-PRAM: Concurrent reads or writes to/from the same
memory location are allowed.
3 B(1) B(2)
+ +
P1 P2
2 B(1) B(2) B(3) B(4)
+ + + +
P1 P2 P3 P4
B(1) B(2) B(3) B(4) B(5) B(6) B(7) B(8)
=A(1) =A(2) =A(3) =A(4) =A(5) =A(6) =A(7) =A(8)
1
P1 P1 P2 P2 P3 P3 P4 P4
For n >> p O(n/p + Log2 p) EECC756 - Shaaban
#22 lec # 3 Spring2011 3-17-2011
Performance of Parallel Algorithms
Performance of a parallel algorithm is typically measured
in terms of worst-case analysis. i.e using order notation O( )
For problem Q with a PRAM algorithm that runs in time
T(n) using P(n) processors, for an instance size of n:
Cost of
a parallel
The time-processor product C(n) = T(n) . P(n) represents the
algorithm cost of the parallel algorithm.
For P < P(n), each of the of the T(n) parallel steps is
simulated in O(P(n)/p) substeps. Total simulation takes
O(T(n)P(n)/p) = O(C(n)/p)
The following four measures of performance are
asymptotically equivalent:
P(n) processors and T(n) time
C(n) = P(n)T(n) cost and T(n) time
O(T(n)P(n)/p) time for any number of processors p < P(n)
O(C(n)/p + T(n)) time for any number of processors.
Illustrated next with an example EECC756 - Shaaban
#23 lec # 3 Spring2011 3-17-2011
Matrix Multiplication On PRAM
Multiply matrices A x B = C of sizes n x n
Sequential Matrix multiplication: Dot product O(n)
sequentially on one
For i=1 to n { n
processor
From last slide: O(C(n)/p + T(n)) time complexity for any number of processors. EECC756 - Shaaban
#24 lec # 3 Spring2011 3-17-2011
The Power of The PRAM Model
Well-developed techniques and algorithms to handle many
computational problems exist for the PRAM model.
Removes algorithmic details regarding synchronization and
communication cost, concentrating on the structural and
fundamental data dependency properties (dependency graph) of the
parallel computation/algorithm.
Captures several important parameters of parallel computations.
Operations performed in unit time, as well as processor allocation.
The PRAM design paradigms are robust and many parallel
network (message-passing) algorithms can be directly derived
from PRAM algorithms.
It is possible to incorporate synchronization and communication
costs into the shared-memory PRAM model.
EECC756 - Shaaban
#25 lec # 3 Spring2011 3-17-2011
Network Model of Message-Passing Multicomputers
A network of processors can viewed as a graph G (N,E)
Each node i N represents a processor Graph represents network topology
bit position.
Degree = Diameter = d = log2 N here d=4
Example d=4
}
According to task (grain) size
Jobs or programs Coarse
Level 5 (Multiprogramming) Grain
Increasing i.e multi-tasking
}
communications
demand and Subprograms, job
Level 4 steps or related parts
mapping/scheduling of a program
overheads Higher
Medium Degree of
Grain Software
Higher
Level 3 Procedures, subroutines, Parallelism
C-to-C or co-routines
Ratio (DOP)
}
Non-recursive loops or More
Thread Level Parallelism
Level 2 unfolded iterations Smaller
(TLP) Fine Tasks
Grain
Instruction Instructions or
Level Parallelism (ILP) Level 1 statements
EECC756 - Shaaban
#35 lec # 3 Spring2011 3-17-2011
Software Parallelism Types: Data Vs. Functional Parallelism
1 - Data Parallelism:
Parallel (often similar) computations performed on elements of large data structures
(e.g numeric solution of linear systems, pixel-level image processing)
Such as resulting from parallelization of loops.
i.e max Usually easy to load balance.
DOP Degree of concurrency usually increases with input or problem size. e.g O(n2) in
equation solver example (next slide).
2- Functional Parallelism:
Entire large tasks (procedures) with possibly different functionality that can be done in
parallel on the same or different data.
Software Pipelining: Different functions or software stages of the pipeline
performed on different data:
As in video encoding/decoding, or polygon rendering.
Concurrency degree usually modest and does not grow with input size (i.e. problem size)
Difficult to load balance.
Often used to reduce synch wait time between data parallel phases.
Most scalable parallel computations/programs:
(more concurrency as problem size increases) parallel programs:
Data parallel computations/programs (per this loose definition)
Functional parallelism can still be exploited to reduce synchronization wait time
between data parallel phases.
n
grids Maximum Degree of
n Parallelism (DOP) or
2D Grid concurrency: O(n2)
nxn data parallel
Total
(a) Cross sections (b) Spatial discretization of a cross section computations per 2D
O(n3) grid per iteration
Computations Model as two-dimensional nxn grids
Per iteration When one task updates/computes one grid element
Discretize in space and time
finer spatial and temporal resolution => greater accuracy
Many different computations per time step O(n2) per grid
set up and solve equations iteratively (Gauss-Seidel) .
Concurrency across and within grid computations per iteration
n2 parallel computations per grid x number of grids
Covered next lecture (PCA 2.3) EECC756 - Shaaban
#37 lec # 3 Spring2011 3-17-2011
Limited Concurrency: Amdahls Law
Most fundamental limitation on parallel speedup.
DOP=1 Assume a fraction s of sequential execution time runs on a single
processor and cannot be parallelized. i.e sequential or serial portion of the computation
Assuming that the problem size remains fixed and that the remaining
fraction (1-s) can be parallelized without any parallelization overheads
DOP=p
to run on p processors and thus reduced by a factor of p. i.e. perfect speedup
The resulting speedup for p processors: for parallelizable portion
p
Parallel Execution 2n2
on p processors Speedup=
Phase 1: time reduced by p 2n2 + p2
Phase 2: divided into two steps 1
1
(see previous slide) =
(c)
Time
1/p + p/2n2
n2/p n2/p p For large n
Speedup ~ p EECC756 - Shaaban
#40 lec # 3 Spring2011 3-17-2011
Parallel Performance Metrics
Degree of Parallelism (DOP)
For a given time period, DOP reflects the number of processors in
a specific parallel computer actually executing a particular parallel
program. i.e DOP at a given time = Min (Software Parallelism, Hardware Parallelism)
Average Degree of Parallelism A:
given maximum parallelism = m Computations/sec
n homogeneous processors
computing capacity of a single processor D
Total amount of work W (instructions, computations):
m
W D i. t i
t2
6
5
4 Average parallelism = 3.72
3
2 Area equal to total # of
1 computations or work, W
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
Speedup? t1 Time t2
Ignoring parallelization overheads, what is the speedup?
EECC756 - Shaaban
(i.e total work on parallel processors equal to work on single processor)
#42 lec # 3 Spring2011 3-17-2011
Concurrency Profile & Speedup
For a parallel program DOP may range from 1 (serial) to a maximum m
1,400
1,200
Concurrency
1,000 Concurrency Profile
800
600
400
200
0
150
219
247
286
313
343
380
415
444
483
504
526
564
589
633
662
702
733
Clock cycle number Ignoring parallelization overheads
What is the parallel speedup when running on a parallel system without any
parallelization overheads ?
1
Speedup
((1 F ) F )
i i i
i
S i
EECC756 - Shaaban
From 551 #45 lec # 3 Spring2011 3-17-2011
Pictorial Depiction of Example
Before:
Original Sequential Execution Time:
S1 = 10 S2 = 15 S3 = 30
/ 10 / 15 / 30
Unchanged
N= Problem Size
P = Number of processors
Algorithm 2
Algorithm 1: T = N + N2/P
Algorithm 2: T = (N+N2 )/P + 100
Algorithm 3: T = (N+N2 )/P + 0.6P2
Algorithm 3
EECC756 - Shaaban
#48 lec # 3 Spring2011 3-17-2011
Slide 30
Repeated Creating a Parallel Program
Assumption: Sequential algorithm to solve problem is given
Or a different algorithm with more inherent parallelism is devised.
Most programming problems have several parallel solutions or
algorithms. The best solution may differ from that suggested by
existing sequential algorithms.
One must: Computational Problem Parallel Algorithm Parallel Program
Identify work that can be done in parallel (dependency analysis)
Partition work and perhaps data among processes (Tasks) Determines size and
number of tasks
Manage data access, communication and synchronization
Note: work includes computation, data access and I/O
By:
Main goal: Maximize Speedup 1- Minimizing parallelization overheads
of parallel processing Performance(p) 2- Balancing workload on processors
Speedup (p) =
Performance(1)
For a fixed size problem: The processor with max. execution time
Time(1) determines parallel execution time
Speedup (p) =
Time(p)
Time (p) = Max (Work + Synch Wait Time + Comm Cost + Extra Work)
EECC756 - Shaaban
#49 lec # 3 Spring2011 3-17-2011
Steps in Creating a Parallel Program
Partitioning At or above Communication
Parallel Algorithm Abstraction
Fine-grain Parallel
Computational Fine-grain Parallel Processes Processors
Computations Tasks Processes
Problem D A Computations O M
e s Tasks r a + Execution Order
c s c p (scheduling)
o i h p
m g p0 p1 e p0 p1 i
p s P0 P1
n n
o m t g
s e r
i n a
t t t
P2 P3
i p2 p3 i p2 p3
o o
n n
Max DOP
Sequential FindTasks
max. degree of Processes Parallel Processors
computation Parallelism (DOP) Tasks program
or concurrency How many tasks? Processes
(Dependency analysis/ Task (grain) size?
graph) Max. no of Tasks
4 steps: + Scheduling
1- Decomposition, 2- Assignment, 3- Orchestration, 4- Mapping
Done by programmer or system software (compiler,
runtime, ...)
Issues are the same, so assume programmer does it all
explicitly Vs. implicitly by parallelizing compiler
Computational Problem Parallel Algorithm Parallel Program EECC756 - Shaaban
#50 lec # 3 Spring2011 3-17-2011
Partitioning: Decomposition & Assignment
Decomposition Dependency Analysis/graph
Break up computation into maximum number of small concurrent
computations that can be combined into fewer/larger tasks in assignment step:
Tasks may become available dynamically. Grain (task) size
No. of available tasks may vary with time. Problem
Together with assignment, also called partitioning. (Assignment)
EECC756 - Shaaban
#53 lec # 3 Spring2011 3-17-2011
Processes Processors
Mapping/Scheduling + Execution Order (scheduling)
EECC756 - Shaaban
#55 lec # 3 Spring2011 3-17-2011
Sequential Execution Possible Parallel Execution Schedule on Two Processors P0, P1
Task Dependency Graph
on one processor DOP?
0 0
1 1
2
A A 2 A Idle
3 3
Comm Comm
4 4
5 B 5 C
6 B C 6 B
7 7
8
C Comm
8
F
9 9 D
10 D E F 10
11 D 11
E Idle
Comm
12 12
Comm
13 13
14
E G 14
Idle G
15 15
16 Partitioning into tasks (given) 16
17 F 17
18
Mapping of tasks to processors (given): 18 T2 =16
P0: Tasks A, C, E, F
19 19
G P1: Tasks B, D, G
20 20
21 21
Assume computation time for each task A-G = 3
Time Assume communication time between parallel tasks = 1 Time
P0 P1
Assume communication can overlap with computation
T1 =21 Speedup on two processors = T1/T2 = 21/16 = 1.3
EECC756 - Shaaban
A simple parallel execution example From lecture #1
#56 lec # 3 Spring2011 3-17-2011
Static Multiprocessor Scheduling
Dynamic multiprocessor scheduling is an NP-hard problem.
EECC756 - Shaaban
#57 lec # 3 Spring2011 3-17-2011
Table 2.1 Steps in the Parallelization Process and Their Goals
Architecture-
Step Dependent? Major Performance Goals
Decomposition Mostly no Expose enough concurrency but not too much
Assignment Mostly no ? Balance workload
Determine size and number of tasks Reduce communication volume
Orchestration Yes Reduce noninherent communication via data
Tasks Processes locality
Reduce communication and synchr onization cost
as seen by the processor
Reduce serialization at shared resources
Schedule tasks to satisfy dependences early
Mapping Yes Put related processes on the same pr ocessor if
Processes Processors necessary
+ Execution Order (scheduling) Exploit locality in network topology
EECC756 - Shaaban
#58 lec # 3 Spring2011 3-17-2011