Brief Overview of Parallel Computing

Background
A Little Motivation
Brief Overview of Parallel Programming
Why do we parallel process in the first place?

Performance - either solve a bigger problem, or to reach solution
faster (or both)
M. D. Jones, Ph.D.
Hardware is becoming (actually it has been for a while now)

intrinsically parallel
Center for Computational Research

University at Buffalo
State University of New York
Parallel programming is not easy. How easy is sequential

programming? Now add another layer (a sizable one, at that) for
parallelism ...
Spring 2014
M. D. Jones, Ph.D. (CCR/UB)
Spring 2014
1 / 61
Background
Spring 2014
3 / 61
Background
Big Picture
Decomposition
Basic idea of parallel programming is a simple one:

problem
decomposition
1D
CPU
CPU
CPU
CPU
CPU
CPU
2D
CPU
CPU
3D
CPU
CPU
CPU
CPU
in which we decompose a large computational problem into available

processing elements (CPUs). How you achieve that decomposition is
where all the fun lies ...
Spring 2014
4 / 61
Uniform volume example - decomposition in 1D, 2D, and 3D. Note that
load balancing in this case is simpler if the work load per cell is the
same ...
Spring 2014
5 / 61
Background
Demand for HPC
Decomposition (continued)
More Speed!
Why Use Parallel Computation?
Note that, in the modern era, inevitably HPC = Parallel (or concurrent)
Computation. The driving forces behind this are pretty simple - the
desire is:
Solve my problem faster, i.e. I want the answer now (and who
doesnt want that?)
I want to solve a bigger problem than I (or anyone else, for that
matter) have ever before been able to tackle, and do so in a
reasonable (generally reasonable = within a graduate students
time to graduate!)
Nonuniform example - decomposition in 3D for a 16-way parallel
adaptive finite element calculation (in this case using a terrain
elevation). Load balancing in this case is much more complex.

Demand for HPC
Spring 2014
6 / 61
More Speed!

Demand for HPC
Spring 2014
8 / 61
More Speed!
A Concrete Example
A Size Example
Well, more of a discrete example, actually. Lets consider the

gravitational N-body problem.
On the other side, suppose that we need to diagonalize (or invert) a

dense matrix being used in, say, an eigenproblem derived from some
engineering or multiphysics problem.
Example
Using classical gravitation, we have a very simple (but long-ranged)
force/potential. For each of N bodies, the resulting force is computed
from the other N 1 bodies, thereby requiring N 2 force calculations
per step. If a galaxy consists of approximately 1012 such bodies, and
even the best algorithm for computing requires N log2 N calculations,
that means ' 1012 ln(1012 )/ ln(2) calculations. If each calculation
takes ' 1 sec, that is 40 106 seconds per step. That is about 1.3
CPU-years per step. Ouch!
Spring 2014
9 / 61
Example
In this problem, we want to increase the resolution to capture the
essential underlying behavior of the physical process being modeled.
So we determine that we need a matrix on the order of, say, 400000
elements. Simply to store this matrix, in 64-bit representation, requires
' 1.28 1012 Bytes of memory, or 1200 GBytes. We could fit this
onto a cluster with say, 103 nodes, each having 4 GBytes of memory,
by distributing the matrix across the individual memories of each
cluster node.
Spring 2014
10 / 61
Demand for HPC
More Speed!
Demand for HPC
So ...
Scaling Concepts
So the lessons to take from the preceding examples should be clear:

It is the nature of the research (and commercial) enterprise to
always be trying to accomplish larger tasks with ever increasing
speed, or efficiency. For that matter it is human nature.
These examples, while specific, apply quite generally - do you see
the connection between the first example and general molecular
modeling? The second example and finite element (or finite
difference) approaches to differential equations? How about web
searching?
Inherent Limitations

Demand for HPC
Spring 2014
11 / 61
By scaling we typically mean the relative performance of a parallel

vs. serial implementation:
Definition (Scaling): Speedup Factor, S(p) is given by
S(p) =
Sequential execution time (using optimal implementation)

Parallel execution time using p processors
so, in the ideal case, S(p) = p.

Demand for HPC
Parallel Efficiency
Spring 2014
12 / 61
Inherent Limitations in Parallel Speedup
Using S as the (best) sequential execution time, we note that

S(p) =
Limitations on the maximum speedup:
S
,
p (p)
Fraction of the code, f , can not be made to execute in parallel
p,
Parallel overhead (communication, duplication costs)

Using this serial fraction, f , we can note that
for a lower bound, and the parallel efficiency is given by
p f S + (1 f ) S /p,
S(p)
,
p
S
=
.
p p
E(p) =
Spring 2014
13 / 61
Spring 2014
14 / 61
Demand for HPC
Demand for HPC
Amdahls Law
Implications of Amdahls Law
This simplification for p leads directly to Amdahls Law,

Definition (Amdahls Law):
The implications of Amdahls law are pretty straightforward:

The limit as p ,
S
,
f S + (1 f ) S /p
p
.
1 + f (p 1)
S(p)
lim S(p)
1
.
f
If the serial fraction is 5%, the maximum parallel speedup is only

20.
G. M. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities, AFIPS Conference
Proceedings 30 (AFIPS Press, Atlantic City, NJ) 483-485, 1967.

Demand for HPC
Spring 2014
15 / 61
Amdahl's Law
16 / 61
Let p = comm + comp , where comm is the time spent in communication

between parallel processes (unavoidable overhead) and comp is the
time spent in (parallelizable) computation.
256
f=0.001
f=0.01
f=0.02
f=0.05
f=0.1
64
Spring 2014
A Practical Example
Maximum Parallel Speedup
1 pcomp ,
32
MAX(S(p))

Demand for HPC
Implications of Amdahls Law (contd)
128
and
16
S(p) =
8
4
1
,
p
pcomp
,
comm + comp
p 1 + comm /comp
1
1
8
32
16
Number of Processors, p
64
128
256
Spring 2014
1
The point here is that it is critical to minimize communication time to

time spent doing computation (recurring theme).
17 / 61
Spring 2014
18 / 61
Demand for HPC
Demand for HPC
Defeating Amdahls Law
Gustafsons Law
Now let p be held constant as p is increased,
Ss (p) = S /p ,
There are ways to work around the serious implications of Amdahls

law:
= (f S + (1 f )S )/p ,
We assumed that the problem size was fixed, which is (very) often
not the case.
= (f S + (1 f )S )/(f S + (1 f )S /p),
Now consider a case where the problem size is allowed to vary
' p + p(1 p)f . . . .
Assume that now the problem size is fixed such that p is held
constant
= p/(1 (1 p)f ),
Another way of looking at this is that the serial fraction becomes

negligible as the problem size is scaled. Actually that is a pretty good
definition of a scalable code ...
J. L. Gustafson, Reevaluating Amdahls Law, Comm. ACM 31(5), 532-533 (1988).

Demand for HPC
Spring 2014
19 / 61

Overview of Parallel APIs
Spring 2014
20 / 61
Basic Terminology
Scalability
The Parallel Zoo
Definition
(Scalable): An algorithm is scalable if there is a minimal nonzero
efficiency as p and the problem size is allowed to take on any
value
There are many parallel programming models, but roughly they break
down into the following categories:
Threaded models: e.g., OpenMP or Posix threads as the
application programming interface (API)
I like this (equivalent) one better:

Definition
(Scalable): For a scalable code the sequential fraction becomes
negligible as the problem size (and the number of processors) grows
Spring 2014
21 / 61
Message passing: e.g., MPI = Message Passing Interface

Hybrid methods: combine elements of other models
Spring 2014
23 / 61
Basic Terminology
Thread Models
Basic Terminology
Thread Models
thread 1
thread 2
a.out
thread 1
thread 2
a.out
data
data
time
time
thread N
Shared address space (all threads can access data of program

a.out).
Generally limited to single machine except in very specialized
cases.
GPGPU (general purpose GPU computing) uses thread model
(with a large number of threads on the host machine.
thread N
Typical thread model (e.g., OpenMP)

Spring 2014
24 / 61
Basic Terminology

Message Passing Models
Spring 2014
25 / 61
Basic Terminology
Message Passing Models

a.out
a.out
PID
task 1
msg send/recv
data
a.out
a.out
data
task 1
PID
a.out
PID
task N1
time
Separate address space (tasks can not access data of other

tasks without sending messages or participating in collective
communications).
General purpose model - messages can be sent over network,
shared memory, etc.
Managing communication is generally up to the programmer.
PID
task N1
time
Typical message passing model (e.g., MPI).

PID
data
data
msg send/recv
collective communication (broadcast/reduce/...)
a.out
msg send/recv
task 0
task 0
msg send/recv
PID
data
collective communication (broadcast/reduce/...)
data
Spring 2014
26 / 61
Spring 2014
27 / 61
Basic Terminology
Parallelism at Three Levels
Approaching the Problem
The first step in deciding where to add parallelism to an algorithm or

application is usually to analyze from the point of tasks and data:
Task - macroscopic view in which an algorithm is organized

around separate concurrent tasks
Tasks How do I reduce this problem into a set of tasks that can
be executed concurrently?
Data - structure data in (separately) update-able chunks
Data How do I take the key data and represent it in such a way
that large chunks can be operated on independently (and
thus concurrently)?
Instruction - ILP through the flow of data in a predictable fashion

Spring 2014
28 / 61
Basic Terminology

Dependencies
Spring 2014
29 / 61
Basic Terminology
An Hour of Planning ...
Taking the time to carefully analyze an existing application, or plan a

new one, can really pay off later:
Bear in mind that you always want to minimize the dependencies

between your concurrent tasks and data:
Plan for parallel programming by keeping data and task parallel

constructs/opportunities in mind
On the task level, design your tasks to be as independent as

possible - this can also be temporal, namely take into account
any necessity for ordering the tasks and their execution
Analyze an existing or new program to discover/confirm the

locations of the time-consuming portions
Sharing of data will require synchronization and thus introduce

data dependencies, if not race conditions or contention
Basic Terminology
Spring 2014
Optimize by implementing a strategy using data or task parallel

constructs
30 / 61
Spring 2014
31 / 61
Auto-Parallelization
Auto-Parallelization
Auto-Vectorization
Often called implicit parallelism, some compilers (usally commercial

ones) have the ability to:
automatically parallelize simple (outer) loops (thread-level
parallelism, or TLP)
Note that thread level parallelism is only usable on an SMP
architecture

Spring 2014
Like auto-parallelization, a feature of high-performance (often

commercial) compilers on hardware that supports at least some level
of vectorization:
compilers finds low-level operations that can operate
simultaneously on multiple data elements using a single
instruction
Intel/AMD processors with SSE/SSE2/SSE3 usually limited to 21 23 vector length (hardware limitation that true vector processors
do not share)
automatically vectorize (usually innermost loops) using ILP
Auto-Vectorization
User can exert some control through compiler directives (usually

vendor specific) and data organization focused on vector
operations
32 / 61
Auto-Vectorization

Spring 2014
33 / 61
Shared Memory APIs
Downside of Automatic Methods
Shared Memory Parallelism
The automatic parallelization schemes have a number of shortfalls:
Here we consider explicit modes of parallel programming on shared

memory architectures:
Very limited in scope - parallelizes only a very small subset of

code
Without directives from the programmer, compiler is very limited in
identifying parallelizable regions of code
Performance is rather easily degraded rather than sped up
Spring 2014
SHMEM, the old Cray shared memory library

Pthreads, UNIX ANSI standard (available on Windows as well)
OpenMP, common specification for compiler directives and API
Wrong results are easy to (automatically) produce
HPF, High Performance Fortran
34 / 61
Spring 2014
35 / 61
Distributed Memory APIs
Distributed Memory Parallel Programming
OpenMP
In this case, mainly message passing, with a few other (compiler

directives again) possibilities:
HPF, High Performance Fortran
Cluster OpenMP, a new product from Intel to push OpenMP into
a distributed memory regime
PVM, Parallel Virtual Machine
MPI, Message Passing Interface, has become the dominant API
for general purpose parallel programming
UPC, Unified Parallel C, extensions to ISO 99 C for parallel
programming (new)

Spring 2014
Synopsis: designed for multi-platform shared memory parallel

programming in C/C++/FORTRAN primarily through the
use of compiler directives and environmental variables
(library routines also available).
Target: Shared memory systems.
Current Status: Specification 1.0 released in 1997, 2.0 in 2002, 2.5 in
2005, 3.0 in 2008, 3.1 in 2011. Widely implemented by
commercial compiler vendors, also now in most
open-source compilers (4.0 specification released in
2013-07, not yet available in production compilers).
More Info: The OpenMP home page:
http://www.openmp.org
CAF, Co-Array Fortran, parallel extensions to Fortran 95/2003

(new, officially part of Fortran 2008)
APIs/Language Extensions & Availability
36 / 61

OpenMP Availability
Spring 2014
37 / 61
MPI
Message Passing Interface

Platform
Linux x86_64
GNU (>=4.2)
GNU (>=4.4)
GNU (>=4.7)
Version
3.1
3.0
2.5
3.0
3.1
Synopsis: Message passing library specification proposed as a

standard by committee of vendors, implementors, and
users.
Invocation (example)
ifort -openmp -openmp_report2 ...
pgf90 -mp ...
gfortran -fopenmp ...
Target: Distributed and shared memory systems.

Current Status: Most popular (and most portable) of the message
passing APIs.
More Info: ANLs main MPI site:
www-unix.mcs.anl.gov/mpi
Spring 2014
38 / 61
Spring 2014
39 / 61
MPI Availability at CCR
Platform
Linux IA64
Linux x86_64
Unified Parallel C (UPC)
Unified Parallel C (UPC) is a project based at Lawrence Berkeley

Laboratory to extend C (ISO 99) and provide large-scale parallel
computing on both shared and distributed memory hardware:
Version(+MPI-2)
1.2+(C++,MPI-I/O)
1.2+(C++,MPI-I/O),2.x(various)
Explicit parallel execution (SPMD)

Shared address space (shared/private data like OpenMP), but
exploits data locality
I am starting to favor the commercial Intel MPI for its ease of use,
expecially in terms of supporting multiple networks/protocols.
Primitives for memory management

BSD license (free download)
http://upc.lbl.gov/, community website at
http://upc.gwu.edu/

Spring 2014
40 / 61

Spring 2014
41 / 61
Co-Array Fortran
Simple example of some UPC syntax:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Co-Array Fortran (CAF) is an extension to Fortran (Fortran 95/2003) to

provide data decomposition for parallel programs (somewhat akin to
UPC and HPF):
shared i n t a l l _ h i t s [THREADS ] ;
...
...
f o r ( i =0; i < m y _ t r i a l s ; i ++) my_hits += h i t ( ) ;
a l l _ h i t s [MYTHREAD] = my_hits ;
upc_barrier ;
i f (MYTHREAD == 0 ) {
t o t a l _ h i t s = 0;
f o r ( i =0; i < THREADS; i ++) {
t o t a l _ h i t s += a l l _ h i t s [ i ] ;
}
p i = 4 . 0 ( ( double ) t o t a l _ h i t s ) / ( ( double ) t r i a l s ) ;
p r i n t f ( " PI e s t i m a t e d t o %10.7 f from %d t r i a l s on %d t h r e a d s . \ n " ,
p i , t r i a l s , THREADS ) ;
}
Original specification by Numrich and Reid, ACM Fortran Forum,

17, no. 2, pp 1-31 (1998).
ISO Fortran committee included co-arrays in next revision to
Fortran standard (c.f. Numrich and Reid, ACM Fortran Forum, 24,
no 2, pp 4-17 (2005), Fortran 2008.
Does not require shared memory (more applicable than OpenMP)
Early compiler work at Rice:
http://www.hipersoft.rice.edu/caf/index.html
Spring 2014
42 / 61
Spring 2014
43 / 61
Libraries
Leveraging Existing Parallel Libraries

Simple examples of CAF usage:
1
2
3
4
5
6
7
8
One easy way to utilize parallel programming resources is through

the use of existing parallel libraries. Most common examples (note
orientation around standard mathematical routines):
REAL, DIMENSION (N ) [ ] : : X , Y
X
= Y [ PE ]
! g e t from Y [ PE ]
Y [ PE ]
= X
! p u t i n t o Y [ PE ]
Y[:]
= X
! b r oa d c a s t X
Y [ LIST ] = X
! b r oa d c a s t X over subset o f PE s i n a r r a y LIST
Z(:)
= Y[:]
! collect all Y
S = MINVAL (Y [ : ] ) ! min ( reduce ) a l l Y
B ( 1 :M) [ 1 : N] = S ! S s c a l a r , promoted t o a r r a y o f shape ( 1 :M, 1 : N)
BLAS - Basic Linear Algebra Subroutines

LAPACK - Linear Algebra PACKage (uses BLAS)
UPC and CAF are examples of partitioned global address space

(PGAS) language, for which a large push is being made by
AHPCRC/DARPA as part of the Petascale computing initiative
(also one based on java called Titanium)
Spring 2014
44 / 61
ScaLAPACK - distributed memory (MPI) solvers for common LAPACK

functions
PetSC - Portable, Extensible Toolkit for Scientific Computation
FFTW - Fastest Fourier Transforms in the West
Libraries
BLAS
Vendor BLAS
The Basic Linear Algebra Subroutines (BLAS) form a standard set of

library functions for vector and matrix operations:
Vendor implementations of the BLAS:
Level 1 Vector-Vector (e.g. xdot, where x=s,d,c,z)

Level 2 Matrix-Vector (e.g. xaxpy, where x=s,d,c,z)
Level 3 Matrix-Matrix (e.g. xgemm, where x=s,d,c,z)
These routines are generally provided by vendors hand-tuned at the
level of assembly code for optimum performance on a particular
processor. Shared memory versions (multithreaded) and distributed
memory versions available.
www.netlib.org/blas
Spring 2014
46 / 61
Spring 2014
48 / 61
Libraries
Spring 2014
47 / 61
AMD
Apple
Compaq
Cray
HP
IBM
Intel
NEC
SGI
SUN
ACML
Velocity Engine
CXML
libsci
MLIB
ESSL
MKL
PDLIB/SX
SCSL
Sun Performance Library
Libraries
Libraries
Performance Example
LAPACK
DDOT Performance
The Linear Algebra PACKage (LAPACK) is a library that lies on top of

the BLAS (for optimium performance and parallelism) and provides:
U2 Compute Node
5000
cMKL 8.1.1
Ref (-lblas)
4500
4000
solvers for systems of simultaneous linear equations
L1 cache
L2 cache
3500
least-squares solutions
MFlop/s
3000
eigenvalue problems
2500
2000
singular value problems
Main memory
1500
On CCR systems, the Intel MKL is generally preferred (includes

optimized BLAS and LAPACK)
1000
500
0
1
10
10
10
10
Vector Length [8B]
10
10
www.netlib.org/lapack
Performance advantage of Intel MKL ddot versus reference (system)

version. Level 3 BLAS routines have even more significant gains.
Spring 2014
49 / 61
Libraries
Parallel Program Design Considerations
ScaLAPACK
Spring 2014
50 / 61
Programming Costs
Programming Costs
The Scalable LAPACK library, ScaLAPACK, is designed for use on the

largest of problems (typically implemented on distributed memory
systems):
subset of LAPACK routines redesigned for distributed memory
MIMD parallel computers
explicit message passing for interprocessor communication
assumes matrices are laid out in a two-dimensional block cyclic
decomposition
On CCR systems, the Intel Cluster MKL includes
ScaLAPACK libraries (includes optimized BLAS, LAPACK, and
ScaLAPACK)
www.netlib.org/scalapack
Spring 2014
51 / 61
Consider:
Complexity: parallel codes can be orders of magnitude more complex
(especially those using message passing) - you have to
plan for and deal with multiple instruction/data streams
Portability: parallel codes frequently have long lifetimes (proportional
to the amount of effort invested in them) - all of the serial
application porting issues apply, plus the choice of
parallel API (MPI, OpenMP, and POSIX threads are
currently good portable choices, but implementations of
them can differ from platform to platform)
Resources: overhead for parallel computation can be significant for
smaller calculations
Scalability: limitations in hardware (CPU-memory speed and
contention, for one example) and the parallel algorithms
will limit speedups. All codes will eventually reach a state
of decreasing returns at some point
Spring 2014
53 / 61
Examples
Examples
Speedup Limitation Example (MD/NAMD)

Joint Amber-CHARMM Benchmark
NAMD, v2.6b1, u2
1
256
ch_p4
ch_gm
128
79.9
88
Benchmark MD s/step
0.1
32
27.2
16
14.2
7.3
6.46
8
7.01
3.67
0.01
Parallel Speedup
64
49.6
3.54
2.8
1.9
1.89
1
JAC (Joint Amber-CHARMM) Benchmark: DHFR Protein, 7182 residues, 23558 atoms, 7023
TIP3 waters, PME (9cutoff)
Spring 2014
54 / 61
Communication Costs
8
32
16
Number of Processors (ppn=2)
128
Communications
64
256
Spring 2014
55 / 61
Communication Costs
Communication Considerations
Communication needs between parallel processes affect parallel
programs in several important ways:
Cost: there is always overhead when communicating:
All parallel programs will need some communication between tasks,

but the amount varies considerably:
minimal communication, also known as embarrassingly parallel
- think Monte Carlo calculations
robust communication, in which processes must exchange or
update information frequently - time evolution codes, domain
decomposed finite difference/element applications
latency - the cost in overhead for a zero length message

(microseconds)
bandwidth - the amount of data that can be sent per unit time
(MBytes/second); many applications utilize many small messages
and are latency-bound
Scope: point-to-point communication is always faster than

collective communication (that involves a large subset or all
processes)
Efficiency: are you using the best available resource (network) for
communicating?
Spring 2014
56 / 61
Spring 2014
57 / 61
Communication Costs
Latency Example - MPI/CCR
Communication Costs
Bandwidth Example - MPI/CCR
MPI PingPong Benchmark Performance
MPI PingPong Benchmark Performance
U2 Cluster Interconnects
U2 Cluster Interconnects
10000
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB
1000
1000
Bandwidth [MByte/s]
Message Time [s]
100
10
100
ch_mx
ch_gm
ch_p4
DAPL-QDR-IB
10
1 0
10
10
10
10
10
Message Length [Bytes]
10
0.1
Latency on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and QDR

Infiniband.
10
10
Spring 2014
58 / 61
10
10
10
10
Message Length [Bytes]
10
0.01
10
Bandwidth on CCR - TCP/IP Ethernet/ch_p4, Myrinet/ch_mx, and

QDR Infiniband.
Communication Costs
Spring 2014
59 / 61
Communication Costs
Collective Example - MPI/CCR

MPI_Alltoall Benchmark
Barriers - force collective synchronization; often implied, and always

expensive
Locks/Semaphores - typically protect memory locations from
simultaneous/conflicting updates to prevent race conditions
Intel MPI, 4KB Buffer, 12-core nodes with Qlogic QDR IB, Xeon E5645
1e+05
10000
talltoall(4KB) [sec]
Visibility: message-passing codes require the programmer to

explicitly manage all communications, but many data-parallel
codes do not, masking the underlying data transfers
Synchronicity: Synchronous communications are called blocking
as they require an explicit handshake between parallel
processes - asynchronous communications (non-blocking offer
the ability to overlap computation with communication, but place
an extra burden on the programmer. Examples include:
ppn=12
ppn=6
ppn=4
ppn=2
ppn=1
1000
100
10
10
100
Number of MPI Processes
1000
Cost of MPI_Alltoall for 4KB buffer on CCR/QDR IB.

Spring 2014
60 / 61
Spring 2014
61 / 61

Brief Overview of Parallel Computing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Brief Overview of Parallel Computing

Uploaded by

Copyright:

Available Formats

Background

Brief Overview of Parallel Programming

Why do we parallel process in the first place?

Hardware is becoming (actually it has been for a while now)

Center for Computational Research

Parallel programming is not easy. How easy is sequential

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Basic idea of parallel programming is a simple one:

in which we decompose a large computational problem into available

Brief Overview of Parallel Programming

Brief Overview of Parallel Programming

Demand for HPC

Why Use Parallel Computation?

Brief Overview of Parallel Programming

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Well, more of a discrete example, actually. Lets consider the

On the other side, suppose that we need to diagonalize (or invert) a

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Demand for HPC

Demand for HPC

So the lessons to take from the preceding examples should be clear:

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

By scaling we typically mean the relative performance of a parallel

Sequential execution time (using optimal implementation)

so, in the ideal case, S(p) = p.

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Inherent Limitations in Parallel Speedup

Using S as the (best) sequential execution time, we note that

Limitations on the maximum speedup:

Fraction of the code, f , can not be made to execute in parallel

Parallel overhead (communication, duplication costs)

for a lower bound, and the parallel efficiency is given by

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Demand for HPC

Demand for HPC

Implications of Amdahls Law

This simplification for p leads directly to Amdahls Law,

The implications of Amdahls law are pretty straightforward:

If the serial fraction is 5%, the maximum parallel speedup is only

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Let p = comm + comp , where comm is the time spent in communication

Maximum Parallel Speedup

Brief Overview of Parallel Programming

Implications of Amdahls Law (contd)

M. D. Jones, Ph.D. (CCR/UB)

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

The point here is that it is critical to minimize communication time to

M. D. Jones, Ph.D. (CCR/UB)

Brief Overview of Parallel Programming

Demand for HPC