You are on page 1of 44

CSL730: An Introduction to Parallel Computation

Focus of the Course


Hybrid many-core computation Parallel algorithms General techniques Learn to analyse, program, debug Related terms Supercomputing Grid computing Cloud computing

Moores Law (1965)


The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.

Moores Law (1965)


The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this 1975: Revised rate of circuit complexity doubling to 18 months goingifforward rate can be expected to continue, not to increase. Over the longer term, the rate of There is no room left to squeeze anything increase is a bit more uncertain, although there out by being clever. Going forward from is no reason to believe it will not remain nearly here we have to depend on the two size factors - least bigger dies and ner constant for at 10 years. That means by dimensions. 1975, the number of components per integrated circuit for minimum cost will be 65,000.

Moores Law (1965)


The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this 1975: Revised rate of circuit complexity doubling to 18 months goingifforward rate can be expected to continue, not to increase. Over the longer term, the rate of 2003: Another decade probably There is no roomis left to squeeze anything increase is a bit more uncertain, although there straightforward...There is certainly no end to out by being clever. Going forward from is creativity. no reason to believe it will not remain nearly here we have to depend on the two size factors - least bigger dies and ner constant for at 10 years. That means by dimensions. 1975, the number of components per integrated circuit for minimum cost will be 65,000.

Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm

Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm Intel 10-Core Xeon Westmere-EX: 2.6B transistors 32nm 512mm NVIDIA GF100: 3B transistors 40nm 529mm AMD Tahiti: 4.31B transistors 28nm 365mm

Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm Intel 10-Core Xeon Westmere-EX: 2.6B transistors 32nm 512mm NVIDIA GF100: 3B transistors 40nm 529mm AMD Tahiti: 4.31B transistors 28nm 365mm Xeon Nehalem-

Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm Intel 10-Core Xeon Westmere-EX: 2.6B transistors 32nm 512mm NVIDIA GF100: 3B transistors 40nm 529mm AMD Tahiti: 4.31B transistors 28nm 365mm Xeon Nehalem-

1971 Intel 4004: 2,300 transistors, 10m, 12mm

Parallel Computers
One of Googles data centers (40-80 servers per rack) sits on 68,680 square foot space. Another uses 139,797 square foot space. courtesy Harpers magazine

courtesy Science & Technology

Parallel Computers
One of Googles data centers (40-80 servers per rack) sits on 68,680 square foot space. Another uses 139,797 square foot space. courtesy Harpers magazine

Tianhe-1: 2.566 petaFLOPS 14,336 Intel Xeon X5670 (2.93 GHz) processors and 7,168 Nvidia Tesla M2050 general purpose GPUs. Power: 4040.00 Kw
5

courtesy Science & Technology

K computer
Fujitsu: SPARC64 VIIIfx 2.0GHz, Tofu interconnect at RIKEN Advanced Institute for Computational Science (AICS) Cores: 705024 Power: 12659.89 Kw Memory: 1410048 Gb Interconnect: Custom Operating System:Linux Linpack performance: 10510000.00 Gops

K computer
Fujitsu: SPARC64 VIIIfx 2.0GHz, Tofu interconnect at RIKEN Advanced Institute for Computational Science (AICS) Cores: 705024 Power: 12659.89 Kw Memory: 1410048 Gb Interconnect: Custom Operating System:Linux Linpack performance: 10510000.00 Gops

India #85: @CRL Xeon 53xx 3GHz, Inniband connect 14384 cores, 132800 Gops

Why Parallel
Cant clock faster Do more per clock (bigger ICs ...)
Execute complex special-purpose instruction Execute more simple instructions

Even if processor performs more operations per second, DRAM access times remain a bottleneck (~ +10% per year) Multiple processors can access memory in parallel also increased caching Some of the fastest growing applications of parallel computing utilize not their raw computational speed, rather their ability to pump data to memory and disk faster
7

Some Complex Problems


N-body simulation
1 million bodies
days/iteration

Atmospheric simulation
1km 3D-grid, each point interacts with neighbors Days of simulation time

Movie making
A few minutes = 30 days of rendering time

Oil exploration
months of sequential processing of seismic data

Financial processing
market prediction, investing

Computational biology
drug design gene sequencing (Celera)

Executing Stored Programs


Sequential
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands

Parallel
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands

Executing Stored Programs


Sequential
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands

Parallel
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands

Executing Stored Programs


Sequential
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands

Parallel
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands

Executing Stored Programs


Sequential
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands

Parallel
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands

Executing Stored Programs


Sequential
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands

Parallel
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands

Serial vs parallel(concurrent)
atmWithdraw(int acountnum, int amount) { cur balance = balance(accountnum) ;
if(curbalance > amount) {
setbalance(accountnum, curbalance-amount); eject(amount)

} else

Programming in the Parallel


Understand target model (Semantics)
Implications/Restrictions of constructs/features

Design for the target model


Choice of granularity, synchronization primitive Usually more of a performance issue

Think concurrent
For each thread, other threads are adversaries
At least with regard to timing

Process launch, Communication, Synchronization


Clearly dene pre and post conditions

Employ high-level constructs when possible


Debugging is extra-hard without it
11

Learn Parallel Programming?


Let compiler extract parallelism
In general, not successful so far Too context sensitive Many efcient serial data structures and algorithms are parallel-inefcient Even if compiler extracted parallelism from serial code, it may not be what you want

Programmer must conceptualize and code parallelism Understand parallel algorithms and data structures
12

Automatic vs Manual Parallelization


Manually implementing parallel code is hard, slow, bug-prone Automatic parallelizing compilers can analyze the source code to identify parallelism
They use a cost-benet framework to decide where parallelism would improve performance Loops are common target for automatic parallelization.

Programmer directives may be used to guide the compiler But:


Wrong results may be produced Performance may degrade Typically limited to a subset (aka loops) of code May miss parallelization due to static cost benet analysis

13

Communication
Shared Memory Message Passing
Interconnect

Memory

Memory

Memory

Memory

Shared Memory Architecture


Processors to access memory as global address space Memory updates by one processor are visible to others (eventually)
Memory consistency models

UMA
Typically Symmetric Multiprocessors (SMP) Equal access and access times to memory Hardware support for Cache Coherency (CC-UMA)

NUMA
Typically multiple SMPs, with access to each others memories Not all processors have equal access time to all memories CC-NUMA: Cache coherency is harder
15

Shared Memory
UMA
P P P P P

NUMA
Interconnect

Memory Controller Memory Memory Memory

Memory

Pros/Cons of Shared Memory


+ Easier to program with global address space + Typically fast memory access
(when hardware supported)

- Hard to scale
- Adding CPUs (geometrically) increases trafc

- Programmer initiated synchronization of memory accesses


17

Distributed Memory Arch.


Communication network (typically between processors, but also memory)
Ethernet, Inniband, Custom made

Processor-local memory Access to another processors data through well dened communication protocol
Implicit synchronization semantics

Inter-process synchronization by programmer

18

Pros/Cons of Distributed Memory


+ Memory is scalable with number of processors + Local access is fast (no cache coherency overhead) + Cost effective, with off-the-shelf processor/ network - Programs often more complex (no RAM model) - Data communication is complex to manage
19

Parallel Programming Models


Shared Memory
Tasks share a common address space they access asynchronously Locks / semaphores used to control access to the shared memory Data may be cached on the processor that works on it Compiler translates user variables into global memory addresses

Message Passing
A set of tasks that use their own local memory during computation. Data transfer usually requires cooperation: send matched by a receive

Data Parallel
Focus on parallel operations on a set (array) of data items. Tasks performs the same OP on a different parts of some data structure

Often organized as threads of computation


Multiple threads, each has local data, but they also share common data Threads may communicate through global memory, User synchronizes Commonly associated with shared memory architectures and OS features
20

Parallel Task Decomposition


Data Parallel
Perform f(x) for many x
Data Parallel

Task Parallel
Perform many functions fi

Pipeline

Task Parallel

21

Fundamental Questions
Is the problem amenable to parallelization?
Are there (serial) dependencies

What machine architectures are available?


Can they be re-congured? Communication network

Algorithm
How to decompose the problem into tasks How to map tasks to processors
22

Measuring Performance
How fast does a job complete
Elapsed time (Latency) compute + communicate + synchronize

How many jobs complete in a given time


Throughput Are they independent jobs?

How well does the system scale?


Increasing processors, memory, interconnect
23

Simple Performance Metrics


Speedup, Sp = Exec time using 1 processor system (T1) Exec time using p processors (Tp)

Efficiency =

Sp p

24

Simple Performance Metrics


Speedup, Sp = Exec time using 1 processor system (T1) Exec time using p processors (Tp)

Efficiency =

Sp p

Cost, Cp = p

24

Simple Performance Metrics


Speedup, Sp = Exec time using 1 processor system (T1) Exec time using p processors (Tp)

Efficiency =

Sp p

Cost, Cp = p Optimal if Cp = T1

24

Simple Performance Metrics


Speedup, Sp = Exec time using 1 processor system (T1) Exec time using p processors (Tp)

Efficiency =

Sp p

Cost, Cp = p Optimal if Cp = T1

Look out for inefficiency: T 1 = n3 Tp = n2.5, for p = n2 Cp = n4.5


24

Amdahls Law
f = fraction of the problem that is sequential
(1 f) = fraction that is parallel

1 f Best parallel time Tp = T1 (f + ) p


Speedup with p processors:

25

Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f

Upper bound on speedup at p =

Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f

Upper bound on speedup at p =


Converges to 0

Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f

Upper bound on speedup at p =


Converges to 0

Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f

Upper bound on speedup at p =


Converges to 0

Example: f = 2%, S = 1 / 0.02 = 50

Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f

Upper bound on speedup at p =


Converges to 0

Example: f = 2%, S = 1 / 0.02 = 50

You might also like