1intro PDF

CSL730: An Introduction to Parallel Computation
Focus of the Course

Hybrid many-core computation Parallel algorithms General techniques Learn to analyse, program, debug Related terms Supercomputing Grid computing Cloud computing
Moores Law (1965)

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000.
Moores Law (1965)

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this 1975: Revised rate of circuit complexity doubling to 18 months goingifforward rate can be expected to continue, not to increase. Over the longer term, the rate of There is no room left to squeeze anything increase is a bit more uncertain, although there out by being clever. Going forward from is no reason to believe it will not remain nearly here we have to depend on the two size factors - least bigger dies and ner constant for at 10 years. That means by dimensions. 1975, the number of components per integrated circuit for minimum cost will be 65,000.
Moores Law (1965)

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this 1975: Revised rate of circuit complexity doubling to 18 months goingifforward rate can be expected to continue, not to increase. Over the longer term, the rate of 2003: Another decade probably There is no roomis left to squeeze anything increase is a bit more uncertain, although there straightforward...There is certainly no end to out by being clever. Going forward from is creativity. no reason to believe it will not remain nearly here we have to depend on the two size factors - least bigger dies and ner constant for at 10 years. That means by dimensions. 1975, the number of components per integrated circuit for minimum cost will be 65,000.
Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm
Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm Intel 10-Core Xeon Westmere-EX: 2.6B transistors 32nm 512mm NVIDIA GF100: 3B transistors 40nm 529mm AMD Tahiti: 4.31B transistors 28nm 365mm
Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm Intel 10-Core Xeon Westmere-EX: 2.6B transistors 32nm 512mm NVIDIA GF100: 3B transistors 40nm 529mm AMD Tahiti: 4.31B transistors 28nm 365mm Xeon Nehalem-
Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm Intel 10-Core Xeon Westmere-EX: 2.6B transistors 32nm 512mm NVIDIA GF100: 3B transistors 40nm 529mm AMD Tahiti: 4.31B transistors 28nm 365mm Xeon Nehalem-
1971 Intel 4004: 2,300 transistors, 10m, 12mm
Parallel Computers
One of Googles data centers (40-80 servers per rack) sits on 68,680 square foot space. Another uses 139,797 square foot space. courtesy Harpers magazine
courtesy Science & Technology
Parallel Computers
One of Googles data centers (40-80 servers per rack) sits on 68,680 square foot space. Another uses 139,797 square foot space. courtesy Harpers magazine
Tianhe-1: 2.566 petaFLOPS 14,336 Intel Xeon X5670 (2.93 GHz) processors and 7,168 Nvidia Tesla M2050 general purpose GPUs. Power: 4040.00 Kw
5
courtesy Science & Technology
K computer
Fujitsu: SPARC64 VIIIfx 2.0GHz, Tofu interconnect at RIKEN Advanced Institute for Computational Science (AICS) Cores: 705024 Power: 12659.89 Kw Memory: 1410048 Gb Interconnect: Custom Operating System:Linux Linpack performance: 10510000.00 Gops
K computer
Fujitsu: SPARC64 VIIIfx 2.0GHz, Tofu interconnect at RIKEN Advanced Institute for Computational Science (AICS) Cores: 705024 Power: 12659.89 Kw Memory: 1410048 Gb Interconnect: Custom Operating System:Linux Linpack performance: 10510000.00 Gops
India #85: @CRL Xeon 53xx 3GHz, Inniband connect 14384 cores, 132800 Gops
Why Parallel
Cant clock faster Do more per clock (bigger ICs ...)
Execute complex special-purpose instruction Execute more simple instructions
Even if processor performs more operations per second, DRAM access times remain a bottleneck (~ +10% per year) Multiple processors can access memory in parallel also increased caching Some of the fastest growing applications of parallel computing utilize not their raw computational speed, rather their ability to pump data to memory and disk faster
7
Some Complex Problems

N-body simulation
1 million bodies
days/iteration
Atmospheric simulation
1km 3D-grid, each point interacts with neighbors Days of simulation time
Movie making
A few minutes = 30 days of rendering time
Oil exploration
months of sequential processing of seismic data
Financial processing
market prediction, investing
Computational biology
drug design gene sequencing (Celera)
Executing Stored Programs

Sequential
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands
Parallel
OP OP OP OP OP OP OP OP operands operands operands operands operands operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands

Sequential
Parallel

Sequential
Parallel

Sequential
Parallel

Sequential
Parallel
Serial vs parallel(concurrent)
atmWithdraw(int acountnum, int amount) { cur balance = balance(accountnum) ;
if(curbalance > amount) {
setbalance(accountnum, curbalance-amount); eject(amount)
} else
Programming in the Parallel

Understand target model (Semantics)
Implications/Restrictions of constructs/features
Design for the target model

Choice of granularity, synchronization primitive Usually more of a performance issue
Think concurrent
For each thread, other threads are adversaries
At least with regard to timing
Process launch, Communication, Synchronization

Clearly dene pre and post conditions
Employ high-level constructs when possible

Debugging is extra-hard without it
11
Learn Parallel Programming?

Let compiler extract parallelism
In general, not successful so far Too context sensitive Many efcient serial data structures and algorithms are parallel-inefcient Even if compiler extracted parallelism from serial code, it may not be what you want
Programmer must conceptualize and code parallelism Understand parallel algorithms and data structures
12
Automatic vs Manual Parallelization

Manually implementing parallel code is hard, slow, bug-prone Automatic parallelizing compilers can analyze the source code to identify parallelism
They use a cost-benet framework to decide where parallelism would improve performance Loops are common target for automatic parallelization.
Programmer directives may be used to guide the compiler But:

Wrong results may be produced Performance may degrade Typically limited to a subset (aka loops) of code May miss parallelization due to static cost benet analysis
13
Communication
Shared Memory Message Passing
Interconnect
Memory
Memory
Memory
Memory
Shared Memory Architecture

Processors to access memory as global address space Memory updates by one processor are visible to others (eventually)
Memory consistency models
UMA
Typically Symmetric Multiprocessors (SMP) Equal access and access times to memory Hardware support for Cache Coherency (CC-UMA)
NUMA
Typically multiple SMPs, with access to each others memories Not all processors have equal access time to all memories CC-NUMA: Cache coherency is harder
15
Shared Memory
UMA
P P P P P
NUMA
Interconnect
Memory Controller Memory Memory Memory
Memory
Pros/Cons of Shared Memory

+ Easier to program with global address space + Typically fast memory access
(when hardware supported)
- Hard to scale
- Adding CPUs (geometrically) increases trafc
- Programmer initiated synchronization of memory accesses

17
Distributed Memory Arch.

Communication network (typically between processors, but also memory)
Ethernet, Inniband, Custom made
Processor-local memory Access to another processors data through well dened communication protocol
Implicit synchronization semantics
Inter-process synchronization by programmer
18
Pros/Cons of Distributed Memory

+ Memory is scalable with number of processors + Local access is fast (no cache coherency overhead) + Cost effective, with off-the-shelf processor/ network - Programs often more complex (no RAM model) - Data communication is complex to manage
19
Parallel Programming Models

Shared Memory
Tasks share a common address space they access asynchronously Locks / semaphores used to control access to the shared memory Data may be cached on the processor that works on it Compiler translates user variables into global memory addresses
Message Passing
A set of tasks that use their own local memory during computation. Data transfer usually requires cooperation: send matched by a receive
Data Parallel
Focus on parallel operations on a set (array) of data items. Tasks performs the same OP on a different parts of some data structure
Often organized as threads of computation

Multiple threads, each has local data, but they also share common data Threads may communicate through global memory, User synchronizes Commonly associated with shared memory architectures and OS features
20
Parallel Task Decomposition

Data Parallel
Perform f(x) for many x
Data Parallel
Task Parallel
Perform many functions fi
Pipeline
Task Parallel
21
Fundamental Questions
Is the problem amenable to parallelization?
Are there (serial) dependencies
What machine architectures are available?

Can they be re-congured? Communication network
Algorithm
How to decompose the problem into tasks How to map tasks to processors
22
Measuring Performance
How fast does a job complete
Elapsed time (Latency) compute + communicate + synchronize
How many jobs complete in a given time

Throughput Are they independent jobs?
How well does the system scale?

Increasing processors, memory, interconnect
23
Simple Performance Metrics

Speedup, Sp = Exec time using 1 processor system (T1) Exec time using p processors (Tp)
Efficiency =
Sp p
24

Efficiency =
Sp p
Cost, Cp = p
24

Efficiency =
Sp p
Cost, Cp = p Optimal if Cp = T1
24

Efficiency =
Sp p
Cost, Cp = p Optimal if Cp = T1
Look out for inefficiency: T 1 = n3 Tp = n2.5, for p = n2 Cp = n4.5

24
Amdahls Law
f = fraction of the problem that is sequential
(1 f) = fraction that is parallel
1 f Best parallel time Tp = T1 (f + ) p

Speedup with p processors:
25
Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f
Upper bound on speedup at p =
Amdahls Law

Converges to 0
Amdahls Law

Converges to 0
Amdahls Law

Converges to 0
Example: f = 2%, S = 1 / 0.02 = 50
Amdahls Law

Converges to 0
Example: f = 2%, S = 1 / 0.02 = 50

1intro PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1intro PDF

Uploaded by

Copyright:

Available Formats

CSL730: An Introduction to Parallel Computation

Focus of the Course

Moores Law (1965)

Moores Law (1965)

Moores Law (1965)

1971 Intel 4004: 2,300 transistors, 10m, 12mm

courtesy Science & Technology

courtesy Science & Technology

Some Complex Problems

Executing Stored Programs

Executing Stored Programs

Executing Stored Programs

Executing Stored Programs

Executing Stored Programs

Programming in the Parallel

Design for the target model

Process launch, Communication, Synchronization

Employ high-level constructs when possible

Learn Parallel Programming?

Automatic vs Manual Parallelization

Programmer directives may be used to guide the compiler But:

Shared Memory Architecture

Memory Controller Memory Memory Memory

Pros/Cons of Shared Memory

- Programmer initiated synchronization of memory accesses

Distributed Memory Arch.

Inter-process synchronization by programmer

Pros/Cons of Distributed Memory

Parallel Programming Models

Often organized as threads of computation

Parallel Task Decomposition

What machine architectures are available?

How many jobs complete in a given time

How well does the system scale?

Simple Performance Metrics

Simple Performance Metrics

Simple Performance Metrics

Simple Performance Metrics

Look out for inefficiency: T 1 = n3 Tp = n2.5, for p = n2 Cp = n4.5

1 f Best parallel time Tp = T1 (f + ) p

Upper bound on speedup at p =

Upper bound on speedup at p =

Upper bound on speedup at p =

Upper bound on speedup at p =

Example: f = 2%, S = 1 / 0.02 = 50

Upper bound on speedup at p =

Example: f = 2%, S = 1 / 0.02 = 50

You might also like