Professional Documents
Culture Documents
Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm
Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm Intel 10-Core Xeon Westmere-EX: 2.6B transistors 32nm 512mm NVIDIA GF100: 3B transistors 40nm 529mm AMD Tahiti: 4.31B transistors 28nm 365mm
Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm Intel 10-Core Xeon Westmere-EX: 2.6B transistors 32nm 512mm NVIDIA GF100: 3B transistors 40nm 529mm AMD Tahiti: 4.31B transistors 28nm 365mm Xeon Nehalem-
Current Sizes
Intel 6-Core Core i7 (Sandy Bridge-E): 2.27B transistors 32nm 434mm Intel 10-Core Xeon Westmere-EX: 2.6B transistors 32nm 512mm NVIDIA GF100: 3B transistors 40nm 529mm AMD Tahiti: 4.31B transistors 28nm 365mm Xeon Nehalem-
Parallel Computers
One of Googles data centers (40-80 servers per rack) sits on 68,680 square foot space. Another uses 139,797 square foot space. courtesy Harpers magazine
Parallel Computers
One of Googles data centers (40-80 servers per rack) sits on 68,680 square foot space. Another uses 139,797 square foot space. courtesy Harpers magazine
Tianhe-1: 2.566 petaFLOPS 14,336 Intel Xeon X5670 (2.93 GHz) processors and 7,168 Nvidia Tesla M2050 general purpose GPUs. Power:
4040.00 Kw
5
K computer
Fujitsu: SPARC64 VIIIfx 2.0GHz, Tofu interconnect at RIKEN Advanced Institute for Computational Science (AICS) Cores:
705024 Power:
12659.89 Kw Memory:
1410048 Gb Interconnect:
Custom Operating System:Linux
Linpack performance: 10510000.00 Gops
K computer
Fujitsu: SPARC64 VIIIfx 2.0GHz, Tofu interconnect at RIKEN Advanced Institute for Computational Science (AICS) Cores:
705024 Power:
12659.89 Kw Memory:
1410048 Gb Interconnect:
Custom Operating System:Linux
Linpack performance: 10510000.00 Gops
India #85: @CRL Xeon 53xx 3GHz, Inniband connect 14384 cores, 132800 Gops
Why Parallel
Cant clock faster Do more per clock (bigger ICs ...)
Execute complex special-purpose instruction Execute more simple instructions
Even if processor performs more operations per second, DRAM access times remain a bottleneck (~ +10% per year) Multiple processors can access memory in parallel also increased caching Some of the fastest growing applications of parallel computing utilize not their raw computational speed, rather their ability to pump data to memory and disk faster
7
Atmospheric simulation
1km 3D-grid, each point interacts with neighbors Days of simulation time
Movie making
A few minutes = 30 days of rendering time
Oil exploration
months of sequential processing of seismic data
Financial processing
market prediction, investing
Computational biology
drug design gene sequencing (Celera)
Parallel
OP OP OP OP OP OP OP OP operands
operands operands operands operands
operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands
Parallel
OP OP OP OP OP OP OP OP operands
operands operands operands operands
operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands
Parallel
OP OP OP OP OP OP OP OP operands
operands operands operands operands
operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands
Parallel
OP OP OP OP OP OP OP OP operands
operands operands operands operands
operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands
Parallel
OP OP OP OP OP OP OP OP operands
operands operands operands operands
operands operands operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands OP operands
Serial vs parallel(concurrent)
atmWithdraw(int acountnum, int amount) {
cur balance = balance(accountnum) ;
if(curbalance > amount) {
setbalance(accountnum, curbalance-amount); eject(amount)
} else
Think concurrent
For each thread, other threads are adversaries
At least with regard to timing
Programmer must conceptualize and code parallelism Understand parallel algorithms and data structures
12
13
Communication
Shared Memory Message Passing
Interconnect
Memory
Memory
Memory
Memory
UMA
Typically Symmetric Multiprocessors (SMP) Equal access and access times to memory Hardware support for Cache Coherency (CC-UMA)
NUMA
Typically multiple SMPs, with access to each others memories Not all processors have equal access time to all memories CC-NUMA: Cache coherency is harder
15
Shared Memory
UMA
P P P P P
NUMA
Interconnect
Memory
- Hard to scale
- Adding CPUs (geometrically) increases trafc
Processor-local memory Access to another processors data through well dened communication protocol
Implicit synchronization semantics
18
Message Passing
A set of tasks that use their own local memory during computation. Data transfer usually requires cooperation: send matched by a receive
Data Parallel
Focus on parallel operations on a set (array) of data items. Tasks performs the same OP on a different parts of some data structure
Task Parallel
Perform many functions fi
Pipeline
Task Parallel
21
Fundamental Questions
Is the problem amenable to parallelization?
Are there (serial) dependencies
Algorithm
How to decompose the problem into tasks How to map tasks to processors
22
Measuring Performance
How fast does a job complete
Elapsed time (Latency) compute + communicate + synchronize
Efficiency =
Sp p
24
Efficiency =
Sp p
Cost, Cp = p
24
Efficiency =
Sp p
Cost, Cp = p Optimal if Cp = T1
24
Efficiency =
Sp p
Cost, Cp = p Optimal if Cp = T1
Amdahls Law
f = fraction of the problem that is sequential
(1 f) = fraction that is parallel
25
Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f
Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f
Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f
Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f
Amdahls Law
Only fraction (1-f) shared by p processors
Increasing p cannot speed-up fraction f