Solihin Chapter 1 Slides

Chapter 1: Perspectives
Copyright @ 2005-2008 Yan Solihin

Copyright notice:
No part of this publication may be reproduced, stored in a retrieval system, or
transmitted by any means (electronic, mechanical, photocopying, recording, or
otherwise) without the prior written permission of the author.
An exception is granted for academic lectures at universities and colleges, provided
that the following text is included in such copy: Source: Yan Solihin, Fundamentals
of Parallel Computer Architecture, 2008.
Evolution in Microprocessors
Fundamentals of Computer Architecture - Chapter 1
Key Points
Increasingly more and more components can be

integrated on a single chip
Speed of integration tracks Moores law: doubling
every 18-24 months.
Performance tracks speed of integration up until
recently
At the architecture level, there are two techniques
Instruction Level Parallelism

Cache Memory
Performance gain from uniprocessor system so

significant making multiprocessor systems not
profitable
Illustration
100-processor system with perfect speedup

Compared to a single processor system
Year
Year
Year
Year
1: 100x faster
2: 62.5x faster
3: 39x faster
10: 0.9x faster
Single processor performance catches up in just a few

years!
Even worse
It takes longer to develop a multiprocessor system

Low volume means prices must be very high
High prices delay adoption
Perfect speedup is unattainable
Why did uniproc performance grow so

fast?
~ half from circuit improvement (smaller transistors,

faster clock, etc.)
~ half from architecture/organization:
Instruction Level Parallelism (ILP)
Pipelining: RISC, CISC with RISC backend

Superscalar
Out of order execution
Memory hierarchy (Caches)
Exploiting spatial and temporal locality

Multiple cache levels
But the uniproc perf growth is stalling
Source of uniprocessor performance growth:

instruction level parallelism (ILP)
ILP growth has slowed abruptly
Parallel execution of independent instructions from a

single thread
Memory wall: Processor speed grows at 55%/year,

memory speed grows at 7% per year
ILP wall: achieving higher ILP requires quadratically
increasing complexity (and power)
Power efficiency
Thermal packaging limit vs. cost
Types of parallelism
Instruction level (ECE 521)
Pipelining
A (a load)
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
ID
EX
IF
WB
MEM
WB
Superscalar/ VLIW
Original:
LD
F0, 34(R2)
ADDD
F4, F0, F2
LD
F7, 45(R3)
ADDD
F8, F7, F6
Schedule as:
LD
F0, 34(R2)
| LD
ADDD
F4, F0, F2
| ADDD F8, F0, F6
F7, 45(R3)
+ Moderate degree of parallelism (sometimes 50)

- Requires fast communication (register level)
Why ILP is slowing
Branch prediction accuracy is already > 90%
But critical dependence loops do not change

Memory latency requires more clock cycles to satisfy
Processor width is already high
Hard to improve it even more
Number of pipeline stages is already deep (~20-30

stages)
Quadratically increasing complexity to increase the width
Cache size
Effective, but also shows diminishing returns

In general, the size must be doubled to reduce miss rate
by a half
Current Trend: Multicore and Manycore

Aspects
Intel
Clovertown
AMD
Barcelona
IBM Cell
# cores
8+1
Clock Freq
2.66 GHz
2.3 GHz
3.2 GHz
Core type
OOO
Superscalar
OOO
Superscalar
2-issue SIMD
Caches
2x4MB L2
512KB L2
(private),
2MB L3 (shd)
256KB local
store
Chip power
120 Watts
95 Watts
100 Watts
10
Historical Perspectives
80s early 90s: prime time for parallel architecture

research
A microprocessor cannot fit on a chip, so naturally need

multiple chips (and processors)
J-machine, M-machine, Alewife, Tera, HEP, etc.
90s: at the low end, uniprocessor systems speed

grows much faster than parallel systems speed
A microprocessor fits on a chip. So do branch predictor,

multiple functional units, large caches, etc!
Microprocessor also exploits parallelism (pipelining,
multiple-issue, VLIW) parallelisms originally invented for
multiprocessors
Many parallel computer vendors went bankrupt
Prestigious but small high-performance computing market
11
If the automobile industry advanced as rapidly as

the semiconductor industry, a Rolls Royce would
get million miles per gallon and it would be
cheaper to throw it away than to park it.
Gordon Moore,
Intel Corporation
12
90s: emergence of distributed (vs. parallel) machines
Progress in network technologies:
Network bandwidth grows faster than Moores law

Fast interconnection network getting cheap
Connects cheap uniprocessor systems into a large distributed

machine
Network of Workstations, Clusters, GRID
00s: parallel architectures are back
Transistors per chip >> microproc transistors

Harder to get more performance from a uniprocessor
SMT (Simultaneous multithreading), CMP (Chip MultiProcessor), ultimately Massive CMP
E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM
Power5, Sun Niagara, etc.
13
What is a Parallel Architecture?

A parallel computer is a collection of processing
elements that can communicate and cooperate to
solve a large problem fast.
-- Almasi & Gottlieb
14
Parallel computers
A parallel computer is a collection of processing

elements that can communicate and cooperate to
solve a large problem fast. [Almasi&Gottlieb]
collection of processing elements
How many? How powerful each? Scalability?

Few very powerful (e.g., Altix) vs. many small ones
(BlueGene)
that can communicate
How do PEs communicate? (shared memory vs. msg

passing)
Interconnection network (bus, multistage, crossbar, )
Evaluation criteria: cost, latency, throughput, scalability,
and fault tolerance
15
and cooperate
Issues: granularity, synchronization, and autonomy

Synchronization allows sequencing of operations to ensure
correctness
Granularity up => parallelism down, communication
down, overhead down
Statement/Instruction level: 2-10 instructions (ECE 521)

Loop level: 10-1K instructions
Task level: 1K 1M instructions
Program level: > 1M instructions
Autonomy
SIMD (single instruction stream) vs. MIMD (multiple

instruction streams)
16
solve a large problem fast
General vs. special purpose machine?

Any machine can solve certain problems well
What domains?
Highly (embarassingly) parallel apps
Many scientific codes
Medium parallel apps
Not parallel apps
Many engineering apps (finite-elements, VLSI-CAD)

Compilers, editors (do we care?)
17
Why parallel computers?
Absolute performance: Can we afford to wait?
Folding of a single protein takes years to simulate on the

most advanced microprocessor. It only takes days on a
parallel computer
Weather forecast: timeliness is crucial
Cost/performance
Harder to improve performance on a single processor

Bigger monolithic processor vs. many, simple processors
Power/performance
Reliability and availability
Key enabling technology:
Advances in microproc and interconnect technology

Advances in software technology
18
Scope of CSC/ECE 506
Parallelism
Flynn taxonomy:
Loop Level and Task Level Parallelism

SIMD (vector architecture)
MIMD
Shared memory machines (SMP and DSM)

Clusters
Programming Model:
Shared Memory
Message passing
Hybrid
19
Loop level parallelism
Each iteration can be computed independently

for (i=0; i<8; i++)
a[i] = b[i] + c[i];
Each iteration cannot be computed independently, thus

does not have loop level parallelism
for (i=0; i<8; i++)
a[i] = b[i] + a[i-1];
+ Very high parallelism > 1K

+ Often easy to achieve load balance
Some loops are not parallel
Some apps do not have many loops
20
10
Task level parallelism
Arbitrary code segments in a single program

Across loops:
for (i=0; i<n; i++)

sum = sum + a[i];
for (i=0; i<n; i++)
prod = prod * a[i];
Subroutines:
Cost = getCost();
A = computeSum();
B = A + Cost;
Threads: e.g. editor: GUI, printing, parsing

+ Larger granularity => low overheads, communication
Low degree of parallelism
Hard to balance
21
Program level parallelism
Various independent programs execute together

gmake:
gcc
gcc
gcc
gcc
c code1.c
// assign to proc1
c code2.c
// assign to proc2
c main.c
// assign to proc3
main.o code1.o code2.o
+ no communication
Hard to balance
Few opportunities
22
11
Parallelism
Flynn taxonomy:

SIMD (vector architecture)
MIMD
*Shared memory machines (SMP and DSM)

Clusters
Programming Model:
Shared Memory
Message passing
Hybrid
23
Taxonomy of Parallel Computers

The Flynn taxonomy:
Single or multiple instruction streams.
Single or multiple data streams.
1. SISD machine (Most desktops, laptops)
Only one instruction fetch stream

Most of todays workstations or desktops
Control
unit
Instruction
stream
ALU
Data
stream
24
12
SIMD
Examples: Vector processors, SIMD extensions (MMX)

A single instruction operates on multiple data items.
SISD:
for (i=0; i<8; i++)
a[i] = b[i] + c[i];
SIMD:
a = b + c;
ALU 1
Control Instruction
unit
stream
ALU 2
ALUn
// vector addition
Data
stream
1
Data
stream
2
Data
streamn
Pseudo-SIMD popular for multimedia extension
25
MISD machine
Example: CMU Warp

Systolic arrays
Data
stream
Control Instruction ALU 1

unit 1 stream 1
Control Instruction ALU 2
unit 2 stream 2
Control Instruction ALUn

unitn
streamn
26
13
Systolic Arrays (contd.)

Example: Systolic array for 1-D convolution
y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3)
x8
y3
x7
x6
y2
x5
x4
x3
w4
y1
xin
yin
x
w
xout
yout
w2
w1
xout = x
x = xin
yout = yin + w xin
Enable variety of algorithms on same hardware
But dedicated interconnect channels
x1
w3
Practical realizations (e.g. iWARP) use quite general processors
x2
Data transfer directly from register to register across channel
Specialized, and same problems as SIMD
General purpose systems work well for same algorithms (locality etc.)
27
MIMD machine
Independent processors connected together to form a

multiprocessor system.
Physical organization:
Programming abstraction:
Determines which memory hierarchy level is shared

Shared Memory:
on a chip: Chip Multiprocessor (CMP)

Interconnected by a bus: Symmetric multiprocessors
(SMP)
Point-to-point interconnection: Distributed Shared
Memory (DSM)
Distributed Memory:
Clusters, Grid
28
14
MIMD Physical Organization

P
Shared Cache Architecture:

- CMP (or Simultaneous Multi-Threading)
- e.g.: Pentium4 chip, IBM Power4 chip, SUN
Niagara, Pentium D, etc.
- Implies shared memory hardware
caches
M
P
caches
caches
Network
M
UMA (Uniform Memory Access)

Shared Memory :
- Pentium Pro Quad, Sun Enterprise,
etc.
- What interconnection network?
- Bus
- Multistage
- Crossbar
- etc.
- Implies shared memory hardware
29
MIMD Physical Organization (2)

P
caches
caches
Network
NUMA (Non-Uniform Memory Access)

Shared Memory :
- SGI Origin, Altix, IBM p690,
AMD Hammer-based system
- What interconnection network?
- Crossbar
- Mesh
- Hypercube
- etc.
- Also referred to as Distributed
Shared Memory
30
15
MIMD Physical Organization (3)
caches
caches
I/O
I/O
Distributed System/Memory:
- Also called clusters, grid
- Dont confuse it with distributed
shared memory
Network
31
Parallel vs. Distributed Computers

Cost
Distrib comp
Parallel comp
Distrib comp
Perf
Parallel comp
size
size
Small scale machines: parallel system cheaper

Large scale machines: distributed system cheaper
Performance: parallel system better (but more expensive)
System size: parallel system limited, and cost grows fast
However, must also consider software cost
32
16
Parallelism
Flynn taxonomy:

MIMD
Shared memory machines (SMP and DSM)
Programming Model:
Shared Memory
Message passing
Hybrid (e.g., UPC)
Data parallel
33
Programming Models
Shared Memory / Shared Address Space:
Each processor can see the entire memory
Shared M
Programming model = thread programming in

uniprocessor systems
34
17
Distributed Memory / Message Passing / Multiple

Address Space:
a processor can only directly access its own local

memory. All communication happens by explicit
messages.
M
P
35
Shared Mem compared to Msg

Passing
+ Can easily be automated (parallelizing compiler,
OpenMP)
+ Shared vars are not communicated, but must be
guarded
- How to provide shared memory? Complex hardware
- Synchronization overhead grows fast with more
processors
+- Difficult to debug, not intuitive for users
36
18
Data Parallel Prog Paradigm & Systems

Programming model
Operations performed in parallel on each element of data structure

Logically single thread of control, performs sequential or parallel steps
Conceptually, a processor associated with each data element
Architectural model
Array of many simple, cheap processors with little memory each
Processors dont sequence through instructions
Attached to a control processor that issues instructions

Specialized and general communication, cheap global synchronization
Original motivations
Matches simple differential equation solvers
Centralize high cost of instruction
fetch/sequencing
37
Application of Data Parallelism

Each PE contains an employee record with his/her salary
If salary > 100K then
salary = salary *1.05
else
salary = salary *1.10
Other examples:
Logically, the whole operation is a single step

Some processors enabled for arithmetic operation, others disabled
Finite differences, linear algebra, ...
Document searching, graphics, image processing, ...

Some recent machines:
Thinking Machines CM-1, CM-2 (and CM-5)
Maspar MP-1 and MP-2,
38
19
Common Today
Systolic Arrays:
Dataflow:
Shared memory:
idea adopted in graphics and network processors

idea adopted in superscalar processors
most small scale servers (up to 128 processors)
Now in workstations/desktops/laptops, too
Message passing: most large scale systems
Data parallel/SIMD:
clusters, grid (hundreds to thousands of processors)

small scale: SIMD multimedia extension (MMX, VIS)
large scale: vector processors
39
Top 500 Supercomputer
http://www.top500.org
Lets look at the Earth Simulator
Was #1 in 2004, now #10 in 2006
Hardware:
5,120 (640 8-way nodes) 500 MHz NEC CPUs

8 GFLOPS per CPU (41 TFLOPS total)
30s TFLOPS sustained performance!
2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total)

shared memory inside the node
10 TB total memory
640 640 crossbar switch between the nodes
16 GB/s inter-node bandwidth
20 kVA power consumption per node
40
20
Programming Model
In a CPU: data parallel, using automatic vectorization
Instruction level
In a node (8 CPUs): shared memory using OpenMP
Across nodes: message passing using MPI-2 or HPF
Loop level
Algorithm level
41
The machine room sits at approximately 4th floor

level. The 3rd floor level is taken by hundreds of
kilometers of copper cabling, and the lower floors
house the air conditioning and electrical equipment.
The structure is enclosed in a cooling shell, with the air
pumped from underneath through the cabinets, and
collected to the two long sides of the building. The
aeroshell gives the building its "pumped-up"
appearance. The machine room is electromagnetically
shielded to prevent interference from nearby
expressway and rail. Even the halogen light sources
are outside the shield, and the light is distributed by a
grid of scattering pipes under the ceiling. The entire
structure is mechanically isolated from the
surroundings, suspended in order to make it less prone
to earthquake damage. All attachments (power,
cooling, access walkways) are flexible.
42
21
43
Linpack performance: 40 TFlops, 80% peak

Real world performance: 33-66% peak (vs. less than
15% for clusters)
Cost? Hint: starts with 4
Maintenance $15M per year
Failure one processor per week
Distributed Memory Parallel Computing System which
640 processor nodes interconnected by Single-Stage
Crossbar Network
44
22
Fastest (#1 as of Aug 2006)
BlueGene
65536 processors
Each processor PowerPC 440 700 MHz (2.8 GFlops)
Rpeak (GFlops):183 TFLOPS
Rmax (GFlops):136 TFLOPS
45
Limitations of very large machines
Niche market
Power wall
By using low power processor, BlueGene can scale to a

very large processor count
Many practical issues: electricity, cooling, etc.
Programming wall
Extremely hard to extract performance out of very large

machine
46
23

Solihin Chapter 1 Slides

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Solihin Chapter 1 Slides

Uploaded by

Copyright:

Available Formats

Chapter 1: Perspectives

Copyright @ 2005-2008 Yan Solihin

Fundamentals of Computer Architecture - Chapter 1

Increasingly more and more components can be

Instruction Level Parallelism

Performance gain from uniprocessor system so

Fundamentals of Computer Architecture - Chapter 1

100-processor system with perfect speedup

Single processor performance catches up in just a few

It takes longer to develop a multiprocessor system

Fundamentals of Computer Architecture - Chapter 1

Why did uniproc performance grow so

~ half from circuit improvement (smaller transistors,

Instruction Level Parallelism (ILP)

Pipelining: RISC, CISC with RISC backend

Memory hierarchy (Caches)

Exploiting spatial and temporal locality

Fundamentals of Computer Architecture - Chapter 1

But the uniproc perf growth is stalling

Source of uniprocessor performance growth:

ILP growth has slowed abruptly

Parallel execution of independent instructions from a

Memory wall: Processor speed grows at 55%/year,

Fundamentals of Computer Architecture - Chapter 1

Instruction level (ECE 521)

Fundamentals of Computer Architecture - Chapter 1

| ADDD F8, F0, F6

+ Moderate degree of parallelism (sometimes 50)

Fundamentals of Computer Architecture - Chapter 1

Why ILP is slowing

Branch prediction accuracy is already > 90%

But critical dependence loops do not change

Processor width is already high

Hard to improve it even more

Number of pipeline stages is already deep (~20-30

Quadratically increasing complexity to increase the width

Effective, but also shows diminishing returns

Fundamentals of Computer Architecture - Chapter 1

Current Trend: Multicore and Manycore

Fundamentals of Computer Architecture - Chapter 1

80s early 90s: prime time for parallel architecture

A microprocessor cannot fit on a chip, so naturally need

90s: at the low end, uniprocessor systems speed

A microprocessor fits on a chip. So do branch predictor,

Fundamentals of Computer Architecture - Chapter 1

If the automobile industry advanced as rapidly as

Fundamentals of Computer Architecture - Chapter 1

90s: emergence of distributed (vs. parallel) machines

Progress in network technologies:

Network bandwidth grows faster than Moores law

Connects cheap uniprocessor systems into a large distributed

00s: parallel architectures are back

Transistors per chip >> microproc transistors

Fundamentals of Computer Architecture - Chapter 1

What is a Parallel Architecture?

Fundamentals of Computer Architecture - Chapter 1

A parallel computer is a collection of processing

How many? How powerful each? Scalability?

that can communicate

How do PEs communicate? (shared memory vs. msg

Fundamentals of Computer Architecture - Chapter 1

Issues: granularity, synchronization, and autonomy

Statement/Instruction level: 2-10 instructions (ECE 521)