You are on page 1of 23

Chapter 1: Perspectives

Copyright @ 2005-2008 Yan Solihin


Copyright notice:
No part of this publication may be reproduced, stored in a retrieval system, or
transmitted by any means (electronic, mechanical, photocopying, recording, or
otherwise) without the prior written permission of the author.
An exception is granted for academic lectures at universities and colleges, provided
that the following text is included in such copy: Source: Yan Solihin, Fundamentals
of Parallel Computer Architecture, 2008.

Evolution in Microprocessors

Fundamentals of Computer Architecture - Chapter 1

Key Points

Increasingly more and more components can be


integrated on a single chip
Speed of integration tracks Moores law: doubling
every 18-24 months.
Performance tracks speed of integration up until
recently
At the architecture level, there are two techniques

Instruction Level Parallelism


Cache Memory

Performance gain from uniprocessor system so


significant making multiprocessor systems not
profitable

Fundamentals of Computer Architecture - Chapter 1

Illustration

100-processor system with perfect speedup


Compared to a single processor system

Year
Year
Year

Year

1: 100x faster
2: 62.5x faster
3: 39x faster
10: 0.9x faster

Single processor performance catches up in just a few


years!
Even worse

It takes longer to develop a multiprocessor system


Low volume means prices must be very high
High prices delay adoption
Perfect speedup is unattainable

Fundamentals of Computer Architecture - Chapter 1

Why did uniproc performance grow so


fast?

~ half from circuit improvement (smaller transistors,


faster clock, etc.)
~ half from architecture/organization:

Instruction Level Parallelism (ILP)

Pipelining: RISC, CISC with RISC backend


Superscalar
Out of order execution

Memory hierarchy (Caches)

Exploiting spatial and temporal locality


Multiple cache levels

Fundamentals of Computer Architecture - Chapter 1

But the uniproc perf growth is stalling

Source of uniprocessor performance growth:


instruction level parallelism (ILP)

ILP growth has slowed abruptly

Parallel execution of independent instructions from a


single thread

Memory wall: Processor speed grows at 55%/year,


memory speed grows at 7% per year
ILP wall: achieving higher ILP requires quadratically
increasing complexity (and power)

Power efficiency
Thermal packaging limit vs. cost

Fundamentals of Computer Architecture - Chapter 1

Types of parallelism

Instruction level (ECE 521)

Pipelining
A (a load)

IF

ID

EX

MEM

WB

IF

ID

EX

MEM

ID

EX

IF

WB
MEM

WB

Fundamentals of Computer Architecture - Chapter 1

Superscalar/ VLIW
Original:
LD

F0, 34(R2)

ADDD

F4, F0, F2

LD

F7, 45(R3)

ADDD

F8, F7, F6

Schedule as:
LD

F0, 34(R2)

| LD

ADDD

F4, F0, F2

| ADDD F8, F0, F6

F7, 45(R3)

+ Moderate degree of parallelism (sometimes 50)


- Requires fast communication (register level)

Fundamentals of Computer Architecture - Chapter 1

Why ILP is slowing

Branch prediction accuracy is already > 90%

But critical dependence loops do not change


Memory latency requires more clock cycles to satisfy

Processor width is already high

Hard to improve it even more

Number of pipeline stages is already deep (~20-30


stages)

Quadratically increasing complexity to increase the width

Cache size

Effective, but also shows diminishing returns


In general, the size must be doubled to reduce miss rate
by a half

Fundamentals of Computer Architecture - Chapter 1

Current Trend: Multicore and Manycore


Aspects

Intel
Clovertown

AMD
Barcelona

IBM Cell

# cores

8+1

Clock Freq

2.66 GHz

2.3 GHz

3.2 GHz

Core type

OOO
Superscalar

OOO
Superscalar

2-issue SIMD

Caches

2x4MB L2

512KB L2
(private),
2MB L3 (shd)

256KB local
store

Chip power

120 Watts

95 Watts

100 Watts

Fundamentals of Computer Architecture - Chapter 1

10

Historical Perspectives

80s early 90s: prime time for parallel architecture


research

A microprocessor cannot fit on a chip, so naturally need


multiple chips (and processors)
J-machine, M-machine, Alewife, Tera, HEP, etc.

90s: at the low end, uniprocessor systems speed


grows much faster than parallel systems speed

A microprocessor fits on a chip. So do branch predictor,


multiple functional units, large caches, etc!
Microprocessor also exploits parallelism (pipelining,
multiple-issue, VLIW) parallelisms originally invented for
multiprocessors
Many parallel computer vendors went bankrupt
Prestigious but small high-performance computing market

Fundamentals of Computer Architecture - Chapter 1

11

If the automobile industry advanced as rapidly as


the semiconductor industry, a Rolls Royce would
get million miles per gallon and it would be
cheaper to throw it away than to park it.
Gordon Moore,
Intel Corporation

Fundamentals of Computer Architecture - Chapter 1

12

90s: emergence of distributed (vs. parallel) machines

Progress in network technologies:

Network bandwidth grows faster than Moores law


Fast interconnection network getting cheap

Connects cheap uniprocessor systems into a large distributed


machine
Network of Workstations, Clusters, GRID

00s: parallel architectures are back

Transistors per chip >> microproc transistors


Harder to get more performance from a uniprocessor
SMT (Simultaneous multithreading), CMP (Chip MultiProcessor), ultimately Massive CMP
E.g. Intel Pentium D, Core Duo, AMD Dual Core, IBM
Power5, Sun Niagara, etc.

Fundamentals of Computer Architecture - Chapter 1

13

What is a Parallel Architecture?


A parallel computer is a collection of processing
elements that can communicate and cooperate to
solve a large problem fast.
-- Almasi & Gottlieb

Fundamentals of Computer Architecture - Chapter 1

14

Parallel computers

A parallel computer is a collection of processing


elements that can communicate and cooperate to
solve a large problem fast. [Almasi&Gottlieb]
collection of processing elements

How many? How powerful each? Scalability?


Few very powerful (e.g., Altix) vs. many small ones
(BlueGene)

that can communicate

How do PEs communicate? (shared memory vs. msg


passing)
Interconnection network (bus, multistage, crossbar, )
Evaluation criteria: cost, latency, throughput, scalability,
and fault tolerance

Fundamentals of Computer Architecture - Chapter 1

15

and cooperate

Issues: granularity, synchronization, and autonomy


Synchronization allows sequencing of operations to ensure
correctness
Granularity up => parallelism down, communication
down, overhead down

Statement/Instruction level: 2-10 instructions (ECE 521)


Loop level: 10-1K instructions
Task level: 1K 1M instructions
Program level: > 1M instructions

Autonomy

SIMD (single instruction stream) vs. MIMD (multiple


instruction streams)

Fundamentals of Computer Architecture - Chapter 1

16

solve a large problem fast

General vs. special purpose machine?


Any machine can solve certain problems well

What domains?
Highly (embarassingly) parallel apps

Many scientific codes

Medium parallel apps

Not parallel apps

Many engineering apps (finite-elements, VLSI-CAD)


Compilers, editors (do we care?)

Fundamentals of Computer Architecture - Chapter 1

17

Why parallel computers?

Absolute performance: Can we afford to wait?

Folding of a single protein takes years to simulate on the


most advanced microprocessor. It only takes days on a
parallel computer
Weather forecast: timeliness is crucial

Cost/performance

Harder to improve performance on a single processor


Bigger monolithic processor vs. many, simple processors

Power/performance
Reliability and availability

Key enabling technology:

Advances in microproc and interconnect technology


Advances in software technology

Fundamentals of Computer Architecture - Chapter 1

18

Scope of CSC/ECE 506

Parallelism

Flynn taxonomy:

Loop Level and Task Level Parallelism


SIMD (vector architecture)
MIMD

Shared memory machines (SMP and DSM)


Clusters

Programming Model:

Shared Memory
Message passing
Hybrid

Fundamentals of Computer Architecture - Chapter 1

19

Loop level parallelism

Each iteration can be computed independently


for (i=0; i<8; i++)
a[i] = b[i] + c[i];

Each iteration cannot be computed independently, thus


does not have loop level parallelism
for (i=0; i<8; i++)
a[i] = b[i] + a[i-1];

+ Very high parallelism > 1K


+ Often easy to achieve load balance
Some loops are not parallel
Some apps do not have many loops

Fundamentals of Computer Architecture - Chapter 1

20

10

Task level parallelism

Arbitrary code segments in a single program


Across loops:

for (i=0; i<n; i++)


sum = sum + a[i];
for (i=0; i<n; i++)
prod = prod * a[i];

Subroutines:
Cost = getCost();
A = computeSum();
B = A + Cost;

Threads: e.g. editor: GUI, printing, parsing


+ Larger granularity => low overheads, communication
Low degree of parallelism
Hard to balance

Fundamentals of Computer Architecture - Chapter 1

21

Program level parallelism

Various independent programs execute together


gmake:

gcc
gcc
gcc
gcc

c code1.c
// assign to proc1
c code2.c
// assign to proc2
c main.c
// assign to proc3
main.o code1.o code2.o

+ no communication
Hard to balance
Few opportunities

Fundamentals of Computer Architecture - Chapter 1

22

11

Scope of CSC/ECE 506

Parallelism

Flynn taxonomy:

Loop Level and Task Level Parallelism


SIMD (vector architecture)
MIMD

*Shared memory machines (SMP and DSM)


Clusters

Programming Model:

Shared Memory
Message passing
Hybrid

Fundamentals of Computer Architecture - Chapter 1

23

Taxonomy of Parallel Computers


The Flynn taxonomy:
Single or multiple instruction streams.
Single or multiple data streams.
1. SISD machine (Most desktops, laptops)

Only one instruction fetch stream


Most of todays workstations or desktops

Control
unit

Instruction
stream

ALU

Data
stream

Fundamentals of Computer Architecture - Chapter 1

24

12

SIMD

Examples: Vector processors, SIMD extensions (MMX)


A single instruction operates on multiple data items.

SISD:
for (i=0; i<8; i++)
a[i] = b[i] + c[i];

SIMD:
a = b + c;
ALU 1

Control Instruction
unit
stream

ALU 2

ALUn

// vector addition

Data
stream
1
Data
stream
2

Data
streamn

Pseudo-SIMD popular for multimedia extension

Fundamentals of Computer Architecture - Chapter 1

25

MISD machine

Example: CMU Warp


Systolic arrays

Data
stream

Control Instruction ALU 1


unit 1 stream 1
Control Instruction ALU 2
unit 2 stream 2

Control Instruction ALUn


unitn
streamn

Fundamentals of Computer Architecture - Chapter 1

26

13

Systolic Arrays (contd.)


Example: Systolic array for 1-D convolution
y(i) = w1 x(i) + w2 x(i + 1) + w3 x(i + 2) + w4 x(i + 3)
x8

y3

x7

x6

y2

x5

x4

x3
w4

y1

xin

yin

x
w

xout

yout

w2

w1

xout = x
x = xin
yout = yin + w xin

Enable variety of algorithms on same hardware

But dedicated interconnect channels

x1
w3

Practical realizations (e.g. iWARP) use quite general processors

x2

Data transfer directly from register to register across channel

Specialized, and same problems as SIMD

General purpose systems work well for same algorithms (locality etc.)
Fundamentals of Computer Architecture - Chapter 1

27

MIMD machine

Independent processors connected together to form a


multiprocessor system.
Physical organization:

Programming abstraction:

Determines which memory hierarchy level is shared


Shared Memory:

on a chip: Chip Multiprocessor (CMP)


Interconnected by a bus: Symmetric multiprocessors
(SMP)
Point-to-point interconnection: Distributed Shared
Memory (DSM)

Distributed Memory:

Clusters, Grid

Fundamentals of Computer Architecture - Chapter 1

28

14

MIMD Physical Organization


P

Shared Cache Architecture:


- CMP (or Simultaneous Multi-Threading)
- e.g.: Pentium4 chip, IBM Power4 chip, SUN
Niagara, Pentium D, etc.
- Implies shared memory hardware

caches
M
P

caches

caches

Network
M

UMA (Uniform Memory Access)


Shared Memory :
- Pentium Pro Quad, Sun Enterprise,
etc.
- What interconnection network?
- Bus
- Multistage
- Crossbar
- etc.
- Implies shared memory hardware

Fundamentals of Computer Architecture - Chapter 1

29

MIMD Physical Organization (2)


P

caches

caches

Network

NUMA (Non-Uniform Memory Access)


Shared Memory :
- SGI Origin, Altix, IBM p690,
AMD Hammer-based system
- What interconnection network?
- Crossbar
- Mesh
- Hypercube
- etc.
- Also referred to as Distributed
Shared Memory

Fundamentals of Computer Architecture - Chapter 1

30

15

MIMD Physical Organization (3)

caches

caches

I/O

I/O

Distributed System/Memory:
- Also called clusters, grid
- Dont confuse it with distributed
shared memory

Network

Fundamentals of Computer Architecture - Chapter 1

31

Parallel vs. Distributed Computers


Cost

Distrib comp

Parallel comp

Distrib comp
Perf
Parallel comp

size

size

Small scale machines: parallel system cheaper


Large scale machines: distributed system cheaper
Performance: parallel system better (but more expensive)
System size: parallel system limited, and cost grows fast
However, must also consider software cost
Fundamentals of Computer Architecture - Chapter 1

32

16

Scope of CSC/ECE 506

Parallelism

Flynn taxonomy:

Loop Level and Task Level Parallelism


MIMD

Shared memory machines (SMP and DSM)

Programming Model:

Shared Memory
Message passing
Hybrid (e.g., UPC)
Data parallel

Fundamentals of Computer Architecture - Chapter 1

33

Programming Models

Shared Memory / Shared Address Space:

Each processor can see the entire memory

Shared M

Programming model = thread programming in


uniprocessor systems

Fundamentals of Computer Architecture - Chapter 1

34

17

Distributed Memory / Message Passing / Multiple


Address Space:

a processor can only directly access its own local


memory. All communication happens by explicit
messages.

M
P

Fundamentals of Computer Architecture - Chapter 1

35

Shared Mem compared to Msg


Passing
+ Can easily be automated (parallelizing compiler,
OpenMP)
+ Shared vars are not communicated, but must be
guarded
- How to provide shared memory? Complex hardware
- Synchronization overhead grows fast with more
processors
+- Difficult to debug, not intuitive for users

Fundamentals of Computer Architecture - Chapter 1

36

18

Data Parallel Prog Paradigm & Systems


Programming model

Operations performed in parallel on each element of data structure


Logically single thread of control, performs sequential or parallel steps

Conceptually, a processor associated with each data element

Architectural model

Array of many simple, cheap processors with little memory each

Processors dont sequence through instructions

Attached to a control processor that issues instructions


Specialized and general communication, cheap global synchronization

Original motivations
Matches simple differential equation solvers
Centralize high cost of instruction
fetch/sequencing
Fundamentals of Computer Architecture - Chapter 1

37

Application of Data Parallelism


Each PE contains an employee record with his/her salary
If salary > 100K then
salary = salary *1.05
else
salary = salary *1.10

Other examples:

Logically, the whole operation is a single step


Some processors enabled for arithmetic operation, others disabled
Finite differences, linear algebra, ...

Document searching, graphics, image processing, ...


Some recent machines:

Thinking Machines CM-1, CM-2 (and CM-5)

Maspar MP-1 and MP-2,

Fundamentals of Computer Architecture - Chapter 1

38

19

Common Today

Systolic Arrays:

Dataflow:

Shared memory:

idea adopted in graphics and network processors


idea adopted in superscalar processors
most small scale servers (up to 128 processors)
Now in workstations/desktops/laptops, too

Message passing: most large scale systems

Data parallel/SIMD:

clusters, grid (hundreds to thousands of processors)


small scale: SIMD multimedia extension (MMX, VIS)
large scale: vector processors

Fundamentals of Computer Architecture - Chapter 1

39

Top 500 Supercomputer

http://www.top500.org
Lets look at the Earth Simulator

Was #1 in 2004, now #10 in 2006

Hardware:

5,120 (640 8-way nodes) 500 MHz NEC CPUs


8 GFLOPS per CPU (41 TFLOPS total)

30s TFLOPS sustained performance!

2 GB (4 512 MB FPLRAM modules) per CPU (10 TB total)


shared memory inside the node
10 TB total memory
640 640 crossbar switch between the nodes
16 GB/s inter-node bandwidth
20 kVA power consumption per node

Fundamentals of Computer Architecture - Chapter 1

40

20

Programming Model

In a CPU: data parallel, using automatic vectorization

Instruction level

In a node (8 CPUs): shared memory using OpenMP

Across nodes: message passing using MPI-2 or HPF

Loop level
Algorithm level

Fundamentals of Computer Architecture - Chapter 1

41

The machine room sits at approximately 4th floor


level. The 3rd floor level is taken by hundreds of
kilometers of copper cabling, and the lower floors
house the air conditioning and electrical equipment.
The structure is enclosed in a cooling shell, with the air
pumped from underneath through the cabinets, and
collected to the two long sides of the building. The
aeroshell gives the building its "pumped-up"
appearance. The machine room is electromagnetically
shielded to prevent interference from nearby
expressway and rail. Even the halogen light sources
are outside the shield, and the light is distributed by a
grid of scattering pipes under the ceiling. The entire
structure is mechanically isolated from the
surroundings, suspended in order to make it less prone
to earthquake damage. All attachments (power,
cooling, access walkways) are flexible.

Fundamentals of Computer Architecture - Chapter 1

42

21

Fundamentals of Computer Architecture - Chapter 1

43

Linpack performance: 40 TFlops, 80% peak


Real world performance: 33-66% peak (vs. less than
15% for clusters)
Cost? Hint: starts with 4
Maintenance $15M per year
Failure one processor per week
Distributed Memory Parallel Computing System which
640 processor nodes interconnected by Single-Stage
Crossbar Network

Fundamentals of Computer Architecture - Chapter 1

44

22

Fastest (#1 as of Aug 2006)

BlueGene
65536 processors
Each processor PowerPC 440 700 MHz (2.8 GFlops)
Rpeak (GFlops):183 TFLOPS
Rmax (GFlops):136 TFLOPS

Fundamentals of Computer Architecture - Chapter 1

45

Limitations of very large machines

Niche market
Power wall

By using low power processor, BlueGene can scale to a


very large processor count
Many practical issues: electricity, cooling, etc.

Programming wall

Extremely hard to extract performance out of very large


machine

Fundamentals of Computer Architecture - Chapter 1

46

23

You might also like