Professional Documents
Culture Documents
Parallel Processing
CPIstall contributors
Data hazards
Control hazards: branches, exceptions
Memory latency: cache misses
Advantages
CPIideal is 1 (pipelining)
IF
Simple, elegant
RF Still used in ARM & MIPS processors
See http://people.ee.duke.edu/~sorin/ece252/lectures/3-superscalar.pdf
6.004 Computation Structures L21: Parallel Processing, Slide #9
A Modern Out-of-Order Superscalar Processor
Needed to
avoid high
I-Cache
For OoO:
CPISTALL on Branch
Fetch Unit
determine when
operands are
Predict
In Order
deep pipelines
Instruction Buffer
ready for inst.
Decode/Rename
Dispatch
Reservation Stations
Out Of Order
Retire
order!
Write Buffer D-Cache
+1 or Branch
PC data
Data
addr Reg File Reg File Reg File Reg File Memory
Instruction addr
ALU ALU ALU ALU
Memory
Addressing
Unit
data
Control Each datapath has its own local storage (register file)
All datapaths execute the same instruction
Memory access with vector loads and stores + wide memory port
4 cores
Moderate perf,
efficient core
2 cores
Performance
What is the optimal tradeoff between core cost and number of cores?
(1 - f) f
timenew
(1 - f) f/S
Soverall lim
S
1 / (1-f)
Say you write a program that can do 90% of the work
in parallel, but the other 10% is sequential
f = 0.9, S= Soverall = 10
What f do you need to use a 1000-core machine well?
Mem
Communication models:
Shared memory:
Single address space
Implicit communication by
memory loads & stores Mem Mem Mem
Message passing:
Separate address spaces
Explicit communication by
sending and receiving messages
Network
Thread A Thread B
x = 3; y = 4;
print(y); print(x);
$1: x = X
13 $2: x = 1
y=2 y=X 24
Thread A Thread B
x = 3; y = 4;
print(y); print(x);
Notice that
Plausible Uniprocessor execution sequences: the outcome
2, 1 does not
SEQUENCE A prints B prints appear in
this list!
x=3; print(y); y=4; print(x); 2 3
x=3; y=4; print(y); print(x); 4 3
x=3; y=4; print(x); print(y); 4 3
y=4; x=3; print(x); print(y); 4 3
y=4; x=3; print(y); print(x); 4 3
y=4; print(x); x=3; print(y); 4 1
Shared Memory
int x=1, y=2;
Thread A Thread B
Werent
x = 3; y = 4; caches
print(y); print(x); supposed to
be invisible
Possible printed values: 2, 3; 4, 3; 4, 1. to programs?
(each corresponds to at least one interleaved execution)
ALTERNATIVE APPROACH:
Weak consistency, by default;
MEMORY BARRIER instruction: stalls thread until all
previous memory operations have completed.
See http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf
for a very readable discussion of memory semantics in multicore systems.
Modified The cache line is present only in the current cache, and
is dirty; it has been modified from the value in main memory. The
cache is required to write the data back to main memory at some
time in the future, before permitting any other read of the (no
longer valid) main memory state.
Exclusive The cache line is present only in the current cache, but
is clean; it matches main memory. It may be changed to the
Shared state at any time, in response to a bus read request.
Alternatively, it may be changed to the Modified state when writing
to it.
Shared Indicates that this cache line may be stored in other
caches of the machine and is clean; it matches the main memory.
The line may be discarded (changed to the Invalid state) at any
time. Writes to SHARED cache lines get special handling
Invalid Indicates that this cache line is invalid (unused).
https://en.wikipedia.org/wiki/MESI_protocol
6.004 Computation Structures L21: Parallel Processing, Slide #30
The Cache Has Two Customers!
Queue up write misses,
STORE_BARRIER inst
Read waits until store queue
CPU is empty
Write Store queue
Intel adds F
state to
choose which
cache
responds
CC0: https://en.wikipedia.org/wiki/MESI_protocol#/media/File:MESI_protocol_activity_diagram.png
6.004 Computation Structures L21: Parallel Processing, Slide #32
Cache Coherence in Action
Thread A Thread B
1 x = 3; 2 y = 4;
4 print(y); 3 print(x);
$0: [S] x =1, [S] y = 2 $1: [S] x =1, [S] y = 2
Programming languages
Corollary: implementation
technologies can evolve while
preserving the engineering
Instruction set + memory
investment at higher levels.
4
IL
O P
3
J T
P C
1 0
00
Ins trucion
A
M em ory
+ 4 D
0 1 R A2SE L
+
WA S E L
Z
J T
C :< 1 5 :0 >
< P C > + 4+ C
*4
C :< 1 5 :0 >
sign-extd ed
IR Q Z
C ontrl Logic
P C SE L
R A2SE L
A B
ASE L A LU Wr
WD R /W
A LU F N
BSE L
WD S E L D at M em ory
WE R F
WA S E L
< P C > + 4
0 1 2 W D S E L
I/O MEM
DISK I/O I/
O
Tools
Design Languages
FPGA prototyping
Timing Analyzers
Algorithms
Arithmetic Systems Software
Signal Processing Storage
Language Virtual Machines
implementation Networking