Professional Documents
Culture Documents
Doug Burger
Stephen W. Keckler
Charles R. Moore
8/31/05
CART UT-CS
Increased specialization
among processors
Benefits of specialization
Performance, power, area
Problems of specialization
Poor performance outside
intended domain
Little design re-use
8/31/05
CART UT-CS
Performance
Network
Server
Graphics
Desktop
Power4
GeForce
Intel IXP
Pentium4
Courtesy : Bob Gray bill, DARPA
VEC
DSP
+ Performance advantages
Load balancing inefficiencies
Higher design complexity
THR
UNI
UNI
UNI
UNI
UNI
UNI
VEC
UNI
DSP
UNI
UNI
DSP
THR
UNI
THR
UNI
UNI
8/31/05
CART UT-CS
Fine-grain CMP:
64 in-order cores
CART UT-CS
Fine-grained concurrency
Fine-grain CMP:
64 in-order cores
Coarse-grain CMP:
16 O-O-O cores
4 ultra-large cores
8/31/05
CART UT-CS
Outline
Conclusions
8/31/05
CART UT-CS
TRIPS Overview
CMP with large Grid Processor cores and L2 cache banks
L2 cache bank
8/31/05
CART UT-CS
TRIPS Overview
Moves
Bank M
L2 Cache Banks
Bank 0
Bank 0
Load store queues
Bank 1
Bank 2
Bank 3
IF CT
Bank 1
Bank 2
Bank 3
8/31/05
CART UT-CS
Thread-level parallelism
Partition instruction window among different threads
Reduce contentions for instruction and data supply
Data-level parallelism
Provide high density of computational elements
Provide high bandwidth to/from data memory
8/31/05
CART UT-CS
L2 Cache Banks
Bank 0
Bank 0
Load store queues
Bank 1
Bank 2
Bank 3
IF CT
Bank 1
Bank 2
Bank 3
Reservation stations
Instruction window management
Register files
Speculative vs. non-speculative data storage
L2 cache banks
tag lookup, replacement, b/w to near banks
8/31/05
CART UT-CS
10
add
sub
add
add
Control
Router
Execution Node
4 logical frames
each with 16 instruction slots
CART UT-CS
11
start
Execute A
D (spec)
C (spec)
A
Predict C
Execute C
C
B
D
Predict D
Execute D
Predict E
Execute E
E
end
8/31/05
12
#blocks
10
IPC
SPEC Int
Programs
1
4
16
Perfect
8
6
4
2
0
bzip2
compr
m88k
mcf
vortex
MEAN
12
10
1
4
16
Perfect
SPEC FP
programs
IPC
8
6
4
2
0
ammp
8/31/05
equake
mgrid
swim
CART UT-CS
tomcatv
MEAN
13
ad
2
A2
re
B2(spec)
Th
ad
1
A1
re
B1(spec)
CART UT-CS
14
TLP Results
30
25
20
Sequential Execution
15
TLP-mode execution
Multiple processors
10
5
0
2
# of threads
8/31/05
CART UT-CS
15
start
(2)
loop N times
end
(1)
(3)
unroll 8X
Streaming Kernel:
- read input stream element
- process element
- write output stream element
start
(8)
end
8/31/05
CART UT-CS
16
L2 Cache Banks
L1 Banks
Bank 0
Bank 0
Load store queues
Bank 1
Bank 2
Bank 3
IF CT
Bank 1
Bank 2
Bank 3
8/31/05
CART UT-CS
17
Compute Inst/cycle
12
10
ILP mode
DLP-mode
1/4 LD B/W
NoRevitalize
6
4
2
0
convert
dct
fft8
fir16
idea
transform
MEAN
CART UT-CS
18
Results: Summary
ILP: instruction window occupancy
Peak: 4x4x128 array 2048 instructions
Sustained: 493 for Spec Int, 1412 for Spec FP
Bottleneck: branch prediction
8/31/05
CART UT-CS
19
Related Work
Polymorphous homogeneous
SmartMemories: Modular reconfigurable architecture[Mai, ISCA
01]
Fine-grained homogeneous
RAW: Baring it all to software [Waingold, IEEE Computer 00]
Heterogeneous
Tarantula Vector Extensions to the EV8 [Espasa, ISCA 02]
8/31/05
CART UT-CS
20
Conclusions
Future work
Demonstrate viability with HW/SW prototype
Design software interfaces to exploit configurable hardware
8/31/05
CART UT-CS
21