Professional Documents
Culture Documents
history and motivation for systolic arrays systolic array features systolic design techniques
composing regular components retiming slowdown clustering bit-level design systolic state machines
Systolic arrays
Memory Processing Element
introduced by Kung and Leiserson, 1978 designs for matrix computations illustrated by snapshots of operation
Systolic: rhythmical contraction; describes the contraction of the heart forcing blood onward and keeping up the circulation. Array: multiple PEs to maximise processing per memory access.
PE
wl 3/2002 4
motivations: improve performance of special-purpose systems - e.g. maximise processing per memory access reduce their design and implementation costs - e.g. exploit latest technology: FPGAs
wl 3/2002 3
M PE PE PE
these result in: simple and reduced costs high performance modular and expandable
wl 3/2002 5
technology: 10 million-gate FPGA, GHz clock speed good platform for implementing systolic designs
- array structure - increased flexibility, adaptable at run time - reduced design/implementation time and cost
wl 3/2002 6
Applications
signal, image, video, multimedia, numerial processing
add, multiply, divide, square root...in various number systems recursive and non-recursive, linear and non-linear filtering DFT, FFT, FHT, DCT, DWT, FNT matrix and graph algorithms, algebraic path problem neural nets, motion estimation, shading, texture mapping sorting, searching, matching, priority queue, LRU dynamic programming data compression and encryption discrete event simulation database operations
wl 3/2002 7
non-numerical processing
-
Rectangular array
R R
R R
R R
wl 3/2002 8
Hexagonal array
Hexagonal array
R R R
R R R
R R R
wl 3/2002 9
R R R
R R R
R R R
wl 3/2002 10
Triangular-shaped arrays
R R
R R R
R R
R R R
x2
D D
x3
D D
aij xi
0
a 01 a 11 a 21 a 31
D
y0 y1 y2 y3
wl 3/2002 11
wl 3/2002 12
w0
D
w1
0
D
w2
D
w3
CbCellb
D
x0 y0
CbCellc
D
CbCellc
CbCellc
D
CbCellc
techniques
-
x1 y1
CbCellc
CbCellc
D
CbCellc
CbCellc
x0 y0
CbCellc
D
CbCellc
CbCellc
CbCellc
D
x1 y1
CbCellc
CbCellc
D
CbCellc
CbCellc
0
wl 3/2002 13
Pipelining
clock speed depends on longest combinational path
w2 w2 xt
D D
w3 w3
data
result
xt 0
w0
xt-1
w1
xt-2
w2
xt-3
w3 y +
wl 3/2002 15 wl 3/2002 16
mac
mac
mac
mac
Pipelining
insert latches between circuits to increase throughput
Retime a chain
idea: introduce anti-latch which cancels effect of a latch OK to have anti-latch at inputs or outputs graphical contours linking introduction of latch/anti-latch pre-condition:
given
data clock
result
R then
R R R
D-1
retiming: graphical method, introduce/relocate latches to improve performance/regularity and preserve behaviour may apply this method several times; avoid overkill
wl 3/2002 17
D-1
D-1
D-1
wl 3/2002 18
Retime a row
pre-condition:
given
D
w1 xt w1 xt
D
w2 w2 xt
D D
w3 w3
R
D-1
xt
w0
then
R R R
D D
xt
D D-1 D-1 D-1
w0
xt-1
w1
xt-2
w2
xt-3
w3 y +
wl 3/2002 20
R
D-1
R
D-1 D-1
0 Cu0:
mac
mac
mac
mac
w2
D D
w3
D D D
xt
xt
w0
xt
D-1 D-1
w1
xt
D-1 D-1 D-1 D-1
w2
xt
D-1 D-1 D-1
w3
D-1 D-1 D-1
D-1
D-1
D-1
x + 0
CuCell1
w0
D
w1
D
w2
D
w3
D
mac
mac
mac
mac
D D D
xt
w0 mac w0
xt-1
w1
xt-2
w2
xt-3
w3
mac w1
D D
mac w2
D
mac w3
D
x 0
CuCell1
mac
mac
mac
mac
y
wl 3/2002 21
semi-systolic regular connection one type of cell: CuCell1 speed? impact of array size? concurrency?
+ mac
Cu1
wl 3/2002 22
w2
w1
w0
w2
w1
w0
D-1
D-1
D-1
D-1
mac w3
D-1
mac w2
D-1 D-1
mac w1
D-1
mac w0
mac
mac
mac
mac
CuCell2
D-1 D-1
x 0
wl 3/2002 23
D
Cu2
mac
mac
mac
mac
y
wl 3/2002 24
w1
w2
D
w3
D D D D D D D
w1
w2
D
w3
D D D D D D D
D D D D D
D D D D D
+ Cu3
+ Cu3
wl 3/2002 25
wl 3/2002 26
Design tree
relate designs by transformation
- root: obvious but inefficient design - leaves: efficient but not obvious designs
w1
w2
D
w3
D D D D D D D
D D D D D
x
D
convolver example
uni-directional flow data
Cu1
Cu3 Cu2
Cu4
+ Cu4
Cu0
counter-flow data
reverse coefficients
...
wl 3/2002 28
Characterise designs
express features of a composite design in terms of the number and features of its components number of cells and registers: impact on size and power consumption and latency latency: determined by the path from input to output which has the maximum number of registers critical path: determined by the path from input to output which has the largest combinational delay e.g. Cu1: N latches, N-1 cycles of latency, Tmult + NTadd critical path assumptions: negligible effect of wires and word-length growth
wl 3/2002 29
Pipelining a grid
R R R
R R R
R R R
R R R
wl 3/2002 30
Pipelining a grid
D D D D D D
Pipelining a grid
D D D D D D
R R R
R R R
D-1
R R R
D-1 D-1
R R R
D-1 D-1 D-1
R
D D
R
D
R
D
R
D
R
D
R
D
R
D
R
D
D D
R
D D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1 D-1
wl 3/2002 31
wl 3/2002 32
Pipelining a grid
D D D D D D
Pipelining a grid
D D D D D D
R
D D
R
D
R
D
R
D
R
D D
R
D
R
D
R
D
R
D
R
D
R
D
R
D
R
D
R
D
R
D
R
D
D D
R
D D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1 D-1
D D
R
D D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1
R
D D-1 D-1 D-1 D-1 D-1 D-1
wl 3/2002 33
wl 3/2002 34
constant multiplier:
xi
a03 a13 a23 a33
aij xi
a00 a01 a11 a21 a31
aij xi
0 0 0 0
y0 y1 y2 y3
0 0 0 0
y0 y1 y2 y3
wl 3/2002 35
wl 3/2002 36
x2
D D D
x3
D D D
constant multiplier:
a03 a13 a23 a33
D
x0
x1
D
x2
D D D
x3
D D D
aij xi
0 0 0 0
xi
y0 y1 y2 y3
aij
aij xi
a00 0 0 0 0 a10 a20 a30
D
y0 y1 y2 y3
wl 3/2002 37
wl 3/2002 38
w1
w2
D
w3
D D D D D D D
w2
D
w3
D
D D D D D
x y
mac"
mac"
mac"
mac"
+
+ Cu3
Cb1
mac"
wl 3/2002 39 wl 3/2002 40
w1
D D -1
w2
D D -1
w3
D D -1
mac"
mac"
mac"
mac"
mac" Cb2
mac"
mac"
mac"
mac"
mac"
mac"
mac"
mac"
mac"
mac"
mac"
wl 3/2002 43
wl 3/2002 44
Fully-pipelined convolver
D-1 D-1 D D D D-1
-1
D D D D-1
-1
mac"
mac"
mac"
mac" mac"
D-1 D
mac"
mac"
mac"
D -1 D -1 D -1 D -1
D-1 D D D D-1
-1
D D D D-1
-1
D-1
CbCell3
D-1 D D
D-1 D-1 D
D-1 D-1 D
mac"
mac"
mac"
mac"
D-1 D -1 D -1 D -1
mac"
wl 3/2002 45
mac"
mac"
mac"
Cb3
wl 3/2002 46
1 / (Tcell+T latch)
R
R4
=
=
D-1 D-1
input-output speed limit clock skew clock rise and fall times significant control degree of pipeling
wl 3/2002 47
Partially-pipelined designs
w1
D
w2
D
w3
D
mac" 0
mac"
mac"
mac"
mac"
mac"
mac"
mac"
mac"
mac"
Cb5
D D
Cb1
mac"
mac"
mac"
mac"
mac"
mac"
Result
D D
Result
D D
R R
D D
R R
D
R R
D
R R
D
D -1
D -1
R R
D D
R R
D
R R
D
R R
D
D -1
D -1
D -1
D -1
D -1
D -1
R R
D D -1 D -1
R R
D D -1 D -1
R R
D D -1 D -1 D -1
R R
D D -1 D -1 D -1
D -1
D -1
D -1
R R
D D -1 D -1
R R
D D -1 D -1
R R
D D -1 D -1 D -1
R R
D D -1 D -1 D -1
D -1
D -1
D -1
D -1
D -1
D -1
D -1
D -1
D -1
wl 3/2002 53
wl 3/2002 54
Result R R
D D
R
D
R R
D D
R
D
D -1
D -1
R R
D D
R R
D D D -1
D -1
D -1
R R
D D -1 D
-1
R R
D D -1 D -1 D
-1
D -1
D -1
R
D -1 D
-1
R
D -1 D -1 D -1
D -1
D -1
D -1
wl 3/2002 56
Result R R
D D D
Lead R
D D
R
D
R R
D
D -1
D -1
Q R R R
wl 3/2002 57
R Q R R
R R Q R
R R R Q
wl 3/2002 58
R R
D D
R R
D D D -1
D -1
D -1
R R
D D -1 D -1 D
R R
D D -1 D -1 D -1 D
D -1
D -1
R
D -1 D -1
R
D -1 D -1 D -1
D -1
D -1
D -1
Trail
Q
R R R Q
R R Q R
R
= R
D D
Q R Q R
R Q R Q
Q R Q R
wl 3/2002 60
Q R Q
10
w x y + D
becomes
w x A H H H
wl 3/2002 61
w D x y + y D
becomes
w x A A A
D D D
0
F F F
wl 3/2002 62
w x y + wx CbCell
0
w x A A A y
D D D
becomes
A F A F
D
xs
w
D
0
D
0
D
0
D
D D
wl 3/2002 63
ys
0
D
F F F
y A F
where
CbCellc: fadd and
wl 3/2002 64
Pipelining strategy 1
(1) slowdown by 2 (double all latches) (2) Retime to get fully-pipelined circuit
0
fadd and
Design Cbb1
0
D
CbCellc:
yt = 0in
w2
D
wi xt,i
0
D
fadd
and
w0
D
0
D
w1
D
0
D
w3
D
CbCellc
x0 y0
CbCellc
D D
CbCellc
D D
CbCellc
D D
CbCellc
D D
0
D D
w
D D
0
D D
CbCellc
D D
x1 0 y1
CbCellc
D D
CbCellc
D D
CbCellc
D D
CbCellc
D D
xs ys
D D
D D
D D
D D
0
D D
D D
D D
D D
0
D D
x2 y2
CbCellc
D D
CbCellc
D D
CbCellc
D D
CbCellc
D D
D D
D D
D D
x3 y3
CbCellc
D D
CbCellc
D D
CbCellc
D D
CbCellc
D D
0
wl 3/2002 66
wl 3/2002 65
11
Pipelining strategy 2
pipeline clusters of K by K cells (K > 1) e.g. K = 2
x0 0 0
D
Design Cbb2
0
D
w0
D
w1
0
D
w2
D
w3
CbCellb
D
w
D
y0 0 0
CbCellc
D
CbCellc
CbCellc
D
CbCellc
CbCellc
0
D
x1 y1
CbCellc
CbCellc
D
CbCellc
CbCellc
D
xs ys
0
D
0
D
x0 y0
CbCellc
D
CbCellc
CbCellc
CbCellc
D
0 x1
CbCellc
D
Cbb2
wl 3/2002 67
y1
CbCellc
D
CbCellc
CbCellc
0
wl 3/2002 68
use clustering to control degree of pipelining n-slow: can replace each latch by n latches in series, provided that (n-1) additional values are inserted between successive inputs; similarly for outputs
wl 3/2002 69
----------------------------------------------------------------------------
Note that the minimum clock period for Cu2 should be (N-1) Tp + Tm + Ta, where Tp is the delay across the wiring cell
wl 3/2002 70
State machines
state-transition function R usually includes an output part y and a next state part s counter sorter
priority queue x
R
s
D
LRU processor
wl 3/2002 71
wl 3/2002 72
12
hadd D0
hadd D0
hadd D0
hadd D0
inc D0
wl 3/2002 73
wl 3/2002 74
hadd D0
hadd D0
hadd D0
hadd D0
wl 3/2002 75
wl 3/2002 76
hadd D0
hadd D0
hadd D0
hadd D0
hadd D0 D0
hadd D0 D0
D0 hadd D0
hadd D0
D0
wl 3/2002 77
wl 3/2002 78
13
Example: inserter
insert an element into an ordered list to form an ordered list: insert <3, <1, 2, 5, 6>> = <<1, 2, 3, 5>, 6> 1 2 5 6 max min
S2 3
Insertion sort
state registers initialised with + load n values to be sorted cycle by cycle load n - values to extract the sorted result
5
S2
1
S2
2
S2
3
S2
5
S2
5
Dmax
S2
Dmax
S2
Dmax
S2
Dmax
wl 3/2002 79
wl 3/2002 80
Insertion sort
state registers initialised with + load n values to be sorted cycle by cycle load n - values to extract the sorted result
5 4 5 1 -
2 1
3 2
8 3
S2
4
Dmax
S2
5
Dmax
S2
Dmax
S2
Dmax
S2
-
Dmax
S2
1
Dmax
S2
2
Dmax
S2
3
Dmax
wl 3/2002 81
wl 3/2002 82
- -
1 -
2 2
3 3
S2
3
Dmax
S2
Dmax
S2
Dmax
S2
Dmax
S2
-
Dmax
S2
-
Dmax
S2
1
Dmax
S2
2
Dmax
Q1: can we avoid reloading the -s (and +s ?) Q2: can we reduce the combinational delay through S2s?
wl 3/2002 83 wl 3/2002 84
14
S2
Dmax
S2
Dmax
S2
Dmax
S2
Dmax
S2
Dmax
Dmax
S2
Dmax
Dmax
S2
Dmax
Dmax
S2
Dmax
Dmax
wl 3/2002 85
wl 3/2002 86
S2
Dmax
Dmax
S2
Dmax
Dmax
S2
Dmax
Dmax
S2
Dmax
Dmax
make sure that registers are initialised appropriately decompose a large state machine into a collection of small state machines pipeline the collection of state machines as required
wl 3/2002 87
wl 3/2002 88
Further reading
H. T. Kung. Why Systolic Architectures?, IEEE Computer, 15(1):37-46, January 1982. - excellent introductions to systolic architectures S.Y. Kung, VLSI Array Processors, Prentice Hall, 1988. - comprehensive reference textbook Proceedings of IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP). This conference began as the International Workshop on Systolic Arrays, 1986. - research on theory and practice of systolic design Proceedings of IEEE International Conference on Field-Programmable Custom Computing Machines. - research on systolic systems implemented by FPGAs
wl 3/2002 90
15