You are on page 1of 27

PIPELINING

why wait . . . ?
... let's solve a "real problem"

device: washer
function: fill, agitate, spin
washerPD = 30 mins

clean,dry laundry device: dryer


function: heat, fast spin
dryerPD = 60 mins
one load at a time

everyone knows that the real


reason that students put off
doing laundry so long is not
because they procrastinate,
step 1:
are lazy,
or even not because they are
working with their
computation slides

the fact is, doing one load


at a time is not smart step 2:

total = washerPD + dryerPD = 90 mins


doing N loads of laundry

here's how 0th year E.E's do laundry, step 1:


the "combinational" way

step 2:

because they did not hear


of pipelining yet ! step 3:

step 4:
.....

total = N*(washerPD + dryerPD)


= N*90 mins
doing N loads... the 1st year E.E. way

if students "pipeline" the


step 1:
laundry process

step 2:

actually, it's more like N*60 + 30


if we account for the startup step 3:
transient correctly
.....
when doing pipeline analysis, we're
mostly interested in the "steady
state" where we assume we have an
infinite supply of inputs

total = N*max(washerPD , dryerPD)


= N*60 mins
some definitions
latency:
the delay from when an input is established until the
output associated with that input becomes valid
assuming that the
(0th year's laundry = 90 mins) wash is started
(1st year's laundry = 120 mins) as soon as
possible and waits
(wet) in the
implies that 0th's in a six hour wait gets 4 loads done,
washer until dryer
while 1st's gets 5 and goes home half an hour earlier
is available

throughput:
the rate of which inputs or outputs are processed
(0th year's laundry = 1/90 mins-1= 0.011 mins-1)
(1st year's laundry = 1/60 mins-1= 0.016 mins-1)
okay, back to circuits...

for combinational logic:


F
latency = tPD
X H P(X) throughput = 1/tPD
we can't get the answer
G faster, but are we making
effective use of our
hardware at all times?
X
F(X)
G(X)
P(X)

F and G are "idle", just holding their outputs


stable while H performs its computation
pipelined circuits
use registers to hold H's input stable!
now F and G can be working on input
Xi+1.
because of the 2-stage pipeline :

a valid input X during clock cycle j,


P(X) is valid during clock j+2.
suppose F, G, H have propagation delays of 15, 20, 25 ns
and we are using ideal zero-delay registers:
latency throughput

unpipelined 45 1/45

2-stage pipeline 50 1/25


worse better
pipeline diagrams

clock cycle
i i+1 i+2 i+3
pipeline stages

input Xi Xi+1 Xi+2 Xi+3 ...

F reg F(Xi) F(Xi+1) F(Xi+2)


...
G reg G(Xi) G(Xi+1) G(Xi+2)

H reg H(Xi) H(Xi+1) H(Xi+2)

the results associated with a particular set of input data


move diagonally through the diagram,
progressing through one pipeline stage each clock cycle
pipeline conventions
definition:
a K-stage pipeline ('K-pipeline") is an acyclic circuit having
exactly K registers on every path from an input to an output
a combinational circuit is thus a 0-stage pipeline

convention:
every pipeline stage, hence every K-stage pipeline, has a
register on its output (not on its input)

always:
the clock common to all registers must have a period
sufficient to cover propagation over combinational paths
PLUS (input) register tPD PLUS (output) register tSETUP

the latency of a K-pipeline is K times the


period of the clock common to all registers
the throughput of a K-pipeline is the
frequency of the clock
ill-formed pipelines
consider a bad job of pipelining:

for what value of K is the above circuit a K-pipeline?


none
problem:
successive inputs get mixed: e.g., B(A(Xi+1 ), Yi)
this happened because some paths from inputs to outputs
had 2 registers, and some had only 1!
can this happen on a well-formed K pipeline?
a pipelining methodology
step 1: STRATEGY:
draw a line that crosses every
output in the circuit, and focus your attention on placing
select one endpoint as an pipelining registers around the
origin slowest circuit elements
(bottlenecks)
step 2:
continue to draw new lines
from the origin across various
circuit connections such that
these new lines partition the
inputs from the outputs

adding a pipeline register at


every point where a separating
line crosses a connection will
always generate a valid
pipeline
pipeline example
observations:
• 1-pipeline improves neither
latency nor throughput
• troughput is improved by
breaking long combinational
paths, allowing faster
clock
• too many stages cost
LATENCY THROUGHPUT latency while not
improving throughput
0-pipe 4 1/4 • back-to-back registers
are often required to keep
1-pipe 4 1/4 pipeline well-formed

2-pipe 4 1/2
3-pipe 6 1/2
considering pipelining

• advantages
– higher throughput than
the corresponding combinatorial device
– different parts of the logic
work on different parts of the problem
• disadvantages
– generally, increases latency
– only as good as the weakest link

is there a way around this "weak link" problem?


how do 1st year EE's a.d.2010 laundry
they work around the bottleneck:
first they find a place
with twice as many dryers as washers

step 1:

step 2: throughput = 1/30 loads/min

step 3:
latency = 90 min
step 4:

step 5:
circuit interleaving
one way to overcome
a pipeline bottleneck
is to replicate
the critical element
as many time as needed
and alternate inputs
between the various copies

N-1 registers

latency = 2 clocks

N-way interleaving is equivalent


to N pipeline stages
combining pipelining and interleaving

combining interleaving
with pipelining
moves the bottleneck
from the C-element
to the F-element

here, C' interleaves two C-elements


with a propagation delay of 8 ns

the resulting C' circuit has


a throughput of 4 ns,
this can be considered and latency of 8ns.
as an extra pipelining stage
that passes through the middle of the C' module
assignment B4
Pipeline a combinational encryptor X 5 1 3 3 1
for throughput! 0
The device takes an integer value X 2 4 7
and computes an encrypted version C(X).
1 5 3 8 5
The propagation delay of each module
is given in ms. 6 9 11
(contamination delays are zero). 13
5 10 3 12 1
before monday, march 8, 9:00 C(X)
• what is the latency and throughput
of the unpipelined device? From: student@tue.nl
• give the locations for registers To: computation@ics.ele.tue.nl
(ideal, zero-delay) by edge numberSubject : B4
after maximizing the throughput!
use as few as possible registers! 27 0.40
• give the latency and throughput 4 9 10 drawing in
of your pipelined device! 30 0.80 attachment

attachment <student_B3.xxx>
multiplication (positive numbers)

multiplicand A3 A2 A1 A0
multiplier B3 B2 B1 B0
x
ABi called a "partial A3B0 A2B0 A1B0 A0B0
product
A3B1 A2B1 A1B1 A0B1
A3B2 A2B2 A1B2 A0B2
+ A3B3 A2B3 A1B3 A0B3

multiplying N-bit number by M-bit number gives (N+M)-bit result

easy part:
forming partial products (just an AND gate since Bi is either 0 or 1)

hard part:
adding partial products column by column with carry
multiplication

multiplicand A13 0A2 0A1 1A0


multiplier B1
3 0B2 1B1 1B0
x
ABi called a "partial A31B0 A02B0 01B0 A10B0
A
product
A31B1 A20B1 A01B1 A
10B1
A3B2 A2B2 A1B2 A0B2 +
+ A3B3 A2B3 A11B3 A10B3 0 1 1
1 0 0 1
multiplying N-bit number by M-bit number gives (N+M)-bit result +
1 1 0 0 0 1 1
easy part:
forming partial products (just an AND gate since Bi is either 0 or 1)

hard part:
adding M N-bit partial products
sequential multiplier

assume the multiplicand (A) has N bits


and the multiplier (B) has M bits.
init: P  0, load A&B

repeat M times {
P  P + (BLSB ==1 ? A : 0)
shift P/B right one bit
}

done: (N+M)-bit result in P/B

a sequential circuit with a single N-bit adder


can proces one partial product at a time
and then cycle the circuit M times
sequential multiplier (64-bit ALU)
after initialization
multiplicand
(product register at 0)
and loading the operands;
63 31 0
E
<< repeat 32 times:
>>
{if 1==LSB in multiplier,
then add multiplicand;
shift multiplier 1 right;
shift multiplicand 1 left;
64 - bit
add zero }
ALU multiplier

63 0 31 0
product register E E
<<
>>

LSB
finite state machine
sequential multiplier (32-bit ALU)
after initialization
multiplicand
(product register at 0)
and loading the operands;
31 0
E
repeat 32 times:
{if 1==LSB in multiplier,
then add multiplicand;
shift multiplier 1 right;
shift product 1 right;
32 - bit add zero }
ALU multiplier

63 0 31 0
product register E E
<<
<<
>> >>

LSB
finite state machine
sequential multiplier (32-bit ALU)
after initialization
multiplicand
(product register at 0)
and loading the operands;
31 0
E
repeat 32 times:
{if 1==LSB in multiplier,
then add multiplicand;
shift content of
product register 1 right;
32 - bit add zero }
ALU multiplier

63 0
product register E
<<
>>
LSB

finite state machine


a combinational multiplier
A3 A2 A1 A0
B0
tPD = 10*tPD,FA
FA FA FA FA
(follow the path A3 A2 A1 A0
from A0 to P7) B1

FA FA FA FA
A3 A2 A1 A0
B2

FA FA FA FA
A3 A2 A1 A0
B3

FA FA FA FA
P7 P6 P5 P4 P3 P2 P1 P0
pipelined multiplier
A3 A2 A1 A0
B0

"carry save" FA FA FA FA
configuration
A3 A2 A1 A0
B1

FA FA FA FA
A3 A2 A1 A0
B2

FA FA FA FA
A3 A2 A1 A0
B3

FA FA FA FA

FA FA FA FA
P7 P6 P5 P4 P3 P2 P1 P0
summary
• latency (L) = time it takes for given input to effect an output
• throughput (T) = rate at which new outputs appear
• for combinational circuits: L = tPD of device, T = 1/L
• for K-stage pipeline (K > 0):
– always have registers on output(s)
– K registers on every path from input to output
– T = (tPD,reg + tPD,slowest pipeline stage + tSETUP)-1
• to increase throughput: split the slowest stage
• no further splitting possible, use replication/interleaving
– L = KxT
• pipelined latency ≥ combinational latency
• pipelining can be combined chapter
with circuit interleaving 4.5-p332
en 3.3: