Pipeline Processing

PIPELINE PROCESSING
PARALLEL PROCESSING
• A parallel processing system is able to perform
concurrent data processing to achieve faster
execution time
• The system may have two or more ALUs and be

able to execute two or more instructions at the
same time
A computer employing parallel processing is

also called parallel computer.
Parallel processing classification
Single instruction stream, single data stream – SISD
Single instruction stream, multiple data stream – SIMD
Multiple instruction stream, single data stream – MISD
Multiple instruction stream, multiple data stream – MIMD

Types of Parallel Computers
• Based on architectural configurations
– Pipelined computers
– Array processors
– Multiprocessor systems
Architectural Classification
– Flynn's classification
• Based on the multiplicity of Instruction Streams and
Data Streams
• Instruction Stream
– Sequence of Instructions read from memory
• Data Stream
– Operations performed on the data in the processor
Number of Data Streams

Single Multiple
Number of Single SISD SIMD

Instruction
Streams Multiple MISD MIMD
COMPUTER ARCHITECTURES FOR PARALLEL
PROCESSING
Von-Neuman SISD Superscalar processors
based
Superpipelined processors
VLIW(Very Long Instruction Word Arch.)
MISD Nonexistence
SIMD Array processors
Systolic arrays
Dataflow
Associative processors
MIMD Shared-memory multiprocessors

Reduction
Bus based
Crossbar switch based
Multistage IN based
Message-passing multicomputers
Hypercube
Mesh
Reconfigurable
PIPELINE PROCESSING
• Pipeline is a technique of overlapping the
execution of several instructions to reduce
the execution time of a set of instructions.
• It is a cascade of processing stages which

are linearly connected to perform a fixed
function over a stream of data flowing from
one end to another.
Advantages of Pipeline Processing
• Reduced access time
• Increased throughput
Types of Pipeline Models
• Asynchronous pipeline
• Synchronous pipeline
Both models external inputs are fed into the first

stage. The processed results are passed from
stage Si to stage Si+1 for all i=1,2,…,k-1.
Final results appears in stage Sk.

Asynchronous Pipeline Model
• Data flow between adjacent stages is controlled
by a handshaking protocol.
• Stage Si is ready to transmit data, it sends a
ready signal to stage Si+1. After stage Si+1
receives the incoming data, it returns an ACK
signal to stage Si.
• Advantages
– Useful for designing communication channel
in message passing multicomputers
• Disadvantages
– Variable throughput size
– Different amounts of delay may be used in
different stages
Synchronous pipeline
• Here clocked latches are used to interface
between stages. The latches are used to
isolate input from outputs.
• Upon arrival of the clock pulses , all
latches transfer data to the next stage
simultaneously.
Clock
Input S1 R1 S2 R2 S3 R3 S4 R4
• Advantage
– Equal delay in all stages
Instruction Execution steps
• Instruction fetch (IF) from MM
• Instruction Decoding (ID)
• Operand Fetch (OF), if any
• Execution of the decoded instruction (EX)
Non-pipelined computer
- 6 – Stage
- Instruction fetch, Instruction Decode, Operand
Address calculate, Operand fetch, Execute,
Write Result
Space-Time Diagram
1 2 3 4 5 6 7 8 9 Clock cycles
Segment 1 T1 T2 T3 T4 T5 T6
2 T1 T2 T3 T4 T5 T6
3 T1 T2 T3 T4 T5 T6
4 T1 T2 T3 T4 T5 T6
Pipelined Computer
EX I1 I2 I3
OF I1 I2 I3
ID I1 I2 I3
IF I1 I2 I3 I4
Stages
/Time
1 2 3 4 5 6 7 8 9 10 11 12 13
In the first cycle instruction I1 is fetched from memory. In the second
cycle another instruction I2 is fetched from memory and simultaneously
I1 is decoded by the instruction decoding unit.
Instruction Pipeline
INSTRUCTION PIPELINE
Execution of Three Instructions in a 4-Stage Pipeline

Conventional
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
Pipelined
i FI DA FO EX
i+1 FI DA FO EX
i+2 FI DA FO EX
PIPELINING
A technique of decomposing a sequential process
into suboperations, with each subprocess being
executed in a partial dedicated segment that
operates concurrently with all other segments.
Ai * Bi + Ci for i = 1, 2, 3, ... , 7
Ai Bi Memory Ci
Segment 1
R1 R2
Multiplier
Segment 2
R3 R4
Adder
Segment 3
R5
R1  Ai, R2  Bi Load Ai and Bi

R3  R1 * R2, R4  Ci Multiply and load Ci
R5  R3 + R4 Add
OPERATIONS IN EACH PIPELINE
STAGE
Clock Segment 1 Segment 2 Segment 3
Pulse
Number R1 R2 R3 R4 R5
1 A1 B1
2 A2 B2 A1 * B1 C1
3 A3 B3 A2 * B2 C2 A1 * B1 + C1
4 A4 B4 A3 * B3 C3 A2 * B2 + C2
5 A5 B5 A4 * B4 C4 A3 * B3 + C3
6 A6 B6 A5 * B5 C5 A4 * B4 + C4
7 A7 B7 A6 * B6 C6 A5 * B5 + C5
8 A7 * B7 C7 A6 * B6 + C6
9 A7 * B7 + C7
INSTRUCTION EXECUTION IN A 4-STAGE
PIPELINE
Segment1: Fetch instruction
from memory
Decode instruction
Segment2: and calculate
effective address
yes Branch?
no
Fetch operand
Segment3: from memory
Segment4: Execute instruction
Interrupt yes
Interrupt?
handling
no
Update PC
Empty pipe
Step: 1 2 3 4 5 6 7 8 9 10 11 12 13
Instruction 1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI FI DA FO EX
5 FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
Example: 6 tasks, divided into 4
segments
1 2 3 4 5 6 7 8 9
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
T1 T2 T3 T4 T5 T6
Pipeline Performance
• Latency
– It is the amount of time, that a single operation takes
to execute
• Throughput
– It is the rate at which each operations gets executed.
(Operations/second or operations/cycle)
1
Non-pipelined processor, throughput =
Latency
1
pipelined processor, throughput >
Latency
• Cycle time of pipeline processor
– Dependent on 4 factor
• Cycle time of unpipelined processor
• Number of pipeline stages
• How evenly data path logic is divided among the stages
• Latency of the pipeline stages
• If the logic is evenly divided, then the clock period
of the pipeline processor is
CycleTime unpipelined
CycleTime pipelined   pipeline latch latency
No.of pipeline stages
• If the logic is cannot be evenly divided, then the

clock period of the pipeline processor is
• Cycle Time = Longest pipeline stages + Pipeline latch latency
• Latency of each pipeline = Cycle time of the pipeline x No. of pipeline stages
Questions
• An unpipelined processor has a cycle time

of 25 ns. What is the cycle time of
pipelined version of the rocessor with 5
evenly divided pipeline stages, if each
pipeline latch latency of 1 ns? What if the
processor is divided into 50 pipeline
stages? What is the total latency of the
pipeline? How about if the processor is
divided into 50 pipeline stages?
Solution
Given data: Cycle Timeunpipelined = 25 ns
No. of pipeline stages = 5
pipeline latch latency = 1 ns
= (25 / 5) + 1 ns = 6 ns
Therefore, Cycle time of the 5 pipeline stages = 6 ns
Latency of each pipeline = Cycletime of the pipeline x No. of pipeline stages
= 6 ns x 5 = 30 ns
For the 50 stage pipeline, cycle time = (25 ns / 50) + 1 ns = 1.5 ns
Therefore, Cycle time of the 50 pipeline stages = 1.5 ns
Latency of each pipeline = Cycletime of the pipeline x No. of pipeline stages
= 1.5 ns x 50 = 75 ns
Questions
• Suppose an unpipelined processor with a
25 ns cycle time is divided into 5 pipeline
stages with latencies of 5, 7, 3,6 and 4 ns.
If the pipeline latch latency is 1 ns, what is
the cycle time of the pipeline processor?
What is the latency of the resulting
pipeline?
Solution
• Here, unpipeline processor is used.
• The longest pipeline stage is : 7 ns
• Pipeline latch latency is = 1 ns
• Cycle time = Longest pipeline stages + Pipeline latch Latency
= 7 + 1 = 8 ns
Therefore, cycle time of the unpipelined processor = 8 ns
There are 5 pipeline stages.
Total latency = Cycle Time of the pipeline x No. of pipeline stages
= 8 ns x 5 = 40 ns
Question
• Suppose that an unpipelined processor has a
cycle time of 25 ns and that its datapath is made
up of modules with latencies of 2,3,4,7,3,2 and 4
ns (in that order). In pipelining this processor, it
is not possible to rearrange the order of the
modules (For example, putting the register read
stage before the instruction decode stage) or to
divide a module into multiple pipeline stages (for
complexity reasons). Given pipeline latches with
1 ns latency. What is the minimum cycle time
that can be achieved by pipelining this
processor?
Solution
• There is no limit on the number of pipeline
stages.
• The minimum cycle time =
Latency of the longest module in the datapath + Pipeline latch time

= 7 + 1 ns
= 8 ns
Question
• Given an unpipelined processor with a 10 ns
cycle time and pipeine latches with 0.5 ns
latency?
a. What are the cycle times of pipelined versions of
the processor with 2,4,7 and 16 stages if the
datapath logic is evenly divided among the
pipeline stages?
b. What is the latency of the pipelined versions of
the processor?
c. How many stages of pipelining are required to
achieve a cycle time of 2 ns and 1 ns?
Solution – a
Given data: Cycle Timeunpipelined = 10 ns
No. of pipeline stages = 2,4,7 and 16
pipeline latch latency = 0.5 ns
Cycle time pipeline for 2 stage pipeline = (10 ns / 2) + 0.5 = 5.5 ns
Cycle time pipeline for 4 stage pipeline = (10 ns / 4) + 0.5 = 3 ns
Cycle time pipeline for 7 stage pipeline = (10 ns / 7) + 0.5
= 1.42857 + 0.5 = 1.92857 ns
Cycle time pipeline for 7 stage pipeline = (10 ns / 16) + 0.5
= 0.625 + 0.5 = 1.125 ns
Solution – b
• Latency of each pipeline
= Cycle time of the pipeline x No. of pipeline stages
Latency for 2 stage pipeline = 5.5 x 2 = 11 ns
Latency for 4 stage pipeline = 3 x 4 = 12 ns
Latency for 7 stage pipeline = 1.92857 x 7
= 13.49999 ns
Latency for 16 stage pipeline = 1.125 x 16 = 18 ns
Solution – C
• 1st solve the number of pipeline stages
CycleTimeunpipelined
Number of pipeline stages 
cycle time pipelined  Pipeline latch latency
= (10 ns / (2ns – 0.5 ns)) = 10 / 1.5 = 6.6667

Therefore, Number of pipeline stages required to achieve 2 ns cycle
time is 6.6667 = 7 stages (approx)
(since fractional part of pipeline stages is not allowed)
Similarly, Number of pipeline stages required to achieve 1 ns cycle

= 10 ns / (1 ns – 0.5 ns) = 10/0.5 = 20 stages
Pipeline Hazards
Pipeline Hazards
• Pipeline increases the processor performance.
– Several instructions are overlapped in the pipeline,
cycle time can be reduced, increasing the rate at
which instructions are executed.
– There are number of factors that limits a pipeline
ability to execute instructions at its peak rate,
including dependencies between instructions,
branches and the time required to access the
memory.
Types of Hazards
• Instruction Hazards
• Structural Hazards
• Control Hazards
• Branches
Hazards in Pipelining
• Procedural dependencies => Control hazards
– conditional and unconditional branches,
calls/returns
• Data dependencies => Data hazards
– RAW (read after write)
– WAR (write after read)
– WAW (write after write)
• Resource conflicts => Structural hazards
– use of same resource in different stages
Instruction Hazards
– Occurs when instructions are R/W reg. that are used by other
instructions.
• RAR Hazards
– Occurs when 2 instructions both read from the same reg.
– Example:
» ADD R1, R2, R3
» SUB R4, R5, R3
• RAW Hazards
– Occurs when instruction reads a reg. that was written by prev. instructions
– Example:
» ADD R1, R2, R3
» SUB R4, R5, R1
• WAR Hazards
– Occurs when output reg. of an instruction has been read by a prev. instructions
– Example
» ADD R1, R2, R3
» SUB R2, R5, R6
• WAW Hazards
– Occurs when output reg. of an instruction has been written by prev. instructions
– Example:
» ADD R1,R2,R3
» SUB R1,R5,R6
STRUCTURAL HAZARDS
Structural Hazards
It occurs when the processor’s H/W is not capable of executing all the
Instructions in the pipeline simultaneously.
Example: With one memory-port, a data and an instruction fetch

cannot be initiated in the same clock
i FI DA FO EX
i+1 FI DA FO EX
i+2 stall stall FI DA FO EX
The Pipeline is stalled for a structural hazard

<- Two Loads with one port memory
-> Two-port memory will serve without stall
Control Hazards
• The delay between when a branch
instructions enters the pipeline and the
time at which the next instructions enters
the pipeline is called the processor’s
branch delay or Control Hazards.
• The delay is mainly due to the control flow
of the program
CONTROL HAZARDS
Branch Instructions
- Branch target address is not known until

the branch instruction is completed
Branch
FI DA FO EX
Instruction
Next FI DA FO EX
Instruction
Target address available
- Stall -> waste of cycle times

Branches
• Branch instructions can also cause delay
in pipelined processor because the
processor cannot determine which
instruction to fetch next until the branch
has executed.
• Conditional branch instruction creates data
dependencies between the branch
instructions and the instruction fetch stage
of the pipeline.
Question
A. Identify all of the RAW hazards in this
instruction queue
DIV R2, R5, R8
SUB R9, R2, R7
ASH R5, R14, R6
MUL R11, R9, R5
BEG R10, #0, R12
OR R8, R15, R2
B. Identify all of the WAR hazards in the previous instruction
sequence
C. Identify all of the WAW hazards in the previous instruction
sequence
D. Identify all of the control hazards in the previous instruction
sequence
Solution
A). RAW hazards exists between following
instructions
– Between DIV and SUB
– Between ASH and MUL
– Between SUB and MUL
– Between DIV and OR.
B). RAW hazards exists between following

- Between DIV and ASH
- Between DIV and OR
C) There are no WAW hazards
D) There is only one control hazard. Between BEQ and OR

Pipeline Processing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipeline Processing

Uploaded by

Copyright:

Available Formats

PIPELINE PROCESSING

• The system may have two or more ALUs and be

A computer employing parallel processing is

Single instruction stream, single data stream – SISD

Single instruction stream, multiple data stream – SIMD

Multiple instruction stream, single data stream – MISD

Multiple instruction stream, multiple data stream – MIMD

Number of Data Streams

Number of Single SISD SIMD

VLIW(Very Long Instruction Word Arch.)

SIMD Array processors

MIMD Shared-memory multiprocessors

• It is a cascade of processing stages which

Both models external inputs are fed into the first

Final results appears in stage Sk.

Execution of Three Instructions in a 4-Stage Pipeline

R1  Ai, R2  Bi Load Ai and Bi

Segment4: Execute instruction

• If the logic is cannot be evenly divided, then the

• An unpipelined processor has a cycle time

Latency of the longest module in the datapath + Pipeline latch time

= (10 ns / (2ns – 0.5 ns)) = 10 / 1.5 = 6.6667

Similarly, Number of pipeline stages required to achieve 1 ns cycle

Example: With one memory-port, a data and an instruction fetch

i+2 stall stall FI DA FO EX

The Pipeline is stalled for a structural hazard

- Branch target address is not known until

Target address available

- Stall -> waste of cycle times

B). RAW hazards exists between following

You might also like