1.pipelining & ILP

CS2354 - Advanced Computer Architecture Pipelining
BY
N R REJIN PAUL LECTURER,CSE DEPT
1/24/02
CS252/Culler Lec 2.1
Review: Visualizing Pipelining

Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
ALU
I n s t r.
O r d e r
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
1/24/02
Limits to pipelining
Hazards: circumstances that would cause incorrect execution if next instruction were launched
Structural hazards: Attempting to use the same hardware to do two different things at the same time Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
1/24/02
Example: One Memory Port/Structural Hazard

Time (clock cycles)
ALU
I Load Ifetch n s Instr 1 t r. O r d e r

1/24/02
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Instr 2 Instr 3 Instr 4
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Structural Hazard
Resolving structural hazards

Defn: attempt to use same hardware for two different things at the same time Solution 1: Wait
must detect the hazard must have mechanism to stall
Solution 2: Throw more hardware at the problem
1/24/02
Detecting and Resolving Structural Hazard

Time (clock cycles)
ALU
I Load Ifetch n s Instr 1 t r. O r d e r

1/24/02
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Instr 2 Stall Instr 3
Ifetch
Reg
DMem
Reg
Bubble
Bubble Bubble
Bubble
ALU
Bubble
Reg
Ifetch
Reg
DMem
Data Hazards
Time (clock cycles)
IF ID/RF EX MEM
DMem
WB
Reg
I n s t r. O r d e r
add r1,r2,r3
Ifetch
Reg
ALU
ALU
sub r4,r1,r3
and r6,r1,r7 or r8,r1,r9
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
xor r10,r1,r11
1/24/02
ALU
Ifetch
Reg
DMem
Reg
Three Generic Data Hazards

Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a Data Dependence (in compiler nomenclature). This hazard results from an actual need for communication.
1/24/02

Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Cant happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
1/24/02 CS252/Culler Lec 2.9

Write After Write (WAW) InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Called an output dependence by compiler writers This also results from the reuse of name r1. Cant happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes
1/24/02 CS252/Culler Lec 2.10
Forwarding to Avoid Data Hazard

Time (clock cycles)
ALU
I n s t r.
O r d e r
add r1,r2,r3 Ifetch

sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
ALU
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
xor r10,r1,r11
Ifetch
Reg
DMem
Reg
1/24/02
Data Hazard Even with Forwarding

Time (clock cycles)
I n s t r. O r d e r
1/24/02
lw r1, 0(r2) Ifetch sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9
ALU
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Resolving this load hazard

Adding hardware? ... not Detection? Compilation techniques?
What is the cost of load delays?
1/24/02
Resolving the Load Data Hazard

Time (clock cycles) I n s t r. O r d e r
ALU
lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9
Ifetch
Reg
DMem
Reg
Ifetch
Reg
Bubble
ALU
DMem
Reg
ALU
Ifetch
Bubble
Reg
DMem
Reg
Ifetch
Reg
ALU
Bubble
DMem
1/24/02
Control Hazard on Branches => Three Stage Stall
10: beq r1,r3,36

14: and r2,r3,r5 18: or r6,r1,r7
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
22: add r8,r1,r9 36: xor r10,r1,r11

1/24/02
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Example: Branch Stall Impact

Two part solution:
Determine branch taken or not sooner, AND Compute taken branch address earlier
MIPS branch tests if register = 0 or 0 MIPS Solution:

Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
1/24/02
Pipelined MIPS Datapath

Instruction Fetch
Next PC
Instr. Decode Reg. Fetch

Next SEQ PC
Execute Addr. Calc
Memory Access
Write Back
MUX
Adder
Adder
4
Address
1/24/02
Zero?
RS1
MEM/WB
Imm
Sign Extend
RD
RD
RD
WB Data
Memory
RS2
EX/MEM
Reg File
ID/EX
ALU
IF/ID
Data Memory
MUX
MUX
OVERCOME Branch Hazard Alternatives

#1: Stall until branch direction is clear
#2: Predict Branch Not Taken
#3: Predict Branch Taken
1/24/02
Extending Pipeline To Handle Multicycle operations

Floating point numbers have two parts Exponents & significant We should deal with the exponent and significant seperately Example: 3.25 X10^3 2.63 X10^-1 3.25 X 10^3 0.000236 X10^3 -Shift the smaller to right until match 3.250263 X10^3
1/24/02 CS252/Culler Lec 2.19
Cont
So some algorithm needs to be implemented in order to perform the operation Functional unit should be redesigned to perform all operations and this type of functional unit require longer pipeline cycle Latency in the functional unit : - Latency is defined as the number of intervening cycles between an instruction that produces a result and an instruction that uses the result. initiation interval : number of cycles that must elapse between issuing two operations of a given type
1/24/02 CS252/Culler Lec 2.20
Cont.
1/24/02
Latencies and initiation intervals for FU
1/24/02
Pipeline support for FP operations
1/24/02
FP example
1/24/02
Cont.
1/24/02
Cont.
Assuming that the pipeline does all hazard detection in ID, there are three checks that must be performed before an instruction can issue: Check For Structural Hazards: Wait until the required functional unit is available Check for a RAW data hazard: Wait until the source registers are not listed as pending destinations in a pipeline register that will not be available Check for a WAW data hazard: Determine if any instruction in Al, . A4,D, Ml, . . . , M7 has the same register destination as this instruction.
1/24/02 CS252/Culler Lec 2.26
Instruction Level Parallelism

Pipelining can overlap the execution of instructions when they are independent of one another. This potential overlap among instructions is called instruction-level parallelism (ILP) since the instructions can be evaluated in parallel. Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program: 1. e = a + b 2. f = c + d 3. g = e * f
1/24/02 CS252/Culler Lec 2.27

Operation 3 depends on the results of operations 1 and 2 3 cannot be calculated until both of them are completed operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. A goal of compiler and processor designers is to identify and take advantage of as much ILP as possible. Ordinary programs are typically written under a sequential execution model where instructions execute one after the other and in the order specified by the programmer.
1/24/02

The simplest and most common way to increase the amount of parallelism available among instructions is to exploit parallelism among iterations of a loop. This type of parallelism is often called loop-level parallelism. Example 1 for (i=1; i<=1000; i= i+1) x[i] = x[i] + y[i]; This is a parallel loop. Every iteration of the loop can overlap with any other iteration, although within each loop iteration there is little opportunity for overlap.
1/24/02

for (i=1; i<=100; i= i+1){ a[i] = a[i] + b[i]; b[i+1] = c[i] + d[i]; } //s1 //s2
1/24/02
Is this loop parallel? If not how to make it parallel? Statement s1 uses the value assigned in the previous iteration by statement s2, so there is a loop-carried dependency between s1 and s2. Despite this dependency, this loop can be made parallel because the dependency is not circular: - neither statement depends on itself; - while s1 depends on s2, s2 does not depend on s1. CS252/Culler
Lec 2.30
Ideas To Reduce Stalls

Technique Dynamic scheduling Dynamic branch prediction Issuing multiple instructions per cycle Speculation Dynamic memory disambiguation Loop unrolling Basic compiler pipeline scheduling Compiler dependence analysis Software pipelining and trace scheduling Compiler speculation
1/24/02
Reduces Data hazard stalls Control stalls Ideal CPI Data and control stalls Data hazard stalls involving memory Control hazard stalls Data hazard stalls Ideal CPI and data hazard stalls Ideal CPI and data hazard stalls Ideal CPI, data and control stalls
Data Dependence and Hazards
InstrJ is data dependent on InstrI InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3 J: sub r4,r1,r3

or InstrJ is data dependent on InstrK which is dependent on InstrI Caused by a True Dependence (compiler term) If true dependence caused a hazard in the pipeline, called a Read After Write (RAW) hazard
1/24/02
Data Dependence and Hazards
Dependences are a property of programs Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is a property of the pipeline Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated Today looking at HW schemes to avoid hazard
1/24/02
Name Dependence #1: Anti-dependence
Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence InstrJ writes operand before InstrI reads it
Called an anti-dependence by compiler writers. This results from reuse of the name r1 If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard
1/24/02
Name Dependence #2: Output dependence
InstrJ writes operand before InstrI writes it.

Called an output dependence by compiler writers This also results from the reuse of name r1 If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard
1/24/02
Control Dependencies
Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.
36
1/24/02
Out-Of-Order Execution
DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion
1/24/02
THANK YOU
1/24/02

1.pipelining & ILP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.pipelining & ILP

Uploaded by

Copyright:

Available Formats

CS2354 - Advanced Computer Architecture Pipelining

CS252/Culler Lec 2.1

Review: Visualizing Pipelining

CS252/Culler Lec 2.2

CS252/Culler Lec 2.3

Example: One Memory Port/Structural Hazard

I Load Ifetch n s Instr 1 t r. O r d e r

Instr 2 Instr 3 Instr 4

CS252/Culler Lec 2.4

Resolving structural hazards

Solution 2: Throw more hardware at the problem

CS252/Culler Lec 2.5

Detecting and Resolving Structural Hazard

I Load Ifetch n s Instr 1 t r. O r d e r

Instr 2 Stall Instr 3

CS252/Culler Lec 2.6

CS252/Culler Lec 2.7

Three Generic Data Hazards

CS252/Culler Lec 2.8

Three Generic Data Hazards

Three Generic Data Hazards

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Forwarding to Avoid Data Hazard

add r1,r2,r3 Ifetch

CS252/Culler Lec 2.11

Data Hazard Even with Forwarding

lw r1, 0(r2) Ifetch sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

CS252/Culler Lec 2.12

Resolving this load hazard

What is the cost of load delays?

CS252/Culler Lec 2.13

Resolving the Load Data Hazard

lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

CS252/Culler Lec 2.14

Control Hazard on Branches => Three Stage Stall

10: beq r1,r3,36

22: add r8,r1,r9 36: xor r10,r1,r11

CS252/Culler Lec 2.15

Example: Branch Stall Impact

MIPS branch tests if register = 0 or 0 MIPS Solution:

CS252/Culler Lec 2.16

Pipelined MIPS Datapath

Instr. Decode Reg. Fetch

Execute Addr. Calc

CS252/Culler Lec 2.17

OVERCOME Branch Hazard Alternatives

#2: Predict Branch Not Taken

#3: Predict Branch Taken

CS252/Culler Lec 2.18

Extending Pipeline To Handle Multicycle operations

CS252/Culler Lec 2.21

Latencies and initiation intervals for FU

CS252/Culler Lec 2.22

Pipeline support for FP operations

CS252/Culler Lec 2.23

CS252/Culler Lec 2.24

CS252/Culler Lec 2.25

Instruction Level Parallelism

Instruction Level Parallelism

Instruction Level Parallelism

CS252/Culler Lec 2.29

Instruction Level Parallelism

Ideas To Reduce Stalls