You are on page 1of 38

CS2354 - Advanced Computer Architecture Pipelining

BY
N R REJIN PAUL LECTURER,CSE DEPT

1/24/02

CS252/Culler Lec 2.1

Review: Visualizing Pipelining


Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU

I n s t r.
O r d e r

ALU

Ifetch

Reg

DMem

Reg

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

1/24/02

CS252/Culler Lec 2.2

Limits to pipelining
Hazards: circumstances that would cause incorrect execution if next instruction were launched
Structural hazards: Attempting to use the same hardware to do two different things at the same time Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

1/24/02

CS252/Culler Lec 2.3

Example: One Memory Port/Structural Hazard


Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU

I Load Ifetch n s Instr 1 t r. O r d e r


1/24/02

ALU

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

Instr 2 Instr 3 Instr 4

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

Structural Hazard

CS252/Culler Lec 2.4

Resolving structural hazards


Defn: attempt to use same hardware for two different things at the same time Solution 1: Wait
must detect the hazard must have mechanism to stall

Solution 2: Throw more hardware at the problem

1/24/02

CS252/Culler Lec 2.5

Detecting and Resolving Structural Hazard


Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU

I Load Ifetch n s Instr 1 t r. O r d e r


1/24/02

ALU

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

Instr 2 Stall Instr 3

Ifetch

Reg

DMem

Reg

Bubble

Bubble Bubble

Bubble
ALU

Bubble
Reg

Ifetch

Reg

DMem

CS252/Culler Lec 2.6

Data Hazards
Time (clock cycles)
IF ID/RF EX MEM
DMem

WB
Reg

I n s t r. O r d e r

add r1,r2,r3

Ifetch

Reg

ALU

ALU

sub r4,r1,r3
and r6,r1,r7 or r8,r1,r9

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

xor r10,r1,r11
1/24/02

ALU

Ifetch

Reg

DMem

Reg

CS252/Culler Lec 2.7

Three Generic Data Hazards


Read After Write (RAW) InstrJ tries to read operand before InstrI writes it I: add r1,r2,r3 J: sub r4,r1,r3 Caused by a Data Dependence (in compiler nomenclature). This hazard results from an actual need for communication.

1/24/02

CS252/Culler Lec 2.8

Three Generic Data Hazards


Write After Read (WAR) InstrJ writes operand before InstrI reads it I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Called an anti-dependence by compiler writers. This results from reuse of the name r1. Cant happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
1/24/02 CS252/Culler Lec 2.9

Three Generic Data Hazards


Write After Write (WAW) InstrJ writes operand before InstrI writes it.

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7


Called an output dependence by compiler writers This also results from the reuse of name r1. Cant happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes
1/24/02 CS252/Culler Lec 2.10

Forwarding to Avoid Data Hazard


Time (clock cycles)

ALU

I n s t r.
O r d e r

add r1,r2,r3 Ifetch


sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9

ALU

Reg

DMem

Reg

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

xor r10,r1,r11

Ifetch

Reg

DMem

Reg

1/24/02

CS252/Culler Lec 2.11

Data Hazard Even with Forwarding


Time (clock cycles)
I n s t r. O r d e r
1/24/02

lw r1, 0(r2) Ifetch sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

ALU

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

CS252/Culler Lec 2.12

Resolving this load hazard


Adding hardware? ... not Detection? Compilation techniques?

What is the cost of load delays?

1/24/02

CS252/Culler Lec 2.13

Resolving the Load Data Hazard


Time (clock cycles) I n s t r. O r d e r
ALU

lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

Ifetch

Reg

DMem

Reg

Ifetch

Reg

Bubble

ALU

DMem

Reg

ALU

Ifetch

Bubble

Reg

DMem

Reg

Ifetch

Reg

ALU

Bubble

DMem

1/24/02

CS252/Culler Lec 2.14

Control Hazard on Branches => Three Stage Stall

10: beq r1,r3,36


14: and r2,r3,r5 18: or r6,r1,r7

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

22: add r8,r1,r9 36: xor r10,r1,r11


1/24/02

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

CS252/Culler Lec 2.15

Example: Branch Stall Impact


Two part solution:
Determine branch taken or not sooner, AND Compute taken branch address earlier

MIPS branch tests if register = 0 or 0 MIPS Solution:


Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

1/24/02

CS252/Culler Lec 2.16

Pipelined MIPS Datapath


Instruction Fetch
Next PC

Instr. Decode Reg. Fetch


Next SEQ PC

Execute Addr. Calc

Memory Access

Write Back

MUX

Adder

Adder

4
Address
1/24/02

Zero?

RS1

MEM/WB

Imm

Sign Extend

RD

RD

RD

CS252/Culler Lec 2.17

WB Data

Memory

RS2

EX/MEM

Reg File

ID/EX

ALU

IF/ID

Data Memory

MUX

MUX

OVERCOME Branch Hazard Alternatives


#1: Stall until branch direction is clear

#2: Predict Branch Not Taken

#3: Predict Branch Taken

1/24/02

CS252/Culler Lec 2.18

Extending Pipeline To Handle Multicycle operations


Floating point numbers have two parts Exponents & significant We should deal with the exponent and significant seperately Example: 3.25 X10^3 2.63 X10^-1 3.25 X 10^3 0.000236 X10^3 -Shift the smaller to right until match 3.250263 X10^3
1/24/02 CS252/Culler Lec 2.19

Cont
So some algorithm needs to be implemented in order to perform the operation Functional unit should be redesigned to perform all operations and this type of functional unit require longer pipeline cycle Latency in the functional unit : - Latency is defined as the number of intervening cycles between an instruction that produces a result and an instruction that uses the result. initiation interval : number of cycles that must elapse between issuing two operations of a given type
1/24/02 CS252/Culler Lec 2.20

Cont.

1/24/02

CS252/Culler Lec 2.21

Latencies and initiation intervals for FU

1/24/02

CS252/Culler Lec 2.22

Pipeline support for FP operations

1/24/02

CS252/Culler Lec 2.23

FP example

1/24/02

CS252/Culler Lec 2.24

Cont.

1/24/02

CS252/Culler Lec 2.25

Cont.
Assuming that the pipeline does all hazard detection in ID, there are three checks that must be performed before an instruction can issue: Check For Structural Hazards: Wait until the required functional unit is available Check for a RAW data hazard: Wait until the source registers are not listed as pending destinations in a pipeline register that will not be available Check for a WAW data hazard: Determine if any instruction in Al, . A4,D, Ml, . . . , M7 has the same register destination as this instruction.
1/24/02 CS252/Culler Lec 2.26

Instruction Level Parallelism


Pipelining can overlap the execution of instructions when they are independent of one another. This potential overlap among instructions is called instruction-level parallelism (ILP) since the instructions can be evaluated in parallel. Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program: 1. e = a + b 2. f = c + d 3. g = e * f
1/24/02 CS252/Culler Lec 2.27

Instruction Level Parallelism


Operation 3 depends on the results of operations 1 and 2 3 cannot be calculated until both of them are completed operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. A goal of compiler and processor designers is to identify and take advantage of as much ILP as possible. Ordinary programs are typically written under a sequential execution model where instructions execute one after the other and in the order specified by the programmer.
CS252/Culler Lec 2.28

1/24/02

Instruction Level Parallelism


The simplest and most common way to increase the amount of parallelism available among instructions is to exploit parallelism among iterations of a loop. This type of parallelism is often called loop-level parallelism. Example 1 for (i=1; i<=1000; i= i+1) x[i] = x[i] + y[i]; This is a parallel loop. Every iteration of the loop can overlap with any other iteration, although within each loop iteration there is little opportunity for overlap.

1/24/02

CS252/Culler Lec 2.29

Instruction Level Parallelism


for (i=1; i<=100; i= i+1){ a[i] = a[i] + b[i]; b[i+1] = c[i] + d[i]; } //s1 //s2

1/24/02

Is this loop parallel? If not how to make it parallel? Statement s1 uses the value assigned in the previous iteration by statement s2, so there is a loop-carried dependency between s1 and s2. Despite this dependency, this loop can be made parallel because the dependency is not circular: - neither statement depends on itself; - while s1 depends on s2, s2 does not depend on s1. CS252/Culler
Lec 2.30

Ideas To Reduce Stalls


Technique Dynamic scheduling Dynamic branch prediction Issuing multiple instructions per cycle Speculation Dynamic memory disambiguation Loop unrolling Basic compiler pipeline scheduling Compiler dependence analysis Software pipelining and trace scheduling Compiler speculation
1/24/02

Reduces Data hazard stalls Control stalls Ideal CPI Data and control stalls Data hazard stalls involving memory Control hazard stalls Data hazard stalls Ideal CPI and data hazard stalls Ideal CPI and data hazard stalls Ideal CPI, data and control stalls
CS252/Culler Lec 2.31

Instruction Level Parallelism

Data Dependence and Hazards

InstrJ is data dependent on InstrI InstrJ tries to read operand before InstrI writes it

I: add r1,r2,r3 J: sub r4,r1,r3


or InstrJ is data dependent on InstrK which is dependent on InstrI Caused by a True Dependence (compiler term) If true dependence caused a hazard in the pipeline, called a Read After Write (RAW) hazard

1/24/02

CS252/Culler Lec 2.32

Instruction Level Parallelism

Data Dependence and Hazards

Dependences are a property of programs Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is a property of the pipeline Importance of the data dependencies 1) indicates the possibility of a hazard 2) determines order in which results must be calculated Today looking at HW schemes to avoid hazard

1/24/02

CS252/Culler Lec 2.33

Instruction Level Parallelism

Name Dependence #1: Anti-dependence

Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; 2 versions of name dependence InstrJ writes operand before InstrI reads it

I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Called an anti-dependence by compiler writers. This results from reuse of the name r1 If anti-dependence caused a hazard in the pipeline, called a Write After Read (WAR) hazard

1/24/02

CS252/Culler Lec 2.34

Instruction Level Parallelism

Name Dependence #2: Output dependence

InstrJ writes operand before InstrI writes it.

I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7


Called an output dependence by compiler writers This also results from the reuse of name r1 If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard

1/24/02

CS252/Culler Lec 2.35

Instruction Level Parallelism

Control Dependencies

Every instruction is control dependent on some set of branches, and, in general, these control dependencies must be preserved to preserve program order if p1 { S1; }; if p2 { S2; } S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1.

36
1/24/02

CS252/Culler Lec 2.36

Out-Of-Order Execution
DIVD F0,F2,F4 ADDD F10,F0,F8 SUBD F12,F8,F14 Enables out-of-order execution => out-of-order completion

1/24/02

CS252/Culler Lec 2.37

THANK YOU

1/24/02

CS252/Culler Lec 2.38

You might also like