You are on page 1of 43

Chapter 4A: The Processor, Part A

Mary Jane Irwin ( www.cse.psu.edu/~mji )

[Adapted from Computer Organization and Design, 4th Edition, Patterson & Hennessy, Hennessy 2008, 2008 MK]

CSE431 Chapter 4A.1

Irwin, PSU, 2008

Review: MIPS (RISC) Design Principles

S Simplicity f favors regularity


z z z

fixed size instructions small number of instruction formats opcode always the first 6 bits

Smaller is faster
z z z

limited instruction set limited number of registers in register file li it d number limited b of f addressing dd i modes d

Make the common case fast


z z

arithmetic operands from the register file (load-store (load store machine) allow instructions to contain immediate operands

Good design demands good compromises


z

three instruction formats


Irwin, PSU, 2008

CSE431 Chapter 4A.2

The Processor: Datapath & Control

Our implementation of the MIPS is simplified


z z z

memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j

Generic implementation
z

use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction

Fetch PC = PC+4 Exec Decode

z z

All instructions (except j) use the ALU after reading the registers How? memory-reference? arithmetic? control flow?

CSE431 Chapter 4A.3

Irwin, PSU, 2008

Aside: Clocking Methodologies

The clocking methodology defines when data in a state element is valid and stable relative to the clock
z z

State elements - a memory element such as a register Edge-triggered all state changes occur on a clock edge read contents of state elements -> send values through g combinational logic -> write results to one or more state elements
State element 1 Combinational logic State element 2

Typical execution
z

clock

one clock cycle

Assumes state elements are written on every clock cycle; if not, need explicit write control signal
z

write occurs only when both the write control is asserted and the clock edge occurs
Irwin, PSU, 2008

CSE431 Chapter 4A.4

Fetching Instructions

Fetching instructions involves


z z

reading the instruction from the Instruction Memory updating the PC value to be the address of the next (sequential) instruction

clock
4

Add dd

Fetch PC = PC+4 Exec Decode

Instruction Memory PC Read Address Instruction

z z

PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal Reading from the Instruction Memory is a combinational activity, so it doesnt need an explicit read control signal
Irwin, PSU, 2008

CSE431 Chapter 4A.5

Decoding Instructions

Decoding instructions involves


z

sending the fetched instructions opcode and function field bits to the control unit

Fetch PC = PC+4 Exec Decode

Control Unit

and

Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

reading two values from the Register File


- Register File addresses are contained in the instruction

CSE431 Chapter 4A.6

Irwin, PSU, 2008

Executing R Format Operations

R format operations (add, add sub sub, slt slt, and and, or)
31 R-type: op p
z z

25 rs

20 rt

15 rd

10

shamt funct

perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)
RegWrite ALU control

Fetch PC = PC+4 Exec Decode

Instruction

Read R d Add Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

ALU

overflow zero

Note N t that th t Register R i t File Fil i is not t written itt every cycle l ( (e.g. sw), ) so we need an explicit write control signal for the Register File
Irwin, PSU, 2008

CSE431 Chapter 4A.7

Executing Load and Store Operations

Load and store operations involves


z

z z

compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction store value (read from the Register File during decode) written to the Data Memory load value, value read from the Data Memory Memory, written to the Register RegWrite ALU control MemWrite File
Read R d Add Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

overflow zero
Address ALU Data Memory Read Data W it Data Write D t

Instruction

16

Sign Extend

MemRead
32

CSE431 Chapter 4A.8

Irwin, PSU, 2008

Executing Branch Operations

Branch operations involves


z z

compare the operands read from the Register File during decode for equality (zero ALU output) compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr
Add 4 Shift left 2 Add

Branch target address

ALU control
PC Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2

zero (to branch control logic)


ALU

I t ti Instruction

CSE431 Chapter 4A.9

16

Sign Extend

32

Irwin, PSU, 2008

Executing Jump Operations

Jump operation involves


z

replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits

Add 4 4 Instruction Memory PC Read Address Instruction 26 Shift left 2

Jump address
28

CSE431 Chapter 4A.10

Irwin, PSU, 2008

Creating a Single Datapath from the Parts

Assemble the datapath segments and add control lines and multiplexors as needed Single cycle design fetch, fetch decode and execute each instructions in one clock cycle
z

no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders) multiplexors needed at the input of shared elements with control lines to do the selection write signals to control writing to the Register File and Data Memory y

Cycle time is determined by length of the longest path

CSE431 Chapter 4A.11

Irwin, PSU, 2008

Fetch, R, and Memory Access Portions

Add 4

RegWrite

ALUSrc ALU control ovf zero

MemWrite

MemtoReg

Instruction st uct o Memory PC Read Address Instruction

Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read W it Data Write D t Data 2

Address ALU Data Memory Read Data Write Data

Sign 16 Extend

MemRead
32

CSE431 Chapter 4A.12

Irwin, PSU, 2008

Adding the Control

Selecting the operations to perform (ALU (ALU, Register File and Memory read/write) Controlling the flow of data (multiplexor inputs)
31 R-type: op 25 rs 25 rs 25 20 rt 20 rt 15 rd 15 address offset 0 10 5 0 shamt funct 0

Ob Observations ti
z

31 I-Type: op 31

op field always in bits 31-26

addr of registers J-type: op target address to be read are always specified by the rs field fi ld (bits (bit 25 25-21) 21) and d rt t field fi ld (bits (bit 20 20-16); 16) f for lw l and d sw rs is i th the b base register addr. of register to be written is in one of two places in rt (bits 20-16) for lw; in rd (bits 15 15-11) 11) for R R-type type instructions offset for beq, lw, and sw always in bits 15-0
Irwin, PSU, 2008

CSE431 Chapter 4A.13

Single Cycle Datapath with Control Unit


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Read R i t Register Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf
Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

zero
ALU

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

CSE431 Chapter 4A.14

Irwin, PSU, 2008

R-type Instruction Data/Control Flow


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Read R i t Register Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf
Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

zero
ALU

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

CSE431 Chapter 4A.15

Irwin, PSU, 2008

Load Word Instruction Data/Control Flow


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Read R i t Register Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf
Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

zero
ALU

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

CSE431 Chapter 4A.16

Irwin, PSU, 2008

Load Word Instruction Data/Control Flow


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Read R i t Register Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf
Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

zero
ALU

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

CSE431 Chapter 4A.17

Irwin, PSU, 2008

Branch Instruction Data/Control Flow


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Read R i t Register Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf
Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

zero
ALU

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

CSE431 Chapter 4A.18

Irwin, PSU, 2008

Branch Instruction Data/Control Flow


0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Read R i t Register Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add

1
PCSrc MemRead MemtoReg MemWrite

ovf
Address Data Memory Read Data Write Data

Instruction Memory PC Read Address Instr[31-0]

zero
ALU

1 0

0 1
ALU control

Instr[15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]

CSE431 Chapter 4A.19

Irwin, PSU, 2008

Adding the Jump Operation


Instr[25-0] [ ] 26 Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Jump Branch Shift left 2 Add Shift left 2

1
28 32 PC+4[31-28]

0 0 1
PCSrc MemRead MemtoReg MemWrite

ovf
Address Data Memory Read Data Write Data

Instruction I t ti Memory PC Read Address Instr[31-0]

zero
ALU

1 0

0 1
ALU control

Instr[15 I t [15 -11] Instr[15-0]

Write Data

Data 2

Sign 16 Extend

32

Instr[5-0]
CSE431 Chapter 4A.20 Irwin, PSU, 2008

Instruction Times (Critical Paths)


What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except:

Instruction and Data Memory (200 ps) and adders (200 ps) Register File access (reads or writes) (100 ps)

z ALU z

Instr. RR type load store beq jump

I Mem

Reg Rd ALU Op D Mem Reg Wr

Total

CSE431 Chapter 4A.21

Irwin, PSU, 2008

Instruction Critical Paths


What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except:

Instruction and Data Memory (200 ps) and adders (200 ps) Register File access (reads or writes) (100 ps)

z ALU z

Instr. RR type load store beq jump

I Mem 200 200 200 200 200

Reg Rd 100 100 100 100

ALU Op D Mem Reg Wr 200 200 200 200 200 200 100 100

Total 600 800 700 500 200


Irwin, PSU, 2008

CSE431 Chapter 4A.22

Single Cycle Disadvantages & Advantages

Uses th U the clock l k cycle l i inefficiently ffi i tl the th clock l k cycle l must t be timed to accommodate the slowest instruction
z

especially problematic for more complex instructions like floating point multiply
Cycle 1 Cycle 2

Clk lw sw Waste

May be wasteful of area since some functional units (e g adders) must be duplicated since they can not be (e.g., shared during a clock cycle but Is simple and easy to understand

CSE431 Chapter 4A.23 Irwin, PSU, 2008

How Can We Make It Faster?

Start fetching S f and executing the next instruction before f the current one has completed
z z

Pipelining p g( (all?) ) modern processors p are p pipelined p for performance Remember the performance equation: CPU time = CPI * CC * IC

Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages
z

A five stage pipeline is nearly five times faster because the CC is nearly five times faster

Fetch (and execute) more than one instruction at a time


z

S Superscalar l processing i stay t tuned t d


Irwin, PSU, 2008

CSE431 Chapter 4A.24

The Five Stages of Load Instruction


Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

l lw

IF t h Dec IFetch D

E Exec

M Mem

WB

IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory y WB: Write the result data into the register file

CSE431 Chapter 4A.25

Irwin, PSU, 2008

A Pipelined MIPS Processor

Start the next instruction before the current one has completed
z z

improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time time from the start of an instruction to its completion) is not reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8

lw sw R-type

IFetch

Dec IFetch

Exec Dec IFetch

Mem Exec Dec

WB Mem Exec WB Mem WB

- clock cycle (pipeline stage time) is limited by the slowest stage dont t need the whole clock cycle (e.g., WB) - for some stages don - for some instructions, some stages are wasted cycles (i.e., nothing is done during that cycle for that instruction)
CSE431 Chapter 4A.26 Irwin, PSU, 2008

Single Cycle versus Pipeline


Single Cycle Implementation (CC = 800 ps):
Cycle 1 Clk lw sw Waste Cycle 2

Pipeline Implementation (CC = 200 ps):


lw IFetch sw Dec IFetch Exec Dec Mem Exec Dec WB Mem Exec WB Mem WB

400 ps

R-type IFetch

To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle y case). ) Why y? How long does each take to complete 1,000,000 adds ?
Irwin, PSU, 2008

CSE431 Chapter 4A.27

Pipelining the MIPS ISA

What makes it easy


z

all instructions are the same length (32 bits)


- can fetch in the 1st stage and decode in the 2nd stage

few instruction formats (three) with symmetry across formats


- can begin reading register file in 2nd stage

memory operations occur only in loads and stores


- can use the execute stage to calculate memory addresses

each instruction writes at most one result ( (i.e., , changes g the machine state) and does it in the last few pipeline stages (MEM or WB) ope a ds must operands us be a aligned g ed in memory e o y so a s single g e da data a transfer a se takes only one data memory access

CSE431 Chapter 4A.28

Irwin, PSU, 2008

MIPS Pipeline Datapath Additions/Mods

State registers between each pipeline stage to isolate them


IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack

IF/ID Add 4

ID/EX

EX/MEM

Shift left 2

Add

MEM/WB

Instruction Memory
Read Address PC

Read Addr 1 R d Read R i t Register Data 1 Read Addr 2

Data Memory
ALU Address Write Data Read Data

File
Write Addr W it Data Write D t Read Data 2

16

Sign Extend

32

System Clock
CSE431 Chapter 4A.29 Irwin, PSU, 2008

MIPS Pipeline Control Path Modifications

All control signals can be determined during Decode


z

and held in the state registers between pipeline stages


PCSrc ID/EX EX/MEM Control IF/ID Add 4 RegWrite Shift left 2 Add Branch MEM/WB

Instruction Memory
Read Address PC

Read Addr 1

Register Read
Data 1 Read Addr 2

Data M Memory
ALUSrc ALU Address Write Data ALU cntrl Read Data

MemtoReg

File
Write Addr Write Data Read Data 2

16

Sign Extend

MemRead

32

ALUOp

RegDst
CSE431 Chapter 4A.30 Irwin, PSU, 2008

Pipeline Control

IF Stage: S read Instr Memory (always ( asserted) ) and write PC (on System Clock) ID Stage: no optional control signals to set

EX Stage Reg Dst 1 0 X X

MEM Stage

WB Stage

R lw sw beq

ALU ALU ALU Brch Mem Mem Reg Mem Op1 Op0 Src Read Write Write toReg 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 X X

CSE431 Chapter 4A.31

Irwin, PSU, 2008

Graphically Representing MIPS Pipeline

ALU

IM

Reg

DM

Reg

Can help with answering questions like:


z z z

How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed?

CSE431 Chapter 4A.32

Irwin, PSU, 2008

Why Pipeline? For Performance!


Ti Time (clock ( l k cycles) l )

I n s t r. O r d e r

Inst 0 Inst 1 Inst 2 Inst 3 Inst 4

IM

R Reg

DM

R Reg

IM

Reg g

DM

Reg g

Once the pipeline i li i is f full, ll one instruction is completed every cycle, cycle so CPI = 1
Reg

IM

Time to fill the pipeline

ALU A

Reg

IM

ALU

Reg

IM

ALU

DM

Reg

ALU

DM

Reg

ALU

DM

Reg

CSE431 Chapter 4A.33

Irwin, PSU, 2008

Can Pipelining Get Us Into Trouble?

Yes: Pipeline Hazards


z

structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready
- An instructions source operand(s) are produced by a prior instruction still in the pipeline

control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated
- branch and jump instructions, exceptions

Can
z z

usually resolve hazards by waiting

pp pipeline control must detect the hazard and take action to resolve hazards
Irwin, PSU, 2008

CSE431 Chapter 4A.34

A Single Memory Would Be a Structural Hazard


Ti Time (clock ( l k cycles) l )

I n s t r. O r d e r

lw Inst 1 Inst 2 Inst 3 Inst 4

M Mem

R Reg

M Mem

R Reg

Reading data from memory


Reg g

ALU A Reg g Mem

ALU

Mem

Mem

ALU Reg Mem

Reg

Mem

Reg

ALU

Mem

Mem

Reg

ALU

Reading instruction from memory y

Reg

Mem

Reg

Fix with separate instr and data memories (I$ and D$)
Irwin, PSU, 2008

CSE431 Chapter 4A.35

How About Register File Access?


Time ( (clock cycles) y )

I n s t r. O r d e r

add $1, , Inst 1 Inst 2

IM

Reg g

DM

Reg g

IM

Reg

DM

Reg

Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half
Reg

IM

ALU U

Reg

ALU

ALU

DM

ALU

add $2,$1,

IM

Reg

DM

Reg

clock edge that controls register writing


CSE431 Chapter 4A.36

clock edge that controls loading of pipeline state registers Irwin, PSU, 2008

Register Usage Can Cause Data Hazards

Dependencies backward in time cause hazards


ALU A

I n s t r. O r d e r

add $1, $1 sub $ $4,$1,$5 ,$ ,$ and $6,$1,$7 or $8,$1,$9

IM

R Reg

DM

R Reg

ALU

IM

Reg g

DM

Reg g

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

xor $4,$1,$5

IM

Reg

DM

Reg

Read before write data hazard


Irwin, PSU, 2008

CSE431 Chapter 4A.37

Register Usage Can Cause Data Hazards

Dependencies backward in time cause hazards


AL LU

add $1, $1 sub $ $4,$1,$5 ,$ ,$ and $6,$1,$7 or $8,$1,$9

IM

Reg

DM

Reg

ALU U

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

xor $4,$1,$5

IM

Reg

DM

Reg

Read before write data hazard


Irwin, PSU, 2008

CSE431 Chapter 4A.38

Loads Can Cause Data Hazards

Dependencies backward in time cause hazards


AL LU

I n s t r. O r d e r

lw

$1 4($2) $1,4($2)

IM

Reg

DM

Reg

ALU U

sub $ $4,$1,$5 ,$ ,$ and $6,$1,$7 or $8,$1,$9

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

ALU

xor $4,$1,$5

IM

Reg

DM

Reg

Load-use data hazard


Irwin, PSU, 2008

CSE431 Chapter 4A.39

Branch Instructions Cause Control Hazards

Dependencies backward in time cause hazards


ALU A

I n s t r. O r d e r

beq lw Inst 3 Inst 4

IM

R Reg

DM

R Reg

ALU

IM

Reg g

DM

Reg g

ALU

IM

Reg

DM

Reg

ALU

IM

Reg

DM

Reg

CSE431 Chapter 4A.40

Irwin, PSU, 2008

Other Pipeline Structures Are Possible

What about the (slow) multiply operation?


z z

Make the clock twice as slow or let it take two cycles (since it doesnt use the DM stage)
MUL AL LU IM Reg DM Reg

What if the data memory y access is twice as slow as the instruction memory?
z z

make the clock twice as slow or let data memory access take two cycles (and keep the same clock rate)
ALU IM Reg DM1 DM2 Reg

CSE431 Chapter 4A.41

Irwin, PSU, 2008

Other Sample Pipeline Alternatives

ARM7

IM PC update IM access

Reg decode reg access

EX ALU op DM access shift/rotate commit result (write back)

XScale
IM1 PC update BTB access start IM access IM2 Reg SHFT decode reg 1 access IM access

ALU op p

shift/rotate reg 2 access

ALU

DM1

Reg DM2

DM write reg write start DM access exception

CSE431 Chapter 4A.42

Irwin, PSU, 2008

Summary

All modern day processors use pipelining Pipelining doesnt help latency of single task, it helps throughput of entire workload Potential speedup: a CPI of 1 and fast a CC Pipeline rate limited by slowest pipeline stage
z z

Unbalanced pipe stages makes for inefficiencies The time to fill fill pipeline and time to drain drain it can impact speedup for deep pipelines and short code runs

Must detect and resolve hazards


z

Stalling negatively affects CPI (makes CPI less than the ideal of 1)

CSE431 Chapter 4A.43

Irwin, PSU, 2008

You might also like