Lec4a Supp

Chapter 4A: The Processor, Part A
Mary Jane Irwin ( www.cse.psu.edu/~mji )
[Adapted from Computer Organization and Design, 4th Edition, Patterson & Hennessy, Hennessy 2008, 2008 MK]
CSE431 Chapter 4A.1
Irwin, PSU, 2008
Review: MIPS (RISC) Design Principles
S Simplicity f favors regularity

z z z
fixed size instructions small number of instruction formats opcode always the first 6 bits
Smaller is faster
z z z
limited instruction set limited number of registers in register file li it d number limited b of f addressing dd i modes d
Make the common case fast

z z
arithmetic operands from the register file (load-store (load store machine) allow instructions to contain immediate operands
Good design demands good compromises

z
three instruction formats

Irwin, PSU, 2008
CSE431 Chapter 4A.2
The Processor: Datapath & Control
Our implementation of the MIPS is simplified

z z z
memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j
Generic implementation
z
use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction
Fetch PC = PC+4 Exec Decode
z z
All instructions (except j) use the ALU after reading the registers How? memory-reference? arithmetic? control flow?
CSE431 Chapter 4A.3
Irwin, PSU, 2008
Aside: Clocking Methodologies
The clocking methodology defines when data in a state element is valid and stable relative to the clock
z z
State elements - a memory element such as a register Edge-triggered all state changes occur on a clock edge read contents of state elements -> send values through g combinational logic -> write results to one or more state elements
State element 1 Combinational logic State element 2
Typical execution
z
clock
one clock cycle
Assumes state elements are written on every clock cycle; if not, need explicit write control signal
z
write occurs only when both the write control is asserted and the clock edge occurs
Irwin, PSU, 2008
CSE431 Chapter 4A.4
Fetching Instructions
Fetching instructions involves

z z
reading the instruction from the Instruction Memory updating the PC value to be the address of the next (sequential) instruction
clock
4
Add dd
Instruction Memory PC Read Address Instruction
z z
PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal Reading from the Instruction Memory is a combinational activity, so it doesnt need an explicit read control signal
Irwin, PSU, 2008
CSE431 Chapter 4A.5
Decoding Instructions
Decoding instructions involves

z
sending the fetched instructions opcode and function field bits to the control unit
Control Unit
and
Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
reading two values from the Register File

- Register File addresses are contained in the instruction
CSE431 Chapter 4A.6
Irwin, PSU, 2008
Executing R Format Operations
R format operations (add, add sub sub, slt slt, and and, or)
31 R-type: op p
z z
25 rs
20 rt
15 rd
10
shamt funct
perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd)
RegWrite ALU control
Instruction
Read R d Add Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
ALU
overflow zero
Note N t that th t Register R i t File Fil i is not t written itt every cycle l ( (e.g. sw), ) so we need an explicit write control signal for the Register File
Irwin, PSU, 2008
CSE431 Chapter 4A.7
Executing Load and Store Operations
Load and store operations involves

z
z z
compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction store value (read from the Register File during decode) written to the Data Memory load value, value read from the Data Memory Memory, written to the Register RegWrite ALU control MemWrite File
Read R d Add Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
overflow zero
Address ALU Data Memory Read Data W it Data Write D t
Instruction
16
Sign Extend
MemRead
32
CSE431 Chapter 4A.8
Irwin, PSU, 2008
Executing Branch Operations
Branch operations involves

z z
compare the operands read from the Register File during decode for equality (zero ALU output) compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr
Add 4 Shift left 2 Add
Branch target address
ALU control
PC Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read Write Data Data 2
zero (to branch control logic)

ALU
I t ti Instruction
CSE431 Chapter 4A.9
16
Sign Extend
32
Irwin, PSU, 2008
Executing Jump Operations
Jump operation involves

z
replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits
Add 4 4 Instruction Memory PC Read Address Instruction 26 Shift left 2
Jump address
28
CSE431 Chapter 4A.10
Irwin, PSU, 2008
Creating a Single Datapath from the Parts
Assemble the datapath segments and add control lines and multiplexors as needed Single cycle design fetch, fetch decode and execute each instructions in one clock cycle
z
no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders) multiplexors needed at the input of shared elements with control lines to do the selection write signals to control writing to the Register File and Data Memory y
Cycle time is determined by length of the longest path
Irwin, PSU, 2008
Fetch, R, and Memory Access Portions
Add 4
RegWrite
ALUSrc ALU control ovf zero
MemWrite
MemtoReg
Instruction st uct o Memory PC Read Address Instruction
Read Addr 1 Register Read Read Addr 2 Data 1 File Write Addr Read W it Data Write D t Data 2
Address ALU Data Memory Read Data Write Data
Sign 16 Extend
MemRead
32
Irwin, PSU, 2008
Adding the Control
Selecting the operations to perform (ALU (ALU, Register File and Memory read/write) Controlling the flow of data (multiplexor inputs)
31 R-type: op 25 rs 25 rs 25 20 rt 20 rt 15 rd 15 address offset 0 10 5 0 shamt funct 0
Ob Observations ti
z
31 I-Type: op 31
op field always in bits 31-26
addr of registers J-type: op target address to be read are always specified by the rs field fi ld (bits (bit 25 25-21) 21) and d rt t field fi ld (bits (bit 20 20-16); 16) f for lw l and d sw rs is i th the b base register addr. of register to be written is in one of two places in rt (bits 20-16) for lw; in rd (bits 15 15-11) 11) for R R-type type instructions offset for beq, lw, and sw always in bits 15-0
Irwin, PSU, 2008
Single Cycle Datapath with Control Unit

0
Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Read R i t Register Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Branch Shift left 2 Add
1
PCSrc MemRead MemtoReg MemWrite
ovf
Address Data Memory Read Data Write Data
Instruction Memory PC Read Address Instr[31-0]
zero
ALU
1 0
0 1
ALU control
Instr[15 -11] Instr[15-0]
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
Irwin, PSU, 2008
R-type Instruction Data/Control Flow

0
1
ovf
zero
ALU
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
Irwin, PSU, 2008
Load Word Instruction Data/Control Flow

0
1
ovf
zero
ALU
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
Irwin, PSU, 2008
Load Word Instruction Data/Control Flow

0
1
ovf
zero
ALU
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
Irwin, PSU, 2008
Branch Instruction Data/Control Flow

0
1
ovf
zero
ALU
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
Irwin, PSU, 2008
Branch Instruction Data/Control Flow

0
1
ovf
zero
ALU
1 0
0 1
ALU control
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
Irwin, PSU, 2008
Adding the Jump Operation

Instr[25-0] [ ] 26 Add 4 ALUOp Instr[31-26] Control Unit ALUSrc RegWrite RegDst Instr[25-21] Read Addr 1 Register Read Instr[20-16] Read Addr 2 Data 1 File 0 Write Addr Read Jump Branch Shift left 2 Add Shift left 2
1
28 32 PC+4[31-28]
0 0 1
ovf
Instruction I t ti Memory PC Read Address Instr[31-0]
zero
ALU
1 0
0 1
ALU control
Instr[15 I t [15 -11] Instr[15-0]
Write Data
Data 2
Sign 16 Extend
32
Instr[5-0]
CSE431 Chapter 4A.20 Irwin, PSU, 2008
Instruction Times (Critical Paths)

What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except:
Instruction and Data Memory (200 ps) and adders (200 ps) Register File access (reads or writes) (100 ps)
z ALU z
Instr. RR type load store beq jump
I Mem
Reg Rd ALU Op D Mem Reg Wr
Total
Irwin, PSU, 2008
Instruction Critical Paths

What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except:
Instruction and Data Memory (200 ps) and adders (200 ps) Register File access (reads or writes) (100 ps)
z ALU z
Instr. RR type load store beq jump
I Mem 200 200 200 200 200
Reg Rd 100 100 100 100
ALU Op D Mem Reg Wr 200 200 200 200 200 200 100 100
Total 600 800 700 500 200

Irwin, PSU, 2008
Single Cycle Disadvantages & Advantages
Uses th U the clock l k cycle l i inefficiently ffi i tl the th clock l k cycle l must t be timed to accommodate the slowest instruction
z
especially problematic for more complex instructions like floating point multiply
Cycle 1 Cycle 2
Clk lw sw Waste
May be wasteful of area since some functional units (e g adders) must be duplicated since they can not be (e.g., shared during a clock cycle but Is simple and easy to understand
How Can We Make It Faster?
Start fetching S f and executing the next instruction before f the current one has completed
z z
Pipelining p g( (all?) ) modern processors p are p pipelined p for performance Remember the performance equation: CPU time = CPI * CC * IC
Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages
z
A five stage pipeline is nearly five times faster because the CC is nearly five times faster
Fetch (and execute) more than one instruction at a time

z
S Superscalar l processing i stay t tuned t d

Irwin, PSU, 2008
The Five Stages of Load Instruction

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
l lw
IF t h Dec IFetch D
E Exec
M Mem
WB
IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory y WB: Write the result data into the register file
Irwin, PSU, 2008
A Pipelined MIPS Processor
Start the next instruction before the current one has completed
z z
improves throughput - total amount of work done in a given time instruction latency (execution time, delay time, response time time from the start of an instruction to its completion) is not reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
lw sw R-type
IFetch
Dec IFetch
Exec Dec IFetch
Mem Exec Dec
WB Mem Exec WB Mem WB
- clock cycle (pipeline stage time) is limited by the slowest stage dont t need the whole clock cycle (e.g., WB) - for some stages don - for some instructions, some stages are wasted cycles (i.e., nothing is done during that cycle for that instruction)
Single Cycle versus Pipeline

Single Cycle Implementation (CC = 800 ps):
Cycle 1 Clk lw sw Waste Cycle 2
Pipeline Implementation (CC = 200 ps):

lw IFetch sw Dec IFetch Exec Dec Mem Exec Dec WB Mem Exec WB Mem WB
400 ps
R-type IFetch
To complete an entire instruction in the pipelined case takes 1000 ps (as compared to 800 ps for the single cycle y case). ) Why y? How long does each take to complete 1,000,000 adds ?
Irwin, PSU, 2008
Pipelining the MIPS ISA
What makes it easy

z
all instructions are the same length (32 bits)

- can fetch in the 1st stage and decode in the 2nd stage
few instruction formats (three) with symmetry across formats

- can begin reading register file in 2nd stage
memory operations occur only in loads and stores

- can use the execute stage to calculate memory addresses
each instruction writes at most one result ( (i.e., , changes g the machine state) and does it in the last few pipeline stages (MEM or WB) ope a ds must operands us be a aligned g ed in memory e o y so a s single g e da data a transfer a se takes only one data memory access
Irwin, PSU, 2008
MIPS Pipeline Datapath Additions/Mods
State registers between each pipeline stage to isolate them

IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack
IF/ID Add 4
ID/EX
EX/MEM
Shift left 2
Add
MEM/WB
Instruction Memory
Read Address PC
Read Addr 1 R d Read R i t Register Data 1 Read Addr 2
Data Memory
ALU Address Write Data Read Data
File
Write Addr W it Data Write D t Read Data 2
16
Sign Extend
32
System Clock
MIPS Pipeline Control Path Modifications
All control signals can be determined during Decode

z
and held in the state registers between pipeline stages

PCSrc ID/EX EX/MEM Control IF/ID Add 4 RegWrite Shift left 2 Add Branch MEM/WB
Instruction Memory
Read Address PC
Read Addr 1
Register Read
Data 1 Read Addr 2
Data M Memory
ALUSrc ALU Address Write Data ALU cntrl Read Data
MemtoReg
File
Write Addr Write Data Read Data 2
16
Sign Extend
MemRead
32
ALUOp
RegDst
Pipeline Control
IF Stage: S read Instr Memory (always ( asserted) ) and write PC (on System Clock) ID Stage: no optional control signals to set
EX Stage Reg Dst 1 0 X X
MEM Stage
WB Stage
R lw sw beq
ALU ALU ALU Brch Mem Mem Reg Mem Op1 Op0 Src Read Write Write toReg 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 0 1 X X
Irwin, PSU, 2008
Graphically Representing MIPS Pipeline
ALU
IM
Reg
DM
Reg
Can help with answering questions like:

z z z
How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed?
Irwin, PSU, 2008
Why Pipeline? For Performance!

Ti Time (clock ( l k cycles) l )
I n s t r. O r d e r
Inst 0 Inst 1 Inst 2 Inst 3 Inst 4
IM
R Reg
DM
R Reg
IM
Reg g
DM
Reg g
Once the pipeline i li i is f full, ll one instruction is completed every cycle, cycle so CPI = 1
Reg
IM
Time to fill the pipeline
ALU A
Reg
IM
ALU
Reg
IM
ALU
DM
Reg
ALU
DM
Reg
ALU
DM
Reg
Irwin, PSU, 2008
Can Pipelining Get Us Into Trouble?
Yes: Pipeline Hazards

z
structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready
- An instructions source operand(s) are produced by a prior instruction still in the pipeline
control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated
- branch and jump instructions, exceptions
Can
z z
usually resolve hazards by waiting
pp pipeline control must detect the hazard and take action to resolve hazards
Irwin, PSU, 2008
A Single Memory Would Be a Structural Hazard

Ti Time (clock ( l k cycles) l )
lw Inst 1 Inst 2 Inst 3 Inst 4
M Mem
R Reg
M Mem
R Reg
Reading data from memory

Reg g
ALU A Reg g Mem
ALU
Mem
Mem
ALU Reg Mem
Reg
Mem
Reg
ALU
Mem
Mem
Reg
ALU
Reading instruction from memory y
Reg
Mem
Reg
Fix with separate instr and data memories (I$ and D$)
Irwin, PSU, 2008
How About Register File Access?

Time ( (clock cycles) y )
add $1, , Inst 1 Inst 2
IM
Reg g
DM
Reg g
IM
Reg
DM
Reg
Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half
Reg
IM
ALU U
Reg
ALU
ALU
DM
ALU
add $2,$1,
IM
Reg
DM
Reg
clock edge that controls register writing

clock edge that controls loading of pipeline state registers Irwin, PSU, 2008
Register Usage Can Cause Data Hazards
Dependencies backward in time cause hazards

ALU A
add $1, $1 sub $ $4,$1,$5 ,$ ,$ and $6,$1,$7 or $8,$1,$9
IM
R Reg
DM
R Reg
ALU
IM
Reg g
DM
Reg g
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
IM
Reg
DM
Reg
Read before write data hazard

Irwin, PSU, 2008
Register Usage Can Cause Data Hazards

AL LU
add $1, $1 sub $ $4,$1,$5 ,$ ,$ and $6,$1,$7 or $8,$1,$9
IM
Reg
DM
Reg
ALU U
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
IM
Reg
DM
Reg
Read before write data hazard

Irwin, PSU, 2008
Loads Can Cause Data Hazards

AL LU
lw
$1 4($2) $1,4($2)
IM
Reg
DM
Reg
ALU U
sub $ $4,$1,$5 ,$ ,$ and $6,$1,$7 or $8,$1,$9
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
ALU
xor $4,$1,$5
IM
Reg
DM
Reg
Load-use data hazard

Irwin, PSU, 2008
Branch Instructions Cause Control Hazards

ALU A
beq lw Inst 3 Inst 4
IM
R Reg
DM
R Reg
ALU
IM
Reg g
DM
Reg g
ALU
IM
Reg
DM
Reg
ALU
IM
Reg
DM
Reg
Irwin, PSU, 2008
Other Pipeline Structures Are Possible
What about the (slow) multiply operation?

z z
Make the clock twice as slow or let it take two cycles (since it doesnt use the DM stage)
MUL AL LU IM Reg DM Reg
What if the data memory y access is twice as slow as the instruction memory?
z z
make the clock twice as slow or let data memory access take two cycles (and keep the same clock rate)
ALU IM Reg DM1 DM2 Reg
Irwin, PSU, 2008
Other Sample Pipeline Alternatives
ARM7
IM PC update IM access
Reg decode reg access
EX ALU op DM access shift/rotate commit result (write back)
XScale
IM1 PC update BTB access start IM access IM2 Reg SHFT decode reg 1 access IM access
ALU op p
shift/rotate reg 2 access
ALU
DM1
Reg DM2
DM write reg write start DM access exception
Irwin, PSU, 2008
Summary

All modern day processors use pipelining Pipelining doesnt help latency of single task, it helps throughput of entire workload Potential speedup: a CPI of 1 and fast a CC Pipeline rate limited by slowest pipeline stage
z z
Unbalanced pipe stages makes for inefficiencies The time to fill fill pipeline and time to drain drain it can impact speedup for deep pipelines and short code runs
Must detect and resolve hazards

z
Stalling negatively affects CPI (makes CPI less than the ideal of 1)
Irwin, PSU, 2008

Lec4a Supp

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec4a Supp

Uploaded by

Copyright:

Available Formats

Chapter 4A: The Processor, Part A

Mary Jane Irwin ( www.cse.psu.edu/~mji )

CSE431 Chapter 4A.1

Irwin, PSU, 2008

Review: MIPS (RISC) Design Principles

S Simplicity f favors regularity

Make the common case fast

Good design demands good compromises

three instruction formats

CSE431 Chapter 4A.2

The Processor: Datapath & Control

Our implementation of the MIPS is simplified

Fetch PC = PC+4 Exec Decode

CSE431 Chapter 4A.3

Irwin, PSU, 2008

Aside: Clocking Methodologies

one clock cycle

CSE431 Chapter 4A.4

Fetching instructions involves

Fetch PC = PC+4 Exec Decode

Instruction Memory PC Read Address Instruction

CSE431 Chapter 4A.5

Decoding instructions involves

Fetch PC = PC+4 Exec Decode

reading two values from the Register File

CSE431 Chapter 4A.6

Irwin, PSU, 2008

Executing R Format Operations

Fetch PC = PC+4 Exec Decode

CSE431 Chapter 4A.7

Executing Load and Store Operations

Load and store operations involves

CSE431 Chapter 4A.8

Irwin, PSU, 2008

Executing Branch Operations

Branch operations involves

Branch target address

zero (to branch control logic)

CSE431 Chapter 4A.9

Irwin, PSU, 2008

Executing Jump Operations

Jump operation involves

Add 4 4 Instruction Memory PC Read Address Instruction 26 Shift left 2

CSE431 Chapter 4A.10

Irwin, PSU, 2008

Creating a Single Datapath from the Parts

Cycle time is determined by length of the longest path

CSE431 Chapter 4A.11

Irwin, PSU, 2008

Fetch, R, and Memory Access Portions

ALUSrc ALU control ovf zero

Instruction st uct o Memory PC Read Address Instruction

Address ALU Data Memory Read Data Write Data

CSE431 Chapter 4A.12

Irwin, PSU, 2008

Adding the Control

op field always in bits 31-26

CSE431 Chapter 4A.13

Single Cycle Datapath with Control Unit

Instruction Memory PC Read Address Instr[31-0]

Instr[15 -11] Instr[15-0]

CSE431 Chapter 4A.14

Irwin, PSU, 2008

R-type Instruction Data/Control Flow