You are on page 1of 72

COMPUTER ORGANIZATION AND DESIGN5th

Edition
The Hardware/Software Interface

Chapter 4
The Processor
Contents
 4.1 Introduction
 4.3 Building a Datapath
 4 4 A Simple Implementation Scheme
4.4
 4.5 An Overview of Pipelining
 4.6 Pipelined Datapath and Control

2
4.1 Introduction
 Three CPU performance factors
 Instruction count
 Determined byy ISA and compiler
p
 E.g.,
 Pseudo instruction: bgt $t4, $t5, L
 Actual ones: slt $at, $t5, $t4
 bne $at, $zero, L
 CPI and Cycle time
 Determined by CPU hardware (i.e., implementation)

3
Implementation
 This Chapter examines two MIPS
implementations (i.e., datapath and the
control unit))
 A simplified version
 A more realistic pipelined version
 Simple subset, shows most aspects
 Memory reference: lw, sw
 Arithmetic/logical: add,
add sub,
sub and,
and or,
or slt
 Control transfer: beq, j

4
Instruction Execution (1/2)
 Phases of instruction execution
 Instruction fetch
 Send the PC to the memoryy and fetch the
instruction
 Decoding
g
 Operands fetch
 Read one or two registers
registers, using fields of the
instruction to select the registers to read

e.g., lw $10, 24($12)


add $10, $11, $12

5
Instruction Execution (2/2)
 Execution, depending on instruction class
 memory reference
f instructions:
i t ti an address
dd
calculation
 arithmetic logical instructions: an operation
arithmetic-logical
execution
 branches: comparison
 Write back
 memory reference
f instructions:
i t ti access the
th memory
 arithmetic-logical instructions: write the data from
the ALU back into a register
 branches: PC  target address or PC + 4
e.g., lw $10, 24($12)
add $10, $11, $12 6
An Abstract View of the Implementation
Fig.
g 4.1

op rs rt address
lw $10, 24($12) 35 base12 10 24
op rs rt rd shamt funct
add $10, $11, $12 0 11 12 10 0 32 7
Basic Implementation
 Fig. 4.2 shows the datapath of Fig. 4.1 with the
th
three required
i d multiplexors
lti l added,
dd d as wellll as
control lines for the major functional units
 In the remainder of the chapter, we refine this
view to fill in the details

8
Multiplexers
 Can’t just join wires
t
together
th
 Use multiplexers

Fig 44.1
Fig. 1
9
The Basic Implementation

Fig. 4.2

1/3 hour
10
4.3 Building a Datapath
 Datapath elements
 Elements that process data and addresses
in the CPU
 E.g., Registers, ALUs, mux’s, memories, …
 Instruction supported
 Arithmetic/logical (R-format): add, sub, and,
or, slt
l
 Memory reference: lw, sw
 Control transfer: beq, j

11
Datapath for Fetching Instruction

Fig. 4.6

Increment by
4 for next
instruction

12
R-Format Instructions
 Register file
 A collection of registers in which any register can be
read or written by specifying the number of the
register
i t ini th
the fil
file
 Read two register operands
 Perform arithmetic/logical operation
 Write
te register
eg ste result
esu t (co
(controlled
t o ed by tthe
e write
te
control signal)
op rs rt address
lw $10, 24($12) 35 12 10 24
op
p rs rt rd shamt funct
add $10, $11, $12 0 11 12 10 0 32
13
Full Datapath

Fig. 4.11
14
Load/Store Instructions
 Read register operands
 C l l t address
Calculate dd using
i 1616-bit
bit offset
ff t
 Use ALU, but sign-extend offset
 Load:
L d Read
R d memory and d update
d t register
i t
 Store: Write register value to memory

Exercise: data path for sw $10, 24($12)

2A

2B

2/3 hour
15
Exercise 班級: 學號: 姓名:
Answer sw $10, 24($12)

12 $12

10

$10

24 24
lw $10, 24($12)

12 $12

10

24 24
R-Type/Load/Store Datapath

Fig. 4.10
19
Branch Instructions
 The datapath for a branch (Fig 4.9)
 e.g., beq $t0, $s5, Exit
 Calculate the branch target address
 The offset field is shifted left 2 bits so that it is a word
offset
 Adding the sign-extended offset field of the instruction
to the PC
 Compare operands
 Do a subtract
 If the Zero signal out of the ALU unit is asserted, we
know that the two values are equal

20
Review: Showing Branch Offset

Loop: add $t1, $s3, $s3 # location 80000


add
dd …
add …
lw … # location 80012
beq $t0, $s5, Exit # location 80016
add …
j Loop # location 80024
Exit: # location 80028
>> Answer
A
80028 (Exit) = 80020 (PC) + 8 (Offset)
beq 4 8 21 2
$t0 $s5 Exit
PC relative addressing
PC-relative

21
Branch Instructions

Fig. 4.9

Sign-bit wire
replicated
p

beq $t0, $s5, Exit 4 8 21 2 22


3/3 hour
beq $t0, $s5, Exit 4 8 21 2 23

80020

80016

Fig. 4.11
23
4.5 An Overview of Pipelining
 Pipelining is an implementation technique
in which multiple instructions are
pp in execution
overlapped
 All stages in pipelining are operating
concurrently

24
Laundry Analogy for Pipelining (1/2)

1.Washer
1 Washer
2.Dryer
3.Fold
4.Put away

Fig. 4.25

25
Laundry Analogy for Pipelining (2/2)
 Four stages in pipeline (Figure 4.25)
 non-pipelining approach: 16 units in execution
time
 pipelining approach: 7 units
 Pipelining improves throughput of our
laundry system without improving the time
to complete a single task
 The speedup due to pipelining is likely
equal to the number of stages in the
pipeline
26
MIPS Pipeline
 Five stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access (data) memory operand
5. WB: Write result back to register

Clock cycle length = 800 ps

Fig 4.27 (a) Single-cycle, nonpipelined 27


1-Cycle vs Pipelined Performance

Instr Instr fetch Register ALU op Memory Register Total time


read access write
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps

Fig. 4.26

28
Figure 4.27

Clock cycle length = 200 ps

stage
Computer
p Organization
g

(b) Single-cycle, pipelined

29
ISA Design for Pipelining
 MIPS ISA designed for pipelining
 All instructions are 32-bits
 Easier to fetch and decode in one cycle
 c.f. x86: 1- to 17-byte instructions
 Few and regular instruction formats
 Can decode and read registers in one step
 Load/store addressing
 Can calculate address in 3rd stage, access memory
in 4th stage
 Ali
Alignment
t off memory operands
d
 Memory access takes only one cycle
1/5 hour
30
Case Study
 ]由玻璃構成的 天 ((11:24))
[[video]由玻璃構成的一天

31
Pipeline Hazards
 There are situations in pipelining when the
next instruction cannot execute in the
g clock cycle.
following y
 These events are called hazards
 Structure hazards
 Conflict for use of a resource
 D t h
Data hazard
d
 Need to wait for previous instruction to complete its
data read/write
 Control hazard
 Deciding on control action depends on previous
instruction
32
Structural Hazards
 Example
 If the
th pipeline
i li iin Fi
Figure 4
4.27(b)
27(b) h
had
d a ffourth
th iinstruction
t ti
 In the same clock cycle that the first instruction is accessing data
from memory while the fourth instruction is fetching an instruction
from the same memory
 Without two memories, our pipeline could have a
structure hazard

lw $4, 400($0)
Memory
reference

33
2/5 hour
34
Data Hazards
 Data hazards arise from the dependence of one
instruction on an earlier one that is still in the pipeline
 Example
add $$s0,, $t0,
$ , $t1
$
Data dependency
sub $t2, $s0, $t3
 The add instruction does not write its result until the fifth stage,
g
meaning that we would have to add two bubbles to the pipeline
(next page)

add $s0, $t0, $t1

sub $t2, $s0, $t3



35
How to Solve Data Hazards?
 Waiting 2 clock cycles
 Correct, but too pessimistic

new $s0 available

36
Exercise
 Solve three data hazards by specifying
extra waiting clock cycles.
add $3, $4, $6
sub $5, $3, $2
lw $7, 100($5)
add $8, $7, $2

37
Answer

add $3, $4, $6


2 waiting clock cycles
sub $5, $3, $2
2 waiting
i i clock
l k cycles
l
lw $7, 100($5)
2 waiting clock cycles
add $8
$8, $7
$7, $2

38
Forwarding
 Use result ($s0) when it is computed
 Don’t wait for it to be stored in a register
 Requires extra connections in the datapath

Fig. 4.29 39
Load-Use Data Hazard
 Can’t always avoid stalls by forwarding
 If value not computed when needed
 Can’tt forward backward in time!
Can

stall one stage

Fig. 4.30 40
Exercise 1
 Show the forwarding paths needed to
execute the following four instructions:
add $3, $4, $6
sub $5, $3, $2
lw $7, 100($5)
add $8, $7, $2

41
Answer

add $3, $4, $6

sub $5, $3, $2

lw $7, 100($5)

2B

add $8, $7, $2

42
Exercise 2
 Identify all of the data dependencies in the
following code.
 Which dependencies are data hazards that
will be resolved via forwarding?
 Which dependencies are data hazards that
will cause a stall?

add $3, $4, $2


sub
b $5
$5, $3
$3, $1
lw $6, 200($3)
add $7, $3, $6
43
Answer
 add $3, $4, $2
sub $5
$5, $3
$3, $1
lw $6, 200($3)
add $7,
$ , $
$3,
, $
$6
 Answer
 The data dependency between the load and the last add
instruction cannot be resolved by using forwarding.

IF ID EXE MEM WB

IF ID EXE MEM WB

IF ID EXE MEM WB

IF ID EXE MEM WB
44
Reordering Code to Avoid Stalls
 Example.
Consider the following code segment in C:
A = B + E;
C = B + F;
Here is the generated MIPS code for this segment:
lw $t1,
$ , 0($t0)
($ ) #load B
#
lw $t2, 4($t0) #load E
add $t3, $t1, $t2 #$t3 = B+E
sw $t3, 12($t0) #store A = $t3
lw $t4, 8($t0) #load F
add $t5, $t1, $t4 #$t5 = B+F
sw $t5, 16($t0) #store C = $t5
Reorder the instructions to avoid any pipeline stalls.
45
Reordering Code to Avoid Stalls
 Answer.
 Both add instructions have a hazard

l
lw $t2 4($t0)
$t2,4($t0)

add $t3,$t1,$t2

 On a p
pipelined
p p
processor with forwarding,
g, the
reordered sequence will complete in two
fewer cycles
y than the original
g version
46
Reordering Code to Avoid Stalls
 Answer (cont’d).
 Moving up the 3rd lw
l instruction eliminates both hazards:
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4
$t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5
$t5, $t1 $t4
$t1,
sw $t5, 16($t0)

lw $t2 4($t0)
$t2,

lw $t4,
$t , 48($t0)
($t )

add $t3, $t1, $t2


4/5 hour
47
Control Hazards
 Control hazard arises from the need to make a
decision based on the results of one instruction
while others are executing
 Fetching next instruction depends on branch outcome
 Pipeline can’t always fetch correct instruction
 Still working
g on ID stage
g of branch

48
Fig. 4.21
49
Stall on Branch
 Example:
beq $1, $2, 40
or $7, $8, $9
We must begin fetching the or instruction on the next
clock cycle. But the pipeline cannot possible know
what the next instruction should be.

beq $1,$2,40 IF ID EXE MEM WB

or $7,
$7 $8
$8, $9 IF ID EXE MEM WB

50
Stall on Branch
 Assume that we put in extra hardware so that we
can test
t t register,
i t calculate
l l t the
th branch
b h address,
dd
and update the PC during the 2nd stage of the
pipeline.
i li

beq $1,$2,40 IF ID EXE MEM WB

or $7, $8, $9 IF ID EXE MEM WB

Fig. 4.31
51
Branch Prediction
 One simple approach is to always predict
that branches will untaken
 When you are right, the pipeline proceeds at
full speed (see Figure 4.32 (a))
 Only when branches are taken does the
pipeline stall (see Figure 4.32 (b))
 A more sophisticated
hi ti t d version
i off b
branch
h
prediction would have some branches
predicted as taken and some as untaken

52
MIPS with Predict Not Taken

Prediction
correct

Prediction
incorrect

xor

Fig. 4.32 53
More-Realistic Branch Prediction
 Dynamic branch prediction
 Hardware measures actual branch behavior
 e.g., record recent history of each branch
 A
Assume f t
future behavior
b h i will
ill continue
ti th
the ttrend
d
 When wrong, stall while re-fetching, and update history

54
Exercise
 Refer to the following sequence of
instructions that are executed in a five-
g p
stage pipeline.
p
lw $1, 40($6)
add
dd $6
$6, $2
$2, $2
sw $5, 50($1)

55
1) Indicate hazards and add nop instructions
to eliminate them.
2) Assume there is full forwarding
forwarding. Indicate
hazards and add nop instructions to
eliminate them
them.

56
Answer
(1)
lw $1, 40($6)
add $
$6,, $
$2,, $
$2
sw $5, 50($1)

hazard

IF ID EXE MEM WB
lw $1, 40($6)
add $6
$6, $2
$2, $2 IF ID EXE MEM WB

sw $5, 50($1)
IF ID EXE MEM WB

57
lw $1, 40($6)
add $6, $2, $2
sw $5, 50($1)

IF ID EXE MEM WB
lw $1,
$ 40($6)
($ )
add $6, $2, $2 IF ID EXE MEM WB

nop
sw $5, 50($1)
IF ID EXE MEM WB

58
(2)
lw $1, 40($6)
add $
$6,, $
$2,, $
$2
sw $5, 50($1)

forwarding

IF ID EXE MEM WB
lw $1, 40($6)
add $6
$6, $2
$2, $2 IF ID EXE MEM WB

sw $5, 50($1)
IF ID EXE MEM WB

59
Case Study: NFC
 [[video]] Amazing-Smart-Mobile-Life-with-NFC
g ((3:55))

2A

3/5 hour
60
4.6 MIPS Pipelined Datapath

MEM

Right-to-left
Ri ht t l ft WB
flow leads to
hazards

61
Pipeline registers
 Need registers between stages
 To hold information produced in previous cycle

62
Pipeline Operation
 Cycle-by-cycle flow of instructions through
the pipelined datapath
 “Single-clock-cycle”
Single clock cycle pipeline diagram
 Shows pipeline usage in a single cycle
 Highlight resources used
 c.f. “multi-clock-cycle” diagram
 Graph of operation over time
 We’ll look at “single-clock-cycle” diagrams
for load & store

63
IF for Load, Store, …

64
ID for Load, Store, …

65
EX for Load

66
MEM for Load

67
WB for Load

Wrong
register
number

68
Corrected Datapath for Load

69
EX for Store

70
MEM for Store

71
WB for Store

72

You might also like