Professional Documents
Culture Documents
Edition
The Hardware/Software Interface
Chapter 4
The Processor
Contents
4.1 Introduction
4.3 Building a Datapath
4 4 A Simple Implementation Scheme
4.4
4.5 An Overview of Pipelining
4.6 Pipelined Datapath and Control
2
4.1 Introduction
Three CPU performance factors
Instruction count
Determined byy ISA and compiler
p
E.g.,
Pseudo instruction: bgt $t4, $t5, L
Actual ones: slt $at, $t5, $t4
bne $at, $zero, L
CPI and Cycle time
Determined by CPU hardware (i.e., implementation)
3
Implementation
This Chapter examines two MIPS
implementations (i.e., datapath and the
control unit))
A simplified version
A more realistic pipelined version
Simple subset, shows most aspects
Memory reference: lw, sw
Arithmetic/logical: add,
add sub,
sub and,
and or,
or slt
Control transfer: beq, j
4
Instruction Execution (1/2)
Phases of instruction execution
Instruction fetch
Send the PC to the memoryy and fetch the
instruction
Decoding
g
Operands fetch
Read one or two registers
registers, using fields of the
instruction to select the registers to read
5
Instruction Execution (2/2)
Execution, depending on instruction class
memory reference
f instructions:
i t ti an address
dd
calculation
arithmetic logical instructions: an operation
arithmetic-logical
execution
branches: comparison
Write back
memory reference
f instructions:
i t ti access the
th memory
arithmetic-logical instructions: write the data from
the ALU back into a register
branches: PC target address or PC + 4
e.g., lw $10, 24($12)
add $10, $11, $12 6
An Abstract View of the Implementation
Fig.
g 4.1
op rs rt address
lw $10, 24($12) 35 base12 10 24
op rs rt rd shamt funct
add $10, $11, $12 0 11 12 10 0 32 7
Basic Implementation
Fig. 4.2 shows the datapath of Fig. 4.1 with the
th
three required
i d multiplexors
lti l added,
dd d as wellll as
control lines for the major functional units
In the remainder of the chapter, we refine this
view to fill in the details
8
Multiplexers
Can’t just join wires
t
together
th
Use multiplexers
Fig 44.1
Fig. 1
9
The Basic Implementation
Fig. 4.2
1/3 hour
10
4.3 Building a Datapath
Datapath elements
Elements that process data and addresses
in the CPU
E.g., Registers, ALUs, mux’s, memories, …
Instruction supported
Arithmetic/logical (R-format): add, sub, and,
or, slt
l
Memory reference: lw, sw
Control transfer: beq, j
11
Datapath for Fetching Instruction
Fig. 4.6
Increment by
4 for next
instruction
12
R-Format Instructions
Register file
A collection of registers in which any register can be
read or written by specifying the number of the
register
i t ini th
the fil
file
Read two register operands
Perform arithmetic/logical operation
Write
te register
eg ste result
esu t (co
(controlled
t o ed by tthe
e write
te
control signal)
op rs rt address
lw $10, 24($12) 35 12 10 24
op
p rs rt rd shamt funct
add $10, $11, $12 0 11 12 10 0 32
13
Full Datapath
Fig. 4.11
14
Load/Store Instructions
Read register operands
C l l t address
Calculate dd using
i 1616-bit
bit offset
ff t
Use ALU, but sign-extend offset
Load:
L d Read
R d memory and d update
d t register
i t
Store: Write register value to memory
2A
2B
2/3 hour
15
Exercise 班級: 學號: 姓名:
Answer sw $10, 24($12)
12 $12
10
$10
24 24
lw $10, 24($12)
12 $12
10
24 24
R-Type/Load/Store Datapath
Fig. 4.10
19
Branch Instructions
The datapath for a branch (Fig 4.9)
e.g., beq $t0, $s5, Exit
Calculate the branch target address
The offset field is shifted left 2 bits so that it is a word
offset
Adding the sign-extended offset field of the instruction
to the PC
Compare operands
Do a subtract
If the Zero signal out of the ALU unit is asserted, we
know that the two values are equal
20
Review: Showing Branch Offset
21
Branch Instructions
Fig. 4.9
Sign-bit wire
replicated
p
80020
80016
Fig. 4.11
23
4.5 An Overview of Pipelining
Pipelining is an implementation technique
in which multiple instructions are
pp in execution
overlapped
All stages in pipelining are operating
concurrently
24
Laundry Analogy for Pipelining (1/2)
1.Washer
1 Washer
2.Dryer
3.Fold
4.Put away
Fig. 4.25
25
Laundry Analogy for Pipelining (2/2)
Four stages in pipeline (Figure 4.25)
non-pipelining approach: 16 units in execution
time
pipelining approach: 7 units
Pipelining improves throughput of our
laundry system without improving the time
to complete a single task
The speedup due to pipelining is likely
equal to the number of stages in the
pipeline
26
MIPS Pipeline
Five stages, one step per stage
1. IF: Instruction fetch from memory
2. ID: Instruction decode & register read
3. EX: Execute operation or calculate address
4. MEM: Access (data) memory operand
5. WB: Write result back to register
Fig. 4.26
28
Figure 4.27
stage
Computer
p Organization
g
29
ISA Design for Pipelining
MIPS ISA designed for pipelining
All instructions are 32-bits
Easier to fetch and decode in one cycle
c.f. x86: 1- to 17-byte instructions
Few and regular instruction formats
Can decode and read registers in one step
Load/store addressing
Can calculate address in 3rd stage, access memory
in 4th stage
Ali
Alignment
t off memory operands
d
Memory access takes only one cycle
1/5 hour
30
Case Study
]由玻璃構成的 天 ((11:24))
[[video]由玻璃構成的一天
31
Pipeline Hazards
There are situations in pipelining when the
next instruction cannot execute in the
g clock cycle.
following y
These events are called hazards
Structure hazards
Conflict for use of a resource
D t h
Data hazard
d
Need to wait for previous instruction to complete its
data read/write
Control hazard
Deciding on control action depends on previous
instruction
32
Structural Hazards
Example
If the
th pipeline
i li iin Fi
Figure 4
4.27(b)
27(b) h
had
d a ffourth
th iinstruction
t ti
In the same clock cycle that the first instruction is accessing data
from memory while the fourth instruction is fetching an instruction
from the same memory
Without two memories, our pipeline could have a
structure hazard
lw $4, 400($0)
Memory
reference
33
2/5 hour
34
Data Hazards
Data hazards arise from the dependence of one
instruction on an earlier one that is still in the pipeline
Example
add $$s0,, $t0,
$ , $t1
$
Data dependency
sub $t2, $s0, $t3
The add instruction does not write its result until the fifth stage,
g
meaning that we would have to add two bubbles to the pipeline
(next page)
36
Exercise
Solve three data hazards by specifying
extra waiting clock cycles.
add $3, $4, $6
sub $5, $3, $2
lw $7, 100($5)
add $8, $7, $2
37
Answer
38
Forwarding
Use result ($s0) when it is computed
Don’t wait for it to be stored in a register
Requires extra connections in the datapath
Fig. 4.29 39
Load-Use Data Hazard
Can’t always avoid stalls by forwarding
If value not computed when needed
Can’tt forward backward in time!
Can
Fig. 4.30 40
Exercise 1
Show the forwarding paths needed to
execute the following four instructions:
add $3, $4, $6
sub $5, $3, $2
lw $7, 100($5)
add $8, $7, $2
41
Answer
lw $7, 100($5)
2B
42
Exercise 2
Identify all of the data dependencies in the
following code.
Which dependencies are data hazards that
will be resolved via forwarding?
Which dependencies are data hazards that
will cause a stall?
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
IF ID EXE MEM WB
44
Reordering Code to Avoid Stalls
Example.
Consider the following code segment in C:
A = B + E;
C = B + F;
Here is the generated MIPS code for this segment:
lw $t1,
$ , 0($t0)
($ ) #load B
#
lw $t2, 4($t0) #load E
add $t3, $t1, $t2 #$t3 = B+E
sw $t3, 12($t0) #store A = $t3
lw $t4, 8($t0) #load F
add $t5, $t1, $t4 #$t5 = B+F
sw $t5, 16($t0) #store C = $t5
Reorder the instructions to avoid any pipeline stalls.
45
Reordering Code to Avoid Stalls
Answer.
Both add instructions have a hazard
l
lw $t2 4($t0)
$t2,4($t0)
add $t3,$t1,$t2
On a p
pipelined
p p
processor with forwarding,
g, the
reordered sequence will complete in two
fewer cycles
y than the original
g version
46
Reordering Code to Avoid Stalls
Answer (cont’d).
Moving up the 3rd lw
l instruction eliminates both hazards:
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4
$t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5
$t5, $t1 $t4
$t1,
sw $t5, 16($t0)
lw $t2 4($t0)
$t2,
lw $t4,
$t , 48($t0)
($t )
48
Fig. 4.21
49
Stall on Branch
Example:
beq $1, $2, 40
or $7, $8, $9
We must begin fetching the or instruction on the next
clock cycle. But the pipeline cannot possible know
what the next instruction should be.
or $7,
$7 $8
$8, $9 IF ID EXE MEM WB
50
Stall on Branch
Assume that we put in extra hardware so that we
can test
t t register,
i t calculate
l l t the
th branch
b h address,
dd
and update the PC during the 2nd stage of the
pipeline.
i li
Fig. 4.31
51
Branch Prediction
One simple approach is to always predict
that branches will untaken
When you are right, the pipeline proceeds at
full speed (see Figure 4.32 (a))
Only when branches are taken does the
pipeline stall (see Figure 4.32 (b))
A more sophisticated
hi ti t d version
i off b
branch
h
prediction would have some branches
predicted as taken and some as untaken
52
MIPS with Predict Not Taken
Prediction
correct
Prediction
incorrect
xor
Fig. 4.32 53
More-Realistic Branch Prediction
Dynamic branch prediction
Hardware measures actual branch behavior
e.g., record recent history of each branch
A
Assume f t
future behavior
b h i will
ill continue
ti th
the ttrend
d
When wrong, stall while re-fetching, and update history
54
Exercise
Refer to the following sequence of
instructions that are executed in a five-
g p
stage pipeline.
p
lw $1, 40($6)
add
dd $6
$6, $2
$2, $2
sw $5, 50($1)
55
1) Indicate hazards and add nop instructions
to eliminate them.
2) Assume there is full forwarding
forwarding. Indicate
hazards and add nop instructions to
eliminate them
them.
56
Answer
(1)
lw $1, 40($6)
add $
$6,, $
$2,, $
$2
sw $5, 50($1)
hazard
IF ID EXE MEM WB
lw $1, 40($6)
add $6
$6, $2
$2, $2 IF ID EXE MEM WB
sw $5, 50($1)
IF ID EXE MEM WB
57
lw $1, 40($6)
add $6, $2, $2
sw $5, 50($1)
IF ID EXE MEM WB
lw $1,
$ 40($6)
($ )
add $6, $2, $2 IF ID EXE MEM WB
nop
sw $5, 50($1)
IF ID EXE MEM WB
58
(2)
lw $1, 40($6)
add $
$6,, $
$2,, $
$2
sw $5, 50($1)
forwarding
IF ID EXE MEM WB
lw $1, 40($6)
add $6
$6, $2
$2, $2 IF ID EXE MEM WB
sw $5, 50($1)
IF ID EXE MEM WB
59
Case Study: NFC
[[video]] Amazing-Smart-Mobile-Life-with-NFC
g ((3:55))
2A
3/5 hour
60
4.6 MIPS Pipelined Datapath
MEM
Right-to-left
Ri ht t l ft WB
flow leads to
hazards
61
Pipeline registers
Need registers between stages
To hold information produced in previous cycle
62
Pipeline Operation
Cycle-by-cycle flow of instructions through
the pipelined datapath
“Single-clock-cycle”
Single clock cycle pipeline diagram
Shows pipeline usage in a single cycle
Highlight resources used
c.f. “multi-clock-cycle” diagram
Graph of operation over time
We’ll look at “single-clock-cycle” diagrams
for load & store
63
IF for Load, Store, …
64
ID for Load, Store, …
65
EX for Load
66
MEM for Load
67
WB for Load
Wrong
register
number
68
Corrected Datapath for Load
69
EX for Store
70
MEM for Store
71
WB for Store
72