You are on page 1of 20

Jordan Daniel Ulmer Computer Org.

HW#5 CH(6) Page | 1


FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
6.1

Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 2
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014

6.1 [M] Consider the following instructions at the given addresses in the memory:
1000 Add R3, R2, #20
1004 Subtract R5, R4, #3
1008 And R6, R4, #0x3A
1012 Add R7, R2, R4
Initially, and . These instructions are executed in a computer that has a five-stage
pipeline as shown in Figure 6.2. The first instruction is fetched in clock cycle 1, and the remaining
instructions are fetched in successive cycles.

Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 3
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
(a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions through the
pipeline. Describe the operation being performed by each pipeline stage during clock cycles 1
through 8.
--
(6.2.a) Flow Of Instructions :
*Assuming Positive Edge Clock, Where All Operations Are Latched At The END Of The Cycle
Stage:
1 2 3 4 5 6 7 8
Operation:
1000 Add
R3, R2,
#20
Fetch
IR(1000)
Decode
RA [R2]
RB [R3]
Compute
RZ [R2]+
#20
Memory
RY [RZ]
Write
R3 [RY]

1004
Subtract
R5, R4, #3

Fetch
IR(1004)
Decode
RA [R4]
RB [R5]
Compute
RZ [R4]-
#3
Memory
RY [RZ]
Write
R5
[RY]

1008 And
R6, R4,
#0x3A

Fetch
IR(1008)
Decode
RA [RZ]
R3
RB [R6]
Compute
RZ
[R4]&#0x3A
Memory
RY [RZ]
Write
R6
[RY]

1012 Add
R7, R2, R4

Fetch
IR(1012)
Decode
RA [R2]
RB [R4]
Compute
RZ
[R2]+
[R4]
Memory
RY
[RZ]
Write
R7
[RY]


(b) With reference to Figures 5.8 and 5.9, describe the contents of registers R2, R3, R4, R5, R6, R7, IR,
PC, RA, RB, RY, and RZ in the pipeline during cycles 2 to 8.
--


Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 4
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
(6.1) Contents of Registers R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ :
Stage:
1 2 3 4 5 6 7 8 9
Register:
[PC] 1000 1004 1008 1012 ? ? ? ? ?
[IR]
Add
(1004)
Subtract
(1004)
And
(1008)
Add
(1012)
? ? ? ? ?
[RA] ?
[R2]
2000
[R4]
50
[R4]
50
[R2]
2000
? ? ? ?
[RB] ?
[R3]
?
[R5]
?
[R6]
?
[R4]
50
? ? ? ?
[RZ] ? ? ?
[R2]+#20
2020
[R4]-#3
47
[R4] & #0x3A
50
[R2]+ [R4]
2050
? ?
[RY] ? ? ? ?
[R2]+#20
2020
[R4]-#3
47
[R4] & #0x3A
50
[R2]+ [R4]
2050
?
[R2] 2000 2000 2000 2000 2000 2000 2000 2000 2000
[R3] ? ? ? ? ?
[R2]+#20
2020
[R2]+#20
2020
[R2]+#20
2020
[R2]+#20
2020
[R4] 50 50 50 50 50 50 50 50 50
[R5] ? ? ? ? ? ?
[R4]-#3
47
[R4]-#3
47
[R4]-#3
47
[R6] ? ? ? ? ? ? ?
[R4] & #0x3A
50
[R4] & #0x3A
50
[R7] ? ? ? ? ? ? ? ?
[R2]+ [R4]
2050
Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 5
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
6.2

6.2 [M] Repeat Problem 6.1 for the following program:
1000 Add R3, R2, #20
1004 Subtract R5, R4, #3
1008 And R6, R3, #0x3A // Data Dependency R3!!!
1012 Add R7, R2, R4
Assume that the pipeline provides forwarding paths to the ALU from registers RY and RZ
in Figure 5.8 and that the processor uses forwarding of operands.
--
From Problem 6.1
Initially, and . These instructions are executed in a computer that has a five-stage
pipeline as shown in Figure 6.2. The first instruction is fetched in clock cycle 1, and the remaining
instructions are fetched in successive cycles.

Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 6
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014

(a) Draw a diagram similar to Figure 6.1 that represents the flow of the instructions through the
pipeline. Describe the operation being performed by each pipeline stage during clock cycles 1
through 8.
--
(6.2.a) Flow Of Instructions :
*Assuming Positive Edge Clock, Where All Operations Are Latched At The END Of The Cycle
Stage:
1 2 3 4 5 6 7 8
Operation:
1000 Add
R3, R2,
#20
Fetch
IR(1000)
Decode
RA [R2]
RB [R3]
Compute
RZ [R2]+
#20
Memory
RY [RZ]
Write
R3 [RY]

1004
Subtract
R5, R4, #3

Fetch
IR(1004)
Decode
RA [R4]
RB [R5]
Compute
RZ [R4]-
#3
Memory
RY [RZ]
Write
R5
[RY]

1008 And
R6, R3,
#0x3A

Fetch
IR(1008)
Decode
RA [RZ]
R3
RB [R6]
Compute
RZ
[R3]&#0x3A
Memory
RY [RZ]
Write
R6
[RY]

1012 Add
R7, R2, R4

Fetch
IR(1012)
Decode
RA [R2]
RB [R4]
Compute
RZ
[R2]+
[R4]
Memory
RY
[RZ]
Write
R7
[RY]

(b) With reference to Figures 5.8 and 5.9, describe the contents of R2, R3, R4, R5, R6, R7, IR, PC, RA,
RB, RY, and RZ in the pipeline during cycles 2 to 8.
--
Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 7
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
(6.2.b) Contents of Registers R2, R3, R4, R5, R6, R7, IR, PC, RA, RB, RY, and RZ :
Stage:
1 2 3 4 5 6 7 8 9
Register:
[PC] 1004 1004 1008 1012 ? ? ? ? ?
[IR]
Add
(1000)
Subtract
(1004)
And
(1008)
Add
(1012)
? ? ? ? ?
[RA] ?
[R2]
2000
[R4]
50
R3
2020
[R2]
2000
? ? ? ?
[RB] ?
[R3]
?
[R5]
?
[R4]
50
[R4]
50
? ? ? ?
[RZ] ? ? ?
[R2]+#20
2020
[R4]-#3
47
[R3] &
#0x3A
32
[R2]+ [R4]
2050
? ?
[RY] ? ? ? ?
[R2]+#20
2020
[R4]-#3
47
[R3] &
#0x3A
32
[R2]+ [R4]
2050
?
[R2] 2000 2000 2000 2000 2000 2000 2000 2000 2000
[R3] ? ? ? ? ?
[R2]+#20
2020
[R2]+#20
2020
[R2]+#20
2020
[R2]+#20
2020
[R4] 50 50 50 50 50 50 50 50 50
[R5] ? ? ? ? ? ?
[R4]-#3
47
[R4]-#3
47
[R4]-#3
47
[R6] ? ? ? ? ? ? ?
[R3] &
#0x3A
32
[R3] &
#0x3A
32
[R7]
?
? ? ? ? ? ? ?
[R2]+ [R4]
2050
Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 8
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
6.7

6.7 [M] Assume that 20 percent of the dynamic count of the instructions executed for a program are
branch instructions. Delayed branching is used, with one delay slot. Assume that there are no stalls
caused by other factors.
(a.)First, derive an expression for the execution time in cycles if all delay slots are filled with NOP
instructions.
Branch 20%
Delay slots 100% No-Operation
--

() ()


(b.)Then, derive another expression that reflects the execution time with 70 percent of delay slots filled
with useful instructions by the optimizing compiler.
Branch 20%
Delay slots 30% No-Operation
Delay slots 70% Useful Operation
--

() ()


(c.) From these expressions, determine the compilers contribution to the increase in performance,
expressed as a speedup percentage.
--


The Optimizing Compiler with a 1-branch-delay processor makes operations faster than
the same processor without an Optimizing Compiler.
Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 9
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
6.8

6.8 [D] Repeat Problem 6.7, but this time for a pipelined processor with .
The output from the optimizing compiler is such that the first delay slot is filled with a useful instruction
70 percent of the time, but the second slot is filled with a useful instruction only 10 percent of the time.

Compare the compiler-optimized execution time for this case with the compiler-optimized execution
time for Problem 6.7. Assume that the two processors have the same clock rate. Indicate which
processor/compiler combination is faster, and determine the speedup percentage by which it is faster.
--
(a.)The execution time in cycles if all delay slots are filled with NOP instructions.



(a.)The execution time in cycles if the delays are filled as described above.
Branches 20% Of Operations
Slot #1
Delay slot 30% No-Operation
Delay slot 70% Useful Operation
Slot #2
Delay slot 90% No-Operation
Delay slot 10% Useful Operation

() () () () ()



Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 10
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
(c.) From these expressions, determine the compilers contribution to the increase in performance,
expressed as a speedup percentage.
--
The Pipeline, with optimizing compiler is best:


An Optimizing Compiler for a 2-branch -delay processor makes executions faster than
the same processor without an Optimizing Compiler.
Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 11
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
6.14


6.14 [E] Assume that a program contains no branch instructions. It is executed on the superscalar
processor shown in Figure 6.13.
(a.) What is the best execution time in cycles that can be expected if the mix of instructions consists of
75 percent arithmetic instructions and 25 percent memory-access instructions?
Arithmetic (75%)+ Memory Access(25%)
--
Given the assumptions marked on Figure 6.13 above the Arithmetic path takes on average:



Again, given the assumptions marked on Figure 6.13 above the Arithmetic path takes:

(


)
Because there is a significantly un-equal amount of instructions passed through each pipeline, we can
safely assume that the Load/Store( 1/4
th
Of Total Instructions MINORITY) will be able to finish before
the Arithmetic( 3/4
th
Of Total Instructions MAJORITY) .
1Cycle
1Cycle
1Cycle
1Cycle
1Cycle
1Cycle
2Cycles
1Cycle
Jordans
Assumptions
In Yellow
Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 12
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014

()

()

() ()
Thus our limiting bottleneck is the Arithmetic sector of the pipeline:





(b.) How does this time compare to the best execution time on the simpler processor in Figure 6.2
using the same clock?
--
Given the assumptions marked on Figure 6.2 above the simple pipeline will be limited by the memory
access:

()

()



Jordans
Assumptions
In Yellow
1Cycle
1Cycle
1Cycle
1Cycle
2Cycles
1Cycle
Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 13
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
Thus our comparison now becomes:

()

Vs.




Because both processors would be dealing with the same set of instructions and the same clock our
comparison simplifies:

()
Vs.



So we saved a little more than one cycle by re-routing memory access to a separate pipeline we could
have saved even more cycles by having a more balanced set of instructions :(ie. Arithmetic (50%) ;
Memory Access(50%))
Furthermore looking at our Speed Up Ratio:

()

()


And so we find that our Double-Path-Super-Scalar-Processor is faster than a Single-Path-
Pipelined Processor.

Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 14
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
6.15

HAS BRANCHES
6.15 [M] Repeat Problem 6.14 to find the best possible execution times for the processors in Figures 6.2
and 6.13, assuming that the mix of instructions consists of 15 percent branch instructions that are never
taken, 65 percent arithmetic instructions, and 20 percent memory access instructions. Assume a
prediction accuracy of 100 percent for all branch instructions-(Branch delay is minimized

).
--
(a.) What is the best execution time in cycles that can be expected if the mix of instructions consists of
Arithmetic (65%) + Memory Access (20%) + Branch Never Taken (15%) instructions?
Arithmetic (65%) + Memory Access (20%) + Branch Never Taken (15%)
--
Making the same assumptions as in problem 6.14, marked on Figure 6.13 above;
Because the number of Arithmetic instructions outweighs the number of Memory Accesses the
bottleneck constraint will originate from the Arithmetic sector of the pipeline.
Where again the Arithmetic path takes:

()

()

() ()




And we now have the additional possibility of a branch delay, HOWEVER THIS IS TAKEN CARE OF by the
Fetch Sector, as a pre-cursor to the Execution Stage:


So, in this scenario, our best Execution Time will be the SAME regardless of branching operations.
Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 15
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014




(b.) How does this time compare to the best execution time on the simpler processor in Figure 6.2
using the same clock?
--
Making the same assumptions as in problem 6.14, marked on Figure 6.2 above;
And we now have the additional possibility of a single-cycle branch delay:

()


()

()


Thus our comparison becomes

()

Vs.





Because both processors would be dealing with the same set of instructions and the same clock our
comparison simplifies:

()
Vs.




So we can see that the simple pipeline takes longer to reconcile the branch delays, but the super scalar
pipeline is essentially un-effected by branching considerations (because this is taken care of in the fetch
stage)
Furthermore looking at our Speed Up Ratio:

()

()


And so we find that our Double-Path-Super-Scalar-Processor is faster than a Single-Path-
Pipelined Processor when we also take branching effects into consideration.

Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 16
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
Suplementary(a,b,c)





Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 17
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014

KEY For Tables Of Operations
Fetch=E
Decode=D
Execute (Using Processor Hardware)=E
Execute With Register Renaming (Not Using Processor Hardware)=E
Write Back=W
Waiting on=X

NOTE: This method of display, was developed in collaboration with Nathan Genetzky
(Supplementary.A.)Completed The Preceding Table Of Operations
NOTE: Horizontal = Temporal-axis
:

F D E W
F D R3 E E E E E W
F D ONE E W
F D R6 R6 R6 R6 R6 E W
F D E W
F D R7 E W
F D FIVE E W
F D R1 R1 R1 E W
F D E E E E E W
F D R6 R6 R6 R6 R6 E W
F D SEVEN E W
R3 R7 R7 R6,R2 R5 R1 R0 R3 R6 R2




OPS, ERROR, instruction 8 cannot use the ALU at the same time as instruction 3.
Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 18
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014

(B.2)What is the savings using capability?
--
We save SEVEN cycles. Performing 11 operations in 17 cycles ( Using - Out Of Order Capability) instead of 24 cycles (NOT Using - Out Of Order
Capability)- THIS IS REALLY SIGNIFICANT!!!!
(Supplementary.B.)Completed The Preceding Table Of Operations
NO OUT OF ORDER CAPABILITY NO REGISTER RENAMING
NOTE: Horizontal = Temporal-axis
:

F D E W
F D R3 E E E E E W
F D ONE ONE ONE ONE ONE E W
F D R6 R6 R6 R6 R6 E W
F D THREE THREE THREE THREE E W
F D R7 R7 R7 R7 R7 R7 E W
F D FIVE FIVE FIVE FIVE FIVE FIVE E W
F D R1 R1 R1 SIX SIX SIX E W
F D SEVEN SEVEN SEVEN SEVEN SEVEN SEVEN E E E E E W
F D R6 R6 R6 R6 R6 R6 R6 R6 R6 R6 R6 E W
F D NINE NINE NINE NINE NINE NINE NINE NINE NINE NINE NINE E W
R3 R6 R7 R1 R7 R2 R5 R0 R6 R2 R3


Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 19
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014


(Supplementary.C)Re-Completed The Table Of Operations
OUT OF ORDER CAPABILITY

NOTE: Horizontal = Temporal-axis
:

F D E W
F D R3 R3 E E E E E W
F D E W
F D R6 R6 R6 R6 R6 R6 R6 E W
F D E W
F D R7 R7 E W
F D E W
F D R1 R1 R1 R1 R1 R1 R1 E W
F D E E E E E W
F D R6 R6 R6 R6 R6 R6 E W
F D E W
R3 R7 R7 R5 R2 R3 R6 R1,R6 R0,R2


Jordan Daniel Ulmer Computer Org. HW#5 CH(6) Page | 20
FIGURE CREDIT: Computer Organization And Embedded Systems, Hamacher, Vranesic, Zaky, Manjikian, 6Ed, Mgh, 2012
Time of print: 10:32 PM 10/8/2014
(A.)OUT OF ORDER EXECUTION
Instruction Fetch Decode

Execute Writeback
0 ADD R3, R1, R2 0 1 2 3
1 LOAD R6, [R3] 1 2 4 9
2 AND R7, R5, 3 2 3 5 7
3 ADD R1, R6, R0 3 4 10 11
4 SRL R7, R0, 8 4 5 6 7
5 OR R2, R4, R7 5 6 8 9
6 SUB R5, R3, R4 6 7 9 10
7 ADD R0, R1, R10 7 8 12 13
8 LOAD R6, [R5] 8 9 10
15
9 SUB R2, R1, R6 9 10 16 17
10 AND R3, R7, 15 10 11 13 14


(C.) SUPER SCALAR (2 Instructions Each Stage) WITH OUT
OF ORDER CAPABILITIES
Instruction Fetch Decode

Execute Writeback
0 ADD R3, R1, R2 0 1 2 3
1 LOAD R6, [R3] 0 1 4 9
2 AND R7, R5, 3 1 2 3 4
3 ADD R1, R6, R0 1 2 10 11
4 SRL R7, R0, 8 2 3 4 5
5 OR R2, R4, R7 2 3 6 7
6 SUB R5, R3, R4 3 4 5 6
7 ADD R0, R1, R10 3 4 12 13
8 LOAD R6, [R5] 4 5 6 11
9 SUB R2, R1, R6 4 5 12 13
10 AND R3, R7, 15 5 6 7 8
(B.) NO OUT OF ORDER OPERATIONS
Instruction Fetch Decode

Execute Writeback
0 ADD R3, R1, R2 0 1 2 3
1 LOAD R6, [R3] 1 2 4 9
2 AND R7, R5, 3 2 3 9 10
3 ADD R1, R6, R0 3 4 10 11
4 SRL R7, R0, 8 4 5 11 12
5 OR R2, R4, R7 5 6 13 14
6 SUB R5, R3, R4 6 7 14 15
7 ADD R0, R1, R10 7 8 15 16
8 LOAD R6, [R5] 8 9 16 21
9 SUB R2, R1, R6 9 10 22 23
10 AND R3, R7, 15 10 11 23 24
OPS, ERROR, instruction 8 cannot use the ALU at the same time as instruction 3.

You might also like