Professional Documents
Culture Documents
Surname:
Instructor:
Course:
Date:
Chapter 1
High level language influence the processor by enabling efficient, translates language constructs
to instructions that a computer can execute.
3. Answer true or false, with justification: “The compiler writer is intimately aware of the
details of the process implementation.”
False, complier writer must have details on ISA and furthermore most of processor
implementation is of no use to complier writer.
4. Explain the levels of abstractions found inside the computer, from the silicon substrate
to a complex multiplayer video game.
_The semiconductor materials possess electrical properties that enable them to be used to make
transistors which are switches.
_The transistors can be connected to implement logic gates, which are simple circuits that allow
logic functions such as AND, OR, and NOT to be recognized.
_Logic gates can get connected to form functional units, which execute operations such a
decoding an n-bit binary number into selecting one of n outputs or adding two n-bit binary
numbers together.
_The logic elements and logic gates can be used to build devices such as memories or state
machines that can be used to control logic circuits. The elements are joined to create a data path
and control system which is the processor.
_ Instructions set commands the processor on what to execute within its capability limits (e.g.,
add two numbers together; fetch something from memory, etc.)
_The complier is the developed to translate a written computer program in a high level language
into instruction from derived from the instruction set.
_ Computer programs written in the high-level language are connected to using networking and
communication technology enabling people to interact and play games.
True also false and it all depends on personal interpretation. Details of transistors can change
based on speed vs. power consumption issues, that all computers know had the knowledge of
adding and the essential circuitry of the same will be similar to the organization of logic elements
and design of the data path and control system might be entirely different between a graphics
processor and one used to control a hearing aid.
Chapter 2
A frame pointer acts as a reference point, where the value in frame pointer register is the
addresses of the local variables determined. On the other hand, a stack pointer is a value in
registers of the processor, which holds the memory address of the top of the stack. In most cases,
this value is less than the top of the stack (i.e., the stack grows from the top to bottom). Anytime
a push is done, stack pointer value decreases.
3. In the LC-2200 architecture, where are operands normally found for an add
instruction?
BZ: that’s if (a == 0)
BEQ: that’s if (a == b)
BN: that’s if (a < 0) or if (a < b) which becomes if (a-b < 0)
BP: that’s if (a > 0) or if (a>b) which becomes if (a-b > 0)
8. Procedure A has important data in both S and T registers and is about to call procedure
B. Which registers should B store on the stack?
Procedure A should save the T registers before calling procedure B. B should save any S
registers it uses OR save any T registers it needs before calling another function.
Surname 3
ISA or Instruction Set Architecture defines the machine code that a processor reads and acts upon
so is the word size, memory address modes, processor registers, and data type.
15. What are the differences on instruction set design?
Instruction sets are influenced by ease of implementations, efficiency, and ease of programming.
Larger instructions set make writing compilers easier, but at the detriment of speed or ease of
implementation.
Chapter 3
1. What is the difference between level triggered logic and edge triggered logic? Which do
we use? Why?
Level triggered logic register content alter their state from current to new when clock signal
hits high, while the edge triggered logic change to the registered contents happen
depending on the rising and falling clock edge. The rising clock edge gives positive edge
triggered logic and the falling edge give a negative edge triggered logic.
2. Given the FSM and state transition diagram for a garage door opener (Figure 3.12-a and
Table 3.1) implement the sequential logic circuit for the garage door opener.
(Hint: The sequential logic circuit has 2 states and produces three outputs, namely, next
state, up motor control and down motor control).
Surname 4
6. What are the advantages and disadvantages of a bus-based data path design?
The advantage of a bus-based data path design is that data signals are available to every
piece of hardware in the circuit, hence no worry of sending signals to multiple devices. The
disadvantage of this design is that there is a limitation on how many signal can be sent to
each clock cycle. For instance, in a single bus design, only one signal can be sent out each
clock cycle, which makes the data path function less efficiently. In addition, there are cost
and space problems that arise out of having so many wires.
9. The Instruction Fetch is implemented in the text with first 4 states and then three.
What would have to be done to the datapath to make it two states long?
In order to implement the Instruction Fetch in 2 states, there would have to be a second bus
incorporated into the datapath, that could either take the MEM [MAR] to the IR in the first
state, or could take A+1 to the PC in the second state, therefore combining 2 of the present
states into 1, eliminating 1 state overall.
10. How many words of memory will this code snippet require when assembled? Is space
allocated for “L1”?
Surname 5
This code snippet, when assembled, would require 3 words, 1 per instruction. No space is
allocated for L1. Instead, when the code is assembled, the L1 reference made in the first
beq is replaced with the line number of the instruction it refers to, which in this case would
be line 2, rendering the L1 label no longer useful.
One of the advantages is to ease and simplify instruction pipelining, which enables a
single-clock throughput at high frequencies. Hence it means that variable length
instructions make it difficult to decouple memory fetches, requiring the processor to fetch
part, then decide whether to fetch more, maybe missing in cache before the instruction is
complete, whereas fixed length allows the full instruction to be fetched in one access,
increasing speed and efficiency.
A leaf procedure refers to a procedure that never calls any other procedure.
14. Suppose you are writing an LC-2200 program and you want to jump a distance that is
farther that allowed by a BEQ instruction. Is there a way to jump to an address?
If you are writing an LC-2200 program and you want to jump a distance that is further than
allowed by a BEQ instruction, you can jump using a JALR instruction. This instruction
stores PC+1 into Reg Y, where PC is the address of the current JALR instruction. Then, it
branches to the address currently in Reg X, which can be any address you want to go to,
any distance. You can also use JALR to perform an unconditional jump, where after using
the JALR instruction; you discard the value stored in Reg Y.
15. Could the LC series processors be made to run faster if it had a second bus? If your
answer was no, what else would you need?
Another ALU
An additional mux
A TPRF
Surname 6
The LC Series processors could not be made to run faster if it had a second bus. However,
if it also had a DRPF (Dual Ported Register File), instead of the temporary A and B
registers, it could run faster. Using a DRPF to get values to be operands for the ALU will
enable both buses to drive values to the DRPF at the same time, allowing the ALU
operations and therefore overall performance of the processor to run faster.
Chapter 4
1. Upon an interrupt what has to happen implicitly in hardware before control is transferred to
the interrupt handler?
The hardware must save the program counter value implicitly before the control is transmitted
to the handler. Also the hardware must determine the address of the handler to transfer control
from the currently executing program to the handler. Depending on the architecture interrupts
can also be disabled.
4. How does the processor know which device has requested an interrupt?
Initially the processor has no knowledge. However, once the processor acknowledges the
interrupt, the information will be supplied to the processor as to its identity by the interrupting
device. For instance, the device might supply the address of its interrupt handler or it might
supply a vector into a table of interrupt handler addresses.
6. In the following interrupt handler code select the ONES THAT DO NOT BELONG.
___X___ disable interrupts;
___X___ save PC;
_______ save $k0;
_______ enable interrupts;
_______ save processor registers; _______
execute device code;
_______ restore processor registers;
_______ disable interrupts;
Surname 7
7. In the following actions in the INT macro state select the ONES THAT DO NOT
BELONG. ___X__ save PC; ___X__ save SP;
______ $k0←PC;
___X__ enable interrupts;
___X__ save processor registers;
______ ACK INT by asserting INTA;
______ Receive interrupt vector from device on the data bus;
______ Retrieve PC address from the interrupt vector table;
___X__ Retrieve SP value from the interrupt vector table;
___X__ disable interrupts
___X__ PC←PC retrieved from the vector table;
______ SP←SP value retrieved from the vector table;
Homework chapter 5
1. True or false: For a given workload and a given instruction-set architecture, reducing the
CPI (clocks per instruction) of all the instructions will always improve the performance
of the processor?
False. The processor execution depends on the instructions number, the average CPI, and clock
cycle time. If the average CPI is decreased but require lengthening of the instruction cycle time
improvement cannot be achieved or even cause a decrease in performance.
3. What would be the execution time for a program containing 2,000,000 instructions if the
processor clock was running at 8 MHz and each instruction takes 4 clock cycles?
S-type 4
Notes:
a) Feedback Lines: are useful during the hardware implementation of the pipeline. These defines
the previous stage of the pipeline that the current instruction is still under processing, therefore
do not send any more to process. They also send a NOP, dummy operation, to the next stage of
the pipeline until the current instruction is ready to proceed to the next stage.
b) Data forwarding: the ex-stage forwards the new value of the written register to the ID/RR
stage, so it can process the right values of the updated register
c) Branch Prediction: the idea here is to predict the outcome of the branch and let the instructions
flow along the pipeline based on this prediction. If the prediction is correct, it completely gets
rid of any stalls. If the prediction is not correct, it will create 2 stalls.
d) Delayed branch: the idea here is to find a useful instruction that we can feed the pipeline with,
while we test the branch instruction.
10. How can a read after write (RAW) hazard be minimized or eliminated?
Surname
10
RAW hazards get eliminated by adding data forwarding to a pipelined. Performing this process
involves adding mechanisms to send data being written to a register to previous stages in the
buffer that want to read from the same register.
14. Regardless of whether we use a conservative approach or branch prediction (branch not
taken), explain why there is always a 2-cycle delay if the branch is taken (i.e., 2 NOPs
injected into the pipeline) before normal execution can resume in the 5-stage pipeline used
in Section 5.13.3.
The processor does not recognize whether the branch is taken until the BEQ instruction is in
execution stage. Since there is no definite mechanism to load the instruction from the new PC
after the branch is taken, new instructions can only get loaded after the BEQ instruction has exited
the execution stage, which will occur 2 cycles after it has been loaded.
18. Using the 5-stage pipeline shown in Figure 5.6c answer the following two questions:
a. Show the actions (similar to Section 5.12.1) in each stage of the pipeline for BEQ
instruction of LC-2200.
a) IF stage (cycle 1):
I-MEM [PC] -> FBUF
PC + 1 -> PC
A process is a program in execution. A program is static, has no state, and has a fixed
size on disk, whereas a process is dynamic, exists in memory, may grow or shrink, and
has associated with it “state” that represents the information associated with the
execution of the program.
Round Robin
10. Given the following processes that arrived in the order shown
Show the processor activities and the I/O area using the FCFS, SJF, and Round Robin
algorithms.
Assuming each process requires a CPU burst followed by an I/O burst followed by a final
CPU burst (as in Example 1 in Section 6.6):
FCFS
Surname
12
SJF
1. Consider Google Earth application. You launch the application, move the mouse on the
earth’s surface, and click on Mount Everest to see an up-close view of the mountain
range. Identify the interactions in layman’s terms between the operating system and the
hardware during the above sequence of actions.
Launching the application triggers the operation to commence a process which requests from
the operating system a connection to Google Earth. Each operation executed by the user
either alters the state of the program and requests output be performed by the operating
system or alter the state of the program and requests the operating system send information to
Google Earth requesting for additional information.
3. Answer True or False with justification: “The compiler writer is intimately aware of
the details of the processor implementation.”
False: The compiler writer must acknowledge some details known as the instruction set
architecture but many details of the processor implementation are of no use or interest to the
compiler writer
4. Explain the levels of abstractions found inside the computer from the silicon substrate
to a complex multi-player video game.
The semiconductor materials possess electrical properties that enable them to be used to make
transistors which are switches.
The transistors can be connected to implement logic gates, which are simple circuits that allow
logic functions such as AND, OR, and NOT to be recognized.
Logic gates can get connected together to form functional units, which execute operations such a
decoding an n-bit binary number into selecting one of n outputs or adding two n-bit binary
numbers together.
The logic elements and logic gates can be used to build devices such as memories or state
machines that can be used to control logic circuits. The elements are joined to create a data path
and control system which is the processor.
Instructions set commands the processor on what to execute with in its capability limits (e.g. add
two numbers together; fetch something from memory, etc.)
The complier is the developed to translate a written computer program in high level language
into instruction from derived from the instruction set.
Computer programs written in high level language are connected together to using networking
and communication technology enabling people to interact and play games.
True also false and it all depend with personal interpretation. Details of transistors can change
based on speed vs. power consumption issues, that all computers know had the knowledge of
adding and the essential circuitry of the same will be similar to the organization of logic elements
and design of the data path and control system might be quite different between a graphics
processor and one used to control a hearing aid.
6. What is the role of a “bridge” between computer buses as shown in Figure 1.8?
Acts as a kind of translator/communications path between two devices (the two buses) which
may consist of no similar operational protocols
Surname
14
7. What is the role of a “controller” in Figure 1.8?
Appear to the computer to be memory locations which are in reality control registers for the
particular I/O devices to be controlled. The controller transmits the information supplied by
the processor and converts it into the appropriate control signals for the I/O device and/or
retrieve information from the device and sets bits in control registers to allow the processor to
receive the information.
8. Using the Internet, research and explain 5 major milestones in the evolution of computer
hardware.
Examples:
Vacuum tunes to transistors
Integrated circuits
Disk drives
Display technology (from paper to glass)
Networking
9. Using the Internet, research and explain 5 major milestones in the evolution of the
operating system.
Examples:
Multiprogramming
Scheduling
Time sharing
GUI Interface
Parallel operating systems
Error recovery
10. Compare and contrast grid computing and the power grid. Explain how the analogy makes
sense. Also, explain how the analogy breaks down.
There is an interconnected network of devices serving some useful purpose in both cases.
The generating systems can be perceived as a thought of powerful resources, supplying
power to the industrial and residential users of electricity. In grid computing, information
flow is more two-way (or even n-way). There are differences in the way things are paid for
where in the electric grid the consumers pay the producers whereas in grid computing
additional revenue streams may be provided by advertisers or others wishing to use
information generated by the grid. For power grid there are a relatively small number of
suppliers compared to a vast number of consumers. In grid computing there would perhaps
more consumers than producers but much more producers that in the case of the power grid.
11. Match the left and right hand sides.
Surname
15
Unix operating system Ritchie
Microchip Kilby and Noyce
FORTRAN programming language Backus
C programming language Thompson and Ritchie
Transistor Bardeen, Brattain, and Shockley
World’s first programmer Lovelace
World’s first computing machine Babbage
Vacuum Tube De Forest
ENIAC Mauchley and Eckert
Linux operating system Torvalds
Disagree.
By judicious use of calling conventions defining saved and temporary registers call/return
overhead is manageable to any desired level of performance
A frame pointer acts as a reference point, where the value in frame pointer register are the
addresses of the local variables determined. On the other hand a stack pointer is a value in
registers of the processor, which holds the memory address of the top of the stack. In most cases,
this value is less than the top of the stack (i.e. the stack grows from the top to bottom). Anytime a
push is done, stack pointer value decreases. Note: This is not to say that when a function calls
another function the frame pointer will remain fixed. It will not. Rather it will be changed on call
and reestablished upon return thus for all execution of a given functions own code it will be
fixed.
3. In the LC-2200 architecture, where are operands normally found for an add
instruction?
4. Endianness: Let’s say you want to write a program for comparing two strings. You
have a choice of using a 32-bit byte-addressable Big-endian or Little-endian
Surname
16
architecture to do this. In either case, you can pack 4 characters in each word of 32-
bits. Which one would you choose and how will you write such a program? [Hint:
Normally, you would do string comparison one character at a time. If you can do it a
word at a time instead of a character at a time, that implementation will be faster.]
The choice of endianness does not matter, so long as there is consistent in the comparison.
Subtract each word in string A from string B. If you return zero, the strings are identical.
Note you can only perform this operation on the same system. The operation will be more
complex if you are trying to compare a Big-endian system to a Little-endian system. You
would have to compare character by character.
BZ: if (a == 0)
BEQ: if (a == b)
BN: if (a < 0) or if (a < b) which becomes if (a-b < 0)
BP: if (a > 0) or if (a>b) which becomes if (a-b > 0)
6. We said that endianness will not affect your program performance or correctness so
long as the use of a (high level) data structure is commensurate with its declaration.
Are there situations where even if your program does not violate the above rule, you
could be bitten by the endianness of the architecture? [Hint: Think of programs that
cross network boundaries.]
Yes, if data from a big-endian computer was to be transferred over a network into a small
endian computer data corruption can be experienced. However, this problem has been
appropriately solved using the Internet and networks using similar technology. The solution
is for networks to use standard endianness. If at any point the endianness of the network
varies from the host computer, the host’s network interface will apply the appropriate
conversion.
7. Work out the details of the implementing the switch statement using jump tables in
assembly using any flavor of conditional branch instruction. [Hint: After ensuring
that the value of the switch variable is within the bounds of valid case values, jump to
the start of the appropriate code segment corresponding to the current switch value,
execute the code and finally jump to exit.]
8. Procedure A has important data in both S and T registers and is about to call
procedure B. Which registers should A store on the stack? Which registers should B
store on the stack?
Procedure A should save the T registers before calling procedure B. B should save any S
registers it uses OR save any T registers it needs before calling another function.
9. Consider the usage of the stack abstraction in executing procedure calls. Do all
actions on the stack happen only via pushes and pops on to and from the top of the
stack? Explain circumstances that warrant reaching into other parts of the stack
during program execution. How is this accomplished?
No, the amount of memory included in the stack at any given time is controlled by simply
changing the stack pointer value. Then values may be read or written in locations defined
as offsets from the address stored in the stack pointer (or frame pointer).
False, the frame pointer is not to implement procedure calls necessarily, but it can make
code simpler.
11. DEC VAX had a single instruction for loading and storing all the program visible
registers from/to memory. Can you see a reason for such an instruction pair?
Consider both the pros and cons.
Pros: If you are a caller and need to use all of the safe and temporary registers, you can
perform this operation in one call. If you want to save the current state of execution, it can
be done in one call.
Cons: In most cases you do not need all available registers, so this command uses more
memory (and possibly time) than required.
12. Show how you can simulate a subtract instruction using the existing LC-2200 ISA?
Surname
18
Since our system uses 2’s complement, the negative value of a number is NOT X plus 1.
The LC-2200 does not have support for NOT, but NAND serves the same function.
B B NAND B ; not B
B B+1 ; B+1
A A+B ; A+B, net result is A-B
13. The BEQ instruction restricts the distance you can branch to from the current
position of the PC. If your program warrants jumping to a distance larger than that
allowed by the offset field of the BEQ instruction, show how you can accomplish such
“long” branches using the existing LC-2200 ISA.
Note: Assume the address of the location that is a "long" way away is in $s2
The ISA (Instruction Set Architecture) serves as a kind of contractual document that
enables all parties concerned with the design, implementation and make use of the
provided processor to know what is expected of them and what resources will be available
by that processor.
• Implementation engineers can come up with the detail that will allow the processor
to meet the ISA specification
• Assembler and compiler writers create appropriate assemblers and compiler tor use
with the processor long before a working model even exists.
• Operating system designers/maintainers determine what is needed to be done to
enable their operating system run on this processor.
• I/O Device engineers design controllers and driver software that will be used with
the processor.
• Box (or equivalent) engineers can determine how to use the processor in their
designs etc.
16. What are conditional statements and how are they handled in the ISA?
An addressing mode specifies how the bits of the instruction the operands locations. For
instance, some of the bits might be a register number or an offset to be added to the PC,
etc.
18. In Section 2.8, we mentioned that local variables in a procedure are allocated on the
stack. While this description is convenient for keeping the exposition simple, modern
compilers work quite differently. This exercise is for you to search the Internet and
find out how exactly modern compilers allocate space for local variables in a
procedure call. [Hint: Recall that registers are faster than memory. So, the objective
should be to keep as many of the variables in registers as possible.]
Generally many local variables which are located in memory technically found on the stack
are maintained in registers due to their significant advantage of speed which is enjoyed by
the registers. We have already noted that saved and temporary register conventions,
argument registers, return value and return address registers. All of these are focused
towards increasing speed and efficiency. In addition, modern optimizing compilers employ
sophisticated register allocation strategies designed to maximize use of registers. However,
arrays and structures are maintained on the stack and not in registers.
We use the term abstraction to refer to the stack. What is meant by this term? Does
the term abstraction imply how it is implemented? For example, is a stack used in a
procedure call/return a hardware device or a software device?
Show how you can realize the effect of the following instruction:
Assume that the registers and the Imm field are 8-bits wide. You can ignore the case that
the SUB instruction causes an overflow.
Solution:
Show how to realize a new addressing mode, called indirect, for use with the load
instruction that is represented in assembly language as:
LW Rx, @ (Ry);
The semantics of this instruction is that the contents of register Ry is the address of a
pointer to the memory operand that must be loaded in Rx.
Solution:
LW Rx, Ry, 0
LW Rx, Rx, 0
22. Convert this statement:
g = h + A[i];
Surname
21
Into an LC-2200 assembler with the assumption that the address of A is located in $t0, g is
in $s1, h is in $s2, and, i is in $t1.
23. Suppose you design a computer called the Big Looper 2000 that will never be used to
call procedures and that will automatically jump back to the beginning of memory
when it reaches the end. Do you need a program counter? Justify your answer.
Big Looper 2000 needs a pc to identify which address to fetch an instruction from on each
cycle. The PC is also useful in calculating relative addresses like the one used with branch
instructions.
24. Consider the following program and assume that for this processor:
1. What is the difference between level triggered logic and edge triggered logic? Which
do we use? Why?
In level triggered logic, register contents change state from current to new when the clock
signal is high. In edge triggered logic, the register contents change state on the rising or
falling edge of the clock. In edge triggered logic, if the change happens on the rising edge
it is referred to as positive edge triggered logic; as opposed to change happening on the
falling edge, which is referred to as negative edge triggered logic.
Surname
23
We use positive edge triggered logic.
Edge triggered logic is a method which avoids certain instability problems which are found
in level triggered circuits
2. Given the FSM and state transition diagram for a garage door opener (Figure 3.12
and (Table 3.1) implement the sequential logic circuit for the garage door opener.
(Hint: The sequential logic circuit has 2 states and produces three outputs, namely,
next state, up motor control and down motor control).
3. Re-implement the above logic circuit using the ROM plus state register approach
detailed in this chapter.
Surname
24
Micro programmed:
Hardwired:
5. One of the optimizations to reduce the space requirement of the control ROM based
design is to club together independent control signals and represent them using an
encoded field in the control ROM. What are the pros and cons of this approach?
What control signals can be clubbed together and what cannot be? Justify your
answer.
The main pro is that this approach can reduce the size of the control signal table (ROM).
While the main con is the it adds decoding steps to the process that leads to a delay data
Surname
25
path for generating control signals. This is because drive signals and load signals can be
grouped, but not together due to multiple storage elements which are needed to be clocked
in the same clock cycle.
The advantage of a bus-based datapath design is that data signals are available to every
piece of hardware in the circuit, so you do not have to worry about sending signals to
multiple devices because they all have access. The disadvantage of this design is that you
are limited to how many signals you can send on each clock cycle. For example, in a
single bus design, only one signal can be sent out each clock cycle, which makes the
datapath function less efficiently. In addition, there are cost and space problems that arise
out of having so many wires.
7. Consider a three-bus design. How would you use it for organizing the above datapath
elements? How does this help compared to the two-bus design?
For organizing the above datapath using a 3-bus design, it would look as follows:
This design would function more efficiently compared to the 2-bus design, it pulls values
from the memory, transmitted to the ALU, and then stored in the register file all in one
step, as shown by the ADD instruction. A 2-bus design needs 2 clock cycles so as to
complete the ADD instruction, whereas the 3-bus design can do it in 1 clock cycle, using
Surname
26
the additional bus to drive the ALU result to the register file, where the IR can than supply
the destination register number to the register file, to complete the writing and therefore the
instruction.
Storing intermediate values in the register is rather helpful so it can be used from
instruction to another which precludes PC or IR specifically for this task since they fetch
and decode every instruction.
9. The Instruction Fetch is implemented in the text with first 4 states and then three.
What would have to be done to the datapath to make it two states long?
For the information Fetch in 2 states to be implemented, there should be a second bus
incorporated into the datapath, which can either convey MEM [MAR] to the IR in first
state, or take A+1 to the PC in second state. Hence combining 2 of the present states into 1,
eliminates 1 state overall.
10. How many words of memory will this code snippet require when assembled? Is space
allocated for “L1”?
When the code snippet is assembled it requires 3 words, 1 per instruction. L1 is not
allocated space but instead during the code assembling L1 reference made in the first beq is
replaced with the number line of instructions it refers to. In this case it is line 2 rendering
the L1 label to be no longer useful.
The advantage of fixed length instruction is to ease and simplify instruction pipelining,
allowing for a single-clock throughput at high frequencies. Basically, this means that
variable length instructions make it difficult to decouple memory fetches, requiring the
processor to fetch part, then decide whether to fetch more, maybe missing in cache before
the instruction is complete, whereas fixed length allows the full instruction to be fetched in
one access, increasing speed and efficiency.
Surname
27
A Leaf procedure refers to a procedure that does not call any other procedure.
13. For this portion of a datapath (assuming that all lines are 16 bits wide). Fill in the
table below.
Time A B C D E F
1 0x42 0xFE 0 0 0 0
2 0 0 0x42 0xFE 0 0
3 0xCAFE 0x1 0 0 0x140 0
4 0 0 0xCAFE 0x1 0 0x140
5 0 0 0 0 0xCAFE 0
6 0 0 0 0 0 0xCAFE
14. Suppose you are writing an LC-2200 program and you want to jump a distance that is
farther that allowed by a BEQ instruction. Is there a way to jump to an address?
If you are writing an LC-2200 program and you want to jump a distance that is farther than
allowed by a BEQ instruction, you can jump using a JALR instruction. This instruction
stores PC+1 into Reg Y, where PC is the address of the current JALR instruction. Then, it
branches to the address currently in Reg X, which can be any address you want to go to,
any distance. You can also use JALR to perform an unconditional jump, where after using
the JALR instruction; you discard the value stored in Reg Y.
Surname
28
15. Could the LC series processors be made to run faster if it had a second bus? If your
answer was no, what else would you need?
Another ALU
An additional mux
A TPRF
The LC series processor cannot be adjusted to run faster if it has second bus but it can run
faster if it had a DRPF (Dual Ported Register File), instead of temporary A and B registers.
Using a DRPF to get values to be operands for the ALU will enable both buses to drive
values to the DRPF at the same time, allowing the ALU operations and therefore overall
performance of the processor to run faster.
g = h + A[i];
Into an LC-2200 assembler with the assumption that the Address of A is located in $t0, g is
in $s1, h is in $s2, and, i is in $t1
Instruction0
RegSelLo DrReg LdA
go to Instruction1
Instruction1
DrOFF LdB
goto Instruction2
Instruction2
ALU_ADD DrALU LdMAR
goto Instruction3
Instruction3
DrMem LdA
goto Instruction4
Instruction4
RegSelLo DrReg LdB
goto Instruction5
Instruction5
ALU_ADD DrALU WrReg
Surname
29
halt
17. Suppose you design a computer called the Big Looper 2000 that will never be used to
call procedures and that will automatically jump back to the beginning of memory
when it reaches the end. Do you need a program counter? Justify your answer.
In order to design a computer that is not used to call procedures and will automatically
jump back to the beginning of memory when it reaches the end, a program counter is not
necessary, as a PC’s purpose includes pointing to the current instruction and implementing
the branch and jump instructions, so if there are no procedures, then there aren’t any
instructions, so there is no reason to have a PC.
18. In the LC-2200 processor, why is there not a register after the ALU?
In the LC-2200 processor, there is not a register file after the ALU because the results of
the ALU operation are written into the destination register, that is pointed to by the IR, so
the results are driven directly from the ALU onto the bus to the proper register.
19. In the datapath diagram shown in Figure 3.15, why do we need the A and B registers
in front of the ALU? Why do we need MAR? Under what conditions would you be
able to do without with any of these registers? [Hint: Think of additional ports in the
register file and/or buses.]
We need the A and B registers in front of the ALU because ALU operations (ADD,
NAND, A-B, A+1) require 2 operands, so we need temporary registers to hold at least one
of them, since we can only get 1 register value out of the register file since there is only 1
output port (Dout). Also, with only 1 bus, there is only 1 channel of communication
between any pair of datapath elements. Similarly, we need the MAR so there is a place to
hold the address sent by the ALU to the memory. The only conditions where we would be
able to do without some of these registers would be if there were multiple buses and/or
multiple output ports from the register file, thereby allowing multiple value to be
communicated simultaneously, so that the ALU could carry out its operations and the
memory could look up data at a specified address.
20. Core memory used to cost $0.01 per bit. Consider your own computer. What would be
a rough estimate of the cost if memory cost is $0.01/bit? If memory were still at that
price what would be the effect on the computer industry?
@ $0.01/bit
$320,000,000.00
21. If computer designers focused entirely on speed and ignored cost implications, what
would the computer industry look like today? Who would the customers be? Now
consider the same question reversed: If the only consideration was cost what would
the industry be like?
If computers were designed focused entirely on speed and ignored cost implications, the
computer industry would only produce supercomputers with infinite processing power, and
the consumers would only be large organizations with lots of money that have a lot of data
to process, like the government and large universities and companies. However, if
computers designed focused entirely on cost, computers would be slow and clunky, not
helpful and inefficient, with a very limited set of operations and the least amount of
hardware used as possible, with the consumers being people only doing really simple
operations, like students.
22. Consider a CPU with a stack-based instruction set. Operands and results for
arithmetic instructions are stored on the stack; the architecture contains no general
purpose registers.
The data path shown on the next page uses two separate memories, a 65,536 (216) byte
memory to hold instructions and (non-stack) data, and a 256 byte memory to hold the
stack. The stack is implemented with a conventional memory and a stack pointer register.
The stack starts at address 0, and grows upward (to higher addresses) as data are pushed
onto the stack. The stack pointer points to the element on top of the stack (or is -1 if the
stack is empty). You may ignore issues such as stack overflow and underflow.
Memory addresses referring to locations in program/data memory are 16 bits. All data are 8
bits. Assume the program/data memory is byte addressable, i.e., each address refers to an
8-bit byte. Each instruction includes an 8-bit opcode. Many instructions also include a 16-
bit address field. The instruction set is shown below. Below, "memory" refers to the
program/data memory (as opposed to the stack memory).
00000001 POP <addr> pop the element on top of the stack into memory at
location <addr>
00000010 ADD Pop the top two elements from the stack, add
them, and push the result onto the stack
00000100 BEQ <addr> Pop top two elements from stack; if they're equal,
branch to memory location <addr>
Note that the ADD instruction is only 8 bits, but the others are 24 bits. Instructions are
packed into successive byte locations of memory (i.e., do NOT assume all instruction uses
24 bits).
Assume memory is 8 bits wide, i.e., each read or write operation to main memory accesses
8 bits of instruction or data. This means the instruction fetch for multi-byte instructions
requires multiple memory accesses.
Datapath:
Complete the partial design shown on the next page.
Assume reading or writing the program/data memory or the stack memory requires a single
clock cycle to complete (actually, slightly less to allow time to read/write registers).
Similarly, assume each ALU requires slightly less than one clock cycle to complete an
arithmetic operation, and the zero detection circuit requires negligible time.
Control Unit:
Show a state diagram for the control unit indicating the control signals that must be
asserted in each state of the state diagram.
Surname
32
Solution
Surname
33
Surname
34
The hardware has to save the program counter value implicitly before the control goes to the
handler. The hardware has to determine the address of the handler to transfer control from the
currently executing program to the handler. Depending on the architecture interrupts may also
be disabled.
4. How does the processor know which device has requested an interrupt?
Initially the processor does not know. However, once the processor acknowledges the
interrupt, the interrupting device will supply information to the processor as to its identity. For
example, the device might supply the address of its interrupt handler or it might supply a
vector into a table of interrupt handler addresses.
Interrupt handler:
! Assume interrupts are disabled when we enter
SW $k0, OFFSET($sp) ! store $k0 on stack to be able to return to
! original program
ADDI $sp, $sp, OFFSET ! reserves space on stack to save registers
EI ! enables interrupt
SW $registers($sp) ! save registers on stack to retrieve these
! registers when the interrupt finishes later
6. In the following interrupt handler code select the ONES THAT DO NOT BELONG.
___X___ disable interrupts;
___X___ save PC;
_______ save $k0;
_______ enable interrupts;
_______ save processor registers;
_______ execute device code;
_______ restore processor registers;
_______ disable interrupts; _______
restore $k0;
___X___ disable interrupts;
___X___ restore PC;
___X___ enable interrupts;
_______ return from interrupt;
7. In the following actions in the INT macro state select the ONES THAT DO NOT
BELONG.
___X__ save PC;
___X__ save SP;
______ $k0←PC;
___X__ enable interrupts;
___X__ save processor registers;
______ ACK INT by asserting INTA;
______ Receive interrupt vector from device on the data bus;
______ Retrieve PC address from the interrupt vector table;
___X__ Retrieve SP value from the interrupt vector table;
___X__ disable interrupts
___X__ PC←PC retrieved from the vector table;
______ SP←SP value retrieved from the vector table;
1. True or false: For a given workload and a given instruction-set architecture, reducing
the CPI (clocks per instruction) of all the instructions will always improve the
performance of the processor.
Surname
37
False. The execution time for the processor depends on the number of instructions, the
average CPI, and the clock cycle time. If we decrease the average CPI but this requires us to
lengthen the instruction cycle time we might see no improvement or even a decrease in
performance.
2. An architecture has three types of instructions that have the following CPI:
Type CPI
A 2
B 5
C 3
An architect determines that he can reduce the CPI for B by some clever architectural trick,
with no change to the CPIs of the other two instruction types. However, she determines that
this change will increase the clock cycle time by 15%. What is the maximum permissible CPI
of B (round it up to the nearest integer) that will make this change still worthwhile? Assume
that all the workloads that execute on this processor use 40% of A, 10% of B, and 50% of C
types of instructions.
The maximum new time for B is 1.55. This is equivalent to about 1.35 clock cycles.
Therefore, the maximum CPI for B is 1, since any CPI greater than 1 will decrease the
speedup to below 1, which means that the new architecture is slower than the old one, making
the changes unnecessary.
Surname
38
3. What would be the execution time for a program containing 2,000,000 instructions if the
processor clock was running at 8 MHz and each instruction takes 4 clock cycles?
4. A smart architect re-implements a given instruction-set architecture, halving the CPI for
50% of the instructions, while increasing the clock cycle time of the processor by 10%.
How much faster is the new implementation compared to the original? Assume that the
usage of all instructions is equally likely in determining the execution time of any
program for the purposes of this problem.
Old CPI =1
Old Clock Cycle Time = 1
Old Total Time =1*1 =1
New CPI = 0.5
New Clock Cycle Time = 1.1
New Total Time = 0.5 * 1.1 = 0.55
Speedup = Old Total Time / New Total Time = 1 / 0.55 = 1.82
The new implementation has a speedup of 1.82, meaning that the new implementation is 82%
faster than the old implementation.
Compute the CPI of both the original and the new CPU. Show your work in coming up with your
answer.
Surname
39
Cycles per Instruction
Instruction CPI
LW 5
SW 4
ALU 4
BEQ 3
JMP 3
This change in the ALU instructions does improve the overall speed of the architecture.
Program 2 will execute faster since the total number of cycles it will take to execute is less
than the total number of cycles of Program 1.
8. Given
Instruction CPI
Add 2
Shift 3
Others 2 (average for all instructions including Add and Shift)
Add/Shift 3
If the sequence ADD followed by SHIFT appears in 20% of the dynamic frequency of a
program, what is the percentage improvement in the execution time of the program with all
{ADD, SHIFT} replaced by the new instruction?
9. Compare and contrast structural, data and control hazards. How are the potential
negative effects on pipeline performance mitigated?
Surname
41
Notes:
a) Feedback Lines: are used in the hardware implementation of the pipeline. These tell the
previous stage of the pipeline that the current instruction is still being processed, hence do not
send any more to process. They also send a NOP, dummy operation, to the next stage of the
pipeline until the current instruction is ready to proceed to the next stage.
b) Data forwarding: the ex-stage forwards the new value of the written register to the ID/RR
stage, so it can process the right values of the updated register
c) Branch Prediction: the idea here is to predict the outcome of the branch and let the
instructions flow along the pipeline based on this prediction. If the prediction is correct, it
completely gets rid of any stalls. If the prediction is not correct, it will create 2 stalls.
Surname
42
d) Delayed branch: the idea here is to find a useful instruction that we can feed the pipeline with,
while we test the branch instruction.
10. How can a read after write (RAW) hazard be minimized or eliminated?
RAW hazards can be eliminated by adding data forwarding to a pipelined. Doing this involves
adding mechanisms to send data being written to a register to previous stages in the buffer that
want to read from the same register.
12. Why is a second ALU needed in the Execute stage of the pipeline?
This second ALU is needed for instructions that change the PC of the processor, since the first
ALU is dedicated to performing arithmetic on the contents of registers.
13. In a processor with a five stage pipeline as discussed in the class and shown in the
picture below (with buffers between the stages), explain the problem posed by branch
instruction. Present a solution.
The problem presented by a branch instruction is that as it moves into the decode stage the
fetch stage does not yet know whether to fetch the next instruction or the branch target
instruction.
Without other changes this requires the pipeline to stall until it has the result of the branch
comparison.
Possible solution to ameliorate this problem:
a.) Perform the comparison in the decode stage.
b.) Change policy to unconditionally execute the instruction following the branch
regardless of the outcome of the branch.
c.) Install a branch predictor which will perhaps record the previous results of a branch
and predict the same outcome. Due to the nature of loops this should have a very high
success rate.
d.) Install a BTB as discussed in question 11.
15. With reference to Figure 5.6a, identify and explain the role of the datapath element that
deal with the BEQ instruction. Explain in detail what exactly happens cycle by cycle
with respect to this datapath during the passage of a BEQ instruction. Assume a
conservative approach to handling the control hazard. Your answer should include
both the cases of branch taken and branch not taken.
Cycle 1: The BEQ instruction is fetched from memory into the FBUF in the IF stage of
the pipeline.
Cycle 2: The BEQ instruction is decoded into the DBUF in the ID/DX stage of the
pipeline. Also in this cycle, the next instruction is fetched.
Cycle 3: The BEW instruction is executed (or tested) in the EX stage of the pipeline. Also
in this cycle, the instruction before the BEW is stalled and a NOP instruction is
fed to the ID/DR.
Cycle 4a-1: If the BEQ is not taken, the pipeline continuous its process normally.
Cycle 4a-2: The BEQ is fed into the WB stage, followed by 1 NOP and the non-conditional
instruction.
Cycle 4b-1: If the BEQ is taken, the previously stalled instruction (the one coming after the
BEQ) is replaced by the new instruction (the one dictated by the BEQ) which is
fetched into the IF stage and a NOP is fed into the ID/DR stage. The BEQ is fed
into the MEM stage.
Cycle 4b-2: The BEQ is fed into the WB stage, followed by 2 NOPs and the conditional
instruction.
16. A smart engineer decides to reduce the 2-cycle "branch-taken" penalty in the 5-stage
pipeline down to 1 cycle. Her idea is to directly use the branch target address computed
in the EX cycle to fetch the instruction (note that the approach presented in Section
5.13.3 requires the target address to be saved in PC first.)
a) Show the modification to the datapath in Figure 5.6a to implement this idea [hint:
you have to simultaneously feed the target address to the PC and the Instruction
memory if the branch is taken].
b) While this reduces the bubbles in the pipeline to 1 for branch taken, it may not be a
good idea. Why? [hint: consider cycle time effects.]
a) The output of the second ALU in the EX stage should be available so is the input to the
Instruction Memory as well as the PC in the IF stage. Two extra switches should be added
to determine which address, the contents of the PC or the address computed in the EX
stage, is driven into the Instruction Memory. Another ALU would also need to be added to
increment the address computed before it is saved to the PC since in the next cycle, the
address of the next instruction needs to be in the PC.
Surname
44
b) This adds extra circuitry to the processor, which may require an increase in cycle time to
work correctly. This increase in cycle time may increase the run time of processes more
than the reduced CPI of the BEQ instruction would decrease the run time.
17. In a pipelined processor where each instruction could be broken up into 5 stages and
where each stage takes 1 ns what is the best we could hope to do in terms of average
time to execute 1,000,000,000 instructions?
Since a pipeline has 5 stages and each instruction can be broken into 5 stages, then the first
stage is going to take 5ns but all the other stages are only going to take 1ns extra for each.
Then:
Best average time = 5ns+999,999,999ns
Best average time = 1,000,000,004ns
18. Using the 5-stage pipeline shown in Figure 5.6c answer the following two questions:
a) Show the actions (similar to Section 5.12.1) in each stage of the pipeline for BEQ
instruction of LC-2200.
b) Considering only the BEQ instruction, compute the sizes of the FBUF, DBUF, EBUF,
and MBUF.
a) IF stage (cycle 1):
I-MEM [PC] -> FBUF // Instruction at PC placed in
FBUF
PC + 1 -> PC // Increment PC
22. Consider
I1: R1 <- R2 + R3
I2: R4 <- R1 + R5
If I2 is immediately following I1 in the pipeline with no forwarding, how many bubbles (i.e.
NOPs) will result in the above execution? Explain your answer.
Three bubbles will appear during execution, because when I2 is in the ID/RR stage, it must
wait until I1 has left the WB stage, since I2 needs to decode R1, but I1 is writing to R1.
21. You are given the pipelined datapath for a processor as shown below.
LW instruction
FBUF: Opcode (8 bits) rA (4 bits) rB (4 bits) offset (16 bits)
DBUF: Opcode (8 bits) (rA) (32 bits) (rB) (32 bits) offset (32 bits)
Assume that there are no hazards in the above set of instructions. Currently the IF stage is
about to fetch instruction at 1004.
IF: NAND
ID: ADD
EX: LW
MEM: NAND
WB: ADD
a.) 5 cycles would elapse, since all the instructions in the pipeline would need to
complete their execution.
b.) 1 cycle would elapse, since all registers in the pipeline would clear their contents at
once, so we would enter the macro state on the next cycle after the interrupt. The
value of PC that will be stored in $k0 is 1000, since that is the last instruction to
have completed its execution.
A process is a program in execution. A program is static, has no state, and has a fixed
size on disk, whereas a process is dynamic, exists in memory, may grow or shrink, and
has associated with it “state” that represents the information associated with the
execution of the program.
The response time of a job is the most user centric metric in a timesharing environment.
4. Consider a pre-emptive priority processor scheduler. There are three processes P1,
P2, and P3 in the job mix that have the following characteristics:
What is the turnaround time for each of P1, P2, and P3?
P1-turnaround-time = 88 seconds
P2-turnaround-time = 64 seconds
P3-turnaround-time = 76 seconds
Its huge potential variation in response time and poor processor utilization based on it
convoy effect.
Round Robin
10. Given the following processes that arrived in the order shown
Show the activity in the processor and the I/O area using the FCFS, SJF, and Round
Robin algorithms.
Assuming each process requires a CPU burst followed by an I/O burst followed by a final
CPU burst (as in Example 1 in Section 6.6):
FCFS
Surname
49
SJF
11. Redo Example 1 in Section 6.6 using SJF and round robin (timeslice = 2)
SJF
a)
Surname
50
b) Response time (P1) = 28 Response time (P2) = 20
Response time (P3) = 7
c) Wait-time (P1) = 10
Wait-time (P2) = 5
Wait-time (P3) = 0
Round Robin
a)
12. Redo Example 3 in Section 6.6 using FCFS and round robin (timeslice =
2)
FCFS
a)
Round Robin
a)
c) Total time = 28
Throughput = 3/28 processes per unit time