You are on page 1of 16

 Computer Design

Part 1B, Dr S.W. Moore


 Introduction

Aims

The aims of this course are to introduce the hardware/software interface models and the hardware
structures used in designing computers. The first seven lectures are concerned with the hardware/software
interface and cover the programmer's model of the computer. The last nine lectures look at hardware
implementation issues at a register transfer level.

Lectures

 Introduction to the course and some background history.


 Historic machines. EDSAC versus Manchester Mark I.
 Introduction to RISC processor design and the MIPS instruction set.
 MIPS tools and code examples.
 Operating system support including memory hierarchy and management.
 Intel x86 instruction set.
 Java Virtual Machine.
 Memory hierarchy (caching).
 Executing instructions. An algorithmic viewpoint.
 Basic processor hardware, pipelining and hazards. [2 lectures]
 Verilog implementation of a MIPS processor. [3 lectures]
 Internal and external communication.
 Data-flow and comments on future directions.

Objectives

At the end of the course students should:

1. be able to read assembler given a guide to the instruction set and be able to write short pieces of assembler if given an
instruction set or asked to invent an instruction set
2. understand the differences between RISC and CISC assembler
3. understand what facilities a processor provides to support operating systems, from memory management to software
interrupts
4. understand memory hierarchy including different cache structures
5. appreciate the use of pipelining in processor design
6. understand the communications structures, from buses close to the processor, to peripheral interfaces
7. have an appreciation of control structures used in processor design
8. have an appreciation of how to implement a processor in Verilog
 Instruction Sets
Objective 1

Accumulator

An accumulator is a register in which intermediate arithmetic and logic results are stored. The
characteristic which distinguishes one register as being the accumulator of a computer architecture is that
the accumulator is used as an implicit operand for arithmetic instructions. For instance, a computer might
have an instruction like: ADD memaddress This instruction would add the value read from the memory
location at memaddress to the value from the accumulator, placing the result in the accumulator.

Register

A processor register is a small amount of storage available on the CPU whose contents can be accessed
more quickly than storage available elsewhere. Moving data from main memory into registers, operating on
them, then moving the result back into main memory is common—a so-called load-store architecture.

Stack

A stack machines memory takes the form of one or more stacks. In addition, a stack machine can also refer
to a real or simulated machine with a "0-operand" instruction set. In such a machine, most instructions
implicitly operate on values at the top of the stack and replace those values with the result. Typically such
machines also have a "load" and a "store" instruction that reads and writes to arbitrary RAM locations.

A stack machine will often be more simple to program and more reliable to run than other machines.
Writing compilers for stack based machines is also comparatively simple, as they have fewer exceptional
cases to complicate matters. Since running compilers can take up a significant percentage of machine
resources, building a machine that can have an efficient compiler is important.

A stack based machine instruction is smaller than one that’s register based since there is no need to
specify operand addresses. This nearly always leads to dramatically smaller compiled programs.

A given task can often be expressed using fewer register machine instructions than stack ones.
Stack Machine: a = b + c might be translated as LOAD c, LOAD b, ADD, STORE a.
Register Machine: The same code would be a single instruction ADD a, b, c.
 OS Support & Memory Management
Objective 3

Exceptions & Interrupts

Exception handling is a programming language construct or computer hardware mechanism designed to


handle the occurrence of some condition that changes the normal flow of execution, for instance; division by
zero. In general, the current state will be saved in a predefined location and execution will switch to a
handler. The handler may later resume the execution at the original location, using the saved information to
restore the original state. Interrupts have a similar affect to exceptions except they are caused by an external
signals, for example a direct memory access device signaling completion.

Applications normally run in user mode; however, when an interrupt or exception occurs the processor is
switched into an alternative mode which has a higher privilege. The software handler is now exposed to
more of the internals of the processor and sees a more complex memory model.

Memory Management & Protection

Memory management involves providing ways to allocate portions of memory to programs at their request,
and freeing it for reuse when no longer needed.

1. Relocation
Programs must be able to reside in different parts of the memory at different times. Since
when the program is swapped back into memory after being swapped out for a while it can
not always be placed in the same location.

2. Protection
Processes should not be able to reference the memory for another process without
permission. Preventing malicious code in one program from interfering with another.

3. Sharing
Processes should be able to share access the same part of memory with necessary
permission.

4. Logical organization
Programs are often organized in modules. Some of these modules could be shared between
different programs, some are read only and some contain data that can be modified. The
memory management is responsible for handling this logical organization that is different
from the physical linear address space.

5. Physical organization
Memory is usually divided into fast primary storage and slow secondary storage. The
memory manager handles moving information between these two levels of memory.
Virtual Addressing

Virtual Memory; gives a program the impression that it has


contiguous working memory, while in fact it is physically
fragmented and may even overflow on to disk storage. This
technique makes programming large applications easier as well
as using physical memory more efficiently.

Virtual Addresses; are what an application uses to


address its memory. These must be converted to
reference a block of physical memory, usually
between 1 and 64 kbytes, called a page. The upper bits
of a virtual address correspond to a page and the
lower bits specify an index into the page.

If there is insufficient physical memory then some of the pages may be swapped out to disk. If an
application attempts to use a page currently stored on disk then the address translation mechanism causes
an exception which is handled by the operating system. The OS selects a ‘suitable’ page to be swapped out
to disk, and then swaps the required page from disk to memory.

There are two principle virtual addressing schemes:

 Multiple address space – Each application resides in its own separate virtual address space and is
prohibited from making accesses outside this space. This method makes sharing libraries and data
more difficult as it involves having several virtual addresses for the same physical address.
 Single address space - There is only one virtual address space and each application being executed is
allowed some part of it. Linking must be done at lead time because the exact address of any
particular component is only known then. However, sharing libraries and data is much simpler since
there is only one virtual address for each physical one
Address Translation

If each page is 4KB in size then we require 12 bits for the page offset. Assuming we are using 32 bit
addressing that leaves us with 20 bits to express the virtual page number. Each entry of a page table is 4
bytes. A simple page table therefore consumes 4 × 2020 = 4194304 bytes = 4 MB memory per table /
application.

One solution to this large overhead is to use


multilevel page tables. However, page tables
may still be large, with many unused entries.
Particular problems arise with address spaces
that are larger than 32 bits wide or if the virtual
address space is occupied very sparsely.

pages in physical memory, and uses a hash lookup to translate virtual addresses to physical addresses in
nearly constant time.

This page table is inverted in the sense that physical frames, instead of virtual pages, are used as the main
index into the table. A system-wide inverted page table (IPT) maps each physical page number to a virtual
page number. A key advantage of an IPT is that its size grows in direct proportion to the number of
physical pages in the system. Each entry in the table records which page of which process occupies that
particular page of physical memory. The main disadvantage is that a search mechanism is required to
locate the physical frame (if any) that holds a particular page. This can be done efficiently using a hash
function.
Memory-Mapped I/O

Input and output devices are usually mapped to part of the address space. Thus, communicating to an I/O
device can be the same as reading and writing to memory addresses devoted to the I/O device. The I/O
device merely has to use the same protocol to communicate with the CPU as memory uses. Reading from
an I/O device often has side effects. Memory protection is used to ensure that only the device driver has
access to its area in memory. Some processors have special instructions to access I/O within a dedicated
I/O address space.

TLB

Performing a complete address translation for every memory request is prohibitively time consuming. The
translation look-aside buffer is used to cache recently performed translations so that they may be reused.
The MIPS R3000 caches 64 translation entries in fully associative store. When an address needs to be
translated the TLB is searched in parallel for the appropriate virtual page translations and protection
information. A TLB miss occurs if the translation information is not present in the TLB. A TLB miss is
handled in software.

Cache: A cache is a small local memory which makes use of the temporal and spatial characteristics of
data to store values which are likely to be needed in the near future. (see next section)
 Memory Hierarchy
Objective 4

“Ideally one would desire an indefinitely large memory


capacity such that any particular word would be immediately
available, we are however forced to recognize the possibility of
construction a hierarchy of memories, each of which has
greater capacity that the preceding but which is less quickly
accessible.”

Memory Technologies Latency and Bandwidth

Static Memory (SRAM): Register File:


 Maintains store provided power kept on  Multi-ported small SRAM
 Typically 4 to 6 transistors per bit  < 1 cycle latency, multiple reads and writes per cycle
 Fast but expensive
First level cache:
Dynamic Memory (DRAM):  Single or multi-ported SRAM
 Relies on storing charge on the gate of transistor  1 to 3 cycles latency, 1 or 2 reads or writes per cycle
 Charge decays over time so requires refreshing
 1 transistor per bit Second level cache:
 Fairly fast, not too expensive  Single ported SRAM
 Around 3 to 9 cycles latency, 0.5 to 1 reads or writes
Magnetic Disk: per cycle
 Maintains store even when power turned off
 Much slower than DRAM but much cheaper per Main memory:
MB  DRAM
 Reads take anything from 10 to 100 cycles to get first
ROM/PROM/EPROM/EEPROM/Flash: word and can receive adjacent words every 2 to 8
 Read only memory cycles.
 Programmable ROM  Writes take 8 to 80 cycles and each further
 Erasable PROM consecutive word takes 2 to 8 cycles.

 Electronically EPROM (faster erase)


Hard Disk:
 Slow to seek, around 2 million clock cycles, returns a
large block of data.
Cache Design

Temporal locality: If a word is accessed once then it is likely


to be accessed again soon.
Spatial locality: If a word is accessed then its neighbors are
likely to be accessed soon.

In both cases it is advantageous to store data close to the processor in a cache.

Fully Associative Cache: If an entry from main memory is free to reside in any part of the cache it is
fully associative.
Direct Mapped Cache: At the other extreme, if each entry in main memory can go in just one place in
the cache, the cache is direct mapped.
Set Associative : If a block can be placed in a restricted set of places in the cache, the cache is said to be
set associative . A set is a group of blocks in the cache. A block is first mapped onto a set, and then the
block can be placed anywhere within that set. See diagram above for 2-way set associative cache.

Cache Line: A line is an adjacent series of bytes in main memory; that is, their addresses are
contiguous. Typically 4 or 8 words. Utilises DRAM’s high read speed for successive locations.

Cache Replacement Policies

When a miss occurs, the cache controller must select a block to be replaced with the desired data. A
replacement policy determines which block should be replaced.

With direct-mapped placement the decision is simple because there is no choice: only one block frame is
checked for a hit and only that block can be replaced. With fully-associative or set-associative placement ,
there are more than one block to choose from on a miss.

Least Recently Used – Good, but requires usage information to be stored in the cache.
Not Last Used – Tends to remove infrequently used cache lines, has a few pathological cases.
Random – Actually quite simple and works well in practice.

victim cache: One cache line buffer to store the last line overwritten in the cache.
Reading Memory

When the CPU requests a memory location the required block is searched for in the cache. If it is found then
it is sent to the CPU, no further work has to be done. If we encounter a cache miss we must go out to main
memory. Upon returning with the block from main memory we have two options:
1. Read Through – Do not store in cache, take straight to CPU
2. No Read Through – Store this block in the cache, from there transfer it to memory.

Writing Memory

When the CPU writes to a memory location and the block currently exists in the cache we have a write hit.
We can do one of two things:
1. Write Through – Data is written to both the cache and the lower level memory so if a cache line is
replaced it doesn’t need to be written back to the memory first. This method is common for
multiprocessor computers so that cache coherency is possible.
2. Write Back - Data is initially written to the cache only and will be written to the lower level memory
when its cache line is replaced. A dirty bit is used to indicate if the cache line has been modified and
therefore requires being written back before removal.

When the CPU writes to a memory location and the block doesn’t exist in the cache we have a write miss.
We can one of two things:
1. Fetch – Bring the block from main memory into the cache and then perform either write hit action.
2. Write Around - The block is modified in main memory and not loaded into the cache.

Since writing to lower level memory takes time we avoid the processor having to wait by using a write
buffer to store a the upcoming writes.
 Pipelining
Objective 5

An instruction pipeline is used in processors to increase their instruction throughput, with the result that a
program's overall execution time is lowered. Pipelining doesn't speed up instruction execution time, but it
does speed up program execution time by increasing the number of instructions we can work on
simultaneously. The more finely we can slice the instruction's lifecycle, the more of the hardware that's used
to implement those phases at any given moment. When a processor does not implement pipelining it is said
to be sequential.

The RISC pipeline is broken into five


pipeline stages with a set of flip flops
between each stage.
1. Instruction fetch (IF)
2. Decode and register fetch (DC)
3. Execute (EX)
4. Memory access (MA)
5. Register write back (WB)

Instruction latency: The number of clock cycles it takes for the instruction to pass through the pipeline

Sequential processors, ones that aren’t pipelined, have a latency of one clock cycle. In contrast, an n-stage
pipeline has a minimum latency of n cycles, or longer depending on whether the instruction stalls.

Pipelining offers greater performance in general but we


can encounter problems when the programmers model is
violated, for instance: When a programmer (or
compiler) writes assembly code, they make the
assumption that each instruction is executed before
execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this
causes a program to behave incorrectly, the situation is known as a hazard. There are two types of hazard;
data and control. Techniques such as forwarding and stalling exist to help guard against hazards.

When a branch instruction occurs in assembler, a pipelined processor does not know what instruction to
fetch next. We could a) stall until we branch/continue b) Predict branch outcome, if found to be wrong,
flush pipeline c) Execute the instruction after the branch regardless, this is known as having a branch delay
slot.
Data Hazard

The data hazards result from a conflict over the sharing of data. Although the instructions are written with
sequential execution in mind, a previous instruction's output may be the current instruction's input, such
that this data dependency creates a problem while pipelining the instructions.

The dependency on register t4 is a problem. A2 uses //..Example Code..//


add t4,t1,t2 // A1: t4=t1+t2
the old value of t4 because A1 hasn’t written its add t5,t4,t3 // A2: t5=t4+t3
updated value back.
 Pipeline Snapshot
Stalling solves this problem; Bubbles, a type of ‘non-
operation’ are inserted into the pipeline to keep A2 in IF DC EX MA WB
the decode stage until A1 has written its new value. 0 A1 ~ ~ ~ ~
1 A2 A1 ~ ~ ~

time
2 ~ A2 A1 ~ ~
In another instance we might want to be able to use the
3 ~ ~ A2 A1 ~
result of the ALU directly, without having to wait for 4 ~ ~ ~ A2 A1
the result to be written back to the register file. For this, 5 ~ ~ ~ ~ A2
we need to forward the result of the ALU directly back
into the ALU.  With bubbles

IF DC EX MA WB
//..Example Code..//
lw t2,0(t1) // L: t2=load(t1) 0 A1 ~ ~ ~ ~
add t3,t3,t2 // A: t3=t3+t22 1 A2 A1 ~ ~ ~
time

2 ~ A2 A1 ~ ~
In this example the value of t2 is fed back into the ALU 3 ~ A2  A1 ~
4 ~ A2   A1
rapidly so that the value of t3 is calculated correctly
5 ~ ~ A2  
and not based on an old value. Note that a single stall is
6 ~ ~ ~ A2 
still required whilst register t1 is fetched. 7 ~ ~ ~ ~ A2
Control Hazard

Control hazards occur when the processor is told to branch - i.e., if a certain condition is true, then jump
from one part of the instruction stream to another - not necessarily to the next instruction sequentially. In
such a case, the processor cannot tell in advance whether it should process the next instruction (when it
may instead have to move to a distant instruction).

This can result in the processor doing unwanted actions.

//..Example Code..//
J label //J: jump to label
add t3,t1,t2 //A1: t3=t1+t2
label: add t6,t4,t5 //A2: t6=t4+t5

We have two solutions to this problem:


1. Flush A1 from the pipeline, converting it to a bubble.
2. Execute A1 anyway, thus exposing the branch delay slot.

 Pros
1. The cycle time of the processor is reduced, thus increasing instruction
bandwidth in most cases.
 Cons
1. The design is complex and harder to manufacture.
2. The instruction latency in a non-pipelined processor is slightly lower
than in a pipelined equivalent due to the extra flip flops that are
added to the data path.
3. The performance of a pipelined processor is much harder to predict
and may vary more widely between different programs.

The higher throughput of pipelines falls short when the executed code contains many branches: the
processor cannot know where to read the next instruction, and must wait for the branch instruction to
finish, leaving the pipeline behind it empty. After the branch is resolved, the next instruction has to travel
all the way through the pipeline before its result becomes available and the processor appears to "work"
again. In the extreme case, the performance of a pipelined processor could theoretically approach that of an
unpipelined processor, or even slightly worse if all but one pipeline stages are idle and a small overhead is
present between stages.
 Internal and External Communication
Objective 6

As a general rule of thumb; when a wire is longer than 1/100 of the wavelength that is being transmitted
down its length we need to consider the transmission properties of the wire. E.g. any wire over 3mm for the
1Ghz signal.

Characteristic Impedance: the ratio of the amplitudes of a single pair of voltage and current waves
propagating along the line in the absence of reflections.

For a loss-less transmission line (i.e. R and G are negligible) which is terminated the characteristic

L
impedance is given by: Z 0  , measured in ohms.
C

An electrical pulse injected into a transmission line has energy. IF the end of transmission line is
unconnected, the pulse is reflected! To prevent reflections, the energy should be dissipated at the receiver,
.e.g. using a resistor which is equal to Z 0 .
Communication Methods

Synchronous communication: uses no start and stop bits but instead synchronizes transmission speeds at
both the receiving and sending end of the transmission using clock signals built into each component. Data
transfer rate is quicker although more errors will occur, as the clocks become skewed.

Skew: the difference in arrival time of bits transmitted at the same time.

Asynchronous communication: a start signal is sent prior to each byte, character or code word and a stop
signal is sent after each code word.

Parallel Communication

Parallel transmission involves sending of several bits at the same time, with each bit transmitted over a
separate wire. An 8-bit parallel channel transmits eight bits (or a byte) simultaneously. Parallel
communication is usually synchronous. Parallel communication wont work well for high clock frequencies
over long distances due to the skew experienced when communicating synchronously. The longer the
distance the bigger the problem:
 PCI cards where limited to 66MHZ
 DDR2 memory chips operate at 660MHZ as they are closer to the CPU.

Serial Communication

Serial communication is the process


of sending data one bit at one time,
sequentially over a twisted pair or
coaxial connection. Contrast this to
parallel communications, where all
the bits of each symbol are sent
together. Serial communications are
used for all long-haul communications and most computer networks, where the cost of cable and
synchronization difficulties make parallel communications impractical.

High data rates are possible; Ethernet (1 – 10Gb/s), SATA (3Gb/s), PCI-express (2.5Gb/s per lane)
The diagram above shows the ASCII character A being sent over RS-232, a commonly used asynchronous
serial line data transmission standard.

USB

Universal serial buss was designed to support a large range of


devices that can be chained together (see diagram). Electrically
USB is a twisted data pair with power and ground lines to
supply power to devices.
 Version 1.1 could transfer at 12Mb/s at full speed.
 Version 2 added a 480Mb/s mode.

Devices are identified by class, vendor etc. to allow plug and play

On-Chip Communication

His notes make no sense, try to read them.

You might also like