You are on page 1of 21

PREFACE

What is Computer Architecture?

Computer = Architecture
Instruction Set Architecture

Instruction Set Architecture + Machine Organisation

The Instruction Set Architecture (or ISA) describes the structure of the computer as seen by a programmer. A family of processors which all run the same (binary) code are all said to have the same architecture. (Note that "architecture" used alone most commonly refers to ISA.) Some ISAs IBM 360 Date 197x

SUN SPARC 1987 DEC Alpha SGI MIPS Intel x86 1992 1986 1978

In the case of the IBM, SUN and Intel ISAs, it is possible to purchase processors which execute the same instructions from more than one manufacturer. All these processors may have quite different internal organisations (copyright and other legal considerations will generally require significant differences!) but they all appear identical to a programmer, because their instruction sets are the same. Microcoded Processors It is even possible to build a microcoded processor which can execute multiple ISAs. We will look briefly at how this is achieved later.

Machine Organisation
This refers to the layout and interconnection of the various Microcode functional units. Processors with widely varying performance (but the same ISA) can be constructed by Microcode is the lowest

changing details such as the


pipeline length, number of functional units, cache size and organisation, sophistication of instruction issue unit (ability to issue more than one instruction at the same time), etc.

programmable level of a machine; it's usually considered part of the hardware. CISC machines usually use microcode extensively to implement their complex instructions.

Modifiable microcode even makes it possible for a machine to have more than one ISA. Instruction Set Architecture and machine organisation are not necessarily related! Details of a machine's organisation are usually transparent to software. Programs compiled for the same ISA will run in the same way (but probably at vastly different speeds) on machines with different internal structures. In particular, caches are transparent to software: a program will run in exactly the same way on a system with no cache as on one which has the largest cache which is technically feasible at any time.

1. Processor Structure
Before we look at basic processor structure, we need to briefly touch on two concepts: von Neumann machines and pipelined, clocked logic systems. von Neumann machines In the early 1950s, John von Neumann proposed the concept of a stored program computer an architecture which has become the foundation for most commercial processors used today. In a von Neumann machine, the program and the data occupy the same memory. The machine has a program counter (PC) which points to the current instruction in memory. The PC is updated on every instruction. When there are no branches, program instructions are fetched from sequential memory locations. (A branch simply updates the PC to some other location in the program memory.) Except for a handful of research machines and a very small collection of commercial devices, all of today's commercial processors work on this simple principle. Later, we will examine some non-von Neumann architectures. Synchronous Machines Again, with a very few exceptions - a handful of research and a small number of commercial systems - most machines nowadays are synchronous, that is, they are controlled by a clock.

Datapaths

Registers and combinatorial logic blocks alternate along the data-paths through the machine. Data advances from one register to the next on each cycle of the global clock: as the clock edge clocks new data into a register, its current output (processed by passing through the combinatorial block) is latched into the next register in the pipeline. The registers are masterslave flip-flops which allow the input to be isolated from the output, ensuring a "clean" transfer of the new data into the register. (Some very high performance machines, eg DEC's Alpha, use dynamic latches here to reduce propagation delays, cf Dobberpuhl et al.) In a synchronous machine, the slowest possible propagation delay, tpdmax, through any combinatorial block must be less than the smallest clock cycle time, tcyc - otherwise a pipeline hazard will occur and data from a previous stage will be clocked into a register again. If tcyc < tpd for any operation in any stage of the pipeline, the clock edge will arrive at the register before data has propagated through the combinatorial block.

Of course, there may also be feedback loops - in which the output of the current stage is fed back and latched in the same register: a conventional state machine. This sort of logic is used to determine the next operation (ie next microcode word or next address for branching purposes). Basic Processor Structure Here we will consider the basic structure of a simple processor. We will examine the flow of data through such a simple processor and identify bottlenecks in order to understand what has guided the design of more complex processors.

Here we see a very simple processor structure - such as might be found in a small 8-bit microprocessor. The various components are: ALU Arithmetic Logic Unit - this circuit takes two operands on the inputs (labelled A and B) and produces a result on the output (labelled Y). The operations will usually include, as a minimum:

add, subtract and, or, not shift right, shift left

ALUs in more complex processors will execute many more instructions. Register File A set of storage locations (registers) for storing temporary results. Early machines had just one register - usually termed an accumulator. Modern RISC processors will have at least 32 registers. Instruction Register The instruction currently being executed by the processor is stored here. Control Unit The control unit decodes the instruction in the instruction register and sets signals which control the operation of most other units of the processor. For example, the operation code (opcode) in the instruction will be used to determine the settings of control signals for the ALU which determine which operation (+,-,^,v,~,shift,etc) it performs. Clock The vast majority of processors are synchronous, that is, they use a clock signal to determine when to capture the next data word and perform an operation on it. In a

globally synchronous processor, a common clock needs to be routed (connected) to every unit in the processor. Program counter The program counter holds the memory address of the next instruction to be executed. It is updated every instruction cycle to point to the next instruction in the program. (Control for the management of branch instructions - which change the program counter by other than a simple increment - has been omitted from this diagram for clarity. Branching instructions and their effect on program execution and efficiency will be examined extensively later. Memory Address Register This register is loaded with the address of the next data word to be fetched from or stored into main memory. Adress Bus This bus is used to transfer addresses to memory and memory-mapped peripherals. It is driven by the processor acting as a bus master. Data Bus This bus carries data to and from the processor, memory and peripherals. It will be driven by the source of data, ie processor, memory or peripheral device. Multiplexed Bus Of necessity, high performance processors provide separate address and data buses. To limit device pin counts and bus complexity, some simple processors multiplex address and data onto the same bus: naturally this has an adverse affect on performance. See multiplexed buses.

Executing Instructions
Let's examine the steps in the execution of a simple memory fetch instruction, eg In this, and most following, examples, we'll use the MIPS instruction set.
101c16: lw $1,0($2)

This instruction tells the processor to take the address stored in register 2, add 0 to it and load the word found at that address in main memory into register 1.

This is chosen because


it's simple, it exists in one widely available range of machines produced by SGI and there is a public domain simulator for MIPS machines, which we will use for some performance studies.

As the next instruction to be executed (our lw instruction) is at memory address 101c16, the program counter contains 101c. Execution Steps

For convenience, most numbers - especially memory addresses and instruction contents - will be expressed in hexadecimal. When orders of magnitude and performance are being discussed, decimal numbers will be used: this will generally be obvious from the context and the use of exponent notations, eg 5 x 1012.

1. The control unit sets the multiplexor to drive the PC onto the address bus.

2. The memory unit responds by placing 8c41000016 - the lw $1,0($2) instruction as encoded for a MIPS processor - on the data bus from where it is latched into the instruction register. 3. The control unit decodes the instruction, recognises it as a memory load instruction and directs the register file to drive the contents of register 2 onto the A input of the ALU and the value 0 onto the B input. At the same time, it instructs the ALU to add its inputs. 4. The output from the ALU is latched into the MAR. The controller ensures that this value is directed onto the address bus by setting the multiplexor. 5. When the memory responds with the value sought, it is captured on the internal data bus and latched into register 1 of the register file. 6. The program counter is now updated to point to the next instruction and the cycle can start again. As another example, lets assume the next instruction is an add instruction: 102016: add This instruction tells the processor to add the contents of registers 3 and 4 $1,$3,$4 and place the result in register 1. 1. The control unit sets the multiplexor to drive the PC onto the address bus. 2. The memory unit responds by placing 0023202016 - the encoded add $1,$3,$4 instruction - on the data bus from where it is latched into the instruction register. 3. The control unit decodes the instruction, recognises it as an arithmetic instruction and directs the register file to drive the contents of register 1 onto the A input of the ALU and the contents of register 3 onto the B input. At the same time, it instructs the ALU to add its inputs. 4. The output from the ALU is latched into the register file at register address 4. 5. The program counter is now updated to point to the next instruction.

Key terms
von Neumann machine A computer which stores its program in memory and steps through instructions in that memory. pipeline A sequence of alternating storage elements (registers or latches) and combinatorial blocks, making up a datapath through the computer. program counter A register or memory location which holds the address of the next instruction to be executed. synchronous (system/machine) A computer in which instructions and data move from one pipeline stage to the next under the control of a single (global) clock.

Performance
Let's assess the performance of our simple processor. Assume that the whole system is driven by a clock at f MHz. This means that each clock cycle takes t = 1/f microseconds

Thus a processor with a clock running at 100MHz is operating with 10ns clock cycles. Generally, a processor will execute one step every cycle, thus, for a memory load instruction, our simple processor needs: Step 1 2 3 4 5 6 PC to bus Memory response Decode and register access ALU operation and latch result to MAR Memory response Increment PC Operation Time (cycles) 1 tac 1 1 tac Overlap with step 3 Notes

Total 3 + 2*tac If the memory response time is, say, 100ns, then our simple processor needs 3x10+2*100 = 230ns to execute a load instruction. For the add instruction, we make a similar table: Time Step Operation Notes (cycles) 1 2 3 4 5 PC to bus Memory response Decode and register access ALU operation and latch result to destination register Increment PC 1 tac 1 1 Overlap with step 3

Total 3 + tac So an add instruction requires 3x10+100 = 130ns to execute. A store operation will also need more than 200ns, so instructions will require, on average, about 150ns.

Performance Measures
One commonly used performance measure is MIPS or millions of instructions per second. Our simple processor will achieve: 1/(150x10-9) = ~6.6 x 106 instructions per second = ~6.6 MIPS As you will know from reading the popular literature, 100MHz is a very common figure for processors in 1998 (leading edge commercial processors have clocks which are more than 5 times faster!) and a MIPS rating of 6.6 is very ordinary. In fact to be competitive, a 100MHz processor should be achieving of the order of 100MIPS - or one instruction for each machine cycle. One of the main aims of this course is to examine how this is achieved.

Bottlenecks
From the simplistic analysis presented above, it will be obvious that access to main memory is a major limiting factor in the performance of a processor. Management of the memory hierarchy to achieve maximum performance is one of the major challenges for a computer architect. Unfortunately, the hardware maxim smaller is faster

conflicts with programmers' and users' desires for more and more capabilities and more elaborate user interfaces in their programs - resulting in programs that require megabytes of main memory to run! This has led the memory manufacturers to concentrate on density (improving the number of bits stored in a single package) rather than speed. They have been remarkably successful in this: the growth in capacity of the standard DRAM chips which form the bulk of any computer's semiconductor memory has matched the increase in speed of processors. However the increase in DRAM access speeds has been much more modest - even if we consider recent developments in synchronous RAM and FRAM. Another reason for the manufacturer's concentration on density is that a small increase in DRAM access time has a negligible effect on the effective access time which needs to include overheads for bus protocols. (The 100ns figure used above assumes 60ns of DRAM access time and a - very optimistic - allowance of 40ns for bus overhead.) Cache memories are the most significant device used to reduce memory overheads and they will be examined in some detail later. However, a host of other techniques such as pipelining, pre-fetching, branch prediction, etc are all used to alleviate the impact of memory fetch times on performance.

ALU
The Arithmetic and Logic Unit is the 'core' of any processor: it's the unit that performs the calculations. A typical ALU will have two input ports (A and B) and a result port (Y). It will also have a control input telling it which operation (add, subtract, and, or, etc) to perform and additional outputs for condition codes (carry, overflow, negative, zero result). ALUs may be simple and perform only a few operations: integer arithmetic (add, subtract), boolean logic (and, or, complement) and shifts (left, right, rotate). Such simple ALUs may be found in small 4- and 8-bit processors used in embedded systems. Aside According to some sources, the most popular processors are not those found in your IBM PC, but the 4-bit microprocessors that control your washing machine, telephone handset, computer keyboard, parts of your car, etc. More complex ALUs will support a wider range of integer operations (multiply and divide), floating point operations (add, subtract, multiply, divide) and even mathematical functions (square root, sine, cosine, log, etc). Up until a few years ago, many high performance processors did not contain support for integer multiply or divide and floating point operations. The largest market for general purpose programmable processors is the commercial one, where the commonest arithmetic operations are addition and subtraction. Integer multiply and all other more complex operations were performed in software - although this takes considerable time (a 32-bit integer multiply needs 32 adds and shifts), the low frequency of these operations meant that their low speed detracted very little from the machine's overall performance. Thus designers would allocate their valuable silicon area to cache and other devices which had a more direct impact on processor performance in the target marketplace.

Motorola's PowerPC604e Note the large area allocated to floating point in the lower right.

More recently, transistor geometries have shrunk to the point where it's possible to get 107 transistors on a single die. Thus it becomes feasible to include floating point ALUs on every chip probably more economic than designing separate processors without the floating point capability. In fact, some manufacturers will supply otherwise identical processors with and without floating point capability. This can be achieved economically by marking chips which had defects only in the region of the floating point unit as "integer-only" processors and selling them at a lower price for the commercial information processing market! This has the desirable effect of increasing your semiconductor yield quite significantly a floating point unit is quite complex and occupies a considerable area of silicon look at a typical chip micrograph. Thus the probability of defects in this area is reasonably high.

Note for hackers A small "industry" has grown up around the phenomenon of "clock-chipping" - the discovery that a processor will generally run at In simple processors, the ALU is a large a frequency somewhat higher than its block of combinatorial logic with the A and B specification. operands and the opcode (operation code) as inputs and a result, Y, plus the condition Of necessity, manufacturers are somewhat codes as outputs. Operands and opcode are conservative about the performance of their applied on one clock edge and the circuit is products and have to specify performance over expected to produce a result before the next a certain temperature range. For commercial clock edge. Thus the propagation delay products this is commonly 0oC - 70oC. A through the ALU determines a minimum reputable computer manufacturer will also be clock period and sets an upper limit to the somewhat conservative, ensuring that the clock frequency. temperature inside the case of his computer normally never rises above, say 45oC. This In advanced processors, the ALU is heavily allows sufficient margin for error in both pipelined to extract higher instruction directions - chips sometimes degrade with age throughput. Faster clock speeds are now and computers may encounter unusual possible because complex operations (eg environmental conditions - so that systems floating point operations) are done in multiple will continue to function to their stages: each individual stage is smaller and specifications. Clock-chippers rely on the fact that propagation delays usually increase with faster. temperature so that a chip specified at x MHz at 70oC may well run at 1.5x at 45oC. Needless to say this is a somewhat reckless

strategy: your processor may functional perfectly well for a few months in winter - and then start failing, initially occasionally, and then more regularly as summer approaches! The manufacturer may also have allowed for some degradation with age so that a chip specified for 70oC now will still function at xMHz in two years time. Thus a clock-chipped processor may start to fail after a few months at the higher speed - again the failures may be irregular and occasional initially, and start to occur with greater frequency as the effects of age show themselves. Restoring the original clock chip may be all that's needed to give you back a functional computer! Software or Hardware? The question of which instructions should be implemented in hardware and which can be left to software continues to occupy designers. A high performance processor with 107 transistors is very expensive to design - $108 is probably a minimum! Thus the trend seems to be to place everything on the die. However, there is an enormous market for lower capability processors for embedded systems, primarily.

Key terms
condition codes a set of bits which store general information about the result of an operation, eg result was zero, result was negative, overflow occurred, etc.

Register File
The Register File is the highest level of the memory hierarchy. In a very simple processor, it consists of a single memory location - usually called an accumulator. The result of ALU operations was stored here and could be re-used in a subsequent operation or saved into memory. In a modern processor, it's considered necessary to have at least 32 registers for integer values and often 32 floating point registers as well. Thus the register file is a small, addressable memory at the top of the memory hierarchy. It's visible to programs (which address registers directly), so that the number and type (integer or floating point) of registers is part of the instruction set architecture (ISA).

Registers are built from fast multi-ported memory cells. They must be fast: a register must be able to drive its data onto an internal bus in a single clock cycle. They are multi-ported because a register must be able to supply its data to either the A or the B input of the ALU and accept a value to be stored from the internal data bus.

Register File Capacity


A modern processor will have at least 32 integer registers each capable of storing a word of 32 (or, more recently, 64) bits. A processor with floating point capabilities will generally also provide 32 or more floating point registers, each capable of holding a double precision floating point word. These registers are used by programs as temporary storage for values which will be needed for calculations. Because the registers are "closest" to the processor in terms of access time - able to supply a value within a single clock cycle - an optimising compiler for a high level language will attempt to retain as many frequently used values in the registers as possible. Thus the size of the register file is an important factor in the overall speed of programs. Earlier processors with fewer than 32 registers (eg early members of the x86 family) severely hampered the ability of the compiler to keep frequently referenced values close to the processor. However, it isn't possible to arbitrarily increase the size of the register file. With too many registers: a. the capacitative load of too many cells on the data line will reduce its response time, b. the resistance of long data lines needed to connect many cells will combine with the capacitative load to reduce the response time, c. the number of bits needed to address the registers will result in longer instructions. A typical RISC instruction has three operands:
sub $5, $3, $6

requiring 15 bits with 32 (= 25) registers, d. the complexity of the address decoder (and thus its propagation delay time) will increase as the size of the register file increases.

Ports
Register files need at least 2 read ports: the ALU has two input ports and it may be necessary to supply both of its inputs from the same register: eg

add $3, $2, $2

The value in register 2 is added to itself and the result stored in register 3. So that both operands can be fetched in the same cycle, each register must have two read ports. As we will see later, in superscalar processors, it's necessary to have two read ports and a write port for each functional unit, because such processors can issue an instruction to every functional unit in the same clock cycle.

Key terms
memory hierarchy Storage in a processor may be arranged in a hierarchy, with small, fast memories at the "top" of the hierarchy and slower, larger ones at the bottom. Managing this hierarchy effectively is one of the major challenges of computer architecture.

Cache
Etymology Cache comes from the French, cacher = to hide, presumably because a cache is transparent to, or hidden from, a programmer. Programmers do not generally "see" the cache: programs should run in the same way whether a cache is present or not. Cache should only affect the performance of a program.

Introduction
Cache is a key to the performance of a modern processor. Typically ~25% of instructions reference memory, so that memory access time is a critical factor in performance. By effectively reducing the cost of a memory access, caches enable the greater than one instruction/cycle goal for instruction throughput for modern processors. A further indication of the importance of cache may be gained from noting that out of 6.6x106 transistors in the MIPS R10000, 4.4x106 were used for primary caches. (This presumably includes TLBs as well as the instruction and data caches.)

Locality of Reference
All programs show some locality of reference. This appears to be a universal property of programs - whether commercial, scientific, games, etc. Cache exploits this property to improve the access time to data and reducing the cost of accessing main memory. There are two types of locality:

a. Temporal Locality
Once a location is referenced, there is a high probability that it will be referenced again in the near future. Instructions

The simplest example of temporal locality is instructions in loops: once the loop is entered, all the instructions in the loop will be referenced again - perhaps many times before the loop exits. However, commonly called subroutines and functions and interrupt handlers (eg the timer interrupt handler) also have the same property: if they are accessed once, then it's very likely that they will be accessed again soon. Data Many types of data exhibit temporal locality: at any point in a program there will tend to be some "hot" data that the program uses or updates many times before going on to another block of data. Some examples are:
o o o o

Counters Look-up Tables Accumulation variables Stack variables

b. Spatial Locality
When an instruction or datum is accessed it is very likely that nearby instructions or data will be accessed soon. Instructions It's obvious that an instruction stream will exhibit considerable spatial locality. In the absence of jumps, the next instruction to be executed is the one immediately following the current one. Data Data also shows considerable spatial locality - particularly when arrays or strings are accessed. Programs commonly step through an array from beginning to end, accessing each element of the array sequentially.

Cache operation
The most basic cache is a direct-mapped cache. It is a small table of fast memory (modern processors will store 16-256kbytes of data in a first-level cache and access a cache word in 2 cycles). There are two parts to each entry in the cache, the data and a tag. If memory addresses have p bits (allowing 2p bytes of memory to be addressed) and the cache can store 2k words of memory. Then the least significant m bits of the address address a byte within a word. (Each word contains 2m bytes.) The next k bits of the address select one of the 2k entries in the cache. The p-k-m bits of the tag in this entry are compared with the most significant p-k-m bits of the memory address: if they match, then the data "belongs" to the required memory address and is used instead of data from the main memory. When the cache tag matches the high bits of the address, we say that we've got a cache hit. Thus a request for data from the CPU may be supplied in 2 cycles rather than the 20-100 cycles that is necessary to fetch the same data from the main memory.

Basic operations Write-Through Write-Back

Cache organisations
Direct Mapped Fully Associative

Set Associative

Cache Performance

Cache operation
The most basic cache is a direct-mapped cache. It is a small table of fast memory (modern processors will store 16-256kbytes of data in a firstlevel cache and access a cache word in 2 cycles). There are two parts to each entry in the cache, the data and a tag. If memory addresses have p bits (allowing 2p bytes of memory to be addressed) and the cache can store 2k words of memory. Then the least significant m bits of the address address a byte within a word. (Each word contains 2m bytes.) The next k bits of the address select one of the 2k entries in the cache. The p-k-m bits of the tag in this entry are compared with the most significant p-k-m bits of the memory address: if they match, then the data "belongs" to the required memory address and is used instead of data from the main memory. When the cache tag matches the high bits of the address, we say that we've got a cache hit. Thus a request for data from the CPU may be supplied in 2 cycles rather than the 20-100 cycles that is necessary to fetch the same data from the main memory.

Basic operations Write-Through Write-Back

Cache organisations
Direct Mapped Fully Associative

Set Associative

Cache Performance

Main Memory
Memory Technologies
Two different technologies can be used to store bits in semiconductor random access memory (RAM): static static RAM and dynamic RAM. Static RAM Static RAM cells use 4-6 transistors to store a single bit of data. This provides faster access times at the expense of lower bit densities. A processor's internal memory (registers and cache) will be fabricated in static RAM. Because the industry has focussed on massproducing dynamic RAM in ever-increasing densities, static RAM is usually considerably more expensive than dynamic RAM: due both to its lower density and the smaller demand (lower production volumes lead to higher costs!). Static RAM is used extensively for second level cache memory, where its speed is needed and a relatively small memory will lead to a significant increase in performance. A highperformance 1998 processor will generally have 512kB to 4Mbyte of L2 cache. Since it doesn't need refresh, static RAM's power consumption is much less than dynamic RAM, SRAM's will be found in battery-powered systems. The absence of refresh circuitry leads to slightly simpler systems, so SRAM will also be found in very small systems, where the simplicity of the circuitry compensates for the cost of the memory devices themselves. Dynamic RAM The bulk of a modern processor's memory is composed of dynamic RAM (DRAM) chips. One of the reasons that memory access times have not reduced as dramatically as processor speeds have increased is probably that the memory manufacturers appear to be involved in a race to produce higher and higher capacity chips. It seems there is considerable kudos in being first to market with the next generation of chips. Thus density increases have been similar to processor speed increases. A DRAM memory cell uses a single transistor and a capacitor to store a bit of data. Devices are reported to be in limited production which provide 256 Mbits of storage in a single device. At the same period, CPUs with 10 million transistors in them are considered state-of-the-art. Regularity is certainly a major contributor to this apparent discrepancy .. a

A typical DRAM cell with a single MOSFET and a storage capacitor

DRAM is about as regular as it is possible to imagine any device could be: a massive 2-D array of bit storage cells. In contrast, a CPU has a large amount of irregular control logic.

Access modes
Almost all DRAMs fabricated require the address applied to the device to be asserted in two parts: a row address and a column address. This has a deleterious affect on the access time, but enables devices with large numbers of bits to be fabricated with fewer pins (enabling higher densities): the row and column addresses are applied to the same pins in row- and column-address phases of the access.

Read access for a DRAM device, showing application of the row and column addresses in conjunction with the row-address strobe (RAS) and column-address strobe (CAS).

The Access Time Myth


The performance of commercial DRAMs is commonly quoted in terms of the "access time" (tRAC in the figure), the time from the assertion of RAS to availability of the data. A more relevant figure when considering total system throughput is the cycle time (tRC in the figure) which is usually about twice as long. The cycle time is the minimum time between successive accesses to the same device and is thus the factor which determines data throughput.

Refresh
Charge leaks slowly from the storage capacitor in a DRAM cell and needs to be periodically refreshed: refresh times are in the ms region. When a DRAM is being refreshed, other accesses must be "held off". This increases the complexity of DRAM controllers (and causes SRAM to be the memory of choice in small systems, where the cost of the refresh circuitry would outweigh the extra cost of the SRAM chip itself) and has a small (several per cent) affect on the effective bandwidth as the memory is effectively "off-line" for a short time every few milliseconds.

Page mode
The bandwidth of DRAM chips can be increased by operating them in page mode. Several column addresses are applied for each row address.

Read access for a DRAM device operating in page mode.

Thus the overhead of asserting the address in two phases is reduced and throughput increased. Locality of reference makes this an effective strategy: once one location in a page is accessed, there is a high probability that other locations in the same page will be accessed also. It also helps filling cache lines, which will span several consecutive words in a modern processor.

Processor-Memory Interconnect
Bus Split Address and Data Buses Splitting the address and data buses allows a processor to overlap the data phase of a bus transaction with the address phase of a following transaction - achieving faster throughput. Interleaved Systems By arranging memory in banks, data throughput can be increased. Successive words of a multi-word burst are fetched from different memory banks: this means that the access latency for a memory word to be fetched from memory is incurred only for the first word of the burst. Subsequent words are fetched in parallel from different banks and are ready at the same time as the first word of the burst, so can be placed on the bus in succeeding bus cycles with no additional penalty.

Cross-bar switches Bus processor-memory interconnects represent such a severe bottleneck in multiple processor systems that high-performance multi-processor clusters now tend to provide cross-bar switch interconnections between processors and memory. One advantage of a cross-bar is that it provides multiple point-to-point connections between processors and banks of memory. Not only are there more links between processors and memory (increasing aggregate bandwidth) but the point-to-point links can be faster. We'll look into this further in the general context of interconnection systems in parallel processors .. Error Detection and Correction Parity Error Correcting Memory

You might also like