You are on page 1of 4

Burst Mode Memories Improve Cache Design Zwie Amitai, Product Planing and Applications Manager David C.

Wyland,Vice President of Engineering Quality Semiconductor, Inc. 851 Martin Avenue Santa Clara. CA 95050-2903 Tel: (408) 986-8326 Fax: (408) 496-0591 ABSTRACT Burst mode memories improve cache design by improving refill time on cache misses. Burst mode RAMs allow refill of a four word cache line in five clock cycles at 50 mHz rather than the eight clock cycles that would be required for a conventional SRAM. Burst mode RAMs also have clock synchronous interfaces which make them easier to design into systems, particukirly at clock rates of 25 mHz and above. Clock Figure 2: Burst RAM Read Timing

,
8

mAlT
Counter

E l
A burst mode RAM provides high speed transfer of a block of sequential words, called a burst. A block diagram of a burst mode SRAM is shown in Figure 1. A burst mode RAM consists of a conveniional SRAM plus an address counter, a read/write flip flop and a write register. Read and write timing is controlled by a clock in combination with the address counter load and read/write signals. In this configuration, random access to a word in the SRAM requires two clock cycles with successive words being read or written at one clock cycle per word. This is shown in the timing diagrams of Figures 2 and 3. Figure 1 : Burst RAM Block Diagram

Read Data For write operations, the first word of data to be written is clocked in to the write register at the same time the address counter and the read/write flip flop are loaded, as shown in Figure 3. Data from the write register is written into the SRAM during the second clock cycle. At the end of the second clock cycle, new data is clocked into the write register and address counter is incremented to the next location to write the next sequential word. Figure 3: Burst RAM Write Timing Clock Address

clOc Address--)l

kTf14T: f
T

Write Register

,
L

LJ

ReadMlrite Data In the read timing diagram of Figure 2,the first clock cycle is used to load the address counter and the readhnrrite flip flop 1 for random access 10 the first word. Read data comes out of the SRAM before Ihe end of the second clock cycle. l h e address counter is incremented at the end of the second clock cycle, and the next word is read from the SRAM. This allows one clock cycle per successive word read following the initial random access.

mAD
Counter

ww

Write Reg
I

SRAM Wriie (Internal)

I b

, , ,

279

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 8, 2009 at 11:36 from IEEE Xplore. Restrictions apply.

The burst mode memory is capable of high speed operation after the initial access because the sequential addresses are generated internally by the address counter. This greatly reduces the read and write cycle times for sequential data following the first access. Clock speeds of up to 50 mHz are possible in a TTL system, making the burst mode memory particularly well suited to the newer generations of high speed RlSC and ClSC chips.

A direct mapped cache for a 32-bit processor is shown in Figure 4. A direct mapped cache consists of a cache tag RAM, a cache data RAM and a small amount of logic to control events when a cache hit or a cache miss occurs. A cache hit is said to occur if a requested word is found in the cache. A iss occurs when the word is not found in the cache. Figure 4: Cache Block Diagram

Burst mode RAMS are faster than SRAM based memory


systems because the address counter is integrated into their design. In a burst mode SRAM, the minimum cycle time of the burst operation is approximatelythe same as the address access time of an equivalent SRAM. This can be as low as 20 ns. In an conventional burst mode memory system design using an SRAM and an address counter, the minimum minimum cycle time is determined by the sum of the clock to output delay of the counter plus the address access time of the SRAM. The cycle time is therefore increased by the delay of the address counter. This adds 6.2 ns to the memory cycle time using the QSFCT161A, one of the fastest counters commercially available. If a 20 ns SRAM is used, the minimum cycle is 26.2 ns. Alternately, a 13.8 ns SRAM would be required to achieve the 20 ns cycle time of a burst mode RAM. CACHF MFMORY IN RlSC AND ClSC PROCESSQBS The use of cache memories has become a standard feature of high performance processor design. Indeed, RlSC design is based on cache memory. The function of a cache memory is to improve the effective access time of the main memory, usually medium speed DRAM, by eliminating processor wait states. The cache does this by keeping copies of the most frequently read words from main memory in a small, high speed buffer memory. When the processor attempts to read a word from main memory, the cache checks to see if it has a copy. If it does, it responds immediately. If not, the main memory is started on a normal read cycle, and the processor waits for it to respond. The cache therefore speeds up the system by reducing the average amount of time the processor has to wait to read a word from memory. Caches are effective because most of the memory accesses are read cycles from a relatively small cluster of memory locations, in typical programs. Cache performancecan be defined in terms of effective wait states with a cache relative to the number of wait states without n. A 33 mHz processor witn medium Speed DRAM memory may require three wait states without a cache and 0.5 wait states with a cache. The three wait states without a cache are determined by the timing requirementsof the main memory. The 0.5 wait states is a statisticalaverage. It can be estimated by the product of cache miss rate and the number of wait states requiredfor cache refill on a miss.

32-Bit FP

Main Memory (DRAM)

Ready

HiVMiss Control Logic

ReadMlrite

The cache stores copies of words read from main memory in the cache data RAM and stores the location these words are read from in the cache tag RAM. In the direct mapped cache, the least significant bits of the address bus are sent to both the tag and data RAMS while the most significant bits are stored in the tag RAM when data is stored in the cache data RAM. In the example shown, both the tag and data RAMSare 8K words deep. When a read request is made to main memory, the least significant bits of the address are used to select one of the 8K words in both memories. The most significant bits of the address are compared against the bits stored in the tag RAM. If there is a match between the two, then the data stored in the data RAM is a copy of the data at the requested location and can be immediately supplied to the processor. This is a cache hit. If the upper address bits do not match, the data stored came from a different location. This is a cache miss. Direct mapped caches work because most accesses to main few thousand words located somewhere in the memory space. If the cache is larger than this cluster size, most of the read data will be provided by the cache. The least significant bits of the address bus are used to index within this cluster of words, and the most significant bits identify the region of memory that they came from. (Cache theory is a little more subtle than this. It treats the least significant bits of the address as a hashing function for a hash indexed buffer.)
memory are typically to a small cluster of a

280

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 8, 2009 at 11:36 from IEEE Xplore. Restrictions apply.

Figure 6: 80488'32K Byte Cache Block Diagram

Figure 8: 80486 Cache Timing Diagram T1 Clock T2 T2 T2 T2 T1

,
w0 a
A4-Alb

*= 2:
CD

Tag RAM 2Kx15 l x QS8813

Data RAM 8Kx32 2x QS8811

Address

A2.3

4
Ratch -Enable
7

HWMiss Control Logic

The design of Figure 6 uses one 088813 8Kx18 Tag RAM and two OS8811 8Kx18 Burst Mode RAMs for the tag and data memories respectively. The QS8813 is an 8Kx18 Tag SRAM with built-in match enable logic that allows it to directly drive the BRDY input of the 80486. This eliminates the need for additional logic in the propagation delay path between the Tag SRAM and the microprocessor. This can save five or more nanoseconds in match time. Only 2K of the 8K are used; however, the E1813 provides a single chip design solution for the TAG RAM. The complete design requires only three RAM chips. Figure 7: 80486 '128K Byte Cache Block Diagram

, ,
Read Data

The design of Figure 7 uses one QS8813 8Kx18 Tag RAM and four QS8839 32Kx9 Burst Mode RAMs for the tag and data memories respectively. The full 8K words of the 8813 are used to support the 32K words of the 8839. Both the 8811 and 8339 Burst Mode RAM chips provide an on-chip address counter and logic for burst mode operation. The address counter provides for bursts of up to four words using the 80486 address counting algorithm. Also, the burst counter on the 8811 count in either binary or 80486 counting modes, pin selectable.

*
w=l CD

Data Address
A1 5-A: 1

Tag

Data RAM

*= o m

ilKxl5 l x CIS881 3
A2.3

CONCl (ISION
8x QS81589
Burst mode memories provide performance improvementfor the cache systems used in high speed ClSC and RlSC systems which use multiple words per cache line. They are particularly useful at CPU clock speeds above 25 mHz due to their higher performance and simpler interface. Because of these advantages, burst mode memories are becoming a standard component for cache design of high speed systems.

-5

A Ratch -Enable
7

Hit/Miss Control Logic


1

281

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 8, 2009 at 11:36 from IEEE Xplore. Restrictions apply.

Qche P e r f o r m e vs Reload T i m Cache performance is defined by miss rate and reload time. Miss rate is the percentage of accesses that miss, and reload time is the number of wait states required to get the data for the processor and reload the cache on a miss. The miss rate of a cache is a function of cache size, cache organization and the statistics of the program running on the processor. Miss rates are like EPA gas mileage estimates: with different programs, your miss rate will vary from benchmark estimates. Generally, caches range from 16 KBytes to 256 KBytes in size, with larger caches having lower miss rates. Target miss rates are in the 2-20% range. Cache reload time for the cache in Figure 4 is the time to access one word out of main memory. This may require three wait states in a conventional access and four wait states with a cache. The cache system has an extra wait state because one clock cycle is required to determine if the data is in the cache before main memory access can be started on a miss. A FOUR WORD PER LINE CACHE Cache refill performance can be improved by loading more than one word on a miss. A cache using this approach is shown in Figure 5. In this design, the data cache is four times as deep as the cache tag memory. The two least significant bits of the address bus go to the cache data memory but do not go to the tag memory. On a cache miss, four words are loaded into the cache data memory, and a single tag - the common tag for the four locations - is written at the same time. This is called a four word per line cache memory, where a line refers to the amount of data fetched on a cache miss. Figure 5 : Four Word/Line Cache Block Diagram Data

Four WOrd/Line Gache Performance


Changing the cache from one word per line to four words per line does not change performance significantly if the reload timing - i.e., number of wait states per word - is not changed. If all four words are eventually used by the processor and if four wait states are required per word, a total of 16 wait states will be used by either cache to load the four words from main memory. In some cases, not all four words will be used, so the one word per line cache has a small advantage for the same reload timing. Performance of the four word per line cache of Figure 5 can be improved, however, by reducing the number of wait states required to load the four words. The main memory can be designed using interleaving techniques to provide the first word in four wait states and the next three words at one wait state each for a total of 7 wait states rather than 16. This approximately doubles the performanceof the cache. The four word per line cache has an implied requirement that the cache data memory must be capable of absorbing data at one clock cycle per word. This is not easy at 33-50 mHz clock rates. The burst mode memory provides a natural advantage at these speeds. A burst mode memory with two cycle first access and one cycle per word thereafter can accept data at the rates capable of being generated by the interleaved DRAM main memory. The burst access memory is particularly useful for cache memory reload because the interleavingtechniques that can be applied in main memory using static column or nibble mode access generally result in unacceptable chip count and propagation delay when attempted in the cache. This is because the cache memory must be capable of two cycle first access in normal operation as well as burst mode operation for refill on a miss. BURST MODF IN SECONDARY CACHES Burst mode operation is becoming a widely used standard in both RlSC and ClSC processors. For example, in the Intel 80486, the small, on-board cache uses a four word per line refill which is typically supplied from a larger, off-chip secondary cache. In this case, burst mode operation is used by the secondary cache both in its normal operating mode of supplying data to the 80486 as well as the reload on a miss mode when it receives data from main memory. Figure 6 shows a four word per line 32 KByte secondary cache for an 80486 using 8kx18 burst mode SRAMs for the data portion of the cache and an 8K x 18 tag RAM with on-board comparator for the tag memory. A 128 KByte cache using this architecture is shown in Figure 7. A timing diagram for both designs is shown in Figure 8 . This architecture provides a 32 KByte cache in three chips expandable to 128 KBytes in nine chips using the same tag RAM.

32-Bit

PP

Main Memory (DRAM)

Ready

ReadNVrite

282

Authorized licensed use limited to: IEEE Xplore. Downloaded on April 8, 2009 at 11:36 from IEEE Xplore. Restrictions apply.

You might also like