X

A translation lookaside buffer (TLB) is a cache that memory management hardware
uses to improve virtual address translation speed.[1] The majority of desktop, l

aptop, and server processors includes one or more TLBs in the memory management
hardware, and it is nearly always present in any hardware that utilizes paged or
segmented virtual memory.
The TLB is sometimes implemented as content-addressable memory (CAM). The CAM se
arch key is the virtual address and the search result is a physical address. If
the requested address is present in the TLB, the CAM search yields a match quick
ly and the retrieved physical address can be used to access memory. This is call
ed a TLB hit. If the requested address is not in the TLB, it is a miss, and the
translation proceeds by looking up the page table in a process called a page wal
k. The page walk requires a lot of time when compared to the processor speed, as
it involves reading the contents of multiple memory locations and using them to
compute the physical address. After the physical address is determined by the p
age walk, the virtual address to physical address mapping is entered into the TL
B. The operation of more common TLB implementations is using hash tables,[citati
on needed] similarly to the CAM implementations.[2]
Contents
1 Overview
2 Implications for performance
3 Multiple TLBs
4 TLB miss handling
5 Typical TLB
6 Context switch
7 Virtualization and x86 TLB
8 See also
9 References
10 External links
Overview
See also: CPU cache: Address translation
A translation lookaside buffer (TLB) has a fixed number of slots containing the
following entries:[3]
page table entries, which map virtual addresses to
physical addresses
intermediate table addresses
segment table entries, which map virtual addresses to
segment addresses
intermediate table addresses
page table addresses.
The virtual memory is the space seen from a process. This space is often segment
ed in pages of a fixed size. The page table (generally stored in memory) keeps t
rack of where the virtual pages are stored in the physical memory. The TLB is a
cache of the page table; that is, only a subset of page table contents is held i
n TLB.[4]
Referencing the physical memory addresses, a TLB may reside between the CPU and
the CPU cache, between the CPU cache and primary storage memory, or between leve
ls of a multi-level cache. The placement determines whether the cache uses physi
cal or virtual addressing. If the cache is virtually addressed, requests are sen
t directly from the CPU to the cache, and the TLB is accessed only on a cache mi
ss. If the cache is physically addressed, the CPU does a TLB lookup on every mem
ory operation and the resulting physical address is sent to the cache.
In a Harvard architecture or hybrid thereof, a separate virtual address space or

memory access hardware may exist for instructions and data. This can lead to di
stinct TLBs for each access type, an Instruction Translation Lookaside Buffer (I
TLB) and a Data Translation Lookaside Buffer (DTLB). Various benefits have been
demonstrated with separate data and instruction TLBs.[5]
A common optimization for physically addressed caches is to perform the TLB look
up in parallel with the cache access. The low-order bits of any virtual address
(e.g., in a virtual memory system having 4 KB pages, the lower 12 bits of the vi
rtual address) represent the offset of the desired address within the page, and
thus they do not change in the virtual-to-physical translation. During a cache a
ccess, two steps are performed: an index is used to find an entry in the cache's
data store, and then the tags for the cache line found are compared. If the cac
he is structured in such a way that it can be indexed using only the bits that d
o not change in translation, the cache can perform its "index" operation while t
he TLB translates the upper bits of the address. Then, the translated address fr
om the TLB is passed to the cache. The cache performs a tag comparison to determ
ine if this access was a hit or miss. It is possible to perform the TLB lookup i
n parallel with the cache access even if the cache must be indexed using some bi
ts that may change upon address translation; see the address translation section
in the cache article for more details about virtual addressing as it pertains t
o caches and TLBs.
Implications for performance
The CPU has to access main memory for a:
instruction cache miss
data cache miss
TLB miss
The third case (the simplest case) is where the desired information itself
lly is in a cache, but the information for virtual-to-physical translation
t in a TLB. These are all about equally slow, so a program "thrashing" the
ill run just as poorly as one thrashing an instruction or data cache. That
y a well functioning TLB is important.
Multiple TLBs
actua
is no
TLB w
is wh
Similar to caches, TLBs may have multiple levels. CPUs can be (and nowadays usua
lly are) built with multiple TLBs, for example a small "L1" TLB (potentially ful
ly associative) that is extremely fast, and a larger "L2" TLB that is somewhat s
lower. When ITLB and DTLB are used, a CPU can have three (ITLB1, DTLB1, TLB2) or
four TLBs.
For instance, Intel's Nehalem microarchitecture has a four-way set associative L
1 DTLB with 64 entries for 4 KiB pages and 32 entries for 2/4 MiB pages, an L1 I
TLB with 128 entries for 4 KiB pages using four-way associativity and 14 fully a
ssociative entries for 2/4 MiB pages (both parts of the ITLB divided statically
between two threads)[6] and a unified 512-entry L2 TLB for 4 KiB pages,[7] both
4-way associative.[8]
Some TLBs may have separate sections for small pages and huge pages.
TLB miss handling
Two schemes for handling TLB misses are commonly found in modern architectures:
With hardware TLB management, the CPU automatically walks the page tables (u
sing the CR3 register on x86 for instance) to see if there is a valid page table
entry for the specified virtual address. If an entry exists, it is brought into
the TLB and the TLB access is retried: this time the access will hit, and the p
rogram can proceed normally. If the CPU finds no valid entry for the virtual add
ress in the page tables, it raises a page fault exception, which the operating s
ystem must handle. Handling page faults usually involves bringing the requested
data into physical memory, setting up a page table entry to map the faulting vir
tual address to the correct physical address, and resuming the program (see Page
fault for more details.) With a hardware-managed TLB, the format of the TLB ent
ries is not visible to software, and can change from CPU to CPU without causing
loss of compatibility for the programs.
With software-managed TLBs, a TLB miss generates a "TLB miss" exception, and
operating system code is responsible for walking the page tables and performing
the translation in software. The operating system then loads the translation in
to the TLB and restarts the program from the instruction that caused the TLB mis
s. As with hardware TLB management, if the OS finds no valid translation in the
page tables, a page fault has occurred, and the OS must handle it accordingly. I
nstruction sets of CPUs that have software-managed TLBs have instructions that a
llow loading entries into any slot in the TLB. The format of the TLB entry is de
fined as a part of the instruction set architecture (ISA).[9] The MIPS architect
ure specifies a software-managed TLB;[10] the SPARC V9 architecture allows an im
plementation of SPARC V9 to have no MMU, an MMU with a software-managed TLB, or
an MMU with a hardware-managed TLB,[11] and the UltraSPARC architecture specifie
s a software-managed TLB.[12]
The Itanium architecture provides an option of using either software or hardware
managed TLBs.[13]
The Alpha architecture's TLB is managed in PALcode, rather than in the operating
system. As the PALcode for a processor can be processor-specific and operatingsystem-specific, this allows different versions of PALcode to implement differen
t page table formats for different operating systems, without requiring that the
TLB format, and the instructions to control the TLB, to be specified by the arc
hitecture.[14]
Typical TLB
These are typical performance levels of a TLB:[15]
size: 12 4,096 entries
hit time: 0.5 1 clock cycle
miss penalty: 10 100 clock cycles
miss rate: 0.01 1%
If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rat
e is 1%, the effective memory cycle rate is an average of 1 0.99 + (1 + 30) 0.01
= 1.30 (1.30 clock cycles per memory access).
Context switch
On a context switch, some TLB entries can become invalid, since the virtual-to-p
hysical mapping is different. The simplest strategy to deal with this is to comp
letely flush the TLB. This means that after a switch, the TLB is empty and any m
emory reference will be a miss, and it will be some time before things are runni
ng back at full speed. Newer CPUs use more effective strategies marking which pr
ocess an entry is for. This means that if a second process runs for only a short
time and jumps back to a first process, it may still have valid entries, saving
the time to reload them.[16]
For example in the Alpha 21264, each TLB entry is tagged with an "address space
number" (ASN), and only TLB entries with an ASN matching the current task are co
nsidered valid. Another example in the Intel Pentium Pro, the page global enable
(PGE) flag in the register CR4 and the global (G) flag of a page-directory or p
age-table entry can be used to prevent frequently used pages from being automati
cally invalidated in the TLBs on a task switch or a load of register CR3.
While selective flushing of the TLB is an option in software managed TLBs, the o
nly option in some hardware TLBs (for example, the TLB in the Intel 80386) is th
e complete flushing of the TLB on a context switch. Other hardware TLBs (for exa
mple, the TLB in the Intel 80486 and later x86 processors, and the TLB in ARM pr
ocessors) allow the flushing of individual entries from the TLB indexed by virtu
al address.
Virtualization and x86 TLB
With the advent of virtualization for server consolidation, a lot of effort has
gone into making the x86 architecture easier to virtualize and to ensure better
performance of virtual machines on x86 hardware.[17][18] In a long list of such
changes to the x86 architecture, the TLB is the latest.
Normally, the entries in the x86 TLBs are not associated with any address space.
Hence, every time there is a change in address space, such as a context switch,
the entire TLB has to be flushed. Maintaining a tag that associates each TLB en
try with an address space in software and comparing this tag during TLB lookup a
nd TLB flush is very expensive, especially since the x86 TLB is designed to oper
ate with very low latency and completely in hardware. In 2008, both Intel (Nehal
em)[19] and AMD (SVM)[20] have introduced tags as part of the TLB entry and dedi
cated hardware that checks the tag during lookup. Even though these are not full
y exploited, it is envisioned that in the future, these tags will identify the a
ddress space to which every TLB entry belongs. Thus a context switch will not re
but just changing the tag of the current address
sult in the flushing of the TLB
space to the tag of the address space of the new task.

X

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

X

Uploaded by

Copyright:

Available Formats

A translation lookaside buffer (TLB) is a cache that memory management hardware

uses to improve virtual address translation speed.[1] The majority of desktop, l

In a Harvard architecture or hybrid thereof, a separate virtual address space or

You might also like