Professional Documents
Culture Documents
actua
is no
TLB w
is wh
Similar to caches, TLBs may have multiple levels. CPUs can be (and nowadays usua
lly are) built with multiple TLBs, for example a small "L1" TLB (potentially ful
ly associative) that is extremely fast, and a larger "L2" TLB that is somewhat s
lower. When ITLB and DTLB are used, a CPU can have three (ITLB1, DTLB1, TLB2) or
four TLBs.
For instance, Intel's Nehalem microarchitecture has a four-way set associative L
1 DTLB with 64 entries for 4 KiB pages and 32 entries for 2/4 MiB pages, an L1 I
TLB with 128 entries for 4 KiB pages using four-way associativity and 14 fully a
ssociative entries for 2/4 MiB pages (both parts of the ITLB divided statically
between two threads)[6] and a unified 512-entry L2 TLB for 4 KiB pages,[7] both
4-way associative.[8]
Some TLBs may have separate sections for small pages and huge pages.
TLB miss handling
Two schemes for handling TLB misses are commonly found in modern architectures:
With hardware TLB management, the CPU automatically walks the page tables (u
sing the CR3 register on x86 for instance) to see if there is a valid page table
entry for the specified virtual address. If an entry exists, it is brought into
the TLB and the TLB access is retried: this time the access will hit, and the p
rogram can proceed normally. If the CPU finds no valid entry for the virtual add
ress in the page tables, it raises a page fault exception, which the operating s
ystem must handle. Handling page faults usually involves bringing the requested
data into physical memory, setting up a page table entry to map the faulting vir
tual address to the correct physical address, and resuming the program (see Page
fault for more details.) With a hardware-managed TLB, the format of the TLB ent
ries is not visible to software, and can change from CPU to CPU without causing
loss of compatibility for the programs.
With software-managed TLBs, a TLB miss generates a "TLB miss" exception, and
operating system code is responsible for walking the page tables and performing
the translation in software. The operating system then loads the translation in
to the TLB and restarts the program from the instruction that caused the TLB mis
s. As with hardware TLB management, if the OS finds no valid translation in the
page tables, a page fault has occurred, and the OS must handle it accordingly. I
nstruction sets of CPUs that have software-managed TLBs have instructions that a
llow loading entries into any slot in the TLB. The format of the TLB entry is de
fined as a part of the instruction set architecture (ISA).[9] The MIPS architect
ure specifies a software-managed TLB;[10] the SPARC V9 architecture allows an im
plementation of SPARC V9 to have no MMU, an MMU with a software-managed TLB, or
an MMU with a hardware-managed TLB,[11] and the UltraSPARC architecture specifie
s a software-managed TLB.[12]
The Itanium architecture provides an option of using either software or hardware
managed TLBs.[13]
The Alpha architecture's TLB is managed in PALcode, rather than in the operating
system. As the PALcode for a processor can be processor-specific and operatingsystem-specific, this allows different versions of PALcode to implement differen
t page table formats for different operating systems, without requiring that the
TLB format, and the instructions to control the TLB, to be specified by the arc
hitecture.[14]
Typical TLB
These are typical performance levels of a TLB:[15]
size: 12 4,096 entries
hit time: 0.5 1 clock cycle
miss penalty: 10 100 clock cycles
miss rate: 0.01 1%
If a TLB hit takes 1 clock cycle, a miss takes 30 clock cycles, and the miss rat
e is 1%, the effective memory cycle rate is an average of 1 0.99 + (1 + 30) 0.01
= 1.30 (1.30 clock cycles per memory access).
Context switch
On a context switch, some TLB entries can become invalid, since the virtual-to-p
hysical mapping is different. The simplest strategy to deal with this is to comp
letely flush the TLB. This means that after a switch, the TLB is empty and any m
emory reference will be a miss, and it will be some time before things are runni
ng back at full speed. Newer CPUs use more effective strategies marking which pr
ocess an entry is for. This means that if a second process runs for only a short
time and jumps back to a first process, it may still have valid entries, saving
the time to reload them.[16]
For example in the Alpha 21264, each TLB entry is tagged with an "address space
number" (ASN), and only TLB entries with an ASN matching the current task are co
nsidered valid. Another example in the Intel Pentium Pro, the page global enable
(PGE) flag in the register CR4 and the global (G) flag of a page-directory or p
age-table entry can be used to prevent frequently used pages from being automati
cally invalidated in the TLBs on a task switch or a load of register CR3.
While selective flushing of the TLB is an option in software managed TLBs, the o
nly option in some hardware TLBs (for example, the TLB in the Intel 80386) is th
e complete flushing of the TLB on a context switch. Other hardware TLBs (for exa
mple, the TLB in the Intel 80486 and later x86 processors, and the TLB in ARM pr
ocessors) allow the flushing of individual entries from the TLB indexed by virtu
al address.
Virtualization and x86 TLB
With the advent of virtualization for server consolidation, a lot of effort has
gone into making the x86 architecture easier to virtualize and to ensure better
performance of virtual machines on x86 hardware.[17][18] In a long list of such
changes to the x86 architecture, the TLB is the latest.
Normally, the entries in the x86 TLBs are not associated with any address space.
Hence, every time there is a change in address space, such as a context switch,
the entire TLB has to be flushed. Maintaining a tag that associates each TLB en
try with an address space in software and comparing this tag during TLB lookup a
nd TLB flush is very expensive, especially since the x86 TLB is designed to oper
ate with very low latency and completely in hardware. In 2008, both Intel (Nehal
em)[19] and AMD (SVM)[20] have introduced tags as part of the TLB entry and dedi
cated hardware that checks the tag during lookup. Even though these are not full
y exploited, it is envisioned that in the future, these tags will identify the a
ddress space to which every TLB entry belongs. Thus a context switch will not re
but just changing the tag of the current address
sult in the flushing of the TLB
space to the tag of the address space of the new task.