You are on page 1of 14

Solaris 10 Physical Memory Management

Physical memory is managed globally in Solaris via a central free pool and a system daemon to manage the use of physical memory.

Physical Memory Allocation


Solaris uses the systems RAM as a central pool of physical memory for different consumers within the system. Physical memory is distributed through the central pool at allocation time and returned to the pool when it is no longer needed. A system daemon (the page scanner ) pro actively manages memory allocation when there is a systemwide shortage of memory.

The Allocation cycle of Physical memory


The most significant central pool physical memory is the freelist. Physical memory is placed on the freelist in page-size chunks when the system is first booted and freelist as shown the above figure. Anonymous/process allocations Anonymous memory, the most common form of allocation from the freelist, is used for most of a processs memory allocation, including heap and stack. Anonymous memory also fulfills shared memory mappings allocations. A small amount of anonymous memory is also used in the kernel for items such a thread stacks. Anonymous memory is pageable and is returned to the freelist when it is unmapped or if it is stolen by the page scanner daemon. File System page cache The page cache is used for caching of file data for file systems other than the ZFS file system. The file system page cache grows on demand to consume available physical memory as a file cache and caches file data in page-size chunks. The pages then reside in one of the three places: the segmap cache, a processs address space to which they are mapped, or on the cachelist. The cachelist is the heart of the page cache. All unmapped file pages reside on the cachelist. Segmap is a cache that holds file data read and written through the read and write system calls. Memory is allocated from the freelist to satisfy a read of a new file page, which then resides in the segmap file cache. File pages are eventually moved from the segmap cache to the cachelist to make room for more pages in the segmap cache. It can be interpreted as the fast first level file system read/write cache. The cachelist operates as a part of freelist. When the freelist is depleted, allocations are made from the oldest pages in the cachelist. This allows file system cache to grow to consume all available memory to dynamically shrink as memory is required for other purposes. Kernel allocations The kernel uses memory to manage information about internal system state; for example, memory used to hold the list of processes in the system. The kernel allocates memory from the freelist for these purposes with its own allocators: vnem and slab and the memory allocated is mostly nonpageable. However, unlike process and file allocations, the kernel seldom returns memory to the freelist; memory is allocated and freed between kernel subsystems and the kernel allocators. Memory is consumed from the freelist only when the total kernel allocation grows and memory is returned to the system freelist pro actively by the kernels allocators when a global memory shortage occurs.

Pages: The Basic unit of Solaris Memory


Pages are the fundamental unit of physical memory in the Solaris memory management subsystem. Physical memory is divided into pages. Every active (not free) page in the Solaris kernel is a mapping between a file (vnode) and memory; the page can be identified with a vnode pointer and the page size offset within that vnode. A pages identity is its vnode/offset pair. The age structure and associated lists are shown below:

The hardware address translation (HAT) and address space layers manage the mapping between a physical page and its virtual address space. The key property of vnode/offset pair is reusability; that is, we can reuse each physical page for another task by simply synchronizing its contents in RAM with its backing store(the vnode and the offset) before the page is used.

The Page Hash List


The VM system hashes pages with identity (a valid vnode/offset pair) onto a global hash list so that they can be located by vnode and offset. Three page functions search the global hash list; page_find(), page_lookup(), and page_lookup_nowait(). The global hash list is an array of pointers to linked lists of pages. The functions use a hash to index into the page_hash array to locate the list of pages that contains the page with the matching vnode/offset pair. The following figure shows how the page_find() function indexes into the page_hash array o locate a page matching a given vnode/offset.

It calculates the slot in the page_hash array containing a lost of potential pages by using the PAGE_HASH_FUNC macro, shown below :

It uses the PAGE_HASH_SEARCH macro, shown below, to search the list referenced by the slot for a page matching vnode/offset. The macro traverses the linked list of pages until it finds such a page.

Free List and cache List


The free list and cache list hold pages that are not mapped into any address space and that have been freed by page_free(). The sum of these pages is reported in the free column in vmstat. Even though vmstat reports these pages as free, they can still contain a valid page from vnode/offset and hence are still part of the global page cache. Memory on the cache list is not really free; it is a valid cache of a page from a file. However, pages will be moved from the cache list to the free list and their contents discarded if the free list becomes exhausted. The free list contains pages that no longer have a vnode and offset associated with themwhich can only occur if the page has been destroyed and removed frm a vnodes hash list. The cache list is a hashed list of pages that still have mappings to valid vnode and offset. The pages can be obtained from the cache list by the page_lookup() routine. This function accepts a vnde and offset as argument and returns a page structure. If the page is found on th cache list, then the page is removed from the cache list and returned to the caller. When we find and remove pages from the cache list, we are reclaiming a page. Page reclaims are reported by vmstat in the re column.

Physical Page memseg Lists


The Solaris kernel uses a segmented global physical page list, consisting of segments of contiguous physical memory. (Many hardware platforms now present memory in noncontiguous groups.) Contiguous physical memory segments are added during system boot. They are also added and deleted dynamically when physical memory is added and removed while the system is running. The following figure shows the arrangement of the physical page lists into contiguous segments.

The Page-Level Interfaces


The Solaris 10 virtual memory system implementation has grouped page management and manipulation into a central group of functions. These functions are used by the segment drivers and file systems to create, delete and modify pages. The following are some of the page-level interfaces:

The Page Throttle


Solaris implements a page creation throttle so a small core of memory is available for consumption by critical parts of the kernel. The page throttle, implemented in the page_create() and page_create_va() functions, causes page creates to block when the PG_WAIT flag is specified, that is, when available is less than the system global, throttlefree. By default, the system global parameter, throttlefree, is set to the same value as the system global parameter minfree. By default, memory allocated through the kernel memory allocator specifies PG_WAIT and is subject to the page-created throttle.

Page Coloring
Some interesting effects result from the organization of pages within the processor caches, and as a result, the page placement policy within these caches can dramatically affect processor performance. When pages overlay other pages in the cache, they can displace cache data that we might not want overlaid, resulting in less cache utilization and hot spots. The optimal placement of pages in the cache often depends on the memory access patterns of the application; that is, is the application accessing memory in a random order, or is it doing some sort of strided ordered access? Several different algorithms can be selected in the Solaris kernel to implement page placement; the default attempts to provide the best overall performance. To understand how page placement can affect performance, lets look at the cache configuration and see when page overlaying and displacement can occur. The UltraSPARC-I and -II implementations use virtually addressed L1 caches and physically addressed L2 caches. The L2 cache is arranged in lines of 64 bytes, and transfers are done to and from physical memory in 64-byte units. The L1 cache is 16 Kbytes, and the L2 (external) cache can vary between 512 Kbytes and 8 Mbytes. We can query the operating system with adb to see the size of the caches reported to the operating system. The L1 cache sizes are recorded in the vac_size parameter, and the L2 cache size is recorded in the ecache_size parameter.

Well start by using the L2 cache as an example of how page placement can affect performance. The physical addressing of the L2 cache means that the cache is organized in page-sized multiples of the physical address space, which means that the cache effectively has only a limited number of page-aligned slots. The number of effective page slots in the cache is the cache size divided by the page size. To simplify our examples, lets assume we have a 32Kbyte L2 cache (much smaller than reality), which means that if we have a page size of 8 Kbytes, there are four page-sized slots on the L2 cache. The cache does not necessarily read and write 8-Kbyte units from memory; it does that in 64-byte chunks, so in reality our 32-Kbyte cache has 1024 addressable slots. The following figure shows how our cache would look if we laid it out linearly:

The L2 cache is direct-mapped from physical memory. If we were to access physical addresses on a 32-Kbyte boundary, for example, offsets 0 and 32678, then both memory locations would map to the same cache line. If we were now to access these two addresses, we cause the cache lines for the offset 0 address to be read, then flushed (cleared), the cache line for the offset 32768 address to be read in, and then flushed, then the first reloaded, etc. This ping-pong effect in the cache is known as cache flushing (or cache ping-ponging), and it effectively reduces our performance to that of real-memory speed, rather than cache speed. By accessing memory on our 32-Kbyte cache-size boundary, we have effectively used only 64 bytes of the cache (a cache line size), rather than the full cache size. Memory is often up to 10 20 times slower than cache and so can have a dramatic effect on performance. Our simple example was based on the assumption that we were accessing physical memory in a regular pattern, but we dont program to physical memory; rather, we program to virtual memory. Therefore, the operating system must provide a sensible mapping between virtual memory and physical memory; otherwise, effects such as our example can occur. By default, physical pages are assigned to an address space from the order in which they appear in the free list. In general, the first time a machine boots, the free list may have physical memory in a linear order, and we may end up with the behavior described in our ping pong example. Once a machine has been running, the physical page free list will become randomly ordered, and subsequent reruns of an identical application could get very different physical page placement and, as a result, very different performance. On early Solaris implementations, this is exactly what customers sawdiffering performance for identical runs, as much as 30 percent difference. To provide better and consistent performance, the Solaris kernel uses a page coloring algorithm when pages are allocated to a virtual address space. Rather than being randomly allocated, the pages are allocated with a specific predetermined relationship between the virtual address to which they are being mapped and their underlying physical address. The virtual-tophysical relationship is predetermined as follows: the free list of physical pages is organized into specifically colored bins, one color bin for each slot in the physical cache; the number of color

bins is determined by the ecache size divided by the page size. (In our example, there would be exactly four colored bins.) When a page is put on the free list, the page_free() algorithms assign it to a color bin. When a page is consumed from the free list, the virtual-to-physical algorithm takes the page from a physical color bin, chosen as a function of the virtual address which to which the page will be mapped. The algorithm requires that when allocating pages from the free list, the page create function must know the virtual address to which a page will be mapped. New pages are allocated by calling the page_create_va() function. The page_create_va() function accepts the virtual address of the location to which the page is going to be mapped as an argument; then, the virtual-to-physical color bin algorithm can decide which color bin to take physical pages from. No one algorithm suits all applications because different applications have different memory access patterns. Over time, the page coloring algorithms used in the Solaris kernel have been refined as a result of extensive simulation, benchmarks, and customer feedback. The kernel supports a default algorithm and two optional algorithms. The default algorithm was chosen according to the following criteria: Fairly consistent, repeatable results Good overall performance for the majority of applications Acceptable performance across a wide range of applications

The default algorithm uses a hashing algorithm to distribute pages as evenly as possible throughout the cache. The default and three other available page coloring algorithms are shown here:

You can change the default algorithm by setting the system parameter consistent_coloring, either on-the-fly with adb or permanently in /etc/system.

So, which algorithm is best? Well, your mileage will vary, depending on your application. Page coloring usually only makes a difference on memory-intensive scientific applications, and the defaults are usually fine for commercial or database systems. If you have a time-critical scientific application, then we recommend that you experiment with the different algorithms and see which is best. Remember that some algorithms will produce different results for each run, so aggregate as many runs as possible.

The Page Scanner


The page scanner is the memory management daemon that manages system wide physical memory. The page scanner and the virtual memory page fault mechanism are the core of the demand-paged memory allocation system used to manage Solaris memory. When there is a memory shortage, the page scanner runs, to steal memory from address spaces by taking pages that havent been used recently, syncing them up with their backing store (swap space if they are anonymous pages), and freeing them. If paged-out virtual memory is required again by an address space, then a memory page fault occurs when the virtual address is referenced and the pages are recreated and copied back from their backing store. The balancing of page stealing and page faults determines which parts of virtual memory will be backed by real physical memory and which will be moved out to swap. The page scanner does not understand the memory usage patterns or working sets of processes; it only knows reference information on a physical page-by-page basis. This policy is often referred to as global page replacement; the alternative process-based page management, is known as local page replacement. The subtleties of which pages are stolen govern the memory allocation policies and can affect different workloads in different ways. During the life of the Solaris kernel, only two significant changes in memory replacement policies have occurred: Enhancements to minimize page stealing from extensively shared libraries and executables Priority paging to prevent application, shared library, and executable paging on systems with ample memory

Page Scanner Implementation


The page scanner is implemented as two kernel threads, both of which use pageout. One thread scans pages, and the other thread pushes the dirty pages queued for I/O to the swap device. In addition, the kernel callout mechanism wakes the page scanner thread when memory is insufficient. The scanner schedpaging() function is called four times per second by a callout placed in the callout table. The schedpaging() function checks whether free memory is below the threshold (lotsfree or cachefree) and, if required, prepares to trigger the scanner thread. The page scanner is not only awakened by the callout thread, it is also triggered by the clock() thread if memory falls below minfree or by the page allocator if memory falls below throttlefree. This illustrates how the page scanner works: Page Scanner Architecture

When called, the schedpaging routine calculates two setup parameters for the page scanner thread: the number of pages to scan and the number of CPU ticks that the scanner thread can consume while doing so. The number of pages and cpu ticks are calculated according to the equations shown of Scan Rate Parameters (Assuming No Priority Paging). Once the scanning parameters have beencalculated, schedpaging triggers the page scanner through a condition variable wakeup. The page scanner thread cycles through the physical page list, progressing by the number of pages requested each time it is woken up. The front hand and the back hand each have a page

pointer. The front hand is incremented first so that it can clear the referenced and modified bits for the page currently pointed to by the front hand. The back hand is then incremented, and the status of the page pointed to by the back hand is checked by the check_page() function. At this point, if the page has been modified, it is placed in the dirty page queue for processing by the page-out thread. If the page was not referenced (its clean!), then it is simply freed. Dirty pages are placed onto a queue so that a separate thread, the page-out thread, can write them out to their backing store. We use another thread so that a deadlock cant occur while the system is waiting to swap a page out. The page-out thread uses a preinitialized list of async buffer headers as the queue for I/O requests. The list is initialized with 256 entries, which means the queue can contain at most 256 entries. The number of entries preconfigured on the list is controlled by the async_request_size system parameter. Requests to queue more I/Os onto the queue will be blocked if the entire queue is full (256 entries) or if the rate of pages queued has exceeded the system maximum set by the maxpgio parameter. The page-out thread simply removes I/O entries from the queue and initiates I/O on it by calling the vnode putpage() function for the page in question. In the Solaris kernel, this function calls the swapfs_putpage() function to initiate the swap page-out via the swapfs layer. The swapfs layer delays and gathers together pages (16 pages on sun4u), then writes these out together. The klustsize parameter controls the number of pages that swapfs will cluster; the defaults are shown in in the below table.

The Memory Scheduler


In addition to the page-out process, the CPU scheduler/dispatcher can swap out entire processes to conserve memory. This operation is separate from page-out. Swapping out a process involves removing all of a processs thread structures and private pages from memory, and setting flags in the process table to indicate that this process has been swapped out. This is an inexpensive way to conserve memory, but it dramatically affects a processs performance and hence is used only when paging fails to consistently free enough memory. The memory scheduler is launched at boot time and does nothing unless memory is consistently less than desfree memory (30 second average). At this point, the memory scheduler starts looking for processes that it can completely swap out. The memory scheduler

will soft-swap out processes if the shortage is minimal or hard-swap out processes in the case of a larger memory shortage.

Soft Swapping
Soft swapping takes place when the 30-second average for free memory is below desfree. Then, the memory scheduler looks for processes that have been inactive for at least maxslp seconds. When the memory scheduler find a process that has been sleeping for maxslp seconds, it swaps out the thread structures for each thread, then pages out all of the private pages of memory for that process.

Hard Swapping
Hard swapping takes place when all of the following are true: At least two processes are on the run queue, waiting for CPU. The average free memory over 30 seconds is consistently less than desfree. Excessive paging (determined to be true if page-out + page-in > maxpgio) is going on. When hard swapping is invoked, a much more aggressive approach is used to find memory. First, the kernel is requested to unload all modules and cache memory that are not currently active, then processes are sequentially swapped out until the desired amount of free memory is returned.

References :
Richard McDougall & Jim Mauro ' Solaris Internals Solaris 10 and Opensolaris Kernel Architecture ' 2nd Edition, Pearson Education, ISBN : 81-317-1620-1 http://www.opensolaris.org Robert A. Gingell, Joseph P. Moran, and William A. Shannon, Virtual Memory Architecture inSunOS, Proceedings of the Summer 1987 Usenix Technical Conference, Usenix Association, Phoenix Arizona, USA, June 1987. Richard McDougall Supporting Multiple Page Sizes in the Solaris Operating System, Sun BluePrints OnLineMarch 2004, Sun Microsystems Inc. Steven R. Kleiman, Vnodes: An Architecture for Multiple File Systems Types in Sun UNIX,Proceedings of the Summer 1986 Usenix Technical Conference, Usenix Association, PhoenixArizona, USA, June 1986. Marshall Kirk McKusick, Michael J. Karels, and Keith Bostic, A Pageable Memory Based Filesystem,Proceedings of the Summer 1990 Usenix Technical Conference, Usenix Association, Anaheim California, USA, June 1990. The Solaris Memory System - Sizing tools and architecture Copyright 1997 Sun Microsystems, Inc. 2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A. http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.html

Peter Snyder tmpfs: A Virtual Memory File System, Sun Microsystems Inc.

http://developers.sun.com/solaris/articles/free_phys_ram.html http://www.dbapool.com/faqs/Q_116.html

You might also like