Professional Documents
Culture Documents
Physical memory is managed globally in Solaris via a central free pool and a system daemon to manage the use of physical memory.
The hardware address translation (HAT) and address space layers manage the mapping between a physical page and its virtual address space. The key property of vnode/offset pair is reusability; that is, we can reuse each physical page for another task by simply synchronizing its contents in RAM with its backing store(the vnode and the offset) before the page is used.
It calculates the slot in the page_hash array containing a lost of potential pages by using the PAGE_HASH_FUNC macro, shown below :
It uses the PAGE_HASH_SEARCH macro, shown below, to search the list referenced by the slot for a page matching vnode/offset. The macro traverses the linked list of pages until it finds such a page.
Page Coloring
Some interesting effects result from the organization of pages within the processor caches, and as a result, the page placement policy within these caches can dramatically affect processor performance. When pages overlay other pages in the cache, they can displace cache data that we might not want overlaid, resulting in less cache utilization and hot spots. The optimal placement of pages in the cache often depends on the memory access patterns of the application; that is, is the application accessing memory in a random order, or is it doing some sort of strided ordered access? Several different algorithms can be selected in the Solaris kernel to implement page placement; the default attempts to provide the best overall performance. To understand how page placement can affect performance, lets look at the cache configuration and see when page overlaying and displacement can occur. The UltraSPARC-I and -II implementations use virtually addressed L1 caches and physically addressed L2 caches. The L2 cache is arranged in lines of 64 bytes, and transfers are done to and from physical memory in 64-byte units. The L1 cache is 16 Kbytes, and the L2 (external) cache can vary between 512 Kbytes and 8 Mbytes. We can query the operating system with adb to see the size of the caches reported to the operating system. The L1 cache sizes are recorded in the vac_size parameter, and the L2 cache size is recorded in the ecache_size parameter.
Well start by using the L2 cache as an example of how page placement can affect performance. The physical addressing of the L2 cache means that the cache is organized in page-sized multiples of the physical address space, which means that the cache effectively has only a limited number of page-aligned slots. The number of effective page slots in the cache is the cache size divided by the page size. To simplify our examples, lets assume we have a 32Kbyte L2 cache (much smaller than reality), which means that if we have a page size of 8 Kbytes, there are four page-sized slots on the L2 cache. The cache does not necessarily read and write 8-Kbyte units from memory; it does that in 64-byte chunks, so in reality our 32-Kbyte cache has 1024 addressable slots. The following figure shows how our cache would look if we laid it out linearly:
The L2 cache is direct-mapped from physical memory. If we were to access physical addresses on a 32-Kbyte boundary, for example, offsets 0 and 32678, then both memory locations would map to the same cache line. If we were now to access these two addresses, we cause the cache lines for the offset 0 address to be read, then flushed (cleared), the cache line for the offset 32768 address to be read in, and then flushed, then the first reloaded, etc. This ping-pong effect in the cache is known as cache flushing (or cache ping-ponging), and it effectively reduces our performance to that of real-memory speed, rather than cache speed. By accessing memory on our 32-Kbyte cache-size boundary, we have effectively used only 64 bytes of the cache (a cache line size), rather than the full cache size. Memory is often up to 10 20 times slower than cache and so can have a dramatic effect on performance. Our simple example was based on the assumption that we were accessing physical memory in a regular pattern, but we dont program to physical memory; rather, we program to virtual memory. Therefore, the operating system must provide a sensible mapping between virtual memory and physical memory; otherwise, effects such as our example can occur. By default, physical pages are assigned to an address space from the order in which they appear in the free list. In general, the first time a machine boots, the free list may have physical memory in a linear order, and we may end up with the behavior described in our ping pong example. Once a machine has been running, the physical page free list will become randomly ordered, and subsequent reruns of an identical application could get very different physical page placement and, as a result, very different performance. On early Solaris implementations, this is exactly what customers sawdiffering performance for identical runs, as much as 30 percent difference. To provide better and consistent performance, the Solaris kernel uses a page coloring algorithm when pages are allocated to a virtual address space. Rather than being randomly allocated, the pages are allocated with a specific predetermined relationship between the virtual address to which they are being mapped and their underlying physical address. The virtual-tophysical relationship is predetermined as follows: the free list of physical pages is organized into specifically colored bins, one color bin for each slot in the physical cache; the number of color
bins is determined by the ecache size divided by the page size. (In our example, there would be exactly four colored bins.) When a page is put on the free list, the page_free() algorithms assign it to a color bin. When a page is consumed from the free list, the virtual-to-physical algorithm takes the page from a physical color bin, chosen as a function of the virtual address which to which the page will be mapped. The algorithm requires that when allocating pages from the free list, the page create function must know the virtual address to which a page will be mapped. New pages are allocated by calling the page_create_va() function. The page_create_va() function accepts the virtual address of the location to which the page is going to be mapped as an argument; then, the virtual-to-physical color bin algorithm can decide which color bin to take physical pages from. No one algorithm suits all applications because different applications have different memory access patterns. Over time, the page coloring algorithms used in the Solaris kernel have been refined as a result of extensive simulation, benchmarks, and customer feedback. The kernel supports a default algorithm and two optional algorithms. The default algorithm was chosen according to the following criteria: Fairly consistent, repeatable results Good overall performance for the majority of applications Acceptable performance across a wide range of applications
The default algorithm uses a hashing algorithm to distribute pages as evenly as possible throughout the cache. The default and three other available page coloring algorithms are shown here:
You can change the default algorithm by setting the system parameter consistent_coloring, either on-the-fly with adb or permanently in /etc/system.
So, which algorithm is best? Well, your mileage will vary, depending on your application. Page coloring usually only makes a difference on memory-intensive scientific applications, and the defaults are usually fine for commercial or database systems. If you have a time-critical scientific application, then we recommend that you experiment with the different algorithms and see which is best. Remember that some algorithms will produce different results for each run, so aggregate as many runs as possible.
When called, the schedpaging routine calculates two setup parameters for the page scanner thread: the number of pages to scan and the number of CPU ticks that the scanner thread can consume while doing so. The number of pages and cpu ticks are calculated according to the equations shown of Scan Rate Parameters (Assuming No Priority Paging). Once the scanning parameters have beencalculated, schedpaging triggers the page scanner through a condition variable wakeup. The page scanner thread cycles through the physical page list, progressing by the number of pages requested each time it is woken up. The front hand and the back hand each have a page
pointer. The front hand is incremented first so that it can clear the referenced and modified bits for the page currently pointed to by the front hand. The back hand is then incremented, and the status of the page pointed to by the back hand is checked by the check_page() function. At this point, if the page has been modified, it is placed in the dirty page queue for processing by the page-out thread. If the page was not referenced (its clean!), then it is simply freed. Dirty pages are placed onto a queue so that a separate thread, the page-out thread, can write them out to their backing store. We use another thread so that a deadlock cant occur while the system is waiting to swap a page out. The page-out thread uses a preinitialized list of async buffer headers as the queue for I/O requests. The list is initialized with 256 entries, which means the queue can contain at most 256 entries. The number of entries preconfigured on the list is controlled by the async_request_size system parameter. Requests to queue more I/Os onto the queue will be blocked if the entire queue is full (256 entries) or if the rate of pages queued has exceeded the system maximum set by the maxpgio parameter. The page-out thread simply removes I/O entries from the queue and initiates I/O on it by calling the vnode putpage() function for the page in question. In the Solaris kernel, this function calls the swapfs_putpage() function to initiate the swap page-out via the swapfs layer. The swapfs layer delays and gathers together pages (16 pages on sun4u), then writes these out together. The klustsize parameter controls the number of pages that swapfs will cluster; the defaults are shown in in the below table.
will soft-swap out processes if the shortage is minimal or hard-swap out processes in the case of a larger memory shortage.
Soft Swapping
Soft swapping takes place when the 30-second average for free memory is below desfree. Then, the memory scheduler looks for processes that have been inactive for at least maxslp seconds. When the memory scheduler find a process that has been sleeping for maxslp seconds, it swaps out the thread structures for each thread, then pages out all of the private pages of memory for that process.
Hard Swapping
Hard swapping takes place when all of the following are true: At least two processes are on the run queue, waiting for CPU. The average free memory over 30 seconds is consistently less than desfree. Excessive paging (determined to be true if page-out + page-in > maxpgio) is going on. When hard swapping is invoked, a much more aggressive approach is used to find memory. First, the kernel is requested to unload all modules and cache memory that are not currently active, then processes are sequentially swapped out until the desired amount of free memory is returned.
References :
Richard McDougall & Jim Mauro ' Solaris Internals Solaris 10 and Opensolaris Kernel Architecture ' 2nd Edition, Pearson Education, ISBN : 81-317-1620-1 http://www.opensolaris.org Robert A. Gingell, Joseph P. Moran, and William A. Shannon, Virtual Memory Architecture inSunOS, Proceedings of the Summer 1987 Usenix Technical Conference, Usenix Association, Phoenix Arizona, USA, June 1987. Richard McDougall Supporting Multiple Page Sizes in the Solaris Operating System, Sun BluePrints OnLineMarch 2004, Sun Microsystems Inc. Steven R. Kleiman, Vnodes: An Architecture for Multiple File Systems Types in Sun UNIX,Proceedings of the Summer 1986 Usenix Technical Conference, Usenix Association, PhoenixArizona, USA, June 1986. Marshall Kirk McKusick, Michael J. Karels, and Keith Bostic, A Pageable Memory Based Filesystem,Proceedings of the Summer 1990 Usenix Technical Conference, Usenix Association, Anaheim California, USA, June 1990. The Solaris Memory System - Sizing tools and architecture Copyright 1997 Sun Microsystems, Inc. 2550 Garcia Avenue, Mountain View, California 94043-1100 U.S.A. http://www.princeton.edu/~unix/Solaris/troubleshoot/ram.html
Peter Snyder tmpfs: A Virtual Memory File System, Sun Microsystems Inc.
http://developers.sun.com/solaris/articles/free_phys_ram.html http://www.dbapool.com/faqs/Q_116.html