You are on page 1of 4

CacheLayers: A Flexible and Scalable OS-Level Caching Framework

Nitin Gupta
University of Massachusetts, Amherst

Dan Magenheimer
Oracle Corporation

Emery D. Berger emery@cs.umass.edu

ngupta@cs.umass.edu

University of Massachusetts, Amherst dan.magenheimer@oracle.com

ABSTRACT
In current systems, typical caching hierarchy starts with fastest processor level caches, to RAM and nally disk storage. This last level represents a huge performance cli rotational disk average latencies are orders of magnitude higher than that of DRAM chips. So, when a workloads memory demands exceed the amount of RAM, severe performance degradation is observed. Previous work explored many techniques to address this issue, for instance, the use of SSDs for caching, memory compression and so on. Many of these studies used simulations and developed prototypes on various platforms to show their eectiveness and obtained highly encouraging results. However, despite all these positive results, none of the techniques are properly supported by any of the production Operating Systems. Primary reason for this is the overwhelming complexity of production kernels, especially considering the need to support a wide variety of devices. We address this issue with CacheLayers a scalable and generic OS-level caching framework that provides an easy API interface for such devices to plug-in. To prove the effectiveness of this framework, we implemented various modules which target specic devices such as SSDs, in-memory compressed caching with de-duplication and hypervisor-level caching (useful for virtualized environments). The use of this framework simplied their development and reduced code duplication by providing generic services common to all these caching providers.

Keywords
Operating Systems, Caching, Compression, Deduplication, Virtualization, Performance

1.

INTRODUCTION

Caching refers to maintaining the most frequently accessed data in faster but smaller, more expensive forms of storage to hide latency of lower levels with relatively slower but higher capacity and lesser cost per unit of storage. Thus, caching forms a hierarchy with each level providing faster access for a small subset of data from the layer below. In current systems, typical caching hierarchy starts with fastest processor level caches, to RAM and nally disk storage. This last level represents a huge performance cli rotational disk average latencies are still in the order of milliseconds which is orders of magnitude slower than DRAM chips. So, severe performance degradation is observed when application memory demands exceed the amount of available RAM. Various techniques have been considered to reduce the eect of slow rotational disks on large memory footprint applications. For instance, use of SSDs and various other ash devices as caches and even memory compression. These techniques essentially add another layer in the caching hierarchy, between the RAM and much slower disks. Previous work using simulations and prototypes on various platforms, showed them to be quite eective in helping the performance in memory overcommit scenarios. Despite these positive results, none of these techniques are properly supported by most of the popular Operation Systems. Some ad-hoc implementations exist to provide caching solutions targeting specic devices but they are not sucient to provide a comprehensive caching solution utilizing all the existing and upcoming devices. which have the price, performance characteristic, making them well suited to be used as caching backend for RAM. Under-utilization of such resources can be mainly attributed to the complexity of implementing such caching solutions. Ad-hoc implementations are often application specic and are not generic enough to cover a wide variety of devices that are available even today. In recent years, a number of new storage technologies both hardware and software-based have appeared in the middle between true RAM and disk including: hypervisor RAM, compressed RAM, SSDs, Phase Change RAM, far-far NUMA RAM and so on. Each has unique performance and/or byte-accessibility and/or reliability idiosyn-

Categories and Subject Descriptors


D.4 [Software]: Operating Systems; D.4.2 [Operating Systems]: Storage ManagementMain memory, Storage hierarchies, Swapping, Virtual memory; D.4.8 [Operating Systems]: Performance

General Terms
Design, Performance

crasies that hinder it from being treated as true RAM. But each is also too fast and too expensive to be treated as a disk. As a result, there have been many attempts to shoehorn these odd memory types, along with their idiosyncrasies, into various parts of the kernel to serve various specic needs. The result has not been particularly aesthetic or maintainable. Nor has this fractured approach come close to achieving the new technologies full capabilities, thus pigeonholing their use and stunting their potential growth. To address these issue, we developed a generic OS-level caching framework called CacheLayers. It provides an easy API interface for a variety of caching devices to be easily pluggedin. Separate modules use CacheLayers services to provide caching over a specic target devices. The complexity of individual devices is abstracted away using Page Addressable Memory (PAM) a storage abstraction which allows data to be accessed only in chunks of pages (typically 4K). It manages mapping of device specic location identiers (example, block number in case of SSDs) with object oriented PAM handle with services like ecient object lookup, insertion and deletion. An LRU-like queue is also maintained for PAM objects. Additional components common to nearly all caching services like an ecient memory allocator (useful for in-memory compressed caching), disk-block allocator (for RAM like disks) are also provided. We implemented CacheLayers for the Linux kernel due to its widespread use and easy source-code availability. To prove eectiveness of this framework, we implemented modules providing dierent caching services: ZCache module provides in-memory compressed caching, HCache provides hypervisor level caching, SSDCache provides caching over SSDs. The implementation of these modules was greatly facilitated by CacheLayers framework which exposes an easy API to plug-in these modules and also provides much of the common functionality highlighted above.

Figure 1: CacheLayers Hierarchy: new components introduced are shown in blue. in the swap subsystem where it issues a callback whenever a pages is about to be written to the disk.

2.1.1

Cleancache

2.

ARCHITECTURE

To provide a range of caching services in-memory compressed caching, hypervisor level caching with de-duplication, SSD caching and so on CacheLayers is composed of multiple layers to easily extend caching services to these varied cases. Figure 1 shows the heirarchy introduced by CacheLayers. Each of the components are explained in following sections.

Cleancache can be thought of as a page-granularity victim cache for clean pages that the kernels pageframe replacement algorithm (PFRA) would like to keep around, but cant since there isnt enough memory. So when the PFRA evicts a page, it rst attempts to put it into a synchronous concurrency-safe page-oriented PAM device (such as zcache SSDCache or other RAM-like devices) which is not directly accessible or addressable by the kernel and is of unknown and possibly time-varying size. And when a cleancacheenabled lesystem wishes to access a page in a le on disk, it rst checks cleancache to see if it already contains it; if it does, the page is copied into the kernel and a disk access is avoided. A cleancache backend that interfaces to this PAM links itself to the kernels cleancache frontend by setting the cleancache_ops funcs appropriately and the functions it provides must conform to certain semantics. Most important, cleancache is ephemeral. Pages which are copied into cleancache have an indenite lifetime which is completely unknowable by the kernel and so may or may not still be in cleancache at any later time. Thus, as its name implies, cleancache is not suitable for dirty pages. Cleancache has complete discretion over what pages to preserve and what pages to discard and when. Mounting a cleancache-enabled lesystem should call init_fs to obtain a pool id which, if positive, must be saved in the lesystems superblock; a negative return value indicates failure. A put_page will copy a (presumably about-to-be-

2.1

Core Kernel Changes

When system is running low on memory, the Linux kernel invokes Page Frame Reclaim Algorithm (PFRA) to evict pages in LRU-like manner. Any subsequent access to these pages invokes a page fault handler which rst checks if the page is in the kernel cache and if not present in the cache, it issues a read request to appropriate disk. To provide a second-chance cache for these evicted pages, we need small modication to parts of the kernel where these pages are evicted and read back into memory. These set of changes are divided into two parts: cleancache and frontswap. The cleancache, which deals with clean pagecache pages only, includes set of hooks in the PFRA where it issues a callback when a page is evicted. The frontswap, which deals with swapcache pages only, includes set of hooks

evicted) page into cleancache and associate it with the pool id, the le inode, and a page index into the le. The combination of a pool id, an inode, and an index constitutes a handle. A get_page will copy the page, if found, from cleancache into kernel memory. A flush_page will ensure the page no longer is present in cleancache; a flush_inode will ush all pages associated with the specied inode; and, when a lesystem is unmounted, a flush_fs will ush all pages in all inodes specied by the given pool id and also surrender the pool id. A init_shared_fs, like init, obtains a pool id but tells cleancache to treat the pool as shared using a 128-bit UUID as a key. On systems that may run multiple kernels (such as hard partitioned or virtualized systems) that may share a clustered lesystem, and where cleancache may be shared among those kernels, calls to init_shared_fs that specify the same UUID will receive the same pool id, thus allowing the pages to be shared. Note that any security requirements must be imposed outside of the kernel (e.g. by tools that control cleancache). Or a cleancache implementation can simply disable shared_init by always returning a negative value. If a get_page is successful on a non-shared pool, the page is ushed (thus making cleancache an exclusive cache). On a shared pool, the page is not ushed on a successful get_page so that it remains accessible to other sharers. The kernel is responsible for ensuring coherency between cleancache (shared or not), the page cache, and the lesystem, using cleancache ush operations as required. Note that cleancache must enforce put-put-get coherency and get-get coherency. For the former, if two puts are made to the same handle but with dierent data, say AAA by the rst put and BBB by the second, a subsequent get can never return the stale data (AAA). For get-get coherency, if a get for a given handle fails, subsequent gets for that handle will never succeed unless preceded by a successful put with that handle. Last, cleancache provides no SMP serialization guarantees; if two dierent Linux threads are simultaneously putting and ushing a page with the same handle, the results are indeterminate.

Figure 2: HCache: Unused memory on the host is used as second-chance cache for guests. PAM. A flush_page will remove the page from PAM and a flush_area will remove all pages associated with the swap type (e.g., like swapo) and notify the PAM device to refuse further puts with that swap type. Once a page is successfully put, a matching get on the page will always succeed. So when the kernel nds itself in a situation where it needs to swap out a page, it rst attempts to use frontswap. If the put returns non-zero, the data has been successfully saved to PAM and a disk write and, if the data is later read back, a disk read are avoided. If a put returns zero, PAM has rejected the data, and the page can be written to swap as usual. Note that if a page is put and the page already exists in PAM (a duplicate put), either the put succeeds and the data is overwritten, or the put fails and the page is ushed. This ensures stale data may never be obtained from PAM.

2.2

Driver Modules

TODO: describe how modules interact with layer above (cleancache and frontswap) and below (tmem and PAM). Also show example modules: zcache and possible hcache (hypervisor cache). Figure 2 shows use of idle hypervisor memory by guests using HCache.

2.3
2.1.2 Frontswap
Frontswap is so named because it can be thought of as the opposite of a backing store for a swap device. The storage is assumed to be a synchronous concurrency-safe pageoriented PAM device (such as zcache SSDCache or other RAM-like devices) which is not directly accessible or addressable by the kernel and is of unknown and possibly time-varying size. This PAM device links itself to frontswap by setting the frontswap_ops pointer appropriately and the functions it provides must conform to certain policies. An init prepares the PAM to receive frontswap pages and returns a non-negative pool id, used for all swap device numbers (aka type). A put_page will copy the page to PAM and associate it with the type and oset associated with the page. A get_page will copy the page, if found, from PAM into kernel memory, but will not remove the page from

Storage Abstraction

CacheLayers allows easily plugging-in driver modules providing caching services over a specic target device. Each of these devices has unique performance and/or byte-accessibility and/or reliability idiosyncrasies that hinder it from being treated as true RAM. To abstract away complexities of these varied underlying devices, an abstraction layer called Page Addressable Memory (PAM) has been introduced. As its name implies, PAM is accessed only by the page, not by the byte (where pagesize must be specied but need not be 4K). Like a device, data in PAM must be copied/DMAed into RAM for the data to be directly used and/or byteaddressed by the kernel or by userland. Because many of the new memory types are dynamic in nature, the kernel does not know a priori the size of PAM, so the kernel addresses each page with a non-linear object-oriented handle and accesses the data through a generic synchronous API of

get_page, put_page, and flush_page. The idiosyncrasies of each new memory type are then entirely hidden in PAM drivers behind the API. There are at least two types of PAM: ephemeral PAM (EPAM) and persistent PAM (PPAM). A put to EPAM is always successful, but a get of the same page may fail; so EPAM is not guaranteed to hold all of the pages put to it. A put to PPAM may fail but, once a put is successful, a get of the same page will always be successful. A PAM driver supporting EPAM and/or PPAM must ensure certain additional coherency and concurrency semantics that are beyond the scope of this brief discussion. There also may be other useful types of PAM. Considering these semantics, EPAM can be used as an overow for page cache and how PPAM can be used as a fronting store for swap devices. For instance, zcache which only caches clean pagecache pages, compresses incoming pages from cleancache and stores resulting chunks in EPAM.

2.4

Other Common Services

Services like memory allocation (useful for zcache), disk block allocation (useful for SSDCache), compression, de-duplication, LRU queues are provided for use by various driver modules.

2.4.1

Memory Allocation

TODO: describe xvmalloc (http://code.google.com/p/compcache/wiki/xvMalloc) and zbud. If possible, xfmalloc too - NOT YET IMPLEMENTED)

2.4.2 2.4.3

Disk Block Allocation Compression and Deduplication

TODO: probably we will reuse existing SSDAlloc (http://www.cs.princeton.edu/ abadam/ssdalloc.html)

TODO: Should we support multiple compression algorithms? Currently, only LZO1X is supported. For de-deuplication, should we attempt to implement something like Dierence Engine (http://www.usenix.org/event/osdi08/tech/full_papers/gupta/gupta.pdf)?

3. 4.

CONCLUSIONS ACKNOWLEDGMENTS

Its all super cool!

This section is optional; it is a location for you to acknowledge grants, funding, editing assistance and what have you.

5.

REFERENCES

You might also like