You are on page 1of 5

Virtual Machine Resource Extensions

Nitin Gupta (ngupta@vflare.org)

1. Abstract
Today, virtualization solutions support large amounts of RAM and a large number of virtual CPUs
(vCPUs) to be assigned to individual Virtual Machines (VMs). Additionally, hot add/remove of memory
and vCPUs is now being supported by some virtualization solutions [1] (of course, guest OS support is
also required). However, due to problems inherent with virtualization and software stacks in general,
this does not provide enough flexibility and scalability for applications running inside a guest. The
solution is presented as an addition to VM hardware – Virtual Co-processors (VCOPs). In section 2,
problems with existing virtualization solutions are described. In Section 3, VCOPs are presented as a
solution for these problems. Section 4 lists some use cases for VCOPs. Section 5 shows limitations of
this approach and finally we conclude in section 6, describing prototype developed for KVM.

2. Virtualization Issues

2.1 Co-scheduling
For multiprocessor VMs, all its vCPUs have to be nearly synchronously co-scheduled to ensure correct
guest execution. Many advances have been made to reduce overheads associated with co-scheduling
[2]. Still, this kind of scheduling hurts performance for over-commit scenarios, where total numbers of
physical CPUs (pCPUs), at any particular time, are not sufficient to co-schedule all vCPUs of a VM. For
example, on a 8-core machine if a 4 vCPU machine is started with one 4 vCPU VM and one 2 vCPU VM
already running, co-scheduling can cause significant performance overhead as only two VMs can run at a
time. Thus, anytime 2 vCPU VM is scheduled, two of the pCPUs will effectively remain unused. Provided
that VMs are not idle, many of the existing techniques to reduce co-scheduling overhead will not work.

As such, this kind of overhead cannot be avoided if a workload indeed requires co-scheduling for all its
threads. However, there are many workloads that can work with more relaxed scheduling, giving more
flexibility to the host scheduler and thus achieving better scalability.

2.2 Additional Hardware Resources


Host CPU might support an extended instruction set (or other processing entities like GPUs, IBM Cell)
that allow accelerating tasks like encryption. For example, Intel Nehalem supports AES-NI [4] instruction
set that claims to provide over 3x rate of encryption and decryption over purely software approach.
Such extended instructions are usually not exposed to guests to allow them to be migrated over to other
machines having CPU(s) from a different family (or maybe from a different vendor). Thus, for example,
virtualization of a workload where encryption is the main task will result in significant slowdown.

Page |1
2.3 Resource Isolation
When a vCPU is hot-added to a guest, there is no guarantee that it will be used to accelerate just a
specific task. Guest OS might time share this new resource with many other less critical tasks. However,
we can instantiate VCOP for individual applications (including the guest kernel) which can then offload
specific tasks without additional noise.

3. Virtual Co-processors
Physical co-processors are slowly being replaced (for example, now obsolete TCP-Offload-Engine [3]) by
general purpose CPUs which are becoming cheaper and faster, obviating the need for special co-
processors. This trend strongly discourages use of co-processors on physical systems. However,
virtualization brings new problems and challenges that calls for revisiting this approach.

VCOPs are visible to a guest as PCI devices which can be dynamically added or removed. Each device
exports a specific functionality like encryption, compression or a high level functionality like filesystem
caching and indexing. These devices setup communication channel with host for receiving input buffers,
or returning results back to guest. The execution happens on the host where one or more threads
(depending on host resources) can be assigned for individual VCOPs.

Local
Memory
CPU 0 CPU 1 VCOP 0 Local
Storage

System Bus

System VCOP 1 VCOP 2


Memory
Figure 1: Logical view of VM with VCOPs.

The number of host threads assigned to a VCOP can be changed dynamically depending on resource
availability on host and per-VM bounds. The application running inside a guest need not be aware of
these changing numbers of threads as it simply uses interface exported by a specific VCOP.
The communication between a guest and the host is provided using virtio [5] which sets up a number of
virtqueues as needed for VCOP functionality. The functionality is exported to applications using VCOP
specific library. Programming models, traditionally used for physical co-processors, might also be
considered.

Page |2
VM Work
Item
App Input Threads
Queue (Running on Host)

Output
Queue

Host

Figure 2: A typical application <-> VCOP interface.

An input queue is processed by VCOP (threads running on host), feeding output to an output queue
which is then consumed by the application. A queue consists of work-items which is a unit a
command/data consumed by the host.

4. VCOP Use Cases


These use cases highlight how virtualization issues highlighted in section 2 can be addressed by the
VCOP approach.

4.1 Encryption VCOP


For Encryption VCOP, each work-item can be a fixed sized buffer to be encrypted. The host (hypervisor)
can assign any number of host threads to work on individual work-items. In this case, there is no need to
co-schedule all these host threads giving the scheduler additional flexibility. For an encryption
dominated workload, the VCOP approach will be much more scalable and resource friendly than the
current approach of creating multiple 1-2 VCPU machines. It is also better than VMs with large number
of VCPUs (if hypervisor supports it) since the VCOP understands that the encryption threads need not be
co-scheduled. Similar can be the case for workload more practical or complex than simple encryption.

4.2 Caching VCOP


This VCOP exposes three queues: Input, Output and Command. The work-item size can set equal to the
system page size. Data to be cached is placed in input queue and the VCOP provides a handle in output
queue to identify this page later. The application places get_page(handle),
put_page(handle), delete_page(handle) on command queue to fetch, store and delete
the given page respectively. This page is placed on the output queue in case it was a fetch command.
The memory assigned to these pages i.e. the size of VCOP local memory can be increased/decreased
dynamically as needed by the application and the upper-limit is decided by the hypervisor. Extending
this idea, applications can also create volatile caches – the contents of such caches can be discarded at
any time. The guest kernel can be modified to use such volatile cache for maintaining “clean” filesystem
cache. This cache takes significant amount of guest memory and since it’s a volatile cache, hypervisor
can quickly reclaim this memory whenever needed – this is much faster and reliable than ballooning and
swapping at host and guest level.

It is easy to extend the working of this caching VCOP to provide the functionality equivalent to
“Paravirtualized Paging” *6] but now in a more general sense – instead of a global kernel controlled

Page |3
cache, such VCOPs can be instantiated by any application (including kernel itself) running inside a VM.
This extended memory can be transparently de/compressed which provides significant memory savings
and better performance under memory pressure [7] [8].

4.3 Indexing VCOP


Another application is filesystem indexing. For this, a process running inside a guest can expose data
from different offsets in a file to VCOP’s input queue. The VCOP running on the host will index contents
from the buffers in the input queue and save the resulting tags in its local disk storage (which is separate
from VM storage). The number of host-threads assigned to do this indexing is decided by the host and
so is the limit on storage used for storing tags. The threads need not be co-scheduled giving the host
more flexibility. Also, the number of threads can dynamically change depending on resources available
on the host.

5. Limitations
The approach for dynamically extending VM resources as detailed above have some limitations. The
VCOP can expose only a specific functionality and cannot be used for executing arbitrary code (as is
possible with vCPUs). Thus “computational VCOPs” (like encryption, compression) cannot be considered
as replacement for the need to support SMP VMs. Instead, they are useful in optimizing hotspots in
applications running under virtualized environment.

“Caching VCOPs” are quite generic as they can be used for any kind of caches. However, they cannot
completely replace caches inside a guest (say, page-cache in Linux) as copying page to/from host/guest
can have non-trivial overhead. However, it can be used as a “second chance” cache, potentially with
transparent de-duplication and compression, providing an efficient caching service to guest applications
(or to the guest kernel itself).

At this stage, it is not clear how the threads running on the host on behalf of a VCOP can be bounded in
terms of resource usage on the host. For example, say, bounding an instance of encryption VCOP to just
200MHz of host CPU power or bounding caching processor to just 128MB of host memory. Enforcing
such bounds is essential; otherwise a single guest can potentially cause denial-of-service, starving other
VMs.

Also, it is not yet understood how feasible it is to modify guest applications/libraries to use VCOP
interface libraries. The prototype presented in section 6 directly uses virtqueues to send out cache data
to the host. Such direct use is not feasible for every kind of VCOP.

6. Prototype
A simplified version of “Caching VCOP” (described above) has been developed for KVM. It consists of
two parts:

Patch for qemu-kvm: Exposes a virtual PCI device which registers a single virtio queue. Link:
http://code.google.com/p/compcache/source/browse/sub-projects/vswap/qemu_kvm_vswap_support.patch

Page |4
Module for the Linux kernel (guest): It registers for vswap virtio device (VIRTIO_ID_VSWAP), so
we can detect and interact with this virtual PCI device. It also creates a virtual block disk
/dev/vswap which acts as a swap device. Pages written to this device are simply sent to the host
when a swap write is issued. Currently, read from the host is not yet implemented. Thus write is
replayed for actual swap device inside guest and read is directly forwarded to physical guest swap
device. Link: http://code.google.com/p/compcache/source/browse/sub-projects/vswap/virtio_vswap.c

(Note that compcache project is not directly related to vswap work. The project repository was simply used to
host this code).

7. References
[1] What’s New in VMware vSphere™ 4.0:
http://www.vmware.com/support/vsphere4/doc/vsp_40_new_feat.html

[2] VMware vSphere™ 4: The CPU Scheduler in VMware® ESX™ 4:


http://blogs.vmware.com/performance/2009/08/vmware-vsphere-4-the-cpu-scheduler-in-vmware-esx-
4.html

[3] End of the Road for TCP Offload:


http://www.solarflare.com/technology/documents/EndoftheRoadforTCPOffload.pdf

[4] Intel® Advanced Encryption Standard Instructions (AES-NI): http://software.intel.com/en-


us/articles/intel-advanced-encryption-standard-instructions-aes-ni/

[5] virtio: towards a de-facto standard for virtual I/O devices. Author: Rusty Russell, SIGOPS Oper. Syst.
Rev., Vol. 42, No. 5. (2008), pp. 95-103.

[6] Paravirtualized Paging. Authors: Dan Magenheimer, Chris Mason, Dave McCracken, and Kurt Hackel:
http://www.usenix.org/event/wiov08/tech/full_papers/magenheimer/magenheimer.pdf

[7] Adaptive main memory compression: Authors: Irina Chihaia Tuduce, Thomas Gross. ATEC '05
Proceedings of the annual conference on USENIX Annual Technical Conference.

[8] Compcache: in-memory compressed swapping: http://lwn.net/Articles/334649/

Page |5

You might also like