Professional Documents
Culture Documents
1. Abstract
Today, virtualization solutions support large amounts of RAM and a large number of virtual CPUs
(vCPUs) to be assigned to individual Virtual Machines (VMs). Additionally, hot add/remove of memory
and vCPUs is now being supported by some virtualization solutions [1] (of course, guest OS support is
also required). However, due to problems inherent with virtualization and software stacks in general,
this does not provide enough flexibility and scalability for applications running inside a guest. The
solution is presented as an addition to VM hardware – Virtual Co-processors (VCOPs). In section 2,
problems with existing virtualization solutions are described. In Section 3, VCOPs are presented as a
solution for these problems. Section 4 lists some use cases for VCOPs. Section 5 shows limitations of
this approach and finally we conclude in section 6, describing prototype developed for KVM.
2. Virtualization Issues
2.1 Co-scheduling
For multiprocessor VMs, all its vCPUs have to be nearly synchronously co-scheduled to ensure correct
guest execution. Many advances have been made to reduce overheads associated with co-scheduling
[2]. Still, this kind of scheduling hurts performance for over-commit scenarios, where total numbers of
physical CPUs (pCPUs), at any particular time, are not sufficient to co-schedule all vCPUs of a VM. For
example, on a 8-core machine if a 4 vCPU machine is started with one 4 vCPU VM and one 2 vCPU VM
already running, co-scheduling can cause significant performance overhead as only two VMs can run at a
time. Thus, anytime 2 vCPU VM is scheduled, two of the pCPUs will effectively remain unused. Provided
that VMs are not idle, many of the existing techniques to reduce co-scheduling overhead will not work.
As such, this kind of overhead cannot be avoided if a workload indeed requires co-scheduling for all its
threads. However, there are many workloads that can work with more relaxed scheduling, giving more
flexibility to the host scheduler and thus achieving better scalability.
Page |1
2.3 Resource Isolation
When a vCPU is hot-added to a guest, there is no guarantee that it will be used to accelerate just a
specific task. Guest OS might time share this new resource with many other less critical tasks. However,
we can instantiate VCOP for individual applications (including the guest kernel) which can then offload
specific tasks without additional noise.
3. Virtual Co-processors
Physical co-processors are slowly being replaced (for example, now obsolete TCP-Offload-Engine [3]) by
general purpose CPUs which are becoming cheaper and faster, obviating the need for special co-
processors. This trend strongly discourages use of co-processors on physical systems. However,
virtualization brings new problems and challenges that calls for revisiting this approach.
VCOPs are visible to a guest as PCI devices which can be dynamically added or removed. Each device
exports a specific functionality like encryption, compression or a high level functionality like filesystem
caching and indexing. These devices setup communication channel with host for receiving input buffers,
or returning results back to guest. The execution happens on the host where one or more threads
(depending on host resources) can be assigned for individual VCOPs.
Local
Memory
CPU 0 CPU 1 VCOP 0 Local
Storage
System Bus
The number of host threads assigned to a VCOP can be changed dynamically depending on resource
availability on host and per-VM bounds. The application running inside a guest need not be aware of
these changing numbers of threads as it simply uses interface exported by a specific VCOP.
The communication between a guest and the host is provided using virtio [5] which sets up a number of
virtqueues as needed for VCOP functionality. The functionality is exported to applications using VCOP
specific library. Programming models, traditionally used for physical co-processors, might also be
considered.
Page |2
VM Work
Item
App Input Threads
Queue (Running on Host)
Output
Queue
Host
An input queue is processed by VCOP (threads running on host), feeding output to an output queue
which is then consumed by the application. A queue consists of work-items which is a unit a
command/data consumed by the host.
It is easy to extend the working of this caching VCOP to provide the functionality equivalent to
“Paravirtualized Paging” *6] but now in a more general sense – instead of a global kernel controlled
Page |3
cache, such VCOPs can be instantiated by any application (including kernel itself) running inside a VM.
This extended memory can be transparently de/compressed which provides significant memory savings
and better performance under memory pressure [7] [8].
5. Limitations
The approach for dynamically extending VM resources as detailed above have some limitations. The
VCOP can expose only a specific functionality and cannot be used for executing arbitrary code (as is
possible with vCPUs). Thus “computational VCOPs” (like encryption, compression) cannot be considered
as replacement for the need to support SMP VMs. Instead, they are useful in optimizing hotspots in
applications running under virtualized environment.
“Caching VCOPs” are quite generic as they can be used for any kind of caches. However, they cannot
completely replace caches inside a guest (say, page-cache in Linux) as copying page to/from host/guest
can have non-trivial overhead. However, it can be used as a “second chance” cache, potentially with
transparent de-duplication and compression, providing an efficient caching service to guest applications
(or to the guest kernel itself).
At this stage, it is not clear how the threads running on the host on behalf of a VCOP can be bounded in
terms of resource usage on the host. For example, say, bounding an instance of encryption VCOP to just
200MHz of host CPU power or bounding caching processor to just 128MB of host memory. Enforcing
such bounds is essential; otherwise a single guest can potentially cause denial-of-service, starving other
VMs.
Also, it is not yet understood how feasible it is to modify guest applications/libraries to use VCOP
interface libraries. The prototype presented in section 6 directly uses virtqueues to send out cache data
to the host. Such direct use is not feasible for every kind of VCOP.
6. Prototype
A simplified version of “Caching VCOP” (described above) has been developed for KVM. It consists of
two parts:
Patch for qemu-kvm: Exposes a virtual PCI device which registers a single virtio queue. Link:
http://code.google.com/p/compcache/source/browse/sub-projects/vswap/qemu_kvm_vswap_support.patch
Page |4
Module for the Linux kernel (guest): It registers for vswap virtio device (VIRTIO_ID_VSWAP), so
we can detect and interact with this virtual PCI device. It also creates a virtual block disk
/dev/vswap which acts as a swap device. Pages written to this device are simply sent to the host
when a swap write is issued. Currently, read from the host is not yet implemented. Thus write is
replayed for actual swap device inside guest and read is directly forwarded to physical guest swap
device. Link: http://code.google.com/p/compcache/source/browse/sub-projects/vswap/virtio_vswap.c
(Note that compcache project is not directly related to vswap work. The project repository was simply used to
host this code).
7. References
[1] What’s New in VMware vSphere™ 4.0:
http://www.vmware.com/support/vsphere4/doc/vsp_40_new_feat.html
[5] virtio: towards a de-facto standard for virtual I/O devices. Author: Rusty Russell, SIGOPS Oper. Syst.
Rev., Vol. 42, No. 5. (2008), pp. 95-103.
[6] Paravirtualized Paging. Authors: Dan Magenheimer, Chris Mason, Dave McCracken, and Kurt Hackel:
http://www.usenix.org/event/wiov08/tech/full_papers/magenheimer/magenheimer.pdf
[7] Adaptive main memory compression: Authors: Irina Chihaia Tuduce, Thomas Gross. ATEC '05
Proceedings of the annual conference on USENIX Annual Technical Conference.
Page |5