You are on page 1of 3

Virtual Co-processors: Flexible and Scalable Virtual

Machines
Nitin Gupta
ngupta@vflare.org
(Work-In-Progress)

ABSTRACT small fraction of the system RAM. Thus, a large amount of effort
Current virtualization solutions do not provide enough flexibility is being wasted in OS and application software development as it
in terms of resource allocation to Virtual Machines (VMs) – this takes significant amount of resources to develop such multi-
is especially true for CPU and memory resources. They also have threaded applications.
severe scalability issues making them unsuitable for running many The problem can be approached in many ways:
kinds of workloads which would otherwise benefit from
virtualization. Develop simpler OS and applications which take significantly less
resources than the current stack. Thus, we can host more number
Flexibility problem: virtualization solutions currently provide a of such VMs for given hardware resources. Such an approach is
fairly rigid configuration for VMs. A fixed number of Virtual taken by ―Just enough OS‖ [3] versions of different Linux and
CPUs (VCPUs) and a fixed amount of memory are assigned to a OpenSolaris distributions. However, it is not feasible to maintain
VM and this cannot be changed during its runtime. Even when such simplified versions for every application to complement
support for hot add/remove of CPU and memory now exists in these simplified OS stacks. Software like web servers, DBMS etc,
some hypervisors and OS, it is still not sufficient: there is no way are designed from ground-up to be scalable. It is almost funny to
to ensure that the newly added VCPU(s) will be used to accelerate find Oracle database running on 1-2 VCPU machine.
just that encryption task. Also, as the number of VCPUs increase,
we quickly run into scalability issues. Same for memory: there is Another approach can be to focus on virtualization stack instead
no guarantee that the newly added memory will be used to cache to allow individual instances of such applications to scale by
just that important database – guest kernel can share it with dynamically providing more CPU and memory resources as
numerous other caches. needed. This dynamic resource allocation should still be under the
restraints enforced by the hypervisor and so we may continue to
Scalability problem: due to co-scheduling issues, only a small get benefits, like isolation, VM migration (VMotion) etc., offered
number of VCPUs (typically 1-4) can be assigned to a VM. At by virtualization.
hypervisor level, this co-scheduling is a must to ensure correct
guest execution. However, there are many applications which do The virtual-coprocessor method helps with the second approach
not have such strict scheduling requirements. by allowing applications to dynamically request more CPU and
memory resources from the hypervisor.
In addition to above problems, applications like web servers,
DBMS [1] etc., want direct control over these resources – they 2. VIRTUAL CO-PROCESSORS
have their own memory allocators and thread schedulers. Ideally, Virtual Co-processors (VCOPs) refers to abstract computing
such application would like to completely bypass, almost devices accessible to applications running inside a guest. Access
redundant, operating system layer. to these devices is requested by the applications and granted by
the hypervisor, bypassing the guest kernel once the channel is set-
These issues are addressed by the virtual co-processor approach
up. These VCOPs can provide high level functionality like
discussed in this paper. Applications can now dynamically request
generic caching devices or specific functions like encryption,
more CPU and memory resources from the hypervisor and
compression, video encoding etc. In general, the VCOP model
manage them without involvement of the Guest kernel.
can handle both task-parallel and data-parallel (or a mix of these)
applications. Figure 1 shows logical view of a VM with VCOPs.
1. INTRODUCTION
Local Memory
Operating Systems and applications today are highly multi-
threaded and are capable of utilizing large number of cores
CPU 0 CPU 1 VCOP 0 Local
available on current systems. This is true from perspective of
virtualization stack too – it manages multiple cores to schedule Storage
multiple VMs simultaneously. Each VM typically has 1-2 VCPUs
and a small fraction of total system memory. To achieve near
native performance, it depends on the ability to run multiple System Bus
instances of the same VM (for example [2]).
Such an approach to achieve scalability brings up a paradox: OS
and applications running inside VMs are being designed to handle System Memory VCOP 1 VCOP 2
increasing number of cores and manage huge amounts of memory.
Figure 1: Logical view of VM with VCOPs.
At the same time, virtualization focuses on running such
applications with small number of VCPUs and memory equal to a
2.1 VCOP INTERFACE AND whenever needed – this is much faster and reliable than
ballooning and swapping at host and guest level.
APPLICATIONS
The guest applications probe and request VCOPs using interface It is easy to extend the working of this caching VCOP to provide
provided by libvcop library. This library allows setting-up the functionality equivalent to ―Paravirtualized Paging‖ [4] but
multiple channels/queues between an application and the now in a more general sense – instead of a global kernel
hypervisor. Typically, an application creates queues for: input, controlled cache, such VCOPs can be instantiated by any
output and command (for synchronization, selecting action: store, application (including kernel itself) running inside a VM.
retrieve etc). The number and kind of queues depend on the
Another application is filesystem indexing. For this, a process
application and the function exported by VCOP.
running in the guest can expose data from different offsets in a file
to the input queue. The VCOP running on the host will index
contents from the buffers in the input queue and save the resulting
VM Work tags in its local disk storage (which is separate from VM storage).
Item The number of host-threads assigned to do this indexing is
App Input Threads decided by the host and so is the limit on storage used for storing
Queue (Running on Host) tags. The host can dynamically increase/decrease number of
threads depending on availability of cores. Also, the threads need
not be co-scheduled giving the host more flexibility and thus
Output
provides better scalability. This is clearly better than, say, an 8
Queue
VCPU VM – since all cores are busy (non-idle), the scheduler
will always try to co-schedule them. For this case, under even
slight CPU over-commit the indexing performance can degrade
Host severely.
It can be noted that the VCOP functionality is independent of the
Figure 2: A typical application – VCOP interface. An input Guest OS. It runs directly over the host and depends only on
queue is processed by VCOP (threads running on host), interface provide through libvcop library. Thus, for example, the
feeding output to an output queue which is consumed by the same caching, filesystem indexing, encryption or compression
application. VCOP can be used for all kinds of guests.
A queue consists of work-items which is a unit a command/data
consumed by the host. For example, for encryption VCOP, each 2.2 IMPLEMENTATION
work-item can be a fixed sized buffer to be encrypted. The host The queues need by VCOP requires setting-up a shared region
(hypervisor) can assign any number of host threads to work on between applications running inside a VM and the hypervisor. For
individual work-items. In this case, there is no need to co- KVM, the virtio infrastructure [5] already integrated with the
schedule all these host-threads giving the scheduler additional Linux kernel which acts as the hypervisor in this case. Another
flexibility. For an encryption dominated workload, the VCOP alternative is the Virtual-bus infrastructure [6] which claims to
approach will be much more scalable and resource friendly than provide even better performance than virtio.
the current approach of creating multiple 1-2 VCPU machines. It
is also better than VMs with large number of VCPUs (if With virtio, a virtual PCI device can be created and exposed to a
hypervisor supports it) since the VCOP understands that the VM. It also provides transport abstraction called virtqueues which
encryption threads need not be co-scheduled. Similar can be the internally maintains circular queues and necessary callback
case for workload more practical or complex workloads than mechanisms to allow typical producer-consumer interface.
simple encryption.
The libvcop library is used to probe support for particular VCOP
Another application can be VCOP as a caching device. Here, and setup various queues. This library internally uses virtqueues
application can create three queues: Input, Output and Command. and callback interface provided by virtio to setup application –
The work-item size can set equal to system page size. Data to be host communication.
cached is placed in input queue and the VCOP provides a handle
in output queue to identify this page later. The application places 2.3 ISSUES
Fetch(handle) or Delete(handle) on command queue to Application using VCOP have to be modified to use interfaces
fetch/delete this page. This page is placed on output queue in case provided by the libvcop library. Also, the use of this library is
it was a fetch command. The memory assigned to these pages i.e. specific to the case where the application is run inside a VM. So,
the size of VCOP local memory can be increased/decreased two version of application is required – one for running natively
dynamically as needed by the application and the upper-limit and another for virtualization case. Additionally, if use of this
decided by the hypervisor. Extending this idea, applications can library requires significant changes to any application then it
also create volatile caches – the contents of such caches can be might never be used.
discarded at any time. The guest kernel can be modified to use
such volatile cache for maintaining ―clean‖ filesystem cache. This To overcome above problem, a wrapper for popular threading
cache takes significant amount of guest memory and since it‘s a libraries like POSIX Threads (Pthreads) can be created around
volatile cache, hypervisor can quickly reclaim this memory libvcop. This will allow developers to create just one version of
application using platform specific thread library which has been
modified to use libvcop if available. Such applications will run
normally on native hardware and use VCOPs when run in
virtualized environment. In case of running on native
environment, an implementation of corresponding VCOP
functionality has to be provided which will now run as thread
inside the guest (instead of host).

3. CONCLUSION AND FUTURE WORK


An approach to dynamically change resources like CPU and
memory is discussed. Example applications also show that this is
more scalable than increasing number of VCPUs in a VM.
This approach does not require any changes to the guest kernel or
the host (hypervisor). However, applications running inside the
guest need to be modified to use libvcop – library interface for
VCOPs. To minimize these changes, wrapper for thread libraries
like Pthreads has been proposed.
Currently, a ‗hello world‘ VCOP is under development. This
involves changes to qemu-kvm to export it as a PCI device and a
Linux (guest) kernel driver to detect this PCI device. Setting up
queue for host—guest communication will be done using
virtqueues which is part of virtio infrastructure. The work on
libvcop has not yet been started. This library will allow probing
for specific VCOPs and export API to provide for application to
host communication.

4. ACKNOWLEDGEMENTS
I am grateful to Diwakar Rao for helpful discussions.

5. REFERENCES
[1] IBM Informix Dynamic Server Administrator‘s Guide:
Virtual Processors and Threads.
http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.j
sp?topic=/com.ibm.admin.doc/admin242.htm
[2] Scaling IBM DB2 in VMware Infrastructure 3 Environment.
http://www.vmware.com/pdf/db2_scalability_wp_vi3.pdf
[3] Just Enough OS:
http://en.wikipedia.org/wiki/Just_enough_operating_system
[4] Dan Magenheimer, Chris Mason, Dave McCracken, and Kurt
Hackel, 2008. USENIX, WIOV‘08. Paravirtualized Paging.
http://www.usenix.org/event/wiov08/tech/full_papers/magen
heimer/magenheimer.pdf
[5] virtio: Towards a De-Facto Standard For Virtual I/O Devices
http://portal.acm.org/ft_gateway.cfm?id=1400108&type=pdf
[6] Virtual-bus:
http://developer.novell.com/wiki/index.php/Virtual-bus
[7] OpenCL (Open Computing Language) Framework:
http://www.khronos.org/opencl/

You might also like