You are on page 1of 23

PaaS Under the Hood

Episode 5: Distributed Routing

Platform as a Service Under the Hood


Episodes 1-5
dotcloud.com

PaaS Under the Hood

INTRODUCTION
Building a Platform as a Service (PaaS) is rewarding work. We get to make the life of a developer easier.
PaaS helps developers deploy, scale, and manage their applications, without making developers hardcore systems
administrators themselves.
As with many problems, the toughest part about managing applications in the cloud is actually not the building of the
PaaS itself. The challenge lies in being able to scale the applications.
To give you a sense of the complexity, each minute, millions of HTTP requests are routed through the application.
Not only does our PaaS collect millions of metrics, we also aggregate, process, and analyze the metrics and look for
abnormal patterns. Apps are constantly deployed and migrated on our PaaS platform.
For economies of scale, virtually all PaaS providers pack density onto their physical machines. How does a PaaS
provider solve the following issues?





How is application isolation accomplished?


How does the platform handle data isolation?
How does the platform deal with resource contention?
How does the platform deploy and run apps efficiently?
How does the platform provide security and resiliency?
How does the platform handle the load from the millions of HTTP requests?

One key element is lightweight virtualization which is the use of virtual environments (called containers) to provide isolation
characteristics comparable to full-blown virtual machines, but with much less overhead. In this area, the dotCloud platform relies
on Linux Containers called LXCs.

In the following 5 episodes, we will dive into some of the internals of the dotCloud platform or more specifically, the
Linux kernel features used by dotCloud.

PaaS Under the Hood

Episode 1: Kernel Namespaces

Episode 1: Kernel Namespaces

Simplifying complexity takes a lot of work. At dotCloud, we turn highly complex


processes such as deploying and scaling web applications in the cloud and make them
appear as simple workflows to developers and DevOps.
How do we accomplish such a feat? In this eBook, we will show you how dotCloud works
under the hood. We will expose the mechanics behind the kernel-level virtualization and
high-throughput network routing. We will expose other technologies such as metrics
collection and memory optimization in later eBooks.
A developer once said, Diving into the inner workings of a PaaS is like going
Disneyland, youll uncover a world of wonder.

dotcloud.com

PaaS Under the Hood

Episode 1: Kernel Namespaces

Episode 1: Namespaces
Each time a new Linux Container (LXC) is created, the name of the container is filed under the /cgroup directory. For
example, a new container named sanfrancisco is filed under the directory /cgroup/sanfrancisco. It is easy to think
that the container relies on the control groups. Although cgroups are useful to Linux Containers (we will cover cgroups
more thoroughly in Episode 2), Namespace provides an even more vital function to the Linux Containers.
Namespaces isolate the resources of processes. This isolation is the real magic behind Linux Containers! There are five
Namespaces, each covering a different resource: pid, net, ipc, mnt, and uts.
The pid namespace
The pid namespace is the most useful technology for basic isolation. Each pid namespace has it own numbering
process. Different pid namespaces form a hierarchy with the kernel which keeps track of all the namespace. A
parent namespace can see and implement actions on the child namespaces. A child namespace cannot perform
any actions on its parent.
There are some principles about the pid namespace as follows:
Each pid namespace has its own PID 1 init-like process
Processes residing in a namespace cannot affect processes residing in a parent or sibling namespace with system
calls like kill or ptrace because process ids are only meaningful inside a given namespace
If a pseudo-filesystem like proc is mounted by a process within a pid namespace, it will only show the processes
belonging to the namespace
Numbering is different in each namespace which means that a process in a child namespace can have multiple PIDs,
for example, one in its own namespace and a different PID in its parent namespace. Top-level pid namespace can see
all processes running in all namespaces with different PIDs. A process can have more than 2 PIDs if there are more than
two levels of hierarchy in the namespaces.
The net namespace
With the pid namespace, you can start processes in multiple isolated environments called containers. What if you
need to run separate instances of Apache webserver in each container? Generally only one process can listen to port
80/TCP at a time. To configure your instances of Apache webserver to listen on different ports, you could use the net
namespace, which has been designed for networking.
Each different net namespace can have different network interfaces. Even lo, the loopback interface supporting
127.0.0.1, can be different in each different net namespace. It is even possible to create a pair of special interfaces, which
will appear in two different net namespaces and allow one of the two net namespaces to talk to the outside world.
A typical container will have its own loopback interface (lo), as well as a special interface on one end, generally named
eth0. The other end of the special interface will be in the original namespace, and will bear a poetic name like
veth42xyz0. It is then possible to put those special interfaces together within an Ethernet bridge (to achieve switching
between containers), or route packets between them, etc. This is similar to the Xen networking model.
Each net namespace has its own local meaning for INADDR_ANY, a.k.a. 0.0.0.0. When your Apache webserver process
binds to INADDR_ANY and port 80 within its namespace *:80 within its namespace, it will only receive connections
directed to the IP addresses and interfaces of its namespace. That allows you to run multiple Apache instances, each
in their own pid and own net namespace, with their default configuration listening on port 80 and each will remain
individually addressable.
Each net namespace has its own routing table, and its own iptables chains and rules.
The ipc namespace
The ipc namespace wont appeal to many of you, unless youve passed UNIX 101 when engineering schools still taught
classes on IPC (InterProcess Communication).
IPC provides semaphores, message queues, and shared memory segments.
While still supported by virtually every UNIX flavors, those features are considered by many as obsolete, and
superseded by POSIX semaphores, POSIX message queues, and mmap. Nonetheless, some programs such as
PostgreSQL, for example, still use IPC.

PaaS Under the Hood

Episode 1: Kernel Namespaces

Whats the connection with namespaces? Each IPC resource is accessed through a globally unique 32-bit ID. While IPC
implements permissions on the resource itself, an application could be surprised if it failed to access a given resource
because it has already been claimed by another process in a different container. The app doesnt know anything about
other containers!
Meet the ipc namespace. Processes within a given ipc namespace cannot access (or even see) the IPC resources living
in other ipc namespaces. And now you can safely run a PostgreSQL instance in each container without the fear of IPC
key collisions.
The mnt namespace
chroot is a mechanism to sandbox a process (and its children) within a given directory. The mnt namespace takes the
chroot concept even further.
As its name implies, the mnt namespace deals with mount points.
Processes living in different mnt namespaces can see different sets of mounted file systems and different root
directories. If a file system is mounted in an mnt namespace, it will be accessible only to those processes within that
namespace. It will not be visible for processes in other namespaces.
At first impression, it may sound useful, since the mnt namespace allows you to sandbox each container within its own
directory, hidden from other containers. However, is this really useful after all? If each container is chrooted in a different
directory, container C1 wont be able to access or see container C2s file system, right? There are downsides.
Inspecting /proc/mounts in a container will show the mount points of all containers. Also, those mountpoints will be
relative to the original namespace, which can give out some hints about the layout of your system. Seeing the path for
the global namespace may confuse some applications that rely on the paths in the local namespace /proc/mounts.
The mnt namespace makes the situation much cleaner, allowing each container to have its own mount points, and see
only those mount points, with their path correctly correlated to the actual root of the namespace.
The uts namespace
Finally, the uts namespace deals with one important detail in that the hostname can be seen by a group of processes.
The uts namespace addresses this issue by giving each uts namespace a different hostname, and changing the
hostname through the sethostname system call. Also, the uts namespace will only change the hostname for processes
running in the same namespace.
Creating namespaces
Namespace creation is achieved with the clone system call. This system call supports a number of flags, allowing you to
specify whether the new process should run within its own pid, net, ipc, mnt, and uts namespaces.
These are the series of steps that take place when creating a new container. A new process starts with new
namespaces created. Its network interfaces that include the special pair of interfaces to talk with the outside world are
configured. It then executes an init-like process.
When the last process within a namespace exits, the associated resources (IPC, network interfaces...) are automatically
reclaimed. If, for some reason, you want those resources to survive after the termination of the last process of the
namespace, you can use mount --bind to retain the namespace for future use, because each namespace is stored in a
special file in /proc/$PID/ns.
Not all namespaces can be retained, only for ones up to kernel 3.4 . There is support for ipc, net, and uts namespaces
but not for mnt and pid namespace. This presents a problem that we will address in the next paragraph.

PaaS Under the Hood

Episode 1: Kernel Namespaces

Attaching to Existing Namespaces


It is also possible to get into or enter a namespace, by attaching a process to an existing namespace.
Here are some use cases for assigning your own namespaces
Setting up network interfaces from the outside, without relying on scripts inside the container
Running arbitrary commands to retrieve information about the container (this can be done by executing netstat)
Obtaining a shell within a container
Attaching a process to existing namespaces requires two things:
The setns system call (which exists only since kernel 3.0, or with patches for older kernels)
The namespace must appear in /proc/$PID/ns
We mentioned in previous paragraphs that only ipc, net, and uts namespaces were supported /proc/$PID/ns and that
mnt and pid namespaces were not supported. Only a patched kernel will allow you to attach to existing mnt and pid
namespaces.
Combining the necessary patches can be fairly tricky, because it involves resolving conflicts between AUFS and
GRSEC.
AUFS and GRSEC will be covered in Episodes 3 & 4 respectively.
To avoid running an overly patched kernel, there are three suggested workarounds.
You can run sshd in your containers, and pre-authorize a special SSH key to execute your commands. This is one of
the easiest solutions to implement. But if sshd crashes, or is stopped (either intentionally or by accident), you may
be locked out of the container. Also, if you want to squeeze the memory footprint of your containers as much as
possible, you might want to get rid of sshd. If the latter is your main concern, you can run a low profile SSH server
like dropbear. Or, you can start the SSH service from inetd or a similar service.
If you want something simpler than SSH (or something different than SSH to avoid interferences with sshd custom
configurations), you can open a backdoor. An example would be to run socat TCP-LISTEN:222,fork,reuseaddr
EXEC:/bin/bash,stderr from init in your containers. Make sure that port 222/tcp is configured correctly and
firewalled within.
An even better solution is to embed this control channel within your init process. Before changing its root
directory, the init process could setup a UNIX socket on a path located outside the container root directory. When it
will change its root directory, it will retain its open file descriptors and therefore, the control socket.
How dotCloud uses namespaces
In previous releases, the dotCloud platform used vanilla LXCs (Linux Containers), which made implicit use of
namespaces.
From the beginning, we deployed kernel patches that allowed us to attach arbitrary processes into existing
namespaces. We found this approach to be the most convenient and reliable way to deploy, control, and orchestrate
containers. As the dotCloud platform evolved, we still made use of namespaces to isolate applications from each other
even though we have stripped down the vanilla LXC containers.

PaaS Under the Hood

Episode 2: Cgroups

Episode 2: cgroups
Control groups, or cgroups, are a set of mechanisms to measure
and limit resource usage for groups of processes.
Conceptually, it works somewhat like the ulimit shell command or
the setrlimit system call. ulimit and setrlimit set resource limits
for a single process. cgroups allow you to set resource limits for
groups of processes.

dotcloud.com

PaaS Under the Hood

Episode 2: Cgroups

Pseudo-FS Interface
The easiest way to manipulate control groups is through the cgroup file system.
Assuming that it has been mounted on /cgroup, creating a new group named polkadot is as easy as mkdir /cgroup/
polkadot. When you create this (pseudo) directory, it instantly gets populated with many (pseudo) files to manipulate
the control group. You can then move one (or many) processes into the control group by writing their PID to the right
control file, for example, echo 4242 > /cgroup/polkadot/tasks.
When a process is created, it will be in the same group as its parent. If the init process of a container has been placed in
a control group, all the processes of the container will be also be in the same control group.
Destroying a control group is as easy as rmdir /cgroup/polkadot. However the processes within the cgroup have to be
moved to other groups first. Otherwise rmdir will fail since it is like trying to remove a non-empty directory.
Technically, control groups are split into many subsystems. Each subsystem is responsible for a set of files in /cgroup/
polkadot, and the file names are prefixed with the subsystem name.
For instance, the files cpuacct.stat, cpuacct.usage, cpuacct.usage_percpu are the interface for the cpuacct subsystem.
The available subsystems will be detailed in the next paragraph.
The subsystems can be used together, or independently. In other words, you can decide that each control group will
have limits and counters for all the subsystems. Alternatively, each subsystem can have different control groups.
To explain the latter case more fully, a subsystem can have a process in the polkadot control group for memory
control, a process in the bluesuedeshoe control group for CPU control such that polkadot and bluesuedeshoe are in
completely separated namespaces.
What can be Controlled?
Many things! Well highlight the most useful ones here, at least the ones we think are the most useful.
Memory
You can limit the amount of RAM and swap space that can be used by a group of processes. It accounts for the
memory used by the processes for their private use such as their Resident Set Size, or RSS, but also for the memory
used for caching purposes.
This is actually quite powerful, because traditional tools such as ps or analysis of /proc do not have a way to identify the
cache memory usage incurred by specific processes. This can make a big difference, for instance, with databases.
A database typically consumes very little memory for processing but consumes a large chunk of cache. Complex
queries would consume a lot of memory but, for this example, we are not performing complex queries.
To perform optimally, your whole database (or at least, your active set of data that you refer to the most often) should
fit into memory.
You can implement a memory limit for a process inside a cgroup that can easily be done by using echo 1000000000 >
/cgroup/polkadot/memory.limit_in_bytes (that will be rounded to a page size).
To check the current usage for a cgroup, inspect the pseudo-filememory.usage_in_bytes in the cgroup directory. You
can gather very detailed (and very useful) information using memory.stat.

PaaS Under the Hood

Episode 2: Cgroups

CPU
You might already be familiar with scheduler priorities, and with the nice and renice commands. Once again, control
groups will let you define the amount of CPU that should be shared by a group of processes, instead of by a single
process. You can give each cgroup a relative number of CPU shares, and the kernel will make sure that each group of
processes gets access to the CPU in proportion of the number of shares you gave it.
Setting the number of shares is as simple as echo 250 > /cgroup/polkadot/cpu.shares.
Remember that those shares are just relative numbers. If you multiply everyones share by 10, the end result will be
exactly the same. This control group also gives statistics incpu.stat.
CPU Sets
This is different from the cpu controller. In systems with multiple CPUs (i.e., the vast majority of servers, desktop &
laptop computers, and even phones today!), the cpuset control group lets you define which processes can use which
CPU.
This can be useful to reserve a full CPU to a given process or group of processes. Those processes will receive a fixed
amount of CPU cycles, and they might also run faster because there will be less thrashing at the level of the CPU cache.
On systems with Non Uniform Memory Access (NUMA), the memory is split in multiple memory banks, and each bank
is tied to a specific CPU - or set of CPUs in a multi-core system. Binding a process (or group of processes) to a specific
CPU or to a specific group can also reduce the overhead when a process is scheduled to run on a CPU, while accessing
RAM tied to another CPU. There is a penalty to pay for accessing RAM that is tied to another CPU, so that you can use a
cpuset to bind a process and its memory to a specific CPU to avoid the penalty. This works for a group of processes or
CPUs too.
Block I/O
The blkio controller provides a lot of information about the disk accesses (or technically, block devices requests)
performed by a group of processes. This is a useful technology, because I/O resources are much harder to share than
CPU or RAM.
A system has a given, known, and fixed amount of RAM. It has a fixed number of CPU cycles every second. This is true
even on systems where the number of CPU cycles can change (such as tickless systems, or virtual machines). This
does not present an issue, because the kernel will slice the CPU time in shares of e.g. 1 millisecond, and there is a given,
known, and fixed number of milliseconds in every second obviously. However, I/O bandwidth can be unpredictable, or
the predictions arent very useful.
A hard disk with a 10ms average seek time will be able to process about 100 requests of 4 kB per second; but if the
requests are sequential, typical desktop hard drives can easily sustain 80 MB/s transfer rates which means 20000
requests of 4 kB per second.
The average throughput (measured in IOPS, I/O Operations Per Second) will be somewhere between those two
extremes. But as soon as the application performs a task that requires a lot of scattered, random I/O operations, the
performance will drop dramatically. The system can give you some guaranteed performance, but this guaranteed
performance is so low that it is not helpful. That is exactly the problem over at AWS EBS, by the way. Its like a highway
that offers a guarantee that you be able to go above a given speed, except that this speed is 5 mph. Not very helpful in
practicality, is it?

PaaS Under the Hood

Episode 2: Cgroups

10

Thats why SSD storage is becoming increasingly popular. SSD has virtually no seek time, and can therefore sustain
random I/O as fast as sequential I/O. The available throughput is therefore predictably good, under any given load.
Actually, there are some workloads that can cause problems. For instance, writing and rewriting a whole disk will
cause performance to drop dramatically. This is because read and write operations are fast, but erase, which must be
performed at some point before write, is slow.
An example of this use case would be to use SSD to manage video on demand for hundreds of HD channels
simultaneously. The disk will sustain the write throughput until it has written every block once. When it needs to erase,
the performance will drop to below acceptable levels.
Going back to dotCloud, whats the purpose of the blkio controller in a PaaS environment?
The blkio controller metrics will help detect applications that are putting an excessive strain on the I/O subsystem. The
controller lets you set limits, which can be expressed in number of operations and/or bytes per second. It also allows for
different limits for read and write operations. It allows you to set some thresholds that no single app can significantly
degrade performance for other apps. Furthermore, once an I/O intensive app has been identified, its quota can be
adapted to reduce impact on other apps.
Its Not Only for Containers
As we mentioned, cgroups are convenient for containers, since it is very easy to map each container to a cgroup. But
there are many other uses for cgroups.
The systemd service manager is able to put each service in a different cgroup. This allows you to keep track of all the
subprocesses started by a given service, even when they use the double-fork technique to detach from their parent
and re-attach to init. It also allows fine-grained tracking and control of the resource used by each service.
It is also possible to run a system-wide daemon to automatically classify processes into cgroups. This can be particularly
useful on multi-user systems, to limit and/or meter appropriately the resources of each user, or to run some specific
programs in a special cgroupwhen you know that those programs are prone to excessive resource use.
dotCloud & Control Groups
Thanks to cgroups, we can meter very accurately the resource usage of each container, and therefore of each unit
of each service for each application. Our metrics collection system uses collectd, along with our in-house lxc plugin.
Metrics are streamed to a custom storage cluster, and can be queried and streamed by the rest of the platform using
our ZeroRPC protocol. We will be writing a more in-depth article on metrics collection system in the future.
We also use cgroups to allocate resource quotas for each container. For instance, when you use vertical scaling on
dotCloud, you are actually setting limits for memory, swap usage, and CPU shares.

PaaS Under the Hood

Episode 3: AUFS

Episode 3: AUFS
AUFS (which initially stood for Another Union File System)
provides fast provisioning while retaining full flexibility and
ensuring disk and memory savings

dotcloud.com

11

PaaS Under the Hood

Episode 3: AUFS

12

AUFS is a union file system, which merges two directory hierarchies together. On the dotCloud platform, we use AUFS
to combine a large, read-only file system containing a ready-to-run system image under a writeable layer. The resulting
file system looks like the large read-only one, except that you can now write on it anywhere and store just the changed
files. LiveCDs or bootable USBs are common examples of this use case. AUFS allows us to have a common base
image for all applications and a separate read-write layer, unique to each app.
Storage Savings
Lets assume that the base image takes up 1 GB of disk space. In reality, it is actually more than that, since were talking
about a full server file system, containing everything a dotCloud app could potentially need such as Python, Ruby,
Perl, Java, C compiler and libraries, and so on. If the entire image had to be cloned each time a dotCloud application
is deployed, it would use 1 GB of disk space for each new cloned deployment. AUFS therefore lets us save on storage
costs because it is typically using less than 1 MB of disk space.
Faster Deployments
Copying the whole base image would not only use up precious disk space, but it would also take time, up to a minute
or so depending on the disk speed. Also, the copy would put a significant I/O load on the disk. On the other hand,
creating a new pseudo-image using AUFS takes a fraction of a second, and virtually no I/O at all. AUFS offers a much
better solution when compared to copying an entire image every time.
Better Memory Usage
Virtually all operating systems use a feature called buffer cache to make disk access faster. Without it, your system
could run at least 10x, 100x or up to 1000x slower, because it has to access the disk even to run simple commands, for
example, when listing your files with ls! As we will see, AUFS also lets us rack in big savings on this buffer cache.
Every single application will load from disk a number of common files and components such as the libc standard
library, the /bin/sh standard shell... and a lot of common infrastructures, like crond, sshd, the local Mail Transfer Agent,
just to name a few. Additionally, all applications of the same type will load the same files. For example, Python
applications will load a copy of the Python interpreter every time.
If each app were running from its own copy, identical copies of those common files would be present multiple times in
memory, within the buffer cache. Using AUFS, those common files are in the base image, and the Linux kernel therefore
knows how to load them only once in memory. This will typically save tens of MB for each app.
Easier Upgrades
If you are familiar with storage technology, you might argue that snapshots, and copy-on-write devices already have
those advantages mentioned above.
Thats true. However, with those systems, it is not possible to update the base image, and have the changes reflected
in the lightweight clones such as in the snapshots. AUFS, on the other hand, lets you do whatever you want with the
base image. The changes will be immediately visible in the AUFS mount points using the base image. It means that it is
easy to do software upgrades, even while the applications are running, just like on a typical single server environment,
except that you can upgrade thousands of servers all at once.
Allows Arbitrary Changes
All those things can also be done without AUFS. For a decade, skilled UNIX systems administrators have been
deploying machines (workstations, X terminals, servers...) with a read-only root file system, allowing read-write access
through ad hoc mount points. After all, with some clever configuration and tuning, you dont need to write anywhere
else except places like /tmp, /var/run, /var/lock, and of course /home. The latter can be a traditional read-write file
system, and the formers can even use a tmpfs mount.

PaaS Under the Hood

Episode 3: AUFS

13

Use Cases for AUFS


Because it allows for arbitrary changes to the file system, AUFS offers many advantages. Lets suppose you need an
extra package, or maybe you want to upgrade the version of Python or Ruby. On a system without AUFS, one with
only a shared read-only root file system with distinct writable mount points, you have two alternatives.
Either you upgrade the read-only base image (and potentially affect all other users of the image)
Alternatively, install whatever you need into a specific writable mount point like /home, /tmp or equivalent. That
means a manual install and potentially introducing side effects or conflicts with existing previously installed versions
With AUFS, since the entire read only root file system is still writeable, just apt-get install whatever you need. The
read-only base file system wont be affected because all the changes will be written onto your own private layer.
Other Union File Systems
We considered many file systems with similar properties outlined in the above, in addition to AUFS. We opted for
AUFS because for what we need to do at dotCloud, we believe that it is the most mature and stable solution at the
time of our evaluation.
Caveats
However, technology is constantly evolving and no solution is ever a perfect match for our changing requirements. We
are currently using AUFS 3. When we were using AUFS 2, it had significant issues, notably with mmap. However, the
other union file systems performed even worse for that specific issue.
We worked around those issues by mounting some read-write volumes at strategic places such as into the data
directories of MySQL, PostgreSQL, MongoDB, Redis and on the home directory where the application code is executed.
Mounting the read-write volumes into the data directories gave us the required stability. We were able to leverage the
flexibility provided by AUFS without the downside.
AUFS at dotCloud
Technically, the main feature that benefits from AUFS is our custom package installation system.
If you need a particular library that is not included in our base image but the library does exist in the Ubuntu package
repository, then installing it in your service can be a breeze! Use the systempackages option in your dotcloud.yml file.
AUFS allows the package to be installed into your service, without ever touching the base image used by other
applications.

PaaS Under the Hood

Episode 4: GRSEC

Episode 4: GRSEC
GRSEC is a security patch to Linux kernel. Security features in
GRSEC help detect and deter malicious code.

dotcloud.com

14

PaaS Under the Hood

Episode 4: GRSEC

15

GRSEC is a fairly large patch for the Linux kernel, providing strong security features that prevent many kinds of attacks
(or exploits), detect suspicious activity such as people looking for new exploits and/or known system vulnerabilities.
There are many features in GRSEC, so our goal is to provide an overview of the relevant features to dotCloud.
Randomize Address Space
Many exploits rely on the fact that the base address for the heap or the stack is always the same.
Consider the following example, this is a classic scenario for an attack on a remote service:
A bug is found in the service. Some index is not checked properly, and can be used to alter the stack, and cause a
jump to an arbitrary address (when a function returns)
The stack is altered to introduce some malicious code
A pointer to this malicious code is placed on the stack as well
The bug is triggered. The service jumps to the malicious code and executes it
If the address space of the stack is randomized, it would be much more difficult for an attacker to exploit the system.
The attacker would have to locate his malicious code before he can jump to the code in memory.
Prevent Execution of Arbitrary Code
There are two steps to make sure that arbitrary code cant make it inside a running program.
First, program code must be loaded in an area that is marked by the memory management unit as being read-only.
This prevents code from modifying itself. Self-modifying code is sometimes referred to as polymorphic code. There are
legitimate use cases for polymorphic code. However, it is more often associated with dubious intentions.
Second, the heap and the stack must be marked as non-executable. After all, theyre supposed to contain data
structures, function parameters, and return addresses but no opcode should be in there. On architectures supporting
it, the heap and the stack regions should be marked as non-executable at the hardware level, effectively preventing
accidental or intentional execution of code located in there.
At this point, there is no memory that is both executable and writable.
We mentioned that there were some legitimate uses for memory regions with both write and exec permissions. When
does that happen, and what can be done about it?
The most common case is on-the-fly code generation for optimization purposes. This use case is applicable to those
using Java and JIT (Just-In-Time) compiler.
The good news is that GRSEC lets you flag some specific executables and allows them to write to their code region or
execute their data region.
This reduces the security for those specific processes, but there are benefits. To exploit a bug, there has to be a bug in
e.g. the JVM itself, not in your program. Bugs in the JVM are likely to be found and fixed much faster than bugs in your
own program. This is not a comment about the quality of anyones code. Its more about the number of users in the
Java community and their scrutiny on the JVM.

PaaS Under the Hood

Episode 4: GRSEC

16

Audit Suspicious Activity


Another interesting security feature of GRSEC is the ability to log some specific events. For instance, it is possible to
make a record each time a process is terminated by SIGSEGV, a.k.a. Segmentation Fault in the kernel log.
Whats the point? Potential attackers will likely run a number of known exploits in an attempt to gain escalated
privileges. Many of the exploits will hopefully fail. Often, the failure will result in the process having to do a
segmentation violation, and then be killed by SIGSEGV.
Any C programmer will tell you that there are legitimate cases where programs are terminated by SIGSEGV. If the
system detects many different programs started by the same user that are all being killed in the same way, then it is
telltale sign that someone is trying to break into the system.
If youre not familiar with those concepts, you can draw upon an analogy in which you observe many scratches around
a padlock. A few scratches on the surface wont mean anything. But if you see the padlock full of dents, you can bet
that someone is trying to pick it!
There are many other similar events that are logged by GRSEC. The kernel logs can then be analyzed in real time, and
suspicious patterns can be detected. This allows you to lock out malicious users, or, alternatively, monitor them closely
to see what theyre doing. GRSEC can be useful in Forensics in case someone does successfully breach the system.
GRSEC logs will record how theyve exploited the system. Knowing how someone exploited the system can be a
valuable tool for the person who is trying to close the security gap.
Compile-time Security Features
GRSEC also plays its part during the kernel compilation. It enables a compiler plugin, which will constify some kernel
structures. It will automatically add the const keyword to all structures containing only function pointers (unless they
have a special non const marker to evade the process).
In other words, instead of being mutable by default unless marked const, function tables are now const by default,
unless specified otherwise. Accordingly, attempts to modify function tables will be detected at compile-time.The
rationale is to make sure that any code that manipulates a function table will be closely audited before the function
table is marked non const.
Why the emphasis on function tables? Because if they can be breached, they are a convenient way for a potential
attacker to jump to arbitrary code, recall the technique explained in the beginning of Episode 4!
Marking those data structures as const helps at compile time, but also later when the kernel is running, because those
data structures will be laid out in a memory region which will be made read-only by the memory management unit.
This not only reduces exposure to attacks, but can also make it harder for successful attackers to cover up their tracks
by hijacking existing function tables.
...And Many More
As told in the introduction, this is just a quick overview. If you want to learn about other features, you can check
GRSECs website.
If you want to quench your thirst for technical details, you can follow these four steps to get a full listing of all the
GRSEC features and descriptions on each feature.
Get the kernelsources
Apply the GRSEC patch set
Run make menuconfig
Navigate to the compilation options related to GRSEC
Almost each feature of GRSEC can be enabled/disabled at compilation time, and therefore will be listed there. The Help
provided with each compilation option is fairly informative.

PaaS Under the Hood

Episode 4: GRSEC

17

In addition to GRSEC, dotCloud has built-in additional layers of security. Each service runs in its own container. The
benefits of container isolation were explained in Episode 2 on cgroups.
We do not allow dotCloud users to have root access. No root access means that users cannot SSH as root, cannot
login as root, and cannot get a root shell through sudo. All processes run under a regular, non-privileged UID.
Furthermore, SUID binaries are restricted to a set of well-known, well-audited programs, like ping.
Each of those security layers is strong. We believe that combining them together can provide a more than adequate
level of security for massively scaled, multi-tenant platforms.

PaaS Under the Hood

Episode 5: Distributed Routing

18

Episode 5: Distributed Routing

The dotCloud platform is powered by hundreds of servers, some of


them running more than one thousand containers. The majority of
these containers are HTTP servers and they handle millions of HTTP
requests every day to power the applications hosted on our platform.

dotcloud.com

PaaS Under the Hood

Episode 5: Distributed Routing

19

All the HTTP traffic is bound to a group of special machines called the gateways. The gateways parse
HTTP requests, and route them to the appropriate backends. When there are multiple backends for a
single service, the gateways also deal with the load balancing and failover. Last but not least, the gateways
also forward HTTP l

to be processed by the metrics cluster.


HTTP Routing Layer

visitors
(technically
HTTP clients)

HTTP routing layer


(load balancers)

dotCloud platform

dotCloud app cluster

This HTTP routing layer, as we call it, runs on an elastic number of dedicated machines. When the load is low, 3
machines are enough to deal with the traffic. When spikes or DoS attacks happen, we scale up to 6, 10, or even more
machines, to ensure optimal availability and latency.
All HTTP requests are bound to the HTTP routing layer, which is a cluster of identical HTTP load balancers. Each time
we create, update (e.g. scale), or delete an application on dotCloud, the configuration of those load balancers has to be
updated.
The master source for all the configuration is stored within a Riak cluster, working in tandem with a Redis cache. The
configuration is modified using basic commands:
Create a HTTP entry
Add/remove a frontend (virtual host)
Add/remove a backend (container)
The commands are passed through a ZeroRPC API. Each update done through the API propagates through the
platform; in the next sections, we will see which mechanisms are used.
Version 1: Nginx + ZeroRPC
As you probably know, a start-up must be lean, agile, and many other things. It also needs to be pragmatic, and the right
solution is not always the ideal one, but its the one that allowed us to ship on time. Thats why the first iteration of our routing
layer had some shortcomings, as we will see. But it has functioned properly up to support tens of thousands of apps.
Nginx powered the first version of dotClouds routing layer. Each modification to an app caused the central vhosts
service to push the full configuration to all the load balancers, using ZeroRPC.
Obviously, as the number of apps grew, the size of the configuration grew as well. Sending differential updates would
have been better. But at least, when a load balancer lost a few configuration messages, there was no special case to
handle. The next update would contain the full configuration, and provide all the necessary information.
The configuration was transmitted using a compressed, efficient format. Then, each load balancer would transform
this abstract configuration into the Nginx configuration file, and inform Nginx to reload this configuration. Nginx is well
designed, even when loading the new configurations, it can still serve requests along with the old one which meant that
no HTTP request is lost during the configuration update.

PaaS Under the Hood

Episode 5: Distributed Routing

20

Nginx also handles load balancing and fail-over well. When a backend server dies, Nginx detects it, removes it from the
pool, periodically tries it again, and will re-add it to the pool once it has fixed itself.
This setup had two issues:
Nginx does not support the WebSocket protocol, which was one of the top features requested by our users at that
time
Nginx has no support for dynamic reconfiguration, which means that each configuration update requires the whole
configuration file to be regenerated and reloaded
At some point, the load balancers started to expend a significant amount of CPU time to reload Nginx configurations.
There was no significant impact on running applications, but it required deploying more and more powerful instances as
the number of apps increased.
Although Nginx was still fast and efficient, we had to find a more dynamic alternative.
Version 2: Node.js + Redis + WebSocket = Hipache
We spent some time digging through several kinds of languages and different technologies to solve this issue. We
needed the following features:
Ability to add, update, and remove virtual hosts dynamically, with a very low cost
Support for the WebSocket protocol
Great flexibility and control over the routed requests: we want to be able to trigger actions, log events, etc., during
different steps of the routing
After looking around, we finally decided to implement our own proxy solution. We did not implement everything from
scratch. We based our proxy on the node-http-proxy library developed by NodeJitsu. It included everything needed to
route a request efficiently with the appropriate level of control. The new routing layer would therefore be in JavaScript,
using Node.JS, leveraging on NodeJitsus library; we added several features such as the following:




Use multi-core machines by scaling the load to multiple workers


Ability to store HTTP routes in Redis, allowing live configuration updates
Passive health-checking (when backend is detected as being down, it is removed from the rotation)
Efficient logging of requests
Memory footprint monitoring - if a leak causes the memory usage of a worker to go beyond a given threshold, the
worker is gracefully recycled
Independence from other dotCloud technologies (like ZeroRPC, to make the proxy fully re-usable by third parties
(the code being, obviously, open source)
After several months of engineering and intensive testing, we released the source code of Hipache: our new distributed
proxy solution!
Behind the scenes, integrating Hipache into the dotCloud platform was very straight forward, due to our serviceoriented architecture.
We simply wrote a new adapter which consumed virtual host configurations from the existing ZeroRPC service, and
used it to update Hipaches configuration in Redis. No refactoring or modification of the platform was necessary.
Heres a side comment about dynamic configuration and latency. Storing the configuration in an external system (like
Redis) means that you have to make following trade-offs:
You can look up the configuration at each request, but it requires a round-trip to Redis at each request, which adds
latency
You can cache the configuration locally, but you will have to wait a bit for your changes to take effect, or implement
a complex cache-busting mechanism
We implemented a cache mechanism to avoid hitting Redis at each request. But that wasnt necessary, because we
realized that requests done to a local Redis are very, very fast. The difference between direct lookups and cached
lookups was less than 0.1ms, which was in fact below the error margin of our measurements.

PaaS Under the Hood

Episode 5: Distributed Routing

21

Version 3: Active Health Checks


Hipache has a simple embedded health-check system. When a request fails because of some backend issue (TCP
errors, HTTP 5xx responses, etc.), the backend is flagged as being dead, and remains in this state for 30 seconds.
During the 30 seconds, no request is sent to the backend; then it goes back to normal state. However, if it is faulty, it will
immediately be re-flagged as dead. This mechanism is simple enough and it works, but it has three caveats:
If a backend is frozen, we will still send requests to it, until it gets marked as dead
When a backend is repaired, it can take up to 30 seconds to mark it live again
A backend which is permanently dead will still receive a few requests every 30 seconds
To address those three problems, we implemented active health checks. The health checker permanently monitors the
state of the backends, by doing the HTTP equivalent of a simple ping. As soon as a backend stops replying correctly
to ping requests, it is marked as dead. As soon as it starts replying again, it is marked as live. The HTTP pings can be
sent every few seconds, meaning that it will be much faster to detect when a backend changes.
To implement the active health checker, we considered multiple solutions: Node.js, Python+gevent, Twisted. And finally
decided to roll it with the Go language. Go Lang was chosen for several reasons as follows:
The health checker is massively concurrent (hundreds, and even thousands of HTTP connections can be in flight
at a given time)
Go programs can be compiled and deployed as a single, stand-alone, binary
We have other tools doing massively concurrent queries, and this was an excellent occasion to do some
comparative benchmarks (we will be publishing the benchmarks in future eBooks)
The active health checker is completely optional. You dont need it to run Hipache, and you can plug it on top of an
existing Hipache installation without modifying Hipache configuration: it will detect and update Hipache configuration
directly through the Redis used by Hipache itself. In other words, it gets along perfectly fine with Hipache embedded
passive health-checking system, and running it will just improve the dead backend detection. And of course, hchecker is
open source, just like Hipache.
Whats next?
Since this HTTP routing layer is a major part of the dotCloud infrastructure, were constantly trying to find ways to make
it better all the time.
Recently we did some research and tests to see if there was some way to implement dynamic routing with Nginx. In
fact, we aimed for an even higher goal. We wanted to route requests with Nginx, using configuration rules stored in
Redis, using the format currently used by Hipache. This would allow us to re-use many components such as the Redis
feeder and the active health checker that uses the same configuration format.
Guess what: we found something! Less than one year ago when we started to think about the design of Hipache and begin
implementation, we looked at the Nginx Lua module. It has improved a lot since then and it may be an ideal candidate.
We started an experimental project which lets Nginx mimic Hipache, by using the same Redis configuration format.
Nginx deals with the request proxying, while the routing logic is all in Lua. We used the excellent lua-resty-redis module
to talk to Redis from Nginx.
This open source project, called hipache-nginx.
Some preliminary benchmarks show that under high load, hipache-nginx can be 10x faster than the original Hipache in
Node.js. The benchmarks have to be refined, but it appears that hipache-nginx can deliver the same performance as
hipache-nodejs with 10x fewer resources. So, while the code is still experimental, it shows that there is plenty of room
for improvement in the current HTTP routing layer. Even if it will probably have an affect on apps with 10,000-100,000
requests per second, it is still worth investigating.

PaaS Under the Hood

22

CONCLUSION
As you can see, building a PaaS like dotCloud or Heroku involves specific knowledge about fundamental technologies.
Of course, you may not choose to implement any of the specific technologies that weve implemented in dotCloud.
We aim to expose the underlying technologies that weve implemented that provide isolation between apps, rapid
deployment, protection against security threats and distributed routing.
In other words, if you are serious about building a robust platform, you may want to become familiar with those types
of technologies. Or, alternatively, you could rely on an existing proven platform like dotCloud.
Join dotClouds Technical Community
Sign up for your own account
Join the technical discussions in our open forums
Read our blog
Have a technical question?
Email us: support@dotcloud.com

PaaS Under the Hood

23

Authors Biography
Jrme Petazzoni, PaaS under the Hood, Episodes 1-4
Jrme is a senior engineer at dotCloud, where he rotates between Ops, Support and Evangelist duties and has earned
the nickname of master Yoda. In a previous life he built and operated large scale Xen hosting back when EC2 was just
the name of a plane, supervised the deployment of fiber interconnects through the French subway, built a specialized
GIS to visualize fiber infrastructure, specialized in commando deployments of large-scale computer systems in
bandwidth-constrained environments such as conference centers, and various other feats of technical wizardry. He cares
for the servers powering dotCloud, helps our users feel at home on the platform, and documents the many ways to use
dotCloud in articles, tutorials and sample applications. Hes also an avid dotCloud power user who has deployed just
about anything on dotCloud - look for one of his many custom services on our Github repository.
Connect with Jrme on Twitter! @jpetazzo
Sam Alba, PaaS Under the Hood, Episode 5
As dotClouds first engineering hire, Sam was part of the tiny team that shipped our first private beta in 2010. Since
then, he has been instrumental in scaling the platform to tens of millions of unique visitors for tens of thousands
of developers across the world, leaving his mark on every major feature and component along the way. Today, as
dotClouds first Director of Engineering, he manages our fast-growing engineering team, which is another way to
say he sits in meetings so that the other engineers dont have to. When not sitting in a meeting, he maintains several
popular open source projects, including Hipache and Cirruxcache and other projects also ending in -ache. In a
previous life, Sam supported Fortune 500s at Akamai, built the web infrastructure at several startups, and wrote
software for self-driving cars in a research lab at INRIA.
Follow Sam on Twitter @sam_alba

You might also like