You are on page 1of 32

parallels.com || openvz.org || criu.

org
Seven Problems
of Linux Containers
Kir Kolyshkin
<kir@openvz.org>
! "pril #$% Linux&est 'orth(est

parallels.com || openvz.org || criu.org
Seventy Seven Problems
of Linux Containers
Kir Kolyshkin
<kir@openvz.org>
! "pril #$% Linux&est 'orth(est
(of which I am going to cover six)

parallels.com || openvz.org || criu.org
Problem 1: Effective virtualization

Virtualization is partitioning

Historical wa: !" mainframes

"o#ern wa: virtual machines

Problem: performance overhea#

Partial solution: har#ware support


(Intel V$% &"' V)

parallels.com || openvz.org || criu.org
(olution: isolation

)un man isolate# userspace instances


on top of one single (*inux) +ernel

&ll processes see each other

files% process information% networ+%


share# memor% users% etc,

"a+e them unsee it-



parallels.com || openvz.org || criu.org

parallels.com || openvz.org || criu.org
.ne historical wa to unsee
chroot()

parallels.com || openvz.org || criu.org
/amespaces

Implemente# in the *inux +ernel

PI'

net

IP0

1$(

mnt

user

clone() with CLONE_NEW* flags



parallels.com || openvz.org || criu.org
Problem 2: (hare# resources

&ll containers share the same set of resources


(0P1% )&"% #is+% various +ernel things ,,,)

/ee# fair #istribution of goo#s so everone


gets their share

/ee# 'o( prevention

/ee# prioritization

All animals are equal, but some animals are more


equal than others 33 4eorge .rwell

parallels.com || openvz.org || criu.org

parallels.com || openvz.org || criu.org
(olution: .penV5 resource controls

.penV5:

user beancounters

controls 26 parameters

hierarchical 0P1 sche#uler

#is+ 7uota per containers

I8. priorities per3container

'namic control% can 9resize: runtime



parallels.com || openvz.org || criu.org
(olution: cgroups

0groups is a mechanism to control resources


per hierarchical groups of processes

0groups is nothing without controllers:

bl+io% cpu% cpuacct% cpuset% #evices% freezer%


memor% net;cls% net;prio

0groups are orthogonal to namespaces

(till a wor+ in progress (+ernel memor)



parallels.com || openvz.org || criu.org
Problem <: eas resources

1ser =eancounters are complicate#:

http:88wi+i,openvz,org81=0;consistenc;chec+

user has to set all these parameters

some of which are inter#epen#ent

>e create# a collection of vali# configs%

,,, wrote a whole boo+ about 1=0

,,, an# a set of tools to help



parallels.com || openvz.org || criu.org

parallels.com || openvz.org || criu.org
(olution: V(wap

.nl two primar parameters: )"* an# s(ap

others still exist% but no longer re7uire# to set

(wap is virtual% no actual I8. is performe#

(low #own to emulate real swap

.nl when actual global )&" shortage occurs%


virtual swap goes into the real swap

0urrentl onl available in .penV5 +ernel



parallels.com || openvz.org || criu.org
Problem ?: fast live migration

>e can migrate an .penV5 container


from one phsical server to another
without a shut#own

>e want to #o it fast even for huge containers

huge #is+: use share# storage

huge )&": @@@



parallels.com || openvz.org || criu.org
/ormal migration process

(&ssuming share# storage)

1 Areeze the container

2 'ump its complete state to a #ump file

< 0op #ump file to #estination server

? 1n#ump

B 1nfreeze

Problem+ huge #ump file



parallels.com || openvz.org || criu.org
(olution 1: networ+ swap

1 'ump the minimal memor% loc+ the rest

2 )estore the minimal memor%


mar+ the rest as swappe# out

< (et up networ+ swap from the source

? 1nfreeze, "issing )&" will be 9swappe# in:

B "igrate the rest of )&" an# +ill it on source



parallels.com || openvz.org || criu.org

parallels.com || openvz.org || criu.org
(olution 1: networ+ swap

1 'ump the minimal memor% loc+ the rest

2 0op% un#ump what we have%


mar+ the rest as swappe# out

< (et up networ+ swap serve# from the source

? 1nfreeze, "issing )&" will be 9swappe# in:

B "igrate the rest of )&" an# +ill it on source

P).=*E"@ )eliabilit% no wa to rollbac+



parallels.com || openvz.org || criu.org
(olution 2: Iterative )&" migration

1 &s+ +ernel to trac+ mo#ifie# pages

2 0op all memor to #estination sstem

< &s+ +ernel for list of mo#ifie# pages

? 0op those pages

B 4.$. < until satisfie#

C Areeze an# #o migration as usual



parallels.com || openvz.org || criu.org
Problem B: upstreaming

.penV5 was #evelope# separatel

$hen we wante# to merge it upstream


(i,e, to vanilla *inux +ernel)

Problem@


parallels.com || openvz.org || criu.org

parallels.com || openvz.org || criu.org
Problem B: upstreaming

.penV5 was #evelope# separatel

$hen we wante# to merge it upstream


(i,e, to vanilla *inux +ernel)

Problem:

upstream ,evs are not a--epting our (ork



parallels.com || openvz.org || criu.org
(olution 1: rewrite from scratch

1ser =eancounters 3D 04roups

'i# 2 rewrites for PI' namespace


until it finall got accepte#

/etwor+ namespace re#one

It wor+s-

about $.## patches got lan#e# to vanilla

II Parallels ma#e it to top16 contributors



parallels.com || openvz.org || criu.org
(olution 2: 0)I1

>e trie# har# to merge chec+point8restore

.ther people trie# har# too% no luc+

0anEt ma+e it to the +ernel% letEs go userspace

>ith minimal +ernel intervention when


re7uire#

Fernel exports most of information alrea#% so


letEs Gust a## missing bits an# pieces

parallels.com || openvz.org || criu.org
C)/0

Chec+point 8 )estore (mostl) /n 0serspace


$ools currentl at version 6,?

>ill #o 1,6 release this ear

Fernel <,H has about 126 patches from us

IBJ of nee#e# features are there

"emor snapshot recentl ma#e it to 3mm tree



parallels.com || openvz.org || criu.org

parallels.com || openvz.org || criu.org
Problem C: common file sstem

0ontainer is Gust a #irector on host%


all 0$s resi#e on the same A(

Aile sstem Gournal is a bottlenec+

*ots of small3size files I8. on 0$ bac+up

/o sub3tree #is+ 7uota support in upstream

/o per3container snapshots

*ive migration: rsnc 33 change# ino#es

Aile sstem tpe an# properties are fixe#



parallels.com || openvz.org || criu.org
(olution 1: *V"

.nl wor+s onl on top of bloc+ #evice

Har# to manage (e,g, how to migrate huge


volume@)

/o #namic allocation

0omplicate# management

parallels.com || openvz.org || criu.org
(olution 2: loop #evice

VA( operations lea#s to #ouble page3caching

(alrea# fixe# in the recent +ernels)

/o #namic allocation% max space is use#

*imite# feature set



parallels.com || openvz.org || criu.org
(olution <: ploop

=asic i#ea: same as loop% Gust better

"o#ular #esign:

various image formats (7cow2 in $.'.)

various I8. bac+en#s

"ore features:

live resize

instant live snapshots

write trac+er to help in live migration



parallels.com || openvz.org || criu.org
&n problems 7uestions@

+irKopenvz,org

$witter: K+olsh+in