You are on page 1of 8

Differentiated Availability in Cloud Computing SLAs

Astrid Undheim Ameen Chilwan and Poul Heegaard


Telenor ASA Department of Telematics
Corporate Development Norwegian University of Science and Technology (NTNU)
Trondheim, Norway Trondheim, Norway
astrid.undheim@telenor.com chilwan@alumni.ntnu.no, poul.heegaard@item.ntnu.no

AbstractCloud computing is the new trend in service interpreted as the probability of providing service according
delivery, and promises large cost savings and agility for the to dened requirements. In order to avoid costly down
customers. However, some challenges still remain to be solved times contributing to service unavailability, fault avoidance
before widespread use can be seen. This is especially relevant
for enterprises, which currently lack the necessary assurance and fault tolerance are used in the design of dependable
for moving their critical data and applications to the cloud. systems. Traditionally, fault tolerance has been implemented
The cloud SLAs are simply not good enough. in hardware resulting in expensive systems, or using cluster
This paper focuses on the availability attribute of a cloud software, often very specic for each application [3]. In
SLA, and develops a complete model for cloud data centers, cloud computing, the approach to fault-tolerance has mainly
including the network. Different techniques for increasing the
availability in a virtualized system are investigated, quantifying been to use cheap, off-the-shelf hardware, allowing failures
the resulting availability. The results show that depending on and then tolerating these in software. The reason for this is
the failure rates, different deployment scenarios and fault- partly the large size of cloud data centers, which means that
tolerance techniques can be used for achieving availability hardware will fail constantly. Adding additional hardware
differentiation. However, large differences can be seen from resources to account for failures and let the failover be
using different priority levels for restarting of virtual machines.
handled by software is then a more cost-effective approach
than using special-built hardware. Another advantage from
Keywords-Availability, cloud, differentiation, SLA virtualization is that virtual instances can be migrated to
arbitrary physical machines, sharing redundant capacity
I. I NTRODUCTION among a large number of Virtual Machines (VMs). The
Cloud computing presents a new computing paradigm standby resources needed are thus much less than for a
that has attracted a lot of attention lately. It enables on- traditional system. However, also with virtualization, fault
demand access to a shared pool of highly scalable computing tolerance means adding additional resources in the system,
resources that can be rapidly provisioned and released [1]. which adds cost. Fault-tolerance should therefore be targeted
This is achieved by offering computing resources and ser- to specic needs.
vices from large data centers, where the physical resources Differentiating with respect to fault-tolerance techniques
(servers, network, storage) are virtualized and offered as and physical deployment for different applications give bet-
services over a network. ter resource utilization, being cost effective for the provider
A large part of cloud applications has so far been tar- while still delivering service according to user expecta-
geted to consumers with low willingness to pay, and low tions. In particular, stateful applications require synchro-
expectations to the service QoS (dependability, performance nized/updated replicas for tolerating failures, while stateless
and security). Recently, more and more enterprises are also applications that tolerate short downtimes can be imple-
investigating how to leverage on the cloud computing ad- mented using non-updated replicas. In addition, adding
vantages such as the pay per use model and rapid elasticity. replicas at different physical locations increase the fault-
However, major challenges have to be faced in order for tolerance, tolerating failures that may affect specic parts
enterprises to trust cloud providers with their core business of a cloud data center.
applications. These challenges are mainly related to QoS, in In related work, hardware reliability for cloud data centers
our view covering dependability, performance and security, has been characterized in [4], and reliability/availability
and a comprehensive Service Level Agreement (SLA) is models for cloud data centers are stated as important on-
needed to cover all these aspects. This is in contrast to the going work. The availability of a service running in VMs
insufcient SLAs offered today. on two physical host is modeled in [5] and a non-virtualized
In this paper, we focus on dependability, and more specif- system is compared with a virtualized system. A very simple
ically the availability attribute. Availability is dened in cloud computing availability model is used in [6], combined
[2] as the readiness for correct service, which can be with performance models to give Quality of Experience
(QoE) measures for online services. The authors state the the unavailability has not been covered by the SLA 2 . One
need for availability models for complete cloud data centers. reason is that the cloud SLAs are not specic enough when
The main contribution of this paper is the effort of dening availability. From the customers point of view this is
modeling a complete cloud system, including the network a major drawback. Performance (e.g. response time) above
between the cloud data centers and the customers. The a certain threshold will be perceived by the customer as
availability resulting from deployment in different physical service unavailability and should be credited accordingly.
locations can thus be studied with respect to different failure This issue is covered in [7], where the throughput of a load-
rates both in the network and the cloud infrastructure itself. balanced application is studied under the events of failures.
In addition, these deployment options will be inuenced by It is clear that the availability parameter alone is not enough
management software failure rates, an important aspect to to ensure a satisfactory service delivery.
include in analysis. The on-demand characteristic of cloud computing is one
The rest of this paper is organized as follows. In Section aspect that complicates the QoS provisioning and SLA
II, we focus on the SLAs offered by commercial cloud management. The cloud infrastructure needs to adjust to
providers today and the missing pieces. In Section III, changing user demands, resource conditions and environ-
principles for achieving fault-tolerance are described, as well mental issues. Hence, the cloud management system needs
as its application in cloud computing. An availability model to automatically allocate resources to match the SLAs and
of a cloud service deployment is described in Section IV, also detect possible violations and take appropriate action
and the different types of failures are discussed. In Section in order to avoid paying credits. Several challenges for
V, different scenarios for VM deployment is described. autonomic SLA management still remain. First, resources
Numerical results are presented in Section VI. Finally, we need to be allocated according to a given SLA. Next,
conclude the paper in Section VII, together with some measurements and monitoring are needed to detect possible
thoughts on future work. violations and react accordingly, e.g., by allocating more
resources. For availability violations this may require adding
II. SLA S IN C LOUD C OMPUTING
more standby resources to handle a given number of failures,
Cloud computing gives the customers less control of and for performance violations this may require moving a
the service delivery, and they need to take precautions in VM to an other physical machine if the current machine
order not to suffer low performance, long downtimes or is overloaded. All these actions require a mapping between
loss of critical data. Service Level Agreements (SLAs) have low-level resource metrics and high-level SLA parameters.
therefore become an important part of the cloud service One proposal on how to do this mapping is given in [8],
delivery model. An SLA is a binding agreement between the where the amount of allocated resources are adjusted on
service provider and the service customer, used to specify the y to avoid an SLA violation. An other proposal for
the level of service to be delivered as well as how measuring, dynamic resource allocation using a feedback control system
reporting and violation handling should be done. Today, is proposed in [9]. Here, the allocation of physical resources
most of the major cloud service providers include QoS (e.g. physical CPU, memory and input/output) to VMs is
guarantees in their SLA proposals, specied in Service Level adjusted based on measured performance.
Specication (SLS), as seen in Figure 1. The focus in most With deployment of widely different services in the cloud,
cases is on dependability, measured as service availability there is clearly a need for cloud providers to offer differen-
usually covering a time period of a month or a whole tiated SLAs, with respect to dependability, performance and
year. Credits are issued if the SLA is violated, e.g. the security. Core business functions such as production systems
Amazon EC2 SLA includes an Annual uptime percentage and billing needs a higher availability than applications
of 99.99% and issues 10% service credits 1 . targeted to consumers such as email and document handling.
Also, different user groups may have different requirements.
Service Level Agreement (SLA)
Measuring
One example is Gmail, where the SLA for email services
Reporting
Violation handling
for consumers and business users are differentiated, offering
the business users an availability of 99.9% at a xed price,
Service Level Specication (SLS)
SLS Parameters
while consumers have a free offering without any SLA 3 .
SLS Thresholds
III. FAULT T OLERANCE IN C LOUD C OMPUTING
In the design of dependable systems, a combination of
Figure 1: The structure of an SLA fault avoidance (called fault prevention in [2]) and fault
tolerance is used to increase availability. Fault avoidance
Recently, we have seen many examples where cloud
2 http://cloudcomputingfuture.wordpress.com/2011/04/24/why-amazons-
services have been unavailable for the customer, but where
cloud-computing-outage-didnt-violate-its-sla
1 http://aws.amazon.com/ec2-sla/ 3 http://www.google.com/apps/intl/en/business/features.html
aims at avoiding faults being introduced, through use of of virtualization though is that one single hardware fault
better components (i.e. SSD instead of HDD), debugging in a physical server can affect several VMs and hence
of software or protecting the system against environmental many applications. Replicas of the same applications must
faults. Fault tolerance is often used in addition to fault therefore always be deployed on different physical machines.
avoidance, allowing a fault leading to error but preventing The standby resources must also be dimensioned to handle
errors leading to service failure. Fault tolerance thus use the high number of failed VMs in case of a physical server
redundancy in order to remove or compensate for errors. failure.
This section gives a short overview of general fault tolerance Virtualization facilitates live migration of VMs, where a
techniques used in design of dependable communication running VM instance can be transferred between physical
systems, and then looks at how fault tolerance is achieved machines. Live migration has been implemented both for the
in virtualized environments. Xen hypervisor [10] and for VMware with its VMotion [11]
and ensures zero downtime in case of planned migrations
A. Fault Tolerance Principles due to resource optimization or planned maintenance. In
Cloud infrastructure is built using off-the-shelf hardware, case of failures in the physical host running the VM, live
and standby redundancy is the preferred fault tolerance migration is not possible. The conguration le or the
technique. With standby redundancy, there are two or more VM image should then be available on possible new host
replicas of the system. Only the active replica will produce machines in order to restart the application. In addition, the
result to be presented to the receiver, while the standby conguration le should be stored at a centralized location
replicas are ready to take over should the active replica should all replicas fail. How this is performed is dependent
fail. Hot and cold standbys are possible. Hot standbys are on the type of standby redundancy, as described next.
powered standbys, capable of taking over service execution 1) Hot Standby: For stateful applications, state must be
with no downtime (as long as the state is updated). Cold stored on the standby virtual machine in order to allow
standbys are non-powered and need some time to be started failover. In traditional fault-tolerance terminology, this re-
in case of failure in the active replica. quires the use of updated hot standbys. Different levels of
Different levels of synchronization are possible for the hot updating/synchronization between the active and standby
standbys (updated/not-updated), and the backup resources replica are possible; either the input is evaluated at each
can be dedicated or shared, for both the hot and cold replica, or the state information is transferred at specied
standbys. This gives the overall classication as shown in checkpoints. The former method will then consume more
Figure 2. compute resources than the latter, and is denoted dedicated
Standby Redundancy in our classication (Figure 2). The latter method allows
many replicas to share backup resources and is hence
VMware HA denoted shared.
Examples of hot standby techniques are VMwares Fault
Hot Cold
Tolerance [3] and Remus for the Xen hypervisor [12] as
seen in Figure 2. VMware Fault Tolerance is designed for
mission-critical workloads, using a technique called virtual
Updated Not updated Dedicated Shared
Remus
Lockstep, and ensure no data or state loss, including all
VMware FT
active network connections etc. Both active and standby
Dedicated Shared replicas execute all instructions, but the output from the
Figure 2: Standby redundancy classication standby replica is suppressed by the hypervisor. The hyper-
visor thus hides the complexity from both the application
The choice between hot or cold standbys will decide the and the underlying hardware. This scheme is classied as
service restoration time, but more importantly the choice updated and dedicated since the standby replica is fully
should depend upon the applications need of an updated synchronized and consumes resources equal to the active
state space, as described in the next section. replica.
In Remus, fault tolerance is achieved by transmitting state
B. Fault Tolerance in Cloud Computing information to the standby VM at frequent checkpoints,
Cloud computing uses virtualization of computing re- and buffering intermediate inputs between checkpoints. The
sources made available as VMs, virtual storage and virtual standby can hence be up and running with a complete state
networks. We concentrate here on computation services and space in case of failures, with only a short downtime needed
the use of VMs. In this case, backup is made easy with for catching up the input buffer. The standby is not executing
virtualization, since the virtual image contains everything any inputs, which means that less resources are consumed
that is needed to run the application and can be transparently compared to VMware Fault Tolerance, and a short downtime
migrated between physical machines. One of the downsides and loss of ongoing transaction is experienced in case of
failure of the active replica. This scheme is classied as at different physical locations, connected to the customers
updated and shared since the standby only consume a small via the Internet. Following [13],we model the data centers
amount of resources compared to the active. with racks of servers that are organized into clusters. Each
Hot standbys can be used for both stateless and stateful cluster share some infrastructure elements such as power
applications, but since all replicas consume resources it distribution elements and network switches. The overall
is most often used for stateful applications. For stateless network architecture is simplied (inspired by [14]), and
services, the not-updated hot standby is a possibility if high consist of two levels of switches (L1 and L2) in addition to
availability is important. the gateway routers. The overall model is then as shown in
2) Cold Standby: The cold standby solution requires Figure 4.
less resources and should in general be used for stateless Cluster
applications that allows short downtimes. The same is true in Server Server
VM VM VM VM

cloud computing. But in addition, functionality is added in a PDU


PDU VMM VMM L1
L2
HW HW

virtualized environment that is valuable for fault tolerance.


Virtualization facilitates the running of different VMs on
top of the same hardware and the standby resources can
be shared by different VMs, reducing the total resource Cloud Provider

needs. Dedicated standby resources are still possible for cold Data
center 1 Cluster L2
GW1 Internet
standbys, and should be used for stateless applications with COL
GW2 Customer
high availability requirements. In practice, the dedicated PWR
Cluster L2

solution can be implemented by prioritizing the restart of a


Data center 2
standby VM in case of a failure. The low priority VMs may
then experience a longer down time, and possible migration
to a different part of the cloud.
VMware High Availability (HA) is one example of the
use of cold standbys and supports both dedicated and shared Figure 4: High level cloud model
resource usage, i.e., by allowing for different priority levels
when restarting failed VMs. B. Failure Classication
With this simple classication, we end up with four From the high level model, we focus on four different
different service levels as seen in Figure 3, where the choice types of failures, namely failures in the power distribu-
between updated and not-updated hot standby is strictly a tion/cooling, network failures, management software failures
choice on the state preservation, while choosing between and server failures. These are described next.
cold shared, cold dedicated and hot not-updated will give 1) Power Failures: An overview of power distribution in
different availabilities. Next, the physical deployment of the data centers can be found in [13]. In general, the power
standby resources may inuence the resulting availability. supply to the data center is from the utility power network.
These principles can lay the foundation for offering differ- The Uninterrupted Power Supply (UPS) unit will distribute
entiated availability levels in cloud SLAs. the power to the datacenter, and also handle switching
from utility to generator and providing backup batteries
State should there be a utility power failure. We can not assume
Hot
Updated perfect failover and a Markov model is needed to model the
Shared Dedicated complexity of the power supply. Since the power supply is
not the main focus in this paper, we chose to use availability
numbers highly documented in [15].
Cold Cold Hot
Not updated
Within the data center, each cluster will be connected
Shared Dedicated
to a (duplicated) Power Distribution Unit (PDU) which are
connected to the central power supply over a power bus. A
failure in the distribution system is assumed to only affect
Availability
one cluster. These failures are independent from failures in
Figure 3: Classication of fault-tolerance techniques accord- the power supply and the two parts can be modeled in a
ing to state and availability series structure as seen in Figure 5.
2) Network Failures: The cloud services are accessed
IV. C LOUD AVAILABILITY M ODEL
over the Internet, and the high level can be seen in Figure
A. High Level Model 4. In addition, we model the data center internal network in
A simplied model of a cloud system is developed. Each two levels [13]. First there is one (duplicated) level 1 switch
cloud provider will typically have two or more data centers connecting all servers in one cluster. Next, there is one
Data center VMware FT scheme [3]. We model two identical VMs
Cluster
running on two different physical servers, always within
PDU 1
the same cluster. The replicas receive the same input and
Power/
cooling perform the same operations, but only the active VM delivers
PDU 2
services. In case of a failure in the active replica, the
hypervisor will immediately detect the failure and switch
to the standby replica which is ready to perform service
Figure 5: Power model without any delay or loss of data. The cluster management
software will then deploy a new standby VM. With the
(duplicated) level 2 switch connecting all level 1 switches failure of the standby VM, the management software will
from all clusters. These are again connected to the WAN likewise deploy a new standby VM. This means that in a
gateways of which there is also two since we assume the dependability context, it does not matter which VM that
cloud provider to be multi-homed to two independent ISPs. fails. This setup will always tolerate one failure, however, it
The resulting structure model, including the core Internet may happen that the resources are exhausted when trying to
and the user access network is then seen in Figure 6. Note deploy a new standby VM, in which case the service will
that we assume common core Internet failures for different fail with the next failure. The resulting model is shown in
data centers of the same cloud provider, this will typically Figure 8. Here, is the failure rate of the server(including
be dependent on the physical location of the data centers. hardware, software and operational failures), is the restart
rate of a new VM, and c is the coverage factor, i.e., the
Cluster Data center probability that a restart is successful. We assume here that
the resources are dimensioned so that there will always be
L1 - A L2 - A GW 1 W1

Internet UA
enough resources for restarting a hot standby, since these
L1 - B L2 - B GW2 W2
should host the highest priority applications.
(1-c) (1-c)
2
Figure 6: Network model
Both One Both
OK down down
3) Management Software Failures: Cloud computing re- 1 2 3

quires extensive management systems, which are complex


c c
software systems and these are exposed for failures. De-
pending on what level of management software these failures Figure 8: Updated, hot stand-by with dedicated backup
affect, a cluster (VM Management), the whole data center resources
(Virtual Infrastructure Management) or the whole cloud Hot Standby, Updated, Shared The hot standby option
(Cloud Management) can be affected. The resulting model with shared updated standbys is different from the dedicated
is shown in Figure 7, where we assume that these software option in that the state information is transmitted at regular
failures are independent. intervals instead of running the replicas in a synchronized
fashion. This scheme corresponds to the Remus scheme for
Cloud
Data center
Xen [12], and as seen above this scheme experiences a short
Cluster downtime and loss of data in case of failures. This also
VM VI Cloud means that it matters which replica fails, since failure of the
Mngmt Mngmt Mngmt
active replica will cause a short downtime. The model will
therefore be different than for the dedicated standby.
Since the standbys share the backup resources, there will
Figure 7: Management software model be a non-zero probability that the standby will not have
enough resources to start in case of failure of the primary.
4) Server Failures: The server models include failures We assume here that the overall load on the cluster is di-
from hardware, software and operation. However, applica- mensioned such that there will always be enough resources.
tion software failures that will take all replicas down are The resulting model is shown in Figure 9, where the
excluded. We chose to study the schemes that are currently parameters are the same as for the updated, dedicated model.
deployed in commercial products, i.e., the VMware FT, However, one additional parameter is introduced, , where
Remus, and VMware HA with two different priority levels. 1/ is the time needed to switch to the standby replica in
Hot Standby, Updated, Dedicated The hot standby case of failure in the active replica. It is then clear that
option with dedicated, updated standbys provides the highest when this time is short enough, this model will be equal to
availability and the most updated state, corresponding to the previous model.

(1-c) (1-c) We look at different deployment options and the resulting
dependability when replicas are located in the same cluster,
Both

Active Standby Both in different clusters of a data center or even in different
OK down down down
1 2 3 4 data centers. The latter two deployment options provide
tolerance also towards power, network and management
c
c software failures.
Figure 9: Hot stand-by with shared backup resources
A. Same Cluster
The easiest deployment is to place all replicas in the
Cold Standby For the cold standby setup, no state infor- same cluster. This means low network latency in upgrading
mation is retained in the standby and the standby is simply replicas etc., but it also means that power, management
restarted in case of a failure, corresponding to VMware HA software and network failures may lead to unavailability
[16] hence heartbeats are used to detect failures in the active of all replicas and thus the service. The resulting model is
replica and restart the VM. This restart usually takes some shown in Figure 11.
time, during which the service is not available. Also, the
backup resources may be shared between different VMs, Mngmt Power Server Network

and are usually dimensioned to allow for a specic number


of physical server failures. If the resources are exhausted, Figure 11: Deployment in the same cluster
VMs can not be restarted in case of a failure in the active
replica. B. Same Data Center
Here we look at two different classes, both with shared
Next, replicas are placed in two different clusters, but in
backup resources, but where the high priority class has
the same data center. The cluster block will then incorporate
preemptive priority over the low priority class. Given that the
the cluster part of the power, management software and
resources are properly dimensioned, the high priority class
network blocks as well as the server block, all in series.
will experience having dedicated standby resources.
The server block will then include the Markov model from
The resulting models are then shown in Figure 10. The
the respective fault tolerance technique. The Mngmt, Power
additional parameter is the preemption rate from a higher
and Network blocks will likewise exclude the cluster part
priority application, and we introduce the parameter p
as shown in Figure 5-7. The resulting model is shown in
which is the probability that there are enough resources for
Figure 12.
restarting the replica. This is different from the previous
Cluster
models since we can no longer assume that the resources are A

dimensioned to handle these restarts for the lowest priority Mngmt Power Network

applications. Also, we introduce the parameter which is the


Cluster
B

rate at which the management system adds more resources


to the cluster if the resources are exhausted. Figure 12: Deployment in the same data center
(1-c)p

+ C. Same Cloud Provider


VM The nal option is to deploy replicas in two different
(1-c) OK
down
data centers. The DC block will then include the whole

pc power block, as well as the cluster and data center part of
OK
VM (1-p) the network and management software blocks. The resulting
down Queue
model is shown in Figure 13.
c
DC A

(a) High Priority (b) Low Priority Mngmt Network

Figure 10: Cold standby with high and low priority DC B

Figure 13: Deployment with the same cloud provider


V. L OCATION OF R EPLICAS
The power, network, management software and server VI. N UMERICAL R ESULTS
failures are assumed to be independent which means that The input parameters for the server models are listed in
reliability block diagrams can be used to model the system Table I. These are mostly collected from [5]. The latter three
availability, and where individual blocks (here the power parameters are guessed, and will typically be dependent
block and the server block) is detailed using Markov models. on the load of the system (the preemption rate and the
Table I: Parameter values for the server model Table III: Availability results for the different deployment
Name Parameter Value Source
scenarios and fault-tolerance techniques
VM Failure Rate 0.00722 hr 1 [5] Scenario VM Fault Tolerance Cloud A Netw A Tot A
VM Restart Rate 2.0 hr 1 [5] Updated Dedicated Hot 0.99354 0.98901 0.98262
Standby Update Rate 60 hr 1 [12] Updated Shared Hot 0.99342 0.98901 0.98262
VM Restart Coverage c 0.95 [5] I
Shared Cold (HP) 0.98981 0.98901 0.97893
Preemption Rate 0.05 hr 1 Guessed Shared Cold (LP) 0.95385 0.98901 0.94336
Exhausted probability p 0.99 Guessed Updated Dedicated Hot 0.994972 0.98901 0.98403
Cluster Expansion Rate 6.0 hr 1 Guessed II
Updated Shared Hot 0.99497 0.98901 0.98403
Shared Cold (HP) 0.99494 0.98901 0.98401
Shared Cold (LP) 0.98311 0.98901 0.97230
Table II: Availability values for the high level model Updated Dedicated Hot 0.997971 0.99 0.98799
Updated Shared Hot 0.99797 0.99 0.98799
Name Parameter Value Source III
Shared Cold (HP) 0.99791 0.99 0.98799
Power Apower 0.9975 [15] Shared Cold (LP) 0.98683 0.99 0.97697
PDU AP DU 0.9992 [17]
Management software Amngmt 0.999 Guessed
Switches Aswitch 0.97986 [14] 0.989

Router Arouter 0.99966 [17]


Aaccess
0.988
Access Network 0.989 [18] Scenario I

Core Network Acore 0.999 [13] 0.987


Scenario II

User Access Auser 0.99 [13] Scenario III

Aservice
0.986

0.985

probability p), and the operational aspects of the data center 0.984

(the cluster expansion rate).


The availability values for different blocks in the high 0.983

0.9990 0.9992 0.9994 0.9996 0.9998


level model are listed in Table II. Amngt

The resulting availability for the different deployment (a) Management software availability
scenarios (I-same cluster, II-same data center, III-same cloud 0.988

provider) and fault tolerance techniques are shown in Table Scenario I

III. The availability is increased when replica VMs are 0.987


Scenario II

deployed in different clusters (scenario II) and data centers 0.986


Scenario III

(scenario III), but the effect is clearly not very big. The
Aservice

difference between the hot and cold standby techniques is 0.985

more prominent, at least for the scenario with all replicas 0.984

in one cluster (scenario I). For the hot standbys, there are
small differences between the dedicated and shared standbys. 0.983

However, with shared backup resources, the availability will 0.982

decrease when the load increases. We also see that the cold 0.9975 0.9980 0.9985
Apower
0.9990 0.9995

standby with high priority gives the same availability as


(b) Power system availability
the hot standby solutions. However, only the latter provide
updated state and the cold standby option is only possible Figure 14: Availability of updated, dedicated hot standbys
for stateless applications. for different deployment scenarios
The network part (Internet and user access) is separated
in order to see the effect of the network availability on the
total availability. For scenario III (same cloud provider), the Finally, the availability for the cold standby with high and
different data centers are accessed using disjoint networks, low priority is plotted versus the preemption rate in Figure
resulting in a higher resulting availability. This inuence of 15. The preemption rate is dependent on the load in the
the network availability on the resulting end-to-end cloud system. With a preemption rate equal to zero, the high and
service availability is a topic for future study. low priority techniques are equal, but for higher preemption
Next, we look at the updated, dedicated hot standby rates the high priority is superior. Hence, using different
scenario with different failure rates in the power and man- priority levels and allowing for preemption will have a clear
agement software. The results are shown in Figure 14 and differentiation effect when the load increases.
shows that scenario III, i.e. using different data centers is
VII. C ONCLUSIONS AND F UTURE W ORK
more superior compared to the less distributed scenarios
when the availability for the management software is low. SLAs have received a lot of attention in cloud computing,
The same is true for the power part. and especially availability is covered by public cloud SLAs.
1.00
[5] D. S. Kim, F. Machida, and K. S. Trivedi, Availability Mod-
High Priority Service eling and Analysis of a Virtualized System, in Proceedings
0.98
Low Priority Service of the 15th IEEE Pacic Rim International Symposium on
0.96
Dependable Computing, Nov. 2009, pp. 365371.

[6] H. Qian, D. Medhi, and K. Trivedi, A Hierarchical Model


Aserver

0.94

to Evaluate Quality of Experience of Online Services hosted


0.92
by Cloud Computing, Time, no. May, pp. 18, 2011.

0.90
[7] D. Menasc, Performance and Availability of Internet Data
Centers, IEEE Internet Computing, vol. 8, no. 3, pp. 9496,
0.00 0.05 0.10 0.15 0.20
May 2004.

Figure 15: Availability for high and low priority cold stand- [8] I. Brandic, V. C. Emeakaroha, M. Maurer, S. Dustdar,
bys with increasing preemption rate S. Acs, A. Kertesz, and G. Kecskemeti, LAYSI: A Layered
Approach for SLA-Violation Propagation in Self-manageble
Cloud Infrastructures, in Proceeding of the 2010 34th An-
nual IEEE Computer Software and Applications Conference
However, there are some important improvements to be Workshops, Jul. 2010, pp. 365370.
made. First, the SLAs must become more detailed with
respect to actual KPIs used to dene availability. Next, in [9] Q. Li, Q. Hao, L. Xiao, and Z. Li, Adaptive Management
order to deploy also important enterprise services in clouds, of Virtualized Resources in Cloud Computing Using Feed-
different levels of availability should be offered, depending back Control, in Proc. of 1st International Conference on
Information Science and Engineering (ICISE09), Dec. 2010.
on the actual user requirements. Finally, the SLAs should
be available on demand, which also means that they should [10] C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul,
be adjustable on demand. C. Limpach, I. Pratt, and A. Wareld, Live Migration of
This paper has proposed an overall availability model Virtual Machines, in Proceedings of the 2nd Symposium
on Networked Systems Design & Implementation (NSDI 05),
for a cloud system, including the network. We have shown 2005, pp. 273286.
how deploying replicas in different physical locations affect
the resulting availability, and also how different applications [11] M. Nelson, B.-h. Lim, and G. Hutchins, Fast Transparent
need different fault tolerance schemes. These are two pos- Migration for Virtual Machines, in Proceedings of USENIX
sible dimensions for differentiating cloud applications. 05, Anaheim, California, 2005, pp. 59.
Future work include modeling more complex services, [12] B. Cully, G. Lefebvre, D. Meyer, M. Feeley, N. Hutchinson,
e.g. a tiered web service. Also, the server models should and A. Wareld, Remus: High Availability via Asynchronous
be made more detailed, taking into account characteristics Virtual Machine Replication, in NSDI08 Proceedings of the
of the different failures and repairs. We have discussed the 5th USENIX Symposium on Networked Systems Design and
Implementation, 2008.
need for well dened KPIs for availability, the next step
is to also include performance measures in the availability [13] L. A. Barroso and U. Holzle, The Datacenter as a Computer:
models. Finally, the network availability strongly inuence An Introduction to the Design of Warehouse-Scale Machines,
the total availability of a cloud service, and should optimally Synthesis Lectures on Computer Architecture, vol. 4, no. 1,
be included in the cloud service SLA. pp. 1108, Jan. 2009.

[14] A. Greenberg, D. A. Maltz, and J. R. Hamilton, VL2 : A


R EFERENCES Scalable and Flexible Data Center Network, in Proceedings
of SIGCOMM09. ACM, 2009.
[1] P. Mell and T. Grance, The NIST Denition
of Cloud Computing, v.15, 2009. [Online]. Avail- [15] W. P. Turner and J. Seader, Tier classications dene site in-
able: http://csrc.nist.gov/groups/SNS/cloud-computing/cloud- frastructure performance, The Uptime Institute White Paper,
def-v15.doc 2006.

[2] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, [16] VMware White Paper, VMware High Availability. Concepts,
Basic Concepts and Taxonomy of Dependable and Secure Implementation and Best Practices, 2007.
Computing, IEEE Transactions on Dependable and Secure
[17] J. Dean, Designs, Lessons and Advice from Building Large
Computing, vol. 1, no. 1, pp. 1133, Jan. 2004.
Distributed Systems, Keynote Presentation at LADIS 2009,
The 3rd ACM SIGOPS International Workshop on Large
[3] VMware White Paper, Protecting Mission-Critical Work- Scale Distributed Systems and Middleware, 2009.
loads with VMware Fault Tolerance, 2009.
[18] M. Dahlin, B. B. V. Chandra, L. Gao, and A. Nayate, End-
[4] K. V. Vishwanath and N. Nagappan, Characterizing Cloud to-End WAN Service Availability, IEEE/ACM Transactions
Computing Hardware Reliability, in in Proceedings of the on Networking, vol. 11, no. 2, pp. 300313, 2003.
ACM Symposium on Cloud Computing (SOCC), 2010.

You might also like