Professional Documents
Culture Documents
1. FT CLOUD
Unfortunately, the reliability of the cloud applications is still far from perfect in reality. Nowadays,
the demand for highly reliable cloud applications is becoming unprecedentedly strong. Building
highly reliable clouds becomes a critical, challenging, and urgently required research problem.
Since cloud applications usually involve a large number of components, it is still too expensive to
provide alternative components for all the cloud components. Moreover, there is probably no need to
provide fault-tolerance mechanisms for the noncritical components, whose failures have limited
impact on the systems. To reduce the cost so as to develop highly reliable cloud applications within a
limited budget, a small set of critical components needs to be identified from the cloud applications.
Our idea is also based on this well-known 80-20 rules, i.e., by tolerating faults of a small part of the
most important cloud components, the cloud application reliability can be greatly improved. Based
on this idea, we propose FTCloud, which is a component ranking framework for building faulttolerant cloud applications. FTCloud identifies the most significant components and suggests the
optimal fault-tolerance strategies for these significant components automatically. FTCloud can be
employed by designers of cloud applications to design more reliable and robust cloud applications
efficiently and effectively.
The contribution of this paper is twofold: This paper identifies the critical problem of locating
significant components in complex cloud applications and proposes a ranking-based framework,
named FTCloud, to build fault-tolerant cloud applications. We first propose two ranking algorithms
to identify significant components from the huge amount of cloud components. Then, we present an
optimal fault-tolerance strategy selection algorithm to determine the most suitable fault-tolerance
strategy for each significant component. We consider FTCloud as the first ranking-based framework
for developing fault-tolerant cloud applications. . We provide extensive experiments to evaluate the
impact of significant components on the reliability of cloud applications.
System Architecture
Fig. 2 shows the system architecture of our fault-tolerance framework (named FTCloud), which
includes two parts:
1) ranking and 2) optimal fault-tolerance selection. The procedures of FTCloud are as follows: 1. The
system designer provides the initial architecture design of a cloud application to FTCloud. A
component graph is built for the cloud application based on the component invocation relationships.
2. Significance values of the cloud components are calculated by employing component ranking
algorithms. Based on the significance values, the components can be ranked. 3. The most significant
components in the cloud application are identified based on the ranking results. 4. The performance
of various fault-tolerance strategy candidates is calculated and the most suitable fault-tolerance
strategy is selected for each significant component. 5. The component ranking results and the
selected fault-tolerance strategies for the significant components are returned to the system designer
for building a reliable cloud application Our current FTCloud framework can be employed to
tolerate crash and value faults.
In the future, we will investigate more types of faults, such as Byzantine faults. In this paper, we only
study the most representative type of software component graph, i.e., scale-free graph. Since different
applications may have different system structures, we will investigate more types of graph models
(e.g., small-world model, random-graph model, etc.) in our future work. Our future work also
includes
1. Considering more factors (such as invocation latency, throughput, etc.) when computing the
weights of invocations links; 2. investigating the component
FTCloud is a component ranking framework for fault-tolerant cloud applications (Zheng, Zhou, Lyu,
& King, 2010, 2012). First it employs two ranking algorithms to identify significant components from
the huge amount of cloud components. Then an optimal fault-tolerant strategy selection algorithm is
presented to determine the most suitable fault-tolerant strategy for each significant component.
However, it does not refer to the placement method of VMs in which fault-tolerant cloud applications
are deployed.
Cloud Computing Applications are usually provided in large and wide scales and with their own
complexity. However, unfortunately their reliability is still far from the ideal state. In (Zheng et al.,
2010), a component ranking based framework has been introduced which is called FT-Cloud. This
framework comprises two phases of operating algorithms including ranking and fault tolerance. FTCloud enables Cloud Computing Applications to confront the faults. This architecture provides fault
tolerance for cloud computing in the face of Value and Crash faults. The structure of this architecture
has been illustrated in Fig. 1.
and resource request of VMs are initialized according to the sorted fault-tolerant strategies in the
part of solution of VM placement. Then, solve the VM placement according to the resource of nodes,
the resource request of VMs and the sorted fault-tolerant strategies. Finally, the VM placement plan
is outputted to the executor and is executed.
Three factors of the cloud application constraint model considered are response time, failure rate and
resource consumption. Software fault tolerance is widely adopted to increase the overall system
reliability in critical applications. Two ways of software fault tolerance are retry and version
redundancy. Version redundancy contains basic method and combined method. Basic method
contains recovery block (RB), N-version programming (NVP) and active (Avizienis, 1995; Salatge &
Fabre, 2007).
In this paper, the failure rate is defined as the probability that an invocation to a cloud application
will fail, but the failure rate of VM and node are not considered. In this paper, only four basic faulttolerant strategies have
been taken into consideration.
3) There are some fault tolerance techniques rather than replication such as checkpointing and
parallel techniques. Selecting the most suitable technique for each service is another challenge in this
work. In this paper, in order to address the first challenge, we use a heuristic that finds a list of the
VMs that can fulfill customers QoS requirements and sorts them in an ascending order according to
their failure probability. In order to address the second challenge, we use a VM classification
heuristic to determine the most valuable VMs. This heuristic depends on both the usage service time
of VMs
and the reliability level introduced by the customer. For the third challenge, an algorithm is proposed
to select the most suitable fault tolerance technique for the selected VM. The algorithm depends on
customers requirements such as cost and deadline time of applications.
Reviewing literatures reveals that most of the previous works done are mainly based on using the
response time and number of failures as the main criteria for selecting VMs for customers
applications. There is no work done that considers the usage time or the failure probability of VMs.
Also, most of the previous work considers only on technique for fault tolerance, mostly replication
with a static or fixed number of replicas. Thus, extra VMs will be used in executing user applications.
As a result, cloud will lose the monetary benefit of these VMs. So, a way is required to provide a
dynamic number of replicas to maintain the monetary profit of the cloud.
5. Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud
computing environments
clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a
foolproof fault tolerance strategy is needed. Based on the principles and semantics of cloud fault
tolerance, a dynamic adaptive fault tolerance
strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between
different failure rates and two different fault tolerance strategies,which are checkpointing fault
tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive
checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by
combining the two fault tol- erance models together to maximize the serviceability and meet the SLOs
evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale
cloud data centers and consider different system centric parameters, such as fault tolerance degree,
fault tolerance overhead, response time, etc
Our work is originally motivated by the following three facts:
The 80/20 rule.
(ii) Temporal locality rules. More recently accessed data will be accessed again in the near future
according to the current data access pattern, which is called temporal locality [28, 34]. With
temporal locality, popular data is determined by analyzing the users access to the data. When the
popularity of the data passes a dynamic threshold, the replication operation will be triggered. The
number of replicas will be determined based on the system availability, failure probability, and fault
tolerance requirement
1.2 Paper contributions
(i) Analyzing the mathematical relationship between different failure rates and two different fault
tolerance strategies, the checkpointing fault tolerance strategy and the replication fault tolerance
strategy; (ii) Building a dynamic adaptive checkpointing fault tolerance model and a dynamic
adaptive replication fault tolerance model, and combining the two fault tolerance models together to
maximize the serviceability and to meet the SLOs; and, (iii) Evaluating the dynamic adaptive fault
tolerance strategy under various conditions in large-scale cloud data centers, with consideration of
different system centric parameters, such as fault tolerance degree, fault tolerance overhead,
response time, etc.
In order toachieve high fault tolerance in clouds, in the checkpointing fault tolerance strategy,two
important problems must be solved:How often should one insert checkpoints and save system running
states?
1- If checkpoints are inserted too frequently, a larger checkpointing overhead, such as large storage,
will be imposed due to too many checkpoints being inserted, and it is especially noticeable in the
cloud environments consisting of virtual nodes. On the contrary, if checkpoints are inserted too
infrequently, a larger fault recovery overhead will be needed as the system needs to roll back with too
many operations when failures do occur.
2-Which checkpointing strategies should be selected in clouds, full checkpointing strategy,
incremental checkpointing strategy or hybrid checkpointing strategy? The full checkpointing strategy
is a strategy to save the whole system running states periodically to a storage platform, as shown in
Fig. 1(a). If a failure does occur while a system is running, the system can be recovered from the
latest checkpoint rather than from the starting checkpoint, which greatly decreases the recomputing
time. However, a larger checkpointing overhead will be imposed due to the operations for saving the
whole system running states. On the contrary, in the incremental checkpointing strategy, as shown in
Fig. 1(b), the first checkpoint contains complete system running states; subsequent checkpoints only
store pages that have been modified since the previous checkpoint. So the incremental checkpointing
is a strategy introduced to reduce the checkpointing overhead by saving pages that have been
changed instead of saving the whole system running states. However, a larger fault recovery
overhead will be needed because the system had to recover from the starting checkpoint. The hybrid
checkpointing strategy, as shown in Fig. 1(c), combines the full checkpointing strategy and the
incremental checkpointing strategy, and has a trade-off between the checkpointing overhead and the
fault recovery overhead. Upon failure, the system can restart from the state of the last full
checkpointing.
Therefore, in the checkpointing fault tolerance strategy, determining a dynamic adaptive
checkpointing strategy is needed which considers a balance between the frequent checkpointing
strategy and the infrequent checkpointing strategy, a balance between the full checkpointing strategy
and the incremental checkpointing strategy, a balance between the checkpointing overhead and the
fault tolerance overhead, and the trade-off between high fault tolerance and high cloud SLOs.
In order to achieve high fault tolerance in clouds, in the replication fault tolerance strategy, there are
three important problems that must be solved:
(i) Which data should be replicated and when to replicate in the cloud systems to meet the system
fault tolerance requirements on waiting time reduction and data access speeding up (ii) How many
suitable new replicas should be created in the cloud to meet a reasonable system availability
requirement is another important issue to be thoroughly
investigated. iii) Where the new replicas should be placed to meet the system task successful
execution rate and bandwidth consumption requirements is also an important issue to be explored in
detail. Therefore, in the replication fault tolerance strategy, one needs to determine a dynamic
adaptive replication strategy which considers which data to replicate and
when to replicate, and how many and where the new replicas need to be placed, and the trade-off
between high fault tolerance and high cloud SLOs.
Our work is originally motivated by the following three facts:
The 80/20 rule.
Temporal locality rules. More recently accessed data will be accessed again in the near future
according to the current data access pattern, which is called temporal locality [28, 34]. With
temporal locality, popular data is determined by analyzing the users access to the data.
applications as an on-demand service. The rational advantages of our approach are as follows. _ It
provides flexibility to the applications to dynamically adjust its fault tolerance properties and the
level of reliability and availability overtime. The resource costs can be made limited, and
performance levels can be modified from one point to another based on business requirements.
Achieving these features with traditional mechanisms would be extremely difficult. _ It simplifies the
job of application developers since the users are only expected to specify desired properties for each
application and have it delivered transparently. This relaxes the requirement of having knowledge
expertise and experience both in reliable application development and managing fault
tolerance/failures during runtime. The system-level complexity is also abstracted by the service layer.
This allows users to receive a specifically composed fault tolerance support for its applications
without requiring an in-depth knowledge about the system level procedures.
FTM ARCHITECTURE: OVERVIEW
FTM is built to work on top of the hypervisor, spanning all the nodes and transversing the abstraction
layers of the Cloud to transparently tolerate failures among the processing nodes. Fig. 1 illustrates
the architecture of FTM which can primarily be viewed as an assemblage of several Web service
components, each with a specific functionality. A brief description of the functionality of all the
components along with the rationale behind their inclusion in the framework is provided further in
this section.
Replication Manager: FTM provides fault tolerance by replicating users applications such that a
redundant copy of the application is available after failure happens. In the Cloud computing
environment, redundancy can also be applied on the entire VM instance in which the application is
hosted. The set of VM instances controlled by a single implementation of the replication service is
referred as a Replica Group. This component receives the reference of clients VM instance and
expected replication properties such as the style of replication (active, passive, cold passive, hot
passive) and the number of replicas in a replica group from the FTMKernel component. Replication
Manager also includes techniques to maintain consistency in a Replica Group by updating the state
of backup replicas with that of the primary replica.
Fault Detection/Prediction Manager: To detect replicafailures, well known algorithms such as
Gossip-based protocol [20] and heartbeat message exchange protocol [23] are implemented. Every
fault detection service must ideally detect the faults immediately after their occurrence and send a
notification about the faulty replica to the FTMKernel to invoke services from the Fault Masking
Manager and Recovery Manager. When a failure is detected, the Resource Manager is also notified
to update the resource state of the Cloud.
Fault Masking Manager: A collection of such algorithms that mask the occurrence of failures and
prevent the faults from resulting into errors is included in this component. An example of a widely
proposed and accepted masking technique in Cloud and virtualization environments is the Live
Migration of VM instances [8], where the entire OS (VM instance) is moved to another location
preserving the established session.
Recovery Manager: This component includes all the mechanisms that resumes error-prone nodes to
a normal operational mode.
Messaging Monitor: FTM integrates WS-RM standard [3] with other Web service specifications like
replication approach protocol so that the communication between any two components (and the
replicas) is reliable even in the presence of component, system or network failure. offers the
necessary communication infrastructure in two different forms: message exchange among replicas of
a replica group, and inter-component communication within the framework.
Client/Admin Interface: This component is used to obtain users requirements and act as an
interface between the end user and FTM. After obtaining the users preferences, the Client interface
forwards them to the FTMKernel.
FTMKernel: This is the central computing component of FTM which manages all the reliability
mechanisms present in the framework. It contemplates the users requirements and accordingly
selects the Web (reliability) services from other components. The chosen modules are then
orchestrated to form an aggregate solution that is delivered to the users application.
Resource Manager: To achieve an efficient and proactive resource allocation, and avoid over
provisioning during failures, the working state of the physical and virtual resources in the Cloud must
be continuously monitored. The Resource Manager realizes this functionality in FTM, by maintaining
a database including detailed logging information about the machines in the Cloud and providing an
abstract, simple representation of the working state of resources in the form of a graph.