You are on page 1of 11

Different stratergies

1. FT CLOUD
Unfortunately, the reliability of the cloud applications is still far from perfect in reality. Nowadays,
the demand for highly reliable cloud applications is becoming unprecedentedly strong. Building
highly reliable clouds becomes a critical, challenging, and urgently required research problem.
Since cloud applications usually involve a large number of components, it is still too expensive to
provide alternative components for all the cloud components. Moreover, there is probably no need to
provide fault-tolerance mechanisms for the noncritical components, whose failures have limited
impact on the systems. To reduce the cost so as to develop highly reliable cloud applications within a
limited budget, a small set of critical components needs to be identified from the cloud applications.
Our idea is also based on this well-known 80-20 rules, i.e., by tolerating faults of a small part of the
most important cloud components, the cloud application reliability can be greatly improved. Based
on this idea, we propose FTCloud, which is a component ranking framework for building faulttolerant cloud applications. FTCloud identifies the most significant components and suggests the
optimal fault-tolerance strategies for these significant components automatically. FTCloud can be
employed by designers of cloud applications to design more reliable and robust cloud applications
efficiently and effectively.
The contribution of this paper is twofold: This paper identifies the critical problem of locating
significant components in complex cloud applications and proposes a ranking-based framework,
named FTCloud, to build fault-tolerant cloud applications. We first propose two ranking algorithms
to identify significant components from the huge amount of cloud components. Then, we present an
optimal fault-tolerance strategy selection algorithm to determine the most suitable fault-tolerance
strategy for each significant component. We consider FTCloud as the first ranking-based framework
for developing fault-tolerant cloud applications. . We provide extensive experiments to evaluate the
impact of significant components on the reliability of cloud applications.
System Architecture
Fig. 2 shows the system architecture of our fault-tolerance framework (named FTCloud), which
includes two parts:
1) ranking and 2) optimal fault-tolerance selection. The procedures of FTCloud are as follows: 1. The
system designer provides the initial architecture design of a cloud application to FTCloud. A
component graph is built for the cloud application based on the component invocation relationships.
2. Significance values of the cloud components are calculated by employing component ranking
algorithms. Based on the significance values, the components can be ranked. 3. The most significant
components in the cloud application are identified based on the ranking results. 4. The performance
of various fault-tolerance strategy candidates is calculated and the most suitable fault-tolerance
strategy is selected for each significant component. 5. The component ranking results and the
selected fault-tolerance strategies for the significant components are returned to the system designer
for building a reliable cloud application Our current FTCloud framework can be employed to
tolerate crash and value faults.
In the future, we will investigate more types of faults, such as Byzantine faults. In this paper, we only
study the most representative type of software component graph, i.e., scale-free graph. Since different
applications may have different system structures, we will investigate more types of graph models
(e.g., small-world model, random-graph model, etc.) in our future work. Our future work also
includes

1. Considering more factors (such as invocation latency, throughput, etc.) when computing the
weights of invocations links; 2. investigating the component
FTCloud is a component ranking framework for fault-tolerant cloud applications (Zheng, Zhou, Lyu,
& King, 2010, 2012). First it employs two ranking algorithms to identify significant components from
the huge amount of cloud components. Then an optimal fault-tolerant strategy selection algorithm is
presented to determine the most suitable fault-tolerant strategy for each significant component.
However, it does not refer to the placement method of VMs in which fault-tolerant cloud applications
are deployed.
Cloud Computing Applications are usually provided in large and wide scales and with their own
complexity. However, unfortunately their reliability is still far from the ideal state. In (Zheng et al.,
2010), a component ranking based framework has been introduced which is called FT-Cloud. This
framework comprises two phases of operating algorithms including ranking and fault tolerance. FTCloud enables Cloud Computing Applications to confront the faults. This architecture provides fault
tolerance for cloud computing in the face of Value and Crash faults. The structure of this architecture
has been illustrated in Fig. 1.

2. Using Proactive Fault-Tolerance Approach to Enhance Cloud Service Reliability


Existing schemes rarely consider the problem of coordination among multiple virtual machines
(VMs) that jointly complete a parallel application. Without VM coordination, the parallel application
execution results will be incorrect. To overcome this problem, we first propose an initial virtual
cluster allocation algorithm according to the VM characteristics to reduce the total network resource
consumption and total energy consumption in the data center. Then, we model CPU temperature to
anticipate a deteriorating physical machine (PM). We migrate VMs from a detected deteriorating PM
to some optimal PMs. Finally, the selection of the optimal target PMs is modeled as an optimization
problem that is solved using an improved particle swarm optimization algorithm. We evaluate our
approach against five related approaches in terms of the overall transmission overhead, overall
network resource consumption, and total execution time while executing a set of parallel
applications.
To overcome the upper-level bandwidth resource bottlenecks and enhance cloud service reliability,
this paper proposes a proactive coordinated FT(PCFT) approach based on particle swarm
optimization (PSO)[29], which addresses the proactive coordinated FT problem of a virtual cluster
with the objective of minimizing the overall transmission overhead, overall network resource
consumption, and total execution time while executing a set of parallel applications.

First, we introduce a deteriorating PM modeling problem, and then, we propose a coordinated FT


problem of the VMs on the detected deteriorating PM to search for some optimal target PMs for these
VMs.
To solve the two above-mentioned problems, we propose the PCFT approach, which is realized in
two steps: first, we introduce a PM fault prediction mod-el to proactively anticipate a deteriorating
PM, and then, we improve the PSO algorithm to solve the co-ordinated FT problem.
We set up a system model to evaluate the efficiency and effectiveness of the proposed PSO-based
PCFT approach by comparing it with five other related ap-proaches in terms of overall transmission
overhead, overall network resource consumption, and total execution time while executing a set of
parallel appli-cations
Although the proactive FTscheme and virtual clusters have been widely adopted[30],[31],[45], they
are rarely used together to enhance the reliability of cloud data centers. Therefore, this paper
proposes a CPU temperature model for anticipating a deteriorating PM. In order to reallocate the
VMs on the detected deteriorating PM as compactly as possible to other VMs in the same virtual
cluster, the PSO-based PCFT approach is introduced to identify some optimal PMs for these VMs.
The health monitoring mechanism is adopted to guaran-tee cloud service reliability in our
approach(PCFT). The objective of the PCFT approach is to monitor and antici-pate a deteriorating
PM. When there exists a deteriorating PM, our approach will search for some optimal target PMs for
the VMs hosted on the deteriorating PM.
As shown in Fig. 3, the system architecture of our approach consists of the following two modules.
PM fault prediction: CPU temperature monitoring and forecasting are essential for preventing PM
shutdowns due to overheating as well as for improving the data centers energy efficiency. The
module has a prediction functionality to monitor and anticipate
a deteriorating PM by limiting the CPU temperature in the normal temperature range. Optimal
target PM selection: When the deteriorating PM is detected, the module searches for optimal target
PMs for the VMs on the deteriorating PM. To search for these optimal target PMs and to execute a
cloud service that consists of a set of parallel applications,
we design a VM coordinated mechanism by selecting three VMs as a virtual cluster to jointly execute
a parallel application and model the optimal target PM selection as a PSO-based optimization
problem within constraints.
In this work, we proposed a PCFT approach that adopts a VM coordinated mechanism to anticipate a
deteriorating PM in a cloud data center, and then automatically mi-grates VMs from the
deteriorating PM to the optimal tar-get PMs. two-step approach, where we first proposed a CPU
temperature model to anticipate a deteriorating PM, and then searched for the optimal target PMs
using an efficient heuristic algorithm. We evaluated the performance of the PCFT approach by
comparing it with five related approaches in terms of the overall transmission overhead, overall
network resource consumption, and total execution time while executing a set of parallel
applications. The However, com-plex parallel applications cannot be executed on our ex-perimental
platform. Hence, in the future, we will design multiple types of parallel applications for execution on
our experimental platform. Meanwhile, we also plan to apply our approach to reactive FT using the
full coordi-nated checkpoint mechanism, which

3. A method of virtual machine placement for fault tolerant cloud applications


The placement of virtual machines (VMs) for highly reliable cloud applications is a challenging and
critical research problem. To attack this challenge, a method of VM placement based on adaptive
selection of fault-tolerant strategy for cloud applications is proposed. It involves two phase. In the
first phase, the fault-tolerant strategies of cloud applications are sorted according to the constantly
change of cloud applications constraint factors including the response time, failure rate and resource
consumption. In the second phase, the VM placement plan based on adaptive selection of faulttolerant strategy for cloud applications is solved. A prototype of VM placement framework based on
adaptive selection of fault-tolerant strategy for cloud applications, named SelfAdaptionFTPlace, is
implemented.
Experimental results demonstrate that the proposed method shows up better performance and VM
placement plan according to the constant change of cloud applications constraint factors compared
with the existing methods
To guarantee the reliability of cloud applications, it is desirable to develop a fault-tolerant VM
placement method. Furthermore, as the requirements of the cloud users vary greatly, the VM
placement method has to be adjustable for different cloud applications. In this paper, a method of VM
placement based on adaptive selection of fault-tolerant strategy (SelfAdaptionFTPlace) for cloud
applications is proposed. Three cloud applications constraint factors, i.e. the response time, failure
rate and resource consumption for VM placement are considered. The proposed method involves two
phase. In the first phase, the best evaluation function value of VM placement based on every faulttolerant strategy is solved. In the second phase, the VM placement plan is solved according to the
solution of the first phase.
System architecture
The system architecture of SelfAdaptionFTPlace in the unified resource layer of cloud computing
system is shown in Figure 1. SelfAdaptionFTPlace consists of three parts: Transformation of
application requirement to constraint model, selection of adaptive fault-tolerant strategies and
solution of VM placement. The resource of VMs and nodes are constantly changing according to the
resource request of cloud applications and the VM placement solution is periodic.
1) Constraint model
In this part, a cloud application is committed in the step of transformation of application requirement
to constraint model. The requirement of the cloud application is initialized including the response
time, failure rate and resource consumption. Then, the cloud application constraints of user are
inputted including the response time, failure rate and resource consumption. Finally, the cloud
application requirement and constraints are transformed to constraint model. 2) Fault tolerance The
evaluation function value for each strategy is computed according to the constraint model in the part
of selection of adaptive fault-tolerant strategies. Then, sort the fault-tolerance strategies according to
the evaluation function from small to large. If the evaluation function value of the fault-tolerant
strategy is smaller, it is more suitable for cloud application. 3) VMs placement The resource of nodes

and resource request of VMs are initialized according to the sorted fault-tolerant strategies in the
part of solution of VM placement. Then, solve the VM placement according to the resource of nodes,
the resource request of VMs and the sorted fault-tolerant strategies. Finally, the VM placement plan
is outputted to the executor and is executed.
Three factors of the cloud application constraint model considered are response time, failure rate and
resource consumption. Software fault tolerance is widely adopted to increase the overall system
reliability in critical applications. Two ways of software fault tolerance are retry and version
redundancy. Version redundancy contains basic method and combined method. Basic method
contains recovery block (RB), N-version programming (NVP) and active (Avizienis, 1995; Salatge &
Fabre, 2007).
In this paper, the failure rate is defined as the probability that an invocation to a cloud application
will fail, but the failure rate of VM and node are not considered. In this paper, only four basic faulttolerant strategies have
been taken into consideration.

4. A Framework for Providing a Hybrid Fault Tolerance in Cloud Computing


These failures will have a great impact on the availability, credibility and economy of the cloud [11].
This is because the system will search for another suitable resource or VM to perform the customers
service. This affects the time needed to serve customers applications and then degrades the
performance of the cloud. Thus, there is a need to minimize the effect of these failures on
performance, when occurred.
1) In the cloud environment, there may be many VMs that can fulfill customers QoS requirements,
but they have a high
tendency to fail. In such a scenario, if the broker neglects the failure history of VMs and its replicas
when selected, the likelihood of failures will be high. This eventually results in compromising the
users QoS parameters in order to complete cloud applications.
2) It is too expensive to perform replication for all cloud applications and VMs. This is because there
will be a profit charge lost when using extra VMs in serving the same application, however these
extra VMs can be exploited to serve other applications. Thus, we only need to replicate applications
executed on the most valuable VMs that will have a great impact on the performance of the cloud if
they failed. Determining the most valuable VMs is a great challenge.

3) There are some fault tolerance techniques rather than replication such as checkpointing and
parallel techniques. Selecting the most suitable technique for each service is another challenge in this
work. In this paper, in order to address the first challenge, we use a heuristic that finds a list of the
VMs that can fulfill customers QoS requirements and sorts them in an ascending order according to
their failure probability. In order to address the second challenge, we use a VM classification
heuristic to determine the most valuable VMs. This heuristic depends on both the usage service time
of VMs
and the reliability level introduced by the customer. For the third challenge, an algorithm is proposed
to select the most suitable fault tolerance technique for the selected VM. The algorithm depends on
customers requirements such as cost and deadline time of applications.
Reviewing literatures reveals that most of the previous works done are mainly based on using the
response time and number of failures as the main criteria for selecting VMs for customers
applications. There is no work done that considers the usage time or the failure probability of VMs.
Also, most of the previous work considers only on technique for fault tolerance, mostly replication
with a static or fixed number of replicas. Thus, extra VMs will be used in executing user applications.
As a result, cloud will lose the monetary benefit of these VMs. So, a way is required to provide a
dynamic number of replicas to maintain the monetary profit of the cloud.

5. Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud
computing environments
clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a
foolproof fault tolerance strategy is needed. Based on the principles and semantics of cloud fault
tolerance, a dynamic adaptive fault tolerance
strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between
different failure rates and two different fault tolerance strategies,which are checkpointing fault
tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive
checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by
combining the two fault tol- erance models together to maximize the serviceability and meet the SLOs
evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale
cloud data centers and consider different system centric parameters, such as fault tolerance degree,
fault tolerance overhead, response time, etc
Our work is originally motivated by the following three facts:
The 80/20 rule.
(ii) Temporal locality rules. More recently accessed data will be accessed again in the near future
according to the current data access pattern, which is called temporal locality [28, 34]. With
temporal locality, popular data is determined by analyzing the users access to the data. When the
popularity of the data passes a dynamic threshold, the replication operation will be triggered. The
number of replicas will be determined based on the system availability, failure probability, and fault
tolerance requirement
1.2 Paper contributions
(i) Analyzing the mathematical relationship between different failure rates and two different fault
tolerance strategies, the checkpointing fault tolerance strategy and the replication fault tolerance
strategy; (ii) Building a dynamic adaptive checkpointing fault tolerance model and a dynamic
adaptive replication fault tolerance model, and combining the two fault tolerance models together to
maximize the serviceability and to meet the SLOs; and, (iii) Evaluating the dynamic adaptive fault
tolerance strategy under various conditions in large-scale cloud data centers, with consideration of
different system centric parameters, such as fault tolerance degree, fault tolerance overhead,
response time, etc.

In order toachieve high fault tolerance in clouds, in the checkpointing fault tolerance strategy,two
important problems must be solved:How often should one insert checkpoints and save system running
states?
1- If checkpoints are inserted too frequently, a larger checkpointing overhead, such as large storage,
will be imposed due to too many checkpoints being inserted, and it is especially noticeable in the
cloud environments consisting of virtual nodes. On the contrary, if checkpoints are inserted too
infrequently, a larger fault recovery overhead will be needed as the system needs to roll back with too
many operations when failures do occur.
2-Which checkpointing strategies should be selected in clouds, full checkpointing strategy,
incremental checkpointing strategy or hybrid checkpointing strategy? The full checkpointing strategy
is a strategy to save the whole system running states periodically to a storage platform, as shown in
Fig. 1(a). If a failure does occur while a system is running, the system can be recovered from the
latest checkpoint rather than from the starting checkpoint, which greatly decreases the recomputing
time. However, a larger checkpointing overhead will be imposed due to the operations for saving the
whole system running states. On the contrary, in the incremental checkpointing strategy, as shown in
Fig. 1(b), the first checkpoint contains complete system running states; subsequent checkpoints only
store pages that have been modified since the previous checkpoint. So the incremental checkpointing
is a strategy introduced to reduce the checkpointing overhead by saving pages that have been
changed instead of saving the whole system running states. However, a larger fault recovery
overhead will be needed because the system had to recover from the starting checkpoint. The hybrid
checkpointing strategy, as shown in Fig. 1(c), combines the full checkpointing strategy and the
incremental checkpointing strategy, and has a trade-off between the checkpointing overhead and the
fault recovery overhead. Upon failure, the system can restart from the state of the last full
checkpointing.
Therefore, in the checkpointing fault tolerance strategy, determining a dynamic adaptive
checkpointing strategy is needed which considers a balance between the frequent checkpointing
strategy and the infrequent checkpointing strategy, a balance between the full checkpointing strategy
and the incremental checkpointing strategy, a balance between the checkpointing overhead and the
fault tolerance overhead, and the trade-off between high fault tolerance and high cloud SLOs.
In order to achieve high fault tolerance in clouds, in the replication fault tolerance strategy, there are
three important problems that must be solved:
(i) Which data should be replicated and when to replicate in the cloud systems to meet the system
fault tolerance requirements on waiting time reduction and data access speeding up (ii) How many
suitable new replicas should be created in the cloud to meet a reasonable system availability
requirement is another important issue to be thoroughly
investigated. iii) Where the new replicas should be placed to meet the system task successful
execution rate and bandwidth consumption requirements is also an important issue to be explored in
detail. Therefore, in the replication fault tolerance strategy, one needs to determine a dynamic
adaptive replication strategy which considers which data to replicate and
when to replicate, and how many and where the new replicas need to be placed, and the trade-off
between high fault tolerance and high cloud SLOs.
Our work is originally motivated by the following three facts:
The 80/20 rule.
Temporal locality rules. More recently accessed data will be accessed again in the near future
according to the current data access pattern, which is called temporal locality [28, 34]. With
temporal locality, popular data is determined by analyzing the users access to the data.

6. Fault Tolerance Management in Cloud Computing A System-Level Perspective


In (Jhawar et al., 2012), another similar architecture with the same name FTM has been suggested. A
view of this architecture is shown in Fig. (7). As can be observed, fault detection task in this
architecture has been assigned to Fault Detector component and recovery operations begin after
fault detection. In this phase, two policies called Checkpoint/Restart and Replication have been
exploited. These policies are managed respectively by Checkpoint Manager and Replication
Manager.
An approach for realizing generic fault tolerance mechanisms is presented by Jhawar, Piuri and
Santambrogio [12]. They have presented their approach as independent modules. The approach
validates fault tolerance properties of each mechanism and matches between users requirements and
the available fault tolerance modules to obtain a comprehensive solution with desired properties.
Also, a framework is designed that allows the integration between providers system and the existing
cloud infrastructure.
we introduce an innovative, system-level, modular perspectiveon creating and managing fault
tolerance in Clouds. We propose a comprehensive high-level approach to shading the implementation
details of the fault tolerance techniques to application developers and users by means of a dedicated
service layer. In particular, the service layer allows the user to specify and apply the desired level of
fault tolerance, and does not require knowledge about the fault tolerance techniques that are
available in the envisioned Cloud and their implementations.
This implies that users must understand fault tolerance techniques and tailor their applications by
considering environment specific parameters during the design phase. However, for the applications
to be deployed in the Cloud computing environment, it is difficult to design a holistic fault tolerance
solution that efficiently combines the failure behavior and system architecture of the application. This
difficulty arises due to: 1) high system complexity, and 2) abstraction layers of Cloud computing that
release limited information about the underlying infrastructure to its users.
In contrast with the traditional approach, we advocate a new dimension where applications deployed
in a Cloud computing infrastructure can obtain required fault tolerance properties from a third party.
To support the new dimension, we extend
our work in [5] and propose an approach to realize general fault tolerance mechanisms as
independent modules such that each module can transparently function on users applications. We
then enrich each module with a set of metadata that characterize its fault tolerance properties, and
use the metadata to select mechanisms that satisfy users requirements. Furthermore, we present a
scheme that: 1) delivers a comprehensive fault tolerance solution to users applications by combining
selected fault tolerance mechanisms, and 2) ascertains the properties of a fault tolerance solution by
means of runtime monitoring. Based on the proposed approach, we design a framework that easily
integrates with the existing Cloud infrastructure and facilitates a third party in offering fault
tolerance as a service.
Outline of our approach
We propose to insert a dedicated service layer between the computing infrastructure and the
applications which can offer fault tolerance support to each application individually while
abstracting the complexity of the underlying infrastructure. To facilitate a full-fledged support, the
service layer must contain a range of reliability mechanisms and must be able to create a fault
tolerance solution with desired properties on-the-fly that can be delivered to the application. To
achieve this, we build on the idea that a fault tolerance solution can be seen as a combination of a set
of distinct activities coordinated in a specific event-based logic.
Rational Advantages of the new Perspective
Our proposal aims at overcoming the limitations of existing methodologies by offering fault tolerance
properties to the

applications as an on-demand service. The rational advantages of our approach are as follows. _ It
provides flexibility to the applications to dynamically adjust its fault tolerance properties and the
level of reliability and availability overtime. The resource costs can be made limited, and
performance levels can be modified from one point to another based on business requirements.
Achieving these features with traditional mechanisms would be extremely difficult. _ It simplifies the
job of application developers since the users are only expected to specify desired properties for each
application and have it delivered transparently. This relaxes the requirement of having knowledge
expertise and experience both in reliable application development and managing fault
tolerance/failures during runtime. The system-level complexity is also abstracted by the service layer.
This allows users to receive a specifically composed fault tolerance support for its applications
without requiring an in-depth knowledge about the system level procedures.
FTM ARCHITECTURE: OVERVIEW
FTM is built to work on top of the hypervisor, spanning all the nodes and transversing the abstraction
layers of the Cloud to transparently tolerate failures among the processing nodes. Fig. 1 illustrates
the architecture of FTM which can primarily be viewed as an assemblage of several Web service
components, each with a specific functionality. A brief description of the functionality of all the
components along with the rationale behind their inclusion in the framework is provided further in
this section.
Replication Manager: FTM provides fault tolerance by replicating users applications such that a
redundant copy of the application is available after failure happens. In the Cloud computing
environment, redundancy can also be applied on the entire VM instance in which the application is
hosted. The set of VM instances controlled by a single implementation of the replication service is
referred as a Replica Group. This component receives the reference of clients VM instance and
expected replication properties such as the style of replication (active, passive, cold passive, hot
passive) and the number of replicas in a replica group from the FTMKernel component. Replication
Manager also includes techniques to maintain consistency in a Replica Group by updating the state
of backup replicas with that of the primary replica.
Fault Detection/Prediction Manager: To detect replicafailures, well known algorithms such as
Gossip-based protocol [20] and heartbeat message exchange protocol [23] are implemented. Every
fault detection service must ideally detect the faults immediately after their occurrence and send a
notification about the faulty replica to the FTMKernel to invoke services from the Fault Masking
Manager and Recovery Manager. When a failure is detected, the Resource Manager is also notified
to update the resource state of the Cloud.
Fault Masking Manager: A collection of such algorithms that mask the occurrence of failures and
prevent the faults from resulting into errors is included in this component. An example of a widely
proposed and accepted masking technique in Cloud and virtualization environments is the Live
Migration of VM instances [8], where the entire OS (VM instance) is moved to another location
preserving the established session.
Recovery Manager: This component includes all the mechanisms that resumes error-prone nodes to
a normal operational mode.
Messaging Monitor: FTM integrates WS-RM standard [3] with other Web service specifications like
replication approach protocol so that the communication between any two components (and the
replicas) is reliable even in the presence of component, system or network failure. offers the
necessary communication infrastructure in two different forms: message exchange among replicas of
a replica group, and inter-component communication within the framework.
Client/Admin Interface: This component is used to obtain users requirements and act as an
interface between the end user and FTM. After obtaining the users preferences, the Client interface
forwards them to the FTMKernel.
FTMKernel: This is the central computing component of FTM which manages all the reliability
mechanisms present in the framework. It contemplates the users requirements and accordingly

selects the Web (reliability) services from other components. The chosen modules are then
orchestrated to form an aggregate solution that is delivered to the users application.
Resource Manager: To achieve an efficient and proactive resource allocation, and avoid over
provisioning during failures, the working state of the physical and virtual resources in the Cloud must
be continuously monitored. The Resource Manager realizes this functionality in FTM, by maintaining
a database including detailed logging information about the machines in the Cloud and providing an
abstract, simple representation of the working state of resources in the form of a graph.

You might also like