Dependability Modeling

Dependability Modeling
Paulo R. M. Maciel
Kishor S. Trivedi
Center of Informatics
Federal University of
Pernambuco, Brazil
Pratt School of
Engineering
Duke University, USA
Rivalino Matias Jr. Dong Seong Kim

School of Computing
Science
Federal University of
Uberlndia, Brazil
Pratt School of
Engineering
Duke University, USA
Abstract
This chapter presents modeling method and evaluation techniques for computing dependability metrics of
systems. The chapter begins providing a summary of seminal works. After presenting the background, the most
prominent model types are presented, and the respective methods for computing exact values and bounds. This
chapter focuses particularly on combinatorial models although state space models such as Markov models and
hierarchical models are also presented. Case studies are then presented in the end of the chapter.
Keywords: dependability, modeling, combinatorial models, state space models
1. An Introduction
Due to ubiquitous provision of services on the Internet, dependability has become an attribute of prime concern in
hardware/software development, deployment, and operation. Providing fault tolerant services is inherently related
to adoption of redundancy. Redundancy can be exploited either in time or in space. Replication of services is
usually provided through distributed hosts across the world, so that whenever the service, the underlying host or
network fails, another service is ready to take over (1). Dependability of a system can be understood as the ability
to deliver a specified functionality that can be justifiably trusted (2). Functionality might be a set of roles or
services (functions) observed by an outside agent (a human being, another system etc) that interacts with system at
its interfaces; and the specified functionality of a system is what the system is intended for. This chapter aims to
provide an overview of dependability modeling. The chapter starts briefly describing some early and seminal
work, their motivations and the succeeding advances. Afterwards, a set of fundamental concepts and definitions
are introduced. Subsequently, the modeling techniques are classified, defined and introduced as well as a
representative set of evaluation methods is presented. Later on, case studies are discussed, modeled and evaluated.
2. A Brief History
This section provides a summary of early work related to dependability and briefly describes some seminal efforts
as well as the respective relations with current prevalent methods. This effort is certainly incomplete, nonetheless,
we hope it provides fundamental events, people and important research related to what is now called dependability
modeling.
Dependability is related to disciplines such as fault tolerance and reliability. The concept of dependable computing
first appeared in 1820s when Charles Babbage undertook the enterprise to conceive and construct a mechanical
calculating engine to eliminate the risk of human errors (3) (4). In his book, On the Economy of Machinery and
Manufacture, he mentions The first objective of every person who attempts to make any article of
consumption is, or ought be, to produce it in perfect form (5). In the nineteenth century, reliability theory
Book chapter of the book Performance and Dependability in Service Computing: Concepts,
Techniques and Research Directions. Publisher: IGI Global.
evolved from probability and statistics as a way to support computing maritime and life insurance rates. In early
twentieth century methods had been applied to estimate survivorship of railroad equipment (5) (7).
The first IEEE (formerly AIEE and IRE) public document to mention reliability is Answers to Questions Relative
to High Tension Transmission that summarizes the meeting of the Board of Directors of the American Institute
of Electrical Engineers, held in September 26, 1902 (6). In 1905, H. G. Stott and H. R. Stuart: discuss TimeLimit Relays and Duplication of Electrical Apparatus to Secure Reliability of Services at New York (7) and at
Pittsburg (8). In these works the concept of reliability was primarily qualitative. In 1907, A. A. Markov began
the study of an important new type of chance process. In this process, the outcome of a given experiment can
affect the outcome of the next experiment. This type of process is now called a Markov chain (11). In 1910s, A. K.
Erlang studied telephone traffic planning problems for reliable service provisioning (9). Later in the 1930s,
extreme value theory was applied to model fatigue life of materials by W. Weibull and Gumbel (9). In 1931,
Kolmogorov, in his famous paper ber die analytischen Methoden in der Wahrscheinlichkeitsrechnung
(Analytical methods in probability theory) laid the foundations for the modern theory of Markov processes (13).
In the 1940s quantitative analysis of reliability was applied to many operational and strategic problems in World
War II (5).
The first generation of electronic computers were quite undependable, thence many techniques were
investigated for improving their reliability. Among such techniques, many researchers investigated design
strategies and evaluation methods. Many methods were then proposed for improving system dependability such as
error control codes, replication of components, comparison monitoring and diagnostic routines. The most
prominent researchers during that period were Shannon (11), Von Neumann (12) and Moore (13), who proposed
and developed theories for building reliable systems by using redundant and less reliable components. These were
the predecessors of the statistical and probabilistic techniques that form the foundation of modern dependability
theory (5).
In the 1950s, reliability became a subject of great engineering interest as a result of the cold war efforts, failures of
American and Soviet rockets, and failures of the first commercial jet aircraft, the British de Havilland comet
(14) (15). Epstein and Sobels 1953 paper studying the exponential distribution was a landmark contribution (10).
In 1954, the Symposium on Reliability and Quality Control (it is now the IEEE Transactions on Reliability) was
held for the first time in the United States, and in 1958 the First All-Union Conference on Reliability took place in
Moscow (14) (15). In 1957 S. J. Einhorn and F. B. Thiess adopted Markov chains for modeling system
intermittence (22), and in 1960, P. M. Anselone employed Markov chains for evaluating availability of radar
systems (23). In 1961 Birnbaum, Esary and Saunders published a milestone paper introducing coherent structures
(17).
The reliability models might be classified as combinatorial (non-state space model) and state-space models.
Reliability Block Diagrams (RBD) and Fault Trees (FT) are combinatorial models and the most widely adopted
models in reliability evaluation. RBD is probably the oldest combinatorial technique for reliability analysis. Fault
Tree Analysis (FTA) was originally developed in 1962 at Bell Laboratories by H. A. Watson to evaluate the
Minuteman I Intercontinental Ballistic Missile Launch Control System. Afterwards, in 1962, Boeing and AVCO
expanded use of FTA to the entire Minuteman II (22). In 1965, W. H. Pierce unified Shannon, Von Neumann and
Moore theories of masking and redundancy as the concept of failure tolerance (23) (24). In 1967, A. Avizienis
integrated masking methods with practical techniques for error detection, fault diagnosis, and recovery into the
concept of fault-tolerant systems (23).
The formation of the IEEE Computer Society Technical Committee on Fault-Tolerant Computing (now
Dependable Computing and Fault Tolerance TC) in 1970 and of IFIP Working Group 10.4 on Dependable
Computing and Fault Tolerance in 1980 were important means for defining a consistent set of concepts and
terminology. In early 1980s Laprie coined the term dependability for encompassing concepts such reliability,
availability, safety, confidentiality, maintainability, security and integrity etc (17).
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
In late 1970s some works were proposed for mapping Petri nets to Markov chains (28) (29) (30). These models
have been widely adopted as high-level Markov chain automatic generation models as well as for discrete event
simulation. Natkin was the first to apply what is now generally called Stochastic Petri nets to dependability
evaluation of systems (29).
3.
Basic Concepts
This section introduces and defines several fundamental concepts, taxonomy and quantitative measures for
dependability.
As mentioned in the beginning of the chapter, dependability of a system is its capability of delivering a set of
trustable services that are observed by outside agents. A service is trustworthy when it implements the systems
specified functionality. A system failure occurs when the system fails to provide its specified functionality.
A fault can be defined as the failure of a component of the system, a subsystem of the system, or another system
which interacts with the considered system. Hence, every fault is a failure from some point of view. A fault can
cause other faults, a system failure, or neither. A system with faults that delivers its specified functionality is said
to be fault tolerant, that is, the system does not fail even when there are faulty components. Distinguishing faults
from failures is fundamental for understanding the fault tolerance concept. The observable outcome of a fault at
the system interface is called symptom and the most extreme symptom of a fault is a failure. Therefore, an analyst
evaluating the inner part of a system might detect faulty components or sub-systems. From that point of view, a
faulty component (or sub-system) had failed, since level of details analyzed is lower.
Consider an indicator random variable
that represents the system state at time t.
the operational state and
for the faulty state (see Figure 1). More formally,
for representing
(1)
Figure 1: States of
Now, consider a random variable as the time to reach the state

, given that the system started in state
at time
. Therefore, the random variable represents the time to failure of the system S,
its
cumulative distribution function (see Figure 2), and
the respective density function (see Figure 3),
where:
,
and
(2)
Figure 2:
Cumulative Distribution Function
t
Figure 3
- Density Function
The probability that the system S does not fail up to time (reliability see Figure 4) is
,
(3)
Figure 4:
The probability of the system S fail within the interval
t
may be calculated by:
.
The probability of the system S failing during the interval
probability of failure) is
if it has survived to the time t (conditional
is conditional probability of failure per time unit. When
then
(4)
where
is named the hazard function. Hazard rates may be characterized as decreasing failure rate (DFR),
constant failure rate (CFR) or increasing failure rate (IFR) according to
Since
(5)
thus,
(6)
where
is the cumulative hazard rate function (cumulative failure rate function).
Consider a hazard rate of an entire population of products over time (

), where some products will fail in early
life (infant mortality), others will last until wear-out (end of life), and others will fail during its useful life
period (normal life). Infant mortality failures are usually caused by material, design and manufacturing
problems, whereas wear-out failures are related to fatigue or exhaustion. Normal life failures are considered to be
random.
Infant mortality is commonly represented by decreasing hazard rate (see Figure 5.a), wear-out failures are
typically represented by increasing hazard rate (Figure 5.c), and normal life failures are usually depicted by
constant hazard rate (see Figure 5.b). The overlapping of these three separate hazard rate functions form the so
called bathtub curve (Figure 5.d).
(c)
(b)
(a)
t
(d)
Figure 5: Hazard rate: (a) Decreasing, (b) Constant, (c) Increasing, (d) Bathtub curve
The mean time to fail (MTTF) is defined by:
.
(7)
Since
thus,
Let
, and applying integration by parts (
, then
hence:
hence
(8)
which is often easier to compute than (7).
Another central tendency reliability measure is the median time to failure (
, defined by:
.
(9)
The median time to failure divides the time to fail distribution into two halves, where 50% of failures occur before
and the other 50% after.
Consider a continuous time random variable
that represents the system state.
=0 when is failed, and
=1 when S has been repaired (see Figure 6). More formally,
(10)
Figure 6: States of
Now, consider the random variable that represents the time to reach the state
, given that the system
started in state
at time
. Therefore, the random variable D represents the system time to repair,
its cumulative distribution function, and
the respective density function, where:
and
The probability that the system S will be repaired by

maintainability.
considering a specified resource is defined as
,
(11)
The mean time to repair (MTTR) is defined by:
An alternative often easier to compute
is
(12)
Consider a repairable system S that is either operational (Up) or faulty (Down). Figure 7 shows the system state
transition model,
when Up
when down. Whenever the system fails, a set of activities are
conducted in order to allow the restoring process. These activities might encompass administrative time,
transportation time, logistic times etc. When the maintenance team arrives to the system site, the actual repairing
process may start. Further, this time may also be divided into diagnosis time and actual repair time, checking time
etc. However, for sake of simplicity, we group these times such that the downtime equals the time to restore
, which is composed by non-repair time
(that groups transportation time, order times, deliver
times, etc.) and time to repair
(see Figure 8). Thus,
(13)
Figure 7: States of a Repairable System
Figure 8: Downtime and Uptime

The simplest definition of Availability is expressed as the ratio of the expected system uptime to the expected
system up and downtimes:
(13)
Consider that the system started operating at time

and fails at
Figure 7).Therefore, the system availability may also be expressed by:
, thus
(see
(14)
where
time,
is the mean time to restore, defined by

mean time to repair), so:
mean non-repair
If
As
Since
and if
, thus
then
therefore:
(15)
The instantaneous availability is the probability that the system is operational at , that is,
If repairing is not possible, the instantaneous availability,

, is equivalent to reliability,
If the system
approaches stationary states as the time increases, it is possible to quantify the steady state availability, such that it
is possible to estimate the long-term fraction of time the system is available.
(16)
3.1. Commonly Used Distributions
The time to failure of a component is a non-negative continuous random variable. This section briefly summarizes
the continuous distributions that have been widely adopted in dependability evaluation. The most adopted
distributions are: Exponential, Expolynomial distributions such as Erlang and hyper-exponential; and Weibull,
Normal and Lognormal distributions.
A random variable T representing a components life time has exponential distribution if its probability density
function is given by
where
is a parameter of this distribution. The respective reliability function, cumulative distribution
function, hazard function (failure rate), mean (mean time to failure) and variance are, respectively:
Table 1 summarizes the density, reliability, cumulative distribution and hazard functions, and mean and variance
of the above mentioned distributions.
Table 1: Distribution Summary

Distributions
Parameter
Exponential
Erlang
Each phases
rate:
(number of
phases);
Hyperexponential
(rates)
(probabilities
);
(scale),
Weibull
(shape).
.
Left
truncated:
Normal
(mean),
(variance).
(mean),
Standard
Normal
(variance).
Lognormal
,
3.2. Specific Failure Terminology for Software

A software failure may be defined as the occurrence of an out of specification result produced by the software
system for the respective specified input value. Since this definition is consistent with the of system failure
definition (previously defined), the reader might ask: why should we pay special attention to software
dependability issues? The reader should bear in mind, however, the pervasive nature of software systems and its
many scientific communities involved in subjects related to make software systems more dependable. These
communities, nonetheless, have specific backgrounds, many of which not rooted on system reliability. These
communities have developed jargons that do not necessarily match the prevailing system dependability
terminology.
Software research communities have long pursued dependable software systems. Correction of codification
problems and software testing begin with the very origin of software development itself. Since then, the term
software bug has been broadly applied to refer to mistakes, failures and faults in a software system; whereas
debugging is referred to the methodical process of finding bugs.
The formal methods community has produced massive contributions on models, methods and strategies for
checking software validity and correctness. These communities along with dependability researchers have
proposed and applied redundancy mechanisms and failure avoidance techniques as means for achieving highly
dependable software systems.
In 1985 Jim Gray proposed a classification of software failures (1). He classified the failures (bugs) as Bohrsbugs
and Heisenbugs. Heisenbugs are transient or intermittent failures (2), since if the program state is reinitialized
and the failed operation retried, the operation will usually not fail the second time. The term Heisenbug derived
from Heisenberg's Uncertainty Principle which states that it is impossible to simultaneously predict the location
and time of a particle. On the other hand, if a Bohrbug is present in the software, there would always be a failure
on retrying the operation which caused the failure. The word Bohrbug comes from the deterministic representation
of atom proposed by Niels Bohr in 1913. Bohrbugs is easily detectable by standard debugging techniques. This
terminology is modified and extended in (1).
3.3 Coherent Systems
Consider a system S composed by a set of components,

, where the state of the system S and
its components could be either operational or failed. Let the discrete random variable indicate the state of
component , thus:
(17)
1 represents the state of each component of the system, and it is named state
The vector
vector. The system state may be represented by a discrete random variable
, such
that
(18)
is called the structure function of the system.

If one is interested in representing the system state at a specific time , the components state variables should be
interpreted as a random variables at time . Hence,
, where
.
For any component ,
(19)
where
and
The Equation 19 expresses the system structure function in terms of two conditions. The first term (
represents a state which where the component is operational and the state of the other components are random
variables (
The second term (
), on the other hand, states the condition
where the component
has failed and the state of the other components are random variables
(
Equation 19 is known as factoring of the structure function and very useful for studying
complex system structures, since through its repeated application, one can eventually reach a subsystem whose
structure function is simple to deal with (1).
A component of a system is irrelevant to the dependability of the system if the state of the system is not affected
by the state of the component. In mathematical terms, a component is said to be irrelevant to the structure
function if
. A system with structure function
is said to be coherent if and only if
is non-decreasing in each and every component is relevant. A function
is non-decreasing if for every
two state vectors and , such that < , then
. Another aspect of coherence that should also be
highlighted is that replacing a failed component in working system does not make the system fail. But, it does not
also mean that a failed system will work if a failed component is substituted by an operational component.
Example 1: Consider a coherent system
Wherever need and the context is clear,
composed of three blocks,
(see Figure 10).
may also be referred as set.
Figure 10: Structure function

Using Equation 19 and first factoring on component a, we have:
.
since
Now factoring
As
using Equation 19 on component b,
, thus:
Therefore:
].
Fact
Since
on component c to get:
and
, thus:
So
In some cases, simplifying the structure function may not be an easy task. A logic function of coherent system
may be adopted to simplify systems functions through Boolean algebra.
As described earlier, assume a system S composed by a set of components
system S and its components could be either operational or faulty. Let
component , and its complement, that is, indicate that has failed.
. The state of the

denotes the operational state of
The Boolean state vector,

, represents the Boolean state of each component of the
system. The system state could be either operational or failed. The operational system state is represented by
, whereas
denotes a faulty system.
and
represent a system where the component is either working ( ) or failed ( ).
Using the notation described, is equivalent to , represents

,
depicts
, represents , and is the respective counterpart of
Let consider a series system composed a set of components
system is
is the counterpart of
.
, the logic function of a series
For a parallel system with n components, the logic function
For any component , the logic function may be represented by
Adopting the system presented in the Example 1. One may observe that the system is functioning if components a
and b are working or if a and c are. This is the respective system logic function. More formally:
Therefore,
Thus,
The structure function may be obtained from the logic function using the respective counter parts. Hence, since
is represented by , is
, and corresponds to , then:
which is the same result obtained in the Example 1.
4. Modeling Techniques
The aim of this section is to introduce a set of important models types for dependability evaluation as well as offering
the reader a summary view of key methods. The section begins with a classification of model, then the main
combinatorial and state-space models are described along with the respective analysis methods.
4.1.
Classification of Modeling Techniques
This section presents a classification of dependability models. These models may be broadly classified into
combinatorial and state-space models. State-space models may also be referred as non-combinatorial, and
combinatorial can be identified as non-state space models.
Combinatorial models capture conditions that make a system fail (or to be working) in terms of structural relationships
between the system components. These relations observe the set of components (and sub-systems) of the system that
should be either properly working or faulty for the system as a whole to be working properly.
State-space models represent the system behavior (failures and repair activities) by its states and event occurrence
expressed as labeled state transitions. Labels can be probabilities, rates or distribution functions. These models allow
representing more complex relations between components of the system, such as dependencies involving sub-systems
and resource constraints. Some state-space models may also be evaluated by discrete event simulation in case of
intractable large state spaces or when combination of non-exponential distributions prohibits an analytic solution. In
some special cases state space analytic models can be solved to derive closed-form answer, but generally a numerical
solution of the underlying equations is necessary using a software packages.
The most prominent combinatorial model types are Reliability Block Diagrams, Fault Trees and Reliability Graphs;
Markov Chains, Stochastic Petri nets, and Stochastic Process algebras are most widely used state-space models. Next
we introduce these model types and their respective evaluation methods.
4.2.
Combinatorial Models
This section describes the two most relevant combinatorial model types for dependability evaluation, namely,
Reliability Block Diagrams (RBD) and Fault Trees (FT), and their respective evaluation methods.
The first two sections define each model type, their syntax, semantics, modeling power and constraints. Each model
type is introduced and then explained by examples so as to help the reader not only to master the related math but
acquire practical modeling skill.
The subsequent sections concern the analysis methods applied to the previously presented models. First a basic set of
standard methods is presented. The methods are particularly applied to models of systems in which components are
arranged as series, parallel or as a combination of parallel and series combinations of components. Afterwards, a set
of methods that applies to non-series-parallel configuration are presented. These methods are more general than the
basic methods, since they could be applied to evaluate sophisticated component compositions, but the complexity
analysis is of concern. The set that should be described are series-parallel reductions, minimal cut and path
computation methods, decomposition, sum of disjoint products (SDP) and delta-star and star-delta transformation
methods. Besides, dependability bounds computation are presented. Finally, measures of component importance are
then presented.
4.2.1. Reliability Block Diagrams

This section presents the Reliability Block Diagram model (RBD). RBDs are networks of functional blocks connected
according to the effect of each block failure on the system reliability. A RBD is not a block schematic diagram of a
system, although they might be isomorphic in some particular cases. RBDs only indicate how the functioning of the
systems components affects the functioning of the system. Although RBD was initially proposed as a model for
calculating reliability, it can be used for computing other dependability metrics, such as availability and
maintainability. RBD is a success oriented model. In RBDs, the system state is described as a Boolean function of
states of its components or sub-systems, where the Boolean function is evaluated as true whenever at least the minimal
number of components is operationally enabled to perform the indented functionality. If the system performs more
than one function (operation), a Boolean function should be defined for representing each operational mode (each
function). The meaning of intended functionality must be specified and depends on the objective of the study. Hence,
the system being operational for a particular function does not mean it also is operational for another function. The
system state may also be described by the respective structure functions of its components or sub-systems, so that the
system structure function is evaluated to 1 whenever at least the minimal number of components is operational.
RBDs have a source and a target vertex, a set of blocks (usually rectangles), where each block represents a
component; and arcs connecting the components and the vertices (see Figure 11). The source node is usually placed at
the left hand side of the diagram whereas the target vertex is positioned at the right.
Figure 11: Reliability Block Diagram

Graphically, when a component is working, the block can be substituted by an arc, otherwise the rectangle is
removed. The system is properly working when there is at least one path from the source node to the target node.
RBDs have been adopted to evaluate series-parallel and more generic structures, such as bridges, stars and delta
arrangements. The simplest and most common RBDs support series-parallel structures only.
Consider a series structure composed of n independent components presented in Figure 12, where
are the functioning probabilities of blocks . These probabilities could be reliabilities or availabilities, for instance.
Figure 12: RBD of Series Structure

The probability for the system to be operational is
(23)
Therefore, the system reliability is
(24)
where
is the reliability of block
Likewise, the system instantaneous availability is

(25)
where
is the instantaneous availability of block
The steady state availability is

(26)
where
is steady state availability of block .
Now consider a parallel structure composed of n identical and independent components presented in Figure 13,
where
are the functioning probabilities of blocks .
Figure 13: RBD of Parallel Structure

The probability for the system to be operational is
(27)
Thus
(28)
The system reliability is then:
(29)
such that,
(30)
where
and
are the reliability and the unreliability of block
, respectively.
Similarly, the system instantaneous availability is
(31)
such that,
where
,
and
are the instantaneous availability and unavailability of block
, respectively.
The steady state availability is

(32)
where
and
are the steady availability and unavailability of block
, respectively.
Due to the importance of the parallel structure, the following simplifying notation is adopted:
Example 2: Consider a system represented by the RBD in Figure 14. This model is composed of four blocks
(
) where each block has
as their respective reliabilities.
Figure 14: RBD of System

The system reliability of the system
is
In a coherent system
, a state vector is called a path vector if
and
is the respective path set.
Hence, a path set is set of components that if every one of its components are operational, the system is also
operational.
is a minimal path set if
, for any
, that is
is comprising a minimal number of
component that should be operational for the system to be operational. In a series system of n components (see Figure
11), there is only one minimal path set and it is composed of every component of the system. On the other hand, if we
consider a parallel system with n components as depicted in Figure 13, then the system has n path minimal sets, where
each set is composed of only one component.
A state vector is called a cut vector if
and
is the respective cut set.
is a minimal cut set if
for any
. In the series system, there are n cut sets where each cut set is composed of one component
only. The minimal cut set of the parallel system is composed by all the components of the system. System of Figure
14 has two minimal path sets,
} and
; and three minimal cut sets,
},
} and
}.
Structures like k out of n, bridges, delta and star arrangements have been customarily represented by RBD,
nevertheless such structures can only be represented if the components are replicated in the model. Consider a system
composed of 3 identical and independent components (
) that is operational if at least 2 out of its 3
components are working properly. The success probability of each of those blocks is p. This system can be considered
as a single block (see Figure 15) where its success probability (reliability, availability or maintainability) is depicted
by
(33)
Figure 15: 2 out of 3 system

The bridge structure may also be considered as a single block with its respective failure probability or can be
transformed into the equivalent series-parallel RBD and then evaluated. Consider a bridge system depicted in Figure
16, composed by blocks
Figure 16: Bridge System

The series-parallel RBD equivalent model presented in Figure 16, and its structure function is
(34)
The reader should observe that the series-parallel equivalent model replicates every component of the bridge (every
component appears twice on the model of the Figure 16.b.
4.2.2. Fault Tree

This section presents the Fault Tree model (FT). Fault tree was first proposed in Bell Telephone Laboratories in 1962
by H. A. Watson to evaluate the Minuteman I Missile. Differently from RBDs, FT is failure oriented model, and as in
RBDs, it was initially proposed for calculating reliability. Nevertheless, FT has been also extensively applied to
evaluate other dependability metrics.
In a FT, the system failure is represented by the TOP event (undesirable state). The TOP event is caused by lower
level events (faults, components failures etc) that alone or combined may lead to the TOP event. The combination of
events is described by logic gates. The events that are not represented by combination of other events are named basic
events. The term event is somewhat misleading, since it actually represents a state reached by event occurrences.
IN FTs, the system state may be described by a Boolean function that is evaluated as true whenever at least one
minimal cut is evaluated as true. The system state may also be represented by a structure function, which, opposite to
RBDs, represents the system failure. If the system has more than one undesirable state, a Boolean function (or a
structure function) should be defined for representing each failure mode, that is, one function should be constructed
for describing the combination of events that cause each undesirable state.
The most common FT elements are the TOP event, AND and OR gates, and basic events. Many extensions have been
proposed which adopt other gates such as XOR, transfer and priority gates. This chapter, however, does not cover
these extensions.
As in RBDs, the system state may also be described by the FT structure function. The FT structure function is
evaluated to 1 whenever at least one structure function of a minimal cut is evaluated as 1.
Consider a system S composed of a set of components,
indicate the state of component , thus:
. Let the discrete random variable
(17)
Table 2: Basic Symbols and their description2

Symbol
Description
TOP event represents the system failure.
Basic event is an event that may cause a system failure.
Basic repeated event.
AND gate generates an event (A) if All event Bi have occurred.
OR gate generates an event (A) if at least one event Bi have

occurred.
KOFN gate generates an event (A) if at least K events Bi out of

N have occurred.
The comment rectangle.
The vector
named state
vector.
The
system
represents the state of each component of the system, and it is

state may be represented by a discrete random variable
, such that
(18)
is named the Fault Tree structure function of the system.

As
is Bernoulli random variable, its expected value is equal to the probability of occurrence of the respective
TOP event. In other words,
is the system failure probability, which is denoted by
.
Example 3: Consider a system in which software applications read, write and modify the content of the storage device
(source). The system periodically replicates the production data (generated by the software application) of one
storage device (
in two storage replicas (targets) so as to allow recovering data in the event of data loss or data
corruption. The system is composed of three storage devices ( , , ), one server and hub that connects the disks
and
to the server (see Figure 17.a).
The system is considered to have failed if the hardware infrastructure does not allow the software applications to read,
write or modify data on , and if no data replica is available, that is both disks
and
have failed. Hence, if
or the Server or the Hub, or either replica storages ( , ) are faulty, the system fails. The respective FT is presented
It is important to stress that the graphical representation of symbols that denote these constructors may vary according the
standard chosen and adopted tools.
in Figure 17.b. For sake of conciseness, the Boolean variables representing the events (faults) of each device are
named with the respective devices names, hence
,
.
(a)
(b)
Figure 17: Data Replication.

The system is considered to have failed if the hardware infrastructure does not allow the software applications to read,
write or modify data on , and if no data replica is available, that is both disks
and
have failed. Hence, if
or the Server or the Hub, or either replica storages ( , ) are faulty, the system fails. The respective FT is presented
in Figure 17.b. For sake of conciseness, the Boolean variables representing the events (faults) of each device are
named with the respective devices names, hence
,
.
Denote the FT Logic Function counterpart that represents the FT structure function ( ) by . In the present context,
the events of interest now are malfunctioning events (faults, failures, human errors etc). Hence, let the Boolean
variables , , , and denote the occurrence of events
,
,
,
and
According to the notation previously introduced, (a Boolean variable) is equivalent to and
represents
.
The
(Logical function that describes conditions that cause a system failure) is the counterpart of
(FT structural function represents system failures),
depicts of
, represents , and is the
respective counterpart of .
In this example, the FT logic function is
),
which is equal to
=
)=
The respective FT structure function may be expressed as
.
The reader may note (in the above expression) that if
, which denotes a system failure.
or
or
or
, then
Example 4: Now consider a system composed by two processors (

), two memory systems local to each
processor (
),) and three storage devices (
) see Figure 18.a.
are only accessed by
software applications running on and , respectively. If any of these devices (

) fails, the storage device
takes over their functions so as to store data generated by software applications running on either processor.
(a)
(b)
(c)
Figure 18: System with shared device.

The system is considered to have failed if both of its sub-system (
, , ,
,
,
, ,
) fails.
Each subsystem fails if its processor fails or if the respective local memory fails or if both of its storage devices fail.
The
As
FT
structure
function
is
easily
derived.
We
named
the
respective
state
variables
, then:
as
One should observe that each that when
and when
, then the FT structure function is reduced to
it reduces to
Therefore, the original FT in Figure 18.b may be factored into two FTs, one considering
as shown in Figure 18.c.
and other
4.3.
Analysis Methods
This section introduces some important methods adopted in combinatorial models for calculation of system
probability of failure when components are independent. For simplifying the notation, reliability (
), steady
state and instantaneous availability (
) of components and system might replace and P.
4.3.1. Expected Value of the Structure Function
The most straightforward strategy for computing system reliability (availability and maintainability) of system
composed of independent components is through the respective definition. Hence, consider a system S and its
respective structure function
The system reliability is defined by
Since
is a Bernoulli random variable, then
variable, thus
for any i and k; hence
; therefore,
. As is a binary
is a polynomial function in which each variable has degree 1.
Summarizing, the main steps for computing the system failure probability, by adopting this method are:
i) obtain the system structure function.
ii) remove the powers of each variable ; and
iii) replace each variable by the respective .
Example 5: Consider a 2 out of 3 system represented by the RBD in Figure 15.a. The structure function of the RBD
presented in Figure 15.b is
(35)
Considering that
is binary variable, thus
for any i and k, hence, after simplification

(36)
Since
1, thus
is Bernoulli random variable, its expected value is equal to
, that is,
Therefore
(37)
As
(38)
which is equal to Equation 33.
4.3.2.
Pivotal Decomposition or Factoring
This method is based on the conditional probability of the system according the states of certain components. Consider
the system structure function as depicted in Equation 19 and identify the pivot component i, then
As
is a Bernoulli random variable, thus:
Since
and
then:
(39)
It may also be represented by

(40)
Example 6: Consider the system composed of three components, a, b and c, depicted in Figure 10, where
denotes the system structure function. As
, then:
But as
, so:
Since
Now factoring on component b,
P{
then
.
As
, thus:
.
Now, as we know that

, and
],
then
thus
As
and
then
Therefore:
which is
4.3.3. Reductions
Plain series and parallel systems are the most fundamental dependability structures. The dependability of such systems
is analyzed through equations described in Section 4.2.1. Other more complex structures such, k out of n and bridge
structures may also be directly evaluated as single components using the equations also presented in Section 4.2.1.
The dependability evaluation of complex system structures might be conducted iteratively by indentifying series,
parallel, k out of n and bridge subsystems, evaluating each of those subsystems, and then reducing each subsystem to
one respective equivalent block. This process may be iteratively applied to the resultant structures until a single block
results.
Consider a series system composed of n components (Figure 19.a) whose failure probabilities are
be reduced into one-component equivalent system (Figure 19.b) whose failure probability is
This system may
Figure 19: Series Reduction

A parallel system composed by n components (Figure 20.a) whose failure probabilities are
one-component equivalent system (Figure 17.b) whose non-failure probability is
respective failure probability is
.
may be reduced into

, and its
Figure 20: Parallel Reduction

K out of n and bridge structures may also be represented by one-component equivalent block as described in Section
4.2.1.
Example 7: Consider a system represented in Figure 21 composed of four basic blocks (
), one 2 out of 3
and one bridge structure. The three components of the 2 out of 3 block are equivalent, that is, the failure probability of
each component is the same ( ). The failure probabilities of components
and the failure probability of
the bridge structure are
and
, respectively.
Figure 21: Reductions System
The 2 out of 3 structure can be represented one equivalent block whose reliability is
(Equation 33). The
bridge structure can be transformed into one component, , (see Figure 22) whose failure probability is
.
Figure 22: Reductions After bridge reduction

After that, two series reductions may be applied, one reducing blocks
and into block
combines blocks
and
and reduces it to the block
. The reliability of block
is
block reliability of block
is
The resulting RBD is depicted in Figure 23.
; and a second that

, and the
.
Figure 23: Reductions After first series reductions

Now a parallel reduction may be applied to merge blocks
and . Figure 24 shows the RBD after that reduction.
The block
represents the block
and
composition, whose reliability is
Figure 24: Reductions After the parallel reduction

Finally, a final series reduction may be applied to RBD depicted in Figure 24 and one block RBD is generated (Figure
25), whose reliability is
Figure 25: Reductions Final RBD
4.3.4. Computation Based on Minimal Paths and Minimal Cuts

Structure functions define the arrangement of components in a system. These arrangements can also be expressed by
the path and cut sets.
Consider a system S with n components and its structure function

, where
is the set of
components. A state vector is named a path vector is
, and the respective set of operational components is
defined as path set. More formally, the respective path set of a state vector is defined by
. A path vector is called minimal path vector if
, for any
, and the respective path set is
named minimal path set, that is
.
A state vector is named a cut vector is
, and the respective set of faulty components is defined as cut set.
Therefore,
. A cut vector is called minimal cut vector if
, for any
, and the respective path set is named minimal cut set, that is
.
Figure 26: Sets and Cuts

Example 8: Consider a system represented by the RBD presented in Figure 26.
,
and
are path sets; and
,
, and
, are cut sets.
,
, and
are the respective path vectors. The respective cut
vectors are
(for
,
(for
,
(for
,
(for
, and
(for
are cut vectors.
is a minimal path set, since for every
,
; that is, for
,
and
;
=
=
=0. The same is true for
, since
,
and
(
);
=
=
=0. On the other hand,
is not minimal, since either
or
; and
=1 and
=1.
is a minimal cut set, since for
, the only larger binary vector than
,
=1. The cut set
is
also minimal, because for , , and
(the three larger binary vector than ),
=
1. The
same is not true for
,
, and
.
Consider a system S with arbitrary structure with p minimal path sets {

,
{
,
,
}. The structure function of a particular minimal path set is
} and k minimal cut sets
is named the minimal path series structure function. As the system is S is working if at least one of the p minimal
path is functioning, then the system structure function is
Hence,
Alternatively, considering the cut sets, the structure function of a particular minimal cut set is
is named minimal cut parallel structure function. As the system is S fails if at least one of the k minimal cuts
fails, then the system structure function is
Example 9: The failure probability of the system depicted in Figure 26 can be computed either by Equation 43 or 45.
The structure functions of the minimal path structures
and
are
and are
, respectively. Therefore,
So,
As
are binary variables,
Since
, hence
then
If the minimal cut structures

and
and
are considered instead, the structure functions of each is

, respectively. Since the components are independent
which is equivalent to the first result.

Minimal paths and cuts are important structures in dependability evaluation of systems. Many evaluation methods are
based on these structures, hence methods have been proposed for computing minimal paths and cuts. This topic,
however, is not covered in the book chapter.
4.3.5. SDP Method

This section introduces the sum-of-disjoint-products (SDP) method for calculating dependability measures. The SDP
method uses minimal paths and cuts to compute system probability of failure or the system operational probability34
by summing-up probabilities of disjoint terms. Many strategies have been proposed to derive the disjoint-product
3
The operational probability is the probability that the system is operational.
terms from minimal paths and cuts, and how to consider them when computing the system probability of failure (or
probability of functioning). The union of the minimal paths or cuts of a system can be represented by the system logic
function. The system logic function may have several terms. If these terms are disjoint, then the dependability measure
(reliability, availability and maintainability) can be directly computed by the simple summation of probabilities related
to each term. Otherwise the probability related to one event (path or cut) is summed with the probability of event
occurrence represented by disjoint-product of terms.
Considering a system composed of three independent components
and , depicted in Figure 26, where the
components failure probabilities are
, respectively.
and
are minimal path
sets, and
, and
are minimal cut sets. The minimal path sets are depicted in Figure 27. As
usual, let the Boolean variable related to the component as and the state variable as .
Figure 27: Minimal path sets

As is the minimal path set, then
and
Therefore, the system failure probability is
at least the failure probability of this path set, that is,
, where
]=
]=
. Now, if the second minimal path set is considered ( , the system failure probability
should be obtained by considering the failure probabilities of components of the second path that do not belong to the
first minimal path. The logic function representing the minimal path set
is
, and
(equivalent to
). The equivalent structure function is
, and
. As there is no other minimal path,
.
The minimal cut sets may considered instead of the minimal path sets. In this case, the first minimal cut used provides
a system failure probability upper bound and the subsequent disjoint products reduce the failure probability value
since each additional cut considered introduces possible terms that describes system failures.
Figure 28: Event sets

The above explanation shows how the SDP method uses the system logic function expressed as union of disjoint
products. The disjoint terms are products of events representing components that work or fail.
Now, consider three set representing minimal path (or minimal cuts) named
disjoint products may be represented by
Let denote
, then
(see Figure 28). The sum of
therefore:
The first term,

, is the contribution related to the first path (cut). The second term,
contribution of the second path (cut)
that has not been accounted by , and the third term,
the contribution related to the path (cut)
that has not be considered neither in nor in .
, is the
, is
Generalizing for n sets, the following expression is obtained
For implementing the Expression 48 many algorithms have been proposed for efficient evaluation of the additional
contribution toward the union by additional events that have not been accounted for by any of the previous events (2).
Example 9: Consider the RBD presented in Figure 26, where the operational probabilities are
The minimal path sets and cuts are
and
, and
and
, respectively.
The operational probability computed in the first interaction of 48 when considering the minimal path is 0.980296.
If the minimal cuts are adopted instead of paths, and if
is the first, operational probability is 0.990099. So,
. In the second interaction, operational probability calculated considering the
is 0.9900019. When adopting the cuts, the next (and sole) disjoint product is
. The operational
probability computed considering the additional term is 0.9900019. The reader may observe that the two bounds
converged. Thus, the system operational probability is
.
4.3.6. Dependability Bounds

The methods introduced so far derive the exact dependability measures of systems. For large systems, computing
exact values may be a very computer intensive task, and may make the respective evaluation impractical.
Approximation methods provide bounds to exact solutions in much shorter times.
There are many approximations methods for estimating dependability of systems. Among the most important, we may
stress the EP (Esary-Proschan) method (3), min-max method (2) (4), modular decomposition and the adoption of SDP
method. This section introduces the min-max method and also applies the SDP method for computing bounds.
Min-Max
The min-max method provides dependability bounds of coherent system using minimal paths and cuts. Consider a
coherent system S, whose structure function is
, with p minimal path sets
,
,
and k
minimal cut sets
,
,
, where
is structure function of a path and
is the structure
function of a particular minimal cut set.
is the operational probability of the path i, whereas
is probability that every component of the minimal cut j have failed. One should bear in mind that
A dependability lower bound may be obtained by
and an upper bound may be calculated by
Hence,
Example 10: Consider the bridge system depicted in Figure 16. The system is composed of a set of five components,
}. The set of minimal paths and cuts are
,
,
and
,
,
, respectively, where
,
,
,
,
,
,
and
. The state variables of each component are named by
labeling function such that
, where
. Therefore the set of state variables is
5.
The structure functions of each minimal path are
Since are Bernoulli random variables,
then
,
,
,
As every variable are independent, then
If
As
and
, and
thus
The structure function of a cut set is

2, 2x
2
5, 3x
minimal cut set
so
, hence
5, 4x
4. Considering the
As
Adopting the same process
calculated.
and
1, 3x=1, 4x=1=
Therefore
,
and
and
then
are
As
then
If
3=
then
2.
SDP
The SDP method, described in Section 4.3.5, provides consecutive lower (upper) bounds when adopting successive
minimal paths (cuts). The basic algorithm adopted in SDP method order the minimal paths and cuts from shorter to
longer ones. If the system components are similarly dependable, longer paths provides smaller dependability bounds
whereas shorter cuts also offer smaller bounds.
Consider an RBD with n minimal paths and m minimal cuts. First take into account only the set of minimal path sets,
. According to Equation 48, the system dependability (reliability, availability etc) may be also
successively expressed so as to find tighter and tighter lower bounds. Hence
5 The reader should bear in mind that x has been interchangeably adopted to represent sets and vector whenever the context is
clear.
Now, consider the set of minimal cut sets,

. As
is probability that every
component of the minimal cut j have failed, it represents the system failure probability (unreliability, unavailability
etc) related to the cut j. Again, adopting the Equation 48, these failure probabilities may be successively computed.
Thus
6
As
, then
Example 11: Consider again the bridge system depicted in Figure 167. The set of minimal paths and cuts are
,
,
and
,
,
, respectively, where
,
,
,
,
,
,
and
, where
and
are the set of respective Boolean and state variables, respectively.
Considering the minimal path sets
As
, then
hence
Adopting the same process the successive lower bounds are obtained. Therefore, a second, a third and a fourth tighter
bound may be obtained. The second and the third bounds are:
6
7
The components are independent.
The fourth and last bound (the exact value) is then calculated by
Hence
If
then
The same process may be applied considering the minimal cut sets in order to obtain upper bounds.
Let
. If this process is applied for the minimal path and minimal cuts, the following lower and upper
bounds are obtained:
Table 3: Successive Lower and Upper Bounds
Iterations i
1
2
3
4
0.826446280992
0.969879106618
0.976088319849
0.982297533080
0.991735537190
0.983539375726
0.982918454403
0.982297533080
Note that at the fourth iteration (last) the lower and the upper bounds equal, since then we have the exact result.
4.4.
State-Space Models
Dependability models can be broadly classified into non-state-space models (or called combinatorial models) and
state-space models. Non-state space models such as RBD and FT, which are introduced in previous sections, can be
easily formulated and solved for system dependability under the assumption of stochastic independence between
system components. In non-state-space models, for the dependability models, it is assumed that the failure or recovery
(or any other behaviors) of a component is not affected by a behavior of any other component. To model more
complicated interactions between components, we use other types of stochastic models such as Markov chains or more
generally state-space models. In this subsection, we briefly look at the most widely adopted state-space models,
namely Markov chains. First introduced by Andrei Andreevich Markov in 1907, Markov chains have been in use
intensively in dependability modeling and analysis since around the fifties. We provide a short introduction to discrete
state stochastic process and fundamental characteristics including stationarity, homogeneity, and memory-less
properties. We then introduce in brief Discrete Time Markov Chain (DTMC), Continuous Time Markov Chain
(CTMC), and semi Markov process. Finally, we briefly introduce stochastic Petri nets (SPN) which is a high-level
formalism to automate the generation of Markov chain.
A stochastic process is a family of random variables X(t) defined on a sample space. The values assumed by X(t) are
called states, and the set of all the possible states is the state space, I. The state space of a stochastic process is either
discrete or continuous. If the state space is discrete, the stochastic process is called a chain. And the time parameter
(also referred to as index set) of a stochastic process is either discrete or continuous. If the time parameter of a
stochastic process is discrete (finite or countably infinite), then we have a discrete-time (parameter) process. Similarly,
if the time parameter of a stochastic process is continuous, then we have a continuous time (parameter) process. A
stochastic process can be classified by the dependence of its state at a particular time on the states at previous times. If
the state of a stochastic process depends only on the immediately preceding state, we have a Markov process. In other
words, a Markov process is a stochastic process whose dynamic behavior is such that probability distributions for its
future development depend only on the present state and not on how the process arrived in that state. It means that at
the time of a transition, the entire past history is summarized by the current state. If we assume that the state space, I,
is discrete (finite or countably infinite), then the Markov process is known as a Markov chain (or discrete state
Markov process). If we further assume that the parameter space T, is also discrete, then we have a discrete-time
Markov chain (DTMC) whereas if the parameter space is continuous, then we have a continuous-time Markov chain
(CTMC). The changes of state of the system are called transitions, and the probabilities associated with various statechanges are called transition probabilities. For homogenous Markov chain, its transition probability is independent of
time (step) but depends only on the state, the Markov chain in this case is said to be time homogeneous. Homogeneous
DTMC sojourn time in a state follows the geometric distribution. The steady state and transient solution methods of
DTMC and CTMC are described inmore detail in (5). The analysis of CTMC is similar to that of the DTMC, except
that the transitions from a given state to another state can happen at any instant of time. For CTMC, we allow the
parameter to a continuous range of values, the set of values of X(t) is discrete. CTMCs are useful models for
performance as well as availability prediction. We show the CTMC models in the case studies section. Extension of
CTMC to Markov reward models (MRM) make them even more useful. MRM can be used by attaching a reward rate
(or a weight) ri to state i of CTMC. We used MRM in case study 2 to compute capacity oriented availability. For a
homogeneous CTMC, the sojourn time (the amount of time in a state) is exponentially distributed. If we lift this
restriction and allow the sojourn time in a state to be any (non-exponential) distribution function, the process is called
a semi-Markov process (SMP). The SMP is not covered in this chapter, more details can be found in (5).
Markov chains are drawn as a directed graph and the transition label is probability, rate, and distribution for
homogeneous DTMC, CTMC, and SMP, respectively. In Markov chains, states represent various conditions of the
system. States can keep track of number of functioning resources and states of recovery for each failed resource. The
transitions between states indicate occurrences of events. A transition can occur from any state to any other state and
represent a simple or a compound event.
Hand construction of the Markov model is tedious and error-prone, especially when the number of states becomes
very large. Petri net (PN) is a graphical paradigm for the formal description of the logical interactions among parts or
of the flow of activities in complex systems. The original PN did not have the notion of time, for dependability
analysis, it is needed to introduce duration of events associated with PN transitions. PN can be extended by associating
time with the firing of transitions, resulting in time the Petri nets. A special case of timed Petri nets is stochastic Petri
nets (SPN) where the firing times are considered to be random variables with exponential distribution. The SPN model
can be automatically converted into underlying Markov model and solved. SPN is a bipartite directed graph consisting
of two kinds of nodes; places and transition. Places typically represent conditions within the system being modeled.
Transitions represent events occurring in the system that may cause change in the condition of the system. Tokens are
dots (or integers) associated with places; A place containing tokens indicates that the corresponding condition holds.
Arcs connect places to transitions (input arcs) are transitions to places (output arcs). An arc cardinality (or multiplicity)
may be associated with input and output arcs, whereby the enabling and firing rules are changed as follows. Inhibitor
arcs are represented with a circle-headed arc. The transition can fire iff the inhibitor place does not contain any tokens.
A priority level can be attached to each PN transition, among all the transitions enabled in a given marking, only those
with associated highest priority level are allowed to fire. An enabling (or guard) function is a Boolean expression
composed from the PN primitives (places, transitions, tokens). Sometimes when some events take extremely small
time to occur, it is useful to model them as instantaneous activities. SPN model were extended as generalized SPN to
allow for such modeling by allowing some transitions, called immediate transitions, to have zero firing time. For
further details see (5).
5. Case Studies
In this section we present two case studies, which will illustrate the application of the aforementioned techniques to
areas related to service computing. Some of these cases are built based on authors experiences developing
dependability research works for companies in IT industry. Even those cases created hypothetically are developed as
close as possible from real-world scenarios.
5.1 Multiprocessor Subsystem

The main focus of this study is to evaluate the availability of a multiprocessor processing subsystem in terms of
individual and multiples processor failures. Nowadays, multiprocessor-computing platform is in the core of many
service computing projects. The modeling techniques used in this case study is CTMC. This subsystem is an integral
part of many computing system platform and this study can be easily generalized. The modeled processing subsystem
is based on a Symmetric multiprocessing dual quad-core processors platform, so it is composed of two physical CPUs,
where each CPU (processor) contains four cores. Typically, in this case the operating system (e.g., Linux and MS
Windows) considers four logical processors for each physical CPU - a total of eight logical processors. The OS
believes it is running on an 8-way machine.
Assume that this system is serving critical applications in a data center environment, so from a hardware standpoint,
we assume that in case of a failure in a physical CPU (e.g., CPU0) the computer is able to work with a second CPU
(e.g., CPU1) after rebooting the system. In this situation the system is considered to be running in degraded mode until
the failed CPU is replaced. This reboot-oriented recovery property is very important to assure a lower downtime,
having a significant influence on the overall system availability. Hence, we assume that the modeled motherboard
contains sockets for two physical CPUs on dual independent point-to-point system buses. If only one physical CPU is
operational, these physical individual buses connecting each CPU directly to the motherboards NorthBridge offer the
physical possibility to have any of the two physical processors, independently of the socket used, booting up and
running the operating system. Hence, the OS kernel is not restricted to start up from a specific processor. For example,
motherboards compatible with the Intel Blackford-P 5000P chipset implements the abovementioned capabilities and
our modeling assumption is based on it.
As described above, although each physical processor has four cores, an individual core failure may be considered an
unlikely event. For that reason we assume that a processor failure will stop the entire physical processor and therefore
its eight encompassed cores. Figure 29 shows the modeled platform with two Quad-core CPUs.
Figure 29: Dual Quad-core CPU subsystem model

States UP and U1 are up states. State UP represents the subsystem with two operational physical processors. When a
processor fails, this subsystem transits from state UP to D1. We assume that a possible common mode failure can
bring all processors down with probability 1-Ccpu, thus transiting from state UP to DW. In states D1 and DW the
computer is down due to a system crash caused by a processor failure. We assume that after such an event, the next
action taken by the administrator is to turn on the computer, and then with probability Capp the computer will come
up with only one physical processor (four cores). The mean time to reinitialize (reboot) the system is 1/app. A
successful initialization after a processor failure brings the subsystem to state U1 and since it is running with only one
physical processor it can fail with a mean time to failure of 1/cpu1. Parameter cpu1 usually is assigned to a higher
value than cpu indicating a shorter lifetime for the remaining processor under a higher workload than usual (dual
mode). While in degraded mode the system administrator can decide to request a processor replacement in order to
resume the system to full processing capacity. This decision is indicated in the cpu1 value. From states DW and U1
the only possible transition is to FR (the field replacement service state).
An important aspect necessary for the availability modeling and analysis is the repair service. It has a direct impact on
the MTTR (mean time to repair) metric necessary for the system availability analysis. In a data center environment, a
specific field support staff is responsible for the replacement of failed parts. These parts are commonly categorized as
customer replaceable units (CRU) or field replacement unit (FRU). These are industry terms used to mean those parts
of the system that are designed to be replaced by the customer personnel (CRU), or only by authorized manufacturer
representative (FRU). The main difference is the time to repair, where FRU involves longer repair time than CRU
because it is not available locally, and hence the necessity to account the travel time. Since CPU is very often
considered an FRU in real data centers, we modeled the repair service considering this assumption. Hence, from state
DW the replacement of the failed CPU is requested and the service personnel arrive with a mean time of 1/spFRU to
fix and reboot the system (transition from FR to UP) afterward. Table 4 shows the CPU subsystem model parameter
values. Most of these values are based on industry standards, and specialist judgment.
Table 4: Input Parameters CPU subsystem model
Params
1/cpu
1/cpu1
Ccpu
1/app
Capp
cpu1
1/spFRU
1/appFRU
Description
mean time for processor failure operating with
2 proc.
mean time for processor failure operating with
1 proc.
coverage factor for cpu failure
mean time to reboot the whole computer
coverage factor for appliance failure after reboot
due to a processor failure
decision factor for keeping the system running
with one processor
mean time for new appliance arrival (FRU)
mean time to install the new CPU (FRU)
Value
1,000,000 hours
(1/cpu) / 2
0.99
2 minutes
0.95
True (1)
4 hours
30 minutes
2,50E+08
2,00E+08
1,50E+08
1,00E+08
5,00E+07
0,00E+00
D1
DW
FR
Figure 30: Breakdown downtime sensitivity analysis
Table 5: Breakdown Downtime

Breakdown Downtime
D1
DW
FR
Value
2.08134903e+000
2.50190267e-001
5.25593188e-001
The breakdown analysis for the downtime (Table 5 and Figure 30) shows us that the contribution of state D1 to the
total downtime is very significant. Hence, actions should be taken to reduce the time spent in D1 reducing the reboot
time and to increase the coverage factor Capp. This multiprocessor subsystem model can be used in conjunction with
other models (e.g., cooling, power supply, storage, etc.) to compose a more complex system such as an entire server
machine or even a cluster of servers.
5.2 A Virtualized System

This case study presents the availability modeling and analysis of a virtualized system (6). Service computing is
highly dependent on data center infrastructure and virtualization technologies. We develop an availability model of a
virtualized system using a hierarchical model in which fault trees are used in the upper level and homogeneous
continuous time Markov chains (CTMC) are used to represent sub-models in the lower level. We incorporate not only
hardware failures (e.g., CPU, memory, power, etc) but also software failures including virtual machine monitor
(VMM), virtual machine (VM), and application failures. We also consider high availability (HA) feature and VM live
migration in the virtualized system. Metrics we use are system steady state availability, downtime in minutes per year
and capacity oriented availability.
Figure 31: Architectures of two hosts virtualized systems

Figure 31 shows a virtualized two hosts system. The virtualized system consists of two physical virtualized
servers (called hosts from now on) in which each host has one VM running on the VMM in the host. Two
virtualized hosts share a common SAN (storage area network). This configuration is commonly used to support
VM live migration. Applications running on VMs can be same or different; we assume that the application is the
same, and we denote them as APP1, APP2 to distinguish them, so this is active/active configuration in a
virtualized system.
Figure 32: System availability model for the virtualized system

We define the system unavailability as the probability that both hosts are down. Figure 32 shows virtualized
system availability models in which the top level fault tree models include both hardware and software failures.
H1 and H2 represent host1 and host2 failure, respectively. HW1 and HW2 represent hardware failure of the host1
and host2, respectively. HW1 consists of the CPU (CPU1), memory (Mem1), network (Net1), power (Pwr1),
cooler (Coo1). A leaf node drawn as a square means that there is a sub model defined. System can fail if both of
the host hardware (including VMM) or SAN, or VMs fail. This is because the SAN is shared by two hosts and
VMs on one host can be migrated to the other host. Below we discuss the VMs availability model in detail. The
description for other sub models can found in (6).
Figure 33: VMs availability model

Figure 33 shows VMs subsystem availability model. We use active/active configuration and we consider the
system to be up if at least one application is working properly. In the top state UUxUUx all the components are up.
State UUxUUx represents, from the left to right, host1 (in short, H1) is up, VM1 (in short, V1) is up, host1 has a
capacity to run another VM (in short, x), host2 (in short, H2) is up, and VM2 (in short, V2) is up, and host2 has
capacity to run another VM. If H1 fails with rate h, it enters state FUxUUx. The failure of H1 is detected using
one of the failure detection mechanisms (for an instance, a heartbeat mechanism every 30 seconds). Once H1
failure is detected with rate h (in state DxxUUR, D stands for detection of the host failure), it restarts V1 on H2
(note that H2 had a capacity to receive V1, as denoted by x, and the restart of a VM is denoted as R). This is
called VM High Availability (HA) service in VMware. It takes mean time 1/rv to restart a VM on the other host.
The host is then repaired with rate h (state UxxUUU, H1 is repaired). Now, in H2, there are two VMs, i.e., V1,
and V2. To make full use of system resources, V1 on H2 may be migrated to H1 with rate mv (i.e., transition from
the state UxxUUU to UUxUUx). We begin again with the state UUxUUx in which both hosts are up and both
VMs on the hosts are also up. If H2 fails (state, UUxFUx), as soon as failure is detected, V2 on H2 is restarted on
H1 (state, UURDxx). And then it is migrated to H2 once H2 is repaired. Now, we describe a second host failure
and recovery. We can begin with the state UxxUUU or UUUUxx. In state UxxUUU, both hosts are up and two
VMs are on H2. If H2 fails with the mean time 1/h, it enters state UxxFUU and then host failure is detected with
rate h (state, UxxDUU). In this state, there are two VMs; Only one VM can be migrated so that two VMs
compete each other and the migration rate is 2rv. The next state will be the state UURDxx. Similarly, from state
DUUUxx, two VM can be migrated to other host. Next, we incorporate VM failure and recovery. If V1 fails, it
goes to state UFxUUx, in which H1 is up but V1 fails. It takes mean time 1/v to detect VM failure. And then it
takes mean time 1/v to recover. In some cases, if VM is not recovered, it needs to be restarted on the other host.
In this case, the VM on H1 is migrated to H2 with mean time, 1/mv. Then, it is restarted on H2 (state, UxxUUR)
and it is recovered (state, UxxUUU). To capture this imperfect recovery we use a coverage factor for VM, cv, so
the rates become cvv and (1-cv)v. Obviously, V2 also can fail (state, UUxUFx); it takes mean time 1/v to detect
the failure. It can be recovered with rate cvv; otherwise it needs to be migrated to H1 with rate (1-cv)v. The
coverage factor cv can be determined by fault injection experiments. So far, we incorporated host failure as well as
VM failure; we also incorporate applications failures. We use notation UUxf_UUx to represent application1
failure on V1 on H1 where f means the failure of application1 on V1 (here, UUxf_UUxu, we used underscore
(_) to distinguish between H1 and H2). If the failure of application1 is detected (UUxd_UUx) with rate a, it can
be recovered with rate ca1a then it goes back to state UUxUUx. Or it sometime needs additional recovery action
(state, UUxp_UUx) with repair rate 2a. The application 2 on V2 on H2 can also fail (UUx_UUxf). It can be
detected with mean time, 1/a, and it can be recovered with mean time, 1/1a. Otherwise, it needs additional
recovery steps with mean time, 1/1a.
Table 6: Input Parameters VMs model
Params
1/h
1/v
1/a
1/h
1/v
1/a
1/mv
1/rv
1/v
1/1a
1/2a
1/h
Description
mean time for host failure
mean time for VM failure
mean time for application failure
mean time for host failure detection
mean time for VM failure detection
mean time for app. failure detection
mean time to migrate a VM
mean time to restart a VM
mean time to repair VM
mean time to application first repair
mean time to application 2nd repair
mean time to repair host failure
Value
host MTTFeq
2160 hours
336 hours
30 seconds
30 seconds
30 seconds
5 minutes
5 minutes
30 minutes
20 minutes
1 hour
host MTTReq
cv
coverage factor for VM repair
0.95
ca
coverage factor for application repair
0.9
Table 7: Output measures of the virtualized system

Output measure
Value
Steady state availability (SSA)

Downtime in minutes per year (DT)
Capacity oriented availability (COA) of VMs model
9.99766977e-001
1.22476683e+002
9.96974481e-001
The output measures such as steady state availability, downtime, and capacity oriented availability (COA) are
computed using the hierarchical model. We used SHARPE to compute the output measures. We compute
mean time to failure equivalent (MTTFeq) and mean time to repair equivalent (MTTReq) for Markov sub
models (such as CPU, memory availability model, etc) by feeding the input parameters value. The MTTFeq
and MTTReq of each submodel are used to compute MTTFeq and MTTReq of a host. We use them in VMs
availability model (1/MTTFeq and 1/MTFReq of host equal to h and h, respectively). Finally, we evaluate
the system availability by feeding all the input parameters values from Table 6 into all the sub-models in
system availability model shown in Figure 33. The steady state availability and downtime in minutes per year
of virtualized two hosts system are summarized in Table 7. We also compute the COA by assigning reward
rates to each state of the VMs availability model (so this Markov chain becomes Markov reward model).
Reward rate 1 is assigned to states where one VM is running on each host (e.g., UUxUUx), reward rate 0.75 is
assigned to states where two VMs are running on a host (e.g., UUUUxx, UxxUUU, UUUDxx, etc). Reward
rate 0.5 is assigned to states where only one VM is running on a host (e.g., UUxFxx, UUxDxx, etc). Zero
reward rate is assigned to all other states. The computed COA of the virtualized system is also in Table 7. The
sensitivity to some parameters is shown from Figure 34 to Figure 35. Figure 34 shows unavailability vs. mean
time to VM failure (1/h in Table 6) / VMM failure (1/VMM). As seen from the Figure, the system
unavailability drops as the mean time to VMM failure increases. But after it reaches about 750 hours,
increasing mean time to VMM failure does not reduce the system unavailability too much. As seen from the
Figure 34, the system unavailability does not change much as the mean time to VM failure increases, since
another VM on the other host is working properly. Figure 35 shows the COA vs. mean time to restart (migrate)
a VM. The COA drops as the mean time to restart and as the mean time to migrate a VM increases. Therefore
it is important to minimize the mean time to restart (migrate) a VM to maximize the COA. The more case
studies using hierarchical modeling approach are in (7) (5) (8).
Capacity oriented availabity vs. mean time to restart (migrate) a VM
System unavailability vs. mean time to VM/VMM failure

Capacity oriented availabity
System unavailability
1,60E-03
Mean time to VM failure
1,40E-03
Mean time to VMM failure
1,20E-03
1,00E-03
8,00E-04
6,00E-04
4,00E-04
2,00E-04
9,9845E-01
Mean time to restart a VM
9,9840E-01
Mean time to migrate a VM
9,9835E-01
9,9830E-01
9,9825E-01
9,9820E-01
9,9815E-01
9,9810E-01
9,9805E-01
0,00E+00
9,9800E-01
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Mean time to VM/VMM failure (hours)
Figure 34: Unavailability vs. mean time to VM/VMM

failure
10
15
20
25
30
35
40
45
50
55
60
Mean time to restart (migrate) a VM (minutes)
Figure 35: COA vs. mean time to VM restart

(migrate)
6. Conclusions
Dependability modeling and evaluation is an area of long tradition and sound fundaments. Dependability
studies are particularly important for the design success of critical systems. In this chapter, we began
introducing some seminal and important works that widely influenced the development of this research area.
Then we introduced some model types and important analysis methods. This chapter focused more on
combinatorial models than state based models, nevertheless the case studies presented considered both class of
models.
Although the presented method and models are quite mature, there are still many challenging problems
encompassing composition modeling strategies, efficient methods for computing measures of large systems,
automatic generation of models, user-friendly tools for industrial use, and evaluation of user-defined
properties. This list of subject is not intended to be complete, but it depicts exiting research subjects both
related to theoretical and practical studies.
Bibliography
1. Trivedi, M. Grottke and K. S. Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate . IEEE Computer. No. 2,
2007, Vol. Vol. 40.
2. Zuo, Way Kuo and Ming J. Optimal Reliability Modeling - Principles and Applications. s.l. : Wiley, 2003. p. 544.
3. Proschan, J. D. Esary and F. A Reliability Bound for Systems of Maintained, Interdependent Components. Journal
of the American Statistical Association. 1970, Vol. Vol. 65, No. 329.
4. Proschan, R. E. Barlow and F. Statistical Theory of Reliability and Life Testing: Probability Models. New York. : Holt,
Rinehart and Winston, 1975. Holt, Rinehart and Winston, New York, 1975..
5. Trivedi, Kishor S. Probability and Statistics with Reliability, Queueing and Computer Science Applications. s.l. :
Wiley, 2002. p. 830.
6. D. S. Kim, F. Machida, K. S. Trivedi. Availability Modeling and Analysis of a Virtualized System. 15th IEEE Pacific
Rim International Symposium on Dependable Computing - PRDC 2009. 2009.
7. K. S. Trivedi, D. Wang, D. J. Hunt, A. Rindos, W. E. Smith, B. Vashaw. Availability Modeling of SIP Protocol on
IBM WebSphere. 14th IEEE Pacific Rim International Symposium on Dependable Computing - PRDC 2008. 2008.
8. Robin A. Sahner, Kishor S. Trivedi, Antonio Puliafito. Performance and Reliability Analysis of Computer Systems An Example-Based Approach Using the SHARPE Software Package. s.l. : Kluwer Academic Publishers. p. 404.
9. Willian H. Press, Saul A. Teukolsky, Willian T. Vetterling, Brian, P. Flannery. Numerical Recipes - The Art of
Scientific Computing. Third. s.l. : Cambridge University Press, 2007. p. 1235.
10. WALLACE R. BLISCHKE, D. N. PRABHAKAR MURTHY, [ed.]. Case Studies in Reliability and Maintenance.
Hoboken : John Wiley & Sons, 2003. p. 661.
11. Ushakov, Igor. IS RELIIABIILITY THEORY STILL ALIVE? e-journal Reliability: Theory& Applications . March 2007, Vol.
Vol. 2, No 1.
12. Thiess, S. J. Einhorn and F. B. "Intermittence as a stochastic process". S. J. Einhorn and F. B. Thiess, "Intermittence
as a stNYU-RCA Working Conference on Theory of Reliability. Ardsley-on-Hudson, N. Y., 1957.
13. Symons, F. J. W. Modelling and analysis of communication protocols using numerical Petri nets. s.l. : Essex Ph.D. ,
1978 .
14. Stuart, H. R. Time-Limit Relays and Duplication of Electrical Apparatus to Secure Reliability of Services at
Pittsburg. s.l. : IEEE, June, 1905.
15. Stott, H. G. Time-Limit Relays and Duplication of Electrical Apparatus to Secure Reliability of Services at New York
. s.l. : IEEE, 1905.
16. Smith, David J. Reliability, Maintainability and Risk. Seventh. s.l. : Elsevier, 2009. p. 346.
17. Shetti, Nitin M. Heisenbugs and Bohrbugs: Why are they different? s.l. : Department of Computer Science,
Rutgers, The State University of New Jersey, March, 2003. dcs-tr-579.
18. SHANNON, C. E. A Mathematical Theory of Communication. The Bell System Technical Journal. July, Octobe,
1948, Vols. pp. 379423, 623656, Vol. 27.
19. Schaffer, Simon. Babbage's Intelligence: Calculating Engines and the Factory System. Critical Inquiry. The
University of Chicago Press , 1994, Vol. Vol. 21, No. 1.
20. Proschan, Richard Barlow and Frank. Mathematical Theory of Reliability. New York : John Wiley, 1967. SIAM
series in applied mathematics.
21. Pierce, W. H. Failure-tolerant computer design. New York : Academic Press, 1965. 65-22769.
22. O'Connor, Patrick D. T. Practical Reliability Engineering . Fourth. s.l. : Wiley, 2009. p. 513.
23. Neumann, J. V. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Annals of
Mathematics Studies. CE Shannon and McCarthy, 1956, Vol. 34, AutomataStudies.
24. Natkin, S. Les Reseaux de Petri Stochastiques et leur Application a l'Evaluation des Systkmes Informatiques.
CNAM, Paris, France : Thse de Docteur Ingegneur, 1980.
25. Nahman, J. M. Dependability of Engineering Systems. s.l. : Springer, 2002. p. 192.
26. Moore, Edward F. Gedanken-Experiments on Sequential Machines. The Journal of Symbolic Logic. Mar., 1958,
Vols. Vol. 23, No. 1, Association for Symbolic Logic .
27. Molloy, M.K. On the Integration of Delay and Throughput Measures in Distributed Processing Models. Los
Angeles, CA, USA. : Ph.D. Thesis, UCLA, 1981.
28. Misra, Krishna B., [ed.]. Handbook of Performability Enginneering. s.l. : Springer, 2008. p. 1316.
29. Leemis, Lawrence M. Reliability - Probability Models and Statistical Methods. Second. s.l. : ISBN: 978-0-69200027-4, 2009. p. 368.
30. Laprie, J.C. Dependable Computing and Fault Tolerance: Concepts and terminology. Proc. 15th IEEE Int. Symp. on
Fault-Tolerant Computing,. 1985.
31. Kolmogoroff, A. ber die analytischen Methoden in der Wahrscheinlichkeitsrechnung (in German).
Mathematische Annalen. Springer-Verlag, 1931.
32. Hoyland, Marvin Rausand and Arnljot. System Reliability, Theory - Models, Statistical Methods, and Applications.
Second. s.l. : Wiley, 2004. p. 636.
33. Gunter Bolch, Stefan Greiner, Hermann de Meer, Kishor S. Trivedi. Queueing Networks and Markov Chains Modeling and Performance Evaluation with Computer Science Applications. Second. s.l. : Wiley, 2006. p. 878.
34. Gnedenko, Igor A. Ushakov. Probabilistic Reliability Engineering. s.l. : Wiley-Interscience, 1995.
35. Gely P. Basharin, Amy N. Langville, Valeriy A. Naumov. The Life and Work of A. A. Markov. Linear Algebra and
its Applications. Special Issue on the Conference on the Numerical Solution of Markov Chains 2003 , 2004, Vol.
Volume 386 .
36. Ericson, Clifton. Fault Tree Analysis - A History. Proceedings of the 17th International Systems Safety Conference.
17th International Systems Safety Conference, 1999.
37. Epstein, Benjamin and Sobel, Milton. Life Testing. Journal of the American Statistical Association. Sep. 1953, Vol.
Vol. 48, No. 263.
38. Ebeling, Charles E. An Introduction to Reliability and Maintainability Engineering. Second. s.l. : Waveland Press,
Inc., 2005. p. 486.
39. Dhillon, B. S. (Balbir S.),. Applied reliability and quality : fundamentals, methods and applications. London :
Springer-Verlag, 2007.
40. Cox, D. R. Quality and Reliability: Some Recent Developments and a Historical Perspective. Journal of Operational
Research Society. N0. 2, Vol. 41.
41. Birnbaum, Z. W., J. D. Esary and S. C. Saunders. Multi-component systems and structures and their reliability.
Technometrics . 1961, Vol. 3 (1).
42. Mathematical reliability theory: from the beginning to the present time. Barlow, Richard E. s.l. : Proceedings of
the Third International Conference on Mathematical Methods In Reliability, Methodology And Practice, 2002.
43. Avizienis, A. Toward Systematic Design of Fault-Tolerant Systems. IEEE Computer. 1997 , Vol. Vol. 30, no. 4,.
44. Anselone, P. M. Persistence of an Effect of a Success in a Bernoulli Sequence. Journal of the Society for Industrial
and Applied Mathematics. 1960, Vol. Vol. 8, No. 2.
45. Anatoliy Gorbenko, Vyacheslav Kharchenko, Alexander Romanovsky. On composing Dependable Web Services
using undependable web components. Int. J. Simulation and Process Modelling. Nos. 1/2, 2007 , Vol. Vol. 3.
46. Fundamental Concepts of Dependability. Algirdas Avizienis, Jean-Claude Laprie, Brian Randell. Seoul, Korea :
s.n., May 21-22, 2001.
47. Jim Gray. Why Do Computers Stop and What Can Be Done About It? s.l. : TANDEM COMPUTERS, June, 1985.
Tandem TR 85.7.
48. Principal Works of A. K. Erlang - The Theory of Probabilities and Telephone Conversations . First published in Nyt
Tidsskrift for Matematik B. 1909, Vol. Vol 20.
49. Board of Directors of the American Institute of Electrical Engineers. Answers to Questions Relative to High
Tension Transmission. s.l. : IEEE, September 26, 1902.
50. J.C. Laprie. Dependability: Basic Concepts and Terminology. s.l. : Springer-Verlag., 1992.
Biography
Paulo R. M. Maciel graduated in Electronic Engineering in 1987, and received his MSc and PhD degrees in
Electronic Engineering and Computer Science from Univesidade Federal de Pernambuco, respectively. He was faculty
member of the Electric Engineering Department of Universidade de Pernambuco from 1989 to 2003. Since 2001 he
has been a member of the Informatics Center of Universidade Federal de Pernambuco, where he is currently Associate
Professor. He is research member of the Brazilian research council (CNPq) and IEEE member. His research interests
include Petri nets, formal models, performance and dependability evaluation, and power consumption analysis. He has
acted as consultant as well as research project coordinator funded by companies such as HP, EMC, CELESTICA,
FOXCONN, ITAUTEC and CHESF.
Kishor S. Trivedi holds the Hudson Chair in the Department of Electrical and Computer Engineering at Duke
University, Durham, NC. He has been on the Duke faculty since 1975. He is the author of a well known text entitled,
Probability and Statistics with Reliability, Queuing and Computer Science Applications, published by Prentice-Hall; a
thoroughly revised second edition (including its Indian edition) of this book has been published by John Wiley. He has
also published two other books entitled, Performance and Reliability Analysis of Computer Systems, published by
Kluwer Academic Publishers and Queueing Networks and Markov Chains, John Wiley. He is a Fellow of the Institute
of Electrical and Electronics Engineers. He is a Golden Core Member of IEEE Computer Society. He has published
over 420 articles and has supervised 42 Ph.D. dissertations. He is on the editorial boards of IEEE Transactions on
dependable and secure computing, Journal of risk and reliability, international journal of performability engineering
and international journal of quality and safety engineering. He is the recipient of IEEE Computer Society Technical
Achievement Award for his research on Software Aging and Rejuvenation. His research interests in are in reliability,
availability, performance, performability and survivability modeling of computer and communication systems. He
works closely with industry in carrying our reliability/availability analysis, providing short courses on reliability,
availability, performability modeling and in the development and dissemination of software packages such as
SHARPE and SPNP.
Rivalino Matias Jr. received his B.S. (1994) in informatics from the Minas Gerais State University, Brazil. He earned
his M.S (1997) and Ph.D. (2006) degrees in computer science, and industrial and systems engineering from the
Federal University of Santa Catarina, Brazil, respectively. In 2008 he was with Department of Electrical and
Computer Engineering at Duke University, Durham, NC, working as a research associate under supervision of Dr.
Kishor Trivedi. He also works for IBM Research Triangle Park in a research related to embedded system availability
and reliability analytical modeling. He is currently an Associate Professor in the Computing School at Federal
University of Uberlndia, Brazil. Dr. Matias has served as reviewer for IEEE TRANSACTIONS ON DEPENDABLE
AND SECURE COMPUTING, JOURNAL OF SYSTEMS AND SOFTWARE, and several international conferences.
His research interests include reliability engineering applied to computing systems, software aging theory,
dependability analytical modeling, and diagnosis protocols for computing systems.
Dong Seong Kim received the B.S. degrees in Electronic Engineering from Korea Aerospace University, Republic of
Korea in 2001. And he received M.S. and PhD degree in Computer Engineering from Korea Aerospace University,
Republic of Korea in 2003, 2008, respectively. And he was a visiting researcher in University of Maryland at College
Park, USA in 2007. Since June 2008, he has been a postdoctoral researcher in Duke University. His research interests
are in dependable and secure systems and networks. In particular, intrusion detection systems, wireless ad hoc and
sensor networks, virtualization, and cloud computing system dependability and security modeling and analysis.

Dependability Modeling

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dependability Modeling

Uploaded by

Copyright:

Available Formats

Dependability Modeling

Rivalino Matias Jr. Dong Seong Kim

Now, consider a random variable as the time to reach the state

Cumulative Distribution Function

if it has survived to the time t (conditional

is conditional probability of failure per time unit. When

is the cumulative hazard rate function (cumulative failure rate function).

Consider a hazard rate of an entire population of products over time (

, and applying integration by parts (

The probability that the system S will be repaired by

considering a specified resource is defined as

The mean time to repair (MTTR) is defined by:

An alternative often easier to compute

Figure 7: States of a Repairable System

Figure 8: Downtime and Uptime

Consider that the system started operating at time

is the mean time to restore, defined by

If repairing is not possible, the instantaneous availability,

Table 1: Distribution Summary

3.2. Specific Failure Terminology for Software

Consider a system S composed by a set of components,

is called the structure function of the system.

Wherever need and the context is clear,

composed of three blocks,

(see Figure 10).

may also be referred as set.

Figure 10: Structure function

using Equation 19 on component b,

. The state of the

The Boolean state vector,

Using the notation described, is equivalent to , represents

, the logic function of a series

For a parallel system with n components, the logic function

For any component , the logic function may be represented by

which is the same result obtained in the Example 1.

Classification of Modeling Techniques

4.2.1. Reliability Block Diagrams

Figure 11: Reliability Block Diagram

Figure 12: RBD of Series Structure

is the reliability of block

Likewise, the system instantaneous availability is

is the instantaneous availability of block

The steady state availability is

Figure 13: RBD of Parallel Structure

are the reliability and the unreliability of block

Similarly, the system instantaneous availability is

are the instantaneous availability and unavailability of block

The steady state availability is

are the steady availability and unavailability of block

Figure 14: RBD of System

Figure 15: 2 out of 3 system

Figure 16: Bridge System

4.2.2. Fault Tree

. Let the discrete random variable

Table 2: Basic Symbols and their description2

AND gate generates an event (A) if All event Bi have occurred.

OR gate generates an event (A) if at least one event Bi have

KOFN gate generates an event (A) if at least K events Bi out of

The comment rectangle.

represents the state of each component of the system, and it is

is named the Fault Tree structure function of the system.

Figure 17: Data Replication.

Example 4: Now consider a system composed by two processors (

software applications running on and , respectively. If any of these devices (