Professional Documents
Culture Documents
Paulo R. M. Maciel
Kishor S. Trivedi
Center of Informatics
Federal University of
Pernambuco, Brazil
Pratt School of
Engineering
Duke University, USA
Pratt School of
Engineering
Duke University, USA
Abstract
This chapter presents modeling method and evaluation techniques for computing dependability metrics of
systems. The chapter begins providing a summary of seminal works. After presenting the background, the most
prominent model types are presented, and the respective methods for computing exact values and bounds. This
chapter focuses particularly on combinatorial models although state space models such as Markov models and
hierarchical models are also presented. Case studies are then presented in the end of the chapter.
Keywords: dependability, modeling, combinatorial models, state space models
1. An Introduction
Due to ubiquitous provision of services on the Internet, dependability has become an attribute of prime concern in
hardware/software development, deployment, and operation. Providing fault tolerant services is inherently related
to adoption of redundancy. Redundancy can be exploited either in time or in space. Replication of services is
usually provided through distributed hosts across the world, so that whenever the service, the underlying host or
network fails, another service is ready to take over (1). Dependability of a system can be understood as the ability
to deliver a specified functionality that can be justifiably trusted (2). Functionality might be a set of roles or
services (functions) observed by an outside agent (a human being, another system etc) that interacts with system at
its interfaces; and the specified functionality of a system is what the system is intended for. This chapter aims to
provide an overview of dependability modeling. The chapter starts briefly describing some early and seminal
work, their motivations and the succeeding advances. Afterwards, a set of fundamental concepts and definitions
are introduced. Subsequently, the modeling techniques are classified, defined and introduced as well as a
representative set of evaluation methods is presented. Later on, case studies are discussed, modeled and evaluated.
2. A Brief History
This section provides a summary of early work related to dependability and briefly describes some seminal efforts
as well as the respective relations with current prevalent methods. This effort is certainly incomplete, nonetheless,
we hope it provides fundamental events, people and important research related to what is now called dependability
modeling.
Dependability is related to disciplines such as fault tolerance and reliability. The concept of dependable computing
first appeared in 1820s when Charles Babbage undertook the enterprise to conceive and construct a mechanical
calculating engine to eliminate the risk of human errors (3) (4). In his book, On the Economy of Machinery and
Manufacture, he mentions The first objective of every person who attempts to make any article of
consumption is, or ought be, to produce it in perfect form (5). In the nineteenth century, reliability theory
Book chapter of the book Performance and Dependability in Service Computing: Concepts,
Techniques and Research Directions. Publisher: IGI Global.
evolved from probability and statistics as a way to support computing maritime and life insurance rates. In early
twentieth century methods had been applied to estimate survivorship of railroad equipment (5) (7).
The first IEEE (formerly AIEE and IRE) public document to mention reliability is Answers to Questions Relative
to High Tension Transmission that summarizes the meeting of the Board of Directors of the American Institute
of Electrical Engineers, held in September 26, 1902 (6). In 1905, H. G. Stott and H. R. Stuart: discuss TimeLimit Relays and Duplication of Electrical Apparatus to Secure Reliability of Services at New York (7) and at
Pittsburg (8). In these works the concept of reliability was primarily qualitative. In 1907, A. A. Markov began
the study of an important new type of chance process. In this process, the outcome of a given experiment can
affect the outcome of the next experiment. This type of process is now called a Markov chain (11). In 1910s, A. K.
Erlang studied telephone traffic planning problems for reliable service provisioning (9). Later in the 1930s,
extreme value theory was applied to model fatigue life of materials by W. Weibull and Gumbel (9). In 1931,
Kolmogorov, in his famous paper ber die analytischen Methoden in der Wahrscheinlichkeitsrechnung
(Analytical methods in probability theory) laid the foundations for the modern theory of Markov processes (13).
In the 1940s quantitative analysis of reliability was applied to many operational and strategic problems in World
War II (5).
The first generation of electronic computers were quite undependable, thence many techniques were
investigated for improving their reliability. Among such techniques, many researchers investigated design
strategies and evaluation methods. Many methods were then proposed for improving system dependability such as
error control codes, replication of components, comparison monitoring and diagnostic routines. The most
prominent researchers during that period were Shannon (11), Von Neumann (12) and Moore (13), who proposed
and developed theories for building reliable systems by using redundant and less reliable components. These were
the predecessors of the statistical and probabilistic techniques that form the foundation of modern dependability
theory (5).
In the 1950s, reliability became a subject of great engineering interest as a result of the cold war efforts, failures of
American and Soviet rockets, and failures of the first commercial jet aircraft, the British de Havilland comet
(14) (15). Epstein and Sobels 1953 paper studying the exponential distribution was a landmark contribution (10).
In 1954, the Symposium on Reliability and Quality Control (it is now the IEEE Transactions on Reliability) was
held for the first time in the United States, and in 1958 the First All-Union Conference on Reliability took place in
Moscow (14) (15). In 1957 S. J. Einhorn and F. B. Thiess adopted Markov chains for modeling system
intermittence (22), and in 1960, P. M. Anselone employed Markov chains for evaluating availability of radar
systems (23). In 1961 Birnbaum, Esary and Saunders published a milestone paper introducing coherent structures
(17).
The reliability models might be classified as combinatorial (non-state space model) and state-space models.
Reliability Block Diagrams (RBD) and Fault Trees (FT) are combinatorial models and the most widely adopted
models in reliability evaluation. RBD is probably the oldest combinatorial technique for reliability analysis. Fault
Tree Analysis (FTA) was originally developed in 1962 at Bell Laboratories by H. A. Watson to evaluate the
Minuteman I Intercontinental Ballistic Missile Launch Control System. Afterwards, in 1962, Boeing and AVCO
expanded use of FTA to the entire Minuteman II (22). In 1965, W. H. Pierce unified Shannon, Von Neumann and
Moore theories of masking and redundancy as the concept of failure tolerance (23) (24). In 1967, A. Avizienis
integrated masking methods with practical techniques for error detection, fault diagnosis, and recovery into the
concept of fault-tolerant systems (23).
The formation of the IEEE Computer Society Technical Committee on Fault-Tolerant Computing (now
Dependable Computing and Fault Tolerance TC) in 1970 and of IFIP Working Group 10.4 on Dependable
Computing and Fault Tolerance in 1980 were important means for defining a consistent set of concepts and
terminology. In early 1980s Laprie coined the term dependability for encompassing concepts such reliability,
availability, safety, confidentiality, maintainability, security and integrity etc (17).
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
In late 1970s some works were proposed for mapping Petri nets to Markov chains (28) (29) (30). These models
have been widely adopted as high-level Markov chain automatic generation models as well as for discrete event
simulation. Natkin was the first to apply what is now generally called Stochastic Petri nets to dependability
evaluation of systems (29).
3.
Basic Concepts
This section introduces and defines several fundamental concepts, taxonomy and quantitative measures for
dependability.
As mentioned in the beginning of the chapter, dependability of a system is its capability of delivering a set of
trustable services that are observed by outside agents. A service is trustworthy when it implements the systems
specified functionality. A system failure occurs when the system fails to provide its specified functionality.
A fault can be defined as the failure of a component of the system, a subsystem of the system, or another system
which interacts with the considered system. Hence, every fault is a failure from some point of view. A fault can
cause other faults, a system failure, or neither. A system with faults that delivers its specified functionality is said
to be fault tolerant, that is, the system does not fail even when there are faulty components. Distinguishing faults
from failures is fundamental for understanding the fault tolerance concept. The observable outcome of a fault at
the system interface is called symptom and the most extreme symptom of a fault is a failure. Therefore, an analyst
evaluating the inner part of a system might detect faulty components or sub-systems. From that point of view, a
faulty component (or sub-system) had failed, since level of details analyzed is lower.
Consider an indicator random variable
that represents the system state at time t.
the operational state and
for the faulty state (see Figure 1). More formally,
for representing
(1)
Figure 1: States of
,
and
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
(2)
Figure 2:
t
Figure 3
- Density Function
The probability that the system S does not fail up to time (reliability see Figure 4) is
,
(3)
Figure 4:
The probability of the system S fail within the interval
t
may be calculated by:
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
.
The probability of the system S failing during the interval
probability of failure) is
then
(4)
where
is named the hazard function. Hazard rates may be characterized as decreasing failure rate (DFR),
constant failure rate (CFR) or increasing failure rate (IFR) according to
Since
(5)
thus,
(6)
where
problems, whereas wear-out failures are related to fatigue or exhaustion. Normal life failures are considered to be
random.
Infant mortality is commonly represented by decreasing hazard rate (see Figure 5.a), wear-out failures are
typically represented by increasing hazard rate (Figure 5.c), and normal life failures are usually depicted by
constant hazard rate (see Figure 5.b). The overlapping of these three separate hazard rate functions form the so
called bathtub curve (Figure 5.d).
(c)
(b)
(a)
t
(d)
Figure 5: Hazard rate: (a) Decreasing, (b) Constant, (c) Increasing, (d) Bathtub curve
The mean time to fail (MTTF) is defined by:
.
(7)
Since
thus,
Let
, then
hence:
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
hence
(8)
which is often easier to compute than (7).
Another central tendency reliability measure is the median time to failure (
, defined by:
.
(9)
The median time to failure divides the time to fail distribution into two halves, where 50% of failures occur before
and the other 50% after.
Consider a continuous time random variable
that represents the system state.
=0 when is failed, and
=1 when S has been repaired (see Figure 6). More formally,
(10)
Figure 6: States of
Now, consider the random variable that represents the time to reach the state
, given that the system
started in state
at time
. Therefore, the random variable D represents the system time to repair,
its cumulative distribution function, and
the respective density function, where:
and
,
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
(11)
is
(12)
Consider a repairable system S that is either operational (Up) or faulty (Down). Figure 7 shows the system state
transition model,
when Up
when down. Whenever the system fails, a set of activities are
conducted in order to allow the restoring process. These activities might encompass administrative time,
transportation time, logistic times etc. When the maintenance team arrives to the system site, the actual repairing
process may start. Further, this time may also be divided into diagnosis time and actual repair time, checking time
etc. However, for sake of simplicity, we group these times such that the downtime equals the time to restore
, which is composed by non-repair time
(that groups transportation time, order times, deliver
times, etc.) and time to repair
(see Figure 8). Thus,
(13)
, thus
(see
(14)
where
time,
mean non-repair
If
As
Since
and if
, thus
then
therefore:
(15)
The instantaneous availability is the probability that the system is operational at , that is,
where
is a parameter of this distribution. The respective reliability function, cumulative distribution
function, hazard function (failure rate), mean (mean time to failure) and variance are, respectively:
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
Table 1 summarizes the density, reliability, cumulative distribution and hazard functions, and mean and variance
of the above mentioned distributions.
Parameter
Exponential
Erlang
Each phases
rate:
(number of
phases);
Hyperexponential
(rates)
(probabilities
);
(scale),
Weibull
(shape).
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
.
Left
truncated:
Normal
(mean),
(variance).
(mean),
Standard
Normal
(variance).
Lognormal
,
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
(17)
1 represents the state of each component of the system, and it is named state
The vector
vector. The system state may be represented by a discrete random variable
, such
that
(18)
and
The Equation 19 expresses the system structure function in terms of two conditions. The first term (
represents a state which where the component is operational and the state of the other components are random
variables (
The second term (
), on the other hand, states the condition
where the component
has failed and the state of the other components are random variables
(
Equation 19 is known as factoring of the structure function and very useful for studying
complex system structures, since through its repeated application, one can eventually reach a subsystem whose
structure function is simple to deal with (1).
A component of a system is irrelevant to the dependability of the system if the state of the system is not affected
by the state of the component. In mathematical terms, a component is said to be irrelevant to the structure
function if
. A system with structure function
is said to be coherent if and only if
is non-decreasing in each and every component is relevant. A function
is non-decreasing if for every
two state vectors and , such that < , then
. Another aspect of coherence that should also be
highlighted is that replacing a failed component in working system does not make the system fail. But, it does not
also mean that a failed system will work if a failed component is substituted by an operational component.
Example 1: Consider a coherent system
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
Now factoring
As
, thus:
Therefore:
].
Fact
Since
on component c to get:
and
, thus:
So
In some cases, simplifying the structure function may not be an easy task. A logic function of coherent system
may be adopted to simplify systems functions through Boolean algebra.
As described earlier, assume a system S composed by a set of components
system S and its components could be either operational or faulty. Let
component , and its complement, that is, indicate that has failed.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
is the counterpart of
.
Adopting the system presented in the Example 1. One may observe that the system is functioning if components a
and b are working or if a and c are. This is the respective system logic function. More formally:
Therefore,
Thus,
The structure function may be obtained from the logic function using the respective counter parts. Hence, since
is represented by , is
, and corresponds to , then:
4. Modeling Techniques
The aim of this section is to introduce a set of important models types for dependability evaluation as well as offering
the reader a summary view of key methods. The section begins with a classification of model, then the main
combinatorial and state-space models are described along with the respective analysis methods.
4.1.
This section presents a classification of dependability models. These models may be broadly classified into
combinatorial and state-space models. State-space models may also be referred as non-combinatorial, and
combinatorial can be identified as non-state space models.
Combinatorial models capture conditions that make a system fail (or to be working) in terms of structural relationships
between the system components. These relations observe the set of components (and sub-systems) of the system that
should be either properly working or faulty for the system as a whole to be working properly.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
State-space models represent the system behavior (failures and repair activities) by its states and event occurrence
expressed as labeled state transitions. Labels can be probabilities, rates or distribution functions. These models allow
representing more complex relations between components of the system, such as dependencies involving sub-systems
and resource constraints. Some state-space models may also be evaluated by discrete event simulation in case of
intractable large state spaces or when combination of non-exponential distributions prohibits an analytic solution. In
some special cases state space analytic models can be solved to derive closed-form answer, but generally a numerical
solution of the underlying equations is necessary using a software packages.
The most prominent combinatorial model types are Reliability Block Diagrams, Fault Trees and Reliability Graphs;
Markov Chains, Stochastic Petri nets, and Stochastic Process algebras are most widely used state-space models. Next
we introduce these model types and their respective evaluation methods.
4.2.
Combinatorial Models
This section describes the two most relevant combinatorial model types for dependability evaluation, namely,
Reliability Block Diagrams (RBD) and Fault Trees (FT), and their respective evaluation methods.
The first two sections define each model type, their syntax, semantics, modeling power and constraints. Each model
type is introduced and then explained by examples so as to help the reader not only to master the related math but
acquire practical modeling skill.
The subsequent sections concern the analysis methods applied to the previously presented models. First a basic set of
standard methods is presented. The methods are particularly applied to models of systems in which components are
arranged as series, parallel or as a combination of parallel and series combinations of components. Afterwards, a set
of methods that applies to non-series-parallel configuration are presented. These methods are more general than the
basic methods, since they could be applied to evaluate sophisticated component compositions, but the complexity
analysis is of concern. The set that should be described are series-parallel reductions, minimal cut and path
computation methods, decomposition, sum of disjoint products (SDP) and delta-star and star-delta transformation
methods. Besides, dependability bounds computation are presented. Finally, measures of component importance are
then presented.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
(23)
Therefore, the system reliability is
(24)
where
(27)
Thus
(28)
The system reliability is then:
(29)
such that,
(30)
where
and
, respectively.
(31)
such that,
where
,
and
, respectively.
and
, respectively.
Due to the importance of the parallel structure, the following simplifying notation is adopted:
Example 2: Consider a system represented by the RBD in Figure 14. This model is composed of four blocks
(
) where each block has
as their respective reliabilities.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
is
In a coherent system
, a state vector is called a path vector if
and
is the respective path set.
Hence, a path set is set of components that if every one of its components are operational, the system is also
operational.
is a minimal path set if
, for any
, that is
is comprising a minimal number of
component that should be operational for the system to be operational. In a series system of n components (see Figure
11), there is only one minimal path set and it is composed of every component of the system. On the other hand, if we
consider a parallel system with n components as depicted in Figure 13, then the system has n path minimal sets, where
each set is composed of only one component.
A state vector is called a cut vector if
and
is the respective cut set.
is a minimal cut set if
for any
. In the series system, there are n cut sets where each cut set is composed of one component
only. The minimal cut set of the parallel system is composed by all the components of the system. System of Figure
14 has two minimal path sets,
} and
; and three minimal cut sets,
},
} and
}.
Structures like k out of n, bridges, delta and star arrangements have been customarily represented by RBD,
nevertheless such structures can only be represented if the components are replicated in the model. Consider a system
composed of 3 identical and independent components (
) that is operational if at least 2 out of its 3
components are working properly. The success probability of each of those blocks is p. This system can be considered
as a single block (see Figure 15) where its success probability (reliability, availability or maintainability) is depicted
by
(33)
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
(17)
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
Description
TOP event represents the system failure.
Basic event is an event that may cause a system failure.
Basic repeated event.
The vector
named state
vector.
The
system
(18)
It is important to stress that the graphical representation of symbols that denote these constructors may vary according the
standard chosen and adopted tools.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
in Figure 17.b. For sake of conciseness, the Boolean variables representing the events (faults) of each device are
named with the respective devices names, hence
,
.
(a)
(b)
or
or
or
, then
(a)
(b)
(c)
As
FT
structure
function
is
easily
derived.
We
named
the
respective
state
variables
, then:
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
as
and when
it reduces to
Therefore, the original FT in Figure 18.b may be factored into two FTs, one considering
as shown in Figure 18.c.
and other
4.3.
Analysis Methods
This section introduces some important methods adopted in combinatorial models for calculation of system
probability of failure when components are independent. For simplifying the notation, reliability (
), steady
state and instantaneous availability (
) of components and system might replace and P.
4.3.1. Expected Value of the Structure Function
The most straightforward strategy for computing system reliability (availability and maintainability) of system
composed of independent components is through the respective definition. Hence, consider a system S and its
respective structure function
The system reliability is defined by
Since
is a Bernoulli random variable, then
variable, thus
for any i and k; hence
; therefore,
. As is a binary
is a polynomial function in which each variable has degree 1.
Summarizing, the main steps for computing the system failure probability, by adopting this method are:
i) obtain the system structure function.
ii) remove the powers of each variable ; and
iii) replace each variable by the respective .
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
Example 5: Consider a 2 out of 3 system represented by the RBD in Figure 15.a. The structure function of the RBD
presented in Figure 15.b is
(35)
Considering that
Since
1, thus
, that is,
Therefore
(37)
As
(38)
which is equal to Equation 33.
4.3.2.
This method is based on the conditional probability of the system according the states of certain components. Consider
the system structure function as depicted in Equation 19 and identify the pivot component i, then
As
Since
and
then:
(39)
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
Example 6: Consider the system composed of three components, a, b and c, depicted in Figure 10, where
denotes the system structure function. As
, then:
But as
, so:
Since
Now factoring on component b,
P{
then
.
As
, thus:
.
As
and
then
Therefore:
which is
4.3.3. Reductions
Plain series and parallel systems are the most fundamental dependability structures. The dependability of such systems
is analyzed through equations described in Section 4.2.1. Other more complex structures such, k out of n and bridge
structures may also be directly evaluated as single components using the equations also presented in Section 4.2.1.
The dependability evaluation of complex system structures might be conducted iteratively by indentifying series,
parallel, k out of n and bridge subsystems, evaluating each of those subsystems, and then reducing each subsystem to
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
one respective equivalent block. This process may be iteratively applied to the resultant structures until a single block
results.
Consider a series system composed of n components (Figure 19.a) whose failure probabilities are
be reduced into one-component equivalent system (Figure 19.b) whose failure probability is
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
The 2 out of 3 structure can be represented one equivalent block whose reliability is
(Equation 33). The
bridge structure can be transformed into one component, , (see Figure 22) whose failure probability is
.
is named the minimal path series structure function. As the system is S is working if at least one of the p minimal
path is functioning, then the system structure function is
Hence,
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
Alternatively, considering the cut sets, the structure function of a particular minimal cut set is
is named minimal cut parallel structure function. As the system is S fails if at least one of the k minimal cuts
fails, then the system structure function is
Example 9: The failure probability of the system depicted in Figure 26 can be computed either by Equation 43 or 45.
The structure functions of the minimal path structures
and
are
and are
, respectively. Therefore,
So,
As
Since
, hence
then
and
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
terms from minimal paths and cuts, and how to consider them when computing the system probability of failure (or
probability of functioning). The union of the minimal paths or cuts of a system can be represented by the system logic
function. The system logic function may have several terms. If these terms are disjoint, then the dependability measure
(reliability, availability and maintainability) can be directly computed by the simple summation of probabilities related
to each term. Otherwise the probability related to one event (path or cut) is summed with the probability of event
occurrence represented by disjoint-product of terms.
Considering a system composed of three independent components
and , depicted in Figure 26, where the
components failure probabilities are
, respectively.
and
are minimal path
sets, and
, and
are minimal cut sets. The minimal path sets are depicted in Figure 27. As
usual, let the Boolean variable related to the component as and the state variable as .
Let denote
, then
therefore:
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
, is the
, is
For implementing the Expression 48 many algorithms have been proposed for efficient evaluation of the additional
contribution toward the union by additional events that have not been accounted for by any of the previous events (2).
Example 9: Consider the RBD presented in Figure 26, where the operational probabilities are
The minimal path sets and cuts are
and
, and
and
, respectively.
The operational probability computed in the first interaction of 48 when considering the minimal path is 0.980296.
If the minimal cuts are adopted instead of paths, and if
is the first, operational probability is 0.990099. So,
. In the second interaction, operational probability calculated considering the
is 0.9900019. When adopting the cuts, the next (and sole) disjoint product is
. The operational
probability computed considering the additional term is 0.9900019. The reader may observe that the two bounds
converged. Thus, the system operational probability is
.
Min-Max
The min-max method provides dependability bounds of coherent system using minimal paths and cuts. Consider a
coherent system S, whose structure function is
, with p minimal path sets
,
,
and k
minimal cut sets
,
,
, where
is structure function of a path and
is the structure
function of a particular minimal cut set.
is the operational probability of the path i, whereas
is probability that every component of the minimal cut j have failed. One should bear in mind that
A dependability lower bound may be obtained by
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
Hence,
Example 10: Consider the bridge system depicted in Figure 16. The system is composed of a set of five components,
}. The set of minimal paths and cuts are
,
,
and
,
,
, respectively, where
,
,
,
,
,
,
and
. The state variables of each component are named by
labeling function such that
, where
. Therefore the set of state variables is
5.
The structure functions of each minimal path are
Since are Bernoulli random variables,
then
,
,
,
As every variable are independent, then
If
As
and
, and
thus
, hence
5, 4x
4. Considering the
As
Adopting the same process
calculated.
and
1, 3x=1, 4x=1=
Therefore
,
and
and
then
are
As
then
If
3=
then
2.
SDP
The SDP method, described in Section 4.3.5, provides consecutive lower (upper) bounds when adopting successive
minimal paths (cuts). The basic algorithm adopted in SDP method order the minimal paths and cuts from shorter to
longer ones. If the system components are similarly dependable, longer paths provides smaller dependability bounds
whereas shorter cuts also offer smaller bounds.
Consider an RBD with n minimal paths and m minimal cuts. First take into account only the set of minimal path sets,
. According to Equation 48, the system dependability (reliability, availability etc) may be also
successively expressed so as to find tighter and tighter lower bounds. Hence
5 The reader should bear in mind that x has been interchangeably adopted to represent sets and vector whenever the context is
clear.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
As
, then
Example 11: Consider again the bridge system depicted in Figure 167. The set of minimal paths and cuts are
,
,
and
,
,
, respectively, where
,
,
,
,
,
,
and
, where
and
are the set of respective Boolean and state variables, respectively.
Considering the minimal path sets
As
, then
hence
Adopting the same process the successive lower bounds are obtained. Therefore, a second, a third and a fourth tighter
bound may be obtained. The second and the third bounds are:
6
7
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
The fourth and last bound (the exact value) is then calculated by
Hence
If
then
The same process may be applied considering the minimal cut sets in order to obtain upper bounds.
Let
. If this process is applied for the minimal path and minimal cuts, the following lower and upper
bounds are obtained:
Table 3: Successive Lower and Upper Bounds
Iterations i
1
2
3
4
0.826446280992
0.969879106618
0.976088319849
0.982297533080
0.991735537190
0.983539375726
0.982918454403
0.982297533080
Note that at the fourth iteration (last) the lower and the upper bounds equal, since then we have the exact result.
4.4.
State-Space Models
Dependability models can be broadly classified into non-state-space models (or called combinatorial models) and
state-space models. Non-state space models such as RBD and FT, which are introduced in previous sections, can be
easily formulated and solved for system dependability under the assumption of stochastic independence between
system components. In non-state-space models, for the dependability models, it is assumed that the failure or recovery
(or any other behaviors) of a component is not affected by a behavior of any other component. To model more
complicated interactions between components, we use other types of stochastic models such as Markov chains or more
generally state-space models. In this subsection, we briefly look at the most widely adopted state-space models,
namely Markov chains. First introduced by Andrei Andreevich Markov in 1907, Markov chains have been in use
intensively in dependability modeling and analysis since around the fifties. We provide a short introduction to discrete
state stochastic process and fundamental characteristics including stationarity, homogeneity, and memory-less
properties. We then introduce in brief Discrete Time Markov Chain (DTMC), Continuous Time Markov Chain
(CTMC), and semi Markov process. Finally, we briefly introduce stochastic Petri nets (SPN) which is a high-level
formalism to automate the generation of Markov chain.
A stochastic process is a family of random variables X(t) defined on a sample space. The values assumed by X(t) are
called states, and the set of all the possible states is the state space, I. The state space of a stochastic process is either
discrete or continuous. If the state space is discrete, the stochastic process is called a chain. And the time parameter
(also referred to as index set) of a stochastic process is either discrete or continuous. If the time parameter of a
stochastic process is discrete (finite or countably infinite), then we have a discrete-time (parameter) process. Similarly,
if the time parameter of a stochastic process is continuous, then we have a continuous time (parameter) process. A
stochastic process can be classified by the dependence of its state at a particular time on the states at previous times. If
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
the state of a stochastic process depends only on the immediately preceding state, we have a Markov process. In other
words, a Markov process is a stochastic process whose dynamic behavior is such that probability distributions for its
future development depend only on the present state and not on how the process arrived in that state. It means that at
the time of a transition, the entire past history is summarized by the current state. If we assume that the state space, I,
is discrete (finite or countably infinite), then the Markov process is known as a Markov chain (or discrete state
Markov process). If we further assume that the parameter space T, is also discrete, then we have a discrete-time
Markov chain (DTMC) whereas if the parameter space is continuous, then we have a continuous-time Markov chain
(CTMC). The changes of state of the system are called transitions, and the probabilities associated with various statechanges are called transition probabilities. For homogenous Markov chain, its transition probability is independent of
time (step) but depends only on the state, the Markov chain in this case is said to be time homogeneous. Homogeneous
DTMC sojourn time in a state follows the geometric distribution. The steady state and transient solution methods of
DTMC and CTMC are described inmore detail in (5). The analysis of CTMC is similar to that of the DTMC, except
that the transitions from a given state to another state can happen at any instant of time. For CTMC, we allow the
parameter to a continuous range of values, the set of values of X(t) is discrete. CTMCs are useful models for
performance as well as availability prediction. We show the CTMC models in the case studies section. Extension of
CTMC to Markov reward models (MRM) make them even more useful. MRM can be used by attaching a reward rate
(or a weight) ri to state i of CTMC. We used MRM in case study 2 to compute capacity oriented availability. For a
homogeneous CTMC, the sojourn time (the amount of time in a state) is exponentially distributed. If we lift this
restriction and allow the sojourn time in a state to be any (non-exponential) distribution function, the process is called
a semi-Markov process (SMP). The SMP is not covered in this chapter, more details can be found in (5).
Markov chains are drawn as a directed graph and the transition label is probability, rate, and distribution for
homogeneous DTMC, CTMC, and SMP, respectively. In Markov chains, states represent various conditions of the
system. States can keep track of number of functioning resources and states of recovery for each failed resource. The
transitions between states indicate occurrences of events. A transition can occur from any state to any other state and
represent a simple or a compound event.
Hand construction of the Markov model is tedious and error-prone, especially when the number of states becomes
very large. Petri net (PN) is a graphical paradigm for the formal description of the logical interactions among parts or
of the flow of activities in complex systems. The original PN did not have the notion of time, for dependability
analysis, it is needed to introduce duration of events associated with PN transitions. PN can be extended by associating
time with the firing of transitions, resulting in time the Petri nets. A special case of timed Petri nets is stochastic Petri
nets (SPN) where the firing times are considered to be random variables with exponential distribution. The SPN model
can be automatically converted into underlying Markov model and solved. SPN is a bipartite directed graph consisting
of two kinds of nodes; places and transition. Places typically represent conditions within the system being modeled.
Transitions represent events occurring in the system that may cause change in the condition of the system. Tokens are
dots (or integers) associated with places; A place containing tokens indicates that the corresponding condition holds.
Arcs connect places to transitions (input arcs) are transitions to places (output arcs). An arc cardinality (or multiplicity)
may be associated with input and output arcs, whereby the enabling and firing rules are changed as follows. Inhibitor
arcs are represented with a circle-headed arc. The transition can fire iff the inhibitor place does not contain any tokens.
A priority level can be attached to each PN transition, among all the transitions enabled in a given marking, only those
with associated highest priority level are allowed to fire. An enabling (or guard) function is a Boolean expression
composed from the PN primitives (places, transitions, tokens). Sometimes when some events take extremely small
time to occur, it is useful to model them as instantaneous activities. SPN model were extended as generalized SPN to
allow for such modeling by allowing some transitions, called immediate transitions, to have zero firing time. For
further details see (5).
5. Case Studies
In this section we present two case studies, which will illustrate the application of the aforementioned techniques to
areas related to service computing. Some of these cases are built based on authors experiences developing
dependability research works for companies in IT industry. Even those cases created hypothetically are developed as
close as possible from real-world scenarios.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
of the system that are designed to be replaced by the customer personnel (CRU), or only by authorized manufacturer
representative (FRU). The main difference is the time to repair, where FRU involves longer repair time than CRU
because it is not available locally, and hence the necessity to account the travel time. Since CPU is very often
considered an FRU in real data centers, we modeled the repair service considering this assumption. Hence, from state
DW the replacement of the failed CPU is requested and the service personnel arrive with a mean time of 1/spFRU to
fix and reboot the system (transition from FR to UP) afterward. Table 4 shows the CPU subsystem model parameter
values. Most of these values are based on industry standards, and specialist judgment.
Table 4: Input Parameters CPU subsystem model
Params
1/cpu
1/cpu1
Ccpu
1/app
Capp
cpu1
1/spFRU
1/appFRU
Description
mean time for processor failure operating with
2 proc.
mean time for processor failure operating with
1 proc.
coverage factor for cpu failure
mean time to reboot the whole computer
coverage factor for appliance failure after reboot
due to a processor failure
decision factor for keeping the system running
with one processor
mean time for new appliance arrival (FRU)
mean time to install the new CPU (FRU)
Value
1,000,000 hours
(1/cpu) / 2
0.99
2 minutes
0.95
True (1)
4 hours
30 minutes
2,50E+08
2,00E+08
1,50E+08
1,00E+08
5,00E+07
0,00E+00
D1
DW
FR
Value
2.08134903e+000
2.50190267e-001
5.25593188e-001
The breakdown analysis for the downtime (Table 5 and Figure 30) shows us that the contribution of state D1 to the
total downtime is very significant. Hence, actions should be taken to reduce the time spent in D1 reducing the reboot
time and to increase the coverage factor Capp. This multiprocessor subsystem model can be used in conjunction with
other models (e.g., cooling, power supply, storage, etc.) to compose a more complex system such as an entire server
machine or even a cluster of servers.
This case study presents the availability modeling and analysis of a virtualized system (6). Service computing is
highly dependent on data center infrastructure and virtualization technologies. We develop an availability model of a
virtualized system using a hierarchical model in which fault trees are used in the upper level and homogeneous
continuous time Markov chains (CTMC) are used to represent sub-models in the lower level. We incorporate not only
hardware failures (e.g., CPU, memory, power, etc) but also software failures including virtual machine monitor
(VMM), virtual machine (VM), and application failures. We also consider high availability (HA) feature and VM live
migration in the virtualized system. Metrics we use are system steady state availability, downtime in minutes per year
and capacity oriented availability.
VMs on one host can be migrated to the other host. Below we discuss the VMs availability model in detail. The
description for other sub models can found in (6).
failure on V1 on H1 where f means the failure of application1 on V1 (here, UUxf_UUxu, we used underscore
(_) to distinguish between H1 and H2). If the failure of application1 is detected (UUxd_UUx) with rate a, it can
be recovered with rate ca1a then it goes back to state UUxUUx. Or it sometime needs additional recovery action
(state, UUxp_UUx) with repair rate 2a. The application 2 on V2 on H2 can also fail (UUx_UUxf). It can be
detected with mean time, 1/a, and it can be recovered with mean time, 1/1a. Otherwise, it needs additional
recovery steps with mean time, 1/1a.
Table 6: Input Parameters VMs model
Params
1/h
1/v
1/a
1/h
1/v
1/a
1/mv
1/rv
1/v
1/1a
1/2a
1/h
Description
mean time for host failure
mean time for VM failure
mean time for application failure
mean time for host failure detection
mean time for VM failure detection
mean time for app. failure detection
mean time to migrate a VM
mean time to restart a VM
mean time to repair VM
mean time to application first repair
mean time to application 2nd repair
mean time to repair host failure
Value
host MTTFeq
2160 hours
336 hours
30 seconds
30 seconds
30 seconds
5 minutes
5 minutes
30 minutes
20 minutes
1 hour
host MTTReq
cv
0.95
ca
0.9
Value
9.99766977e-001
1.22476683e+002
9.96974481e-001
The output measures such as steady state availability, downtime, and capacity oriented availability (COA) are
computed using the hierarchical model. We used SHARPE to compute the output measures. We compute
mean time to failure equivalent (MTTFeq) and mean time to repair equivalent (MTTReq) for Markov sub
models (such as CPU, memory availability model, etc) by feeding the input parameters value. The MTTFeq
and MTTReq of each submodel are used to compute MTTFeq and MTTReq of a host. We use them in VMs
availability model (1/MTTFeq and 1/MTFReq of host equal to h and h, respectively). Finally, we evaluate
the system availability by feeding all the input parameters values from Table 6 into all the sub-models in
system availability model shown in Figure 33. The steady state availability and downtime in minutes per year
of virtualized two hosts system are summarized in Table 7. We also compute the COA by assigning reward
rates to each state of the VMs availability model (so this Markov chain becomes Markov reward model).
Reward rate 1 is assigned to states where one VM is running on each host (e.g., UUxUUx), reward rate 0.75 is
assigned to states where two VMs are running on a host (e.g., UUUUxx, UxxUUU, UUUDxx, etc). Reward
rate 0.5 is assigned to states where only one VM is running on a host (e.g., UUxFxx, UUxDxx, etc). Zero
reward rate is assigned to all other states. The computed COA of the virtualized system is also in Table 7. The
sensitivity to some parameters is shown from Figure 34 to Figure 35. Figure 34 shows unavailability vs. mean
time to VM failure (1/h in Table 6) / VMM failure (1/VMM). As seen from the Figure, the system
unavailability drops as the mean time to VMM failure increases. But after it reaches about 750 hours,
increasing mean time to VMM failure does not reduce the system unavailability too much. As seen from the
Figure 34, the system unavailability does not change much as the mean time to VM failure increases, since
another VM on the other host is working properly. Figure 35 shows the COA vs. mean time to restart (migrate)
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
a VM. The COA drops as the mean time to restart and as the mean time to migrate a VM increases. Therefore
it is important to minimize the mean time to restart (migrate) a VM to maximize the COA. The more case
studies using hierarchical modeling approach are in (7) (5) (8).
Capacity oriented availabity vs. mean time to restart (migrate) a VM
System unavailability
1,60E-03
1,40E-03
1,20E-03
1,00E-03
8,00E-04
6,00E-04
4,00E-04
2,00E-04
9,9845E-01
Mean time to restart a VM
9,9840E-01
9,9835E-01
9,9830E-01
9,9825E-01
9,9820E-01
9,9815E-01
9,9810E-01
9,9805E-01
0,00E+00
9,9800E-01
0
500
1000
1500
2000
2500
3000
3500
4000
4500
10
15
20
25
30
35
40
45
50
55
60
6. Conclusions
Dependability modeling and evaluation is an area of long tradition and sound fundaments. Dependability
studies are particularly important for the design success of critical systems. In this chapter, we began
introducing some seminal and important works that widely influenced the development of this research area.
Then we introduced some model types and important analysis methods. This chapter focused more on
combinatorial models than state based models, nevertheless the case studies presented considered both class of
models.
Although the presented method and models are quite mature, there are still many challenging problems
encompassing composition modeling strategies, efficient methods for computing measures of large systems,
automatic generation of models, user-friendly tools for industrial use, and evaluation of user-defined
properties. This list of subject is not intended to be complete, but it depicts exiting research subjects both
related to theoretical and practical studies.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
Bibliography
1. Trivedi, M. Grottke and K. S. Fighting Bugs: Remove, Retry, Replicate, and Rejuvenate . IEEE Computer. No. 2,
2007, Vol. Vol. 40.
2. Zuo, Way Kuo and Ming J. Optimal Reliability Modeling - Principles and Applications. s.l. : Wiley, 2003. p. 544.
3. Proschan, J. D. Esary and F. A Reliability Bound for Systems of Maintained, Interdependent Components. Journal
of the American Statistical Association. 1970, Vol. Vol. 65, No. 329.
4. Proschan, R. E. Barlow and F. Statistical Theory of Reliability and Life Testing: Probability Models. New York. : Holt,
Rinehart and Winston, 1975. Holt, Rinehart and Winston, New York, 1975..
5. Trivedi, Kishor S. Probability and Statistics with Reliability, Queueing and Computer Science Applications. s.l. :
Wiley, 2002. p. 830.
6. D. S. Kim, F. Machida, K. S. Trivedi. Availability Modeling and Analysis of a Virtualized System. 15th IEEE Pacific
Rim International Symposium on Dependable Computing - PRDC 2009. 2009.
7. K. S. Trivedi, D. Wang, D. J. Hunt, A. Rindos, W. E. Smith, B. Vashaw. Availability Modeling of SIP Protocol on
IBM WebSphere. 14th IEEE Pacific Rim International Symposium on Dependable Computing - PRDC 2008. 2008.
8. Robin A. Sahner, Kishor S. Trivedi, Antonio Puliafito. Performance and Reliability Analysis of Computer Systems An Example-Based Approach Using the SHARPE Software Package. s.l. : Kluwer Academic Publishers. p. 404.
9. Willian H. Press, Saul A. Teukolsky, Willian T. Vetterling, Brian, P. Flannery. Numerical Recipes - The Art of
Scientific Computing. Third. s.l. : Cambridge University Press, 2007. p. 1235.
10. WALLACE R. BLISCHKE, D. N. PRABHAKAR MURTHY, [ed.]. Case Studies in Reliability and Maintenance.
Hoboken : John Wiley & Sons, 2003. p. 661.
11. Ushakov, Igor. IS RELIIABIILITY THEORY STILL ALIVE? e-journal Reliability: Theory& Applications . March 2007, Vol.
Vol. 2, No 1.
12. Thiess, S. J. Einhorn and F. B. "Intermittence as a stochastic process". S. J. Einhorn and F. B. Thiess, "Intermittence
as a stNYU-RCA Working Conference on Theory of Reliability. Ardsley-on-Hudson, N. Y., 1957.
13. Symons, F. J. W. Modelling and analysis of communication protocols using numerical Petri nets. s.l. : Essex Ph.D. ,
1978 .
14. Stuart, H. R. Time-Limit Relays and Duplication of Electrical Apparatus to Secure Reliability of Services at
Pittsburg. s.l. : IEEE, June, 1905.
15. Stott, H. G. Time-Limit Relays and Duplication of Electrical Apparatus to Secure Reliability of Services at New York
. s.l. : IEEE, 1905.
16. Smith, David J. Reliability, Maintainability and Risk. Seventh. s.l. : Elsevier, 2009. p. 346.
17. Shetti, Nitin M. Heisenbugs and Bohrbugs: Why are they different? s.l. : Department of Computer Science,
Rutgers, The State University of New Jersey, March, 2003. dcs-tr-579.
18. SHANNON, C. E. A Mathematical Theory of Communication. The Bell System Technical Journal. July, Octobe,
1948, Vols. pp. 379423, 623656, Vol. 27.
19. Schaffer, Simon. Babbage's Intelligence: Calculating Engines and the Factory System. Critical Inquiry. The
University of Chicago Press , 1994, Vol. Vol. 21, No. 1.
20. Proschan, Richard Barlow and Frank. Mathematical Theory of Reliability. New York : John Wiley, 1967. SIAM
series in applied mathematics.
21. Pierce, W. H. Failure-tolerant computer design. New York : Academic Press, 1965. 65-22769.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
22. O'Connor, Patrick D. T. Practical Reliability Engineering . Fourth. s.l. : Wiley, 2009. p. 513.
23. Neumann, J. V. Probabilistic logics and the synthesis of reliable organisms from unreliable components. Annals of
Mathematics Studies. CE Shannon and McCarthy, 1956, Vol. 34, AutomataStudies.
24. Natkin, S. Les Reseaux de Petri Stochastiques et leur Application a l'Evaluation des Systkmes Informatiques.
CNAM, Paris, France : Thse de Docteur Ingegneur, 1980.
25. Nahman, J. M. Dependability of Engineering Systems. s.l. : Springer, 2002. p. 192.
26. Moore, Edward F. Gedanken-Experiments on Sequential Machines. The Journal of Symbolic Logic. Mar., 1958,
Vols. Vol. 23, No. 1, Association for Symbolic Logic .
27. Molloy, M.K. On the Integration of Delay and Throughput Measures in Distributed Processing Models. Los
Angeles, CA, USA. : Ph.D. Thesis, UCLA, 1981.
28. Misra, Krishna B., [ed.]. Handbook of Performability Enginneering. s.l. : Springer, 2008. p. 1316.
29. Leemis, Lawrence M. Reliability - Probability Models and Statistical Methods. Second. s.l. : ISBN: 978-0-69200027-4, 2009. p. 368.
30. Laprie, J.C. Dependable Computing and Fault Tolerance: Concepts and terminology. Proc. 15th IEEE Int. Symp. on
Fault-Tolerant Computing,. 1985.
31. Kolmogoroff, A. ber die analytischen Methoden in der Wahrscheinlichkeitsrechnung (in German).
Mathematische Annalen. Springer-Verlag, 1931.
32. Hoyland, Marvin Rausand and Arnljot. System Reliability, Theory - Models, Statistical Methods, and Applications.
Second. s.l. : Wiley, 2004. p. 636.
33. Gunter Bolch, Stefan Greiner, Hermann de Meer, Kishor S. Trivedi. Queueing Networks and Markov Chains Modeling and Performance Evaluation with Computer Science Applications. Second. s.l. : Wiley, 2006. p. 878.
34. Gnedenko, Igor A. Ushakov. Probabilistic Reliability Engineering. s.l. : Wiley-Interscience, 1995.
35. Gely P. Basharin, Amy N. Langville, Valeriy A. Naumov. The Life and Work of A. A. Markov. Linear Algebra and
its Applications. Special Issue on the Conference on the Numerical Solution of Markov Chains 2003 , 2004, Vol.
Volume 386 .
36. Ericson, Clifton. Fault Tree Analysis - A History. Proceedings of the 17th International Systems Safety Conference.
17th International Systems Safety Conference, 1999.
37. Epstein, Benjamin and Sobel, Milton. Life Testing. Journal of the American Statistical Association. Sep. 1953, Vol.
Vol. 48, No. 263.
38. Ebeling, Charles E. An Introduction to Reliability and Maintainability Engineering. Second. s.l. : Waveland Press,
Inc., 2005. p. 486.
39. Dhillon, B. S. (Balbir S.),. Applied reliability and quality : fundamentals, methods and applications. London :
Springer-Verlag, 2007.
40. Cox, D. R. Quality and Reliability: Some Recent Developments and a Historical Perspective. Journal of Operational
Research Society. N0. 2, Vol. 41.
41. Birnbaum, Z. W., J. D. Esary and S. C. Saunders. Multi-component systems and structures and their reliability.
Technometrics . 1961, Vol. 3 (1).
42. Mathematical reliability theory: from the beginning to the present time. Barlow, Richard E. s.l. : Proceedings of
the Third International Conference on Mathematical Methods In Reliability, Methodology And Practice, 2002.
43. Avizienis, A. Toward Systematic Design of Fault-Tolerant Systems. IEEE Computer. 1997 , Vol. Vol. 30, no. 4,.
44. Anselone, P. M. Persistence of an Effect of a Success in a Bernoulli Sequence. Journal of the Society for Industrial
and Applied Mathematics. 1960, Vol. Vol. 8, No. 2.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
45. Anatoliy Gorbenko, Vyacheslav Kharchenko, Alexander Romanovsky. On composing Dependable Web Services
using undependable web components. Int. J. Simulation and Process Modelling. Nos. 1/2, 2007 , Vol. Vol. 3.
46. Fundamental Concepts of Dependability. Algirdas Avizienis, Jean-Claude Laprie, Brian Randell. Seoul, Korea :
s.n., May 21-22, 2001.
47. Jim Gray. Why Do Computers Stop and What Can Be Done About It? s.l. : TANDEM COMPUTERS, June, 1985.
Tandem TR 85.7.
48. Principal Works of A. K. Erlang - The Theory of Probabilities and Telephone Conversations . First published in Nyt
Tidsskrift for Matematik B. 1909, Vol. Vol 20.
49. Board of Directors of the American Institute of Electrical Engineers. Answers to Questions Relative to High
Tension Transmission. s.l. : IEEE, September 26, 1902.
50. J.C. Laprie. Dependability: Basic Concepts and Terminology. s.l. : Springer-Verlag., 1992.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.
Biography
Paulo R. M. Maciel graduated in Electronic Engineering in 1987, and received his MSc and PhD degrees in
Electronic Engineering and Computer Science from Univesidade Federal de Pernambuco, respectively. He was faculty
member of the Electric Engineering Department of Universidade de Pernambuco from 1989 to 2003. Since 2001 he
has been a member of the Informatics Center of Universidade Federal de Pernambuco, where he is currently Associate
Professor. He is research member of the Brazilian research council (CNPq) and IEEE member. His research interests
include Petri nets, formal models, performance and dependability evaluation, and power consumption analysis. He has
acted as consultant as well as research project coordinator funded by companies such as HP, EMC, CELESTICA,
FOXCONN, ITAUTEC and CHESF.
Kishor S. Trivedi holds the Hudson Chair in the Department of Electrical and Computer Engineering at Duke
University, Durham, NC. He has been on the Duke faculty since 1975. He is the author of a well known text entitled,
Probability and Statistics with Reliability, Queuing and Computer Science Applications, published by Prentice-Hall; a
thoroughly revised second edition (including its Indian edition) of this book has been published by John Wiley. He has
also published two other books entitled, Performance and Reliability Analysis of Computer Systems, published by
Kluwer Academic Publishers and Queueing Networks and Markov Chains, John Wiley. He is a Fellow of the Institute
of Electrical and Electronics Engineers. He is a Golden Core Member of IEEE Computer Society. He has published
over 420 articles and has supervised 42 Ph.D. dissertations. He is on the editorial boards of IEEE Transactions on
dependable and secure computing, Journal of risk and reliability, international journal of performability engineering
and international journal of quality and safety engineering. He is the recipient of IEEE Computer Society Technical
Achievement Award for his research on Software Aging and Rejuvenation. His research interests in are in reliability,
availability, performance, performability and survivability modeling of computer and communication systems. He
works closely with industry in carrying our reliability/availability analysis, providing short courses on reliability,
availability, performability modeling and in the development and dissemination of software packages such as
SHARPE and SPNP.
Rivalino Matias Jr. received his B.S. (1994) in informatics from the Minas Gerais State University, Brazil. He earned
his M.S (1997) and Ph.D. (2006) degrees in computer science, and industrial and systems engineering from the
Federal University of Santa Catarina, Brazil, respectively. In 2008 he was with Department of Electrical and
Computer Engineering at Duke University, Durham, NC, working as a research associate under supervision of Dr.
Kishor Trivedi. He also works for IBM Research Triangle Park in a research related to embedded system availability
and reliability analytical modeling. He is currently an Associate Professor in the Computing School at Federal
University of Uberlndia, Brazil. Dr. Matias has served as reviewer for IEEE TRANSACTIONS ON DEPENDABLE
AND SECURE COMPUTING, JOURNAL OF SYSTEMS AND SOFTWARE, and several international conferences.
His research interests include reliability engineering applied to computing systems, software aging theory,
dependability analytical modeling, and diagnosis protocols for computing systems.
Dong Seong Kim received the B.S. degrees in Electronic Engineering from Korea Aerospace University, Republic of
Korea in 2001. And he received M.S. and PhD degree in Computer Engineering from Korea Aerospace University,
Republic of Korea in 2003, 2008, respectively. And he was a visiting researcher in University of Maryland at College
Park, USA in 2007. Since June 2008, he has been a postdoctoral researcher in Duke University. His research interests
are in dependable and secure systems and networks. In particular, intrusion detection systems, wireless ad hoc and
sensor networks, virtualization, and cloud computing system dependability and security modeling and analysis.
Book chapter of the book Performance and Dependability in Service Computing: Concepts, Techniques and
Research Directions. Publisher: IGI Global.