Professional Documents
Culture Documents
reliabilityprediction
G.V.Berg
g.v.berg@student.utwente.nl
ABSTRACT
In the past few decades the field of reliability engineering has
developed several useful techniques to achieve their main
objective of being able to systematically analyze systems for
reliability and robustness. One important technique is Fault
Tree Analysis and over time it has been extended into the more
versatile method of Dynamic Fault Tree (DFT) Analysis. With
a DFT one can compute the probability of failure during a
certain mission time. Calculating this probability of failure can
be computationally expensive. In this paper we describe a tool
which reduces the computational complexity of calculating the
failure probability for a DFT. The tool will compute DFT
failure probabilities using Monte Carlo simulation techniques.
Keywords
Monte Carlo simulation, reliability prediction, dynamic fault
tree
1. INTRODUCTION
The IEEE Reliability Society defines reliability as: Reliability
is a design engineering discipline which applies scientific
knowledge to assure a product will perform its intended
function for the required duration within a given environment
[IRS07]. One aspect of reliability engineering is reliability
prediction. NASA, for example, is eager to know the chance of
having one failed component in their satellites and how this
situation affects other components (and thus the functioning of
the satellite as a whole).
One established theory for calculating such probabilities
(based on known failure rates of individual components) is a
technique called Fault Tree Analysis (FTA). The Fault Tree
Handbook published by the United States Nuclear Regulatory
Commission [VGR+81] has set the basic standard for
analyzing the safety of mission critical systems such as nuclear
reactors. Since then much progress has been made in making
FTs more expressive. In particular, Dugan et al [DBB92] have
extended FTs with what are called dynamic gates, which gave
rise to the dynamic fault tree formalism. The DFT formalism
puts order of occurrence of events into FTs.
The most notable implementation for analysing (Dynamic) FTs
is Galileo [BSC99]. Galileo attempts to analytically solve
DFTs. It does so by trying to calculate the exact reliability
value of a DFT using a combination of Markov Chains and
Binary Decision Diagrams (BDDs) [And97]. The BDDs are
used when the FT contains no dynamic gates.
For DFTs Galileo computes the system reliability by solving
the underlying Markov Chains. These Markov Chains suffer
from state space explosions. A linear increase in DFT size will
make the state space of the Markov Chain exponentially
greater. For complex DFTs Galileo needs lots of memory and
time before it can compute the answer.
To counter these state space explosions Monte Carlo sampling
techniques are useful. Instead of trying to compute the answer
sampling techniques approach the answer in a computationally
less expensive way. The main idea is that, when taken enough
samples, the answer will be accurate enough (i.e. close to the
analytically computed value).
To sample system reliabilities several approaches have been
tried. Boyd and Bavuso [BB93] used a variation reduction
technique called importance sampling. They write: [..]
analytical solution techniques are preferable whenever the
model is small enough [..] simulation is preferred [..] when
the model is too large or exhibits system behaviour too
complex to be accommodated by analytical solution
techniques.
Gedam and Beaudet [GB00] also did work on using sampling
techniques in the field of reliability engineering. They used
Monte Carlo sampling to solve Reliability Block Diagrams
(RBDs). This diagram technique is a combinatorial one. A
static FT can be translated in a RBD and vice versa. For DFTs
this is not possible.
In this paper we describe a tool we have implemented. The tool
uses Monte Carlo sampling to compute the reliability of DFTs.
It does not use Markov Chains or BDDs. It works directly on
the FT. Gedam and Beaudets approach will be shown as being
an effective approach. The computational complexity and state
space explosion of the traditional Galileo methods are
countered by our tool. This will be proven using the results of
our case studies. Our cases are based upon the ones Boyd and
Bavuso [BB93] used to test their Galileo implementation.
2. BACKGROUND
2.1 Fault trees and dynamic fault trees
A Fault Tree (FT) is a Directed Acyclic Graph (DAG) in which
the leaves are basic events (BEs) and the other elements are
gates. This definition is based on [BCS07]. BEs model
component failures whereas the gates model how component
failures induce a system failure. Fault Trees have three types of
gates:
1.
2.
3.
unicycle
fails
sp
tire fails
2.
SPARE gate3 (Figure 1.e) which has one primary input and
zero or more spare inputs. All inputs are BEs. When the
primary input fails it is replaced by the first available
spare input. When that one fails it is replaced by the next
available spare input, etc. If the primary and all the spares
have failed the SPARE gate fails.
3.
From now on we will use the term Static Fault Trees (SFTs)
when referring to FTs without DFT extensions
2.3.1 Example
We take the unicycle DFT in figure 3. We assume both tires
(i.e. BEs) have the same failure distribution. It is the
exponential failure distribution we use. This means that the
PDF of the BE is defined as:
Time t
RNG = u
Time t
0.9
1.151
0.8
0.8047
0.4
0.2554
0.6
0.4581
0.5
0.3466
0.3
0.1783
0.6
0.4581
0.7
0.6020
yields a list of failure times for all the BEs. We now move on
to the propagation of these failure times through the tree.
We take the BEs we just sampled and move up the tree
according to the arcs coming out of the BEs. We hand these
values to the gates. After this we will compute the values of all
the gates upon where we will move these values up the tree
according to the arcs. This is done until we arrive at the top
node at which point we have a sampled failure time of the
entire tree.
The methodology for calculating the failure time of each gate is
described below:
3.1.1AND gate
This gate only fails after all its inputs have failed. So the
output failure time of an AND gate is equal to the largest
failure time of its inputs (or infinity if not all of its inputs fail).
3.1.2 OR gate
This gate fails after one or more of its inputs have failed. The
output failure time of an OR gate is equal to the lowest failure
time of its inputs (or infinity if none of its inputs fail).
This yields the following PDF definition for the spare input
with pfail denoting the time at which the primary tire failed:
We do not have to resample all the values for the spare tire
since the inverse PDF just returns the time (denoted by x in the
previous equations) at which the component failed. For the
spare tire we need to take x + pfail to adjust for the failing of
the primary tire. In short, for each row we add the value in
column 2 and column 4 together. For each row this value
denotes the time at which the spare tire failed (after the
primary already being in a failed state). Doing this delivers the
following values:
Table 2. Sampled values
1.9557
0.7135
0.5249
1.0601
3.1Sampling
The sampling starts by sampling all the BEs. For each BE we
generate a random number r and solve (with the inverse CDF)
for Fcdf(t) = r as described before in the unicycle example. This
The PAND gate fails if all its inputs from left to right in order
have failed. So we take all the inputs of the gate (in order) and
check if the values are sorted ascendingly. If they are we return
the last value. If they are not we return infinity since the failure
of its inputs occurred in another order than specified so the
PAND gate will not fail.
If pfail > t' we assume the primary fails after the spare
input so the failure time of the spare gate as a whole
becomes pfail + t'.
If pfail < t' we assume the primary fails before the spare
input so the failure time of the spare gate as a whole
becomes pfail + t''.
3.3Confidence interval
The confidence interval is a measurement of the range in
which the real system reliability lies based upon all the
samples. It is used to determine if an answer is accurate
enough. To compute the confidence interval we first need to
compute the standard deviation of all the samples we have
taken.
We first calculate the mean of all the samples:
z / n
Z is a value we can look up in tables for the normal
distribution. According to Moore and McCabe [MM03] we are
allowed to assume that for a large enough n the distribution
will behave like a normal distribution.
3.3.1Example
Suppose we take 1000 samples. The standard deviation is
equal to 50. The mean of the sampled values is 300. For a
confidence of 99.9% the table in Moore and McCage yields
3.291 as a value for z.
This yields a confidence offset of:
3.29150/ 300=9.5003
With a confidence of 99% we can now say that the real mean
value of the distribution lays between:
(300 9.5003, 300 + 0.5003)
4.IMPLEMENTATION
In this section we will briefly describe our implementation.
The tool we have built requires the following input:
Each case will detail the DFT analyzed and give the
unreliability output by our tool. These unreliabilities will also
be accompanied with the confidence interval and the number
of samples taken. We will also analyze the DFTs with several
mission times.
Our results will be compared against the unreliability
measures computed by Galileo. Based upon this we are going
to show the methodology and its implementation are working.
3 OR gates
3 AND gates
3 PAND gates
12 BEs
The parser for the Galileo format files is based upon supplied
ANTLR grammar files. They were written to produce Java
code, so we had to translate the ANTLR files into Python code
producing parsers.
100
5.CASE STUDIES
For comparing the tool to existing techniques we performed
several case studies. These case studies are based upon two
case studies from the paper by Dugan et al [DBB992]. The
DFTs in that paper model several types of Fault-Tolerant
Computer Systems. These systems were explicitly designed to
be as redundant and reliable as possible. The Dugan case
studies are depicted in appendix A.
The first case in Dugans paper was chosen because it has four
FDEP and twelve PAND gates. This means that with
traditionally solving of that DFT an enormous state space
explosion will take place when trying to solve the underlying
Markov Chain. This makes it difficult to completely compute
the system reliability so our methodology might prove a good
alternative.
The second case in Dugans paper was chosen because it has
four SPARE gates. The SPARE gates also have shared spare
inputs. This means for analytically computing the system
reliability of the DFT the Markov Chain will be very complex.
Especially because of the shared spare inputs since this makes
all the SPARE gates dependent on each other. Being able to
solve this faster with our methodology would be quite
desirable.
We have drawn a few DFT's based upon the selection criteria
we have mentioned before and simulated those. The cases
themselves and the results for each independent case is shown
below.
250
10,000
0.018000
+/- 0.019989
0.028300
+/- 0.011081
0.349000
0.321300
+/- 0.112466
+/- 0.035109
3 OR gates
3 AND gates
3 PAND gates
2 FDEP gates
14 BEs
100
250
1,000
10,000
0.09000
0.079400
+/- 0.054845 +/- 0.028019
0.528000
0.509200
+/- 0.12520
+/- 0.069478
100,000
0.082670
+/- 0.009373
0.512460
+/- 0.008879
2 spare gates
5 BEs
The spare gates have warm spare inputs. All the basic events
have dormancy factors of 0.5. For BE1, BE3 and BE5 who are
not spare input BEs the dormancy factor is set but it's not taken
into account at any point in the simulation.
The lambda's for each BE are shown in the table below:
Table 5. BE Lambda values
BE1
BE2
BE3
BE4
BE5
0.006
0.008
0.006
0.009
0.02
This behavior was only observed with SPARE gates with warm
spare inputs. Due to time constraints we were not able to
figure out whether there's a bug in our implementation or not.
Our methodology seems to work for simple DFTs with only
one SPARE gate and one warm spare input. It might be a side
effect of the simulation but it is more probable that we have
stumbled upon an implementation error. This will have to be
investigated further.
6.CONCLUSION
The methodology described in this paper seems to work well
for smaller cases. For the cases we have been able to test the
results are encouraging. The simulation seems to give accurate
enough answers and is able to calculate the unreliabilities for
systems which cannot even be analytically analyzed by Galileo.
The sampling of the warm spare inputs in larger DFTs does not
seem to match the rest of the test data. Most of our
methodology was proven and has been put to the test so this
might give other researchers a good start to further improve our
implementation and find more accurate ways of Monte Carlo
simulation for DFTs.
7.ACKNOWLEDGMENTS
The author wants to thank his supervisor, Hichem Boudali for
making available several references and discussing several
aspects of this paper. Gratitude is also expressed towards my
supervisors, Marille Stoelinga and Lodewijk Bergmans, and
my fellow students for reviewing and commenting on this
paper.
8.REFERENCES
[And97]
H. R. Andersen
An Introduction to Binary Decision Diagrams
Lecture notes for 49285 Advanced Algorithms
E97, October 1997,
Dept. of Information Technology, Technical
University of Denmark,
http://www.itu.dk/people/hra/bdd97.ps, accessed at
21st of March 2008
[BB93]
100
250
1,000
10,000
100,000
0.333000
+/- 0.118184
0.293700
+/- 0.038013
0.294760
+/- 0.018844
0.765000
0.803100
+/- 0.264136
+/- 0.059303
0.8032900
+/- 0.025621
[BCS07]
[BSC99]
[Dug04]
J.B. Dugan,
Fault Tree Analysis of Computer-Based systems,
Lecture at Reliability and Maintainability
Symposium, University of Virginia 2004
[DBB92]
[Gen03]
Gentle, J.E.
Random number generation and Monte Carlo
methods, 2nd edition
Springer-Link, New York, 2003
[GB00]
[IRS07]
[KW86]
[MM03]
[Sci07]
SciPy website
http://www.scipy.org/, accessed at 25th of March
2008
APPENDIX B: DFT #1