You are on page 1of 214

IBTM553 LECTURE-5

Main Menu

> > > > >

Course Program Fault Tree Basics Event Tree Methodology Integrated FT/ET Analysis Conclusions

Course Program
Date 1 2 3 4 5 6 7 Sep-8 Room 2464 Sep-15 Room 2464 Sep-22 Room 2464 Sep-29 Room 2464 Oct-6 Room 2464 Oct-13 Room 2464 Oct-20 Room 2464 Contents Overview of course contents and schedule Definition of hazard and risk, risk management principles Individual risk, collective risk and societal risk, risk management indices, background risk, risk acceptance criteria Risk Assessment I: Hazard analyses: techniques and applications, risk matrix Risk Assessment II: Concept of frequency, probability and uncertainty, Bayesian data update Risk Assessment III: Quantitative risk assessment, fault tree, data analysis, event tree Mid-term exam Cost/risk-benefit analysis I

Recap of Lecture-3
Objective

Understand the differences between hazard and risk Be familiar with typical hazard identification tools Be able to apply the concept of risk matrices

Recap of Lecture-3
Application of Worksheets
Identify hazard scenarios systematically
Use systems, subsystems, job steps, etc.

Apply risk matrices to quickly screen hazards for their importance


To identify show-Stoppers To rank order hazard scenarios

Recommend mitigation measures to control hazard risks


Not useful if most scenarios end up with similar recommendations Recommendations should commensurate with level of risks Mitigation measures can reduce likelihood and/or consequences

Residual risk should be realistic

Recap of Lecture-3
Mitigation Measures in order of adoption
(a) eliminate the hazard (e.g., physically remove the hazard, design change); (b) substitute the hazard with a safe alternative (e.g., replace a hazardous material with a safe material); (c) prevent exposure of personnel to the hazard; (d) use of active and/or passive safe guards, minimise failure of safe guards with redundancy (e.g., install safety barriers or warning devices) and/or special procedure and administration control; (e) use of personal protection equipment; (f) develop response plan to reduce the consequence; (g) conduct focused training to improve the competency of staff and reduce human errors (this should not be the only control measure for high risk hazards); and (h) accept the hazard and monitor the hazard continuously (this should not be the only control measure for high risk hazards).

Recap of Lecture-3
Filling the Worksheet
Be creative but realistic and comprehensive Each row is one scenario that gives one set of H/M/L or F/S/R; if you lump different scenarios together, hard to justify H/M/L or F/S/R Be able to know what you are talking about by you and others years later Show existing and proposal control measures, residual risks

Recap of Lecture-3
Assigning Risk Matrices
Must tell people what risk matrices you are using Be consistent - If you use H/M/L type simple matrix, do not use Frequency and Severity classes If you use Frequency and Severity classes, you must show F/S/R explicitly for each scenario If your tables have numeric scores, you must show scores

Recap of Lecture-3
Common Mistakes

Haza rd ID.

Mix up risk matrices, if use F/S/R must show all 3 values Did not show residual risk Mix up potential cause and hazard scenarios, do not follow notes 3B Scenario description not concise, not comprehensive Provide PPE is not the best bet
Hazard Description Potential Cause Consequence Existing Control Measure F Original Risk Residual Risk Proposed Control Measure C R F C R Comment

The worker dropped the electrical into water due to carelessness resulting in electrical shorts to the metal ladder

Lack of safety awarene ss

Possible fatality due to electrical shock and/or downing

none

F C A Drain water 2 4 before work, use rechargeable drill if possible or GFCI protected circuit

F C C 4 2

Recap of Lecture-4
Objectives

Understand the basic concept of probability and frequency Understand the different types of uncertainties in risk analyses Be familiar with Bayes Theorem

Recap of Lecture-4
Probability - Basic Concept

Sample Space
The set of all possible outcomes without the problem area Everything else is irrelevant

Events
Outcomes, or trials Combinations of decisions and outcomes
Throw a dice = {1}, {2}, {3}, {4}, {5}, {6}, 2 Children in family = {B1,G2}, {B1,B2}, {G1,B2}, {G1,G2}

Recap of Lecture-4
The Laws of Probability

If A and B are not mutually exclusive: P(A or B) = P(A) + P(B) P(A and B)
A B

Mathematical symbols: A or B = AB = AB = A+B A and B = AB = AB = A,B = A*B = AB B given A = B|A

Recap of Lecture-4
The Laws of Probability Law of Multiplication What is the probability that both A and B occur together? P (A and B) = P(A B) = P(B|A) P(A) where P(B|A) is the probability of B conditioned on A; is and, | is given If A and B are statistically independent: P(B|A) = P(B) and then P(A B) = P(A) P(B) Most people would only remember P(A B) = P(A) P(B)

Recap of Lecture-4
Conditional Probability and Bayes Theorem
Conditional probability is DEFINED as:

P( A | B) =

Probability of Event A, given Event B, is the probability of Event (A and B) divided by the probability of Event B alone Equivalently, P(A B) = P(A|B) P(B) = P(B|A) P(A) Where is and, | is given

P( A B) P( B)

A AB

Bayes Theorem P(B|A) = P(A|B) P(B) / P(A)

Recap of Lecture-4
Bayes Theorem for a Given Parameter
The prior is the probability of the parameter and represents what was thought before seeing the data The likelihood is the probability of the data given the parameter and represents the data now available The posterior represents what is thought given both prior information and the data just seen It relates the conditional density of a parameter (posterior probability) with its unconditional density (prior, since depends on information present before the experiment) p (jdata) = p (dataj) p (j) / p (datai) p (i)

Posterior likelihood x prior

basically a normalizing constant

Example: Probability of Having Two Girls Given at least one Girl


Family, 2 children D: Both are girls C: At least 1 girl G1
First child Second child G2 Number of girls 2 Prob. 1/4

3/4 B2 G2 1 1 1/4 1/4

P(C I D) P(D | C) = P(C)

B1 B2
P(B | A) =

P(G1I G2) 1 / 4 1 = = 3/4 3 P(C)

1/4

P(A | B)P(B) P(A B) = P(A) P(A)

C D is the same as G1 G2

B= Both are girls; A = at least one girl

Recap of Lecture-4
Problem Solving #3 In a bolt factory machines A, B and C manufacture, respectively, 25, 35 and 40 percent of the total. Of their output 5, 4 and 2 percent are defective bolts. A bolt is drawn at random from the produce and is found defective. What are the possibilities that it was manufactured by machines A, B, C?
A Output Defective P(A)= 0.25 P(d|A)=0.05 B P(B)= 0.35 P(d|B)=0.04 C P(C)= 0.40 P(d|C)=0.02

Find: P(A|d), P(B|d) and P(C|d)

Recap of Lecture-4
Problem Solving #3
A Output Defective P(A)= 0.25 P(d|A)=0.05 B P(B)= 0.35 P(d|B)=0.04 C P(C)= 0.40 P(d|C)=0.02

P(A | d) =

P(A d) P(d | A)P(A) = P(d) P(d)

P(d) = P(d,A) + P(d,B) + P(d,C) =P(d|A) P(A) + P(d|B)P(B) + P(d|C)P(C) =0.05*0.25 + 0.04*0.35 + 0.02*0.40 = 0.0345

Recap of Lecture-4
Problem Solving #3
A Output Defective P(d) P(A)= 0.25 P(d|A)=0.05
0.25*0.05=0.0125

B P(B)= 0.35 P(d|B)=0.04


0.014

C P(C)= 0.40 P(d|C)=0.02


0.008 =0.0345

=1

P(A | d) =

P(A d) P(d | A)P(A) = P(d) P(d)

P(A|d) = 0.0125/0.0345 = 0.36 P(B|d) = 0.014/0.0345 = 0.41 P(C|d) = 0.008/0.0345 = 0.23 P(A|d) + P(B|d) + P(C|d) = 1

Recap of Lecture-4
Law of Total Probability B1, B2, , Bn = mutually exclusive and exhaustive events. P(A) = P(A|Bi ) P(Bi )
i =1 n

Given: P(Bi ), P(A|Bi ) i =1, 2, ..., n Want: P(Bi |A) P(Bi |A) = P(A|Bi ) P(Bi )

j =1

P(A |Bj ) P(Bj)

LECTURE-5 Fault Tree Basics

Lecture-5
Learning Objectives

At the conclusion of this session, participants will:


Understand the concepts of fault tree and event tree Be familiar with the principles of integrated fault tree/event tree analysis, QRA

Suggested Readings:
Lecture-5 Supplementary Notes

Series System 1

The system is operative if every component is operative. Ai = Componenti is operative. A = The system is operative. Assume: A1, , An are mutually independent P(A) = P (A1 A2 An) = P (A1) P(A2) P(An) Note: P(A) < min {P(A1), , P(An)}, i.e., the system is weaker than its weakest link.

Parallel System 1 2
The system is operative if any one component is operative. Ai = Component i is operative. A = System is operative. A1, A2, , An are mutually independent. P() = P(1 2 n) = P(1) P(2) P(n) = (1-P(A1)) (1-P(A2)) (1-(An))

P(A) = 1 (1-P(Ai ))

Series Parallel System 1


a

4
b

The system operates if there is at least one path of operating components from a to b.
Ri = P {Component i operates} R = P {System operates} = 1 (1 R1R4) (1 R3R5)

Fault Trees Analysis


Start with Top Event and follow through scenario Use deductive logic to systematically identify event initiators Separate tree into functional level, system level, subsystem level, component level, fault level, etc. Bottom of the tree are basic events or developed events Can be qualitative or quantitative

Fault Tree Symbols


Two kinds of symbols are used in a fault tree:
Logic symbols Event symbols

Many symbols and styles, we stay with the simple ones here
TOP

Fault Tree Symbols Logic Symbols

Fault Tree Symbols More Symbols

Fault Tree Symbols Event Symbols

Fault Tree Symbols Event Symbols

Fault Tree Symbols Event Symbols

Fault Tree Construction


Identify the Undesired Top Event. A different tree is required for each unique Top Event Constructing the logic Identify and sketch the Intermediate Events to develop logical branches Spotting/correcting some common errors Adding quantitative data

Fault Tree Example

Calculations

Fault Tree Analysis


Fault trees use deductive logic to identify fault or failure precursors postulate and to quantify the top event probability Fault tree is based on probability theory in solving Boolean algebra Approximation:
P(Top) P(A) x P(B) x [P(C) + P(D)] P(Top) 0.1x0.1x(0.1+0.2) = 0.003 TOP

Exact:

P(Top) = P(A) x P(B) x [P(C) + P(D) P(C)xP(D)] P(Top) 0.1x0.1x(0.1+0.2 0.1x0.2) = 0.0028

A 0.1

B C 0.1 0.1

D 0.2

Typical Faults in Fault Tree Analysis


Fault trees propagate probability or unavailability, NOT frequency Approximation led people to think they can add events together for OR gate regardless of contents Should not use fault tree simply to add events, A+B is not necessary A or B ; A or B = A + B A*B

Fault Tree Example


Tank explodes AND Pressure rises OR too much input OR temperature rise AND overheats

(B + C + F(D + E))A

pressure relief valve fails

B + C + F(D + E)

A B+C

F(D + E)

(D +E)
pump fails

regulator fails

OR

temperature alarm fails

F
fire process temp increases

F= AB + AC + AFD + AFE

A Flood Alarm System

A Flood Alarm System


Two System Redundancy

A Flood Alarm System


Component Level Redundancy

Power Supplies Fail

Alarms Fail

LECTURE-5 Frequency and Rates

Measurement of Likelihood
Typically use generic frequency or rates Should use specific data (past failure records) with consideration of generic data Can use expert judgment for rare events must handle degree of belief; i.e., uncertainties Can be a discrete value (like those in a risk matrix) or a continuous function

Main Menu

Frequency
Frequency is a measure of the rate of occurrence. E.g., failure rate of a pump is 6.2x10-3/hr Frequency data are based on statistics with consideration of uncertainties (probability); e.g., the failure rate of a pump is 6.2x10-3/hr. But it could be Frequency Fraction Product
1.0x10-4/hr 2.0x10-3/hr 3.2x10-3/hr 4.5x10-2/hr 0.2 0.5 0.2 0.1 Sum: 2.0x10-5/hr 1.0x10-3/hr 6.4x10-4/hr 4.5x10-3/hr 6.2x10-3/hr

Probability Curves for Frequency

LECTURE-5 Event Tree Methodology

Event Trees
Use inductive logic to postulate and quantify accident scenarios or accident sequences Start with initiating event and follow through scenario to identify possible scenarios which need to be managed Event trees should be used to display the progression of an accident A typical event tree in a nuclear power plant risk analysis may generate millions of accident sequences

Event Tree Analysis


Each event tree heading may have more than 2 branches, although binary tree is most common Event trees should start with an initiating event, not a damage state. Most people confuse event tree with decision tree
Fire Initiating Event Auto Fire Protection System Available Auto FPS Controls Fire before Damage Manual Suppression Available Consequence

1-Qauto success 1-U Qauto Accident sequence or path 1-Qmanual

SAFE

DAMAGE

SAFE U Fail Split fraction value Qmanual DAMAGE

Damage State

Event Tree
Event headings are usually state or system, function of safety barriers, actions or events that can alter the course of the accident scenario Easier if you put key actions first Event tree and fault tree are interchangeable in most cases

Example
Detector exists Detector works
Y Y 0. 2 0.99

Detector noticed

Escape
Y 0.9 N 0.1

rescued

Consequence
OK

Probability
0.1782 0.0099 0.0099

Y 0.5 N 0.5

OK

Fire N 0.01

Y 0.5 N 0.5 Y 0.5 N 0.5 Y 0.2 N 0.8

OK OK

0.001 0.0002 0.0008

N 0.8

OK Y 0.2 N 0.8 OK

0.4 0.08 0.32 1.0

Another example

Not a good practice

Pressure Tank
Again, Not a good practice

LECTURE-5 Integrated FT/ET Analysis

Quantitative Risk Assessment


Input data of a QRA are characterised by distribution; e.g., the failure rate of a component may be represented by a lognormal distribution, a histogram, or a point value Results are often presented in probability distributions that characterise the uncertainties in the point value estimates

Steps In A QRA
Typical, a PRA consists of four steps
Risk identification Risk evaluation Risk management Risk communication

The four steps are equally important and are often iteratively applied in phases These generic steps are applicable to ALL risk assessments

Risk Identification Must Be Systematic And Comprehensive


Each path represents a potential scenario for further analysis
IGNITION SOURCES FIRE COMPARTMENTS FIRE ZONE

SAFETY CRITICAL COMPONENTS (TARGETS)

PLANT

ALSO CONSIDER INTERCOMPARTMENT FIRE PROPAGATION

QRA - Fault Tree/Event Tree


EVENT TREE IE
BAD THING .01/YR

MFW AFW B&F SAFE


.1 .5 .1

SAFE SAFE FAIL


UNSAFE STATE AFW Failure
0.5

FAULT TREE

Fail Path Frequency


Operator Failure

Valve Failure
0.1
Test & Maintenance Unavailability

Pump Failure
0.3
Failure to Start or STBY Failure Rate

TRANS MFW AFW B&F .01/YR X .1 X .5 X .1 = 5X10-5/YR

0.1

Failure to Run
0.1

0.1

0.1

Integrated Event Tree/Fault Tree Model


I

Scenario Level Event Tree Analysis

Accident Initiating Event IE 1 IE 2 IE N

Event A 1 - FA Success Fail FA

Event B

Event N

End State S1 S2 SN -1 SN

Success Fail

Event Trees were used to postulate accident sequences and quantify the Frequency of each sequence FS|IEi are conditional probabilities quantified by fault tree analysis or engineering calculations

System Level Fault Tree Analysis

Failure of Event A Failure Deduction Logic


or and

Basic Event a

Basic Event b

Basic Event n

The likelihood of an accident sequence, Freq(Si ), with a defined End State Si , is

Freq(Si ) = IEi FS|IEi Qi

The Consequence is assessed by the consideration of the failure scenario. May not be as simple as Safe/Unsafe. Can be many states of failure

System Level Analysis (Event Tree and Fault Tree Analysis, etc.)

Accident Initiating Event IE 1 IE 2 IE N

Event A 1 - FA Success Fail FA

Event B

Event N

Damage State S1 S2 SN-1 SN

QRA Applications

Success Fail

System and Subsystem Level Analysis (Fault Tree Analysis, SCA, etc.)

Failure of Event A Failure Deduction Logic


or and

Basic Event a

Basic Event b

Basic Event n

Subsystem and Component Level Analysis (FMCEA, SCA, etc.)

List of Failure Causes Ca1 Ca2 Can

List of Failure Causes Cb1 Cb2 Cbn

Common Cause Failures Cc1 Cc2 Ccn

Data Analysis (initiating event frequency, component failure rate and consequence modelling)

Reliability Test Data Supplier Data KCRC Data HK Data Generic Railroad Data Expert Judgment

Readings
Supplementary Notes Practice problems

Without Risk, there is no opportunity

END

Practice Problems: (1) Draw a fault tree with top event No Output from Reactor. Assuming that either Compressor I or II can supply adequate amount of compressed air to Drier.

(2) Draw a fault tree with top event Latch does not Trip. Assuming that the Latch will trip if the linkage can be driven by either hydraulic system. The linkage may break.

(3) Draw a fault tree for No Output from the System

(4) Draw a fault tree for Motor doe not Operate)

(5) Write down the probability value of each sequence. Given failure probability of System B, C, and D are PB, PC and PD, respectively
B

A
Success

D
1 2 3 4

Failure

(6) Draw fault trees for the following systems.

Suggested Solutions (1)

(2)

(3)
No Output from System

Group 1,2 fail

Group 2,5,3 fail

Group 1,5,4 fail

Group 3,4 fail

(4)

(5) Sequence 1= Sequence 2= Sequence 3= Sequence 4= Sequence 5=

(1-PB) (1-PC) (1-PD) (1-PB) (1-PC) PD (1-PB) PC PB (1-PD) P B PD


B B B B B

(6). If you can finish Questions 1 to 5, you should know the answer for Question 6!

Fault Tree Analysis (FTA) is another technique for reliability and safety analysis. Bell Telephone Laboratories developed the concept in 1962 for the U.S. Air Force for use with the Minuteman system. It was later adopted and extensively applied by the Boeing Company. Fault tree analysis is one of many symbolic "analytical logic techniques" found in operations research and in system reliability. Other techniques include Reliability Block Diagrams (RBDs).

History of Fault Tree Analysis (FTA)

Fault tree diagrams (or negative analytical trees) are logic block diagrams that display the state of a system (top event) in terms of the states of its components (basic events). Like reliability block diagrams (RBDs), fault tree diagrams are also a graphical design technique, and as such provide an alternative to methodology to RBDs. An FTD is built top-down and in term of events rather than blocks. It uses a graphic "model" of the pathways within a system that can lead to a foreseeable, undesirable loss event (or a failure). The pathways interconnect contributory events and conditions, using standard logic symbols (AND, OR etc). The basic constructs in a fault tree diagram are gates and events, where the events have an identical meaning as a block in an RBD and the gates are the conditions.

What is a Fault Tree Diagram (FTD)?

Fault Trees and Reliability Block Diagrams

The most fundamental difference between FTDs and RBDs is that in an RBD one is working in the "success space", and thus looks at system successes combinations, while in a fault tree one works in the "failure space" and looks at system failure combinations. Traditionally, fault trees have been used to access fixed probabilities (i.e. each event that comprises the tree has a fixed probability of occurring) while RBDs may have included time-varying distributions for the success (reliability equation) and other properties, such as repair/restoration distributions.

Drawing Fault Trees: Gates and Events Fault trees are built using gates and events (blocks). The two most commonly used gates in a fault tree are the AND and OR gates. As an example, consider two events (or blocks) comprising a Top Event (or a system). If occurrence of either event causes the top event to occur, then these events (blocks) are connected using an OR gate. Alternatively, if both events need to occur to cause the top event to occur, they are connected by an AND gate. As a visualization example, consider the simple case of a system comprised of two components, A and B, and where a failure of either component causes system failure. The system RBD is made up of two blocks in series (see RBD configurations), as shown next:

The fault tree diagram for this system includes two basic events connected to an OR gate (which is the "Top Event"). For the "Top Event" to occur, either A or B must happen. In other words, failure of A OR B causes the system to fail.

Relationships Between Fault Trees and RBDs In general (and with some specific exceptions), a fault tree can be easily converted to an RBD. However, it is generally more difficult to convert an RBD into a fault tree, especially if one allows for highly complex configurations. The following table shows gate symbols commonly used in fault tree diagrams and describes their relationship to an RBD. (The term "Classic Fault Tree" refers to the definitions as used in the Fault Tree Handbook (NUREG-0492) by the U.S. Nuclear Regulatory Commission).

Table 1: Classic Fault Tree Gates and their Traditional RBD Equivalents Name of Gate Classic FTA Symbol Description RBD Equivalent Simple Parallel Configuration [See Example] The output event occurs if at least one of the input events occurs. Series Configuration [See Example] k-out-of-n Parallel Configuration [See Example] Simple Parallel Configuration of all the The input event occurs if all input events plus the events occur and an additional condition conditional event occurs. [See Example] Standby Parallel Configuration (without a quiescent failure distribution)

AND

The output event occurs if all input events occur.

OR

Voting OR (kout-of-n)

The output event occurs if k or more of the input events occur.

Inhibit

Priority AND

The output event occurs if all input events occur in a specific sequence.

Dependency AND

Not used in classic FTA. Gate defined by ReliaSoft.

The output event occurs if all input events occur, however the events are dependent, i.e. the occurrence of each event affects the probability of occurrence of the other events.

Load Sharing Parallel Configuration

XOR

The output event occurs if exactly one input event occurs.

Cannot be represented and does not apply in terms of system reliability. In system reliability, this would imply that a twocomponent system would function even if both components have failed.

Table 2: RBD Constructs without a Traditional Fault Tree Equivalent Function FTA Equivalent Description Allows for modeling event dependency (or load sharing). The output event occurs if all input events occur, however the events are dependent, i.e. the occurrence of each event affects the probability of occurrence of the other events. Standby redundancy configurations consist of items that are inactive and available to be called into service when/if the active item fails (i.e. on standby). Items on standby can also fail (quiescent) while waiting to switch. RBD Equivalent

Dependency (Load Sharing)

Not used in classic FTA.

Load Sharing Parallel Configuration

True Standby with a quiescent failure distribution

A Priority AND gate can be used. However, this does not account for quiescent failure probabilities.

Standby Parallel Configuration

Table 3: Traditional Fault Tree Gates without an RBD Equivalent Name of Gate Classic FTA Symbol Description The output event occurs if exactly one input event occurs. In a two component system the event does not occur if both or none of the inputs occur. XOR When modeling system reliability, this implies that the system is successful if none of the components fail or if all of the components fail. RBD Equivalent Cannot be represented and does not apply in terms of system reliability. In system reliability, this would imply that a twocomponent system would function even if both components have failed.

Events The gates in a fault tree are the logic symbols that interconnect contributory events and conditions. An event (or a condition) block in a fault tree is the same as a standard block in an RBD, in that it can have a probability of occurrence (or a distribution function). However, unlike traditional RBDs, where a single graphical representation is utilized to represent the block (or event), fault trees use several graphical block representations. Table 4 discusses these graphical representations.

Table 4: Traditional Fault Tree Event Symbols and their RBD Equivalents Primary Event Block Basic Event Classic FTA Symbol Description A basic initiating fault (or failure event). An event that is normally expected to occur. External Event (House Event) In general, these events can be set to occur or not occur, i.e. they have a fixed probability of 0 or 1. An event which is no further developed. It is a basic event that does not need further resolution. A specific condition or restriction that can apply to any gate. RBD Equivalent Block

Block that cannot fail or that is in a failed state.

Undeveloped Event

Block

Conditioning Event

Block: Placement of the block will vary depending on the gate applied to.

Table 5: Additional Fault Tree Constructs and their RBD Equivalents Primary Event Block Transfer Classic FTA Symbol Description Indicates a transfer continuation to a sub tree. RBD Equivalent Subdiagram Block

Example 1 A fault tree diagram with a Voting Gate and the RBD equivalent.

Example 2 Fault Trees and Complex RBDs: The best example of a complex reliability block diagram is the so called "bridge." The following RBD represents such a bridge.

Representation of this bridge as a fault tree diagram requires the utilization of duplicate events, since gates can only represent components in series and parallel. An inspection of this system reveals that any of the following failures will cause the system to fail: Failure Failure Failure Failure of of of of components components components components 1 3 1 2 and and and and 2. 4. 5 and 4. 5 and 3.

In probability terminology, we have: (1 And 2) Or (3 And 4) Or (1 And 5 And 4) Or (2 And 5 And 3).

These sets of events are also called minimal cut sets. It can now be seen how the fault tree can be created by representing the above set of events in the following fault tree.

Conversion of the above fault tree to an RBD (note that components with same name are mirrored blocks).

Using Fault Trees to Identify Potential Faults in Critical Systems


Visualize the Events that Lead to Component Failure

Fault Tree Analysis (FTA) is well recognized worldwide as an important tool for evaluating safety and reliability in system design, development, and operation. For more than 40 years, FTA has been used in the aerospace, nuclear, and transportation industries to translate the failure behavior of a system into a visual diagram that displays system relationships and root cause failure paths. A fault tree provides a concise, visual representation of the various combinations of possible occurrences within a system that can result in a predefined and undesirable event. FTA is most often used for:

Identifying safety critical components. Verifying product requirements. Certifying product reliability. Assessing product risk. Investigating accidents/incidents. Evaluating design changes. Displaying the causes and consequences of events. Identifying common-cause failures.

FTA is a deductive analysis method that begins with a general conclusion (a system-level undesirable event) and then attempts to determine the specific causes of this conclusion. Based on a set of rules and logic symbols from probability theory and Boolean algebra, FTA uses a top-down approach to generate a logic model that provides for both qualitative and quantitative evaluation of system reliability. The undesirable event at the system level is referred to as the top event. It generally represents a system failure mode or hazard for which predicted availability data is required. The lower level events in each branch of a fault tree are referred to as basic events. They represent hardware, software, and human failures for which the probability of failure is given based on historical data. Basic events are linked via logic symbols (gates) to one or more undesirable top events.

Computerized FTA
Small fault trees have fewer than 100 events, medium fault trees have from 100 to 1,000 events, and large fault trees have more than 1,000 events! Today, computerized FTA can be used to analyze very complex systems as well as very complex relationships between hardware, software, and humans. Using good FTA software, you can cut, copy, paste, rearrange, and delete events and gates to various fault tree branches to quickly and easily compare different hardware configurations. An example of a computer-generated fault tree follows.*

* Generated in Relex Fault Tree. Click the image to view a full-size version. In the above figure, "Passenger Injury Occurs in Elevator" is defined as the top event. The reasons why passenger injury in an elevator could occur have been determined to be either that the box free falls or that the door is open at an inappropriate time. After determining all possible causes for each event identified, the events and gates for connecting them to higher-level events are added to the fault tree. Any faults that can be further developed to determine causes are then added as lower-level events and connected by the appropriate gates. The lowest-level events that terminate fault tree paths are called basic events or primary events. They are either component-level events that cannot be further resolved or external events. For example, in the first level of possible events for the free fall of the box, "Cable off Pulley" and "Broken Cable" are basic events. Because these events are primary faults, they are not developed any further in the fault tree.

Fault Tree Construction


To construct a useful fault tree, the analyst must fully understand the system as an integrated interaction of subsystems. In addition to having a logical mind and the ability to visualize the logic structure and interaction of a system and its subsystems, the analyst must have knowledge of the dependencies between the components, their reliability parameters, and the conditions that determine the components that are considered to have failed. Thus, good analysts are generally experts in mechanical, structural, electrical, and control systems and also have an understanding of human interactions, procedural implications, and even chemical interactions.

The most common errors in constructing fault trees include:

Using too wide of a scope for the top event, which results in a large, complex, and unfocused fault tree. Using inconsistent nomenclature for the same events, which prevents you from finding events that occur in multiple branches of the fault tree. Using the same nomenclature for similar but different components, thereby identifying the same failure for several scenarios when these failures are actually caused by different components. Breaking the fault tree into branches by electrical, mechanical, and structural subsystems, thereby failing to take the interface and integration of the system into account.

Top Event Definition Because the top event sets the tone for the series of questions that are considered when constructing the fault tree, the analyst should use the system definition to construct a clear and concise top event. If a top event is vaguely stated, the fault tree is likely to be large, complex, and unfocused. To generate a useful fault tree, the top event must be precisely stated and be narrow in scope. Specifying the specific mission phase or portion of the mission to which a top event applies in the description of the top event often helps to generate a very concise fault tree. Event Nomenclature During fault tree creation, consistently applying the appropriate nomenclature to events is critical to identifying the same event in multiple fault tree branches. If, for example, you give an event a different name in another branch of the fault tree, cutset analysis, which is described in the "Fault Tree Analysis" section, identifies multiple events leading to different failures (rather than the same event leading to different failures). If you do not realize that nomenclature errors exist, you may not recognize an event as a major contributor to the top event and thereby fail to recommend improvements or controls for it. Similarly, when two identical components are installed in different locations within a system, you must be sure to identify that they are physically different components by using reference designators in the nomenclature. Otherwise, cutset analysis identifies how the same component failure contributes to several scenarios when the failures are actually caused by different components. Branch Arrangement Because engineering groups so often function autonomously, fitting each piece of hardware together in a system tends to be an afterthought. Organizations that regularly categorize work by engineering disciplines tend to arrange the branches of a fault tree by subsystems. However, such an arrangement limits FTA to considering only component failures. When engineering groups fail to properly coordinate and implement a design as a team, interfaces and interactions are most often the areas in which the system breaks down. When fault tree branches are arranged by subsystems, these areas are never even addressed. When scenarios that lead to the top event are used to arrange fault tree branches, the analyst can place faults under the cause for a component failure. Causes can include not only hardware failures but also interface and integration problems due to design flaws, software, human errors, operation and maintenance errors, and environmental influences on the system. Fault trees arranged by scenarios often uncover complex relationships and interactions of systems, components, and actions that are believed to be unrelated. For example, such an FTA can reveal a single-point component failure that can fail two supposedly redundant or independent systems.

Fault Tree Analysis

After properly identifying all failures, events, and conditions that can lead to the occurrence of the top event, you can compute the probability of the top event and measure the relative impact of a design fix. The traditional analysis process is to generate the system minimal cut sets, apply the basic event probabilistic data, and then determine the probability of the top event. The qualitative analysis of fault trees is based on determining the minimal cutsets for the top event. Cutsets identify the sets of events that cause the top event to occur. A cutset can be a single-point failure or event or can be a set of many events. Different cutsets can include different combinations of the same event. A minimal cutset is the smallest group of events that cause the top event to occur. In large trees, the events that cause the top event to occur are often buried deep within the system and are not easily discovered without performing cutset analysis. The basic events that belong to a cutset provide such information as single-point failures and the relative contributions of each cutset. Generally, the cutsets that have the highest probability of occurrence are the ones that have the fewest number of events. Thus, the minimal cutset information obtained during qualitative analysis can be used for computing the unavailability and unreliability values of the system during quantitative analysis. (Unavailable and unreliability values are calculated by FTA because fault trees are organized around system failures rather than system successes.) For quantitative analysis, reliability and maintainability information such as failure probability or repair rate is used to determine or quantify the probability of occurrence of the top event.

Conclusion
Because FTA is an event-oriented analysis, it can identify more possible failure causes than structureoriented FMEAs (Failure Modes and Effects Analysis) and RBDs (Reliability Block Diagrams), which allow only hardware failure considerations. When performed correctly, FTA often identifies system problems that other design and analytical methods would overlook.

Fault Tree Analysis

Topics Covered
Fault Tree Definition Developing the Fault Tree Structural Significance of the Analysis Quantitative Significance of the Analysis Diagnostic Aids and Shortcuts Finding and Interpreting Cut Sets and Path Sets Success-Domain Counterpart Analysis Assembling the Fault Tree Analysis Report Fault Tree Analysis vs. Alternatives Fault Tree Shortcoming/Pitfalls/Abuses
All fault trees appearing in this training module have been drawn, analyzed, and printed using FaultrEaseTM, a computer application available from: Arthur D. Little, Inc./Acorn Park/ Cambridge, MA., 02140-2390 Phone (617) 8645770.

2
8671

First A Bit of Background


Origins of the technique Fault Tree Analysis defined Where best to apply the technique What the analysis produces Symbols and conventions
3
8671

Origins
Fault tree analysis was developed in 1962 for the U.S. Air Force by Bell Telephone Laboratories for use with the Minuteman systemwas later adopted and extensively applied by the Boeing Companyis one of many symbolic logic analytical techniques found in the operations research discipline.
4
8671

The Fault Tree is


A graphic model of the pathways within a system that can lead to a foreseeable, undesirable loss event. The pathways interconnect contributory events and conditions, using standard logic symbols. Numerical probabilities of occurrence can be entered and propagated through the model to evaluate probability of the foreseeable, undesirable event. Only one of many System Safety analytical tools and techniques.
5
8671

Fault Tree Analysis is Best Applied to Cases with


Large, perceived threats of loss, i.e., high risk. Numerous potential contributors to a mishap. Complex or multi-element systems/processes. Already-identified undesirable events. (a must!) Indiscernible mishap causes (i.e., autopsies).
Caveat: Large fault trees are resource-hungry and should not be undertaken without reasonable assurance of need.
6
8671

Fault Tree Analysis Produces


Graphic display of chains of events/conditions leading to the loss event. Identification of those potential contributors to failure that are critical. Improved understanding of system characteristics. Qualitative/quantitative insight into probability of the loss event selected for analysis. Identification of resources committed to preventing failure. Guidance for redeploying resources to optimize control of risk. Documentation of analytical results.

7
8671

Some Definitions
FAULT An abnormal undesirable state of a system or a system element* induced 1) by presence of an improper command or absence of a proper one, or 2) by a failure (see below). All failures cause faults; not all faults are caused by failures. A system which has been shut down by safety features has not faulted. FAILURE Loss, by a system or system element*, of functional integrity to perform as intended, e.g., relay contacts corrode and will not pass rated current closed, or the relay coil has burned out and will not close the contacts when commanded the relay has failed; a pressure vessel bursts the vessel fails. A protective device which functions as intended has not failed, e.g, a blown fuse.
8
8671

*System element: a subsystem, assembly, component, piece part, etc.

Definitions

PRIMARY (OR BASIC) FAILURE The failed element has seen no exposure to environmental or service stresses exceeding its ratings to perform. E.g., fatigue failure of a relay spring within its rated lifetime; leakage of a valve seal within its pressure rating. SECONDARY FAILURE Failure induced by exposure of the failed element to environmental and/or service stresses exceeding its intended ratings. E.g., the failed element has been improperly designed, or selected, or installed, or calibrated for the application; the failed element is overstressed/underqualified for its burden.

9
8671

Assumptions and Limitations


I I I

Non-repairable system. No sabotage. Markov Fault rates are constant = 1/MTBF = K The future is independent of the past i.e., future states available to the system depend only upon its present state and pathways now available to it, not upon how it got where it is. Bernoulli Each system element analyzed has two, mutually exclusive states.

10
8671

The Logic Symbols


TOP Event forseeable, undesirable event, toward which all fault tree logic paths flow,or Intermediate event describing a system state produced by antecedent events. Most Fault Tree Or Gate produces output if any input Analyses can be carried out using exists. Any input, individual, must be only these four (1) necessary and (2) sufficient to cause symbols. the output event. And Gate produces output if all inputs co-exist. All inputs, individually must be (1) necessary and (2) sufficient to cause the output event Basic Event Initiating fault/failure, not developed further. (Called Leaf, Initiator, or Basic.) The Basic Event marks the limit of resolution of the analysis.
Events and Gates are not component parts of the system being analyzed. They are symbols representing the logic of the analysis. They are bi-modal. They function flawlessly.
11
8671

OR

AND

Steps in Fault Tree Analysis


1 3

Identify undesirable TOP event

Link contributors to TOP by logic gates


2

Identify first-level contributors

Link second-level contributors to TOP by logic gates Identify second-level contributors

Basic Event (Leaf, Initiator, or Basic) indicates limit of analytical resolution.


12
8671

Repeat/continue

Some Rules and Conventions


Do use single-stem gate-feed inputs.

NO

YES
Dont let gates feed gates.

13
8671

More Rules and Conventions


Be CONSISTENT in naming fault events/conditions. Use same name for same event/condition throughout the analysis. (Use index numbering for large trees.) Say WHAT failed/faulted and HOW e.g., Switch Sw-418 contacts fail closed Dont expect miracles to save the system. Lightning will not recharge the battery. A large bass will not plug the hole in the hull.
14
8671

Some Conventions Illustrated


Flat Tire

?
Air Escapes From Casing Tire Pressure Drops Tire Deflates

15
8671

Initiators must be statistically independent of one another. Name basics consistently!

MAYBE A gust of wind will come along and correct the skid. A sudden cloudburst will extinguish the ignition source. Therell be a power outage when the workers hand contacts the highvoltage conductor. No miracles!

Identifying TOP Events


Explore historical records (own and others). Look to energy sources. Identify potential mission failure contributors. Development what-if scenarios. Use shopping lists.
16
8671

Example TOP Events


Wheels-up landing Mid-air collision Subway derailment Turbine engine FOD Rocket failure to ignite Irretrievable loss of primary test data Dengue fever pandemic Sting failure Inadvertent nuke launch Reactor loss of cooling Uncommanded ignition Inability to dewater buoyancy tanks

TOP events represent potential high-penalty losses (i.e., high risk). Either severity of the outcome or frequency of occurrence can produce high risk.
17
8671

Scope the Tree TOP


Too Broad Computer Outage Improved Outage of Primary Data Collection computer, exceeding eight hours, from external causes Unprotected body contact with potential greater than 40 volts Foreign object weighing more than 5 grams and having density greater than 3.2 gm/cc Fuel dispensing fire resulting in loss exceeding $2,500

Exposed Conductor Foreign Object Ingestion

Jet Fuel Dispensing Leak

Scoping reduces effort spent in the analysis by confining it to relevant considerations. To scope, describe the level of penalty or the circumstances for which the event becomes intolerable use modifiers to narrow the event description.
18
8671

Adding Contributors to the Tree


(2) must be an INDEPENDENT* FAULT or FAILURE CONDITION (typically described by a noun, an action verb, and specifying modifiers) * At a given level, under a given gate, each fault must be independent of all (1) EACH others. However, the CONTRIBUTING same fault may ELEMENT appear at other points on the tree. EFFECT Examples: Electrical power fails off Low-temp. Alarm fails off

CAUSE (3) and, each element must be an immediate contributor to the level above

NOTE: As a group under an AND gate, and individually under an OR gate, contributing elements must be both necessary and sufficient to serve as immediate cause for the output event.

Example Fault Tree Development Constructing the logic Spotting/correcting some common errors Adding quantitative data

20
8671

An Example Fault Tree


Late for Work Undesirable Event

Sequence Initiation Failures


Oversleep

Transport Failures

Life Support Failures

Process and Misc. System Malfunctions Causative Modalities*

21
8671

* Partitioned aspects of system function, subdivided as the purpose, physical arrangement, or sequence of operation

Sequence Initiation Failures


Oversleep

No Start Pulse

Natural Apathy

Biorhythm Fails
22
8671

Artificial Wakeup Fails

Verifying Logic
Oversleep

Does this look correct? Should the gate be OR?

No Start Pulse

Natural Apathy Artificial Wakeup Fails

Biorhythm Fails

?
23
8671

Test Logic in SUCCESS Domain


Oversleep

Redraw invert all statements and gates


trigger

Wakeup Succeeds

motivation

No Start Pulse

Failure Domain

Natural Apathy

Start Pulse Works

Success Domain

Natural High Torque

BioRhythm Fails

Artificial Wakeup Fails

BioRhythm Fails

Artificial Wakeup Works

?
24
8671

If it was wrong hereitll be wrong here, too!

Artificial Wakeup Fails


Artificial Wakeup Fails

Alarm Clocks Fail

Nocturnal Deafness

Main Plug-in Clock Fails

Backup (Windup) Clock Fails

Power Outage

Faulty Innards

Forget to Set

Faulty Mechanism

Forget to Set

Forget to Wind

Electrical Fault

Mechanical Fault

What does the tree tell up about system vulnerability at this point?

Hour Hand Falls Off 25


8671

Hour Hand Jams Works

Background for Numerical Methods


Relating PF to R The Bathtub Curve Exponential Failure Distribution Propagation through Gates PF Sources
26
8671

Reliability and Failure Probability Relationships


I I I I

S = Successes F = Failures S Reliability R =(S+F) Failure Probability PF = F (S+F) S R + PF = (S+F)+ F 1 (S+F) = Fault Rate = 1 MTBF

27
8671

Significance of PF
Random Failure

T 0 0

Fault probability is modeled acceptably well as a function of exposure interval (T) by the exponential. For exposure intervals that are brief (T < 0.2 MTBF), PF is approximated within 2% by T.
PF T (within 2%, for T 20%) 1.0 t

= 1 / MTBF

(In B fa UR nt N M IN or ta lity )

The Bathtub Curve

Most system elements have fault rates ( = 1/MTBF) that are constant (0) over long periods of useful life. During these periods, faults occur at random times.

BU O RN UT

0.63 0.5

PF = 1 T = T

0 0
28
8671

Exponentially Modeled Failure Probability

1 MTBF

and PF Through Gates


OR Gate
Either of two, independent, element failures produces system failure. T = A B
PF = 1 T PF = 1 (A B) PF = 1 [(1 PA)(1 PB)]

For 2 Inputs

AND Gate

Both of two, independent elements must fail to produce system failure.


R + PF 1

T = A + B A B
PF = 1 T PF = 1 ( A + B A B) PF = 1 [(1 PA) + (1 PB) (1 PA)(1 PB)]

PF = PA + PB PA PB
for PA,B 0.2 PF PA + PB with error 11%

[Union / ]

PF = PA PB

[Intersection / ]

Rare Event Approximation

For 3 Inputs PF = PA PB PC
Omit for approximation

PF = PA + PB + PC
PA PB PA PC PB PC + PA PBPC
29
8671

PF Propagation Through Gates


AND Gate
PT = Pe

TOP
PT = P1 P2
[Intersection / ]

OR Gate
PT Pe

TOP
PT P1+ P2
[Union / ]

1
P1

2
P2 1&2 are INDEPENDENT events.

1
P1

2
P2

PT = P1 P2
30
8671

PT = P1 + P2 P1 P2
Usually negligible

Ipping Gives Exact OR Gate Solutions


Failure

TOP
PT = ?

Success

TOP
PT = (1 Pe)

Failure

TOP
PT =

1
P1

2
P2

3
P3

3
P3 = (1 P3)

1
P1

2
P2

P1 = (1 P1)

The ip operator ( ) is the P2 = (1 P2) co-function of pi (). It PT = Pe= 1 (1 Pe) provides an exact solution for propagating PT = 1 [(1 P1) ( 1 P2) (1 P3 (1 Pn )] probabilities through the OR gate. Its use is rarely justifiable.
31
8671

Pe

3
P3

More Gates and Symbols


Inclusive OR Gate PT = P1 + P2 (P1 x P2) Opens when any one or more events occur. Exclusive OR Gate PT = P1 + P2 2 (P1 x P2) Opens when any one (but only one) event occurs. Mutually Exclusive OR Gate PT = P1 + P2 Opens when any one of two or more events occur. All other events are then precluded.
For all OR Gate cases, the Rare Event ApproxiPT Pe mation may be used for small values of Pe.

32
8671

Still More Gates and Symbols


Priority AND Gate PT = P1 x P2 Opens when input events occur in predetermined sequence. Inhibit Gate Opens when (single) input event occurs in presence of enabling condition. Undeveloped Event An event not further developed. External Event An event normally expected to occur. Conditioning Event Applies conditions or restrictions to other symbols.

33
8671

Some Failure Probability Sources


Manufacturers Data Industry Consensus Standards MIL Standards Historical Evidence Same or Similar Systems Simulation/testing Delphi Estimates ERDA Log Average Method
34
8671

Log Average Method*


If probability is not estimated easily, but upper and lower credible bounds can be judged Estimate upper and lower credible bounds of probability for the phenomenon in question. Average the logarithms of the upper and lower bounds. The antilogarithm of the average of the logarithms of the upper and lower bounds is less than the upper bound and greater than the lower bound by the same factor. Thus, it is geometrically midway between the limits of estimation.
0.01 0.0 2 0.03
0.0316+
PL Lower Probability Bound 102

0.04 0.05

0.07

0.1
PU

Upper Log PL + Log PU Log Average = Antilog = Antilog (2) + (1) = 101.5 = 0.0316228 Probability 2 2 Bound 101

Note that, for the example shown, the arithmetic average would be 0.01 + 0.1 = 0.055 2 i.e., 5.5 times the lower bound and 0.55 times the upper bound
* Reference: Briscoe, Glen J.; Risk Management Guide; System Safety Development Center; SSDC-11; DOE 76-45/11; September 1982. 35
8671

More Failure Probability Sources


WASH-1400 (NUREG-75/014); Reactor Safety Study An Assessment of Accident Risks in US Commercial Nuclear Power Plants; 1975 IEEE Standard 500 Government-Industry Data Exchange Program (GIDEP) Rome Air Development Center Tables NUREG-0492; Fault Tree Handbook; (Table XI-1); 1986 Many others, including numerous industry-specific proprietary listings
36
8671

Typical Component Failure Rates


Failures Per 106 Hours Device Semiconductor Diodes Transistors Microwave Diodes MIL-R-11 Resistors MIL-R-22097 Resistors Rotary Electrical Motors Connectors
37
8671

Minimum 0.10 0.10 3.0 0.0035 29.0 0.60 0.01

Average 1.0 3.0 10.0 0.0048 41.0 5.0 0.10

Maximum 10.0 12.0 22.0 0.016 80.0 500.0 10.0

Source: Willie Hammer, Handbook of System and Product Safety, Prentice Hall

Typical Human Operator Failure Rates


Activity *Error of omission/item embedded in procedure *Simple arithmetic error with self-checking *Inspector error of operator oversight *General rate/high stress/ dangerous activity **Checkoff provision improperly used **Error of omission/10-item checkoff list **Carry out plant policy/no check on operator **Select wrong control/group of identical, labeled, controls Error Rate 3 x 103 3 x 102 101 0.2-0.3 0.1-0.09 (0.5 avg.) 0.0001-0.005 (0.001 avg.) 0.005-0.05 (0.01 avg.) 0.001-0.01 (0.003 avg.)

Sources: * WASH-1400 (NUREG-75/014); Reactor Safety Study An Assessment of Accident Risks in U.S. Commercial Nuclear Power Plants, 1975 **NUREG/CR-1278; Handbook of Human Reliability Analysis with Emphasis on 38 Nuclear Power Plant Applications, 1980
8671

Some Factors Influencing Human Operator Failure Probability


Experience Stress Training Individual self discipline/conscientiousness Fatigue Perception of error consequences (to self/others) Use of guides and checklists Realization of failure on prior attempt Character of Task Complexity/Repetitiveness
39
8671

Artificial Wakeup Fails


Artificial Wakeup Fails

KEY: Faults/Operation...8. X 103 Rate, Faults/Year. 2/1 Assume 260 operations/year


Alarm Clocks Fail 3.34 x 104 Main Plug-in Clock Fails 1.82 x 102 Faulty Innards 1. x 102 3/1 Electrical Fault 3. x 104 1/15 Hour Hand Falls Off 40
8671

3.34 x 104 approx. 0.1 / yr Nocturnal Deafness


Negligible

Backup (Windup) Clock Fails 1.83 x 102

Power Outage

3. x 104

Forget to Set

8. x 103 2/1

Faulty Mechanism

4. x 104 1/10

Forget to Set

8. x 103 2/1

Forget to Wind

1. x 102 3/1

Mechanical Fault 8. x 108 Hour Hand Jams Works 2. x 104 1/20

4. x 104 1/10

HOW Much PT is TOO Much?


Consider bootstrapping comparisons with known risks
Human operator error (response to repetitive stimulus) Internal combustion engine failure (spark ignition) Pneumatic instrument recorder failure Distribution transformer failure U.S. Motor vehicles fatalities Death by disease (U.S. lifetime avg.) U.S. Employment fatalities Death by lightning Meteorite (>1 lb) hit on 103x 103 ft area of U.S. Earth destroyed by extraterrestrial hit 102- 103/exp MH 103/exp hr 104/exp hr 105/exp hr 106/exp MH 106/exp MH 107-108/exp MH 109/exp MH* 1010/exp hr 1014/exp hr

41
8671

Browning, R.L., The Loss Rate Concept in Safety Engineering * National Safety Council, Accident Facts Kopecek, J.T., Analytical Methods Applicable to Risk Assessment & Prevention, Tenth International System Safety Conference

Apply Scoping
What power outages are of concern?
Power Outage 1 X 102 3/1

Not all of them! Only those that Are undetected/uncompensated Occur during the hours of sleep Have sufficient duration to fault the system This probability must reflect these conditions!

42
8671

Single-Point Failure A failure of one independent element of a system which causes an immediate hazard to occur and/or causes the whole system to fail.
Professional Safety March 1980

43
8671

Some AND Gate Properties


TOP PT = P1 x P2 1 2 Cost: Assume two identical elements having P = 0.1. PT = 0.01 Two elements having P = 0.1 may cost much less than one element having P = 0.01.

Freedom from single point failure: Redundancy ensures that either 1 or 2 may fail without inducing TOP.
44
8671

Failures at Any Analysis Level Must Be


Dont
Independent of each other True contributors to the level above
Mechanical Fault Faulty Innards

Do

Independent
Hand Falls Off Hand Jams Works Elect. Fault Hand Falls/ Jams Works Gearing Fails Other Mech. Fault

Alarm Failure

Alarm Failure

True Contributors
Alarm Clock Fails Toast Burns Backup Clock Fails Alarm Clock Fails Backup Clock Fails

45
8671

Common Cause Events/Phenomena


A Common Cause is an event or a phenomenon which, if it occurs, will induce the occurrence of two or more fault tree elements. Oversight of Common Causes is a frequently found fault tree flaw!
46
8671

Common Cause Oversight An Example


Unannunciated Intrusion by Burglar

Microwave

ElectroOptical

Seismic Footfall

Acoustic

DETECTOR/ALARM FAILURES

47
8671

Four, wholly independent alarm systems are provided to detect and annunciate intrusion. No two of them share a common operating principle. Redundancy appears to be absolute. The AND gate to the TOP event seems appropriate. But, suppose the four systems share a single source of operating power, and that source fails, and there are no backup sources?

Common Cause Oversight Correction


Unannunciated Intrusion by Burglar

Detector/Alarm Failure

Detector/Alarm Power Failure

Microwave Electro-Optical Seismic Footfall Acoustic

Basic Power Failure Emergency Power Failure

Here, power source failure has been recognized as an event which, if it occurs, will disable all four alarm systems. Power failure has been accounted for as a common cause event, leading to the TOP event through an OR gate. OTHER COMMON CAUSES SHOULD ALSO BE SEARCHED FOR.
48
8671

Example Common Cause Fault/Failure Sources


Utility Outage Electricity Cooling Water Pneumatic Pressure Steam Moisture Corrosion Seismic Disturbance
49
8671

Dust/Grit Temperature Effects (Freezing/Overheat) Electromagnetic Disturbance Single Operator Oversight Many Others

Example Common Cause Suppression Methods


Separation/Isolation/Insulation/Sealing/ Shielding of System Elements. Using redundant elements having differing operating principles. Separately powering/servicing/maintaining redundant elements. Using independent operators/inspectors.
50
8671

Missing Elements?
Contributing elements must combine to satisfy all conditions essential to the TOP event. The logic criteria of necessity and sufficiency must be satisfied.
Unannunciated Intrusion by Burglar SYSTEM CHALLENGE

Detector/Alarm Failure

Intrusion By Burglar

Detector/Alarm System Failure

Detector/Alarm Power Failure

Burglar Present

Barriers Fail

Microwave Electro-Optical Seismic Footfall Acoustic

Basic Power Failure Emergency Power Failure

51
8671

Example Problem Sclerotic Scurvy The Astronauts Scourge


BACKGROUND: Sclerotic scurvy infects 10% of all returning astronauts. Incubation period is 13 days. For a week thereafter, victims of the disease display symptoms which include malaise, lassitude, and a very crabby outlook. A test can be used during the incubation period to determine whether an astronaut has been infected. Anti-toxin administered during the incubation period is 100% effective in preventing the disease when administered to an infected astronaut. However, for an uninfected astronaut, it produces disorientation, confusion, and intensifies all undesirable personality traits for about seven days. The test for infection produces a false positive result in 2% of all uninfected astronauts and a false negative result in one percent of all infected astronauts. Both treatment of an uninfected astronaut and failure to treat an infected astronaut constitute in malpractice. Problem: Using the test for infection and the anti-toxin, if the test indicates need for it, what is the probability that a returning astronaut will be a victim of malpractice?
52
8671

Sclerotic Scurvy Malpractice


Malpractice 0.019

What is the greatest contributor to this probability?


Treat Needlessly (Side Effects) 0.018

Fail to Treat Infection (Disease) 0.001

Should the test be used?

False Negative Test 0.01

Infected Astronaut 0.1

Healthy Astronaut 0.9

False Positive Test 0.02

10% of returnees are infected 90% are not infected 1% of infected cases test falsely negative, receive no treatment, succumb to disease
53
8671

2% of uninfected cases test falsely positive, receive treatment, succumb to side effects

Cut Sets
AIDS TO System Diagnosis Reducing Vulnerability Linking to Success Domain

54
8671

Cut Sets

A CUT SET is any group of fault tree initiators which, if all occur, will cause the TOP event to occur. A MINIMAL CUT SET is a least group of fault tree initiators which, if all occur, will cause the TOP event to occur.
55
8671

Finding Cut Sets


1. 2. 3.

Ignore all tree elements except the initiators (leaves/basics). Starting immediately below the TOP event, assign a unique letter to each gate, and assign a unique number to each initiator. Proceeding stepwise from TOP event downward, construct a matrix using the letters and numbers. The letter representing the TOP event gate becomes the initial matrix entry. As the construction progresses: Replace the letter for each AND gate by the letter(s)/number(s) for all gates/initiators which are its inputs. Display these horizontally, in matrix rows. Replace the letter for each OR gate by the letter(s)/number(s) for all gates/initiators which are its inputs. Display these vertically, in matrix columns. Each newly formed OR gate replacement row must also contain all other entries found in the original parent row.

56
8671

Finding Cut Sets


4.

A final matrix results, displaying only numbers representing initiators. Each row of this matrix is a Boolean Indicated Cut Set. By inspection, eliminate any row that contains all elements found in a lesser row. Also eliminate redundant elements within rows and rows that duplicate other rows. The rows that remain are Minimal Cut Sets.

57
8671

A Cut Set Example


PROCEDURE: Assign letters to gates. (TOP gate is A.) Do not repeat letters. Assign numbers to basic initiators. If a basic initiator appears more than once, represent it by the same number at each appearance. Construct a matrix, starting with the TOP A gate.
TOP A

B 1 C 2

D 4

58
8671

A Cut Set Example


A B D 1 D C D 1 D 2 D 3

TOP event gate is A, the initial matrix entry.

A is an AND gate; B & D, its inputs, replace it horizontally.

B is an OR gate; 1 & C, its inputs, replace it vertically. Each requires a new row.
1 2 2 3 1 4

C is an AND gate; 2 & 3, its inputs, replace it horizontally.

1 2 2 D 3 1 4

D (top row), is an OR gate; 2 & 4, its inputs, replace it vertically. Each requires a new row.
59
8671

1 2 2 2 3 1 4 2 4 3

These BooleanIndicated Cut Sets reduce to these minimal cut sets.

D (second row), is an OR gate. Replace as before.

Minimal Cut Set rows are least groups of initiators which will induce TOP.

An Equivalent Fault Tree


An Equivalent Fault Tree can be constructed from Minimal Cut Sets. For example, these Minimal Cut Sets
1 2 1 2 3 4 1 2 1

TOP

Boolean Equivalent Fault Tree

represent this Fault Tree and this Fault Tree is a Logic Equivalent of the original, for which the Minimal Cut Sets were derived.
60
8671

Equivalent Trees Arent Always Simpler


4 gates 6 initiators

This Fault Tree has this logic equivalent.


9 gates 24 initiators

6 TOP

Minimal cut sets 1/3/5 1/3/6 1/4/5 1/4/6 2/3/5 2/3/6 2/4/5 2/4/6
61
8671

1 3

1 3 6

1 4 5

1 4

2 3 5

2 3 6

2 4

2 4

Another Cut Set Example


Compare this case to the first Cut Set example note differences. TOP gate here is OR. 1 In the first example, TOP gate was AND. 2 Proceed as with first example. 3
62
8671

TOP A

6
D F

3
E

5
G

Another Cut Set Example Construct Matrix make step-by-step substitutions


A B C 1 D F 6 1 2 F D I E 1 2 3 5 G 6 1 E

Boolean-Indicated Cut Sets Minimal Cut Sets


1 2 3 5 G 1 3 1 4 6 1 3 1 1 3 2 5 G 3 4 5 1 6 1 1 1 3 2 3 4 4

Note that there are four Minimal Cut Sets. Co-existence of all of the initiators in any one of them will precipitate the TOP event.

An EQUIVALENT FAULT TREE can again be constructed


63
8671

Another Equivalent Fault Tree


These Minimal Cut Sets represent this Fault Tree a Logic Equivalent of the original tree.
TOP 1 1 1 3 2 3 4 4 5 6

1
64
8671

From Tree to Reliability Block Diagram


TOP A

Blocks represent functions of system elements. Paths through them represent success.
Barring terms (n) denotes consideration of their success properties. C

3 5 4 6 1

3 1

1
D F

6 3
E

5
G

TOP The tree models a system fault, in failure domain. Let that fault be System Fails to Function as Intended. Its opposite, System Succeeds to function as intended, can be represented by a Reliability Block Diagram in which success flows through system element functions from left to right. Any path through the block diagram, not interrupted by a fault of an element, results in system success.

65
8671

Cut Sets and Reliability Blocks


TOP A

3 2
B C

3 1

4 4

5 1 6

1
D F

6 3
E

5
G

1 1 1 3

2 3 4 4 5 6

66
8671

Each Cut Set (horizontal rows in the matrix) interrupts all left-to-right paths through the Reliability Block Diagram

Note that 3/5/1/6 is a Cut Set, but not a Minimal Cut Set. (It contains 1/3, a true Minimal Cut Set.)

Minimal Cut Sets

Cut Set Uses


Evaluating PT Finding Vulnerability to Common Causes Analyzing Common Cause Probability Evaluating Structural Cut Set Importance Evaluating Quantitative Cut Set Importance Evaluating Item Importance
67
8671

Cut Set Uses/Evaluating PT


TOP A PT

Minimal Cut Sets

1 1
C

2 3 4 4 5 6

1 3
6

1
D F

2
E

5
G

Pt P k = P 1 x P2 + P1 x P3 + P1 x P4 + P3 x P4 x P5 x P6
Note that propagating probabilities through an unpruned tree, i .e., using Boolean-Indicated Cut Sets rather than minimal Cut Sets, would produce a falsely high PT.
1 2 3 5 4 6 1 3 1 4 3 5

68
8671

Cut Set Probability (Pk), the product of probabilities for events within the Cut Set, is the probability that the Cut Set being considered will induce TOP. Pk = Pe = P1 x P2 x P3 xPn

Cut Set Uses/Common Cause Vulnerability


TOP A

1v
D F

Uniquely subscript initiators, using letter indicators of common cause susceptibility, e.g. l = location (code where) m = moisture h = human operator Minimal Cut Sets q = heat 1 v 2h f = cold 6m v = vibration 1v 3 m etc.

1v 4 m

2h
E

3m

5m
G

3m 4m 5m 6m

All Initiators in this Cut Set are vulnerable to moisture. Moisture is a Common Cause Some Initiators may be vulnerable to several Common Causes and receive several corresponding and can induce TOP. subscript designators. Some may have no Common ADVICE: Moisture proof one or more items. Cause vulnerability receive no subscripts. 69 3m 4m 4m 1v
8671

Analyzing Common Cause Probability


TOP
PT

System Fault

These must be OR

Common-Cause Induced Fault

Analyze as usual

others Moisture Vibration Human Operator Heat

70
8671

Introduce each Common Cause identified as a Cut Set Killer at its individual probability level of both (1) occurring, and (2) inducing all terms within the affected cut set.

Cut Set Structural Importance


TOP
A

Minimal Cut Sets

1 1
C

2 3 4 4 5 6

1 3
6

1
D F

2
E

5
G

All other things being equal A LONG Cut Set signals low vulnerability A SHORT Cut Set signals higher vulnerability Presence of NUMEROUS Cut Sets signals high vulnerability and a singlet cut set signals a Potential Single-Point Failure.

Analyzing Structural Importance enables qualitative ranking of contributions to System Failure.


71
8671

Cut Set Quantitative Importance


TOP A

PT

The quantitative importance of a Cut Set (Ik) is the numerical probability that, given that TOP has occurred, that Cut Set has induced it. Pk Ik = PT 6 where Pk = Pe = P3 x P4 x P5 x P6 Minimal Cut Sets

1
D F

2
E

5
G

1 1 1 3

2 3 4 4 5 6

Analyzing Quantitative Importance enables numerical ranking of contributions to System Failure. To reduce system vulnerability most effectively, attack Cut Sets having greater Importance. Generally, short Cut Sets have greater Importance, long Cut Sets have lesser Importance.
72
8671

Item Importance
The quantitative Importance of an item (Ie) is the numerical probability that, given that TOP has occurred, that item has contributed to it. Ne = Number of Minimal Cut Sets containing Item e Ne Ie Ike Minimal Cut Sets
1 1 1 3
73
8671

Ike = Importance of the Minimal Cuts Sets containing Item e Example Importance of item 1

2 3 4 4 5 6

I1

(P1 x P2) + (P1 x P3) + (P1 x P4) PT

Path Sets
Aids to Further Diagnostic Measures Linking to Success Domain Trade/Cost Studies

74
8671

Path Sets

A PATH SET is a group of fault tree initiators which, if none of them occurs, will guarantee that the TOP event cannot occur. TO FIND PATH SETS* change all AND gates to OR gates and all OR gates to AND. Then proceed using matrix construction as for Cut Sets. Path Sets will be the result.
*This Cut Set-to-Path-Set conversion takes advantage of de Morgans duality theorem. Path Sets are complements of Cut Sets.

75
8671

A Path Set Example


TOP
A

This Fault Tree has these Minimal Cut sets

1
D F

Path Sets are least groups of initiators which, if they cannot occur, guarantee against TOP 6 occurring
1
G

2
E

3 4 5 6 3 4

1 1 1 2

1 1 1 3
76
8671

2 3 4 4 5 6

and these Path Sets

Barring terms (n) denotes consideration of their success properties

Path Sets and Reliability Blocks


TOP A

3
B C

3 1

4 4

5 1 6

1
D F

6 3
E

5
G

1 1 1 1

3 4 5 6

77
8671

2 3 4 Path Sets

Each Path Set (horizontal rows in the matrix) represents a left-toright path through the Reliability Block Diagram.

Pat Sets and Trade Studies


3 2 3 1 4 4 6 Path Sets Pp PPa PPb PPc PPd $ $a $b $c $d $e 5 1

Pp Pe

Path Set Probability (Pp) is the probability that the system will suffer a fault at one or more points along the operational route modeled by the path. To minimize failure probability, minimize path set probability.

a b c d e
78
8671

1 1 1 1 2

3 4 5 6 3 4

PPe

Sprinkle countermeasure resources amongst the Path Sets. Compute the probability decrement for each newly adjusted Path Set option. Pick the countermeasure ensemble(s) giving the most favorable Pp / $. (Selection results can be verified by computing PT/ $ for competing candidates.)

Reducing Vulnerability A Summary


Inspect tree find/operate on major PT contributors Add interveners/redundancy (lengthen cut sets). Derate components (increase robustness/reduce Pe). Fortify maintenance/parts replacement (increase MTBF). Examine/alter system architecture increase path set/cut set ratio. Evaluate Cut Set Importance. Rank items using Ik.} Ik= Pk/ PT Identify items amenable to improvement. N Evaluate item importance. Rank items using Ie Ie Ike Identify items amenable to improvement.

Evaluate path set probability. Reduce PP at most favorable P/ $. Pp Pe

For all new countermeasures, THINK COST EFFECTIVENESS FEASIBILITY (incl. schedule)
AND

Does the new countermeasure Introduce new HAZARDS? Cripple the system?
79
8671

Some Diagnostic and Analytical Gimmicks


A Conceptual Probabilistic Model Sensitivity Testing Finding a PT Upper Limit Limit of Resolution Shutting off Tree Growth State-of-Component Method When to Use Another Technique FMECA
80
8671

Some Diagnostic Gimmicks


Using a generic all-purpose fault tree
TOP
PT

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

81
8671

Think Roulette Wheels


TOP
PT

A convenient, thought-tool model of probabilistic tree modeling

10 11

16

17

22

23

24

Imagine a roulette wheel representing 9 8 each initiator. The peg count ratio for each wheel is determined by 13 14 12 15 probability for that initiator. Spin all initiator wheels once for each system exposure interval. Wheels winning in 20 18 19 gate-opening combinations provide a path to the TOP. 26 28 29 27
25

21

P22 = 3 x 103 1,000 peg spaces 997 white 3 red


82
8671

30

31

32

33

34

Use Sensitivity Tests


TOP
PT

Gaging the nastiness of untrustworthy initiators

10

11

12

P10 = ?
16 17

22

23

24

25

Embedded within the tree, theres a bothersome initiator with 9 8 an uncertain Pe. Perform a crude sensitivity test to obtain quick relief from worry or, to justify the urgency of need for more exact input data: 13 14 15 1.Compute PT for a nominal value of Pe. Then, recompute PT 20 for a new Pe = Pe + Pe. 21 PT 18 19 now, compute the Sensitivity of Pe = Pe If this sensitivity exceeds 0.1 in a large tree, work to ~27 28 26 29 Find a value for Pe having less uncertaintyor 2.Compute PT for a value of Pe at its upper credible limit. Is the corresponding PT acceptable? If not, get a better Pe.
31 32 33 34

30

83
8671

Find a Max PT Limit Quickly


The parts-count approach gives a sometimes-useful early estimate of PT
TOP PT

PT cannot exceed an8 upper bound given by: 9 PT(max) = Pe = P1 + P2 + P3 + Pn


10 11 12 13 14 15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

84
8671

How Far Down Should a Fault Tree Grow?


Severity
TOP
PT

Probability

Where do you stop the analysis? The analysis is a Risk Management enterprise. The TOP statement gives severity. The tree analysis provides probability. ANALYZE 4 3 5 2 NO FURTHER DOWN THAN IS NECESSARY TO ENTER PROBABILITY DATA WITH CONFIDENCE. Is risk acceptable? If YES, stop. If NO, use the tree to guide risk reduction. SOME EXCEPTIONS 8 6 9 7 1.) An event within the tree has alarmingly high probability. Dig deeper beneath it to find the source(s) of the high probability. 10 11 12 13 14 15 2.) Mishap autopsies must sometimes analyze down to the cotter-pin level to produce a credible cause list.
16 17 18 19 20 21

Initiators / leaves / basics define the LIMIT OF RESOLUTION of the analysis.

85
8671

State-of-Component Method
Relay K-28 Contacts Fail Closed

WHEN Analysis has proceeded to the device level i.e., valves, pumps, switches, relays, etc. HOW Show device fault/failure in the mode needed for upward propagation.
Relay K-28 Secondary Fault

Basic Failure/ Relay K-28

Relay K-28 Command Fault

Install an OR gate. Place these three events beneath the OR. This represents faults from environmental and service stresses for which the device is not qualified e.g., component struck by foreign object, wrong component selection/installation. (Omit, if negligible.)

This represents internal self failures under normal environmental and service stresses e.g., coil burnout, spring failure, contacts drop off
86
8671

Analyze further to find the source of the fault condition, induced by presence/absence of external command signals. (Omit for most passive devices e.g., piping.)

The Fault Tree Analysis Report


Title Company Author Date etc.

Executive Summary (Abstract of complete report) Scope of the analysis Say what is analyzed
Brief system description and TOP Description/Severity Bounding what is not analyzed. Analysis Boundaries Interfaces Treated Physical Boundaries Resolution Limit Operational Boundaries Exposure Interval Operational Phases Others Human Operator In/out

The Analysis

Show Tree as Figure. Discussion of Method (Cite Refs.) Include Data Sources, Software Used Cut Sets, Path Sets, etc. Presentation/Discussion of the Tree as Tables. Source(s) of Probability Data (If quantified) Common Cause Search (If done) Sensitivity Test(s) (If conducted) Cut Sets (Structural and/or Quantitative Importance, if analyzed) Path Sets (If analyzed) Trade Studies (If Done)

Findings

TOP Probability (Give Confidence Limits) Comments on System Vulnerability Chief Contributors Candidate Reduction Approaches (If appropriate) Risk Comparisons (Bootstrapping data, if appropriate) Is further analysis needed? By what method(s)?

Conclusions and Recommendations


87
8671

FTA vs. FMECA Selection Criteria*


Selection Characteristic Safety of public/operating/maintenance personnel Small number/clearly defined TOP events Indistinctly defined TOP events Full-Mission completion critically important Many, potentially successful missions possible All possible failure modes are of concern High potential for human error contributions High potential for software error contributions Numerical risk evaluation needed Very complex system architecture/many functional parts Linear system architecture with little/human software influence System irreparable after mission starts
88
8671

Preferred FTA FMECA

*Adapted from Fault Tree Analysis Application Guide, Reliability Analysis Center, Rome Air Development Center.

Fault Tree Constraints and Shortcomings


Undesirable events must be foreseen and are only analyzed singly. All significant contributors to fault/failure must be anticipated. Each fault/failure initiator must be constrained to two conditional modes when modeled in the tree. Initiators at a given analysis level beneath a common gate must be independent of each other. Events/conditions at any analysis level must be true, immediate contributors to next-level events/conditions. Each Initiators failure rate must be a predictable constant.
89
8671

Common Fault Tree Abuses


Over-analysis Fault Kudzu Unjustified confidence in numerical results 6.0232 x 105+/? Credence in preposterously low probabilities 1.666 x 1024/hour Unpreparedness to deal with results (particularly quantitative) Is 4.3 x 107/hour acceptable for a catastrophe? Overlooking common causes Will a roof leak or a shaking floor wipe you out? Misapplication Would Event Tree Analysis (or another technique) serve better? Scoping changes in mid-tree
90
8671

Fault Tree Payoffs


Gaging/quantifying system failure probability. Assessing system Common Cause vulnerability. Optimizing resource deployment to control vulnerability. Guiding system reconfiguration to reduce vulnerability. Identifying Man Paths to disaster. Identifying potential single point failures. Supporting trade studies with differential analyses.
FAULT TREE ANALYSIS is a risk assessment enterprise. Risk Severity is defined by the TOP event. Risk Probability is the result of the tree analysis.
91
8671

Closing Caveats
Be wary of the ILLUSION of SAFETY. Low probability does not mean that a mishap wont happen! THERE IS NO ABSOLUTE SAFETY! An enterprise is safe only to the degree that its risks are tolerable! Apply broad confidence limits to probabilities representing human performance! A large number of systems having low probabilities of failure means that A MISHAP WILL HAPPEN somewhere among them! P1 + P2+ P3+ P4 + ----------Pn 1 More
92
8671

Caveats Do you REALLY have enough data to justify QUANTITATIVE ANALYSIS? For 95% confidence
We must have no failures in
Assumptions:
I Stochastic

to give PF

and

1,000 tests 300 tests 100 tests 30 tests 10 tests

3 x 103 102 3 x 102 101 3 x 101

0.997 0.99 0.97 0.9 0.7

System Behavior
I Constant I Constant

System Properties Service Stresses

I Constant

Environmental Stresses

Dont drive the numbers into the ground!


93
8671

Analyze Only to Turn Results Into Decisions


Perform an analysis only to reach a decision. Do not perform an analysis if that decision can be reached without it. It is not effective to do so. It is a waste of resources.
Dr. V.L. Grose George Washington University

94
8671

Event Tree Analysis

Event Tree Analysis Is


A bottom-up, deductive, system safety analytical technique Applicable to: Physical systems, with or without human operators Decision-making/management systems Complementary to other techniques, e.g. Fault Tree Analysis Failure Modes and Effects Analysis
2
8671

Event Tree Analysis


Explores system Responses to initiating Challenges and enables Probability Assessment of Success/Failure. Example Challenges Utility system failure Pipe of vessel burst Heightened business Ignition of stored competition combustibles Outbreak of epidemic Technology need Normal system operating command
3
8671

Event Tree Analysis (General Case)


Portray all credible system operating permutations. Trace each path to eventual success or failure.
Decision/ Action A Decision/ Action B Decision/ Action C D/A N
Success O/O

Failure Success Failure Success Failure

Operation/ Outcome Initiation Operation/ Outcome


4
8671

Operation/ Outcome

Success Failure Success Failure

3 1

Event Tree Analysis (Bernoulli Model)


Reduce tree to simplified representation of system behavior. Use binary branching. Lead unrecoverable failures and undefeatable successes directly to final outcomes. A fault tree or other analysis may be necessary to determine probability of the initiating event or condition. (Unity probability may be assumed.)
Success Failure Failure Success Failure Success Initiation Failure Success Success Failure Success Failure Failure

8671

An Example Problem

P Pump

Klaxon K

Background/Problem A subgrade compartment containing important control equipment is protected against flooding by the system shown. Rising flood waters close float switch S, powering pump P from an uninterruptible power supply. A klaxon K is also sounded, alerting operators to perform manual bailing, B, should the pump fail. Either pumping or bailing will dewater the compartment effectively. Assume flooding has commenced, and analyze responses available to the dewatering system. Develop an event tree representing system responses Develop a reliability block diagram for the system Develop a fault tree fro the TOP event Failure to Dewater Simplifying Assumptions: Power is available full time. Treat only the four system components S, P, K, and B. Consider operator error as included within the bailing function, B.
S

8671

Example Problem
Pump Succeeds (1 PP)

Event Tree
Klaxon Succeeds (1 PK) Pump Fails PP [PP PP PS] [PP PP PS PKPP + PKPP PS

Bailing Succeeds (1 PB) [PP PP PS PKPP + PKPP PS PBPP + PBPP PS + PBPK PP PBPKPP PS ] Bailing Fails (PB) [PBPP PBPP PS PBPKPP + PBPKPP PS]

Water Rises (1.0)

Float Switch Fails (PS)

Klaxon Fails (PK) [PKPP PKPP PS]

[PS]

PSuccess = 1 PS PKPP + PKPP PS PBPP + PBPP PS + PBPKPP PBPKPP PS PFailure = PS + PKPP PKPP PS + PBPP PBPP PS PBPKPP + PBPKPPPS
7
8671

PSuccess + PFailure = 1

Failure

Success

Float Switch Succeeds (1 PS)

[1 PS PP + PP PS]

Reliability Block Diagram


Pump P Float Switch S

Klaxon K

Bailing B

Cut Sets Path Sets S/P S/K/B S P/K P/B

8
8671

Fault Tree
Exact solution: PTOP = PS + PP PK PPPKPS + PBPP PBPPPS PBPKPP + PBPKPP PS Rare event approximation: PTOP = PS + PP PK + PPPB
Cut Sets Path Sets S/P S/K S P/K P/B
K Klaxon Fails
9
8671

Command Failure

Failure To Dewater

Response Failure

Float Switch Fails Open

Water Removal Fails

Float Switch Fails Open

Manual Removal Fails

Bailing Fails

Event Tree Fault Tree Transformation


7 3 8 1 9 i 4 10 11 5 2 12 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Success Failure A1 Success Failure B1 Success Failure B2 Success Failure C Success Failure B3 Success Failure A2 Success Failure D Success Failure

Failure A1-2

Failure A1

Failure A2

16

7*

3*

1*

26

12

5*

13 6 14
10
8671

*Note that not all events represented here are failures

Assess Risk and Judge Tolerability


Failure statements express Severity Event tree analysis explores Outcomes/Assesses Probability. Probability and Severity establish Risk. IS THE RISK ACCEPTABLE? If not, develop intervenors Select intervenor(s) on the basis of Effectiveness Cost Feasibility (including schedule)

11
8671

Event Tree Shortcomings and Advantages


Shortcomings: Operating pathways must be anticipated. Partial successes/failures are not distinguishable. Advantages: End events need not be foreseen. Multiple failures can be analyzed. Potential single-point failures can be identified. System weaknesses can be identified. Zero-payoff system elements/options can be discarded.
12
8671

Bibliography
Selected references for further study Center for process Safety; Guidelines for Hazard Evaluation Procedures; 2nd Edition with Worked Examples; 1992 (461 pp); American Institute of Chemical Engineers Lees, Frank P.; Loss Prevention in the Process Industries; 1996 (1,316 pp second edition; three volumes) Henley, Ernerst J. and Hiromitsu Kumamoto; Reliability Engineering and Risk Assessment; 1981 (568 pp)
13
8671

4. EVENT TREE ANALYSIS


4.1 Introduction Event trees are inductive logic methods for identifying the various accident sequences which can generate from a single initiating event. The approach is based on the discretization of the real accident evolution in few macroscopic events; the accident sequences which derive are then quantified in terms of their probability of occurrence. The events delineating the accident sequences are usually characterized in terms of: i) the intervention (or not) of protection systems which are supposed to take action for the mitigation of the accident (system event tree); ii) the fulfillment (or not) of safety functions (functional event tree); iii) the occurrence or not of physical phenomena (phenomenological event tree). Typically the functional event trees are an intermediate step to the construction of system event trees: following the accident-initiating event, the safety functions which need to be fulfilled are identified; these will later be substituted by the corresponding safety and protection systems. The system event trees are used to identify the accident sequences developing within the plant and involving the protection and safety systems. The phenomenological event trees describe the accident phenomenological evolution outside the plant (fire, contaminant dispersion, etc.). 4.2. Event tree construction An event tree begins with a defined accident-initiating event which could be a component or an external failure. It follows that there is one event tree for each different accident-initiating event considered. This aspect obviously poses a limitation on the number of initiating events which can be analyzed in detail. For this reason, the analyst groups similar initiating events in bins and only one representative initiating event for each class is investigated in details. Initiating events which are grouped in the same class are usually such to require the intervention of the same safety functions and to lead to similar consequences. Once an initiating event is defined, all the safety functions that are required to mitigate the accident must be defined and organized according to their time of intervention. For example (Figure 4.1) if the initiating event is the rupture of a tube with release of inflammable liquid and the sparking of jet-fire, the first function required would be that of interception of the released flow rate, followed by the cooling of adjacent tanks and finally the quenching of the jet. These functions are structured in the form of headings in the functional event tree. For each function, the set of possible success and failure states must be defined and enumerated. Each state gives rise to a branching of the tree (Figure 4.1). For example, in a binary success/failure logic it is customary to associate to the top branch the success of the function and to the bottom branch its failure. Besides the time-order, also the logic order of the required functions must be accounted for. In other words if the successful fulfillment of a given function is dependent on the fulfillment of another one, the tree needs to be re-order in such a way that the dependent functions follow those upon which they depend. This allows pruning of some sequences: consider a dependent function S1 whose fulfillment depends on the success of a function S2; then, the branch following the failure of S2 needs not be further decomposed in two branches for S1 successful or not, because failure of S2 implies no fulfillment of S1 (Figure 4.2). The functions in the tree are then substituted by the safety systems which must perform them, again respecting the logical dependencies which may lead to additional pruning. System dependencies can be functional, if the failure of intervention of a system renders helpless the intervention of the successive one, or structural if the systems share some common parts or flow so that malfunctioning of that part makes them both fail.

Once the system failure and success states have been properly defined, the states are then combined through the tree branching logic to obtain the various accident sequences that are associated with the given initiating event. Figure 4.3 shows a graphical example of a system event tree: the initiating event is depicted by the initial horizontal line and the system states are then connected in a stepwise, branching fashion: system success and failure states have been denoted by S and F, respectively. The accident sequences that result from the tree structure are shown in the last column. Each branch yields one particular accident sequence; for example, IS1F2 denotes the accident sequence in which the initiating event (I) occurs, system 1 is called upon and succeeds (S1), and system 2 is called upon but fails to perform its defined function. For larger event trees, this stepwise branching would simply be continued. Note that the system states on a given branch of the event tree are conditional on the previous system states having occurred. With reference to the previous example, the success and failure of system 1 must be defined under the condition that the initiating event has occurred; likewise, in the upper branch of the tree corresponding to system 1 success, the success and failure of system 2 must be defined under the conditions that the initiating event has occurred and system 1 has succeeded. 4.3. Event tree evaluation Once the final event tree has been constructed, the final task is to compute the probabilities of system failure. Each event (branch) in the tree can be interpreted as the top event of a fault tree which allows the evaluation of the probability of the occurrence of such event; the value thus computed represents the conditional probability of the occurrence of the event, given that the events which precede on that sequence have occurred. In case of independent events, multiplication of the conditional probabilities for each branch in a sequence gives the probability of that sequence (Figure 4.4). In the case of structural dependencies, two approaches to accident sequence modelling are available. One approach is called event tree with boundary conditions and consists in decomposing the system so as to identify the supporting parts or functions upon which some components and systems are simultaneously dependent. The supporting parts thereby identify appear explicitly as system event tree headings, preceding the dependent protection systems and components. Since dependent parts are extracted and explicitly treated as boundary conditions in the event tree, this approach leads to large fault trees and relatively small event trees. For example, consider an initiating event which requires two systems, S1 and S2 to intervene and suppose that S1 needs the pumps of S2 to operate. Then, one could extract the common part and consider three systems: S1, S2*, which is the S2 system without the pumps common to S1, and S3, which is the pumps used by both S1 and S2 (Figure 4.5). Then, the dependencies are explicitly represented in the tree and the branching associated to S1 and S2* eliminated when S3 is not functioning. Thus, all the conditional probabilities are independent and the probability of the accident sequences can be computed by simple multiplication. This way of proceeding, thus, simplifies considerably the computations but it requires a great deal of expertise by the analyst. In fact, since system interactions and dependencies are treated primarily within the inductive logic of the event tree, those dependencies not recognized by the analyst may not be incorporated into the analysis. The second approach is called Fault-tree link. In this method, the dependencies from support systems or common parts are modeled in the fault trees and thus, at the level of the event trees the system are inserted without any care of their structural dependencies. For each sequence of the event tree, then, the fault trees of the composing events are linked in one, large fault tree which follows the logic depicted in the event tree and the large fault tree is then solved with the usual techniques to compute the probability of occurrence of that sequence. Figure 4.6 shows the previous example of Figure 4.5. Only systems S1 and S2 are explicited on the event tree without particular care to their dependence. If we now want to evaluate the probability of the sequence IS1S2, we build a fault tree whose top event occurs when the initiating event I, and the failure of both systems S1 and S2 occur. In place of the events S1 and S2 we can substitute their corresponding system fault

trees, thus obtaining a large fault tree which can be logically simplified (accounting for the existing dependencies) and evaluated so as to give the probability of the top event, i.e. the probability of the sequence of interest. With this method, the dependencies are properly treated even if the analyst was, a priori, unaware that the dependency existed. On the other hand, the resulting fault tree for an accident sequence may be rather large. In summary, in the event trees with boundary conditions all the significant dependencies among systems are explicitly represented in the event tree; the fault trees for the individual events are then simple and independent; the analyst must take great care in identifying all the existing dependencies. In the fault tree-link approach, dependencies are included in the fault trees for the various systems and thus they are not dependent; the accident sequence, linked fault tree is rather large and complex but all dependencies are treated automatically. Finally, in Figures 4.7 and 4.8 we report a simplified version of functional and system event trees for the case of a large break of a pipe in the primary cooling circuit of a nuclear reactor: it can easily be seen that for realistic systems the trees can become quite complicated

Flow interception

Tanks cooling

Jet fire quenching

Tube rupture with release of burnable liquid

Figure 4.1: Example of functional event tree

S1

S2
Seq 1 Seq 2 IS1S 2 IS1 S 2 IS1S 2 IS1 S 2

Seq 3 Seq 4

S2

S1
Seq 1 Seq 3

IS1S 2 IS1S 2

I
Seq 5 Figure 4.2: Functional dependences IS 2

Figure 4.3: Illustration of event tree branching [From Reactor Safety Study. U.S. Nuclear Regulatory Commission Rep. WASH-1400, NUREG 75/014 (October 1975)].

Success state 1-F1

Initiating event

Failure state F1

Figure 4.4: Schematic of event tree shown with fault trees used to evaluate probabilities of different events

S3

S1

S2 *
Freq(Seq1)=f(EI)Pr(S3)Pr(S1)Pr(S2*)

EI

Figure 4.5: Event tree with boundary conditions

S2 S1 S2

Seq4
AND

S2 S1 S2 OR AND AND OR

S1

S2

Pump 1 fault
OR

Pump 1 fault

Pump 2 fault

Valve 2 Valve 1 fault fault

Human error

Figure 4.6: Fault tree linking

Seq. RS No. 1 2 3 4 5 6 7 8 9 10

CO I

ECl

COR ECR

Remarks Core cooled Slow melt Core cooled Slow melt Melt Core cooled Slow melt Melt Melt Melt

f f f f f f f f f NA NA f f NA NA NA NA NA NA f NA f NA NA NA

f = function failure; NA = not applicable.

Figure 4.7: Function event tree for a large break LOCA (Loss of Coolant Accident)

Figure 4.8: System event tree for a large LOCA (Loss of Coolant Accident)

You might also like