You are on page 1of 9

Root Cause Analysis –

Quality of Process?
By Robert J. Latino,
Sr. VP, Reliability Center, Inc.

Abstract: We have all heard the term Root Cause Analysis (RCA) and we
likely all interpret its meaning in a different fashion. This is the single
most reason we see for the ineffective use of RCA, lack of communication
or miscommunication amongst the users. If we are all using various forms
of RCA, then when we compare our results we are not comparing apples
with apples. We will explore the discipline necessary to provide
consistency to our RCA application thus quantumly improving the
credibility and communication of the results.

Since the evolution of Total Productive Maintenance (TPM) in the United


States there has been a consistent movement towards exploring the quality
of the process versus the quality of the product. Before the advent of TPM,
quality organizations were typically content with testing the quality of the
product as it came off the line as a finished product. While an admirable
concept at the time, we learned by that time it was too late if we found
quality defects. The entire product and/or lot would have to be reworked at
great expense to the organization.

Then the TPM concepts of W. Edwards Deming were introduced and they
pushed the “quality of process” concept. In short, this meant that we would
measure key variables within the process stages and monitor for any
unacceptable variances. In this manner, we can correct the process
variation and prevent the production of off-spec products. This era has
continued into the 21st century with the introduction of Six Sigma.

Now take the above summaries of the application of TPM and let's apply
them to a non-manufacturing process such as RCA. As we discussed
earlier, RCA means different things to different people. Many consider
undisciplined efforts such as “trial and error” as their RCA approach. This
means that we perceive a problem to exist and we go right to what the most
obvious cause is TO US! This is the “finished product” approach. We do
not validate any of our assumptions, we just assume a cause and spend
money to implement a fix and hope it works. Experience shows this
approach to be ineffective and very expensive.

Now let's apply the TPM concept to a disciplined method of RCA such as a
Logic Tree used in the PROACT® process. A Logic Tree strives to
graphically represent the cause and effect relationships that lead to the
surfacing of the undesirable event.

In this approach, we must clearly identify the undesirable event and its
associated modes with supporting facts. Facts are supported by some
essence of science, direct observation and documentation. They cannot be
hearsay or assumptions!

For instance below, most people would insist that we start with a bearing
failure. However, when the event occurred, why was it brought to our
attention? It was not brought to our attention because the bearing failed. It
was brought to our attention because the failed bearing caused a pump to
stop pumping something. Therefore the last effect that caused us concern
was the pump failure. One reason (or mode) that the pump failed was due
to the bearing failing. This is clearly evidenced by the failed bearing
(physical evidence). The top of the Logic Tree may look like this:

Event
Fact – DCS
Verification
Mode
Fact – Physical
Bearing

Figure 1.0 – Event and Mode Supported by Facts

Continuing our search backwards for the cause and effect relationships, we
would then ask “How can a bearing fail?”. Our hypotheses may be;
erosion, corrosion, fatigue or overload. How can we prove which are true?
We would simply have a metallurgical lab analyze the failed bearing for us
and produce an analysis report.

For the sake of this example, lets say our metallurgical report indicated
fatigue patterns only. Now our Logic Tree would advance a level and look
like the following:

Figure 2.0 – Additional Hypotheses

You can now see that as we develop new sets of hypotheses, we are
proving what we say at each level of the process. This is demonstrating
quality of process. When we continue this reiterative process, we are
validating our conclusions each step of the way. This way, when we draw
our conclusions about root causes, they will be right because we have
supported them based on fact and not merely assumption. This also means
that when we agree to spend money to overcome the identified causes, that
the money will be well spent because the problem will not recur.

Applying TPM to thought processes is not a new concept however. When


you think of scientific experimentation, it follows the same premise. When
conducting scientific experiments, we must first develop a hypothesis and
then an appropriate test method to draw valid conclusions. If you think
about it, any investigative occupation must follow this “quality of process”
approach. Think about detectives, NTSB investigators, doctors, fire
investigators, etc., they must all develop hypotheses and prove what they
say.
In an effort to move our cultures toward precision, we must treat our
administrative processes with the TPM concepts in mind as well. The
TPM approach is applicable to Equipment, Process and Human situations.
We must not limit ourselves by applying the concepts narrowly.

Root Cause Analysis:


Quality of Process (2)
By Robert J. Latino, Sr. VP Strategic Development, Reliability Center, Inc.

For the ARTICLE # 1 OF THIS SERIES, CLICK HERE PLEASE

Abstract: If we reflect twenty plus years ago, we will recall that most of our quality efforts were directed
checking the quality of the final product in the finishing and packaging stages. By that point, if something was
found defective, we would have to scrap the entire run or lot of products produced. Then came the TPM
initiatives that stressed “quality of process” and we started to implement Statistical Process Controls (SPC) and
Statistical Quality Control (SQC). We started to look at quality “during” the manufacturing process ensuring
that when the finished product came off the line, it was a quality product. Can we do the same with Root Cause
Analysis (RCA)?

Taking the TPM parallel described in the abstract above, let’s see if it applies to non-manufacturing processes
such as RCA. If asked, almost everyone will say they are doing Root Cause Analysis (RCA). And to a large part,
they will be correct in their own minds. This is because of how they define RCA versus how the person asking
the question defines RCA. This is like if we asked a sample population, “Do you live a healthy life?” The
majority would reply with an emphatic YES. However, what does healthy mean to these people? To some if
means we are alive, to others it means that we eat right and exercise, to others it means that they are
emotionally sound and to others it may mean that they are content in their religious beliefs.

So how many ways cannot someone interpret RCA? Some believe it is 1) having the local expert provide us a
solution, some believe it is 2) brainstorming in a room and drawing conclusions from hearsay and some believe
in 3) the use of a disciplined thought process to seek true root causes.

1) When the perceived expert provides a solution as an individual, we are more apt trust their instincts,
spend the money and their solution and see if it works. Sometimes it works, but more often that not, it
does not work. Checking to see if the solution works is like checking quality only at the finished product
stage. It is too late if there is a defect found!

2) When teams are used to brainstorm using quality techniques such as fishbone and/or 5 WHYS, they will
usually draw conclusions based on majority opinion. This means that solutions tend to be implemented
based on the consensus of the group’s opinions, not on any factual basis where tests prove that these
opinions are correct. Again, we are checking quality of the final product and not of the thought process
that drew the conclusions.

3) When teams use a disciplined RCA process that requires hypotheses to be developed as to how
something could occur, and then REQUIRE verification with some essence of science as to whether it is
true or not, then we are employing quality of process! This is because we are proving our hypotheses
with facts rather than relying on hearsay, assumptions and ignorance.
To demonstrate these points, look at the following abbreviated example:

Figure 1.0 - PROACT® RCA Disciplined Logic Tree

The above depicts a disciplined thought process called PROACT®. Let’s think back to our RCA scenarios. If a
critical pump were to fail, in some cases we would get our best engineers to take a look at it. They would do
their engineering magic themselves and may conclude that a different type of bearing (perhaps more heavy
duty) should be in this service. We would change out the bearings with the new designed ones. Given the
above scenario, would the problem go away?

What about if we get our brainstorming teams together and everyone looks at the past performance of the
pump and its maintenance history and concludes that it is a new lubricant they are using and that it should be
changed. Under the above scenario, would the problem go away?

Utilizing the disciplined approach above, we are going to have to have the bearing reviewed by metallurgists.
They will send back a report concluding (with science) that there is evidence to support the presence of fatigue.
We ask ourselves, How can fatigue occur on the bearing? We hypothesize that it can come from high vibration.
We check our vibration monitoring records and conclude that there is evidence of excessive vibration. How can
we have excessive vibration? We hypothesize that it can come from imbalance, resonance and misalignment.
We check our balance certifications and our vibration records for resonance, and find not evidence to support
that they are contributors. We ask the mechanic who aligned the pump to align it again and observe his
practices. From the observation, we can conclude that he does not know how to properly align.

When we ask, Why would he not align it properly?, we find that he was never trained in how to align, he was
using worn alignment tools and no procedure existed to follow. Now we know the REAL root causes, so we can
develop solutions, that when implemented, WILL WORK!!

Using the PROACT® disciplined process, we are utilizing quality of process versus quality of product. The facts
are leading us to our conclusions, not hearsay. We are not using “trial and error” solutions to see if they work.
By the time we get to solutions, we know they will work because we have maintained quality of the RCA
process.

While the undisciplined RCA approaches are attractive to organizations because they produce a quick answer, it
does not mean that the answer is correct. They are quick approaches because they lack proof that they are
correct. True RCA involves taking the time to prove what we say, before we spend money to prove we are
wrong!!

Root Cause Analysis:


Quality of Process (3)
By Robert J. Latino, Sr. VP Strategic Development,
Reliability Center, Inc.

Where Does Root Cause Analysis Stop, At the HOW or the WHY? .

Abstract: When most people conduct their version of a Root Cause Analysis (RCA), where do they usually stop?
How do they know when they are done? How do they know that the problem will not recur? These questions
represent reality when we are the ones in the field working on a pressing problem with management on our
backs. If we consider ourselves manufacturing detectives, are we content with the stopping at the “HOWS” or
the “WHYS”?

I was watching a TV series the other night, my favorite by the way, called Crime Scene Investigators or CSI. It
is a series about forensic specialists that use high tech tools to prove and disprove hypotheses for mainly
prosecutors and detectives. The entire show revolved around various crime scenes and how the cases are built
to prepare for a “solid case” in court.

Putting this perspective into our world as RCA analysts, we too must build a “solid case.” However our court is
not likely going to be a judge and/or jury, but rather a select number of managers that we are going to request
money from to implement RCA recommendations. While the objectives may be different, the means to attain
them are similar. In both instances, we must prove a solid case in order to obtain desired ends. In the criminal
detective’s instance the goal is a conviction. In the analyst’s case, the goal is to implement recommendations to
prevent recurrence of the undesirable event.

Looking at it this way, when we typically conduct analyses, are we more like the forensic engineer or the
prosecutor and detective looking to win his case? What is the difference between the two roles?

The forensic engineer’s role is simply to determine with science HOW the event occurred? This means that a
certain sequence of cause and effect relationships linked up and resulted in the undesirable event. Their role is
to prove that each hypothesis did or did not occur. They in essence will map out HOW the crime occurred and
be able to prove that it happened just that way.

Now let’s look at the role of the prosecutor and the detectives. How do they fit into the big picture? Their role
is typically to determine the WHY? The forensic engineers provided them the HOW pieces of the puzzle, now the
detectives and the prosecutors must determine WHY the crime was committed. In other words, they must
identify the motive of the person that triggered the HOW (the sequence of events that lead to the outcome or
the crime) to occur.

This is the same for us in industry. We use our technology (i.e. – vibration monitoring, infrared imaging,
electron microscopy, stress analysis, etc.) to prove and disprove our hypotheses, but our analysts must explore
WHY people make decisions that result in undesirable outcomes or failure. Take, for instance, the Logic Tree
example below that we used in Part II of this series.

Inadequate Training (LATENT)

Improper Tools (LATENT)

No Procedures (LATENT)
Picture 1.0 - PROACT® RCA Disciplined Logic Tree

The undesirable outcome is that some pump failure to perform its intended function. In an effort to prove our
“solid case” we must understand the cause and effect relationships that lead up to the event. This will involve
using science to prove our hypotheses. In the above case let’s explore HOW the pump could have failed and use
science to prove our case:

HYPOTHESIS VERIFICATION TECHNIQUES

Erosion, Corrosion, Fatigue & Overload Metallurgical Analysis

High Vibration Vibration Monitoring Instruments

Misalignment Laser Alignment Technology

These questions answer the HOW, but what about the WHY? In this case someone misaligned a pump and that
decision resulted in a sequence of cause and effect relationships that caused the pump to fail prematurely. The
“forensics” confirmed for us the HOW, but WHY would a person choose to align in that fashion. This is where we
need to understand the motive of WHY people make decisions that are in error. As an analyst, if we were to go
deeper and understand the thought process or the rationale for such a decision (Latent Root), we would uncover
the real ROOT CAUSES of WHY physical failure occurs. People often misalign because they were never trained in
proper alignment practices, no procedure exists outlining alignment as a required practice with specifications
and/or the current alignment equipment we are using is worn or inadequate for the application.

If we do not explore the WHY, then the HOW is likely to recur. In this example, if we merely change out the
failed bearing, does the problem go away for good? Even if we identify an excessive vibration and take
measures to identify it sooner so that we can better predict impending failure, does that make the problem go
away? If we discipline the mechanic for not aligning properly, “Does that make the issue go away?”
As you can tell, none of these commonly applied solutions will totally prevent the recurrence of the pump
failure. Only the identification of the WHY that triggers the physical root to occur, will prevent recurrence.

If you now reflect on your current RCA efforts, do you stop at the HOW (forensics level) or at
the WHY (detective level)?

ean Manufacturing and the Environment

Go
Contact Us Search: All EPA This Area

• You are here: EPA Home


• Environmental Innovation
• Lean Manufacturing and the Environment
• Lean Thinking and Methods
• Total Productive Maintenance (TPM)

Total Productive Maintenance (TPM)


• Introduction
• Method and Implementation Approach
• Implications for Environmental Performance
• Useful Resources

Introduction

Total Productive Maintenance (TPM) seeks to engage all levels and functions in an organization to maximize the
overall effectiveness of production equipment. This method further tunes up existing processes and equipment by
reducing mistakes and accidents. Whereas maintenance departments are the traditional center of preventive
maintenance programs, TPM seeks to involve workers in all departments and levels, from the plant-floor to senior
executives, to ensure effective equipment operation.

Autonomous maintenance, a key aspect of TPM, trains and focuses workers to take care of the equipment and
machines with which they work. TPM addresses the entire production system lifecycle and builds a solid, plant-floor
based system to prevent accidents, defects, and breakdowns. TPM focuses on preventing breakdowns (preventive
maintenance), "mistake-proofing" equipment (or poka-yoke) to eliminate product defects and non-de, or to make
maintenance easier (corrective maintenance), designing and installing equipment that needs little or no maintenance
(maintenance prevention), and quickly repairing equipment after breakdowns occur (breakdown maintenance).

The goal is the total elimination of all losses, including breakdowns, equipment setup and adjustment losses, idling
and minor stoppages, reduced speed, defects and rework, spills and process upset conditions, and startup and yield
losses. The ultimate goals of TPM are zero equipment breakdowns and zero product defects, which lead to improved
utilization of production assets and plant capacity.

Top of page

Method and Implementation Approach

TPM is focused primarily on keeping machinery functioning optimally and minimizing equipment breakdowns and
associated waste by making equipment more efficient, conducting preventative, corrective, and autonomous
maintenance, mistake-proofing equipment, and effectively managing safety and environmental issues. TPM seeks to
eliminate five major losses that can result from faulty equipment or operation, as summarized below.
Five major losses that can result from faulty equipment or operation
Poor Maintenance
Costs to Organization
Loss Category
Unexpected Results in equipment downtime for repairs. Costs can include downtime (and lost production
breakdown losses opportunity or yields), labor, and spare parts.
Set-up and Results in lost production opportunity (yields) that occurs during product changeovers, shift
adjustment losses change or other changes in operating conditions.
Results in frequent production downtime from zero to 10 minutes in length and that are difficult
to record manually. As a result, these losses are usually hidden from efficiency reports and are
Stoppage losses
built into machine capabilities but can cause substantial equipment downtime and lost
production opportunity.
Results in productivity losses when equipment must be slowed down to prevent quality defects
Speed losses or minor stoppages. In most cases, this loss is not recorded because the equipment continues to
operate.
Results in off-spec production and defects due to equipment malfunction or poor performance,
Quality defect losses
leading to output which must be reworked or scrapped as waste.
Equipment and
Results in wear and tear on equipment that reduces its durability and productive life span,
capital investment
leading to more frequent capital investment in replacement equipment.
losses

Organizations typically pursue the four techniques below to implement TPM. Kaizen events can be used to focus
organizational attention on implementing these techniques (see profile of the Kaizen lean method).

1. Efficient Equipment: The best way to increase equipment efficiency is to identify the losses, among the six
described above, that are hindering performance. To measure overall equipment effectiveness, a TPM index,
Overall Equipment Effectiveness (OEE) is used. OEE is calculated by multiplying (each as a percentage),
overall equipment availability, performance and product quality rate. With these figures, the amount of time
spent on each of the six big losses, and where most attention needs to be focused, can be determined. It is
estimated that most companies can realize a 15-25 percent increase in equipment efficiency rates within three
years of adopting TPM.
2. Effective Maintenance: Thorough and routine maintenance is a critical aspect of TPM. First and foremost,
TPM trains equipment operators to play a key role in preventive maintenance by carrying out "autonomous
maintenance" on a daily basis. Typical daily activities include precision checks, lubrication, parts replacement,
simple repairs, and abnormality detection. Workers are also encouraged to conduct corrective maintenance,
designed to further keep equipment from breaking down, and to facilitate inspection, repair and use. Corrective
maintenance includes recording the results of daily inspections, and regularly considering and submitting
maintenance improvement ideas.
3. Mistake-Proofing: Known as poka-yoke1 in lean manufacturing contexts, mistake-proofing is the application
of simple "fail-safing" mechanisms designed to make mistakes impossible or at least easy to detect and correct.
Poka-yoke devices fall into two major categories: prevention and detection.
o A prevention device is one that makes it impossible for a machine or machine operator to make a
mistake. For example, many automobiles have "shift locks" that prevent a driver from shifting into
reverse unless their foot is on the brake.
o A detection device signals the user when a mistake has been made, so that the user can quickly correct
the problem. In automobiles, a detection device might be a warning buzzer indicating that keys have
been inadvertently left in the ignition.
4. Safety Management: The fundamental principle behind TMP safety and environmental management activities
is addressing potentially dangerous conditions and activities before they cause accidents, damage, and
unanticipated costs. Like maintenance, safety activities under TPM are to be carried out continuously and
systematically. Focus areas include
o the development of safety checklists (e.g., to detect leaks, unusual equipment vibration, or static
electricity)
o the standardization of operations (e.g., materials handling and transport, use of protective clothing, etc.)
o and coordinating nonrepetitive maintenance tasks (e.g., especially those involving electrical hazards,
toxic substances, open flames, etc.).

In many cases, equipment can be modified (see mistake-proofing) to minimize the likelihood of equipment
malfunction and upset conditions.
Top of page

Implications for Environmental Performance

Potential Benefits:
Properly maintaining equipment and systems helps reduce defects that result from a process. A reduction in
defects can, in turn, help eliminate waste from processes in three fundamental ways:

1. fewer defects decreases the number of products that must be scrapped;


2. fewer defects also means that the raw materials, energy, and resulting waste associated with the
scrap are eliminated;
3. fewer defects decreases the amount of energy, raw material, and wastes that are used or
generated to fix defective products that can be re-worked.

TPM can increase the longevity of equipment, thereby decreasing the need to purchase and/or make
replacement equipment. This, in turn, reduces the environmental impacts associated with raw materials and
manufacturing processes needed to produce new equipment.
TPM often attempts to decrease the number and severity of equipment spills, leaks, and upset conditions. This
typically reduces the solid and hazardous wastes (e.g., contaminated rags and adsorbent pads) resulting from
spills and leaks and their clean-up.
Potential Shortcomings:
Failure to consider the environmental aspects or impacts associated with equipment during mistake-proofing
and equipment efficiency improvement can leave potential waste minimization and pollution prevention
opportunities on the table. For example, equipment can often be modified to reduce or eliminate spills, leaks,
overspray, and misting that increase clean-up needs.
TPM can result in increased use of cleaning supplies, particularly if the route cause of unclean conditions are
not addressed. Cleaning supplies may contain solvents and/or chemicals that can result in air emissions or
increased waste generation.

Useful Resources

Campbell, John Dixon. Uptime: Strategies for Excellence in Maintenance Management ( Portland, Oregon:
Productivity Press, 1995).

The Japan Institute of Plant Maintenance, ed. TPM for Every Operator (Portland, Oregon: Productivity Press, 1996).

Leflar, James. Practical TPM: Successful Equipment Management at Agilent Technologies (Portland, Oregon:
Productivity Press, 2001).

Robinson, Charles and Andrew Ginder. Introduction to Implementing TPM: The North American Experience
(Portland, Oregon: Productivity Press, 1995).

Suzuki, Tokutaro, ed. TPM in Process Industries (Portland, Oregon: Productivity Press, 1994).

You might also like