Subject Instructor The concept and methodology of root cause analysis (RCA) are designed to provide a cost-effective means to isolate all factors that directly or indirectly result in the myriad of problems that we face in our plants and facilities. This process is not limited to equipment or system failures; but can be effectively used to resolve any problem that has serious, negative impact on effective management, operation, maintenance, and support of our plants and facilities.
It has the capability to identify initial problems; isolate the actual cause or forcing function that directly resulted in the problem, as well as identifies all factors that directly or indirectly contributed to the problem.
A3 PROBLEM SOLVING REPORT. Uses 10 steps to proceed from problem identification to resolution in a fashion that fosters learning, collaboration, and personal development.
Toyota motor corporation is famed for its ability to relentlessly improve operational performance. Toyota uses the A3 The term A3 derives from the paper size used for the report, which is a metric equivalent to 11 x 17 (or B-sized) paper.
Toyota uses A3 reports for solving problems, reporting project states, and for proposing policy changes. Most problems that arise in organizations are addressed in superficial ways, what some call first-order problem-solving That is, we work around the problem to accomplish our immediate objective, but do not address the root causes of the problem so as to prevent its recurrence. By not addressing the root cause, we encounter the same problem or same type of problem again and again, and operational performance does not improve. This helps people engage in collaborative, in- depth problem solving.
It drives problem solver to address the root causes of problems.
Can be used for almost any situation.
Whenever the way work happens is not ideal, or when a goal or objective is not being met, you have a problem, or, if you prefer, a need.
The preferred source of problem identification is statistical analysis that tracks the actual versus design performance of the plant and all of its functions, for example, sales, production, procurement, maintenance, and so on. Observe the work process first hand, and document ones observation. Create a diagram that shows how the work is currently done. Any number of formal process charting or mapping tools can be used, but often simple stick figures and arrows . Quantify the magnitude of the problem. ex: percentage of customers deliveries that are late, number of stocks out in a month, number of errors reported per quarter, percentage of work time, represented in data graphically
There are multiple diagnostic and analysis tools that can be used to conduct the RCA. The guidelines provided will help the investigator select the most effective tool or tools for each classification of problem that will be encountered.
Ex. 5 Whys as a tool for analysis Countermeasures are the changes to be made to the work processes that will move the organization closer to ideal, or make the process more efficient, by addressing root causes.
Recommended countermeasures help the process conform to three rules: Specify the outcome, content, sequence, and task of work activities. Create clear, direct connections between requestors and suppliers of goods and services. Eliminate loops, workarounds, and delays. The countermeasure(s) addressing the root cause(s) of the problem will lead to new ways of getting the work done, what is called the target condition or target state. It describes how the work will get done with the proposed countermeasures in place. In the A3 report, the target condition should be a diagram, similar to the current condition that illustrates how the new proposed process will work. The specific countermeasures should be noted or listed, and the expected improvement should be predicted specifically and quantitatively. In order to reach the target state, one needs a well thought-out and workable implementation plan. The implementation plan should include a list of the actions that need to be done to get the countermeasures in place and realize the target condition, along with the individual responsible for each task and a due date. Other relevant items, such as cost, may also be added. A critical step in the learning process of problem-solvers is to verify whether they truly understood the current condition well enough to improve it. Therefore, a follow-up plan becomes a critical step in process improvement to make sure the implementation plan was executed, the target condition realized, and the expected results achieved. You can state the predicted outcome here rather than in the target condition, if you prefer. Its vitally important to communicate with all parties affected by the implementation or target condition, and try to build consensus throughout the process. Concerns raised should be addressed insomuch as possible, and this may involve studying the problem further or reworking the countermeasures, target condition, or implementation plan. The goal is to have everyone affected by the change aware of it and, ideally, in agreement that the organization is best served by the change. If the person conducting the A3 process is not a manager, it is imperative to remember the importance of obtaining approval from an authority figure to carry out the proposed plan. The authority figure should verify that the problem has been sufficiently studied and that all affected parties are within board with the proposal. The authority figure may then approve the change and allow implementation. Without implantation, no change occurs.
The next step is to execute the implementation plan. Process improvement should not end with implementation. It is very important to measure the actual results and compare to predict. If the actual results differ from the predicted ones, research needs to be conducted to figure out why, modify the process and repeat implementation and follow-up (i.e., repeat the A3 process) until the goal is met. Steps in Form Use Use of the A3 format should begin as soon as the potential need for RCA is identified.
Business Case. This section should be used to clearly and concisely define the problem that is to be investigated.
This definition will define, as well as the business case , for example, cost benefit analysis will be complete after the investigation is complete. Current Conditions. This section includes a concise definition of the current conditions surrounding or as a result of the problem to be investigated.
The use of graphs, charts, and other illustrations should be used to clearly convey the message.
Target Conditions. This section defines the resultant of the proposed corrective actions identified by the RCA.
The use of graphs, charts, and other illustrative materials will permit the inclusion of much more data as well as provide a more professional report.
Action Plan. This section is your management tool during the RCA and become your next steps following management approval to implement the corrective actions.
The use of a Gantt, Pert, or other types of project schedules or timelines is ideal for this section. Timelines can be created in MS Visio and inserted directly into this section. Metrics. This section should be used to define the specific return on investment (ROI) or change that is expected from the recommended changes.
It should include both the actual values that represent the change and the source of the data that will be used. Root cause analysis takes many forms. It can range from: Simple visual inspection of failed parts to a comprehensive process designed to identify; quantify the impact of; develop cost-effective solutions; and implementation of corrective actions for complex capacity, quality, cost, and reliability problems. RCA is a systematic process that is based on factual data that is free of prejudice, opinions, or political pressure.
It is a logical, practical process that can be used by anyone who is willing to follow it.
Many of the equipment-related problems that plague industrial plants and facilities can be resolved by visual inspection of the failed parts.
For example, premature failure of rolling element bearings is a common problem in most plants and facilities. Too many plants simply replace the bearing and throw the failed bearing in the nearest trash bin. This approach does little to eliminate the real reason that the bearing failed and there is a high probability that the failure will recur. A simple, visual inspection of the failed bearing, in most cases, will permit plant personnel to identify the underlying reason, that is, root cause, of the premature failure. As its name implies, this simple root cause tool is an interview process that works best in a cross-functional group of personnel who have direct knowledge of the problem that is being investigated. The process should be repeated as often as necessary to arrive at the true root cause of the problem that is being investigated. Why is Module A failing to meet it production goals? Answer: Were forced to use relief operators most of the time?
Why are you forced to use relief operators? Answer: The regular operators have been in training for the past month?
Why are the regular operators in training? Answer: Its mandated training and everyone has to attend before the end of the year.?
Why are all operators being trained at the same time? Answer: Thats how it was scheduled.?
Why was it scheduled that way? Answer: We were really being pushed earlier in the year and management decided to postpone training until demand dropped?
What is the root cause of the problem? The forcing function was managements decision to wait on mandated training until it was too late to do it efficiently, but is that the real root cause? How would you correct this problem remember the objective is to prevent a similar situation from reoccurring at some point in the future. In formal RCA, the investigating team will need input from all plant personnel who may have direct or indirect knowledge of the deviation, event, or problem that is being investigated.
This information input activity may be limited to interviews, either individually or in groups; but could entail additional support gathering data, records, and other pertinent information.
Obviously, the actual level of effort will depend on the complexity of the problem and the team? ability to determine the root cause or causes. The purpose of RCA is to resolve problems that negatively impact safety, environmental compliance, asset reliability, and plant performance, not to fix blame.
Fixing Blame = this approach results in lost morale and will condition the workforce to withhold information that is critical to root cause process and effective plant operation and maintenance. Root cause analysis cannot be performed sitting in a conference room, office, or in front of a computer. While the RCA process does require working group meetings, as well as individual and group interviews, the heart of the process is gathering factual data that can be used to isolate, identify, and quantify the real reason or reasons that resulted in the abnormal behavior that is being investigated.
The RCA process requires a hands-on process of interviews, inspections, testing, and evaluations that can only be done in the plant or field. Theoretical evaluations have their place, but to use the RCA process effectively, the investigators must clearly understand the operating dynamics of the investigated system, confirm any and all factors, assumptions, or hypotheses that may be offered The number of people required is dependent on the complexity of the specific event, deviation, or failure that is being investigated. In rare cases, the personnel required to properly perform a RCA can be substantial; but in most cases will require a three to four person, multi-disciplined team. Two primary sources of potential problems.
1. KPI key performance indicators and asset history to detect deviations from normal conditions.
2. The second source for potential analysis is request from one or more members of the plants workforce.
Therefore, any employee is expected to identify problems or events that may warrant an analysis. The investigator is seldom present when an incident or problem occurs. Therefore, the first step is the initial notification that an incident or problem has taken place. Typically, this report will be verbal, a brief written note, or a notation in the production logbook. In most cases, the communication will not contain a complete description of the problem. Rather, it will be a very brief description of the perceived symptoms observed by the person reporting the problem. The most effective means of problem or event definition is to determine its real symptoms and establish limits that bound the event. At this stage of the investigation, the task can be accomplished by an interview with the person who first observed the problem. At this point, each person interviewed will have a definite opinion about the incident, and will have his or her description of the event and an absolute reason for the occurrence. Some perceptions are totally wrong, but they cannot be discounted. Even though many of the opinions expressed by the people involved with or reporting an event may be invalid, do not disregard them without any investigation. The use of format that completely bounds the potential problem or event greatly reduces the level of effort required to complete an analysis.
the investigator or team must first clarify the problem with sufficient definition to:
(1) verify that a problem truly exists and
(2) that the severity of the problem warrants an analysis. 1. What happened? 2. Where did it happen? 3. When did it happen? 4. What changed? 5. Who was involved? 6. Why did it happen? 7. What is the impact? 8. Will it happen again? 9. How can recurrence be prevented? RCA should not be based on opinions or assumptions. Before starting an analysis, the investigators must confirm that a problem truly exist and that it warrants a formal investigation. If a problem exists there should be a data in the CMMS or other records keeping system that supports it. the first priority when investigating a problem, deviation from acceptable norm or an event involving equipment damage or failure is to preserve physical evidence.
If possible, the failed machine and its installed system should be isolated from service until a full investigation can be conducted.
Upon removal from service, the failed machine and all of its components should be stored in a secure area until they can be fully inspected and appropriate tests conducted. If this approach is not practical, the scene of the failure should be fully documented before the machine is removed from its installation.
Photographs, sketches, and the instrumentation and control settings should be fully documented to ensure that all data are preserved for the investigating team.
All automatic reports, such as those generated by computer-monitoring and control system, should be obtained and preserved. 1. Currently approved Standard Operating (SOP) and Maintenance (SMP) Procedures for the machine or area where the event occurred 2. Company policies that govern activities performed during the event 3. Operating and process data, such as strip charts, computer output, and data-recorder information
4. Appropriate maintenance records for the machinery or area involved in the event 5. Copies of logbooks, work packages, work orders, work permits, and maintenance records; equipment-test results, quality- control reports; oil and lubrication analysis results; vibration signatures; and other records 6. Diagrams, schematics, drawings, vendor manuals, and technical specifications, including pertinent design data for the system or area involved in the incident 7. Training records, copies of training courses, and other information that shows skill levels of personnel involved in the event 8. Photographs, videotape, and/or diagram of the incident scene 9. Broken hardware, such as ruptured gaskets, burned leads, blown fuses, failed bearings, etc. 10. Environmental conditions when the event occurred. These data should be as complete and accurate as possible
Copies of incident reports for similar prior events and history/trend information for the area involved in the current incident Not all problems whether real or perceived justify a formal RCA. Therefore, the clarified and confirmed problem should be evaluated to determine if its impact is sufficient to warrant further investigation. If the initial steps appear to justify a RCA, the next step in the process is to perform a top-level cost-benefit analysis. The intent of this analysis is to verify that the potential benefits generated by resolving the reported problem are greater than the incurred cost associated with the problem. The incremental or elevated cost of repairing a machine with a normal mean time between repair (MTBR) of 12 months but with an actual MTBR of 3 months; the incremental cost is the difference between the rebuild cost. In this case, the pump is being rebuilt three times more often than the norm and the incremental cost is three times higher than norm. If the cost-benefit analysis indicates that the reported event or problem does not warrant further analysis, the investigator should notify the person or persons who initiated the request. The objective of the design review is to establish the specific operating characteristics of the machine or production system involved in the incident.
The data obtained from a design review provide a baseline or reference, which is needed to fully investigate and resolve plant problems. The evaluation should clearly define the specific function or functions that each machine and system was designed to perform.
Simplified Failure Modes and Effects Analysis (SFMEA), and fault-tree analysis (FTA) in that it is intended to identify the variables or failure modes that could contribute to a problem or failure The technique is based on readily available, application specific data to determine the variables that may cause or contribute to an incident. In some instances, the process may be limited to a cursory review of the vendors Operating And Maintenance (O&M) manual and performance specifications.
In others, a full evaluation that includes all procurement, design, and operations data may be required. the information required can be obtained from four sources: equipment nameplates, procurement specifications, vendor specifications, and the O&M manuals provided by the vendors.
Most of the machinery, equipment, and systems used in process plants have a permanently affixed nameplate that defines their operating envelope.
For example, a centrifugal pumps nameplate typically includes flow rate, total discharge pressure, specific gravity, impeller diameter, and other data that define its design operating characteristics.
These data can be used to determine if the equipment is suitable for the application and if it is operating within its design envelope. Procurement specifications are normally prepared for all capital equipment as part of the purchasing process. These documents define the specific characteristics and operating envelope requested by the plant-engineering group. These specifications provide information that is useful for evaluating the equipment or system during an investigation. When procurement specifications are not available, purchasing records should describe the equipment and provide the system envelope. Although this data may be limited to a specific type or model of machine, it is generally useful information. For most equipment procured as part of capital projects, a detailed set of vendor specifications should be available. Generally, these specifications were included in the vendors proposal and confirmed as part of the deliverables for the project. Normally, these records are on file in two different departments: purchasing and plant engineering. As part of the design review, the vendor and procurement specifications should be carefully compared.
Many of the chronic problems that plague plants are a direct result of vendor deviations from procurement specifications.
Carefully comparing these two documents may uncover the root cause of chronic problems.
O&M manuals are one of the best sources of information. In most cases, these documents provide specific recommendations for proper operation and maintenance of the machine, equipment, or system. In addition, most of these manuals provide specific troubleshooting guides that point out many of the common problems that may occur. A thorough review of these documents is essential before beginning the RCFA. The information provided in these manuals is essential to effective resolution of plant problems The objective of the design review is to determine design limitations, acceptable operating envelope, probable failure modes, and specific indices that quantify the actual operating condition of the machine, equipment, or process system being investigated. At a minimum, the evaluation should determine design function and specifically what the machine or system was designed to do. The review should clearly define the specific functions of the system and its components. To fully define machinery, equipment, or system functions, a description should include incoming and output product specifications, work to be performed, and acceptable operating envelopes. For example, a centrifugal pump may be designed to deliver 1000 gal/min of water having a temperature of 100F and a discharge pressure of 100 lb/in2. Machine and system functions depend on the incoming product to be handled. Therefore, the design review must establish the incoming product boundary conditions used in the design process. In most cases, these boundaries include: temperature range, density or specific gravity, volume, pressure, and other measurable parameters. These boundaries determine the amount of work the machine or system must provide. Assuming the incoming product boundary conditions are met, the investigation should determine what output the system was designed to deliver. As with the incoming product, the output from the machine or system can be bound by specific, measurable parameters. Flow, pressure, density, and temperature are the common measures of output product. However, depending on the process, there may be others. This part of the design review should determine the measurable work to be performed by the machine or system. Efficiency, power usage, product loss, and similar parameters are used to define this part of the review. The actual parameters will vary depending on the machine or system. In most cases, the original design specifications will provide the proper parameters for the system under investigation. The final part of the design review is to define the acceptable operating envelope of the machine or system. Each machine or system is designed to operate within a specific range, or operating envelope. This envelope includes the maximum variation in incoming product, startup ramp rates and shutdown speeds, ambient environment, and a variety of other parameters. Many of the chronic problems that negatively affect critical production systems are caused by inherent design deficiencies. Therefore, the investigator should evaluate the confirmed data develop before and during the design review to determine whether or not the root cause of the problem can be accurately isolated without continuing the RCA process. The obvious next step in the RCA process is to review the application to ensure that the machine or system is being used in the proper application and that the mode of operation and maintenance are within the operating envelope, as defined in the design review. The data gathered during the design review should be used to verify the application, as well as operating and maintenance records associated with the appropriate system or asset. Factors to evaluate in an application review include: installation, operating envelope, operating procedures and practices, such as standard procedures versus actual practices, maintenance history, and maintenance procedures and practices. Each machine and system has specific installation criteria that must be met before acceptable levels of reliability can be achieved and sustained. These criteria vary with the type of machine or system, and should be verified as part of the RCA. Using the information developed as part of the design review, the investigator or other qualified individuals should evaluate the actual installation of the machine or system that is being investigated. As a minimum, a thorough visual inspection of the machine and its related system should be conducted to determine if improper installation is contributing to the problem. Photographs, sketches, or drawings of the actual installation should be prepared as part of the evaluation. They should point out any deviations from acceptable or recommended installation practices as defined in the reference documents and good engineering practices. This data can be used later in the RCA when potential corrective actions are considered. Evaluating the actual operating envelope of the production system associated with the investigated event is more difficult. The best approach is to determine all variables and limits used in normal production.
For example, define the full range of operating speeds, flow rates, incoming product variations, and so on, which are normally associated with the system. In variable-speed applications, determine the minimum and maximum ramp rates used by the operators. This part of the application review consists of evaluating the standard operating procedures as well as the actual operating practices. Most production areas maintain some historical data that tracks its performance and practices. These records may consist of logbooks, reports, or computer data. These data should be reviewed to determine the actual production practices that are used to operate the machine or system being investigated. Evaluate the standard operating procedures (SOPs) for the affected area or system to determine if they are consistent and adequate for the application. Two reference sources, the design review report and vendors O&M manuals, are required to complete this task. In addition, evaluate SOPs to determine if they are usable by the operators. Review organization, content, and syntax to determine if the procedure is correct and understandable. Special attention should be given to the setup procedures for each product produced by a machine or process system. Improper or inconsistent system setup is a leading cause of poor product quality, capacity restrictions, and equipment unreliability.
The procedures should provide clear, easy to understand instructions that ensure accurate, repeatable setup for each product type.
If they do not, the deviations should be noted for further evaluation. Transient procedures, such as start-up, speed change, and shutdown, also should be carefully evaluated.
These are the predominant transients that cause deviations in quality and capacity, and that have a direct impact on equipment reliability. These procedures should be evaluated to ensure that they do not violate the operating envelope or vendors recommendations.
All deviations must be clearly defined for further evaluation. This part of the evaluation should determine if the SOPs were understood and followed before and during the incident or event. The normal tendency of operators is to shortcut procedures, which Is a common reason for many problems. In addition, unclear procedures lead to misunderstandings and misuse. Therefore, the investigation must fully evaluate the actual practices that the production team uses to operate the machine or system. The best way to determine compliance with SOPs is to have the operator(s) list the steps used to run the system or machine being investigated. This task should be performed without referring to the SOP manual. The investigator should lead the operator(s) through the process and use their input to develop a sequence diagram. After the diagram is complete, compare it to the SOPs. If the operators actual practices are not the same as those described in the SOPs, the procedures may need to be upgraded or the operators may need to be retrained.
A thorough review of the maintenance history associated with the machine or system is essential to the RCA process. One of the questions that must be answered is will this happen again? A review of the maintenance history may help answer this question. The level of accurate maintenance data that are available will vary greatly from plant to plant. This may hamper the evaluation, but it is necessary to develop as clear a picture as possible of the system?s maintenance history.
A complete history of the scheduled and actual maintenance, including inspections and lubrication, should be developed for the affected machine, system, or area. The primary details that are needed include: frequency of repair and types of repair, frequency and types of preventive maintenance, failure history, and any other facts that will help in the investigation.
A complete evaluation of the Standard Maintenance Procedures (SMPs) and actual practices should be conducted.
The procedures should be compared with maintenance requirements defined by both the design review and the vendors O&M manuals. Actual maintenance practices can be determined in the same manner as described in earlier or by visual observation of similar repairs. This task should determine if all maintenance personnel assigned to or involved with the area that is being investigated consistently follow the SMPs.
Special attention should be given to the routine tasks, such as lubrication, adjustments, and other preventive tasks. Determine if these procedures are being performed in a timely manner and if proper techniques are being used.
More than 27 percent of all reliability problems are caused by misapplication. While the initial design and operations of the system may have been compatible, the myriad of modifications, upgrades, and other changes have historically resulted in operating conditions that are outside the acceptable operating envelope. If the preceding steps do not provide a clear understanding of the more probable reasons for the problem, the investigator or team must organize all of the data, assumptions, and hypotheses into a form that can be used for further analysis. The most effective method involves plotting the accumulated facts, assumptions, and hypotheses into a graphical format that facilitates understanding the cause and effect and interactions of all identified variables. Common problem classifications are equipment damage or failure, operating performance, economic performance, safety, and regulatory compliance. Classifying the event as a particular problem type allows the analyst to determine the best method to resolve the problem. Each of the major classifications requires a slightly different RCA approach. One of the major classifications of problems that often warrant RCA is an event associated with failure of critical production equipment, machinery, or systems.
Typically, any incident that results in partial or complete failure of a machine or process system warrants a RCA. This type of incident can have a severe, negative impact on plant performance. Therefore, it often justifies the effort required to fully evaluate the event and to determine its root cause.
The most effective methods of resolving an equipment or system failure problem are sequence-of events analysis or SFMEA. Product Quality. Deviations in first-time- through product quality are prime candidates for RCA, which can be used to resolve most quality-related problems. However, the analysis should not be used for all quality problems. Nonrecurring deviations or those that do not have a significant impact on capacity or costs are not cost-effective applications. Many of the problems or events that occur affect a plant?s ability to consistently meet expected production or capacity rates. These problems may be suitable for RCA, but further evaluation is recommended before beginning an analysis. After the initial investigation, if the event can be fully qualified and a cost-effective solution found, then a full analysis should be considered. Note that an analysis is not normally performed on random, nonrecurring events or equipment failures. The preferred analytical tool for these potentially complex problems is cause and effects analysis. In some cases where the exact time the problem first began, sequence-of-events analysis can also be used effectively. Deviations in economic performance, such as high production or maintenance costs, often warrant the use of RCA. The decision tree and specific steps required to resolve these problems vary depending on the type of problem and its forcing functions or causes. Because of the complexity of economic deviations, the preferred analytical tool is again cause and effects analysis. Any event that has a potential for causing personal injury should be investigated immediately. While events in this classification may not warrant a full RCFA, they must be resolved as quickly as possible.
Isolating the root cause of injury-causing accidents or events is generally more difficult than for equipment failures and requires a different problem-solving approach. The primary reason for this increased difficulty is that the cause is often subjective.
In most cases, regulator requirements necessitate using all of the analytical tools, but the primary tool should be cause and effect. Any regulatory compliance event can potentially impact the safety of workers, the environment, as well as the continued operation of the plant.
Therefore, any event that results in a violation of environmental permits or other regulatory-compliance guidelines, such as Occupational Safety and Health Administration, Environmental Protection Agency, and state regulations, must be investigated and resolved as quickly as possible.
Since all releases and violations must be reported and they have a potential for curtailed production and/or fines this type of problem must receive a high priority.