Professional Documents
Culture Documents
Plant machinery working life prediction method utilizing reliability and condition-monitoring data
K B Goode, J Moore and B J Roylance
Proceedings of the Institution of Mechanical Engineers, Part E: Journal of Process Mechanical Engineering 2000 214:
109
DOI: 10.1243/0954408001530146
Published by:
http://www.sagepublications.com
On behalf of:
Additional services and information for Proceedings of the Institution of Mechanical Engineers, Part E: Journal of Process Mechanical
Engineering can be found at:
Subscriptions: http://pie.sagepub.com/subscriptions
Reprints: http://www.sagepub.com/journalsReprints.nav
Permissions: http://www.sagepub.com/journalsPermissions.nav
Citations: http://pie.sagepub.com/content/214/2/109.refs.html
What is This?
Keywords: condition-based maintenance model, condition monitoring, hot strip mill, steelworks,
failure prediction, life prediction, reliability, statistical process control
problems imposed by reliance on these methods is that, if gained [9]. Whether this regime is identified as infant
the alarm limits are set too high, the machine may fail mortality, random failure or wear-out, corrective main-
without sufficient advanced warning. If the limits are set tenance actions can be implemented to improve machine
too low, the machine will generate false alarms that can reliability further.
obscure a true warning until it is too late. Experienced Proportional hazard modelling (PHM), developed by
machine operators and maintenance personnel learn, Cox [10], combines a baseline hazard rate, stated as a
from experience, how to distinguish between false and function of time, and a hazard function based on the
true alarms. However, such people are not always present machine condition variables. The PHM technique was
and, hence, a need exists to try to mimic such applied, by Jardine [11, 12], to aircraft and marine
experiences. If this is attempted through the development engine failures in which the metal concentration
of a failure prediction model, a number of difficulties are measurements of the engine oil were used as the
encountered. condition variables. Jardine was able to show that the
Some of these difficulties may be addressed by PHM approach was superior to the equivalent time-
employing a statistical process control (SPC) approach. dependent hazard rate modelling and, through estimating
SPC theory assumes that operating measurements failure rates, more effective maintenance decisions could
obtained from a correctly functioning machine will be made.
normally vary around an average value [3]. If the In practice, a regular inspection strategy is often
machine malfunctions however, this natural measure- employed to monitor critical machinery. Christer and
ment pattern may change, indicating the source of the Waller [13, 14] developed a technique called delay time
problem. By setting suitable alarm limits, SPC can be analysis for modelling the consequences of such a policy.
utilized to distinguish such measurements in terms of The delay time is defined as the time period from the
stable or unstable regions [4]. However, the setting of point at which a defect is noticeable to the point at which
such limits is both machine and process dependent and the defect causes a failure. A repair is therefore possible
needs to be conducted for each individual situation. It is at any time within this period. Christer and Waller were
therefore, a time-consuming and failure data-intensive able to obtain a subjective estimate, through the use of
activity which may be alleviated to some extent by questionnaires, of the probability density function f(h) of
comparing measurements from a group of otherwise the delay time. Once f(h) is established, it is possible to
similar machines, thereby providing a larger population develop subsequent models describing the relationships
of failures from which to extract data and establish between time and other relevant variables such as
realistic alarm limits. expected downtime and operating costs.
In predicting the remaining useful life of a machine, Using historical reliability data to predict the future
previously developed models were based on the ex- performance of similar machines requires an assumption
tensive use of reliability data coupled to a number of that the historical and current performance are highly
simplifying assumptions [57]. Although the prediction correlated. In reality, this is rarely the case and, instead,
obtained from such models is seldom precise enough for individual assessments of a machines health are more
predicting the remaining life of individual machinery, effective in identifying when a problem exists. This can
they have been found to be useful for optimizing be achieved through the use of strategic machinery
maintenance strategies. health-monitoring techniques and the intelligent use of
A commonly encountered reliability model applied to predetermined alarm limits. Fitch [15] highlighted four
repairable systems is the renewal process. It assumes that, methodologies in setting such alarm limits; goal based,
when a machine fails, it is repaired perfectly, i.e. as good ageing, rate of change and statistical limits.
as new, and that times between failure are independent Moubray [16] examined condition monitoring failure
and identically distributed. When these assumptions hold patterns and highlights the use of the PF interval. A PF
true, the process is said to be stationary. A special case of interval is the time taken for a machine to reach
the renewal process is when the times between failures functional failure F from the point at which condition
are independently and exponentially distributed with a monitoring could have identified a potential problem with
constant failure rate. This is known as the homogenous the machine P. Provided that the condition-monitoring
Poisson process. It is known that the probability of some interval is no greater than the PF interval, a functional
arbitrary number of failures exhibits a Poisson distribu- failure should never occur without warning. In practice,
tion. From such reliability-based models, hazard rate however, the PF interval is difficult to quantify and for
functions can be constructed and used to optimize critical machinery, with a short PF interval, continuous
scheduled maintenance periods. However, in practice, monitoring may be justified. From the above review of
their construction is generally a function of many some relevant developments, it is evident that two models
variables, including design, operating conditions, envir- are required; one to describe when the component
onment and quality of repairs [8]. deteriorates, and the other to relate the degree of
By conducting a Weibull analysis on the failure data, deterioration to the condition of the equipment by the
an insight into the dominant failure regime can often be use of monitoring procedures.
Proc Instn Mech Engrs Vol 214 Part E E00899 IMechE 2000
2.2 Prediction model theory that a problem has occurred, and it can be said that the
machine has entered the failure zone. The function of the
A theoretical solution is presented in the following alarm limits is to identify when this deviation becomes
section to three problems commonly encountered while significant and, hence, when the failure zone is entered.
practising condition monitoring: setting appropriate Observations made by the present authors on numerous
limits, identifying when next to monitor the machinery, sets of failure data indicate that, while in the failure zone
and predicting when to change or repair the machine after it is reasonable to assume that the machines health
a problem has been identified. The theory is based on the degrades as an exponential function. It should be noted,
assumption that the life of a machine can be divided into however, that not all machines exhibit this characteristic
two distinct regions, namely a stable zone and failure but, in the first instance, the model will be developed on
zone, which can be distinguished between the observed the basis of this assumption.
condition-monitored measurements.
Figure 1 shows typical vibration analysis data as a
function of time, taken from a hot strip mill condition- 2.2.1 Setting alarm limits
monitoring programme for a hydraulic pump system The first part of the model is concerned with ensuring the
which subsequently failed. Failure in this context earliest identification of a problem through setting
represents a change in the machines state such that it appropriate alarm limits. As stated previously, current
can no longer be relied upon to perform the function for condition-monitoring programmes rely mainly on alarm
which it was originally intended. limits being pre-set using manufacturers recommenda-
A general failure condition is self-evident, which may tions, British Standards, e.g. BS 7854 [17], and operators
be represented schematically as shown in Fig. 2. In the personal experience. Unfortunately, based on observa-
stable zone, the machinery is functioning correctly and tions arising from the current research programme, it is
condition-monitoring measurements, assuming that they evident that these alarm limits are often set either too high
reflect the true health of the machine, are varying or too low. This results in either too frequent alarm
randomly about an average value. The variance is reports, which ultimately are ignored due to compla-
probably due to process changes between successive cency, or an even worse situation whereby a machine
measurements and/or condition measurement error. fails without any warning, because the alarm limit was
When the condition measurements start to deviate not exceeded. Clearly, there is a balance between the
significantly from these values, it soon becomes apparent number of false alarms and the earliest detection of a
problem which is also influenced by the amount of time
required to maintain the malfunctioning machine in
relation to the rate of machine deterioration.
Based on these observations, it became clear that the
problem of setting appropriate alarm limits may be better
treated through the use of SPC. Provided that sufficient
stable zone data are available, and that the measurements
may be assumed to follow a normal distribution, the
average and standard deviation of the stable zone
condition measurements can be calculated.
In defining alarm limits, three standard deviations
either side of the average stable zone value, SPC theory
Fig. 2 General failure pattern states that 99.73 per cent of subsequent measurements
E00899 IMechE 2000 Proc Instn Mech Engrs Vol 214 Part E
should fall within this band. If the condition measure- describing the time from the point of a problem being
ments lie outside these limits, then the measurement has detected to failure.
become unstable and it may be assumed that the If the IP and PF time intervals are shown to be
machine condition has entered the failure zone. constants, then the total time to failure (TTF) of the
machine, at any time t, is given by
In order to develop any form of prediction model, a However, it is generally recognized that estimating the IP
machine life model must be described. It is assumed that and PF intervals is difficult since they are often not time
a machine begins its life at the installation point I and is dependent and rarely accurately approximated by a
in a good functional state. Given that there is sufficient constant value. Instead, it is more realistic to represent
time, during which the machines condition is relatively these intervals using a suitable distribution. The follow-
constant, the machine eventually experiences a problem ing Weibull distribution offers the most benefits due to its
and begins to deteriorate until functional failure point F is flexibility in describing many types of distribution, and
reached, the point at which the machine can be said to its mathematical simplicity:
fulfil no longer its specified requirements.
If it is assumed that the condition-monitoring measure- " b #
ments reflect the machines health, then, when operating Int
Ft 1 exp 2
correctly, it is equal to an average measurement, defined Z
as the lower limit (LL). A functional failure is defined by
the condition measurements reaching a specific value, the where
upper limit (UL), above which the machine no longer
operates. Once the alarm limit (AL), the point at which F(t) cumulative probability
the condition-monitoring measurements identify a prob- t elapsed time since machine installation
lem, has been defined, the potential failure point P can b shape parameter
also be introduced to the model, as illustrated in Fig. 3. g location parameter
This description of machine deterioration is similar to Z characteristic life parameter
that used by Moubray [16]. However, the plots have been Int interval time (either IP or PF)
developed to resemble more closely the overall vibration
monitoring deterioration profiles commonly found in the b, g and Z are constants derived from an analysis of the
steel industry. machines historical failures.
The potential failure point is used to define the Rearranging equation (2) with respect to the interval
transition from the stable to the failure zone. The stable time, the following expression is obtained:
zone time is defined as IP, the time from machine
installation to the potential failure point. This contains
largely condition-monitoring measurements which are Int
Zf ln1 Ftg1=b 3
randomly varying around the lower limit. When the
condition measurements exceed the alarm limit, it is By using this expression to define the time intervals IP
assumed that the machine has entered the failure zone and PF, TTF now becomes
and will deteriorate, at an exponential rate, towards a
functional failure after time PF. For comparison, the PF TTF
IP ZIP f ln1 Ftg1=bIP
PF
notation is similar to that used by Moubray [16] in
ZPF f ln1 Ftg1=bPF 4
and, if t IP, the following equation is derived: which to base a prediction of the time to failure, are its
historical reliability. Equation (5) gave one solution,
TTF
PF ZPF f ln1 Ftg1=bPF 6 which predicted the ultimate functional failure from the
installation point I. However, it did not account for the
The condition-monitoring measurements may be thought machine surviving to the current time. In this section,
of as a switch, which moves the TTF prediction from another approach will be developed which predicts the
the stable to the failure zone calculations. It is argued that likelihood of failure over a period of time, given any time
this switch approach would give a better indication of the in the stable zone.
remaining life of the machine than relying solely on a While in the stable zone the chance of a functional
reliability analysis of functional failure times. failure, in a specified time period (n 1) dt n dt, is
However, it is possible to obtain an even better simply a function of the probability to reach the potential
prediction by using more of the available data. In the failure point, P(reaching potential point in time Y) and the
stable zone the machine age has yet to be incorporated. probability of reaching the functional failure point in the
Similarly, the condition-monitored measurements have remaining time, P (reaching functional failure in time Z),
so far only been utilized to trigger a switch. The as illustrated in Fig. 5. Therefore to calculate the
development of a failure zone model could result in a probability of failure in the increment, all individual
condition-based prediction model which employs the failure combination probabilities are summed, as follows:
condition measurement data more efficiently. In the next
two sections the means whereby these additional data can
X
be incorporated to achieve a better prediction of the P(functional failure) Ppotential failure in Y
failure time is described.
Pfunctional failure in Z
2.2.3 Failure prediction in the stable zone 7
In the stable zone, condition-monitoring measurements
provide little information except to confirm that the In theory, there are an infinite number of possible
machines health is fine. Therefore, the only data on failure combinations and hence, to reduce this number, it
is assumed that it takes a small amount of time, dt, to time remaining from the potential failure point, the
move from the stable to failure zone. The failure cumulative density function for the PF Weibull distribu-
combination probabilities for the first four time incre- tion is used:
ments are shown in Fig. 6.
It is noticed that, for each time increment, another
failure combination is possible and, due to the assump- Pfunctional failure in Z Fz dt Fz 1 dt
tion that it takes one increment to move from a stable to 10
failure zone, there is no chance of a failure in the first
increment. Hence, to calculate the probability of func-
tional failure in the interval, the probability of functional Hence, the probability of an overall functional failure
failure during each increment is calculated and summed between 0 and 1 dt is zero.
as follows: The probability of functional failure between 1 dt and 2
dt is
P(functional failure in interval)
n interval=dt
X FIP t dt FIP t
P(functional failure in increment) FPF dt
1 FIP t
n2
8
The probability of functional failure between 2 dt and 3 dt
To find the probability of reaching a potential failure is
point, the Weibull distribution hazard rate of the IP
interval is used. The hazard rate h(t) is defined as the
probability of failure over the next time increment dt, FIP t dt FIP t
FPF 2 dt FPF dt
assuming that no failure has occurred up to current time t. 1 FIP t
This is expressed as
FIP t 2 dt FIP t dt
FIP t dt FIP t FPF dt
hIP t 9 1 FIP t
1 FIP t
To calculate the probability of functional failure in the The probability of functional failure between 3 dt and 4 dt
Proc Instn Mech Engrs Vol 214 Part E E00899 IMechE 2000
FPF t dt FPF t
1 FPF t Fig. 7 Failure zone model
E00899 IMechE 2000 Proc Instn Mech Engrs Vol 214 Part E
ability of functional failure. However, in the failure zone, of the next condition-monitoring measurement will be
equation (14) is used with F(t) being equal to the obtained and a further assessment of the machines health
acceptable risk: is conducted.
Interval
PF ZPF ln1 Risk1=bPF
lnfX t LL=(AL LL)g 4 PREDICTION MODEL TRIALS
1
ln(UL LL)=(AL LL)
15 To test and validate the maintenance model the following
studies were undertaken:
Alternatively, given a pre-set condition monitoring
(a) a simulated assessment of the model where the data
interval, equation (15) can be rearranged to indicate the
inputs are always known;
probability of functional failure before the next monitor-
(b) specific case studies of real machines used within
ing time is reached. This could provide useful informa-
British Steel, to examine the performance of the
tion in making a decision either to maintain a machine
model with historical failures.
immediately, or to allow it to run until the next planned
maintenance period.
b 2.91 1.03
Z 526 days 222 days
g 0 days 0 days
Fig. 13 Probability of failure between monitoring intervals
Proc Instn Mech Engrs Vol 214 Part E E00899 IMechE 2000
Subsequent measurements were far lower but no the estimated pump demise at 532 days. Examining the
maintenance action was reported. results presented in Fig. 15, the familiar broad stable zone
predictions are clearly distinguishable from the more
4.2.1 Hydraulic system 3, pump 2 focused failure zone predictions. The very narrow
prediction at 430 days is due to the high recorded
In studying the condition-monitoring measurements of measurement.
pump 2, shown in Fig. 14, it is observed that a It could be argued that pump 2 was replaced too early
maintenance action, conducted when the machine was and could have continued operation for a little while
480 days old, improved the health of the pump and longer, possibly with additional condition measurements
thereby prolonged its life. Clearly, this type of opera- to track the impending failure. Figure 16 shows that the
tional maintenance has not been incorporated into the last condition measurement resulted in a predicted chance
prediction model. To do this would be very complex and of functional failure during the next monitoring interval,
require substantial quantities of maintenance data. 14 days, as 12 per cent. An 88 per cent chance of survival
However, even with these restrictions, the model is still to the next monitoring time is arguably a worthwhile risk.
able to adapt and perform reasonably well in predicting However, since these pumps are critical to the HSM
operation, a 12 per cent chance of operational disruption pump 3, in Fig. 17, it is noticed that a number of
is undesirable. It may also be the case that maintenance abnormalities exist. A significant time period with no
scheduling played an important role in deciding when recorded measurements is observed at machine age
pump 2 was changed. Clearly, the prediction model 10361216 days, a consequence of the fact that the
enables a more scientific approach to be used in pump is on standby. A high measurement, in excess of
assessing the risk of machine failure. However, the final 20 mm/s, is present, which would normally indicate that
maintenance action decision will also be strongly the machine had functionally failed. However, as the
influenced by other external factors such as scheduling, latter measurements indicate, this high measurement was
criticality, cost, spares, environmental and safety impact. not normal and probably occurred as a result of process
Therefore, the need for a maintenance planner will still and/or measurement error.
continue. The abnormally high reading cannot be processed by
the prediction model, except to indicate that the pump has
4.2.2 Hydraulic system 3, pump 3 functionally failed and, hence, does not appear in the
results given in Fig. 18. The predicted time range of
In examining the condition-monitoring measurements of functional failure only narrows when the condition
stable zone and failure zone. While in the stable zone, development of a predictive model for condition-based
condition measurements are normal and, hence, a maintenance in a steel works hot strip mill. In Proceedings
reliability-based model is employed. When condition of the JOAP International Condition Monitoring Con-
measurements indicate the existence of a problem, both ference, Mobile, Alabama, 1998, pp. 203218.
reliability and condition-monitoring information are 3 Wheeler, D. J. and Chambers, D. S. Understanding
combined to predict the remaining machine life. Statistical Process Control, 1990, pp. 577 (SPC Press).
4 Weatherill, G. B. and Brown, D. W. Statistical Process
Both simulated and real case studies were investigated
and ControlTheory and Practice, 1990 (Chapman and
to test the models performance and highlight some of its Hall, London).
implementation difficulties. Arising from these studies it 5 Asher, H. and Feingold, H. Repairable Systems Relia-
is evident that the prediction model is dependent on the bility, 1984 (Marcel Dekker, New York).
quality and accuracy of the condition-monitored 6 Van Alven, W. H. Reliability Engineering, 1964 (Prentice-
measurements. Hall, Englewood Cliffs, New Jersey).
It is anticipated that the model will enable a more 7 Davidson, J. F. Reliability of Mechanical Systems, IMechE
systematic approach to assessing the risk of machine Guides for the Process Industry, 1988 (Mechanical
failure and be applicable to most condition-monitored Engineering Publications, London).
situations, in which the failure lead time is sufficient and 8 Sherwin, D. J. Improved schedules by using data collection
the condition-monitoring measurements reflect the ma- under preventative maintenance. IEEE Trans. Reliability,
chines true health. However, the final maintenance 1984, R33(4), 315320.
action decision will inevitably depend on other external 9 Bloch, H. P. and Geitner, F. K. An Introduction to
factors such as scheduling, criticality, cost, spares, Machinery Reliability Assessment, 1990, pp. 3334 (Van
Nostrand Reinhold, New York).
environmental and safety impact.
10 Cox, D. R. and Lewis, P. A. W. The Statistical Analysis of
Series of Events, 1966 (John Wiley, New York).
11 Jardine, A. K. S. and Anderson, P. M. Use of concomitant
ACKNOWLEDGEMENTS variables for reliability estimation. Maintenance Managmt
Int., 1985, 5, 135140.
12 Jardine, A. K. S., Anderson, P. M. and Mann, D. S.
The authors would like to thank Dr B. J. Hewitt, Director,
Application of the Weibull proportional hazard model to
Technical, and Mr E. F. Walker, Manager, Technical Co- aircraft and marine engine failure data. Qual. Reliability
ordinator, Welsh Technology Centre, British Steel Strip Engng Int., 1987, 3, 7782.
Products, for permission to publish this paper and 13 Christer, A. H. and Waller, W. M. Delay time models of
acknowledge the support of the Engineering and Physical industrial inspection maintenance problems. J. Opl. Res.
Sciences Research Council. Thanks are also due to Port Soc., 1984, 35(5), 401406.
Talbot hot strip mill PCM department and the Llanwern 14 Christer, A. H. and Waller, W. M. Reducing production
FMMS department for their help and contribution to this downtime using delay time analysis. J. Opl. Res. Soc.,
project. 1984, 35(6), 499512.
15 Fitch, J. C. Proactive and predictive strategies for setting
oil analysis alarms and limits. In Proceedings of the JOAP
International Condition Monitoring Conference, Mobile,
REFERENCES Alabama, 1998, pp. 370378.
16 Moubray, J. RCM II, 1991 (ButterworthHeinemann,
1 Goode, K. B., Roylance, B. J. and Moore, J. Development Oxford).
of predictive model for monitoring of hot strip mill. Iron 17 BS 7854 Mechanical VibrationEvaluation of Measure-
Steelmaking, 1998, 25(1), 4247. ments on Non-rotating Parts (British Standards Institution,
2 Goode, K. B., Roylance, B. J. and Moore, J. The London).
Proc Instn Mech Engrs Vol 214 Part E E00899 IMechE 2000