You are on page 1of 22

Neural networks in multivariate calibration

Tutorial Review

Frederic Despagne and D. Luc Massart*


ChemoAC, Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090
Brussels, Belgium
Received 17th July 1998, Accepted 28th August 1998

1
2
3
3.1
3.2
3.2.1
3.2.2
3.3
3.3.1
3.3.2
3.3.3
3.3.4
4
4.1
4.1.1
4.1.2
4.1.3
4.1.4
4.1.5
4.1.6
4.2
4.2.1
4.2.2
4.2.3
4.2.4
4.3
4.3.1
4.3.2
4.3.3
5
6
7

Introduction
Principle of neural networks
Neural networks in multivariate calibration
When to use neural networks
Alternative methods
Linear methods
Non-linear methods
Advantages and limitations of neural networks
Flexibility of neural networks
Neural networks and linear models
Robustness of neural networks
Black-box aspect of neural networks
Development of calibration models
Data pre-processing
Detection of non-linearity
Detection of outliers
Number of samples
Data splitting and validation
Data compression
Data scaling
Determination of network topology
Number of layers
Number of input and output nodes
Number of hidden nodes
Transfer function
Training of the network
Learning alogrithms
When to stop training
Model interpretation
Conclusion
Acknowledgements
References

1 Introduction
Artificial neural networks (NNs) have now gained acceptance in
numerous areas of chemistry, as illustrated by the number of
Frederic Despagne obtained an engineer degree from the Ecole
Nationale Superieure de Chimie et Physique de Bordeaux and a
postgraduate diploma in Materials Science from Universite de
Bordeaux in 1994. He was then
sponsored by Elf Aquitaine to
do research in the Chemometrics group of Professor Brown
at the University of Delaware.
In 1996 he joined the research
group of Professor Massart at
the Vrije Universiteit Brussel,
where he is currently studying
for a PhD His research interests are in multivariate calibration and artificial intelligence.

applications mentioned by Zupan and Gasteiger1 in their


review. In the 1996 Chemometrics fundamental review,2 NN
applications were reported in sections concerning signal
processing, curve resolution, calibration, parameter estimation,
QSAR, pattern recognition and of course artificial intelligence.
Tutorials on NNs in chemistry were proposed by Smits et al.3
and Svozil et al.4 (the latter contains an extensive list of Internet
resources for NNs) and different types of applications of NNs to
spectroscopy were reviewed by Cirovic.5
This tutorial is restricted to the application of NNs for
multivariate calibration with chemical data, which is an
important source of publications in chemometrics. The potential
of NNs as modelling tools for multivariate calibration is well
established, and efforts must now focus on developing proper
methodologies to ensure that NNs are always used in ideal
conditions; this is the goal of this tutorial. Bos et al.6 presented
an excellent overview of practical aspects of NNs in quantitative analysis. Most of these aspects will be presented again here,
in particular in order to establish the terminology, and we will
include some recommendations according to recent results
obtained in NN research. We will restrict ourselves to NNs of
the multi-layer feed-forward type (also called multi-layer
perceptron, MLP) with the error back-propagation learning rule
that is the most popular.
The tutorial is organised as follows. In Section 2, we remind
readers how NNs came on to the scene and explain their basic
principles. Section 3 is dedicated to the possibilities offered by
NNs to analytical chemists. We present some of the most
general aspects of NNs (flexibility, black-box aspect) and
emphasise their main limitations. In Section 4 we consider more
technical aspects and propose a methodology for the development of calibration models with NNs. A non-negligible part of
this methodology is dedicated to data handling, and we try to
outline the pitfalls specific to NN modelling. We consider
topology optimisation and introduce techniques that can help in
developing and interpreting NN models. The different aspects
discussed in the tutorial are illustrated with examples of
applications from the literature.
Professor D. Luc Massart teaches analytical chemistry at the
Pharmaceutical Institute of the Vrije Universiteit Brussel,
where he was appointed in
1968. He is author of several
books on chemometrics.

Analyst, 1998, 123, 157R178R

157R

2 Principle of neural networks


NNs stem from the field of artificial intelligence. An early
motivation for developing NNs was to mimic some unique
characteristics of the human brain, such as the ability to learn
general mechanisms from presentation of a reduced set of
examples, or to retrieve correct information from missing or
distorted input data. NNs currently used in applied sciences
have little in common with their human counterparts and the
scope of their possible applications is more restricted. Research
is still being carried out to establish links between neurobiology
and artificial intelligence, but a description of NNs by analogy
with biological concepts, although fascinating, can lead to an
erroneous perception of NNs as mysterious intelligent machines. In the framework of multivariate calibration, we will
consider NNs in a more pragmatic way and in a first
approximation define them as non-parametric non-linear regression estimators.7 Non-parametric methods are those methods that are not based on the a priori assumption of a specific
model form.
NNs allow one to estimate relationships between one or
several input variables called independent variables or descriptors and one or several output variables called dependent
variables or responses. Information in an NN is distributed
among multiple cells (nodes) and connections between the cells
(weights). An example of MLP is displayed in Fig. 1, for a
model with four descriptors x1, x2, x3, x4 and a single response
y.
The descriptors are presented to the NN at the input layer and
then weighted by the connections wAij between the input and
hidden layer. Hidden layer nodes receive simultaneously
weighted signals from input nodes and perform two tasks: a

summation of the weighted inputs followed by a projection of


this sum on a transfer function fh, to produce an activation. In
turn, hidden nodes activations are weighted by the connections
wBj between the hidden and output layer and forwarded towards
the nodes of the output layer. Similarly to hidden nodes, output
nodes perform a summation of incoming weighted signals and
project the sum on their specific transfer function fo. In Fig. 1 a
single response y is modelled and the output layer contains only
one node. The output of this node is the estimated response y
that can be expressed as
nh

nd

y = fo q +
w j fh
wij xi + q
(1)

i =1

j =1

where nd and nh are the number of input variables and hidden


nodes, respectively.
Although NNs can be considered as non-parametric tools, the
models that they yield are defined by sets of adjustable
parameters determined by an algorithm, not a priori by the user.
Adjustable parameters are the weights wAij, wBj and biases qA, qB
that act as offset terms by shifting the transfer functions
horizontally. They are determined with an iterative procedure
called training or learning. The adjustable parameters are first
ascribed initial random values, then training starts and proceeds
in two steps. First, a forward pass [Fig. 1(a)] is performed
through the NN with a set of training samples with known
experimental response y. At the end of the pass, the magnitude
of the error between experimental and predicted responses is
calculated and used to adjust all weights of the NN, in a backpropagation step [Fig. 1(b)]. These two steps constitute an
iteration or epoch. A new forward pass is then performed with
the training samples and the optimised parameters. The whole
procedure is repeated until convergence is reached. This means
that a pre-specified or acceptably low error level is reached.
Training an NN is an optimisation problem, where one seeks
the minimum of an error surface in a multi-dimensional space
defined by the adjustable parameters. Such surfaces are
characterised by the presence of several local minima, saddle
points or canyons. It must be accepted that the NN will probably
not find the absolute minimum of the error surface, but a local
minimum relatively close to the absolute minimum and
acceptable for the problem considered. The most popular
algorithm to adjust weights during training is the gradient
descent algorithm based on the estimation of the first derivative
of the error with respect to each weight.8
The most important feature of NNs applied to regression is
that they are universal approximators: they can fit any
continuous function defined on a compact domain (a domain
defined by bounded inputs) to a pre-defined arbitrary degree of
accuracy.9 We will now see why this characteristic can be
particularly attractive in analytical chemistry.

3 Neural networks in multivariate calibration


3.1 When to use neural networks

Fig. 1 Feed-forward NN training: a, forward pass; b, error backpropagation.

158R

Analyst, 1998, 123, 157R178R

For analytical chemists, a calibration model relates a series of


instrumental measurements to the concentration or some
physico-chemical properties of one or several target analytes.10
NNs can be used to build empirical multivariate calibration
models of the form: Y = F(X) + e. We will only consider inverse
calibration models, for which X designates a matrix of analytical
measurements performed on a series of n samples. For a given
sample, measurements are described by a set of descriptors, xi,
for instance, absorbance values at a given set of wavelengths. Y
is a vector or a matrix containing sample responses, for instance,
the concentrations of a target analyte in a set of mixtures.
Calibration sample responses are often determined experimentally with reference methods such as the wet chemistry Kjeldahl

method for the determination of protein content in wheat, or


research engines for the determination of gasoline octane
number.
NNs should be primarily used when a data set is known or
suspected to be non-linear. (From the mathematical point of
view, a truly non-linear model is non-linear with respect to its
parameters. We will also consider as non-linear the models
where the non-linearity appears in the relationship between the
response and the descriptors. In analytical chemistry, the
distinction between true and apparent non-linearity cannot
always be performed since most non-linearities are detected
visually on calibration lines or in model residuals.) Several
types of non-linearity can be observed with sensor or spectroscopic data.11,12 For instance, the BeerLambert law that
linearly relates the absorbance of a species in a mixture to its
concentration is an approximation that is only valid for dilute
and non-saturated systems. Deviations from linearity can occur
if a sample is highly absorbing or non-homogeneous, if the
particle size is not constant in all samples (for crystalline
species) or if some signals are overlapping. A non-linear
detector response (especially with photoconductive detectors)
or the presence of stray light (due to imperfections in the optics
of a spectrometer) introduces curvature in the concentration
response function. These non-linear effects can sometimes be
corrected with appropriate pre-processing such as first or
second derivative, multiplicative scatter correction (MSC) or
standard normal variate correction (SNV) [the last two are preprocessing techniques generally used to remove particle size or
scatter effects from near-infrared (NIR) spectra13]. Pre-processing has limitations, however: derivatives reduce the signal-tonoise ratio; in some situations the spectra are apparently
corrected by a mathematical pre-treatment but non-linearity is
introduced in the wavelength space (for instance, SNV is not a
linear transformation). Other non-linear effects can be observed
as a result of chemical factors such as non-symmetrical
chemical equilibrium, intermolecular reactions, intermetallic
reactions in electrochemistry, presence of humidity inducing
hydrogen bonding, changes in temperature or solvent composition. These effects result in a shift and broadening of the
absorption bands.
In some situations, the XY relationship is known to be
intrinsically non-linear even if it cannot be explicitly derived.
This is the case, for instance, with the relationship between the
NIR spectrum and flex modulus of an elastomer,11 or the
relationship between octane number and NIR spectra or
chromatograms of gasolines, most gasolines containing hundreds of different hydrocarbons with non-linear blending
characteristics.14 NNs are ideal tools for such problems since
they can theoretically map any measurable linear or non-linear
function.
NNs can also be applied when no a priori indication
concerning the nature of the relationship to model is available
and a model is needed rapidly. It is a situation where nonparametric statistical inference can help to delve into complex
multivariate problems. However, we recommend that one
always starts by modelling new data with one of the well known
linear regression methods that give satisfactory results on most
calibration data [multiple linear regression (MLR), principal
component regression (PCR) or partial least squares (PLS)]. A
decision as to whether an NN model is necessary can be based
on the examination of linear model residuals: a curvature or a
trend in the residuals is indicative of an un-modelled source of
variance, generally a non-linearity. (With PCR and PLS, one
must look for non-linearity in residuals in the first few
dimensions only since the inclusion of a large number of
components will generally mask the presence of non-linearity.)
Finally, NNs can be recommended for monitoring on-line
processes, where measured variables are likely to be blurred
with noise and perturbations such as temperature effects15 that

introduce non-linearity into the model. In such situations, the


flexibility of NNs and their ability to maintain a decent
performance even in the presence of significant amounts of
noise in the input data are highly desirable. Several workers
have obtained good results with NNs in the presence of noise in
the analytical measurements or in the responses from the
calibration data.1618 Long et al.19 found that the addition of
increasing levels of random noise to the training data did not
significantly affect the model. In contrast to most techniques,
the performance of NNs degrades regularly with increasing
levels of random noise in the training data or with deletion or
perturbation of an increasingly large number of weights.4,20
This remarkable property can be attributed first to the signalaveraging effect of small random deviations in the two
summation terms in eqn. (1).19,21 In addition, the dense
interconnectivity between nodes is a non-localised form of
information storage that acts as a security against component
damage.22,23 It also implies that adding more nodes to an NN
should make it more robust with respect to random noise. This
temptation must be resisted because the data sets are seldom
large enough to avoid the under-determination problem encountered with an oversized NN: too few data points are available
compared with the number of adjustable parameters. Even with
a limited number of connections, the distribution of information
in an NN allows one to reduce the influence of random noise.
Another reason why NNs are more robust than several other
techniques in the presence of random noise in training data is
that they can build non-linear models from a limited number of
descriptors whereas techniques such as PLS or PCR accommodate the non-linearities by including higher order components that are likely to be blurred with noise.24
Table 1 contains a list of references that illustrate the
broadness of the application field of MLP NNs in multivariate
calibration.

3.2 Alternative techniques


3.2.1 Linear methods. In order to assess the usefulness and
main limitations of NNs, it is interesting to have an overview of
alternative tools often applied in multivariate calibration. The
most popular methods remain MLR, PCR and PLS. The
attraction of MLR lies in the ease of model interpretation since
the estimated parameters relate the property of interest linearly
to a set of original variables. In PLS or PCR, components used
for modelling are linear combinations of original variables. The
projection of samples on the reduced subspace spanned by the
first components allows the visualisation of outliers, atypical
samples and clusters among the objects. These methods are
based on the minimisation of a least squares criterion, similarly
to NNs. In fact, if linear transfer functions are used in both
hidden and output layers, the two successive linear combinations performed in the NN are equivalent to a single MLR
regression. If a linear data set is modelled with an NN using
linear transfer functions, it will converge to the MLR solution.
The difference between MLR and NNs lies in the way the model
parameters are estimated: by matrix inversion in MLR and by
iterative optimisation in NNs.
Eqn. (1) can also represent a PCR or PLS model when the
transfer functions are linear.25 The weights between input and
hidden layer are equivalent to the X-data loadings on the
different factors and the activation produced by the hidden
nodes can be compared with PCR or PLS scores. Again, the
difference lies in the way the parameters are optimised. In NNs,
adjustable parameters are fitted without restriction to minimise
the calibration samples squared residuals. In PCR or PLS,
constraints such as scores orthogonality, maximisation of Xdata variance (PCR) or XY covariance (PLS) are also taken into
account, and therefore the parameters obtained are different.
Analyst, 1998, 123, 157R178R

159R

Although they are linear methods, MLR, PCR or PLS can be


used for the modelling of some specific types of non-linear data.
If the form of the non-linear relationship between the response
and the descriptors is known, a model can be linearised by
taking the appropriate transform of the original variables,26 or
by adding higher order and cross-terms to the regression
equation. In practice, the number of situations where these
approaches are successful is limited, mainly because the exact
form of the non-linear relationships is not known a priori and
the number of calibration samples available is not sufficient to
fit a complex model with a large number of cross-terms. It is
also known to PCR and PLS practitioners that in some cases
these methods can accommodate non-linear relationships by
using higher order components to correct for partial nonlinearities.27 However, there is a risk of introducing a significant
amount of irrelevant information in the model.
3.2.2 Non-linear methods. Non-linear variants of PCR or
PLS also exist (polynomial PCR,28 quadratic PLS29). Their
main limitation is that they are based on the assumption that a
simple (e.g., quadratic) relationship exists between the response
modelled and the components. This assumption is sometimes
violated since components are already linear combinations of
original variables.30 Locally weighted regression (LWR) is
based on the decomposition of a global non-linear model in a
series of local linear PLS or PCR models. It was found to
perform well in multivariate calibration, especially on clustered
data sets.31,32 However in LWR a data set cannot be described
Table 1

with a unique set of components and loadings, since each


sample is fitted with a local model built with its nearest
neighbours only. One must also take the risk that local model
parameters are less stable than global model parameters since
they are estimated with a reduced set of objects.
Other techniques exist for non-linear regression but they are
not yet as popular as NNs and the above-mentioned techniques.
A review of non-parametric non-linear regression methods
[alternating conditional expectations (ACE), smooth multiple
additive regression technique (SMART), non-linear partial least
squares (NLPLS), classification and regression trees (CART),
multivariate adaptive regression splines (MARS) and spline
partial least squares (SPL-PLS)] can be found in the report of
Sekulic et al.33 and in Franks tutorial.34 These methods can
perform well on non-linear data but are computationally more
complex than linear methods and share with NNs the limitation
of being prone to overfitting. Their performance also depends
heavily on the amount and quality of data available.34
3.3 Advantages and limitations of neural networks
3.3.1 Flexibility of neural networks. We have seen that NNs
are not the only tools to handle non-linear multivariate data.
However, their flexibility is often a decisive asset compared
with parametric techniques that require the assumption of a
specific hard model form. Hard models cannot be developed
with NIR data owing to the significant overlap of combination
and overtone bands in the spectra. Other types of analytical data

Examples of application of NNs to multivariate calibration

Property modelled

Descriptors

Ref.

Alditols in binary mixtures (%)


Apparent metabolic energy of barley
Components in simulated binary and ternary mixtures
Active ingredients in drugs
Components in simulated binary and ternary mixtures
Components in rhodamine mixtures
Components in simulated binary mixtures
Active ingredients in drugs
Protein in wheat
[H2O] in meat
Flex modulus of polymers
Ethanol in mixtures containing latex
Fat in port meat
Mineral charge in polymer
Gasoline octane number
[KOH] in polyether polyols
Constituents in paper coatings
[OH], [NH], grind size in cereals
Methanol in water mixtures
Composition of organic extract
Hydroxyl in cellulose esters (%)
Property of polymer pellets
Solvents in aqueous process stream
Aromaticity of brown coals
Color change in emulsion paints
RNA, DNA or lysozyme in binary mixtures containing
glycogen
Bacteria in tertiary mixtures
Adulteration of cows milk with goats or ewes milk
Penicillin and buffer ion concentrations in solutions with
different buffer ion concentrations
Urea and glucose in solutions at different pH

1H

18
60
12

[SO2] and relative humidity of water vapour in sample gas


Cu/Zn in simulated two-component system with formation of
intermetallic compounds
Cu/Pb/Cd/Zn in experimental four-component system
Ionic concentrations in mixtures
Characteristics of the physical structure of polymer yarns
Metals in Fe/Ni/Cr systems
Fe/Ni in thin films
Gasoline octane number

160R

Analyst, 1998, 123, 157R178R

NMR spectra
Measured physical and chemical characteristics of barley
UV/VIS spectra with simulated instrumental perturbations
UV/VIS spectra
Absorption spectra with simulated non-linear effects
UV/VIS spectra
UV/VIS spectra
UV/VIS spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Simulated near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Fourier transform infrared spectra
Measured concentration of oxide ingredients
Pyrolysis mass spectra
Pyrolysis mass spectra
Pyrolysis mass spectra
Measurement signals of enzyme field effect transistor flow
injection analysis
Measurement signals of enzyme field effect transistor flow
injection analysis
Frequency response of a piezoelectric crystal gas sensor
Anodic stripping voltammograms
Anodic stripping voltammograms
Measurements from ion-selective electrode arrays
Parameters describing mechanical properties of the yarns
X-ray fluorescence spectra
X-ray fluorescence spectra
Gas chromatograms

30, 67
19
25
24
70
64, 65
35
41

72
61
16
17
96
23
97
98
99
38
86
14

(e.g., UV/VIS spectra) are more easily interpretable from the


spectroscopic point of view, but the a priori specification of a
hard model rarely incorporates the non-linear effects that may
occur in practice. Non-linearity in a data set can be detected
with graphical methods but identification of its source is more
challenging and sometimes impossible. Thanks to their ability
to learn and derive XY relationships from the presentation of a
set of training samples, NNs avoid the time-consuming and
possibly expensive task of hard model identification. In
addition, the fundamental principle of distributing information
among several weights and nodes renders the NN model robust
with respect to random noise in the input data (as already
explained) and allows one to have several NNs with different
topologies converging to qualitatively equivalent results.
If one is not careful, however, a drawback of the flexibility of
NNs is their tendency to overfit calibration data and the
resulting lack of generalisation ability, that is, the capability of
a model to produce a valid estimate of the correct output when
a new input is presented to the NN. Also, the flexibility of NNs
can lead to unreliable results in situations of extrapolation.
Although NNs proved to perform better than PLS on extrapolated non-linear data in some applications,24 they were found to
be equivalent to or less reliable than methods such as MLR,
PCR, PLS or LWR in comparative studies of calibration
methods where extrapolations occurred.35,36 The dangers of
strong extrapolation with NNs are illustrated in Fig. 2(a)(c),
which show results obtained for the modelling of a cosine
function with different numbers of hidden nodes and test points
(+) outside the calibration domain. The calibration domain
contains the X-values in the range [22, +2].
The NN builds an empirical model to fit objects in the
calibration space only and test points are badly predicted. With
analytical data, such strong extrapolations rarely occur and one

generally has the situation represented in Fig. 2(d), where the


prediction error is less dramatic. It is possible to use NNs to
perform small or mild extrapolations on such non-linear data
but NNs should not be considered as generally suitable for
extrapolation, as with any other chemometrics technique.
3.3.2 Neural networks and linear models. One may wonder
what happens if an NN is used to model a linear data set. For
instance, a model may be wrongly considered as non-linear
owing to an incorrect estimation of linear PLS or PCR model
complexity. It is also tempting to take advantage of the
flexibility of NNs and let them do the work with any kind of
data, even when they should be linear.
From the point of view of prediction, if the data are linear an
NN with non-linear transfer functions should nevertheless
converge to a solution that approximates a linear model
solution, since the linear portion of the transfer functions can be
activated in that case (see Fig. 3).
This was confirmed by the results of a recent comparative
study carried out to evaluate the performance of several linear
and non-linear modelling methods on real industrial data.32
Each of the four industrial data sets consisted of a series of NIR
spectra (X-data) and a specific property to be predicted (Y-data).
Some results of this comparative study are listed in Table 2,
which contains the root mean square error of prediction
(RMSEP) values obtained with stepwise MLR, PCR, PLS and
NN.
NNs outperform linear methods for the strongly non-linear
data set, which is not surprising, but their performance on
slightly non-linear and linear data is comparable to the
performance of linear methods such as PLS or PCR. This is in
agreement with the observations of Gemperline et al.,12 who
stated that Artificial neural networks having the appropriate

Fig. 2 NN predictions within and outside calibration space: ac, cosine function; d, quadratic function. Model with, a, three; b, six; c, nine; and d, two hidden
cones. o, Actual training, +, actual test, *, predicted.

Analyst, 1998, 123, 157R178R

161R

architecture can be used to develop linear calibration models


that perform as well as linear calibration models developed by
PCR or PLS.
It was said that when NNs are used to model linear
relationships, they require a long training time since a nonlinear technique is applied to linear data.33 In theory this is true
in the sense that the apparently linear portion of the non-linear
transfer functions is not perfectly linear, and therefore the
learning algorithm must perform continuous adjustments to
correct for this slight deviation. For a perfectly linear and noisefree data set, the NN performance tends asymptotically towards
the linear model performance and it generally converges to the
intrinsic precision of the computer. However, in this case the
curve of NN error as a function of the number of iterations is
almost perfectly flat and an acceptable solution can be reached
relatively early during the training. Moreover, perfectly linear
and noise-free data sets are seldom available so that in practice
NNs can reach a performance qualitatively similar to that of
linear methods in a reasonably short training time.
In spite of these reassuring results, it does not make sense
intuitively to apply a complex and possibly time-consuming
method when simpler tools are likely to perform as well. MLR
with stepwise variable selection can give excellent prediction
results on linear data sets (see Table 2) and its interpretation
properties for the analyst are optimal compared with all other
methods. In practice, using a highly flexible tool to model linear
phenomena can lead to rapid overfitting of the measurement
noise. Artefacts can also occur if the topology of the NN is not
carefully designed. As an illustration, Fig. 4 shows distortions
appearing when a perfectly linear and noise-free model is fitted
with an NN containing too many hidden nodes and a non-linear
node instead of a linear node in the output layer.
3.3.3 Robustness of the models. NNs are sometimes
recommended for their robustness,37 but this term is rarely
defined with precision. Unlike analytical procedures for which
official definitions of the term exist, there is not a unique
definition of the robustness of a multivariate calibration model,
as illustrated by some controversial statements.38,39 It seems
reasonable to follow Frank and Todeschinis40 definition of
robustness in the framework of regression analysis: robust
methods are those methods that are insensitive to small
deviations from the distributional assumptions. This definition

Fig. 3

applies in particular to methods designed to cope with outliers


present in the calibration set. Methods to detect or handle
outliers are presented in Section 4.1.2. Robustness of an NN is
also challenged when predictions are performed on new
samples outside the calibration domain in the X-space or in the
Y-space. We underlined in Section 3.3.1 that NNs often perform
relatively poorly in situations of extrapolation.
In all these situations, deviations from a priori assumptions
(data set free of outliers and of systematic noise) affect the
training samples. Some authors consider robustness from a
different perspective, in situations where a model has been
developed with training data that fulfil initial assumptions but
perturbations affect new objects to be predicted.38,41 Different
types of perturbations must be considered. The appearance of
higher levels of random noise in the test samples is usually not
catastrophic.42 Derks et al.38 related quantitatively the variance
of predicted responses to the variance of random noise added to
the input variables. The influence of instrumental perturbations
that have a more systematic effect than random noise (e.g.,
baseline or wavelength shift) can be more catastrophic and is
difficult to anticipate. Indeed, it depends on a number of
parameters: the curvature of the relationship between each
descriptor and the response and the position of the perturbed
samples on the descriptors axes. When a strongly non-linear
relationship is being modelled, the NN can have either an
attenuating effect with respect to perturbations (compared with
linear models) because of the squashing effect of the nonlinearity, or a catastrophic effect on high leverage points, as
illustrated in Fig. 5.
Since the exact shape of the model and the position of future
samples in input space cannot always be known, a solution
consists of identifying possible sources of degradation and
including them either in the training set42 or in the monitoring
set.41 This allows one to avoid large prediction errors after the
appearance of small perturbations that can be expected in
practice.
3.3.4 Black-box aspect of neural networks. NNs can
perform at least as well as any other technique in terms of
prediction, but a major criticism remains their black-box aspect.
To be fair, it should be pointed out that this limitation is not
peculiar to NNs only. For instance, it is often impossible to
visualise clusters and outliers by projecting scores on component axes in LWR since the samples belong to local models

Usual non-linear transfer functions: hyperbolic tangent; sigmoid.

Table 2 RMSEP of different multivariate calibration methods applied to industrial data

162R

Property y

Nature of data

MLR

PCR

PLS

NN

Moisture in wheat
Hydroxyl number of polyether polyol
Octane number of gasoline
Mineral charge in a polymer

Linear
Linear
Slightly non-linear
Strongly non-linear

0.1860
0.90
0.1355
0.0797

0.2147
1.15
0.1426
0.0477

0.2150
1.31
0.1461
0.0445

0.1981
0.88
0.1459
0.0096

Analyst, 1998, 123, 157R178R

defined with different objects. However, model interpretation


with an NN is still considered much more complex than with,
e.g., PLS or PCR. This is due to the operations (summation and
projection on transfer function) performed successively in the
hidden and output layer, that prevent one from deriving simple
analytical expressions between input and output variables [see
eqn. (1)]. In addition, unlike QSAR applications, where input
variables are heterogeneous original variables, the input
variables used in multivariate calibration are often scores
compressing spectral information, which complicates even
further model interpretation. Methods to ease model interpretation will be presented in Section 4.3.3, but it is clear that
model interpretability remains an active research area for the
NN community and the danger of incorrect inference (common
to all non-parametric techniques) must not be overlooked.

4 Development of calibration models


We will now examine in more detail the way in which an NN
model should be developed, according to our experience. The
different steps in method development are summarised in the
flow chart in Fig. 6.
It will come as no surprise that data pre-processing (Fig. 6,
left) governs closely the quality of results that can be expected.
We propose some tools to help in optimising parameters such as
the number of input variables or the number of hidden nodes.
NN construction (Fig. 6, right) is based on alternating removal
of input and hidden nodes, starting from a large NN. The
procedure described in this flow chart is very general and of
course other strategies are applicable. Short cuts can be made
through the flow chart by including a priori knowledge, or as
the user acquires more experience with topology optimisation.

Fig. 4

Predictions for linear model with incorrect NN topology.

Fig. 5 Attenuation or amplification of Y-prediction error in a non-linear


model compared with a linear model, depending on the sign of the error in
X.

4.1 Data pre-processing


4.1.1 Detection of non-linearity. As a general rule, one
should not try to build an NN model unless the situation is one
of those mentioned in Section 3.1. Therefore, some diagnostic
tools are necessary to detect the presence of non-linearity in a
data set. The simplest approachwhich in many cases is
sufficient to detect the presence of non-linearityis to plot the
property of interest versus the different measurement variables,
or combinations of these variables such as PC scores. If these
plots are inconclusive then one should build a linear model with
MLR, PCR or PLS. Visual inspection of the residuals (y 2 y ) of
the linear model versus each descriptor xi retained in the model,
versus the experimental response y and versus the estimated
response y should then be performed to detect non-linearities.
Recently, Centner et al.43 reviewed a number of more
sophisticated graphical and numerical methods to detect nonlinearities. They cited the Mallows augmented partial residuals
plot (MAPRP) combined with a runs test as the most promising
approach for non-linearity detection. The MAPRP is the plot of
the term (e + bixi + biixi2), called augmented residuals, versus xi.
The e are residuals of the linear regression y = f(x1,. . .,xi,. . .,xn,
xi2). The regression should be performed on all variables xi in
the model (original variables or principal component scores).
Curvature in the MAPRP plot indicates that higher variables xj
(j > i) correct for the non-linear (quadratic) nature of the
relationship between y and the variable xi. In that case the
variable xj is undesirable because it makes the model less robust.
The runs test is used to detect series of residuals with the same
sign, called runs. Long runs indicate the presence of a trend in
residuals that may be a systematic bias or non-linearity. From
the total number of positive and negative residuals, one
calculates a z-value that is compared with a tabulated value. A
significant value of |z| indicates a trend in the residuals. As an
illustration, we performed the detection of non-linearity between the spectra of a series of 104 diesel oil NIR spectra and
their viscosity. We built a 10-component PCR model, and for
each principal component (PC) we looked at the MPARP plot
combined with the runs test. For PC2, PC3 and PC4, the |z|
values indicate a non-linearity between the augmented residuals
and the variable (Fig. 7).
A limitation of the MPARP plot is that it allows only the
detection of non-linearities that can be described or approximated by a quadratic term.
Centner et al.43 emphasised the need for careful outlier
detection before drawing conclusions about the presence of
non-linearity in a data set. Outliers with high leverage can pull
the regression line and lead to an incorrect estimation of the
number of runs. Conversely, some outlier detection methods
can wrongly flag as outliers samples that are high leverage
points responsible for non-linearity in the data.33 This will be
illustrated in the next section.
4.1.2 Detection of outliers. Actually, the term outlier
detection encompasses two steps: first, atypical object detection, followed by an outlier identification. Although numerical
methods allow flagging of samples that are outliers on statistical
grounds, the positive identification of an atypical object as a true
outlier requires knowledge of the process or data acquisition
procedure, or interaction with the person in charge of this
acquisition. It is recommended to keep all flagged samples
unless they are positively identified as outliers on experimental
grounds.
It is beyond the scope of this paper to review all methods for
outlier detection proposed in the literature, but we will suggest
a few guidelines. One must make a distinction between different
types of outliers. Outliers in X can be due to accidental process
upsets, experimental errors during acquisition of spectra or
transcription errors during the labelling of samples or file
Analyst, 1998, 123, 157R178R

163R

manipulation. Outliers in Y are due to incorrect measurements


of reference values or transcription errors also. Atypical objects,
i.e., possible outliers in X or in Y, can be flagged before
performing any modelling. By contrast, outliers in the XY
relationship can only be detected after building a complete
model.
The simplest tool to flag atypical objects before modelling is
the visual observation of the X and Y data available. One should
look at the original set of sample spectra, the vector of responses

Fig. 6

164R

and score plots on the first PCs. To detect outliers in the X space,
it is recommended to examine the leverage of each sample to
detect possible outliers. The leverage of a sample is a measure
of its spatial distance to the main body of the samples in X.44 For
a given data matrix X, the leverage of sample i is given by the
diagonal term pii of the prediction matrix P, also called Hat
matrix:
P = X (XTX)21 XT

Strategy for construction of NN model: left, data handling; right, network construction.

Analyst, 1998, 123, 157R178R

(2)

When there are more variables than objects in X, the prediction


matrix must be calculated with the matrix TA of sample scores
on the A first significant PCs:
P = TA (TATTA)21 TAT

(3)

High leverage points have large values of pii (diagonal elements


of the P matrix) and special attention should be paid to these
points. They have a strong influence on parameter estimation
and can alter the model dramatically if they happen to be true
outliers. The limitation of this approach is that it is not

Fig. 6Continued

Analyst, 1998, 123, 157R178R

165R

straightforward to determine A. Several methods (see Section


4.2.2) can be applied to perform this determination.4549
An alternative approach for the a priori detection of atypical
objects in X is to apply Grubbs test on Raos statistic.50 Raos
statistic D2(k) (yi) is a value calculated for each sample i and for
each PC k. It accumulates all variations described by PCs k + 1
to p. For each k, the Raos statistic is used as input to flag
possible outliers in X with the univariate Grubbs test.
After flagging possible outliers in X or in Y, one must check
if these samples are outliers in the XY relationship. Centner
et al.50 proposed a procedure based on the development of PLS
leave-one-out cross-validation models after flagging possible
outliers with a Grubbs test performed on the Raos statistic.
The goal of the cross-validation is to discriminate situations
where a true outlier alters the models resulting in a large
cumulative cross-validation error, from situations where the
large value of the cross-validation error is simply due to the
incorrect prediction of a high leverage point that is not an
outlier. A limitation of this approach is that the identification is
based on linear cross-validation models (it will be explained in
Section 4.1.4 why cross-validation should not be performed
with NNs). A sample that is an outlier to a linear model might
not be an outlier to a non-linear model.33 The final decision
should be made on the basis of a comparison of prediction
results for NN models with and without the flagged samples in
the training set.
To illustrate the difficulty of outlier detection in non-linear
models, we report in Fig. 8 a PC scores plot for the NIR data set
used to model viscosity of diesel oil. Applying Grubbs test on
Raos statistic, the sample marked with an asterisk was
identified as an atypical object. Using leave-one-out crossvalidation on PLS models, the flagged sample (which has the
highest Y-value in the data set) is positively identified as an
outlier to the PLS model. If we compare the PLS and NN test
results depending on the inclusion or not of this flagged sample
in the training set, we obtain the RMSEP values reported in
Table 3.

Fig. 7

166R

When the flagged sample is included in the training set, the


NN performance in prediction improves whereas the PLS
performance degrades. This illustrates how non-linear information can be extracted by the NN from a high leverage sample
that is not an outlier.
Since outlier detection is not always successful, it is possible
to design NNs that can handle outliers present in the training set.
For instance, Walczak51 proposed to use error thresholding
functions adjusted iteratively during training with respect to the
median of residuals. Wang et al.52 also applied a thresholding
function adjusted with respect to the assumed proportion of
outliers among the ranked residuals. In both approaches, the
idea is to prevent outlier residuals from influencing weight
estimations during training.
4.1.3 Number of samples. The number of samples available
is often a limiting factor when using NNs. Like other regression
methods, there are constraints concerning the number of
samples required to develop an NN model. The number of
adjustable parameters is usually such that the training set is
rapidly overfitted if too few samples are available. We consider
that when this number is less than 30, an alternative modelling
technique should be considered. Unfortunately, this is not
always obvious for inexperienced users, who can be deceived
by the extreme flexibility of NNs since they can fit the training
data with arbitrary precision. It is possible to obtain excellent
training results for the modelling of data sets with less than 15
samples. However, if these models are validated on new
independent samples, a significant degradation of the results is
observed due to a lack of generalisation ability.
To estimate the minimum number of training samples
allowing theoretical generalisation, one can use a parameter
called the VapnikCervonenkis dimension (VCDim). For an
MLP with one hidden layer, the lower bound of the VCDim is
approximated as twice the total number of weights in the NN.53
It is possible to reach good generalisation if the number of

Mallows augmented partial residual plots for PCR models of diesel oil viscosity: PC1; PC2; PC3; PC4.

Analyst, 1998, 123, 157R178R

training samples is at least equal to this lower bound. When the


number of samples available does not fulfil this requirement,
NN can still be used to find an acceptable local minimum close
enough to the absolute minimum of the error function.
However, the ratio of the number of samples to the number of
adjustable parameters should be kept as high as possible, in
order to avoid under-determination of the problem. The number
of samples is generally imposed or limited by practical
constraints, but one can partly solve the under-determination
problem by reducing the number of weights in the NN as much
as possible, as will be explained in Section 4.1.5.
4.1.4 Data splitting and validation. An important step in the
development of any calibration model is the splitting of the
available data into two subsets: a training set (used to estimate
model parameters) and a validation set or test set (used to check
the generalisation ability of the model of new samples). For
NNs the problem is more complex because they fit to arbitrary
precision the training data, provided that the number of hidden
nodes is sufficient and the training time long enough. Therefore,
an additional monitoring set is necessary to stop the training
before the NN learns idiosyncrasies present in the training
data.4,54,55 The monitoring set must be representative of the
population under study in order to avoid NN overtraining that
leads to overfitting (see Fig. 9).
Ideally, for a number nt of training samples, the monitoring
set and the test set (if it is available) should contain between nt/2
and nt samples each. The repartition of samples between these
sets and the terminology used in several papers are the source of
many confusions. When prediction errors are reported in the
literature concerning NNs, are the authors referring to training
error, monitoring error or validation error? The performance of
an NN should not be judged by its performance on training data
that can always be fitted perfectly. Often, the problem is to
know whether the reported results have been obtained on a
monitoring set or a validation set. Data sets are seldom large
enough to be split into three subsets, so that authors often report
results on a monitoring set that they call the validation set or
test set. There is no reason why results obtained on a
monitoring set could not be reported, as long as it is made clear
that these results were obtained on the data set used to evaluate

Fig. 8

the training end-point. One must be aware of the limitations of


this approach: a true validation error is a better estimator of the
NN generalisation ability than a monitoring error.4 If one
decides to favour the modelling power of the NN by using only
two subsets (training and monitoring) instead of three subsets of
smaller size (training, monitoring and validation), very good
results may be obtained on the monitoring set but the model has
not been truly validated in the sense that the monitoring data
were used to optimise one of the model parameters (number of
iterations for training). However, the monitoring results can be
considered as indicative of the modelling power to expect from
the NN model, and they can be compared with, e.g., PLS results
with cross-validation. We summarise the comparison between
different situations for PLS and NNs in Fig. 10.
Some authors mention leave-k-out (often k = 1) crossvalidation as a way of estimating the generalisation ability of the
NN, for instance, when only few calibration samples are
available.3,6,56 We believe that this approach is not adapted to
NNs37,41,54 and we do not recommend it. The procedure can be
suitable for parametric linear models characterised by a
quadratic bowl-shaped smooth error surface. With such models,
the perturbation caused by the removal of one or a few samples
from the training set has little influence on the model
parameters, and therefore the cumulative cross-validation error
obtained is a reliable validation error estimate for the model
constructed with all samples. The situation changes for NNs
applied to non-linear problems characterised by complex error
surfaces.53 Unlike PLS or PCR, which are constrained to
produce orthogonal components, no constraint is imposed on
NN adjustable parameters and it tends to perform a point-bypoint fit of all training samples. Solutions obtained when two
different samples are removed from the training set can differ
significantly from each other.4 In this case one cannot consider
that the global model is validated, and it is even possible that
none of the models developed during cross-validation describe
the same region of the error surface as the global model.
Therefore, if too few calibration samples are available to create
a monitoring set, it is better to consider an alternative method to
NNs.
Ideally, the monitoring and validation set should be independent of each other and of the training set. This can only be
achieved if the samples in each of these subsets are selected
randomly. However, it is important to include as many sources
of variance as possible in the training set. If not, extrapolation
may occur in the prediction phase and this should be avoided
with any modelling method. Specific algorithms can be used to
select training samples that are representative of the total
population and contain high leverage points that carry information about the main sources of variance. A limitation of this
approach is that the subsets selected are no longer independent
since mathematical criteria are applied to discriminate training
samples from the other samples. It is important to keep

Score plot of diesel oil samples.

Table 3 Influence of the presence of a single training sample on RMSEP


obtained with PLS and NN models
Method

RMSEP (flagged sample


not in training set)

RMSEP (flagged sample


in training set)

PLS
NN

0.31
0.28

0.39
0.23

Fig. 9 Typical evolution of training and monitoring errors as a function of


number of iterations.

Analyst, 1998, 123, 157R178R

167R

this restriction in mind when results are reported. We will now


present some algorithms to perform automatic subset selection.
The D-optimality criterion selects the n calibration samples
that provide regression coefficients with the lowest variance of
all the subsets Xn of n samples. Selection is performed by
maximising the determinant of the information matrix (XnTXn).
When the number of samples available is large, Ferre and Rius57
proposed the use of Fedorovs exchange algorithm to select the
D-optimal subset. Samples selected with this criterion are
located at the border of the calibration domain. If a small
number of samples are retained, the interior of the calibration
domain is not appropriately sampled and the set obtained is not
representative of the whole population.
The KennardStone algorithm58 is an alternative method that
allows the selection of a subset of representative samples.
Samples are selected iteratively by maximising the Euclidean
distance between the last selected point and its nearest
previously selected neighbour. The first samples selected with
this method are generally the same as with the D-optimality
criterion and they describe the border of the calibration domain.
As the number of selected samples increases, their repartition
becomes more homogeneous and the subset selected is more
representative of the global population.
These two algorithms ensure that monitoring and/or validation samples are within the domain covered by the training
samples, so that the model does not extrapolate. This type of
sample selection does not match the not-so-ideal situation
sometimes encountered in practice, where it is not guaranteed
that all new samples fall within the calibration domain. The
duplex algorithm59 allows a more realistic repartition of
samples than the two previous methods. Samples are selected in
the same way as with the KennardStone method, but they are
alternatively assigned to the training set and the validation (or

Fig. 10

168R

monitoring) set. Thus, not all samples located at the border of


the calibration domain are placed in the training set; some are
found in the validation set. However, if some samples at the
border of the domain are very close to each other, duplex
splitting can be misleading because each training sample will
have its nearest neighbour in the validation set. This can lead to
overfitting and over-optimistic estimation of the validation
error. For the same reason, with any splitting method all
replicates of a sample should be assigned to the same subset.
Sample selection is often performed in the PC space on the
scores matrix T instead of on the original matrix X, which allows
one to reduce the computational burden. To illustrate the
principle of the three selection methods (KennardStone, Doptimal and duplex), we represented the sets of 30 training
samples from a non-linear data set (prediction of viscosity of
diesel oil samples from their NIR spectra) selected with each
method. We first performed a PCA decomposition of the
original X matrix (104 3 795), then the 30 training samples
were selected in the subspace spanned by the first ten PCs. Fig.
11 represents the position of the training samples (asterisks)
selected in the PC1PC2 plane.
If one wants to compare the efficiency of several modelling
methods, samples can be selected with D-optimal or Kennard
Stone designs. If a model has to be developed for an application
for which there is no guarantee that only interpolation will be
performed, then duplex design will lead to more pessimistic but
reliable results.
It is also possible to perform the splitting after projecting the
samples on a two-dimensional map with a Kohonen NN.60.61
The advantage of such a projection is that an estimation of the
relevant number of dimensions is not required and the essential
topological features of the data set are preserved in two
dimensions, which allows rapid visualisation of the data
structure.

Repartition of samples for internal and external validation with PLS and NN.

Analyst, 1998, 123, 157R178R

With strongly clustered data, subset selection should be


performed on each cluster separately in order to ensure good
representativity between the training and test data. After data
splitting, one can apply the methods presented by JouanRimbaud et al.62,63 for estimating numerically the representativity of two data sets. These methods provided indices varying
between 0 and 1 to compare direction, covariance and centroids
of two data sets.
4.1.5 Data compression. As pointed out earlier, the ratio of
the number of samples to the number of adjustable parameters
in the NN should be kept as large as possible. One way of overdetermining the problem is to compress input data, especially
when they consist of absorbances recorded at several hundred
wavelengths. In addition to reducing the size of input data,
compression allows one to eliminate irrelevant information

Fig. 11 Data splitting: selection of calibration samples (*) in PC space: a,


D-optimum design; b, KennardStone design; c, duplex design.

such as noise or redundancies present in a data matrix.


Successful data compression can result in increased training
speed, a reduction of memory storage, better generalisation
ability of the model, enhanced robustness with respect to noise
in the measurements and simpler model representation.
The latent variables calculated with the PLS algorithm are
designed to project data points on a lower dimensional subspace
describing all relevant sources of variance. While PCs are
designed to maximise the explained variance in the X-space
only, PLS latent variables are built so as to maximise the
covariance between X and Y. Some authors have used PLS to
calculate input socres for NN training.64 However, the latent
variables are designed to conserve information linearly correlated with the response and some relevant non-linear information might be rejected in higher order latent variables that are
not retained in the model.24,65 For this reason, we do not
recommend pre-processing data with PLS before NN modelling.
The most popular method for data compression in chemometrics is principal component analysis (PCA). In addition to
summarising almost all variance in the X-matrix on a few axes
only (the PCs), it has the property that these axes are mutually
orthogonal, which allows inversion of the variancecovariance
matrix in linear regression models (PCR). Orthogonality of
input variables is not so critical for NNs that can handle
collinear input data. However, most NN applications in
quantitative analysis with spectral data use PC scores as input
variables.24,30,41,6670 For the determination of the optimum
number A of input PCs to retain, one can use the same PC
selection procedures as for PCR, although the choice is not so
critical since NN models are built iteratively by successive
optimisations of the NN topology. One possible approach
consists in performing initial calculations with a deliberately
large number of PCs and progressively reducing this number.
This point will be detailed in Section 4.2.
When compressing data with PCA, one must be aware of
some theoretical limitations. PCA is a linear projection method
that fails to preserve the structure of a non-linear data set. If
there is some non-linearity in X (or between X and Y), this nonlinearity can appear as a small perturbation on a linear solution
and will not be described by the first PCs as in a linear case. A
non-linear transformation of the X-matrix or PC scores matrix
can be performed to restore the least-squares approximation
property, but the resulting non-linear PCs are strongly dependent upon the pre-selected non-linear form and may not ensure
the best representation of distances between points in the
original space.71 In practice, PC scores are often successfully
used as inputs without transformation because all relevant
information about X is usually contained in the first 15 PCs.
Alternatively, it is possible to use Fourier analysis,35,41
Hadamard transform72 or wavelet analysis73 to pre-process
spectral data before NN modelling. An attractive feature of
wavelets is their ability to describe optimally local information
from the spectrum, whereas Fourier decomposition is global. If
this localised information is related to the non-linearity present
in the data, an improvement can be expected if the input matrix
is described with wavelet coefficients instead of PC scores or
Fourier coefficients. A difficulty lies in the selection of one of
the numerous wavelet bases for spectral decomposition. A
scheme based on the optimisation of the minimum description
length (MDL) criterion in multivariate calibration was explained by Walczak and Massart.74
Whatever the compression method retained, the new subspace (PCs, Fourier coefficients, wavelet coefficients) for
sample description must be determined on the training set only.
Then the monitoring and test samples can be projected in this
subspace to calculate their scores or coefficients.
4.1.6 Data scaling. Once the input variables have been
selected or calculated, one must ensure that they can be used for
Analyst, 1998, 123, 157R178R

169R

efficiently estimating NN parameters. It is not necessary to


mean-center input variables before training since the biases act
as offsets in the model. NN training is not based on variance
covariance maximisation, and therefore it is not necessary to
scale the different variables to unit variance, even when they are
heterogeneous. This is an advantage over methods such as PCR
or PLS that require auto-scaling when variables are of different
nature. For instance, in process control applications where some
variables are continuous and others are binary, the binary
variables can be artificially given more weight than the
continuous variables because of auto-scaling, and the model
interpretation is incorrect.
The only constraint for NNs is to scale each input variable so
that training starts within the active range of the non-linear
transfer functions. Usually, samples are range-scaled with a
linear mapping called minmax scaling. Scaling parameters
must be determined on the training samples. All samples must
train
be scaled with respect to these parameters. Let X train
min and X max
be the extreme values of variable X in the training set, and let
rmin and rmax define the limits of the range where we want to
scale variable X. Any sample Xi (from the training, monitoring
or test set) must be scaled to a new value Ai as follows:

Ai =

train
( Xi - Xmin
) (r

train
train
- X min
X max

max

- rmin ) + rmin

(4)

For NNs with sigmoid or hyperbolic tangent transfer functions,


rmin and rmax are set to 21 and 1, respectively. One must also
ensure that the initial weights wi0 are reasonably small to avoid
saturating the transfer functions in the first iterations. We
suggest setting them so that 0 < |w0i | < 0.1.
If non-linear transfer functions are used in the output layer,
the Y-values must also be range-scaled so that outputs produced
by the NN are not in the flat regions of the transfer function (see
Fig. 2). In these regions, the derivatives used for weight
adjustment are almost zero and learning stops. For a sigmoid
transfer function, range-scaling Y to [0.2, 0.8] is recommended,
whereas for hyperbolic tangents range-scaling must be performed in the range [20.8, 0.8]. In theory, when linear transfer
functions are used no range-scaling is needed since they are not
bounded. In practice, we found that in the early steps of learning
there is a risk that unscaled responses lead to divergent wild
steps for weight adjustments that can only be slowly recovered,
especially with noisy data. Therefore, we suggest also rangescaling responses to an arbitrary small range. To calculate
training, monitoring or test error one must perform an inverse
range-scaling to return the predicted responses to their original
scale and compare them with experimental responses.

4.2 Determination of network topology


The topology of an NN is determined by the number of layers in
the NN, the number of nodes in each layer and the nature of the
transfer functions.
Optimisation of NN topology is probably the most tedious
step in the development of a model. To understand the difficulty
of topology optimisation, let us first consider the well known
bias/variance decomposition of the mean squared error (MSE)
for regression problems. It can be demonstrated that
E(y 2 y)2 = E[y 2 E(y)]2 + E[E(y) 2 y]2

Analyst, 1998, 123, 157R178R

4.2.1 Number of layers. The terminology used to describe


NN topology can vary according to the authors, some of them
considering the input layer as a simple buffer. We designate the
NN represented in Fig. 1 as a three-layer NN, with a 431
architecture (four input nodes, three hidden nodes, one output
node).
The theoretical property of universal approximation has been
proved for NNs with only one hidden layer.9 In practice, we
never obtained better results on calibration problems by using
two hidden layers instead of one, even if learning is sometimes
faster. A similar observation was made by other authors,3,4 and
it is therefore recommended that one uses only one hidden layer
in multivariate calibration, unless the relationship to the model
seems to be discontinuous.37 In this case an additional hidden
layer is necessary.
It is possible to add direct connections between the input and
output layer of an NN24 as illustrated in Fig. 13.
When the input variables have a mixed contribution to the
response (some linear and some non-linear), direct connections
can handle the linear part and the classical NN builds the nonlinear part of the model. This approach can be interesting with
NIR spectroscopic data where the non-linear effects observed
generally correspond to small deviations from a linear solution.24 Direct connections may speed up the learning process
and ease model interpretation in situations where descriptors are
heterogeneous. Blank and Brown30 compared the performances

(5)

where E( ) denotes expectation with respect to the distribution


function of a pair (x, y). The first term on the right-hand side of
eqn. (5) is related to the variance in the model, whereas the
second term describes the bias introduced to counter-balance
model flexibility and avoid overfitting.7 The composite contribution of bias and variance to the MSE in a regression model
can be represented as a function of model complexity, as in Fig.
12.
170R

NNs can perform unbiased estimation of the training set to


arbitrary precision and achieve asymptotic consistency. Universal approximation has a cost, however: a truly unbiased NN
model (for instance, an NN with an infinite number of hidden
nodes) would exhibit a very large variance, would be extremely
sensitive to the idiosyncrasies in the training set and could only
perform well on noise-free data.7 To attenuate the influence of
noise that affects real analytical measurements, one has to
constrain NN topology and allow some bias in the model. This
can be done by the following means: reducing the number of
layers, nodes and connections in the NN, constraining the form
of the transfer functions or using a monitoring set to stop
training.
We have said earlier that in a first approximation, NNs could
be defined as non-parametric models.7,38 This definition is
ambiguous and some authors consider NNs as parametric
models.25 We can refine the definition now that we have
presented the concept of bias in NNs. A model with one nonlinear hidden node is strongly biased and reduces to a
parametric sigmoidal regression model. A model with two nonlinear hidden nodes is also strongly biased and will only fit the
class of functions that can be modelled by combining the two
non-linear transfer functions. As more hidden nodes are added
to an NN, the bias is reduced and the number of functions that
can be fitted increases exponentially. However, the term semiparametric seems more adapted to NNs used in multivariate
calibration, where one tries to build models as parsimonious as
possible.

Fig. 12 Evolution of mean squared error of as a function of the complexity


of a model.

of NNs with and without direct connections for the development


of multivariate calibration models with non-linear simulated
data. They found that directly connected NNs learned more
quickly in the initial and intermediate training phases, but NNs
without direct connections converged to lower calibration and
prediction errors. Dolmotova et al.65 recently compared NNs
with and without direct connections for the simultaneous
determination of the concentration of three main components in
paper coating. The results obtained with both methods were
approximately similar. In theory, an NN without direct
connections can achieve the same prediction performance as an
NN with direct connections, and we therefore prefer NNs
without direct connections to reduce the number of adjustable
parameters.
4.2.2 Number of input and output nodes. Although NNs
have the property to model multiple responses simultaneously,
it is recommended that one models only one response at a time
and therefore have a single output node. The only exception to
this rule is for situations where one wants to predict several
correlated responses, such as the concentrations of different
constituents of a mixture in a closed system. In that case, all
responses can be modelled simultaneously with an NN having
one output node per response.
To set the initial number of input nodes, two approaches are
possible: the stepwise addition approach consists of starting
with a deliberately small number of input variables and adding
new variables one at a time until the monitoring and/or
prediction performance of the NN does not improve any more;
the stepwise elimination approach consists of starting with a
deliberately large number of input scores and gradually
removing (pruning) some of them until the monitoring and/or
prediction performance of the NN stops improving. Both
approaches are used in practice and no definite recommendation
can be given as to which one is better, since they both have
advantages and limitations. If PCs are selected according to
eigenvalues and the scores used as inputs, the stepwise addition
method often leads to quick and satisfactory results, because all
necessary information is usually contained in the first few PCs.
However, it can happen that most information is contained in,
e.g., PC1 to PC5, but some important additional information is
also contained in PC10. During stepwise addition, the NN
performance will stagnate or degrade between PC6 and PC9 and
there are few chances that PC10 is included in the final
model.
When stepwise elimination is performed, one must include a
deliberately large number of input variables in the initial set.
Irrelevant variables can be eliminated later, but relevant
variables that have not been included in the initial model will
not be tested subsequently. Here again, working with PC scores
as inputs is advantageous. Using classical techniques (e.g.,
Malinowskis factor indication function and reduced eigenvalue
test45 or cross-validation75), one can estimate the pseudo-rank
of the input data matrix. Then, one selects a few additional PCs

(five or six) that may account for possible non-linearity, and the
NN training can be started with this initial training set. For
calibration problems, the size of the initial set should typically
vary between 10 and 15 PCs. The drawback of the stepwise
elimination approach is that it can be extremely time consuming, if input variables are tentatively removed by trial and error,
because of the large number of possible combinations.60
In neural computation, the relevance of a variable to a model
is called its sensitivity. The optimisation of the set of input
variables can be accelerated if a method to estimate the
sensitivity of each variable is implemented. Several methods
have been proposed. The most common is often referred to as
Hinton diagrams. It consists of ascribing to each input variable
a sensitivity proportional to the average magnitude of its
associated connections in the NN, represented on a twodimensional map by square boxes of varying size. Candidate
variables to be deleted are those with the lowest sensitivity. In
spite of its popularity, this method exhibits severe theoretical
and practical limitations.70,76 It is based on an analogy with the
classical MLR approach, where the magnitude of a regression
coefficient reflects the importance of the relationship between
the associated descriptor and the response. In an NN model,
input variables that have a linear contribution to the response
will be modelled in the linear portion of the sigmoidal transfer
function associated with small or medium magnitude weights,
whereas the non-linear variables will be modelled in the
concave portion of the transfer function associated with large
magnitude weights. Therefore, the Hinton diagram ranking
method is not based on the intrinsic relevance of a variable to a
model, but simply on the nature of its contribution to the
response. Linear input variables are systematically flagged as
unimportant even when they explicitly contribute to the model.
This approach can only give reliable results when the data set is
entirely linear, in which case there is no point in using an NN.
For the same reason, we are not in favour of training methods
based on the principle of weight decay4 that consists of adding
to the cost function a term penalising large weights.
The approach based on estimation of saliencies is theoretically more stringent.76 The saliency of a weight is the measure
of the increase in the NN cost function caused by the deletion of
this weight. It is estimated at the end of the training. Deletion of
an individual weight wi in an NN can generally be considered as
a small perturbation. First, the change in cost function caused by
this small perturbation to the weight matrix is approximated by
a second-order Taylor series expansion. Ideally, the training is
stopped when the NN has converged to a minimum, and
therefore the change in cost function can be described using
only Hessian terms (second partial derivatives of the error
function with respect to weights) in the approximation of the
change in error. Hassibi and Stork77 proposed calculating the
saliency of a weight k as

sk =

1 wk2
2 H -1

[ ]

(6)
kk

where H21 is the inverse of the Hessian matrix. Once the


saliency of each weight in the NN is obtained, we use the sum
of the saliencies of weights connected to input variable i to
determine the sensitivity Si of this variable:76

Si =

sk

(7)

Fig. 13

Example of three-layer 431 NN with direct connections.

The saliency estimation method has already been used to


optimise NN topology in multivariate calibration.68 It can lead
to unstable results in situations where the assumptions made for
saliency estimation (small magnitude of weights, training
stopped when training error is at a minimum) are not
fulfilled.70
Two variance-based approaches for input variable sensitivity
determination were proposed recently.70 They are designed for
situations where input variables are orthogonal, which is the
Analyst, 1998, 123, 157R178R

171R

case with PC scores. The methods are based on the estimation of


the individual contribution of each input variable to the variance
of the predicted response. In the first approach, this contribution
is determined by partial modelling. First, the NN is trained to
estimate the parameters of the model:
(8)
y = f(x1, x2,. . .,xn)
After training, the sensitivity of each input variable xi is
calculated as the variance of the response y (xi) predicted with
the trained NN when all input variables except xi are set to
zero:
(9)
y (xi) = f(xi)
2
(10)
Si = s y (xi)
In the second approach, the separate contribution of each input
variable to the variance of the estimated response is derived
from a variance propagation equation for non-linear combinations of variables. In the case of a two-variable model (x1, x2),
this equation is
2

y 2 y 2
y y
s 2y =
(COV ) x1x2 (11)
s x1 +
s x2 + 2
x1 x 2
x1
x2
Since PC scores are orthogonal, the covariance term can be
neglected and the sensitivity of input variable xi is calculated
as
2

y 2
(12)
Si =
s xi
xi
Applying the chain rule several times, one obtains an
analytical expression that allows one to determine Si at the end
of training. The most interesting characteristic of these two
variance-based methods (partial modelling and variance propagation) is that they give extremely stable results. When NNs
with the same topology are trained with different sets of initial
random weights, they can converge to different local minima on
the error surface that are qualitatively equally good and close to
each other. In that case the two variance-based methods give
similar results, which is not always the case with Hinton
diagrams or with the saliency estimation method.
Once the sensitivity of each input variable has been
estimated, we recommend that one should first try to remove the
variable with the lowest sensitivity, and retrain the NN. If the
monitoring error decreases after removing the flagged variable,
it can be considered as irrelevant for the model and permanently
removed, otherwise it must be replaced and another flagged
variable must be tentatively removed. Since parsimonious
models should be preferred in multivariate calibration, we
propose the following methodology for the stepwise elimination
of input variables. Let ME(k) be the monitoring error at the kth
trial and ME(k + 1) the monitoring error at the next trial after
removal of a flagged input variable. Then,
If ME(k + 1) @ t 3 ME(k), then remove the flagged
variable
Else, replace the flagged variable and try to remove the
next variable with lowest sensitivity
Here t is a tolerance factor that can be adjusted to different
values; we suggest t = 1.1. Increasing this factor will result in
removing more input variables from the model, at the risk of
losing some relevant sources of variance; t should not be lower
than 1, otherwise the NN could have a poor generalisation
ability.
For a given set of input variables, the NN performance will
also vary with the number of hidden nodes. Therefore,
optimisation of the number of input variables and of the number
of hidden nodes should be performed in conjunction: at each
step, one should optimise the number of input variables, then the
number of hidden nodes, then optimise again the number of
input variables and proceed so until the monitoring error stops
decreasing.
4.2.3 Number of hidden nodes. A study performed by Tetko
et al.55 suggested a fairly wide tolerance of NNs to the number
172R

Analyst, 1998, 123, 157R178R

of hidden nodes, provided that overtraining be avoided with an


external validation set. However, an upper bound on the number
of hidden nodes is of the order of the number of training samples
used.53 It was further proved that an NN with n sigmoidal
hidden nodes could approximate the response of 2n 2 1
samples.78 These results support the idea that it is not necessary
to use large numbers of hidden nodes to fit complex multivariate
relationships. On the contrary, large numbers of hidden nodes
often accentuate the risk of overfitting.79
To circumvent the problems of overfitting and local minima
trapping characteristic of complex networks, Jiang et al.66
proposed a recursive algorithm to add a reasonable number of
hidden nodes to an already trained NN. The idea is that an
augmented NN is capable of the same approximation as a
smaller one, and convergence can be improved with additional
hidden nodes. The augmented NN is trained with a modified
genetic algorithm (MGA) instead of the usual back-propagation
algorithm to avoid local minima. However, the initial topology
to be augmented remains to be determined.
Conversely, Kanjilal and Banerjee80 presented a strategy for
reducing the number of hidden nodes in an NN. The method is
based on orthogonalisation of the hidden layer output matrix
with singular value decomposition (SVD), after a crude
convergence has been reached. Zhang et al.69 presented an
algorithm based on a similar concept, that allows one to use all
calibration samples for NN training without need for a
monitoring set. The initial postulate is that NNs with large
numbers of hidden nodes are relatively insensitive to initial
conditions, but their generalisation ability is worse than NNs
with a hidden layer of reduced size. The proposed scheme
consists of starting NN training with a deliberately large hidden
layer until an arbitrarily low error is reached, then perform SVD
on the hidden layer output matrix H:
Hk 3 h = Uk 3 kSk 3 hV Th3 h

(13)

where h is the number of hidden nodes and k the number of


training samples. The number r of dominant singular values in
the diagonal S matrix (determined by a variance ratio criterion)
is considered as the number of hidden nodes necessary for the
NN. A new NN is built, with only r < k hidden nodes, and the
new initial weight matrices are determined by least squares fit
so that the hidden layer output matrix is
HA = [U1U2. . .Ur]

(14)

Training is then resumed on this pruned NN with improved


generalisation ability.
We have studied the influence of the number of hidden nodes
on the NN error on four non-linear NIR data sets, for which the
optimum set of input variables (PC scores) had previously been
identified. The first two data sets consist of diesel oil spectra
with their corresponding values of viscosity and pour point
(eight and four input variables, respectively). The third data set
contains spectra of a polymer and the concentration of a mineral
charge in this polymer as dependent variable (three input
variables). The fourth data set contains spectra of gasoline
samples and their corresponding octane numbers (thirteen input
variables). The first three sets can be considered as strongly
non-linear, whereas the last one is only slightly non-linear.32
For each set, models with different numbers of hidden nodes
have been designed. Each model was repeated five times to
avoid chance correlations due to the random initialisation of the
weights. Fig. 14 shows the evolution of average calibration
error (CE), monitoring error (ME) and test error (TE) as a
function of the number of hidden nodes in the NN, for each of
the four data sets.
For the three highly non-linear data sets [Fig. 14(a)(c)],
there is first a sharp decrease in error as the second and/or third
hidden node are added to the model, whereas for the modelling
of octane number [Fig. 14(d), slightly non-linear], the error

curves remain relatively flat between 1 and 20 hidden nodes.


The high initial error values observed in Fig. 14(a)(c) for one
hidden node indicate a situation where the NN is not flexible
enough to model highly non-linear relationships. The situation
is equivalent to fitting a second- or third-order polynomial with
a first-order model. One could think of simply selecting an
arbitrary large number of hidden nodes and keep it constant,
since the error curves in Fig. 14 remain stable for high numbers
of hidden nodes. However, the test samples in these examples

are all within the calibration domain. The situation changes


significantly when NN are used in extrapolation. For instance,
in Fig. 15(a) the CE, ME and TE values are reported for the
modelling of diesel oil viscosity, when the test set contains
samples with extreme X values.
The monitoring and test errors increase as more hidden nodes
are added, in contrast to what was observed in Fig. 14(a). The
main reason is that several samples that describe the nonlinearity are now in the test set, and the calibration samples

Fig. 14 Evolution of NN calibration, monitoring and test error as a function of the number of hidden nodes: a, viscosity data; b, pour point data; c, polymer
data; d, gasoline data.

Fig. 15 Evolution of calibration, monitoring and test errors as a function of the number of hidden nodes for viscosity data, when some test samples are
outside calibration space: a, error; b, standard deviation of error.

Analyst, 1998, 123, 157R178R

173R

mainly describe the linear portion of the viscosity range. One


hidden node is sufficient to fit the mild non-linearity present in
the calibration set. The fit is slightly better if a second hidden
node is added (lower CE), but we already start to overfit the
training data, which leads to higher ME and TE values. The
situation is now equivalent to fitting a first-order polynomial
with a second- or third-order model. If we consider only the TE
values, models with one or six hidden nodes give equivalent
results, but the one hidden node model has the advantage of
producing very stable results: Fig. 15(b) represents the standard
deviation of errors on five trials with different initial sets of
random weights. A model obtained with one hidden node is
quasi-independent from the set of initial weights (standard
deviation almost zero). As more hidden nodes are added,
different sets of initial random weights can lead to different
combinations of transfer functions to build empirical models.81
These models are generally equivalent within the calibration
domain, but can lead to different results in extrapolation, as was
seen in Fig. 4: when the number of hidden nodes was increased
to six or nine, the calibration fit improved slightly but the
performance in prediction degraded.
We therefore recommend systematically reducing the number of hidden nodes as much as possible, in order to achieve
simpler and more robust models. It is always a good idea to
compare the performance of a one hidden node model with the
performance of a more complex model since many data sets in
multivariate calibration are only slightly non-linear. The
advantage of models with one hidden node is that the results
they produce are stable and independent of the set of initial
random weights.81 Moreover, a model with one hidden node
reduces to a sigmoidal regression that can be easily interpreted.
In an extrapolation calibration study,36 the prediction error of
the NN on one data set was reduced by 50% by using one hidden
node only.
4.2.4 Transfer function. Kolmogorovs theorem states that
an NN with linear combinations of n 3 (2n + 1) monotonically
increasing non-linear functions of only one variable is able to fit
any continuous function of n variables.82 The most currently
used nonlinear transfer functions in the hidden layer are the
sigmoid or hyperbolic tangent functions that are bounded, easily
differentiable and exhibit a linear-like portion in their centre, so
that data sets that are only slightly non-linear can also be
modelled (see Fig. 2). These two functions are popular because
they allow one to fit a large number of non-linearities, but other
functions can be tried. For instance, Gemperline et al.12
performed multivariate calibration with NNs on UV/VIS data
using in their hidden layer combinations of linear, sigmoid,
hyperbolic tangent and square functions, to accommodate

different types of non-linear response in different spectral


regions.
The transfer function(s) in the output layer can be linear or
non-linear. In many situations, if the number of hidden nodes is
sufficient, all modelling is done in the hidden layer. It was
observed that in some situations where data were mainly linear,
non-linear output transfer functions could introduce distortion
in the predicted responses,16 as illustrated in Fig. 3(a). If a linear
output transfer function is used, any linear node in the hidden
layer can be replaced with a direct connection between input
and hidden layer (because two successive linear transformations
can be reduced to a single one), which reduces the number of
adjustable parameters in the NN.
The safest procedure is try both types of output transfer
functions (linear and non-linear) during topology optimisation
and to base the decision on the shape of residuals for models
constructed with the same input variables.

4.3 Training of the network


4.3.1 Learning algorithms. Two general modes of learning
can be distinguished: incremental learning and batch learning.
Incremental learning consists of successively updating the
weights in the NN after estimating the error associated with the
response predicted for each sample presented in a random order.
In the batch learning mode the errors of all training samples over
each iteration are first summed and the parameters are adjusted
with respect to this sum. The former approach has the advantage
that it superimposes a stochastic component on the weight
update. This can help the NN escape from local minima on the
error surface in the hyperspace of the weights. A drawback is
that the method is prone to the phenomenon of thrashing: the
NN can take successive steps in opposite directions that may
slow learning. Batch learning provides a more accurate estimate
of the gradient vector4 and faster convergence, but it also
requires more memory storage capacity. The relative efficiency
of both approaches is usually data set dependent. The
incremental approach seems particularly suited for very homogeneous training sets21 or for on-line process control applications4 where the composition of the training set is constantly
modified.
Training an NN is an optimisation problem, and several
methods are available for this task. It is not possible to review
in detail all algorithms available, but the main types of
algorithms will be summarised and their particularities outlined.
The gradient descent algorithm performs a steepest-descent
minimisation on the error surface in the adjustable parameters

Fig. 16 Detection of representativity problems between training and monitoring set on r.m.s. error curves: a, lack of representativity; b, chance correlation
with initial set of weights.

174R

Analyst, 1998, 123, 157R178R

hyperspace. This algorithm was described and popularised by


Rumelhart and McClelland83 in 1986. The excessively slow
convergence of the basic algorithm and its tendency to become
trapped in the numerous local minima of the error surface
triggered the need for improvements such as the addition of a
momentum term in the weight update, that allows one to smooth
the error surface and to attenuate oscillations in the bottom of
steep valleys. The speed of the algorithm can be significantly
enhanced by using adaptive parameters (learning rate and
momentum rate) for each weight in the NN. This is the basis of
the delta-bar-delta84 and extended delta-bar-delta85 algorithms,
that have been successfully applied in multivariate calibration.30

Fig. 17 Visualisation of sample repartitions on hidden nodes (hn) output


maps for ICP data: a, hn1hn2; b, hn1hn3; c, hn2hn3.

Faster convergence can be reached with second-order


optimisation methods, based on the determination or approximation of the Hessian matrix of partial second derivatives of
the cost function: these methods typically have a convergence
time one order of magnitude smaller than the gradient method or
its derivatives. In the NewtonRaphson method, the Hessian
matrix is used to adjust the descent direction at each step, and
convergence is reached in a single step if the error surface is
quadratic, with ellipsoidal contours. Currently, one of the most
popular and efficient second-order methods for NN training is
the LevenbergMarquardt algorithm,8 which is a compromise
between gradient descent and NewtonRaphson optimisation.
At each step, an adaptive parameter allows the algorithm to
transit smoothly between the gradient direction and the
NewtonRaphson direction. The inverse Hessian matrix is only
estimated and iteratively updated to avoid tedious calculations.
Applications of this algorithm for NN training in multivariate
calibration have recently been reported.32,68,70,79 Conjugate
gradient optimisation is an alternative second-order technique
that also uses the Hessian matrix, but the algorithm is
formulated in such a way that the estimation and storage of the
Hessian matrix are completely avoided.8 With conjugate
gradient optimisation, each new search direction is chosen so as
to spoil as little as possible the minimisation achieved by the
previous one, in contrast to the winding trajectory observed with
the gradient method. This method is guaranteed to locate the
minimum of any quadratic function of n variables in at most n
steps.
Genetic algorithms (GA) have been used for NN training.66,86
This global search method allows one to overcome the problem
of becoming trapped in local minima, but at the expense of a
long computing time because each individual in the population
represents a different NN model. In addition, a number of
parameters must be set to define the population size and
evolution mode, and therefore this approach cannot be easily
implemented.
Random optimisation consists of taking successive random
steps in the weight space and discarding all steps that do not
reduce the cost function. In contrast to the classical backpropagation algorithm, random search is guaranteed to find a
global minimum,87 but the computation time is so high that the
method is never used in practice. Instead, GA or random
optimisation can be used as preliminary techniques to optimise
the initial set of weights in the NN, then the training is continued
with a back-propagation-based method.
4.3.2 When to stop training. As mentioned previously, a
monitoring set has to be used in order to reduce the tendency of
NN to overtrain and therefore overfit the training data. The
evolution of the monitoring error must be followed during
training. The frequency of monitoring error estimation has to be
determined by the user; ideally it should be performed after each
iteration. Consecutive monitoring error values are stored in a
vector, and several criteria can be applied to retain the optimum
set of weights: train the NN for a pre-defined large number of
iterations and retain the set of weights corresponding to the
minimum of the monitoring error curve; stop training and retain
the last set of weights as soon as the monitoring error is below
a pre-specified threshold; or stop training and retain the last set
of weights as soon as the decrement between two successive
monitoring errors is below a pre-specified threshold.
One must also check that the training error is reasonably low
at the number of iterations retained, and that the representativity
between the training and the monitoring set is ensured. A useful
way to detect lack of representativity between training and
monitoring set is when the r.m.s. error curves for both sets are
separated by a large gap in the region where they flatten, as
shown in Fig. 16(a).3,88
Alternatively, it is possible that the optimum monitoring error
is reached while the training error is still relatively high [Fig.
Analyst, 1998, 123, 157R178R

175R

16(b)]. This can be due to chance correlation, for instance when


the initial set of random weights brings the model near a local
minimum on the monitoring error surface. Chauvin89 demonstrated that in NNs with complex architectures, late validation
minima could sometimes be deeper than the first local
minimum. In both cases (large gap between monitoring and
training error curves, or early minimum for monitoring), a
different splitting of data between the two subsets should be
considered.
The sensitivity of the NN solution to initial conditions is a
well known issue that was discussed by Kolen and Pollack.81 To
overcome effects due to chance correlation, several trials must
be performed with different sets of initial random weights.55 At
least five trials are recommended. The topology corresponding
to the lowest average monitoring error should be retained,
provided that the variability of predictions is not significantly
higher than with other topologies. Once the topology has been
established, any set of weights leading to an acceptable
monitoring error can be retained for the final model. It is
recommended, however, to test it against a validation set, if
available, before performing predictions on unknown samples.
Some approaches have been presented that avoid the need for
a monitoring set, such as the method based on hidden node
pruning presented in Section 4.2.3.69.80 Since no overfitting is
observed in the later stage of training with this approach, it is
claimed that no monitoring set is necessary. This seems
particularly attractive for situations where the number of
calibration samples is low. In practice, we found that the
method was giving very good results when no particular
overfitting problem was observed with a classical NN, but in
situations where we had difficulty with a classical NN a

monitoring set was also necessary with the hidden node pruning
approach.
4.3.3 Model interpretation. NNs have more to offer than a
simple empirical model. The sensitivity plots that we have
presented earlier describe the relative influence of the different
input variables in the final model. In addition, examination of
the projection of the samples on the hidden nodes of the NN is
often informative.37 We performed a calibration model for the
quantitative analysis of traces of lead in water, using inductively
coupled plasma atomic emission spectrometry (ICP-AES) data
as input (14 descriptors). At the end of training, if we display the
activation of hidden nodes versus each other, we obtain plots
comparable to score plots (Fig. 17). The five measurement
replicates marked with asterisks are easily identified as probable
outliers. Such plots are instructive and also allow visualisation
of clusters present in the data, but they are rarely used. When
data must first be compressed, visualisation is performed on the
scores before modelling instead.
We displayed in Fig. 18(a)(c) the activation of the three
hidden nodes at the end of training for the ICP-AES data NN
model. Fig. 18(d) and (e) show the activation of the two hidden
nodes in the non-linear model for polymer charge concentration. To estimate the relative importance of each hidden node in
the final model, we have reported the value of the magnitude of
the weight between this hidden node and the output node in
parentheses. This is possible because all hidden nodes are
connected to one output node only. Therefore, the magnitude of
the connecting weights can directly be compared, which is not
the case for weights connected to input nodes.
The activation of hidden nodes for ICP-AES data indicates
that this data set is mainly linear, whereas the transfer functions

Fig. 18 Visualisation of hidden nodes activation: a, ICP data, hn1, w = 20.36; b, ICP data, hn2, w = 20.54; c, ICP data, hn3, w = 0.60; d, polymer data,
hn1, w = 20.12; e, polymer data, hn2, w = 0.33.

176R

Analyst, 1998, 123, 157R178R

for the modelling polymer data are activated in their strongly


non-linear portion. Thus we obtain information on the degree of
non-linearity of a given data set, even when the exact form of
the model is unknown.
Recently, several groups have investigated the assessment of
statistical confidence intervals for predictions with NNs. Dathe
and Otto72 derived confidence intervals using the bootstrap
method. After finding the optimum topology of the NN, they
erase a portion of the calibration matrix and randomly fill it with
replicate samples from the remaining portion. An arbitrary
number nsets of calibration matrices is created, and nsets models
are built with the pre-defined topology. An external test set is
used to predict the responses with each of the bootstrapped NN
models, and standard deviations of predicted responses can be
calculated. Derks and Buydens90 also worked on the calculation
of confidence intervals and compared three forms of bootstrapping. The advantage of the bootstrap approach is that the
derived confidence intervals contain all sources of variability
(experimental noise, model errors, effect of different sets of
random weights), thus yielding a worst case estimation. The
drawback is that the confidence intervals derived correspond to
an NN topology, not to a single model with a fixed set of
weights.

5 Conclusions
As is often the case in chemometrics, data pre-treatment and
presentation (number of samples, detection of outliers, data
compression and splitting) are critical issues that should not be
overlooked. Experience has proved that several failures of NNs
for modelling were indeed due to inappropriate problem
formulation. Such issues can be circumvented by focusing on
prior model identification, in particular the detection of nonlinearity. Proper a priori non-linearity detection is one of the
major difficulties and methods existing so far often fail in the
presence of outliers.
NNs should become part of the standard toolkit of analytical
chemists concerned with multivariate calibration, but it is
important to have a clear understanding of their capabilities and
limitations. One should not consider NNs as black boxes, but as
regression models whose flexibility will depend on the topology
defined by the user. In recent years, numerous research efforts
have been focused on improving the speed of algorithms used
for NN training. With the availability of faster personal
computers, the emphasis is no longer on the speed of algorithms
but rather on the development of tools to ease topology
optimisation, visualisation and model interpretation.
The design of an optimum topology is certainly critical and
time consuming, but this is true also for the optimisation of
parameters for other methods (form of the model in polynomial
PCR or PLS, complexity of soft models, number of nearest
neighbours in LWR, variables to retain/eliminate in methods
based on feature selection/elimination), although it is less
emphasised. Moreover, the comment that NNs do not allow
inference is somewhat unfair. Some simple plots can provide
information on the nature and form of the problem tackled and
on the presence of possible clusters or outliers.
Several recent research efforts aimed at combining the
flexibility and auto-adaptive ability of NNs with the superior
interpretability and inference capability of PLS models.9194 So
far, it seems that these methods also combine the pitfalls of both
approaches and their application generally requires an optimisation of a large number of parameters. Radial basis function
(RBF) networks offer interesting alternatives to MLP in the
sense that they allow local training and the final models can be
interpreted in terms of logical rules.38,53,95 Another approach to
gain insight into a complex problem is to combine the use of

classical MLP (for prediction) with counter-propagation NNs to


obtain contour plots of the input and output variables.60,61

6 Acknowledgements
The authors are grateful to Vita Centner and Frederic Estienne
for fruitful discussions. This work received financial support
from the European Commission (SMT Programme contract
SMT4-CT95-2031) and the Fonds voor Wetenschappelijk
Onderzoek (FWO, Fund for Scientific Research).

7 References
1 J. Zupan and J. Gasteiger, Anal. Chim. Acta, 1991, 248, 1.
2 S. D. Brown, S. T. Sum, F. Despagne and B. K. Lavine, Anal. Chem.,
1996, 68, 21R.
3 J. R. M. Smits, W. J. Melssen, L. M. C. Buydens and G. Kateman,
Chemom. Intell. Lab. Syst., 1992, 22, 165.
4 D. Svozil, V. Kvasnicka and J. Pospchal, Chemom. Intell. Lab. Syst.,
1997, 39, 43.
5 D. A. Cirovic, Trends Anal. Chem., 1997, 16, 148.
6 M. Bos, A. Bos and W. E. van der Linden, Analyst, 1993, 118,
323.
7 S. Geman, E. Bienenstock and R. Doursat, Neural Comput., 1992, 4,
1.
8 R. Fletcher, Practical Methods of Optimisation, Vol. 1: Unconstrained Optimisation, Wiley, New York, 1980.
9 K. Hornik, M. Stinchcombe and H. White, Neural Networks, 1989, 2,
359.
10 E. Thomas, Anal. Chem., 1994, 66, 795A.
11 C. E. Miller, NIR News, 1993, 4, 3.
12 P. J. Gemperline, J. R. Long and V. G. Gregoriou, Anal. Chem., 1991,
63, 2313.
13 M. S. Danhoa, S. J. Lister, R. Sanderson and R. J. Barnes, Near
Infrared Spectrosc., 1994, 2, 43.
14 J. A. van Leeuwen, R. J. Jonker and R. Gill, Chemom. Intell. Lab.
Syst., 1994, 25, 325.
15 F. Wulfert, W. T. Kok and A. K. Smilde, Anal. Chem., 1998, 70,
1761.
16 R. Goodacre, M. J. Neal and D. B. Kell, Anal. Chem., 1994, 66,
1070.
17 R. Goodacre, Appl. Spectrosc., 1997, 51, 1144.
18 S. R. Amendolia, A. Doppiu, M. L. Ganadu and G. Lubinu, Anal.
Chem., 1998, 70, 1249.
19 J. R. Long, V. G. Gregoriou and P. J. Gemperline, Anal. Chem., 1990,
62, 1791.
20 T. J. Sejnowski and C. R. Rosenberg, Complex Syst., 1987, 1, 145.
21 J. Hertz, A. Krogh and R. Palmer, Introduction to the Theory of
Neural Computation, Addison Wesley, Redwood City, CA, 1991.
22 S. Biswas and S. Venkatesh, in Advances in Neural Information
Processing Systems, ed. R. P. Lippmann, J. E. Moody and D. S.
Touretzky, Morgan Kaufmann, San Mateo, CA, 1991, Vol. III.
23 B. Hitzmann, A. Ritzka, R. Ulber, T. Scheper and K. Schugerl, Anal.
Chim. Acta, 1997, 348, 135.
24 C. Borggaard and H. H. Thodberg, Anal. Chem., 1992, 64, 545.
25 T. Naes, K. Kvaal, T. Isaksson and C. Miller, J. Near Infrared
Spectrosc., 1993, 1, 1.
26 J. Verdu-Andres, D. L. Massart, C. Menardo and C. Sterna, Anal.
Chim. Acta, 1997, 349, 271.
27 H. Martens and T. Naes, Multivariate Calibration, Wiley, Chichester,
1989.
28 N. B. Vogt, Chemom. Intell. Lab. Syst., 1989, 7, 119.
29 S. Wold, N. Kettaneh-Wold and B. Skagerberg, Chemom. Intell. Lab.
Syst., 1993, 65, 3081.
30 T. B. Blank and S. D. Brown, Anal. Chem., 1993, 65, 3081.
31 T. Naes and T. Isaksson, NIR News, 1994, 5, 7.
32 V. Centner, J. Verdu-Andres, B. Walczak, D. Jouan-Rimbaud, F.
Despagne, L. Pasti, R. Poppi, D. L. Massart and O. E. de Noord,
submitted for publication.
33 S. Sekulic, M. B. Seasholtz, Z. Wang, B. R. Kowalski, S. E. Lee and
B. R. Holt, Anal. Chem., 1993, 65, 835A.
34 I. E. Frank, Chemom. Intell. Lab. Syst., 1995, 27, 1.
35 P. H. Hindle and C. R. R. Smith, J. Near Infrared Spectrosc., 1996,
4, 119.
36 L. Pasti, B. Walczak, F. Despagne, D. Jouan-Rimbaud, D. L. Massart
and O. E. de Noord, submitted for publication.

Analyst, 1998, 123, 157R178R

177R

37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

65
66
67
68
69

T. Masters, Practical Neural Network Recipes in C++, Academic


Press, Boston, 1993.
E. P. P. A. Derks, M. S. Sanchez Pastor and L. M. C. Buydens,
Chemom. Intell. Lab. Syst., 1995, 28, 49.
K. Faber and B. R. Kowalski, Chemom. Intell. Lab. Syst., 1996, 34,
293.
I. E. Frank and R. Todeschini, The Data Analysis Handbook,
Elsevier, Amsterdam, 1994.
P. J. Gemperline, Chemom. Intell. Lab. Syst., 1997, 39, 29.
M. Hartnett, D. Diamond and P. G. Barker, Analyst, 1993, 118,
347.
V. Centner, D. L. Massart and O. E. de Noord, Anal. Chim. Acta, in
press.
S. Chatterjee and A. S. Hadi, Stat. Sci., 1986, 1, 379.
E. R. Malinowski, Factor Analysis in Chemistry, Wiley, New York,
2nd edn., 1991.
J. E. Jackson, A Users Guide to Principal Components, Wiley, New
York, 1991.
A. Hoskuldsson, Prediction Methods in Science and Technology;
Vol. 1: Basic Theory, Thor Publishing, 1996.
G. B. Dijksterhuis and W. J. Heiser, Food Qual. Pref., 1995, 6,
263.
A. G. Gonzalez and D. Gonzalez Arjona, Anal. Chim. Acta, 1995,
314, 251.
V. Centner, D. L. Massart and O. E. de Noord, Anal. Chim. Acta,
1996, 330, 1.
B. Walczak, Anal. Chim. Acta, 1996, 322, 21.
J. H. Wang, J. H. Jiang and R. Q. Yu, Chemom. Intell. Lab. Syst.,
1996, 34, 109.
D. R. Hush and B. G. Horne, IEEE Signal Process. Mag., 1993, 1,
8.
Neural Networks in QSAR and Drug Design, ed. J. Devillers,
Academic Press, London, 1996.
I. V. Tetko, D. J. Livingstone and A. I. Luik, J. Chem. Inf. Comput.
Sci., 1995, 35, 826.
J. H. Wikel and E. R. Dow, Bioorg. Med. Chem. Lett., 1993, 3,
645.
J. Ferre and X. Rius, Anal. Chem., 1996, 68, 1565.
R. W. Kennard and L. A. Stone, Technometrics, 1969, 11, 137.
R. D. Snee, Technometrics, 1977, 19, 415.
J. Lozano, M. Novic, F. X. Rius and J. Zupan, Chemom. Intell. Lab.
Syst., 1995, 28, 61.
N. Majcen, K. Rajer-Kanduc, M. Novic and J. Zupan, Anal. Chem.,
1995, 67, 2154.
D. Jouan-Rimbaud, D. L. Massart, C. A. Saby and C. Puel, Anal.
Chim. Acta, 1997, 350, 149.
D. Jouan-Rimbaud, D. L. Massart, C. A. Saby and C. Puel, Chemom.
Intell. Lab. Syst., 1998, 40, 129.
N. Dupuy, C. Ruckebush, L. Duponchel, P. Beurdeley-Saudou, B.
Amram, J. P. Huvenne and P. Legrand, Anal. Chim. Acta, 1996, 335,
79.
L. Dolmotova, C. Ruckebusch, N. Dupuy, J. P. Huvenne and P.
Legrand, Chemom. Intell. Lab. Syst., 1997, 36, 125.
J. H. Jiang, J. H. Wang, X. H. Song and R. Q. Yu, J. Chemom., 1996,
10, 253.
T. B. Blank and S. D. Brown, Anal. Chim. Acta, 1993, 277, 273.
R. J. Poppi and D. L. Massart, Anal. Chim. Acta, in the press.
L. Zhang, J. H. Jiang, P. Liu, Y. Z. Liang and R. Q. Yu, Anal. Chim.
Acta, 1997, 344, 29.

178R

Analyst, 1998, 123, 157R178R

70
71
72
73
74
75
76
77

78

79
80
81

82
83
84
85
86
87

88
89

90
91
92
93
94
95
96
97
98
99

F. Despagne and D. L. Massart, Chemom. Intell. Lab. Syst., 1998, 40,


145.
J. H. Jiang, J. H. Wang, X. Chu and R. Q. Yu, Anal. Chim. Acta, 1996,
336, 209.
M. Dathe and M. Otto, Fresenius J. Anal. Chem., 1996, 356, 17.
E. R. Collantes, R. Duta, W. J. Welsh, W. L. Zielinski and J. Brower,
Anal. Chem., 1997, 69, 1392.
B. Walczak and D. L. Massart, Chemom. Intell. Lab. Syst., 1997, 36,
81.
S. Wold, Technometrics, 1978, 20, 397.
I. V. Tetko, A. E. P. Villa and J. Livingstone, J. Chem. Inf. Comput.
Sci., 1996, 36, 794.
B. Hassibi and D. G. Stork, in Advances in Neural Information
Processing Systems, ed. J. D. Cowan and C. L. Giles, Morgan
Kaufmann, San Mateo, CA, 1993, vol. V.
E. D. Sontag, in Advances in Neural Information Processing Systems,
ed. R. P. Lippmann, J. E. Moody and D. S. Touretzky, Morgan
Kaufmann, San Mateo, CA, 1991, vol. III.
E. P. P. A. Derks and L. M. C. Buydens, Chemom. Intell. Lab. Syst.,
1998, 41, 171.
P. P. Kanjilal and D. N. Banerjee, IEEE Signal Process. Mag., 1993,
1, 8.
J. F. Kolen and J. B. Pollack, in Advances in Neural Information
Processing Systems, ed. R. P. Lippmann, J. E. Moody and D. S.
Touretzky, Morgan Kaufmann, San Mateo, CA, 1991, vol. III.
R. P. Lippmann, IEEE Trans. Neural Networks, 1995, 5, 1061.
D. E. Rumelhart and J. L. McClelland, Parallel Distributed
Processing, MIT Press, Cambridge, MA, 1986, vol. 1.
R. A. Jacobs, Neural Networks, 1988, 1, 226.
A. A. Minai and R. D. Williams, in International Joint Conference on
Neural Networks, 1990, vol. 3, p. 676.
M. Bos and H. T. Weber, Anal. Chim. Acta, 191, 247, 97.
C. G. Looney, Pattern Recognition Using Neural Networks: Theory
and Algorithms for Engineers and Scientists, Oxford University
Press, New York, 1997.
G. Kateman and J. R. M. Smits, Anal. Chim. Acta, 1993, 277, 179.
Y. Chauvin, in Advances in Neural Information Processing Systems,
ed. R. P. Lippmann, J. E. Moody and D. S. Touretzky, Morgan
Kaufmann, San Mateo, CA, 1991, vol. III.
E. P. P. A. Derks and L. M. C. Buydens, Chemom. Intell. Lab. Syst.,
1990, 41, 185.
R. Bro, J. Chemom., 1995, 9, 423.
G. Andersson, P. Kaufmann and L. Renberg, J. Chemom., 1996, 10,
605.
S. J. Qin and T. J. McAvoy, Comput. Chem. Engng., 1992, 16,
379.
T. R. Holcomb and M. Morari, Comput. Chem. Engng., 1992, 16,
393.
B. Walczak and D. L. Massart, Anal. Chim. Acta, 1996, 331, 177.
B. Hitzmann and T. Kullick, Anal. Chim. Acta, 1994, 294, 243.
H. Wei, L. Wang, W. Xing, B. Zhang, C. Liu and J. Feng, Anal.
Chem., 1997, 69, 699.
H. Chan, A. Butler, D. M. Falck and M. S. Freund, Anal. Chem., 1997,
69, 2373.
M. Bos, A. Bos and W. E. van der Linden, Anal. Chim. Acta, 1990,
233, 31.

Paper 8/05562I

You might also like