Professional Documents
Culture Documents
Tutorial Review
1
2
3
3.1
3.2
3.2.1
3.2.2
3.3
3.3.1
3.3.2
3.3.3
3.3.4
4
4.1
4.1.1
4.1.2
4.1.3
4.1.4
4.1.5
4.1.6
4.2
4.2.1
4.2.2
4.2.3
4.2.4
4.3
4.3.1
4.3.2
4.3.3
5
6
7
Introduction
Principle of neural networks
Neural networks in multivariate calibration
When to use neural networks
Alternative methods
Linear methods
Non-linear methods
Advantages and limitations of neural networks
Flexibility of neural networks
Neural networks and linear models
Robustness of neural networks
Black-box aspect of neural networks
Development of calibration models
Data pre-processing
Detection of non-linearity
Detection of outliers
Number of samples
Data splitting and validation
Data compression
Data scaling
Determination of network topology
Number of layers
Number of input and output nodes
Number of hidden nodes
Transfer function
Training of the network
Learning alogrithms
When to stop training
Model interpretation
Conclusion
Acknowledgements
References
1 Introduction
Artificial neural networks (NNs) have now gained acceptance in
numerous areas of chemistry, as illustrated by the number of
Frederic Despagne obtained an engineer degree from the Ecole
Nationale Superieure de Chimie et Physique de Bordeaux and a
postgraduate diploma in Materials Science from Universite de
Bordeaux in 1994. He was then
sponsored by Elf Aquitaine to
do research in the Chemometrics group of Professor Brown
at the University of Delaware.
In 1996 he joined the research
group of Professor Massart at
the Vrije Universiteit Brussel,
where he is currently studying
for a PhD His research interests are in multivariate calibration and artificial intelligence.
157R
nd
y = fo q +
w j fh
wij xi + q
(1)
i =1
j =1
158R
159R
Property modelled
Descriptors
Ref.
1H
18
60
12
160R
NMR spectra
Measured physical and chemical characteristics of barley
UV/VIS spectra with simulated instrumental perturbations
UV/VIS spectra
Absorption spectra with simulated non-linear effects
UV/VIS spectra
UV/VIS spectra
UV/VIS spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Simulated near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Near-infrared spectra
Fourier transform infrared spectra
Measured concentration of oxide ingredients
Pyrolysis mass spectra
Pyrolysis mass spectra
Pyrolysis mass spectra
Measurement signals of enzyme field effect transistor flow
injection analysis
Measurement signals of enzyme field effect transistor flow
injection analysis
Frequency response of a piezoelectric crystal gas sensor
Anodic stripping voltammograms
Anodic stripping voltammograms
Measurements from ion-selective electrode arrays
Parameters describing mechanical properties of the yarns
X-ray fluorescence spectra
X-ray fluorescence spectra
Gas chromatograms
30, 67
19
25
24
70
64, 65
35
41
72
61
16
17
96
23
97
98
99
38
86
14
Fig. 2 NN predictions within and outside calibration space: ac, cosine function; d, quadratic function. Model with, a, three; b, six; c, nine; and d, two hidden
cones. o, Actual training, +, actual test, *, predicted.
161R
Fig. 3
162R
Property y
Nature of data
MLR
PCR
PLS
NN
Moisture in wheat
Hydroxyl number of polyether polyol
Octane number of gasoline
Mineral charge in a polymer
Linear
Linear
Slightly non-linear
Strongly non-linear
0.1860
0.90
0.1355
0.0797
0.2147
1.15
0.1426
0.0477
0.2150
1.31
0.1461
0.0445
0.1981
0.88
0.1459
0.0096
Fig. 4
163R
Fig. 6
164R
and score plots on the first PCs. To detect outliers in the X space,
it is recommended to examine the leverage of each sample to
detect possible outliers. The leverage of a sample is a measure
of its spatial distance to the main body of the samples in X.44 For
a given data matrix X, the leverage of sample i is given by the
diagonal term pii of the prediction matrix P, also called Hat
matrix:
P = X (XTX)21 XT
Strategy for construction of NN model: left, data handling; right, network construction.
(2)
(3)
Fig. 6Continued
165R
Fig. 7
166R
Mallows augmented partial residual plots for PCR models of diesel oil viscosity: PC1; PC2; PC3; PC4.
Fig. 8
PLS
NN
0.31
0.28
0.39
0.23
167R
Fig. 10
168R
Repartition of samples for internal and external validation with PLS and NN.
169R
Ai =
train
( Xi - Xmin
) (r
train
train
- X min
X max
max
- rmin ) + rmin
(4)
(5)
(five or six) that may account for possible non-linearity, and the
NN training can be started with this initial training set. For
calibration problems, the size of the initial set should typically
vary between 10 and 15 PCs. The drawback of the stepwise
elimination approach is that it can be extremely time consuming, if input variables are tentatively removed by trial and error,
because of the large number of possible combinations.60
In neural computation, the relevance of a variable to a model
is called its sensitivity. The optimisation of the set of input
variables can be accelerated if a method to estimate the
sensitivity of each variable is implemented. Several methods
have been proposed. The most common is often referred to as
Hinton diagrams. It consists of ascribing to each input variable
a sensitivity proportional to the average magnitude of its
associated connections in the NN, represented on a twodimensional map by square boxes of varying size. Candidate
variables to be deleted are those with the lowest sensitivity. In
spite of its popularity, this method exhibits severe theoretical
and practical limitations.70,76 It is based on an analogy with the
classical MLR approach, where the magnitude of a regression
coefficient reflects the importance of the relationship between
the associated descriptor and the response. In an NN model,
input variables that have a linear contribution to the response
will be modelled in the linear portion of the sigmoidal transfer
function associated with small or medium magnitude weights,
whereas the non-linear variables will be modelled in the
concave portion of the transfer function associated with large
magnitude weights. Therefore, the Hinton diagram ranking
method is not based on the intrinsic relevance of a variable to a
model, but simply on the nature of its contribution to the
response. Linear input variables are systematically flagged as
unimportant even when they explicitly contribute to the model.
This approach can only give reliable results when the data set is
entirely linear, in which case there is no point in using an NN.
For the same reason, we are not in favour of training methods
based on the principle of weight decay4 that consists of adding
to the cost function a term penalising large weights.
The approach based on estimation of saliencies is theoretically more stringent.76 The saliency of a weight is the measure
of the increase in the NN cost function caused by the deletion of
this weight. It is estimated at the end of the training. Deletion of
an individual weight wi in an NN can generally be considered as
a small perturbation. First, the change in cost function caused by
this small perturbation to the weight matrix is approximated by
a second-order Taylor series expansion. Ideally, the training is
stopped when the NN has converged to a minimum, and
therefore the change in cost function can be described using
only Hessian terms (second partial derivatives of the error
function with respect to weights) in the approximation of the
change in error. Hassibi and Stork77 proposed calculating the
saliency of a weight k as
sk =
1 wk2
2 H -1
[ ]
(6)
kk
Si =
sk
(7)
Fig. 13
171R
y 2 y 2
y y
s 2y =
(COV ) x1x2 (11)
s x1 +
s x2 + 2
x1 x 2
x1
x2
Since PC scores are orthogonal, the covariance term can be
neglected and the sensitivity of input variable xi is calculated
as
2
y 2
(12)
Si =
s xi
xi
Applying the chain rule several times, one obtains an
analytical expression that allows one to determine Si at the end
of training. The most interesting characteristic of these two
variance-based methods (partial modelling and variance propagation) is that they give extremely stable results. When NNs
with the same topology are trained with different sets of initial
random weights, they can converge to different local minima on
the error surface that are qualitatively equally good and close to
each other. In that case the two variance-based methods give
similar results, which is not always the case with Hinton
diagrams or with the saliency estimation method.
Once the sensitivity of each input variable has been
estimated, we recommend that one should first try to remove the
variable with the lowest sensitivity, and retrain the NN. If the
monitoring error decreases after removing the flagged variable,
it can be considered as irrelevant for the model and permanently
removed, otherwise it must be replaced and another flagged
variable must be tentatively removed. Since parsimonious
models should be preferred in multivariate calibration, we
propose the following methodology for the stepwise elimination
of input variables. Let ME(k) be the monitoring error at the kth
trial and ME(k + 1) the monitoring error at the next trial after
removal of a flagged input variable. Then,
If ME(k + 1) @ t 3 ME(k), then remove the flagged
variable
Else, replace the flagged variable and try to remove the
next variable with lowest sensitivity
Here t is a tolerance factor that can be adjusted to different
values; we suggest t = 1.1. Increasing this factor will result in
removing more input variables from the model, at the risk of
losing some relevant sources of variance; t should not be lower
than 1, otherwise the NN could have a poor generalisation
ability.
For a given set of input variables, the NN performance will
also vary with the number of hidden nodes. Therefore,
optimisation of the number of input variables and of the number
of hidden nodes should be performed in conjunction: at each
step, one should optimise the number of input variables, then the
number of hidden nodes, then optimise again the number of
input variables and proceed so until the monitoring error stops
decreasing.
4.2.3 Number of hidden nodes. A study performed by Tetko
et al.55 suggested a fairly wide tolerance of NNs to the number
172R
(13)
(14)
Fig. 14 Evolution of NN calibration, monitoring and test error as a function of the number of hidden nodes: a, viscosity data; b, pour point data; c, polymer
data; d, gasoline data.
Fig. 15 Evolution of calibration, monitoring and test errors as a function of the number of hidden nodes for viscosity data, when some test samples are
outside calibration space: a, error; b, standard deviation of error.
173R
Fig. 16 Detection of representativity problems between training and monitoring set on r.m.s. error curves: a, lack of representativity; b, chance correlation
with initial set of weights.
174R
175R
monitoring set was also necessary with the hidden node pruning
approach.
4.3.3 Model interpretation. NNs have more to offer than a
simple empirical model. The sensitivity plots that we have
presented earlier describe the relative influence of the different
input variables in the final model. In addition, examination of
the projection of the samples on the hidden nodes of the NN is
often informative.37 We performed a calibration model for the
quantitative analysis of traces of lead in water, using inductively
coupled plasma atomic emission spectrometry (ICP-AES) data
as input (14 descriptors). At the end of training, if we display the
activation of hidden nodes versus each other, we obtain plots
comparable to score plots (Fig. 17). The five measurement
replicates marked with asterisks are easily identified as probable
outliers. Such plots are instructive and also allow visualisation
of clusters present in the data, but they are rarely used. When
data must first be compressed, visualisation is performed on the
scores before modelling instead.
We displayed in Fig. 18(a)(c) the activation of the three
hidden nodes at the end of training for the ICP-AES data NN
model. Fig. 18(d) and (e) show the activation of the two hidden
nodes in the non-linear model for polymer charge concentration. To estimate the relative importance of each hidden node in
the final model, we have reported the value of the magnitude of
the weight between this hidden node and the output node in
parentheses. This is possible because all hidden nodes are
connected to one output node only. Therefore, the magnitude of
the connecting weights can directly be compared, which is not
the case for weights connected to input nodes.
The activation of hidden nodes for ICP-AES data indicates
that this data set is mainly linear, whereas the transfer functions
Fig. 18 Visualisation of hidden nodes activation: a, ICP data, hn1, w = 20.36; b, ICP data, hn2, w = 20.54; c, ICP data, hn3, w = 0.60; d, polymer data,
hn1, w = 20.12; e, polymer data, hn2, w = 0.33.
176R
5 Conclusions
As is often the case in chemometrics, data pre-treatment and
presentation (number of samples, detection of outliers, data
compression and splitting) are critical issues that should not be
overlooked. Experience has proved that several failures of NNs
for modelling were indeed due to inappropriate problem
formulation. Such issues can be circumvented by focusing on
prior model identification, in particular the detection of nonlinearity. Proper a priori non-linearity detection is one of the
major difficulties and methods existing so far often fail in the
presence of outliers.
NNs should become part of the standard toolkit of analytical
chemists concerned with multivariate calibration, but it is
important to have a clear understanding of their capabilities and
limitations. One should not consider NNs as black boxes, but as
regression models whose flexibility will depend on the topology
defined by the user. In recent years, numerous research efforts
have been focused on improving the speed of algorithms used
for NN training. With the availability of faster personal
computers, the emphasis is no longer on the speed of algorithms
but rather on the development of tools to ease topology
optimisation, visualisation and model interpretation.
The design of an optimum topology is certainly critical and
time consuming, but this is true also for the optimisation of
parameters for other methods (form of the model in polynomial
PCR or PLS, complexity of soft models, number of nearest
neighbours in LWR, variables to retain/eliminate in methods
based on feature selection/elimination), although it is less
emphasised. Moreover, the comment that NNs do not allow
inference is somewhat unfair. Some simple plots can provide
information on the nature and form of the problem tackled and
on the presence of possible clusters or outliers.
Several recent research efforts aimed at combining the
flexibility and auto-adaptive ability of NNs with the superior
interpretability and inference capability of PLS models.9194 So
far, it seems that these methods also combine the pitfalls of both
approaches and their application generally requires an optimisation of a large number of parameters. Radial basis function
(RBF) networks offer interesting alternatives to MLP in the
sense that they allow local training and the final models can be
interpreted in terms of logical rules.38,53,95 Another approach to
gain insight into a complex problem is to combine the use of
6 Acknowledgements
The authors are grateful to Vita Centner and Frederic Estienne
for fruitful discussions. This work received financial support
from the European Commission (SMT Programme contract
SMT4-CT95-2031) and the Fonds voor Wetenschappelijk
Onderzoek (FWO, Fund for Scientific Research).
7 References
1 J. Zupan and J. Gasteiger, Anal. Chim. Acta, 1991, 248, 1.
2 S. D. Brown, S. T. Sum, F. Despagne and B. K. Lavine, Anal. Chem.,
1996, 68, 21R.
3 J. R. M. Smits, W. J. Melssen, L. M. C. Buydens and G. Kateman,
Chemom. Intell. Lab. Syst., 1992, 22, 165.
4 D. Svozil, V. Kvasnicka and J. Pospchal, Chemom. Intell. Lab. Syst.,
1997, 39, 43.
5 D. A. Cirovic, Trends Anal. Chem., 1997, 16, 148.
6 M. Bos, A. Bos and W. E. van der Linden, Analyst, 1993, 118,
323.
7 S. Geman, E. Bienenstock and R. Doursat, Neural Comput., 1992, 4,
1.
8 R. Fletcher, Practical Methods of Optimisation, Vol. 1: Unconstrained Optimisation, Wiley, New York, 1980.
9 K. Hornik, M. Stinchcombe and H. White, Neural Networks, 1989, 2,
359.
10 E. Thomas, Anal. Chem., 1994, 66, 795A.
11 C. E. Miller, NIR News, 1993, 4, 3.
12 P. J. Gemperline, J. R. Long and V. G. Gregoriou, Anal. Chem., 1991,
63, 2313.
13 M. S. Danhoa, S. J. Lister, R. Sanderson and R. J. Barnes, Near
Infrared Spectrosc., 1994, 2, 43.
14 J. A. van Leeuwen, R. J. Jonker and R. Gill, Chemom. Intell. Lab.
Syst., 1994, 25, 325.
15 F. Wulfert, W. T. Kok and A. K. Smilde, Anal. Chem., 1998, 70,
1761.
16 R. Goodacre, M. J. Neal and D. B. Kell, Anal. Chem., 1994, 66,
1070.
17 R. Goodacre, Appl. Spectrosc., 1997, 51, 1144.
18 S. R. Amendolia, A. Doppiu, M. L. Ganadu and G. Lubinu, Anal.
Chem., 1998, 70, 1249.
19 J. R. Long, V. G. Gregoriou and P. J. Gemperline, Anal. Chem., 1990,
62, 1791.
20 T. J. Sejnowski and C. R. Rosenberg, Complex Syst., 1987, 1, 145.
21 J. Hertz, A. Krogh and R. Palmer, Introduction to the Theory of
Neural Computation, Addison Wesley, Redwood City, CA, 1991.
22 S. Biswas and S. Venkatesh, in Advances in Neural Information
Processing Systems, ed. R. P. Lippmann, J. E. Moody and D. S.
Touretzky, Morgan Kaufmann, San Mateo, CA, 1991, Vol. III.
23 B. Hitzmann, A. Ritzka, R. Ulber, T. Scheper and K. Schugerl, Anal.
Chim. Acta, 1997, 348, 135.
24 C. Borggaard and H. H. Thodberg, Anal. Chem., 1992, 64, 545.
25 T. Naes, K. Kvaal, T. Isaksson and C. Miller, J. Near Infrared
Spectrosc., 1993, 1, 1.
26 J. Verdu-Andres, D. L. Massart, C. Menardo and C. Sterna, Anal.
Chim. Acta, 1997, 349, 271.
27 H. Martens and T. Naes, Multivariate Calibration, Wiley, Chichester,
1989.
28 N. B. Vogt, Chemom. Intell. Lab. Syst., 1989, 7, 119.
29 S. Wold, N. Kettaneh-Wold and B. Skagerberg, Chemom. Intell. Lab.
Syst., 1993, 65, 3081.
30 T. B. Blank and S. D. Brown, Anal. Chem., 1993, 65, 3081.
31 T. Naes and T. Isaksson, NIR News, 1994, 5, 7.
32 V. Centner, J. Verdu-Andres, B. Walczak, D. Jouan-Rimbaud, F.
Despagne, L. Pasti, R. Poppi, D. L. Massart and O. E. de Noord,
submitted for publication.
33 S. Sekulic, M. B. Seasholtz, Z. Wang, B. R. Kowalski, S. E. Lee and
B. R. Holt, Anal. Chem., 1993, 65, 835A.
34 I. E. Frank, Chemom. Intell. Lab. Syst., 1995, 27, 1.
35 P. H. Hindle and C. R. R. Smith, J. Near Infrared Spectrosc., 1996,
4, 119.
36 L. Pasti, B. Walczak, F. Despagne, D. Jouan-Rimbaud, D. L. Massart
and O. E. de Noord, submitted for publication.
177R
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
178R
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Paper 8/05562I