You are on page 1of 7

Selection of Samples for Calibration in Near-Infrared

Spectroscopy. Part Ih Selection Based on


Spectral Measurements

TOMAS ISAKSSON* and TORMOD N/ES


MATFORSK, Norwegian Food Research Institute, Oslovegen 1, 1430 As, Norway

Two strategies for selection of samples based on spectral measurements an NIR data set, the present paper suggests how to use
on a large set of samples are tested and compared. A method based on samples not selected by the algorithm as additional in-
cluster analysis appears to be the best. The same prediction results formation in the calibration procedure. Finally, the paper
achieved with the whole calibration set of 114 samples were obtained
discusses how many samples are actually needed in a
with only 20 samples selected by this algorithm.
good calibration.
Index Headings: Infrared; Analytical methods; NIR; Reflectance spec-
troscopy.
THEORY
M o d e l a n d C a l i b r a t i o n M e t h o d . The model used in
INTRODUCTION NIR analysis is usually the multiple linear regression
model
In Near-Infrared Reflectance (NIR) analysis, good
K
multivariate calibration is critical for accurate chemical
determinations. To obtain a good calibration, one needs Y = bo + ~_~ x~b~ + e (1)
an effective calibration method and good calibration data.
Several calibration methods exist; we refer the reader to where the purpose is to estimate the regression para-
Williams and Norris I and Martens and N~s 2 for an over- meters bk, k = 0 . . . . . K in order to obtain good prediction
view and some comparisons. results
This paper deals with the selection of samples for cal- K
ibration. It is the second in a series of two papers on this Y = bo + X xk6 (2)
problem. In the first paper) general concepts and sug- k=l
gestions of design strategies were considered and illus-
trated by examples. One of the main conclusions in that for as many future samples as possible or for a particular
paper was that design strategies which try to spread the region2
samples evenly (or uniformly) over the whole region of In NIR analysis, the spectral measurements are usually
interest (target population) have several attractive fea- highly collinear,2and a mathematical procedure different
tures from both a prediction and a model-fitting point of from the usual multiple linear regression is needed for
view. They are not necessarily optimal in prediction, 3 but the estimation of bk, k = 0 . . . . , K. In this paper we use
they are robust with respect to the underlying data struc- Principal Component Regression (PCR) and select the
t u r e - w h i c h means that they behave well in many cases. principal components (PCs) 6 from the ones with the larg-
This second paper is on selection of samples where the est variance. We refer the reader to N~s and Martens 7
selection is based on spectral NIR values. The idea be- for a justification and discussion of this method.
hind the approach is that NIR analysis is fast and simple, In our experience with NIR data, s we have found that
and spectra for a large set of samples can quite easily be the Multiplicative Scatter Correction 9 (MSC) or log(l/
obtained. Provided that the samples can be stored with- R) spectral data usually leads to better or similar pre-
out change in the chemical composition, a small number diction ability, in comparison to results based on uncor-
of candidates can be selected and submitted to the more rected spectra. We performed MSC on all the samples
time-consuming chemical analysis and finally to multi- involved in this study (see below for details).
variate calibration. Selection of Calibration Samples--General Principles.
This paper is devoted to comparison of two different When one is selecting samples for calibration, two dif-
techniques. One is based on c l u s t e r analysis, as pre- ferent aspects have to be given attention. First, we must
sented in the research of N~s, 4 and aims at applying the consider the statistical precision of the estimation results
philosophy mentioned above, i.e., spreading the samples (bk values), which have to do with the absorbance vari-
evenly over the whole region of interest. The second ation along the different wavelength axes (x-axes). Sec-
method is that described by Honigs et al. 5 (here denoted ond, we have to be concerned about the fit of the model
H H M H , with reference to the first initials of the au- to the data (the size of the error e in the model in Eq.
thors). In addition to discussing the difference between 1), or the degree of nonlinearity and noise, which is usu-
the two techniques and evaluating their performance on ally better the smaller the region is. Consequently there
may be a trade-off between these two aspects. In the N~s
Received 27 November 1989; revision received 29 January 1990. and Isaksson 3 paper, important aspects of design are dis-
* Author to whomcorrespondenceshould be sent. cussed, as listed in the following three points:

1 152 Volume 44, Number 7, 1990 ooo3-7o28/9o/44o7-u52$zoo/o APPLIED SPECTROSCOPY


© 1990 Society for Applied Spectroscopy
1. All combinations of variables or directions represent- W
ing the relevant (those with largest eigenvalue) pop-
ulation eigenvectors must be present.
2. The span of all directions described in point 1 should
be as large as possible. With reference to the trade-
off mentioned above, one should take care not to use
a region larger than that covered by the anticipated
prediction samples (target population2).
3. The calibration samples should be evenly spread over
the whole subspace generated by the eigenvectors de-
/ XXAA
scribed in point 1 (and of course limited by the outer
/ X'XA
limits in point 2).
/
Points 1 and 2 are usually very important in order to
obtain estimates of the regression coefficients (bh's) that / }( X .X:
are good enough for prediction over the whole region of
interest. Point 3 is not as important as 1 and 2, but X X £XAA
ensures nonlinearities to be smoothed 3. Also, point 3 of-
fers the potential for detecting lack of fit between model
A
. / Y V. }{ X'K
and data. TM It is often very close to optimum even when
the model holds exact. 1°Finally, it is absolutely necessary w
when local calibration methods such as those presented
by, for example, Robert e t al. 11 and N~s e t al. TM are used.
In the paper by N~s and Isaksson 3 the importance of
points 1 and 2 was demonstrated in an experiment where
the chemical values were selected according to a design
scheme. In this case, the best overall prediction results
were obtained by putting more emphasis on the end-
points than on the even spread of points in point 3.
The results when point 3 was used in a more strict sense
were only slightly less precise and gave better prediction
ability near the center of the population, where more
samples are usually clustered. In addition, it has the
advantages mentioned above.
Cluster Analysis. N~s 4 suggested a procedure based
on cluster analysis for selection of samples from a large
sample set that is developed to handle all three design KX£AA
aspects mentioned above. The idea is that the clusters
together cover the whole space of interest and that, when ×'K
samples are in the same cluster, they contain similar B
information. It is therefore better to select only one sam- FIG.1. Illustration of the calibrationand predictionsamples as points
ple from each cluster, as opposed to using several samples in the design triangle with starch (S), protein (P), and water (W) as
corners. The horizontaland the diagonal lines in the figure represent
from only a few of the clusters or from a limited region. level curves (10% difference).Due to a small amount of ash in the fish
In this way, all the three points above are satisfied rea- meal, protein + water ÷ starch < 100%. Consequently,the triangles
sonably well. in the figureillustrate planes in the tetrahedron,with ash, starch, water,
For a clustering method, N~s 4 used complete linkage 13 and protein as corners. Therefore,the level curves do not intersect in
the same points. (A) Full calibration set, 114 samples; (B) test set, 25
(furthest neighbor) based on the standardized principal samples.
components with largest eigenvalues. Complete linkage
is used because of its ability to produce small globular
clusters, which are relatively equal in size. The first few that will be mentioned in the discussion. The results of
principal components are used in order to avoid contri- N~s 4 indicated that it is not too important how many
butions from the PCs with small eigenvalues, which have PCs are used, as long as the main predictive information
minor predictive ability (see point 1 above). The stan- is involved.
dardization (or weighing to variance 1) is used in order The H H M H Algorithm. This algorithm 5 is based on
to give all the PCs that are used equal weight in the the idea of spanning the spectral variation as much as
selection of samples. From each cluster the sample far- possible. It is more difficult to visualize and is not as
thest away from the center was selected, in order to sat- directly related to statistical principles as the cluster
isfy the idea of spanning the variation as much as possible method. Therefore, it is quite complicated to compare it
under the restriction imposed by the cluster philosophy. with other methods, both conceptually and theoretically.
The number of PCs to be used in this clustering meth- The H H M H method starts by identifying the spec-
od is a matter of choice and can be based either on trum (sample) having the largest absolute absorbance
experience concerning how many PCs have main pre- value. This particular sample is selected for calibration,
dictive relevance (variance) or on a two step-procedure and the wavelength is deleted from further consideration

APPLIED SPECTROSCOPY 1 1,53


w
in the selection process. Next, new spectra are obtained

aA
by subtracting the selected spectrum (multiplied by a
constant) from each of the others. The procedure is re-
peated on these residual spectra. The method continues
until the desired number of spectra is selected. Further
x" X'/~ A A/'~ details are found in the original publication)
/ X .X.:~;t A A Utilizing all Spectra in the Calibration. When one is
/ ~/X .X:=7,:AiA/X selecting only a few of the samples for calibration, as
? Y ~ X X X.R A 4 . /YXXXXAA~ with the two algorithms above, there will always be sam-
,./Y ~i y *°X X ~, A A o . , ( V ~ W X'X ~-,\ A A ~ ples left which are not used in the calibration process.
w
Is it possible to apply these samples in the calibration
as well? The answer is yes, provided that, for example,
principal component regression is used.
The idea behind the method is to perform the regres-
sion of the selected samples in the PC subspace created
by the full set. In this way, all the samples are used for
/ X .X.:&Ik A / ' , X" X . X . ' ~ , A / ~ stabilizing the principal component subspace or the space
f~X ~:':X :A-_,~/'L / X X ~:':7, : X . Z ~ / ~ of main variability, but only the selected calibration sam-
X_K X2~ A h . J Y ~(X X X . ~ A ? ~
ples are used in the regression.
. g ' V ~ Y ~"X ;~ X A A , . g ' V ~ Y X'X X-~ AA,~
W w
EXPERIMENTAL
The Data. The data set is based on the same experi-
ment as described by N~s and Isaksson 3 and consists of
mixtures of fish meal (as protein source), potato starch,
and water. Figure 1A is a graphical illustration of the set
of 114 samples from which calibration sets are selected.
VX X .X:"X.;,x,../',/R The independent prediction set of 25 samples is shown
'x" X .X K..~ A,,% in Fig. lB. The spectra are generated from a Technicon
~ ' V ~ Y "X"X X'X AA~ ~ V ' ~ Y X°X X-X A A ~ InfraAlyzer 400 (Tarrytown, NY) with 19 standard fixed-
W w
filter wavelengths.
Multiplicative Scatter Correction (MSC). The MSC 9
of log(l/R) spectra was performed for all 114 calibration
samples. The prediction set was scatter corrected ac-
cording to the averages computed from the 114 samples.
Notice that in this way information from the full set of
//~X ,.,X,..7~i,x:.,,G/~. samples is used in the correction of prediction data. Both
X "X:':X: / a ~ f k the cluster method and the H H M H method were per-
~ Y Y X X XJk A T ~ formed on scatter-corrected log (l/R) data.
p S . . V ~ Y X°X ,~ X A A ~
Cluster Method. The cluster analysis was performed
w
with complete linkage based on standardized PCs as de-
scribed above. For comparison, the cluster analysis was
first based on 4 and then on 8 principal components.
H H M H . The H H M H algorithm as described above
can maximally select the same number of samples as the
number of wavelengths (in this case 19). A modification
Y_X .X":~k: l ~ \ / R of the method, as also proposed in Honigs e t al., 5 is
•X":$ ~o_E
_K ~.~ A / ~ X X X2k A/5, therefore used. This strategy first selects n (< 19) sam-
ples, then starts the procedure again on the remaining
w w
114 - n samples with the full set of wavelengths. The
procedure continues until the desired number of samples
is selected. In our example, n = 10 is used, as in the
original publication)
Computations. The results are reported as the root-
mean-square error of prediction (RMSEP), as defined
by N~es and Isaksson. 2 In each calibration, only the
RMSEP corresponding to the optimal number of prin-
2 Y 2"X X ~2r,, A h . cipal components is reported; i.e., the lowest prediction
error in the test set is reported. The calibrations are

FIG. 2. Illustration of the calibration samples selected by the cluster


algorithm in the 4 and 8 principal components subspace. The largest
points indicate the selected samples in the calibration set: (a) 10, (b) components; (g) 10, (h) 15, (i) 20, (j) 25, (k) 30, and (l) 40 samples
15, (c) 20, (d) 25, (e) 30, (f) 40 samples selected by using 4 principal selected by using 8 principal components.

1 154 Volume 44, Number 7, 1990


W STARCH
2.8

2.6-

2.4-
O.
LU
(D 2.2"
¢c
2-

1.8-

1.6-
.

1.4
o 2'o ' 4'0 ' 6'o 8; 120
No. samples in calibration

PROTEIN

FIG. 3. The 10 clusters and the 10 selected samples in the case of 4 3,4-

PCs in the selection process. This shown the same selection as in Fig.
2a. 3-

computed in the Unscrambler (Version 2.2, Camo AS, 2.6-


Trondheim, Norway) and the clustering as well as the
H H M H algorithm was done in the SAS system (Version 2.2-
6, SAS Institute, Raleigh, NC).
1.8-
R E S U L T S AND D I S C U S S I O N
Cluster Algorithm. The samples selected by the cluster 1.4-
algorithm are presented as filled circles in the design II
scheme in Fig. 2 for 10, 15, 20, 25, 30, and 40 selected 1
0 2'0 ' 4'0 ' 6'0 8'0 1(~0 120
samples. Remember that each sample represents one No. samples in calibration
cluster and is selected as the one that is farthest away
from the center. Clustering by using 4 as well as 8 prin- WATER
cipal components was performed. As we see, the results 1.9-
for the two choices of PCs give similar results, with a C
slight tendency toward uniform spread when only 4 com- 1.8- i

ponents are used. Notice also the tendency (when few 1.7- I
samples are selected) to select samples from the border
of the region--which is mainly due to the fact that we 1.6- 1/
a. 1.5-
select the sample within each cluster that is farthest away ILl
O'J
from the center. This pattern becomes clear when it is ~ 1.4-
compared with an illustration of the clusters themselves 1.3-
(Fig. 3). This result indicates that point 3 does not per-
1.2-
tain as much as might be expected from the philosophy
behind the method. In other words, it seems that the 1.1-
cluster method with the selection of samples farthest 1-
away from the center is merely a compromise between
0.9 , , ,
an even spread of points and an "end-point" design with o 2'o 4'o 6'0 8'0 160 120
all weight on the samples close to the edge. For larger No. samples in calibration
calibration sets, however, we have an almost even dis- FIG. 4. Prediction results presented as root-mean-square error of pre-
tribution of samples. diction (RMSEP) for cluster analysis [squares (n) indicate selection
The prediction results are shown in Fig. 4; as we can based on 4 principal components, and summation signs (+) indicate 8
see, the results based on 4 PCs gave better results for principal components] and the H H M H algorithm [diamonds (<>)] as
a function of the number of selected samples in the calibration. In each
starch and protein, but 8 principal components were bet- case the optimal PCR result is reported. (A) Starch; (B) protein; (C)
ter for water. The improvements in this latter case are, water.
however, very small compared with the improvements
in the other two cases. This observation indicates that, tion samples2 These show that the first few PCs contain
in this case, fewer factors in the selection are better than the main predictive information about chemical constit-
many. The reason for this outcome is probably due to uents, although the relationship between the PCs and
the structure of the principal components of all calibra- the concentrations is slightly nonlinear. The prediction

APPLIED SPECTROSCOPY 1 155


w STARCH

a 3.2- --/i

3:
2.8-
2.6-
2.4-
P $ 2.2 [
W
2:
1.8--
1.6--

1.44-~, ,
0 10 I'2 1'4 1'6 1'8 20
NO. samples in calibration
J Y X "X:':?~ : A i A A
7Y~.X X ~ ;I A / ' , PROTEIN
2,4- --~,
w

2,2-
o

2-

O-
LU 1.8-
co
rr
1.6-

1.4-

°2 a,
w

1.2-

1 --s~
o tb ' 1'2 1:4 1'6 ' 1~ 2'0
No. samples in calibration

WATER
7Y'_Y'*.X ~* Ah, 1.30-'-'~

1.26-

1.22-
0_
LU
(D 1.18-
rr
1.14-

1,10 -
~ " , 2 " ~ X" X' ~ X-X A A ,
1.06-

1.02~-',
0 1'o ' 1'2 ' 14 1~ ' lh £o
No.samples in calibration
FIG. 6. Results from the method in the text [diamonds (0)] that
utilizes all spectra in the calibration process, compared to the results
from only the samples selected by the cluster method based on 4 prin-
cipal components [squares (El)]. (A) Starch; (B) protein; (C) water.

results of Fig. 4 indicate t h a t using too m a n y of these


nonlinear factors is b a d for the selection. T h e P C plot
Fro. 5. Illustration of the samples selected by the HHMH algorithm:
(a) 10, (b) 15, (¢) 20, (d) 25, (e) 30 and (f) 40 samples in the selected in the p a p e r by Isaksson a n d N ~ s 3 also showed a stronger
calibration set. nonlinearity for samples with a high a m o u n t of water,
indicating the reason why 8 principal c o m p o n e n t s gave
best results for water.

1 156 V o l u m e 4 4 , N u m b e r 7, 1 9 9 0
In addition to this, we see that the prediction results result shows that it is not possible to draw any definite
are quite good for a test with only 20 samples in the conclusion. The protein results are, however, promising
calibration set. Note also that calibration based on all enough to justify further investigation of the approach.
114 samples gives a slightly less accurate prediction than We have no explanation for the fact that only protein
the calibration based on, e.g., 20 samples. This result gave better results.
clearly indicates that the cluster algorithm gives reason- Directions for Future Research. The results in this pa-
able results with very few samples in the calibration set per indicate that the cluster algorithm performs well and
and that the number of samples is of relatively little has a number of properties for prediction that are better
importance in comparison to the way in which the sam- than those offered by the H H M H algorithm. This de-
ples are selected. Of course, the calibration sets with more velopment points to optimization of the technique as an
samples contain more information than the smaller, and interesting area of research. First of all, further studies
this information can perhaps be used for better predic- on how to select an optimal number of PCs should be
tions in subregions (for instance, near the center), but performed. Second, other weighing schemes for the fac-
in this case they certainly gave rise to nonlinearities or tors should be pursued, for instance weighing them ac-
other effects (for instance, the least-squares effect2,3) that cording to estimates of their prediction relevance. Such
are disadvantageous for the average prediction ability of estimates could be obtained from a calibration with few
the whole prediction set. samples, and more samples could be selected according
The H H M H Algorithm. The results from the H H M H to a criterion where these weights are used. This ap-
algorithm are given in Fig. 5. As we see, this algorithm proach would, however, require a constituent-dependent
has a tendency to select more samples near the protein selection which could be a disadvantage in some cases.
corner--as opposed to the cluster algorithm's ability to Third, the selection of samples from each cluster could
pick samples from a larger area. The reason for this is be modified. Is it, for instance, better to use some center
probably that several of the 19 wavelengths are related point from each cluster in order to satisfy requirement
to protein variation in the 2100-2350 nm region, and 3 in a better way? Different cluster algorithms could also
since for each selected sample one wavelength is deleted, be tested, along with strategies selecting more samples
such a "bias" can be introduced. The prediction results from large clusters. Residuals from preliminary predic-
are given in Fig. 4. Again, we see that the prediction tions could also be used to guide the search for new
results were good for a quite small calibration set. objects. Investigations in the direction indicated by
For the calibration set of 60 samples we detected a Puchwein 14 are also of interest.
quite large residual in two of the samples (near the cen-
ter) for water and starch. This result shows an indication CONCLUSIONS
of two samples with quite poor fit to the linear model The main conclusion of this paper is that good cali-
fitted by the rest of the samples. These were deleted brations can be done with quite few samples (20) in the
from the calibration, and the results reported for "60 calibration set as long as we use cluster analysis based
samples" are based on computations on the remaining on 4 principal components in the selection. This indicates
58 samples. that the important aspect of a selection process is not
Comparison between the Methods. The most striking the number of samples, but rather the way in which they
feature in a comparison of the two algorithms is that in are selected. The cluster method emphasizes samples
most cases the cluster method gave better average pre- near the edge of the area. This is due to the selection of
diction results for the 25 prediction samples than the the sample farthest away from the center from each clus-
H H M H method, especially when few samples are se- ter.
lected. In all cases the H H M H algorithm needed almost Another important conclusion is that the cluster meth-
30-40 samples to come down to a level similar to that of od generally performs better for 4 PCs than for 8. Both
the cluster method obtained with 20 samples. Again, this cluster methods gave better average prediction results
result probably has to do with the bias in selection of than the H H M H algorithm. The reason for this is prob-
samples by the H H M H algorithm that was mentioned ably that the H H M H algorithm emphasizes one partic-
above. ular part of the whole calibration region, rather than
Utilizing Information from all Spectra. Results based spanning the variation properly. The results were, how-
on the method that uses all spectra for estimation of the ever, also quite reasonable for this method.
principal components (see above) are given in Fig. 6 for A method for incorporating information from all sam-
the cases of 10, 15, and 20 samples in the calibration set, ples, including those that are not selected for calibration,
for all 3 constituents, and for selection based on 4 PCs. was tested. This method gave quite substantial improve-
(Since this method is supposed to be of highest impor- ment for one of the constituents. This result is promising
tance for cases with few samples in the calibration set, enough to allow us to propose that the method be in-
the computations for more than 20 samples in the cali- vestigated further.
bration set are not included). As we can see, the results
based on this new method, in comparison to the usual ACKNOWLEDGMENT
PC-regression, are very similar for water; the new meth- We wouldlike to thank Bjerg Nature Nilsen for help with the com-
od is substantially better for protein and poorer for starch putations and the graphics.
when 10 samples are used in the calibration set. Notice
in particular that, by this new method, the prediction 1. P. C. Williamsand K. H. Norris, Near Infrared Technology in the
results for protein based on 10 calibration samples are Agricultural and Food Industries (AmericanAssociationof Cereal
comparable to the calibration based on 114 samples. This Chemists, St. Paul, Minnesota, 1987).

APPLIED SPECTROSCOPY 1157


2. H. Martens and T. Nms, Multivariate Calibration (John Wiley and 9. P. Geladi, D. MacDougall, and H. Martens, Appl. Spectrose. 39,
Sons, Chichester, England 1989). 491 (1985).
3. T. N~es and T. Isaksson, Appl. Spectrosc. 43, 328 (1989). 10. P. J. Zemroch, Technometrics 28, 39 (1986).
4. T. N~es, J. of Chemometrics 1, 121 (1987). 11. P. Robert, D. Bertrand, M. Crochon, and J. Sabino, Appl. Spec-
5. D. E. Honigs, G. M. Hieftje, H. C. Mark, and T. B. Hirschfeld, trosc. 43, 1045 (1989).
Anal. Chem. 57, 2299 (1985). 12. T. N~s, T. Isaksson, and B. Kowalski, Anal. Chem. 62, 664 (1990).
6. I.T. Joliffe, Principal Component Analysis (Springer-Verlag, New 13. K. V. Mardia, J. T. Kent, and J. M. Bibby, Multivariate Analysis
York, 1986). (Academic Press, London, 1979).
7. T. N~es and H. Martens, J. of Chemometrics 2, 155 (1988). 14. G. Puchwein, Anal. Chem. 60, 569 (1988).
8. T. Isaksson and T. N~es, Appl. Spectrosc. 42, 1273 (1988).

1 158 Volume 44, Number 7, 1990

You might also like