Multivariate Data Analysis: Universiteit Van Amsterdam

Universiteit van Amsterdam
Multivariate Data Analysis

Gooitzen Zwanenburg
- 2009 -
Chapter 1
Introduction
This course is about analyzing data sets with a large number of variables. Two variables
we can still visualize, for example if we measure two variables, like temperature and density
we can plot the data in a graph with the temperature on the x-axis and the density on the
y axis. Visualizing measurements of three variables is already dicult but still possible in
a three dimensional plot, but thats it.
In the following youll learn techniques to handle data sets with a large number of
variables. Think of soil samples from dierent sites where the concentrations of ten dierent
heavy metals have been measured. Or UV/VIS intensities measured at eighty dierent
wavelengths. We will nd ways to reduce the number of variables and classify the data
into groups. In a sense we will build a model of a large data set that has a reduced
number of variables and can be used to classify unknown samples.
The workhorse of multivariate (so called because there are many variables) data anal-
ysis is PCA, Principal Component Analysis. It is a mathematical technique to extract
information that matters from a large data set. The idea is that you combine variables
in such a way that correlated variables are counted as one. For example, if you measure
temperature and density of a liquid, you might as well measure one of the two because the
two variables are correlated. Of course, you still have to know how they are correlated.
Another example are peaks in the spectra of polycyclic aromatics that always occur to-
gether. Measuring these peaks does not dierentiate between the dierent poly aromatics
and one might just as well measure only one of the peaks. PCA is a technique that can
lter out this redundant information and leave us with what really matters.
No pain, no gain: we will have to go through some math and we will need a computer
to work out some examples. For the last we will use the program Matlab. Because we
shouldnt waste too much time on learning to work with Matlab, a very short survival kit
with basic Matlab commands is included in this introduction.
However, because matrices are the tools of the trade in multivariate data analysis we
start with a few notes on matrices and notation.
1
CHAPTER 1. INTRODUCTION 2
1.1 Matrices and notation
Matrices are rectangular arrays of numbers indicated by bold capital letters, thus an mn
matrix is an array with m rows and n columns written as:
A =
_
_
a
11
a
12
a
13
a
1n
a
21
a
22
a
23
a
2n
.
.
.
.
.
.
.
.
.
.
.
.
a
m1
a
m2
a
m3
a
mn
_
_
(1.1)
Notice that the elements of A, a
ij
, are given by small letters with two indices. The rst
index gives the row number of the matrix element, the second index gives the column
number of the matrix element.
A matrix with only one row or one column is called a vector and is indicated by a small
bold letter. The 1 m column vector a is:
a =
_
_
a
1
a
2
.
.
.
a
m
_
_
(1.2)
The transpose of an m1 column vector is a 1 m row vector:
a
T
=
_
a
1
a
2
a
m
(1.3)
Data can be collected in a matrix in such a way that each column of the matrix corre-
sponds to a variable and each row corresponds to a measurement. For example, we collect
data on the height, weight and shoe size from dierent people and obtain the following
table:
height weight shoesize
1.80 75 42
1.65 55 37
1.78 83 41
The results can be now be put in a matrix where each column corresponds to a variable
and each row to a measurement:
X =
_
_
1.80 75 42
1.65 55 37
1.78 83 41
_
_
(1.4)
The length l of a vector (also called Euclidean norm) a is the square root of the sum
of the squares of its elements:
l = a =
_
m
i
a
2
i
(1.5)
1.2 Matrix multiplication
Multiplying two matrices goes as follows. Let A be an i k matrix and B an k j matrix.
The product C = AB is an i j matrix that is obtained by multiplying the rows of A
with the columns of B:
c
ij
=
k
a
ik
b
kj
(1.6)
This means that to be able to multiply two matrices, the number of columns of the rst
matrix in the product and the number of rows of the second matrix in the product must
be the same. Note that for matrices in general AB = BA.
We can write the length of a vector in matrix form as the product of a row vector and
a column vector:
l =
_
m
i
a
2
i
=
a
T
a (1.7)
1.3 Matlab Primer
The program Matlab will be used for the exercises in this part of the course. Matlab is a
very much used program that allows the user to write programs but also to make use of the
many pre-programmed functions. Matlab stands for Matrix Laboratory and, not surprising
with a name like that, is very much based on working with matrices (also called arrays).
This makes it possible to use many build-in matrix functions, a denite advantage, but it
also means that one has to be careful: Matlab considers most operations to be operations
on matrices, and this can at times be confusing.
One of the more important functions is help; if you type this function with the one you
need help on, Matlab displays a short description of the function and its use. For instance
help sum
gives information on the Matlab function sum.
Here is a very short overview of Matlab functions and operations that you may nd
helpful when you do the exercises.
In Matlab there are dierent ways to assign a matrix to a variable:
A = [1 2 3 4; 5 6 7 8]
A = [1:4; 5:8]
both these assignments assign the matrix
1 2 3 4
5 6 7 8
to the variable A. The elements of the matrix are separated by commas, the rows of the
matrix are separated by a semi colon, ;. The : notation in the second assignment indicates
a range, 1:4 stands for the numbers 1 through 4, 5:8 stands for the numbers 5 through 8.
The function ones(i,j) creates an i j matrix in which the elements are all 1:
>> One = ones (3,1)
One =
1
1
1
>> Een = ones (1,3)
Een =
1 1 1
The transpose of a matrix A is the A
with the rows and columns of A interchanged:

>> A = [1 2 3 ; 4 5 6 ; 7 8 9]
A =
1 2 3
4 5 6
7 8 9
>> A
ans =
1 4 7
2 5 8
3 6 9
Notice that we have indicated the transpose of a matrix with a superscript T earlier. This
is what is usually found in literature, Matlab uses

to indicate transpose.
The functions sum and mean operate on the columns of a matrix. sum(A) returns a
vector with the sums of the columns of A; mean(A) returns a vector with the averages of
the column values of A:
>> A = [ 4 5 6 ; 1 2 3 ; 7 8 9]
A =
4 5 6
1 2 3
7 8 9
>> sum(A)
ans =
12 15 18
>> mean(A)
ans =
4 5 6
You can also assign the result of mean to a variable averg_A:
>> averg_A = mean(A)
averg_A =
4 5 6
When you apply sum or mean to a 1-dimensional array, the sum or average of the array
elements is given:
>> sum(sum(A))
ans =
45
The function std calculates the standard deviation of the columns of a matrix. The
default behavior is to normalize with N1 where N is the number of elements of the vector
(population size, number of samples). If a second argument of 1 is given std normalizes
with N:
>> A = [ 4 2 6 ; 1 2 3 ; 7 3 10]
A =
4 2 6
1 2 3
7 3 10
>> std(A)
ans =
3.0000 0.5774 3.5119
>> std(A,1)
ans =
2.4495 0.4714 2.8674
diag(A) returns a vector with the diagonal elements of x:
>> A = [ 4 5 6 ; 1 2 3 ; 7 8 9]
A =
4 5 6
1 2 3
7 8 9
>> d = diag(A)
d =
4
2
9
Mathematically more advanced functions include cov, inv and svd. The rst calculates
the covariance matrix of a matrix A, the second returns the inverse of a square matrix
X. You will have plenty of opportunity to use svd (singular value decomposition). The
function svd(X) decomposes the matrix X into the product of three matrices U, S and V
such that X = U*S*V, where S is a diagonal matrix with elements in decreasing order:
>> A = [ 4 5 6 ; 1 2 3 ; 7 8 9]
A =
4 5 6
1 2 3
7 8 9
>> svd(A)
ans =
16.8481
1.0684
0.0000
>> [U,S,V] = svd(A)
If svd is called without assigning the answer to an array, a column vector with the diagonal
elements of S is returned, as demonstrated above. The second way of calling svd shown,
assigns the matrices U, S and V. Try this yourself.
To visualize your data Matlab has the plot function. The basic call to this function
is of the form plot(x1,y1,mmm,x2,y2,nnn,...) where xi and yi are arrays with x-
and y-coordinates, and the mmm and nnn are strings indicating line style, color and
symbol. As an example consider
plot(x(1:20,1),y(1:20,1),r*,x(21:33 ,1) ,y(21:33 ,1) ,ro)
xlabel(First column of x)
ylabel(First column of y)
In this example we plot the rst column of a matrix x along the x-axis and the rst column
of the matrix y along the y-axis. We do this in two parts: data points in rows 1 through 20
(x(1:20,1)) are marked with a red asterisk (r*) and data points from rows 21 through
33 are marked with a blue circle (ro). The xlabel and ylabel functions provide text
along the x- and y-axes.
With the text function you can place a text string in a plot:
text(x,y,samples)
Here x and y are two arrays with x- and y-coordinates respectively, samples is an array of
the same length as x (and y) with text strings (for example identication tags of samples)
that are plotted at the coordinates (xi,yi). Two more functions related to plotting may
come in handy: hold and figure. The rst holds the current plot (Matlab starts a new
plot if you issue another plot command), the second creates a new gure window. More
information can be obtained from the on-line help function with help plot.
As mentioned earlier, Matlab does everything with matrices; if you want to square all
elements of a matrix A you cannot just write
A^2
because this multiplies A with itself and does not give give a matrix with all elements of
A squared (check this for yourself). For these situations Matlab has the period operator:
A.^2
Now each individual element of the matrix is squared.
Sometimes lines are too long to t on the screen. You can continue on the next line by
entering three periods:
plot(datamatrix (1:7,3), datamatrix (1:7,5),r*, datamatrix (8:14 ,3) , ...
datamatrix (8:14 ,5) ,b*)
Chapter 2
Principal Components Analysis
Modern analytical techniques can generate bewildering amounts of data. Consider for
example the dierence between the following two hypothetical experiments. In the rst
experiment we measure the refractive index of a liquid as a function of time to determine
the progress of a chemical reaction for, say, 20 minutes. This is a so called rst order
measurement; the data can be stored in a one dimensional array (matrix). Each entry is a
data point for a single variable. If we determine the refractive index every 10 seconds we
have obtained 180 data points after half an hour of measuring. Now, we want to follow the
same reaction by measuring the UV/VIS spectrum between 250 and 500 nm in wavelength
intervals of 5 nm. Every 10 seconds we now measure 50 variables (the intensity in each
wavelength interval counts as one variable). To store the data a column vector no longer
suces; we need a 2-dimensional matrix to store the data. This is called a second order
experiment. After 20 minutes of measuring we have obtained 18050 = 9000 data points.
Obviously, it is dicult or even impossible to work directly with such a large amount
of data. We cannot possibly oversee or discover patterns in measurements of more than
two or three variables. Enter PCA. PCA is a technique to reduce the number of variables
without losing important patterns and features in the data.
2.1 Principal Component Analysis: the idea
As mentioned in the introduction, to get a hold on the data we gather the data in a matrix
X; each column of the matrix corresponds to one variable, each row corresponds to a
measurement. As a simple example, suppose we have a number of measurements of shoe
size versus body length, see table 2.1.
8
CHAPTER 2. PRINCIPAL COMPONENTS ANALYSIS 9
height (cm) shoe size
187 44
175 39
192 45
165 37
164 38
173 40
183 44
178 40
187 45
195 47
178 40
Table 2.1: Shoe size versus height.
From these data we construct a data matrix X:
X =
_
_
187 44
175 39
192 45
165 37
164 38
173 40
183 44
178 40
187 45
195 47
178 40
_
_
(2.1)
with each column corresponding to one of the variables (height, shoe size) and each row
representing a measurement.
More generally, we can construct a data matrix from the data points we obtain with a
second order instrument: m measurements of n variables give a mn matrix. Thus, the
measurement of the UV/VIS spectra referred to earlier leads to an 180 50 data matrix
X.
With the data matrix we have our data in a form that makes them amenable to math-
ematical analysis. We consider each variable a dimension in a n-dimensional space and
each row the coordinates of a point in this space. For one, two and three variables it is
possible to visualize this. In gure 2.1 the rows of the data matrix X are shown as points
in the 2-dimensional height-shoe size space.
Looking at the data in gure 2.1, there seems to be a relation between the data points:
the taller the person, the larger shoes he or she wears. No surprises there. Instead of
measuring the height and the shoe size of a person, we could also measure just the height
and derive the shoe size from the plot. We could do this by determining the best straight
u
u
u
u
u
u
u
u
u
u
u
S
h
o
e
s
i
z
e
Height 120
20
200
50
Figure 2.1: Shoe size versus height.
line for the data points and read of the shoe size corresponding to the height of the person.
By tting a straight line through the data points we have made a model for the data points:
the straight line represents the data points as best as possible and reects the fact that
the larger the height of the person, the larger the shoe size.
Instead of measuring two variables, height and shoe size, it would be great if we could
combine the two variables into one single variable, represented by the best t line through
the data points and just measure this new variable as the position along the best t line.
Unfortunately this is not possible in real life; there is no physical object that corresponds
to a combination of height and shoe size. There is no reason, however, why we should
not create such a variable on paper. This is exactly what we do in principal component
analysis.
Take a look at gure 2.2. We again have two variables, X and Y that we depict in a 2-
dimensional space. As with the shoe size data the data points appear to be correlated. On
paper we can now construct a new variable, suggestively called PC1 (Principal Component
1) that is a linear combination of the original two variables, X and Y. The contributions
of X and Y to each of the principal components are called the loadings. The data points
in the 2-dimensional space can now be measured in terms of the new variables. The values
of the data points for the new variables are called the scores of the data point.
In gure 2.2 most of the data points are not exactly located on PC1, just as the shoe
size data were not exactly on a straight line. To completely describe their position in 2-
r
r
r
r
r
r
r
r
r
r
r
r
r
u
u
u
u
u
X
Y
PC1
PC2
s
c
o
r
e
s
c
o
r
e
Figure 2.2: Relation between variables X, Y and principal components PC1, PC2.
dimensional space we would need two coordinates. From the gure we see that all points,
except the one right on PC1 also have a non-zero score for PC2. However, this score is
very small compared to the score for PC1, so we can say that by only using PC1 we make
a small error but still have a pretty good model for our data.
What can we say about the error we make if we ignore the scores on PC2? In gure
2.3 we have drawn a single data point A with scores, t
1
and t
2
with respect to principal
components PC1 and PC2. The length of the vector a is
a =
_
t
2
1
+ t
2
2
(2.2)
The length of the projections on PC1 and PC2 are t
1
and t
2
, respectively. We can use
the length of the projection as a measure how much the data point is removed from the
principal component: the closer the length of the projection is to the length of the vector,
the smaller is the error with respect to that principal component. Thus, referring to the
gure: the closer t
1
is to a, the smaller the length of t
2
. If we were to use only PC1 in
describing the data point we would make an error in the length of the vector a of size
t
2
. Because the absolute value of the error doesnt mean much if we dont know what the
length of the vector is, we prefer to speak of the relative error e:
e =
t
2
2
t
2
1
+ t
2
2
(2.3)
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
@
A
u
PC1
PC2
a
t
1
t
2
Figure 2.3: Error in data point A. Note that the principal components are orthogonal to
each other and that the axes have been rotated with respect to gure 2.2
Conversely, we can say that PC1 can explain a fraction
V =
t
2
1
t
2
1
+ t
2
2
(2.4)
of the measurements. In other words this equation gives us a measure how well PC1 ts
the data.
2.2 Principal Component Analysis: the computation
So far we have talked about principal components in a fairly qualitative way. We will now
make the discussion more quantitative and see how we can calculate principal components
with Matlab.
In our simple 2-dimensional example we saw that we could successfully describe the
data with a single (constructed) variable by taking a linear combination of the variables
we measured. Although, with more than three variables it becomes hard to visualize the
linear combinations of measured variables, this idea can easily be extended to more than
two variables.
Mathematically what we did when we switched to the principal components to describe
our data was to change the basis of the 2-dimensional space spanned by the two variables
we measured in such a way that the new basis vectors were in the direction of the principal
components PC1 and PC2. We can extend this idea to more dimensions (variables). If we
have n variables we are looking for a linear combination of the variables such that the rst
new variable explains most of the variation in the measurements, this is PC1, the second
new variable explains most of the remaining variation in the measurements, this is PC2,
and so on. If we continue we would nd n principal components, each a dierent linear
combination of the n variables. To go back to the shoe size versus height plot. The height
and shoe size variables gave pretty much the same information: you could almost deduce
one from the other. These two variables are correlated. On the other hand, knowledge
about the score for PC1 teaches us nothing about the sore of PC2, the principle components
are uncorrelated. Our hope is, of course, that we can build a good model of our data with
only a few principal components.
In mathematical terms, we would like to nd two matrices, the scores matrix T and
the loadings matrix P such that
X = TP
T
(2.5)
where each row of T contains the scores of a data point with respect to the principal
components, each row of P contains the loadings of a principal component.
The data matrix X can thus be written as the product of a scores matrix and a loadings
matrix. If we want to build a model for our data we only include the principle components
that best describe the data. To determine how many principal components we need to take
into account, we need to quantify the error we make when we omit principal components.
Going back to equation (2.5), if we include only a principal components we will nd that
part of the data remains unexplained by the model:
X = TP
T
+E (2.6)
The matrix E contains the residuals. In terms of principal components and loadings
equation (2.6) can be written as:
X = t
1
p
T
1
+t
2
p
T
2
+ +t
a
p
T
a
+E (2.7)
where t
1
and p
1
are the scores and loadings respectively for the rst principle component
and so on. The sum of all elements of E is known as the RSS (Residual Sum of Squares)
error:
RSS =
m
i
n
2
ij
(2.8)
When the RSS is of the same order of magnitude as the sum of squares of the errors in
the measurements themselves there is no point including more principal components. All
they do is improve the tting of noise!
In Matlab we can nd the scores and loadings matrices with a procedure called singular
value decomposition. With this procedure we can write a matrix X as the product of three
matrices, U, S and V
T
:
X = USV
T
(2.9)
the matrix S is a diagonal matrix with on the diagonal the square root of the sum of the
squares of the scores for each of the principal components. Hence, the rst element of S is
(the index i runs over all data points, i.e. all rows of the scores matrix):
g
1
= S
2
11
=
m
i
t
2
i1
(2.10)
the second is
g
2
= S
2
22
=
m
i
t
2
i2
(2.11)
and so on. In terms of projections, g
1
is the sum of the squares of the projections of all data
points on PC1, and likewise for g
2
etc. The function svd arranges the diagonal elements of
S in such a way that the largest element appears in the upper left hand corner, the second
largest one position lower on the diagonal and so on. We can use the diagonal elements of
S to write the RSS as:
RSS =
m
i
n
2
ij
=
m
i
n
j
x
2
ij

a
i
g
i
(2.12)
We can construct the matrices T and P from the matrices U, S and V
T
as follows:
T = US (2.13)
and
P
T
= V
T
(2.14)
To do 2.1
On Blackboard you can nd the le shoe.dat with the data matrix in equation X
containing the shoe size versus hight data. Put the data le in your course directory
and add this directory to the Matlab Path. To add the course directory to the path
do the following: under the le menu select Set Path. In the dialog that comes up,
select Add folder and look for your course directory. Click OK. Save and close the
Set Path dialog.
You can now load the shoe data with the load command:
load(shoe.dat)
Nothing seems to happen but when you type shoe youll see Matlab prints the matrix
from equation (2.1).
We rst have a look at the shoe data. Use the plot command to plot the data:
plot(shoe(:,1),shoe(:,2),b*)
Determine the principal components for this data set. We use the Matlab svd func-
tion:
[U,S,V] = svd(shoe)
Matlab returns three matrices, U, S and V. The matrix U represents the scores, S
the size of the scores and V
T
the loadings as described above.
Form the matrix T from equation (2.13)
T = U*S
and verify that the data matrix shoe is the product of the scores and loadings matrices
T and P
T
= V
T
.
From the matrix T we see that the scores for the second principal component PC2
are fairly small compared to the scores for the rst principal component, just like
we saw for the data points in gure 2.2. This suggests that we may obtain a good
description of the data by leaving out the second principal component.
Make a new matrix T1 with only the scores for the rst principal component and a
new matrix P1 with only the loadings for the rst principal component:
T1 = T(:,1)
P1 = P(:,1)
and calculate the data matrix X1 from T1 and P1.
With X1 we recreated the data matrix but taking only PC1 into account. Plot the
data in shoe and X1 in a single gure and compare the two.
plot(shoe(:,1),shoe(:,2),b*,X1(:,1),X1(:,2),r*)
The result of the last plot command should come as no surprise. The red stars are
all on a straight line representing PC1.
With equation (2.4) we gave a measure of how well a principal component could
explain the data. Calculate the fraction V
1
of the data that is explained by the rst
principal component.
So far we discussed a very simple case of just two variables. But at the beginning of the
chapter we also mentioned a signicantly larger system with 50 variables. The procedure
with many variables is pretty much the same as the two variable case discussed so far.
The question is of course How many principal components should we use to describe our
data?. If we use all principal components, all data points will be perfectly reproduced;
if we omit principal components we will get an approximation to our original data matrix
X, with an error. If we include the rst a principal components to describe our data we
will nd a matrix
X
a
= T
a
P
T
a
(2.15)
where T
a
and P
a
are the scores and loadings matrices with the rst a principal components
included.
Figure 2.4: Scree plot gives the fraction of the data explained according to equation 2.16
as a function of the number of principal components.
A way to determine how many principal components to include is to plot the cumulative
fraction (or percentage) V
a
of the data that is explained as a function of the number of
principal components we use to describe the data:
V
a
= 100
a
i=1
g
i
N
i=1
g
i
(2.16)
with N the total number of principal components. From this so called scree plot (see
gure 2.4) we can determine when adding further principal components does not lead to
signicant improvement in the description. In general well see that the rst few principal
components explain a large fraction of the data, but at some point there will be an elbow
in the graph where adding further principal components does not lead to a better t
(frequently scree plots are drawn in which the RSS or the percentage of the data not tted
is placed along the vertical axis). From gure 2.4 one can see that the rst principal
component explains about 40% of the data, adding the second principal component raises
this gure to about 70% of the data. With four principal components more than 95% of
the data can be explained.
To do 2.2
Download the le hplc_dad.xls from the Blackboard. The le contains a data set
obtained from a high performance liquid chromatography with diode array detection,
HPLC-DAD. The data points were acquired each second at 28 wavelengths. The table
corresponds to the table 4.1 in Brereton (see copy of chapter 4 on Blackboard). To
get the data into Matlab we use the import function. Under File select Import
Data and follow the instructions of the import wizard. The data will be stored in a
matrix probably called Sheet1.
Use singular value decomposition to nd the scores, loadings and eigenvectors of the
data matrix. How many principal components are there? Verify that you can write
the data matrix as a product of the scores matrix T and the transpose of the loadings
matrix P
T
.
Calculate the function V
a
from equation 2.16 for the rst few principal components
and make a scree plot. How may principal components would you use to describe
the data if you want the error of the model to be less than 5%?
2.2.1 Centering and scaling
So far we have been working with raw data. Usually, however, raw data are not suitable for
principal component analysis. If for example the HPLC-DAD data from the previous To
do arise from two compounds but if for some reason there is a large background signal for
part of the spectrum we need to pre-process the data to remove this bias. Otherwise, the
size of the scores of the rst principal component will be largely determined by the bias.
Common pre-processing techniques are centering and scaling. In centering the column
average for each column is subtracted from each data point in that column:
c
x
ij
= x
ij
x
j
(2.17)
where
c
x
ij
is the data point after centering and x
j
is the average of the data points in
column j. In eect, centering sets the average value for each variable equal to zero. As
you can imagine, centering can have a large eect on the relative size of the principal
components because most of the bias ends up in the rst principal component.
When many variables are measured, it may well be that not all variables are measured
on the same scale or even in the same units. For example, in the height shoe size experiment
we saw earlier, the shoe size had values between 37 and 47 whereas the height varied
between 164 and 195. Had we measured the height in meters, the variation would have
been a lot smaller, between 1.64 and 1.95. Of course we do not want our result to depend
on the measurement scale. We can correct for dierences in scales by dividing by the
standard deviation of the data points. Now the measurements are all on the same scale.
The combination of mean centering and scaling is know as standardization. Written as a
formula, standardization looks like this:
s
x
ij
=
x
ij
x
j
_
1
m1
m
i
(x
ij
x
j
)
2
(2.18)
where x
ij
is an element of the data matrix, x
j
is the column average of column j and m is
the number of rows.
To standardize a data matrix X in Matlab we use the functions mean and std:
[m,n] = size(X);
col_avg = mean(X);
X_c = X - ones(m,1)* col_avg;
col_std = std(X_c);
X_s = X_c./( ones(m,1)* col_std );
We discuss the above Matlab code in a bit more detail. In the rst line the size function
is used to obtain the number or rows and columns of the matrix X. The result is put in a
row vector with the number of rows, m, as the rst element and the number of columns,
n as second element. We need these (at least the number of rows) in the subsequent code.
In the second line the function mean is used to collect the mean values of each column in a
row vector col_avg. In the next line we encounter the function ones(m,n). As we saw in
the introduction, this function generates an mn matrix with a 1 at each position. Here,
a column vector with 1s of length m is made and multiplied by the row vector col_avg
to give an mn matrix with in the rst column the average of the rst column of X, in
the second column the average of the second column of X and so on. When we subtract
this matrix from the data matrix X, we have mean centered the data matrix.
Next we calculate the standard deviation for each column of the mean centered ma-
trix, X
c
and put the result in a row vector col_std. In the last line we apply the
same technique as in the third line to obtain a matrix with the standard deviations:
ones(m,1)*col_std. Because we need to divide each element of X
c
by the corresponding
element of ones(m,1)*col_std we have to use the period operator in the division.
The end result of these ve lines of Matlab code is that we have a standardized data
matrix X
s
.
To do 2.3
Standardize the data in shoe.
Calculate the principal values for the standardized shoe data and calculate how much
each of the principal components contributes to the explanation of the data. Compare
this with the result you obtained earlier. Reect on the dierence between the two
results in the light of standardizing the data.
Plot the standardized shoe data and compare the plot with the plot of the original
shoe data. Comment on the dierences and similarities.
To do 2.4
Standardize the data in hplc_dad.
Calculate the scores and loadings for the standardized matrix.
Calculate again the function V
a
, but now for the standardized data. How many
principal components would you now use to model the standardized data with an
error of less than 5%?
To do 2.5
We can learn quite a bit about our data by looking at it plotted in the plane of
the rst two principal components. In this exercise we will look at the elemental
composition of 58 samples of pottery from southern Italy, divided into two groups,
A and B. The data in the le pottery.xls can be downloaded from the web site
(Brereton excercise 4.8).
The data is in the form of an Excel le. Import the data into Matlab using the import
wizard as we did before. After the import wizard nishes youll nd a 5811 matrix
data with data and a separate matrix textdata with row and column headings. You
can look at the excel le to get an idea about the data and compare it with the way
how Matlab stores the data in two separate matrices, a data matrix and a matrix
with column headings.
As a rst step, the matrix needs to be standardized. Can you say why? Standardize
the matrix using the procedure described in section 2.2.1. Put the standardized data
in a matrix X:
X = (data - ones(size(data ,1) ,1)* mean(data ))./ ...
(ones(size(data ,1) ,1)* std(data))
For the remainder of this To do we will use this standardized data matrix X.
We want to collect the class assignment of the samples in a separate array. The class
assignment is found in the last column of textdata. We put these in an array class
as follows:
class = textdata (2: size(textdata ,1),size(textdata ,2)); class
char(class); A = find(class==A); B = find(class==B);
What we did here is extract the last column of the matrix textdata. Unfortunately
we need to convert our array class to the proper type of array because class starts
out as a so called cell array. We convert it to a regular array with:
class = char(class );
Then we made two new arrays with the indices of the samples that belong to class
A and the indices that belong to class B, respectively.
Chapter 3
Supervised pattern recognition
In this section we will look at data sets in which the data can divided in dierent groups
often called classes. With such a data set we can build a model for each class and attempt
to assign an unknown sample to one of the classes in the model. This is known as supervised
pattern recognition: we want to assign an unknown sample to one or more known classes of
data. Examples include recognizing skeletal remains as male or female or assigning paint
samples to dierent manufacturers.
Suppose we have data from known groups or classes, for example soil samples from
dierent locations, these form what is known as the training set. We also have samples of
unknown origin that we want to assign to one of the known groups. To assigned a sample
to one or more known classes we need some discriminating criterium. We will consider
some classication methods and discriminating criteria below.
3.1 kNN classication
In kNN (k-Nearest Neighbors) classication we look at the distance of an unknown sample
to the samples of the dierent classes we have. The number k determines the number of
nearest neighbors of the unknown sample we consider. The unknown sample is assigned
to the class with the largest number of the k closest samples. In the simplest case we only
consider one nearest neighbor, k = 1. The unknown sample is then assigned to the class
that has the measurement with smallest distance to the unknown sample. In gure 3.1 you
see two classes A and B and an unknown sample indicated by . According to the 1-NN
classication the sample will be classied in class B because the measurement closest to
the unknown sample belongs to class B.
An easy next step would be to consider not the nearest neighbor but the three nearest
neighbors of the unknown sample and assign the sample to the class with the most nearest
neighbors.
We skimmed over an important aspect: how do we calculate the distance between
measurements? A common way to calculate the distance is to use use the Euclidean
20
CHAPTER 3. SUPERVISED PATTERN RECOGNITION 21
distance, well known from Pythagoras theorem (see also equation (1.7)):
d
E
(x, y) =
_
n
i
(x
i
y
i
)
2
=
_
(x y)
T
(x y) (3.1)
In the following exercise taken from Brereton (2) we illustrate the k-NN method with
a data set of 17 milk samples. The rst seven are samples of pure cow milk (class C
1
), the
next seven are samples of cow milk mixed with 20% goat milk (class C
2
). The last three
samples are of unknown origin. The variables are the concentrations (%) of three fatty
acids, FA1, FA2 and FA3.
To do 3.1
Use the import wizard to import the le milk.xls. Matlab puts the data in a matrix
data and the text (rst row) in a matrix textdata.
Plot the variable FA1 versus FA2. Indicate the dierent classes by dierent colors:
plot(data (1:7,3), data (1:7,4),r*,data (8:14,3), ...
data (8:14,4),b*,data (15:17 ,3) , data (15:17 ,4) ,k*)
Comment on the options you have to distinguish between the two classes. Based on
this plot, how would you assign the unknown samples? Plot also FA1 versus FA3.
Does this plot make classication easier? Clearly, we are looking for a more objective
way to separate the two classes and assign the unknown samples.
As a rst attempt, we will classify sample 15 with the 1NN method. The discrimi-
nation criterium will be the smallest distance: if the sample 15 is closest to a point
of class C
1
we assign sample 15 to class C
1
, otherwise to class C
2
.
Split the matrix data in three matrices, one corresponding to each of the two classes
C
1
, C
2
: X1, X2 and and one, X0, corresponding to the unknown samples. For
example for class C
1
:
X1 = data (1:7 ,3:5)
To determine the nearest neighbor, we need to nd the Euclidean distance of point
15 to the data points in classes C
1
and C
2
. We isolate sample 15 from matrix X0:
s15 = X0(1,:)
To nd the distance of this sample to the samples in classes C
1
and C
2
, subtract this
vector from each row of the matrices X1 and X2. For class C
1
:
D1 = X1 - ones (7,1)* s15
Do the same for X2.
We now have the vectors from sample 15 to all other samples. To determine the
Euclidean distance we need to determine the length of all these vectors, so we square
all elements, sum over the rows, and take the square root:
sqrt(sum((D1.^2) ))
This last Matlab expression is very compact: we square all elements of the matrix,
then we take the transpose of the result and sum the columns of the transpose (we
had to sum over the rows to get the length of the distance vectors). Finally, we take
the square root to get the Euclidean distances. These are put in a row vector, the
rst element is the distance of sample 15 to the rst sample of class C
1
, the second
the distance of sample 15 to the second sample of class C
1
and so on.
Do the same for the distances of sample 15 to the samples of class C
2
. Find the
smallest distance and classify sample 15.
Classify also sample 17.
Classify samples 15 and 17 using 3-NN classication.
When we discussed principal component analysis we noticed that in the analysis some
variables had a relatively large impact on the nal results because the absolute values of
the measurements were much larger than those of the other variables. A similar problem
can arise here. The Euclidean distance to a class can be aected by large scatter in the data
points. To account for such dierences in variation among classes we can again standardize
our data before classication. In general standardization over the whole training set is used.
To see the eect of standardization on the classication with the kNN classier, we classify
the two samples 15 and 17 from the previous To do again, but now after standardization
of the data.
To do 3.2
Standardize the data using the whole training set, the rst 14 samples:
X = data (1:14 ,3:5)
XS = (X-ones(size(X,1) ,1)* mean(X))./ ...
(ones(size(X,1) ,1)* std(X))
This is a condensed version of the standardization we did in the section on principal
components. You may want to compare the procedure there with this one.
Standardize also the unknown samples (samples 15, 16 and 17) using the mean and
standard deviation you just calculated:
S0 = (X0 - ones(size(X0 ,1) ,1)* mean(X))./ ...
(ones(size(X0 ,1) ,1)* std(X))
Classify the standardized samples 15 and 17 using 1-NN and 3-NN classication:
calculate the distance of the standardized samples 15 and 17 to the standardized
points of the two classes. For sample 15 we would calculate the distances to the
fourteen samples in classes C
1
and C
2
as follows:
S15 = S0(1,:)
DS = XS - ones(size(XS ,1) ,1)* S15
sum((DS.^2) )
Select the shortest distance to nd the 1NN classication and the 3 closest distances
to nd the 3-NN classication. Are the classications the same for the raw and
standardized data?
Mahalanobis distance
In the k-NN classication we calculated the distance of an unknown sample to all samples
in the training set and selected the k shortest Euclidean distances. Instead of calculating
the distance to all samples in a class and selecting the ones that are closest, we can also
calculate the average position of all samples in a class (the so called centroid and determine
the distance of an unknown sample to the centroids of the dierent classes. The sample is
then assigned to the class with the nearest centroid.
What if there is some structure in the measurements in the dierent classes? For
example, data can have a large variation along one variable, but only slight variation for
another variable. Also, variables can be correlated giving unwanted bias to distances along
the correlated variables. In such cases the Euclidean distance may not be the appropriate
way to measure distance. Brereton gives the following example that nicely illustrates the
issue. Suppose we have measured the weight of mammals and have obtained a class for
elephants and a class for mice. The average for the elephant class is perhaps 2000 kg, the
average for mice 2 grams. What happens if we want to classify a baby elephant weighing
200 kg? If we use the Euclidean distance the baby elephant would be classied as a mouse
because the distance to the average of the mouse class is smaller than the distance to the
elephant class.
To correct for such bias we use the Mahalanobis distance. It corrects for correlation
between variables and large dierences in measurement size by dividing the class distance
by the correlation coecient between the variables, not unlike the standardization of data
we did in the previous section. The main dierence is that now we have more than one
variable and cross correlations between variables. In matrix form the Mahalanobis distance
is:
d
M
(x, y) =
_
(x y)
T
C
1
(x y) (3.2)
Where C
1
is the inverse of the variance covariance matrix of the class under consideration.
In the following To do the use of Mahalanobis distance and centroids will be illustrated
with an exercise from Brereton.
u
u
u
u
u
u
u
u
u
u
u
Class A
Class B
d
A
d
B
Figure 3.1: Distances of an unknown object, () to the centers () of classes A and B.
To do 3.3
In this exercise from Brereton (excercise 4.8) we will use the Euclidean and Ma-
halanobis distance to classify pottery from pre-classical sites in Italy. Elemental
composition was measured on 58 samples of pottery from southern Italy, divided
into two groups, A and B. The data in the le pottery.xls can be downloaded from
the web site.
The data is in the form of an Excel le. Import the data into Matlab using the import
wizard as we did before: under the File menu, select Import Data... and locate
the le pottery.xls. After the import wizard nishes youll nd a 58 11 matrix
data with data and a separate matrix textdata with row and column headings.
As a rst step, the matrix needs to be standardized. Can you say why? Standardize
the matrix using the procedure described in section 2.2.1. Put the standardized data
in a matrix X:
X = (data - ones(size(data ,1) ,1)* mean(data ))./ ...
(ones(size(data ,1) ,1)* std(data))
For the remainder of this To do we will use this standardized data matrix X.
This is a good time to make some other arrays that we can use later on. First we
need a vector with the sample codes so we can identify the samples in a plot. The
sample codes ended up in the rst column of the array textdata. To extract them
we use:
samples = textdata (2: size(textdata ,1),1)
We also want the class assignment of the samples in an array. This is the last column
of textdata and we put these in an array class:
class = textdata (2: size(textdata ,1),size(textdata ,2));
A = find(class ==A);
B = find(class ==B);
What we did here is extract the last column of the matrix textdata. Then we need
to convert class to the proper type of array. class starts out as a so called cell
array which we convert to a regular array with
Next we made two new arrays with the indices of the samples that belong to class A
and the indices that belong to class B, respectively.
From the table of data we have no idea if we can say anything about the discrimination
between the two classes. Also, we have no idea if there are outliers that could aect the
classication of the data. Therefore, we start with PCA to explore the data set.
To do 3.4
Do a principal component analysis on the data using Matlabs svd function. As you
remember, this function returns three matrices that we can use to nd the scores,
the loadings and the eigenvalues:
[U,S,V] = svd(X);
T = U*S;
P = V;
Plot the scores of the rst principal component versus the second. Use dierent
symbols for classes A and B. The rst 23 samples belong to class A. You can use
plot:
plot(T(A,1),T(A,2),r*,T(B,1),T(B,2),b*)
In the plot you see a two-dimensional representation of the data points along the rst
two principal components. You notice there is an outlier, to nd out which point
it is, we need to label the points. This is where the array samples we made earlier
comes in handy.
hold
text(T(:,1),T(:,2), samples)
With hold we keep the current plot and with the text command we plot the contents
of samples at the correct positions of the principal components plot. Which sample
is the outlier?
Outliers can give a strong bias if we calculate the distance to a class: if the outlier
happens to lie much closer to the unknown sample than other the points of the class,
the unknown sample can be assigned to the class based on the distance to the outlier.
If we rst calculate the centroid, the eect of outliers will be smaller, but could still
be signicant. Therefore it is good practice to remove outliers before classifying.
Remove the outlier and calculate the centroids for class A and class B. We know
the outlier is sample number 22, so we can make a data matrix for class A without
sample 22 and a data matrix for class B:
XA = [X(1:21 ,:); X(23 ,:)]
XB = X(24: size(X,1) ,:)
To make XA we construct a new matrix with the rst 21 rows of X and a last row
comprising row 23 of X. XB is the same as the last part of X, starting at row 24 and
with last row size(X,1).
Use the Matlab function mean to calculate the centroids for both classes (without the
outlier) and put them in arrays CA and CB:
CA = mean(XA)
and likewise for CB.
Of course the points of class A and class B are spread out, so it could well be that
some points of, say, class A are closer to the centroid of class B and vice versa. Since
we know to which class the points really belong, we can calculate the fraction that
is classied correctly.
To obtain the distance d(x
i
, CA) of sample i to the centroid of class A, CA we need
to calculate:
d(x
i
, CA) =
j
(x
ij
CA
j
)
2
=
_
(x
i
CA)
T
(x
i
CA) (3.3)
where the sum goes over all variables (columns of X). And similar for the distance to
the centroid of class B. We have already done a similar procedure when we did kNN
classication. We calculate the distances of the samples to the centroid of class A:
DA = sqrt(sum (((X - ones(size(X,1) ,1)*CA).^2) ))
This is rather condensed, see if you can work out all the steps and do the same for
the distances of all samples to the centroid of class B, put these in an array DB.
With the arrays you found, make a class distance plot in which you plot for each
sample the distance to class A and to class B. You do this by plotting the two arrays
you found in the previous part of this To do. Use dierent colors for each class:
plot(DA(1:23) ,DB(1:23) ,r*,DA(24: size(X,1)),DB(24: size(X,1)),b*)
If you would classify the samples using the shortest distance to the centroid as clas-
sication criterium, how many samples would be classied correctly?
As you saw, not all samples were classied correctly. We can try to do better by
accounting for the dierence in variation in both classes. We can do this by using
the Mahalanobis distance instead of the Euclidean distance. The distance function
is now given by equation (3.2).
Use the Matlab function cov to calculate the covariance matrix for each of the classes
(without the outlier):
COVA = cov(XA)
COVB = cov(XB)
You should obtain two 11 11 matrices, one for class A and one for class B.
Next, we use equation 3.2 to calculate the Mahalanobis distances for each sample to
the classes A and B. Because this is rather an involved set of operations, we do it
step by step. First we calculate the dierence vectors between the samples and the
centroids:
DIFFA = X - ones(size(X,1) ,1)*CA
DIFFB = X - ones(size(X,1) ,1)*CB
and then obtain the distances:
MA = sqrt(diag(DIFFA*(inv(COVA ))*DIFFA ))
MB = sqrt(diag(DIFFB*(inv(COVB ))*DIFFB ))
Make a distance plot with the Mahalanobis distances and compare the result with
the plot of the Euclidean distance. How many samples are classied correctly now?
Literature
Chemometrics, Data Analysis for the Laboratory and Chemical Plant, Richard G.
Brereton, 2003, John Wiley & Sons Ltd, West Sussex
Multivariate pattern recognition in chemometrics illustrated by case studies, Richard
G. Brereton (editor), 1992, Elsevier, Amsterdam

Multivariate Data Analysis: Universiteit Van Amsterdam

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multivariate Data Analysis: Universiteit Van Amsterdam

Uploaded by

Copyright:

Available Formats

Universiteit van Amsterdam

Multivariate Data Analysis

with the rows and columns of A interchanged:

You might also like