You are on page 1of 28

JOURNAL OF CHEMOMETRICS, VOL.

11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES


ALOKE PHATAK1, * AND SIJMEN DE JONG2
1
2

CSIRO Mathematical and Information Sciences, Locked Bag 17, North Ryde, N.S.W. 2113, Australia
Unilever Research Laboratorium Vlaardingen, PO Box 114, NL-3130 AC Vlaardingen, Netherlands

SUMMARY
Our objective in this article is to clarify partial least squares (PLS) regression by illustrating the geometry of
NIPALS and SIMPLS, two algorithms for carrying out PLS, in both object and variable space. We introduce the
notion of the tangent rotation of a vector on an ellipsoid and show how it is intimately related to the power method
of finding the eigenvalues and eigenvectors of a symmetric matrix. We also show that the PLS estimate of the
vector of coefficients in the linear model turns out to be an oblique projection of the ordinary least squares
estimate. With two simple building blockstangent rotations and orthogonal and oblique projectionsit
becomes possible to visualize precisely how PLS functions. 1997 by John Wiley & Sons, Ltd.
Journal of Chemometrics, Vol. 11, 311338 (1997)
KEY WORDS

(No. of Figures: 13

No. of Tables: 2

No. of References: 25)

partial least squares; geometry; power method; oblique projection

1. INTRODUCTION
Since its introduction into chemometrics as a tool for solving regression problems with highly
collinear predictor variables, partial least squares1 (PLS) has become a common, if not standard,
regression method. Much work has gone into elucidating its mathematical and statistical properties28
and several alternative algorithms have been proposed, especially recently.914 All these efforts have
helped to clarify the diverse aspects of PLS and shed light on what was once, by definition, only an
algorithm.
In this article our objective is to illustrate partial least squares by presenting what amounts to twoand three-dimensional pictures of how it works. We extend the work of Phatak et al.,15 who considered
the geometry of PLS in object space or, as it is sometimes known, the space spanned by the columns
of the matrix of regressors. Here, by contrast, we will work in variable space or the space spanned by
the rows of the matrix of predictor variables. Looking at PLS in variable space allows us to do two
things that cannot be done in object space: first, by illustrating the SIMPLS algorithm of de Jong,9 we
can more easily visualize the extraction of successive PLS dimensions; second, we can illustrate the
algebraic result that the PLS estimator of the vector of coefficients in the standard linear model is an
oblique projection of the ordinary least squares (OLS) estimator. This is a result which, to our
knowledge, has not appeared in the literature.
We begin in Section 2 by establishing notational conventions and then briefly outline the algebra
of PLS and related regression methods in Section 3, which also introduces oblique projections. Section
4 briefly discusses some properties of ellipses and ellipsoids and planes tangent to them. A good
understanding of this material is essential for grasping the geometric interpretations of principal
component regression (PCR) and univariate and multivariate PLS that are presented in Sections 5 and
6. Finally, Section 7 summarizes some of the more important aspects of the geometry presented in the
paper.
Correspondence to: Aloke Phatak.

CCC 08869383/97/04031128 $17.50


1997 by John Wiley & Sons, Ltd.

Received 31 May 1996


Accepted 10 December 1996

312

A. PHATAK AND S. DE JONG

2. NOTATION
Throughout this paper, matrices will be denoted by boldface uppercase letters (A, G), column vectors
by boldface lowercase letters (a, g) and scalars by lowercase italic letters (a, g). Transposition of a
matrix or vector is shown by the superscript T (AT, aT); where the inverse of a square matrix exists,
it will be denoted by, for example, A 2 1. The MoorePenrose inverse,16 on the other hand, is denoted
by A 2 . Some special matrices and vectors that we will come across include the identity matrix of
order p (Ip) and a vector consisting solely of ones (1).
The operator diag will occur occasionally and has two roles. When applied to a square matrix, e.g.
diag(A), it extracts the diagonal elements of A and yields a vector whose jth element is the jth diagonal
element of A. When applied to a vector of length m, however, it constructs an m 3 m diagonal matrix
whose jth diagonal element is the jth element of that vector. Because we differentiate typographically
between matrices and vectors, there should be no confusion about which of the two operations is
meant.
In many of the derivations that follow, it will be convenient to express vectors (and columns of
matrices) in mean-centered form. Mean centering implies the operation
xixi 2 x
where xi is the ith element of the vector x and x is the arithmetic mean of its elements. Notationally,
we will not distinguish between a vector (and hence a matrix of mean-centered columns) and its meancentered form. It will be clear from the context which one is intended.
Given two mean-centered vectors xj and xk, we will dispense with the scalar n 2 1 when expressing
their sample covariance, i.e. we write
cov(xjxk) = xTj xk
where the factor n 2 1 is implied.
When we refer to a linear combination of a matrix, we mean a linear combination of the columns
of that matrix. For example, if A is an n 3 p matrix, then a linear combination of A is
c = Ab = b1a1 + b2a2 + + bpap
where a1, a2, . . . , ap represent the p columns of A and b = (b1, b2, . . . , bp)T is a p 3 1 vector of weights.
The result is an n 3 1 vector c.
3. OLS, PCR AND PLS
In this section we will briefly outline ordinary least squares, principal component regression and
partial least squares regression. The emphasis here will be on the algebraic derivation of the vector (or
matrix) of coefficients in the linear regression model, for we shall see in later sections how they are
constructed in variable space. More rigorous and complete treatments of the material presented here
may be found elsewhere,2,17,18 but we include it for the sake of completeness.
The starting point for these three methods is the standard regression model defined by the
equation
y = 1bo + Xb +

(1)

where y is an n 3 1 vector of observations on the response variable, bo is an unknown constant, X is


an n 3 p matrix consisting of n observations on p variables, b is a p 3 1 vector of parameters and is
an n 3 1 vector of errors identically and independently distributed with mean zero and variance s2. For
the sake of simplicity we assume that n > p. In equation (1) the elements of X are measured about their
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

313

column means, i.e. X satisfies XT1 = 0, but it will also be convenient in the derivations that follow to
express y in mean-centered form. Hence we will adopt the model
y = Xb +

(2)

where y now has also been mean centered, i.e. yT1 = 0. To simplify the discussion that follows, we shall
neglect the constant term in equation (1) and instead focus on the estimation of the vector of
coefficients b and on the properties of the different estimates of the vector of coefficients b.
For multiresponse data the analog of equation (2) can be written as
Y = XB + E

(3)

where Y is an n 3 q matrix of observations on q response variables y1, y2, . . . , yq and E is an n 3 q


matrix of errors whose rows are independently and identically distributed. The matrix B is a p 3 q
matrix of parameters to be estimated.
3.1. Ordinary least squares
If X is of full column rank, then the ordinary least squares estimate of b is given by
OLS = (XTX) 2 1XTy
b

(4)

and the corresponding vector of fitted values is


OLS = X(XTX) 2 1XTy = * Xy
y OLS = Xb

(5)

The matrix * X is both symmetric and idempotent (i.e. * X * X = * X) and hence is an orthogonal
projection matrix.19 Thus we can interpret y OLS as the orthogonal projection of y onto the space
spanned by the columns of X. The notion of projection onto a subspace is an important one and will
reoccur throughout this paper.
The least squares estimator in equation (4) turns out to be an optimal estimator in many senses; for
a discussion of some of these properties, see e.g. Reference 17. In particular, it is BLUE: the best linear
unbiased estimator of b, best in the sense that of all possible linear unbiased estimates of its elements
bj the least squares estimates have the smallest variance. In many chemometric problems, however,
the predictor variables may be highly collinear; as a result, the variance of the least squares estimator
may be very large and subsequent predictions rather imprecise. For this reason, biased methods such
as ridge regression, PCR and PLS are used with the consequent trade-off between, on one hand,
increased bias and, on the other hand, decreased variance. Indeed, as we shall see below, not only is
the PLS estimator biased, it is also non-linear. When n < p, the inverse in equation (4) does not exist
and there is no unique unbiased least squares estimator.20
By straightforward extension of equation (4) the least squares estimate of B in equation (3) is
OLS = (XTX) 2 1XTY
B

(6)

which is equivalent to regressing each response separately on the explanatory variables.


3.2. Principal component regression
The idea in principal component regression is to replace the original regressors by a subset of the
principal components (PCs) of X. The principal components of X are successive linear combinations
of X that account for as much of the total variation as possible subject to orthogonality constraints and
to restrictions on the length of the weight vector. It turns out that the ith principal component, which
we shall write as ji, is equal to Xgi, where the weight vectors gi, i = 1, 2, . . . , p, are the unit-norm
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

314

A. PHATAK AND S. DE JONG

eigenvectors of XTX, i.e. they satisfy


XTXgi = ligi

(7)

(8)

with

gTi gj =

1,
0,

i=j
ij

The corresponding eigenvalues are denoted by li and we shall adopt the ordering convention that
l1 > l2 > > lp. If we premultiply both sides of equation (7) by gTi , it is easy to see that
gTi XTXgi = jTi ji = li

(9)

and hence that the variance of a principal component is proportional to its corresponding eigenvalue.
Furthermore, because of the orthogonality restrictions placed on the PCs,
gTi XTXgj = jTi jj = 0, i j

(10)

i.e. a given PC is orthogonal to all others. Principal component analysis (PCA) affords a bilinear
decomposition of X, i.e.

O
p

X = j1g + j2g + + jpg =


T
1

T
2

T
p

jigTi = JGT

(11)

uisivTi = USVT

(12)

i=1

Inserting li2 1/2l1/2


between ji and gTi yields
i

O
p

X=

i=1

O
p

(j l

)l g =

2 1/2
i i

1/2
i

T
i

i=1

which shows the equivalence between principal component analysis and the singular value
are
decomposition21 of X. In equation (12), ui = jili2 1/2 are the unit-norm left singular vectors, si = l1/2
i
the singular values, with S = diag(s1, s2, . . . , sp), and vi = gi are the right singular vectors of X.
In most spectroscopic problems n!p and hence the number of non-zero singular values is n 2 1.
Consequently, at most n 2 1 principal components can be extracted.
A straightforward way to derive the principal component estimator of b in equation (2) is to first
obtain the vector of fitted values and then rearrange it so it appears as a linear combination of X. Then
the estimator is simply given by the vector of weights.
The PCR vector of fitted values, which we denote by y mPCR, can be obtained by projecting y onto the
first m principal components Jm. Thus
y mPCR = Jm(JTmJm) 2 1JTmy = * Jy

(13)

where * J is an orthogonal projection matrix that can also be rewritten as UmU . Recall, however, that
the PCs are linear combinations of X, i.e. Jm = XGm, where the first m eigenvectors of XTX form the
columns of Gm. Furthermore, because of equations (9) and (10), JTmJm = Lm. Making these two
substitutions into equation (13) yields
T
m

y mPCR = XGmLm2 1GTmXTy

(14)

with Lm = diag(l1, l2, . . . , lm). Thus the PCR estimator of b is simply


1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

315

b mPCR = GmLm2 1GTmXTy

(15)
21

and then simplify, equation

mPCR = GmLm2 1GTmXTX(XTX) 2 1XTy = GmGTmb


OLS
b

(16)

What is the geometric interpretation of b


(15) becomes

m
PCR

? If we insert X X(X X)

Now, because the eigenvectors are orthonormal, we can write the matrix GmGTm as Gm(GTmGm) 2 1GTm,
which is of course an orthogonal projection matrix. Thus, as expressed in equation (16), b mPCR is simply
OLS onto the space spanned by the first m eigenvectors of XTX. As we
the orthogonal projection of b
shall see in the next subsection, the expression for the PLS estimator is similar, but it turns out to be
an oblique projection.
Multiresponse principal component regression is a far less commonly used technique, but the
OLS
corresponding estimator can be derived in an analogous fashion by projecting each column of B
onto the m eigenvectors retained.
3.3. Partial least squares
Partial least squares is similar to principal component regression in that the matrix of regressors is
replaced by a reduced set of linear combinations. There is, however, one essential differencein PCR
the linear combinations or principal components are derived without reference to the response
variables, but in PLS the observed responses play an essential role. Furthermore, principal components
are optimal in the sense of maximizing the amount of explained variation in X. In PLS, on the other
hand, two sets of linear combinations are derived, one for the explanatory variables and one for the
response variables, and they are optimally related in yet another sense (see Section 6.1.1).
In this subsection we will briefly outline the derivation of the PLS estimators of b and B in
equations (2) and (3) respectively. Since the derivations of the PLS linear combinations and associated
weights and loadings are inseparable from the algorithms for PLS, we shall defer discussing how they
are derived until Section 6.1.2, where the algorithms and their geometry are explained. We note here,
however, that explicit calculation of the Y-block variates is not required to derive the estimators.
In the same way that PCA yields a bilinear decomposition of X (see equation (11)), so too does PLS.
It can be written as

O
p

T
1 1

T
2 2

T
p p

X=t p +t p + +t p =

tipTi = TPT

(17)

i=1

Here the ti are linear combinations of X, which we shall write as Xri. The p 3 1 vectors pi are often
called loadings, although they are true loadings in the strict sense of the word only when the data
matrices are appropriately scaled. Unlike the weights in PCR (i.e. the eigenvectors gi), the ri are not
orthonormal. The ti, however, are, like the principal components ji, orthogonal. In the traditional
NIPALS algorithm this orthogonality is imposed by computing the ti as linear combinations of residual
matrices Ei, i.e. as

O
i

ti = Ei 2 1wi,

Ei = X 2

tjpTj ,

E0 = X

(18)

j=1

in which the wi satisfy the conditions for orthonormality. They are also known as Lanczos vectors,
which is a consequence of the close connection of NIPALS to the Lanczos method,21 a method for
approximating the extremal eigenvalues and eigenvectors of a symmetric matrix.
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

316

A. PHATAK AND S. DE JONG

The two sets of weight vectors wi and ri, i = 1, 2, . . . , m, span the same space. In most algorithms
for both multivariate and univariate PLS the initial sequence of steps aims to derive either wi or ri in
order to be able to calculate the linear combination ti. This is followed by calculation of pi by simply
regressing X onto ti. After m dimensions have been extracted, we can write the following
relationships:
Tm = XRm

(19a)

Pm = XTTm(TTmTm) 2 1

(19b)

Rm = Wm(PTmWm) 2 1

(19c)

Here the subscript m indicates that the matrix comprises the first m of the sequence of corresponding
vectors, e.g. T3 = [t1, t2, t3]. Equation (19c) links the two sets of weight vectors by a linear
transformation. The existence of such a transformation is obvious when we realize that PTmRm = Im,
which follows from equations (19b) and (19a): RTmPm = RTmXTTm(TTmTm) 2 1 = TTmTm(TTmTm) 2 1 = Im.
To derive the PLS estimator, we will follow the same strategy as in the previous subsection.
Although we shall consider univariate PLS only, the expression for the multivariate estimator is
OLS.
identical, except that b OLS is replaced by B
After m dimensions have been extracted, the vector of fitted values from PLS is simply the
projection of the observed responses onto the first m PLS linear combinations Tm. Thus we can write
y mPLS = Tm(TTmTm) 2 1TTmy

(20)

We note in passing that equation (20) is more general than it first appears: it applies to any method
which projects onto a subspace Tm = XRm. Substituting XRm for Tm and XTXb OLS for XTy leads to
OLS
y mPLS = XRm(RTmXTXRm) 2 1RTmXTXb

(21)

mPLS = Rm(RTmXTXRm) 2 1RTmXTXb OLS


b

(22)

from which it is clear that

In equations (21) and (22), Rm may be replaced by any non-singular transformation of Rm without
affecting the result, because the transformations inside and outside the parentheses will cancel. In
particular Wm could be used instead of Rm. This reflects the fact that only the (sub)space spanned by
these vectors is of importance, not a particular set of basis vectors.
OLS can be obtained by first substituting equation (19a) into
A somewhat simpler expression for b
T
T T
(19b), which yields Pm = X XRm(RmX XRm) 2 1. Then using this result in equation (22) gives
mPLS = RmPTmb
OLS = Wm(PTmWm) 2 1PTmb
OLS
b

(23)

The matrix Wm(PTmWm) 2 1PTm is idempotent and hence is a projection matrix. It is not symmetric,
OLS onto the space spanned by
however, and is therefore an oblique projector,22 one which projects b
the columns of Wm along some direction. As Appendix I shows, projection is along the space
orthogonal to Pm. Following the notation introduced in Appendix I, we shall write Wm(PTWm) 2 1PTm as
* Wm Pm> , which is the projector onto Wm along Pm> . Since Rm and Wm span the same space (see equation
(19c)), the oblique projectors * Wm Pm> and * Rm Pm> = Rm(PTmRm) 2 1PTm = RmPTm are identical. Although
projection along a plane (or hyperplane) is more difficult to visualize than projection along a vector,
mPLS as oblique
we shall see in Section 6.3.2 that it is possible to express successive contributions to b
projections along a vector.
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

317

4. ELLIPSOIDS AND PLANES OF TANGENCY


In illustrating the geometry of PLS and PCR, we will need to grasp the notion of (n 2 1)-dimensional
hyperplanes that are tangent to n-dimensional ellipsoids. Appendix II contains an extended discussion
of the geometry of ellipsoids and tangent hyper-planes, but here we simply state some definitions and
the main result.
If A (p 3 p) is a symmetric matrix of rank p, then the set of p-vectors z satisfying zTAz = c forms an
ellipsoidal hypersurface in p dimensions. We denote this ellipsoid by Ac2 . Similarly, Ac denotes the set
of vectors z satisfying zTA 2 1z = c.
As we will see in the next section, the basis of a geometric description of PCR and PLS is a
depiction of the algebraic operation b = Aa, where a is a p-vector. We denote this operation by the term
tangent rotation and it can be described as follows: to construct a vector that is a scalar multiple of
b ( = Aa), we find the point of tangency between the ellipsoid Ac and the hyperplane orthogonal to a.
A reverse tangent rotation corresponds to simply reversing the procedure. Starting with b, we first
construct a hyperplane tangent to the point of intersection between Ac and b; then the direction of a
is given by the vector orthogonal to this hyperplane. Appendix II contains illustrations that show the
relationship between Ac2 and Ac as well as a picture of a tangent rotation.

5. GEOMETRY OF PCA AND PCR

5.1. The power method


The power method21 is an old and well-known method of finding the eigenvalues and eigenvectors of
a symmetric matrix A. It is not very efficient but is very easy to understand and implement. Moreover,
its intimate connection with tangent rotation makes it easy to visualize.
The basic idea in the power method is to construct, given a starting vector z(0), the sequence
z(1), z(2), z(3), . . . = k1Az(0), k2A2z(0), k3A3z(0), . . . = k1Az(0), k92Az(1), k93Az(2), . . .

(24)

In practice the scalars ki (and k9i) are chosen so that || z(i) || = 1. (For our purposes we shall suppose that
T
they are chosen so that z(i) A 2 1z(i) = 1; in this way the result of a tangent rotation lies on the ellipsoid.)
Except in rare circumstances, the sequence above will converge to the first eigenpair (l1, v1) and
Figure 1 shows why. The right-hand side of equation (24) consists of nothing more than a sequence
of tangent rotations where the result of one tangent rotation becomes the starting point of the next. As
the tangent rotations are continued indefinitely, we can see that the sequence converges to the major
axis of the ellipse A and hence to the first eigenvector of A. The reverse tangent rotation, on the other
hand, converges to the minor axis. Subsequent eigenvectors and eigenvalues can be obtained by
carrying out the power method on the matrices A(i) = A 2 Sij = 1 ljvjvTj , i = 1, 2, . . . , p 2 1, a technique
known as deflation. Geometrically, deflation after i eigenvectors have been obtained implies slicing
a (p 2 i + 1)-dimensional ellipsoid along the (p 2 i)-dimensional hyperplane orthogonal to the ith
eigenvector.
From the geometry of the power method it is easy to see when convergence will be fast or slow.
If l1 @ l2, the ellipse will be elongated and tangent rotations will converge quickly; conversely, if
l1 l2, the ellipse will nearly be a circle and the tangent rotations will take only small steps, and
therefore many steps, towards the major axis. Figures 1(a) and 1(b) ilustrate both slow and fast
convergence.
In the context of the principal component analysis of the matrix X (n 3 p) the power method can be
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

318

A. PHATAK AND S. DE JONG

applied to the cross-product matrix S = XTX to obtain the eigenvectors gi. Similarly, it is easy to show
that the principal components ji can be obtained by carrying out the power method on the matrix
D = XXT, which has the same non-zero eigenvalues as the matrix S. Geometrically, the construction of
the ji is analogous to the construction of the eigenvectors of S from the ellipse S. There is, however,
one very important difference: if X is of rank p, say, D will be less than full rank, because p < n, and
hence its inverse does not exist. Consequently, we must use the MoorePenrose inverse D 2 , because
its eigenvectors are the same as those of D, with eigenvalues the inverse of the eigenvalues of D.
Geometrically, this corresponds to working in the p-dimensional subspace of the n-dimensional object
space. Note that if p > n, as is often the case in spectroscopic applications, ellipses constructed from
both S 2 and D 2 will be required.
To accommodate the possibility of rank-deficient symmetric matrices, we need to slightly amend
the definition of an ellipsoid presented in Section 4. Thus for positive semidefinite A we write
Ac ; {z | zTA 2 z = c}

(25a)

Ac2 ; {z | zTAz = c}

(25b)

Figure 1. (a) Illustration of power method when l2 / l1 = 52 /42. Note that convergence is relatively slow. (b)
Illustration of power method when l2 / l1 = 52 /252. By contrast with Figure 1(a), convergence to the principal axis
of the ellipse is fast

1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

319

Figure 2. Construction of PCR estimate of b when only the first principal component is retained

with the important stipulation that z lie in the column space of A, i.e. zPrange(A). Then we can
construct both Ac and Ac2 without further restriction.
5.2. PCR
The geometric interpretation of principal component regression is quite straightforward. As equation
mPCR is simply the orthogonal projection of b OLS onto the subspace spanned by the
(16) showed, b
eigenvectors corresponding to the principal components retained in the regression. Figure 2 shows the
mPCR for a simple example where there are three variables. Here only the first principal
construction of b
OLS occurs onto g1. It is worth noting here that
component has been retained and hence projection of b
m

OLS ||, with


as a consequence of the way in which bPCR is constructed, it follows that || b mPCR || < || b
equality for m = p.
From equation (13) the PCR vector of fitted values y mPCR is the orthogonal projection of y onto the
space spanned by the m PCs retained. Figure 3 illustrates a three-dimensional example in which
projection takes place onto the first two PCs.

Figure 3. Construction of PCR vector of fitted values when the first two principal components are retained

1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

320

A. PHATAK AND S. DE JONG

6. GEOMETRY OF PLS
In this section we will present several pictures of the geometry of partial least squares. In Section 6.2
we begin with the simplest possible casea single response variable and two regressors. This picture
is then extended to three regressors (Section 6.3), which allows us to depict the non-trivial
construction of the second PLS dimension. Section 6.4 presents the simplest possible multivariate PLS
regression, with two responses and two explanatory variables. Finally, we present in Section 6.5 a
description of PLS regression with two responses and three regressors. In all instances the construction
of the vectors of scores, weights and loadings will be depicted in object and variable space. In the
mPLS.
latter we will also illustrate the construction of b
We begin this section, however, by extending the discussion of PLS begun in Section 3.3 to the
algorithms used to determine scores, weights and loadings. In particular, we will consider what might
be considered the standard algorithm23 and the SIMPLS algorithm of de Jong.9
6.1. Two algorithms for PLS
6.1.1. Covariance maximization
As we saw in Section 3.3, PLS effects a bilinear decomposition of X in terms of a set of orthogonal
linear combinations ti and an associated set of loading vectors pi. Unlike PCR, however, the derivation
of these linear combinations is not motivated by variance maximization, but by covariance
maximization.24 Thus, to derive the first PLS X-variate, we search for the linear combinations t = Xw
and the corresponding Y-variate u = Yq that have maximum squared covariance subject to constraints
on the norms of the respective weight vectors. Algebraically, the objective function can be written as
+ = (qTYTXw)2 2 fy(qTq 2 1) 2 fx(wTw 2 1)

(26)

where fy and fx are Lagrange multipliers. It is quite straightforward to show24 that the first set of
weight vectors (q1, w1) is given by the dominant right and left singular vectors respectively of the
matrix XTY. Equivalently, q1 is the dominant eigenvector of YTXXTY and w1 the dominant eigenvector
of XTYYTX. Hence, by analogy with the derivation of the PCs of a symmetric matrix, the criterion in
equation (26) can be reformulated (for w) as
+ = wTXTYYTXw 2 f(wTw 2 1)

(27)

which can be interpreted as the sum of the squared covariances of Xw with each response variable.
Note that linear combinations of Y do not appear in this reformulation. In many early presentations
of multivariate PLS, solutions to the two eigenproblems are presented in terms of the pair of
alternating least squares (ALS) equations given by
XTYq = f1/2w

(28a)

YTXw = f1/2q

(28b)

where f = fx = fy. Substituting one into the other shows that, at convergence, q and w yield the right
singular vectors of XTY and YTX respectively.
Once the weight vector w1 has been obtained, the first X-block variate t1 is calculated as Xw1 and
the associated loading vector by regressing X on t1, i.e. p1 = XTt1 /|| t1 ||2. Successive X-block variates
are derived so that they are orthogonal to one another, i.e. tiTtj = 0, i j. The way in which the
orthogonality constraint is imposed in NIPALS and SIMPLS differs, and although it makes no
difference when there is only one response variable, the space spanned by the first m ti, 1 < m < p, will
not be the same when many responses are available.
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

321

6.1.2. NIPALS and SIMPLS


In the standard or NIPALS3 algorithm, orthogonality of X-block variates is enforced by projecting
the columns of the matrix X onto the space orthogonal to the ti, i = 1, 2, . . . , m, already extracted,
resulting in a matrix of residuals Em. Successive variates are then defined to be linear combinations
of Em that maximize the functions in either equation (26) or (27), with Em in place of X. This way of
ensuring orthogonality is analogous to the deflation carried out to determine successive eigenvectors
by the power method (Section 5.1) and arises out of the bilinear decomposition of X by PLS (equation
(17)). Hence the mth matrix can be written as

O
m

Em = E0 2

tipTi ;

E0 = X

(29a)

i=1

= X 2 TmPTm

(29b)
21

= [In 2 Tm(T Tm) T ]X = * Tm> X


T
m

T
m

(29c)

where we have made the substitution Pm = XTTm(TTmTm) 2 1 to arrive at the last expression. Equation
(29c) shows clearly the projection of the columns of X onto the space orthogonal to the ti already
extracted. After Em has been calculated, the (m + 1)th weight vector wm + 1 can be obtained as the
dominant left singular vector of the matrix ETmY. The steps in the NIPALS algorithm for deriving
weights, scores and loadings associated with the regressor variables are summarized in Table 1.
Although NIPALS does not express the ti explicitly as linear combinations of X, they must be
expressible as such, because the space spanned by the columns of Ei is contained in the column space
of X. Several authors2,9 have shown that Tm = XRm, where Rm = Wm(PTmWm) 2 1. Clearly, r1 = w1, but it is
also true that the ri calculated in this way are not orthonormal. It is interesting to note that if we
substitute Tm = XWm(PTmWm) 2 1 into equation (29b) and then rearrange, it is possible to write Em as
Em = X[I 2 Wm(PTmWm) 2 1PTm] = X(I 2 * Wm Pm> ) = X * Pm> Wm

(29d)

or, alternatively, its transpose as ETm = * Wm> PmX. Hence, although the columns of X are projected
orthogonally, the rows are projected obliquely onto the subspace orthogonal to Wm and along Pm.
The SIMPLS algorithm, whose main steps are shown in Table 2, is different from NIPALS in two
important ways: first, successive ti are calculated explicitly as linear combinations of X; second, X is
not deflated in the same way. Now orthogonality of the ti means that tTi tj = 0. If we denote the weights
by r, then we require that
tTi tj = 0 = tTi Xrj = (tTj ti)pTi rj

(30)

for j > i. Here pi is the ith loading vector calculated in the usual way. Equation (30) implies that any
new weight vector rm + 1 (m > 0) must be orthogonal to all preceding loading vectors. Algebraically, this
Table 1. NIPALS algorithm
Set E0 = X. Then for i = 1 to m,
1. wi = dominant left singular vector of ETi 2 1Y
2. ti = Ei 2 1wi
3. pi = XTti /(tTi ti)
4. Ei = Ei 2 1 2 tipTi
5. ii + 1, go to step 1.
End

1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

322

A. PHATAK AND S. DE JONG

Table 2. SIMPLS algorithm


Set C0 = XTY. Then for i = 1 to m,
1. ri = dominant left singular vector of Ci 2 1
2. ti = Xri
3. pi = XTti /(tTi ti)
4. Ci = * Pi> Ci 2 1
5. ii + 1, go to step 1.
End

requirement can be expressed as


rTm + 1 * Pm = 0

(31)

or as
rm + 1 = (I 2 * Pm)rm + 1 = * Pm> rm + 1,

m>0

(32)

The second expression shows clearly that rm + 1 must lie in the space orthogonal to the first m loading
vectors. Hence the solution for rm + 1 is now given by the dominant left singular vector of XTY projected
onto * Pm> , i.e. of the matrix * Pm> XTY, which implies that Em can be written as X * Pm> . Thus the rows
are projected orthogonally onto Pm> , but what of the columns? It is easy to show that the corresponding
deflation in column space is [I 2 XPm(TTmXPm) 2 1TTm]X, which implies projection of the columns of X
onto Tm> (which guarantees orthogonality of the ti) along the space spanned by the columns of XPm.
Despite the different deflations that are carried out in NIPALS and SIMPLS, de Jong9 showed that
they yield exactly the same score, loading and weight vectors when there is only one response
variable. In multivariate PLS, however, the results are different, and apart from the first set of vectors
(r1, t1 p1), none of the other sets characterizing subsequent dimensions will be the same. Only in
pathological cases will the difference be of any practical consequence.
When discussing the geometry of PLS, however, the SIMPLS approach has a definite advantage:
for all dimensions we can express ti as explicit linear combinations of X. Thus we can stick with the
basic ellipsoids S and D throughout. It also implies that we can, in principle, sketch the geometry of
PLS in variable space of any dimension.
6.2. Single response, two regressors
When there is only a single response variable, the situation is quite simple, because the criterion to be
maximized is the squared covariance of ti with y itself. Consequently, weights can be determined
explicitly.
6.2.1. Object space
In object space we need only work with y OLS, because
rTi XTy OLS = rTi XTX(XTX) 2 XTy = rTi XTy
For illustrating the geometry of PLS, this is a convenient result, because it allows us to view the
construction of variates in the space of the ellipsoid D
Now the weights ri satisfy rTi ri = 1, which in this instance is simply the equation of a circle of radius
one in two-dimensional variable space. The image of this circle in n-dimensional object space is an
ellipse, i.e.
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

323

Figure 4. Geometry of univariate PLS in object space with two regressors. The space is spanned by x1 and x2 (not
shown) and the principal axes of the ellipse are j1 and j2

rTi ri = rTi XT(XXT) 2 Xr = tTi D 2 ti = 1

(33)

Thus the score vectors t1 and t2 lie on the ellipse D1, which is the ellipse lying in the two-dimensional
subspace of n-dimensional object space spanned by the vectors x1 and x2 and having j1 and j2 as its
principal axes (Figure 4). The OLS vector of fitted values y OLS lies in this subspace and our objective
now is to construct t1 and t2, which we can do in one of two ways for this simple example.
T
First, the optimal weight vector r1 that maximizes (yOLS
Xr1)2 is collinear with XTy OLS. This yields
T
t1 = Xr1 = kXX y OLS = kDyOLS, which can be obtained by a tangent rotation of y OLS on D. Thus, having
constructed t1, the next dimension t2 is found immediately as the vector in the direction orthogonal to
t1 and lying on the ellipse.
A second more intuitive way is to rotate a vector on the ellipse around until it has maximum
T
projection on y OLS, thus maximizing the inner product y OLS
t1. By trial and error we can see that the
vector obtained by tangent rotation of y OLS yields the desired result.
The construction of PLS variates (and indeed of weights in variable space) by tangent rotation
illustrates that PLS is intermediate between OLS and PCR and that the PLS subspace depends both
on the eigenstructure of XTX and on the position of y OLS. When the eigenvalues of XTX are nearly all
the same, a PLS solution will yield results similar to OLS. When the first few eigenvalues are large
compared with the rest, however, PLS is closer to PCR.
6.2.2. Variable space
Figure 5 shows the geometry of PLS in variable space. Here the information in y is contained in the
OLS. The OLS estimator of b is also the direction of steepest descent with respect to y,
direction of b
perpendicular to the contour lines of y as a linear function of x1 and x2.
OLS on S,
The first PLS weight vector r1 ( = w1) can be easily constructed by a tangent rotation of b
OLS. Furthermore, if we specify the weight vectors to have unit norm,
because r1 = kXTy OLS = kXTXb
they will lie on a circle of radius one. Given r1, it is easy to construct the corresponding loading vector
p1. We simply apply a tangent rotation to r1, because p1 ( = XTt1 /tT1t1) = kXTXr1. The length of p1 is
given by dropping a perpendicular on r1, since pT1r1 = 1.
Once the first weight vector has been constructed, the second weight vector w2 in the NIPALS
algorithm is simply the unit-norm vector orthogonal to w1. In SIMPLS, r2 lies in the space orthogonal
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

324

A. PHATAK AND S. DE JONG

Figure 5. Geometry of univariate PLS in variable space with two regressors. The vectors w1 and w2 are
perpendicular, as are p1 and r2. Not shown is the construction of p2, whose direction is given by a tangent rotaton
of r2 on S

to p1 and in this simple example it is simply the unit-norm vector orthogonal to the first loading
vector.
mPLS.
The construction of the PLS model is complete when we can also plot the regression vector b

Here only m = 1 is of interest, since retaining all (two) dimensions reproduces bOLS. Clearly, for a one=1
mPLS
dimensional solution, b
must be collinear with r1 (and hence with w1), because
=1
mPLS
OLS = w1(pT1b
OLS) = r1(pT1b
OLS)
b
= w1(pT1w1) 2 1pT1b
=1
OLS onto r1, i.e. onto r1 along a
mPLS
follows from an oblique projection of b
The precise position of b
direction orthogonal to p1, hence along r2. Another, though slightly more complicated, way of
OLS follows from a reinterpretation of oblique projections. Without
illustrating the construction of b
proof we state that equation (23) can be rewritten as

OLS
mPLS = S 2 1/2 * R S1/2b
b
m

(34)

mPLS can be constructed by first carrying out (a)


m = S 2 1/2Rm and S = XTX. As a consequence, b
where R

half a tangent rotation of bOLS, (b) an orthogonal projection of the result onto the subspace spanned
by the vectors arising from half a tangent rotation of r1, r2, . . . , rm and finally (c) half a reverse
rotation of the result of the orthogonal projection in (b).
6.3. Single response, three regressors
In going from p = 2 to p = 3, we can now visualize the non-trivial construction of the second PLS linear
combination and associated weight vectors. Visualization is a little more difficult, however: circles
become spheres and ellipses become ellipsoids.
6.3.1. Object space
Figures 6(a)(c) illustrate the geometry of PLS in object space when there are three regressors. The
weights ri satisfy rTi ri = 1 and hence lie on a unit sphere; the image of this sphere in object space,
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

325

Figure 6. (a) Construction of t1 by tangent rotation of y OLS on the ellipsoid D1. Viewpoint is (03, 1, 02). (b)
Construction of PLS vector of fitted values by orthogonal projection of y OLS on the first two PLS dimensions. As
shown here, t2 and t3 correspond to dimensions from SIMPLS. Viewpoint is (1, 2 07, 03). (c) Construction of
t2: tangent rotation on either the SIMPLS or NIPALS ellipse yields vectors which are scalar multiples of one
another

1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

326

A. PHATAK AND S. DE JONG

however, is a three-dimensional ellipsoid in the n-dimensional object space. Thus the score vectors t1,
t2 and t3 lie on the ellipsoid D1, which is the ellipsoid lying in the space spanned by x1, x2 and x3, the
columns of X. The directions of the principal axes of D1 are defined by the principal components j1,
j2 and j3.
The first score vector is found by a tangent rotation of y OLS (Figure 6(a)). The second score vector
lies in the plane orthogonal to t1 and it too can be constructed by a tangent rotation of y OLS projected
onto this plane. Depending on whether SIMPLS or NIPALS is used, however, the ellipse on which the
tangent rotation takes place will be slightly different, although the result will still be the same. In both
algorithms the ellipses are defined by (E1ET1) 2 . As we showed in Section 6.1.2, however, the residual
matrices Ei differ, because deflation is caried out slightly differently in the two algorithms.
In their illustration of the geometry of NIPALS, Phatak et al.15 pointed out that the ellipse defined
by the NIPALS residual matrix is not bounded by D and hence is somewhat awkward to construct
geometrically. It turns out, however, that the ellipse defined by the SIMPLS residual matrices is given
by the intersection of the (hyper)plane orthogonal to the score vectors already extracted and the
ellipsoid D. Figure 6(b) shows the two residual ellipses together. We see that in using SIMPLS, we
can stick with the original ellipsoid D.
The construction of t2 from a tangent rotation of (E1ET1) 2 from both NIPALS and SIMPLS is
pictured in Figure 6(c). Because tT2y = tT2y OLS = tT2 * t1> y, we can carry out the tangent rotation on the
projection of y (or of y OLS) onto the subspace of range(X) that is orthogonal to t1. Note that t2 from
SIMPLS and NIPALS lie on their respective ellipses when they are constructed as linear combinations
with unit-norm weight vectors. Although their resultant lengths are different, their directions are
identical.
Once t2 has been found, t3 follows immediately as the vector orthogonal to the plane formed by t1
and t2. As Figure 6(b) shows, projecting y OLS onto this plane yields the PLS vector of fitted values.
6.3.2. Variable space
In variable space the first weight vector (r1 or w1) is obtained from a tangent rotation of b OLS on the
ellipsoid S (Figure 7(a)). By applying a second tangent rotation, the loading vector p1 is found (not
shown). To obtain r2, one of two approaches can be adopted. Beyond the first weight vector, ri can be
OLS, which has two interpretations. First, it can be seen as a tangent rotation of
written as * Pm> XTXb
bOLS on the ellipsoid formed by the projection of S onto the (hyper)plane orthogonal to the loading
vectors already extracted. In this simple example it is the ellipse that results from the projection of S
onto the plane orthogonal to p1. The tangent rotation of the projection of b OLS onto this plane yields
OLS, hence
r2. Alternatively, we can interpret the direction of ri as the orthogonal projection of XTXb
of r1, onto the space orthogonal to the loading vectors extracted to that point. In Figure 7(b) the
direction of r2 is given by the orthogonal projection of r1 onto the plane orthogonal to p1. Only the
portion of this plane that intersects S is shown.
In much these same way the construction of the weights in NIPALS can be carried out in one of two
ways. Here
OLS = * W > P XTX * P > W b
= ETmEmb
OLS
wm + 1 ~ ETmy OLS = * Wm> PmXTXb
m m
m m OLS
and hence successive weights can be derived by either projecting the first weight vector onto the space
OLS (or its projection) on the ellipse formed
orthogonal to Wm along Pm or as a tangent rotation of b
by (ETmEm) 2 . In Figure 7(c) we show w2 as an oblique projection onto the plane orthogonal to w1 along
p1. We show only that portion of the plane that intersects S.
=1
mPLS
The regression vector b
for the one-dimensional PLS solution is obtained by an oblique

projection of bOLS onto r1 in a direction parallel to the plane formed by r2 and r3, i.e. orthogonal to
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

327

Figure 7. (a) Construction of w1 and r1 by tangent rotation of b OLS on the ellipsoid S1. The extension of the first
weight vector to the surface of the ellipsoid is also shown. Viewpoint is (1, 15, 03). (b) Geometry of SIMPLS
in variable space: (i) tangent rotation of r1 on S1 yields p1; (ii) r2 obtained by projecting r1 onto space orthogonal
=2
can then be drawn by projecting b OLS onto (r1, r2) along r3.
to p1 and then scaling to desired norm; (iii) b mPLS
Viewpoint is (1, 2 12, 02). (c) Geometry of NIPALS in variable space: (i) projecting w1 onto its orthogonal
=2
can then be obtained by projecting
complement along p1 yields a vector that is a scalar multiple of w2; (ii) b mPLS

bOLS onto (w1, w2) along r3. Viewpoint is (1, 2 11, 02)

1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

328

A. PHATAK AND S. DE JONG

p1. The contribution of the second PLS dimension to the regression vector is found in a similar way
OLS onto r2 in a direction orthogonal to p2. Vectorial addition of the
by an oblique projection of b
=2
mPLS
. Alternatively, the regression vector for PLS
contributions from the first two dimensions yields b
with two dimensions can be obtained by an oblique projection of b OLS onto the (r1, r2) (or (w1, w2))
plane along r3, i.e. orthogonal to (p1, p2). Of course, a full three-dimensional model coincides with the
OLS.
ordinary least squares solution b
6.4. Two responses, two regressors
When more than one response is measured (q > 2), estimation in PLS is intrinsically more complex.
We must now consider linear combinations of Y either explicitly, as in the alternating least squares
solution where optimal linear combinations of X and Y are derived simultaneously, or implicitly, by
obtaining weights as eigenvectors of a symmetric matrix. The fact that there are two algebraic
solutions means that there will also be two geometric interpretations in both object and variable space.
Also, although our primary interest is in deriving the ti, illustrating the construction of Y-variates in
either the object or variable space of Y will proceed in an analogous fashion.
6.4.1. Object space
In an earlier paper, Phatak et al.15 described a geometric interpretation of the ti in multivariate PLS
which stems essentially from the geometric interpretation of the eigenvalue/eigenvector problem
defined in equation (27). When w1 is defined as the largest eigenvector of XTYYTX, then the variate
constructed as Xw1 can be obtained geometrically as follows.
1.
2.
3.
4.

OLS.
Project Y orthogonally onto the column space of X to obtain Y
OLS on the ellipsoid D.
Carry out half a tangent rotation of each column of Y
Find the largest principal component of these rotated columns.
Carry out half a tangent rotation of this largest principal component on D to obtain t1.

To obtain successive variates, the procedure is repeated with Ei in place of X. Note that this
description extends to both NIPALS and SIMPLS: only the subspace in which construction takes place
will differ.
Another interpretation25 of multivariate PLS stems from the alternating least squares equations in
OLS allows us to work in the column space of X. If we then
equation (28). Replacing XTY by XTY

premultiply by X and YOLS respectively, we can write


Xw ( = t) = f 2 1/2(XXT)u

(35a)

T
OLSq ( = u)
OLSY
OLS
Y
= f 2 1/2(Y
)t

(35b)

OLS. At any step in the ALS iterations it will be


Note that u is not a linear combination of Y but of Y
the orthogonal projection of u = Yq onto the column space of X.
In the course of the ALS procedure we will have at hand a pair of vectors (t, u).
To obtain the value
of t in the next iteration, we premultiply u by XXT, which is equivalent in geometric terms to carrying
out a tangent rotation on the ellipsoid D. Similarly, the next value of u is the result of a tangent rotation
of t on the q-dimensional ellipsoid Gc, where
T
OLSY
OLS
Gc ; {z | zT(Y
) 2 z = c}

Thus equations (35a) and (35b) can be interpreted geometrically as a sequence of alternating tangent
rotations of t and u,
each with respect to the other ellipsoid. Each tangent rotation maximizes the
inner product between t and u given one of the vectors. For the simple case of two responses and two
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

329

Figure 8. Geometry of multivariate PLS in object space with two responses and two regressors. Depicted here
is the (converged) alternating least squares procedure

regressors the end point of this sequence of alternating rotations is illustrated in Figure 8. It converges
to the invariant pair of vectors (t1, u 1), where t1 is the optimal X-variate and u 1 is the orthogonal
projection of the optimal Y-variate u1 onto X-space. At convergence the tangent at t1 is orthogonal to
u 1 and the tangent at u 1 is normal to t1. Once t1 has been extracted, the second score vector can be
found immediately as the vector orthogonal to it and on the ellipse D. The variate u 2 can then be found
by a tangent rotation of t2 on G. Note again that u 2 is the projection of the second Y-variate.
An alternative way of constructing the pair (t1, u 1) is to interpret directly the PLS optimization
criterion. Recall that we are trying to maximize
T
T
OLSY
OLS
OLSY
OLS
rTXTYYTXr = rTXTY
Xr = tTY
t

subject to the constraint


rTr = rXT(XXT) 2 Xr = tT(XXT) 2 t = 1
In geometric terms, therefore, the problem is to find a vector t that lies on the ellipse G 2 which is as
large as possible subject to the constraint that t lie on the ellipsoid D1. Thus the solution is found as
the tangent point of two ellipsoids: D1 and a circumscribing ellipse G 2 . Figure 9 illustrates this
interpretation. It is important to keep in mind that the ellipse G 2 has the same shape as the ellipse G
used in constructing t1 by alternating tangent rotations but a different (90) orientation. In much the
same way, u 1 can be constructed by finding the point of tangency between the ellipse G and the
circumscribing ellipse D12 . However, a much simpler way of constructing u 1 given t1 is by a tangent
rotation of the latter on G.
6.4.2. Variable space
In the variable space of X each response is represented by its OLS regression coefficient vector. These
are the images of yi under the transformation X 2 . To see this, we need only rewrite the matrix
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

330

A. PHATAK AND S. DE JONG

Figure 9. Alternative method of constructing (t1, u 1). The point of tangency between D1 and the circumscribing
ellipse G 2 yields t1 , and the construction of u 1 follows as before
T
OLS
OLS represents a
XTYYTX, whose principal eigenvector yields r1, as SB OLSB
S. Geometrically, SB
tangent rotation of the individual coefficient vectors on the ellipsoid S (Figure 10(a)) and w1 (or r1)
is simply the unit-norm vector that lies in the direction of the largest principal component of these
rotated vectors (Figure 10(b)). It is worth pointing out that the construction of r1 is somewhat different
from that of its counterpart in object space, t1. Whereas r1 is the largest PC of the full tangent rotation
OLS, t1 is found by half a tangent rotation of Y
OLS, extraction of the largest PC and then another
of B
half a tangent rotation.
From r1 we can construct, as before (Section 6.2.2), the loading vectors p1 and p2, the second set
=1
mPLS
of weight vectors r2 and w2 as well as the PLS regression matrix B
for a one-dimensional solution
m=1

(Figure 10(b)). The columns of BPLS are of course collinear.

6.5. Two responses, three regressors


The extension of the geometry of multivariate PLS to three regressors follows closely from the
description in the previous subsection. An important point to note is that in this case, unlike univariate
PLS, NIPALS and SIMPLS will not yield the same sets of vectors beyond the first dimension. We can,
in principle, illustrate the non-trivial construction of variates and weights beyond the first set, but as
de Jong9 points out, the difference between SIMPLS and NIPALS is practically very small. Hence it
is difficult for any reasonable geometry of X and Y to illustrate the difference pictorially. We shall
have to be content, therefore, with verbal descriptions, but given the now familiar operations of
tangent rotations and projecting vectors orthogonally or obliquely, it should not be too difficult to
visualize the geometry of multivariate PLS when there are three regressors.
6.5.1. Object space
The alternating least squares formulation of multivariate PLS depicted in Figure 8 can also be used
to visualize the construction of t1 here for both SIMPLS and NIPALS. Because there are three
regressors, however, the geometry now consists of an ellipsoid D and an ellipse G, both centered at
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

331

Figure 10. (a) Tangent rotation of the matrix of OLS estimates B OLS on the ellipse S1. Each column of B OLS, here
=1
=1
=1
( = [(b mPLS
)1, (b mPLS
)2]) for the simple case of two
denoted by (b OLS)j, is rotated separately. (b) Construction of B mPLS
regressors and two explanatory variables. The columns of B OLS are projected obliquely onto w1 (or r1) along r2.
=1
)2. It is,
Unfortunately, because (b OLS)2 is almost collinear with w1, we cannot depict the construction of (b mPLS
however, easy to visualize

1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

332

A. PHATAK AND S. DE JONG

the origin. At convergence a tangent rotation of u 1 on D yields t1; conversely, a tangent rotation of t1
on the ellipse G (or a tangent rotation of the projection of t1 onto G) gives u 1.
The second dimension t2 lies in the plane orthogonal to t1, but its length and direction depend on
whether SIMPLS or NIPALS is used. Because of the different deflation in the two algorithms, the
shape and orientation of the ellipse defined by (EiEiT) 2 will not be the same. As a consequence, tangent
rotation of y OLS (or its projection onto the space orthogonal to t1) on the SIMPLS and NIPALS ellipses
will not yield, in contrast with Figure 6(c), dimensions that are scalar multiples of one another. Instead,
they will be oriented differently and hence the subspaces defined by the first two SIMPLS and
NIPALS dimensions will be different.
6.5.2. Variable space
As we showed in Section 6.4.2, the first weight vector in multivariate PLS is the principal eigenvector
T
OLS
OLSB
S. In much the same way it is easy to show that subsequent weights are the
of the matrix SB
principal eigenvectors of matrices of the form
T
OLSB
OLS
(ETi Ei)B
(ETi Ei)

Consequently, the geometric interpretation of weights beyond the first is directly analogous. The
OLS represent a tangent rotation of the columns of B
OLS on the ellipsoid
columns of the matrix (ETi Ei)B
(ETi Ei) 2 and the largest principal component of these rotated vectors yields the required weight vector
ri + 1 or wi + 1 depending on whether SIMPLS or NIPALS is used. It is useful to recall here that in
multivariate PLS the first m wi (NIPALS) and ri (SIMPLS) no longer span the same space.
In variable space the triple (w1, r1, p1) is the same in NIPALS and SIMPLS. Moreover, for all
dimensions the SIMPLS loading vector pi is a tangent rotation on S of the corresponding weight ri.
The second SIMPLS weight vector lies in the space orthogonal to p1 and it is in this plane that the
OLS projected onto the plane orthogonal to p1
ellipse (ET1E1) 2 lies. Carrying out a tangent rotation of B
and then finding the largest PC gives a vector in the direction of r2. Again the loading vector p2 can
be constructed in a straightforward manner by a tangent rotation of r2 on S. The third weight vector
r3 is orthogonal to p1 and p2, while p3 is constructed in the same way as the first two loading vectors.
We can now construct a PLS estimator for one or two dimensions as oblique projections of the
OLS onto r1 or onto the plane formed by r1 and r2 respectively.
columns of B
The second NIPALS weight vector w2 is obtained in the same way as the SIMPLS weight vector,
except that the ellipse on which the tangent rotation of the columns of B OLS is carried out lies in a
different subspace. The residual ellipse (ET1E1) 2 lies in the space orthogonal to w1. Once the first two
weight vectors have been constructed, the third follows immediately as the vector orthogonal to the
plane formed by the first two. The second loading vector p2 is simply the tangent rotation of w2 on
(ET1E1) 2 , while the third will be identical with w3. We should note here that the ti in NIPALS can also
be constructed as linear combinations of X, but the weights (which we shall again denote by ri) are
different from those in SIMPLS. We can write, for example, that
t2 = E1w2 = X(I 2 * w1p1> )w2 = Xr2
which implies that r2 is simply w2 projected onto the space orthogonal to p1 along w1. It is easy to
generalize this result to subsequent ri. Both wi and ri span the same space and the PLS estimator for
OLS along p3 onto either (w1, w2) or (r1, r2). Note
two dimensions can be constructed by projecting B
that the estimator thus formed will in general be different from the SIMPLS estimator, because the
vectors (r1, r2) from SIMPLS do not span the same space as (w1, w2) or (r1, r2) from NIPALS.
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

333

7. CONCLUSIONS
The geometric interpretation of partial least squares regression presented here illustrates the
mechanics of univariate and multivariate PLS in both variable and object space. Using simple building
blocks such as tangent rotations and orthogonal and oblique projections, we can visualize the steps in
the NIPALS and SIMPLS algorithms.
There are two aspects of the work presented here that are worth re-emphasizing.
1. Premultiplying a vector by a symmetric matrix A is nothing more than a tangent rotation on the
ellipsoid Ac ; {z | zTA 2 z = c}. This simple operation is at the heart of the power method of finding
eigenvalues and eigenvectors of a symmetric matrix and allows us to interpret a Krylov sequence
as a sequence of tangent rotations. It also provides a nice illustration of why the power method
converges to the principal axes of an ellipsoid, albeit sometimes very slowly!
2. The PLS estimator of the vector of coefficients in the standard univariate or multivariate linear
regression model is an oblique projection of the OLS estimate. Oblique projections play a large
part in the PLS algorithms, especially in the process of deflation. We showed that the apparently
orthogonal deflation of the columns of X in NIPALS corresponds to oblique projection of its
rows. In SIMPLS the opposite is true. It is interesting to note that an orthogonal projection
decreases the length of the vector being projected, whereas an oblique projection can decrease
or increase it depending on the direction of projection. Such a result may be useful in elucidating
the properties of the PLS estimator. It recalls, for example, the empirical observation of Frank
and Friedman7 that the PLS estimator shrinks in some directions and expands in others.
Our emphasis has been on visualizing the individual steps in the PLS algorithms, steps which are
composed of the handful of building blocks listed above. One drawing is sufficient to condense many
operations in NIPALS and SIMPLS. We hope, therefore, that the reader, after completing this article,
will accept a slight twist on the adage that a picture is worth a thousand wordsit is worth, we think,
if not a thousand, then at least several steps in an algorithm!
APPENDIX I: OBLIQUE PROJECTIONS
Oblique projections require us to specify the direction along which projection is to take place. Figure
11 shows two illustrative examples. In the two-dimensional example the vector a is projected onto c
along e and the result is given by b. In the same way, d is the result of projecting a onto e along c.
In the three-dimensional example, a is projected onto the plane 0 along e, resulting in b. Here it is
important to realize that it is not possible to project a along e onto some other vector, e.g. onto c,
unless these three vectors are linearly related. The projection direction and the subspace onto which

Figure 11. Oblique projection of the vector a onto the vector c in two dimensions and onto the plane 0 in three
dimensions

1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

334

A. PHATAK AND S. DE JONG

projection takes place should span the entire space. Thus we can project a onto e along any plane, e.g.
along 0, as long as the vector e and the plane span the entire space. The projection direction may not
be contained in the projection subspace or vice versa. For example, projection of a along c onto 0
is not possible.
In a more general setting, if %n denotes the n-dimensional space in which we are working, then
projection onto a p-dimensional subspace 9p along a q-dimensional subspace 0q is possible only if
p + q = n and 9p and 0q are disjoint, i.e. they contain no common vectors except the null vector.
Another way of stating the same requirement is that 9p and the orthogonal complement of the
>
direction along which projection takes place, 0q , should have the same dimension p.
Let us consider the case of projection onto 9p along a q-dimensional subspace ] > such that 9p and
>
] are disjoint and p + q = n. Now, if A (n 3 p) and B (n 3 p) span 9p and ] respectively, then an
explicit expression for the projector onto 9p along ] > is given by22
* A B > = A(BTA) 2 1BT

(36)

It is easy to verify that the above expression is indeed correct. First we note that * A B > is idempotent,
i.e. * A B > * A B > = * A B > , and hence it is a projector. Consequently, any vector c in the range of A, i.e.
any c = Ax, is left unchanged:
* A B > c = * A B > Ax = A(BTA) 2 1BTAx = c
Thus the column space of A (9p) constitutes the subspace onto which projection takes place. Second,
any vector cPB > is annihilated, i.e. * A B > c = 0. Hence the subspace spanned by B > (] > ) is the
subspace along which projection takes place. Finally, a special case arises when A and B span the
same subspace. Then the oblique projector in equation (36) reduces to * A A > = A(ATA) 2 1AT, the
orthogonal projector onto 9p, which is written simply as * A. For a more rigorous treatment of oblique
projectors, see Reference 22.
Returning again to equation (23), Section 3.3, in which the PLS estimator was expressed as
OLS, we can now immediately recognize this as an oblique projection of b
OLS onto
Wm(PTmWm) 2 1PTmb
mPLS = * W P > b
OLS, with
Wm along the space orthogonal to Pm. Thus b
m m
* Wm Pm> = Wm(PTmWm) 2 1PTm

(37)

Since Rm and Wm span the same space (see equation (19c), Section 3.3), the oblique projectors * Wm Pm>
and * Rm Pm> = Rm(PTmRm) 2 1PTm = RmPTm are identical. Although projection along a plane (or hyperplane)
is more difficult to visualize than projection along a vector, Section 6.3.2 shows that it is possible to
express successive contributions to b mPLS as oblique projections along a vector.

APPENDIX II: GEOMETRY OF ELLIPSOIDS


II.1. Ellipsoids
If A (p 3 p) is a symmetric matrix of rank p, then the set of p-vectors z satisfying
zTAz = c

(38)

where c is a scalar constant, forms an ellipsoidal hypersurface in p-dimensional space. For example,
when p = 2, c = 1 and A is a diagonal matrix with elements 1/a21 and 1/a22, equation (38) reduces to the
familiar equation of an ellipse whose principal axes correspond to the co-ordinate axes and which are
of length a1 and a2. To simplify the terminology, surfaces formed when p > 3 will be called ellipsoids.
In this paper we will be treating ellipses (p = 2) and ellipsoids (p = 3), although the results apply to
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

335

ellipsoids of any dimension.


Ellipsoids defined by equation (38) will be denoted by Ac2 (the reason for the superscript notation,
which at first seems counter-intuitive, will become apparent in this subsection and the next). Thus
Ac2 ; {z | zTAz = c}

(39)

Any ellipsoid such as Ac2 is characterized by its orientation, shape, size and the location of its centre.
We need not consider the latter, however, because we will be working with mean-centered data in this
paper. The shape of the ellipsoid is determined by the relative lengths of its principal axes and it is
straightforward to show that these are related to the eigenvalues of A. Using either equation (11) or
(12) in Section 3.2, it can be shown (by considering XTX, for example) that any p 3 p symmetric
matrix A can be decomposed as
A = VLVT

(40)

where the p eigenvectors of A comprise the columns of V and the diagonal matrix L contains the
corresponding eigenvalues. Equation (40) is known as the spectral decomposition,16 and if we
substitute for A in equation (38), we can write
(zTV)L(VTz) = c

(41)

which shows that the principal axes have lengths (c/ li) . Furthermore, if we plot the ellipsoid in the
co-ordinate system of the variate z = Vz, the principal axes will coincide with the co-ordinate axes.
Consequently, switching to the original co-ordinates z ( = VTz ) amounts to rotating the ellipsoid from
its canonical position, with the direction of rotation of each axis determined by the corresponding
eigenvector. When all the eigenvalues are equal, the ellipsoid becomes a (hyper)sphere.
For a given matrix A the size of the ellipsoid depends on the constant c. For our purposes, however,
only the shape and orientation are important and we will sometimes drop the subscript c from the
symbol Ac2 to indicate an ellipse with orientation and shape determined by the eigenvectors and
eigenvalues of A but of unspecified size.
From the spectral decomposition of A it is easy to show that A raised to the power of m can be
written as
1/2

Am = VLmVT

(42)

which shows that the eigenvectors of A and A are the same and that the eigenvalues are simply those
of A raised to the power m. Of particular interest is m = 2 1, which gives the inverse of A. The
corresponding ellipsoid, which we shall denote by Ac, is, by analogy with equation (39),
Ac ; {z | zTA 2 1z = c}. What does Ac look like? Because the eigenvectors are the same, Ac has the same
principal axes as Ac2 , except of course that their indices will be reversed: for example, the first
eigenpair of A is (l1, g1), but the eigenvalue of A 2 1 associated with g1 is l12 1, the smallest one.
Moreover, the shapes of Ac2 and Ac (which we can think of as being related to the relative lengths
of the principal axes) will be different, because in general

l1 : l2 : . . . : lp 2 1 : lp lp2 1 : lp2211 : . . . : l22 1 : l12 1


In two dimensions, however, the ellipses Ac2 and Ac do have the same shape, because l1 / l2 = l22 1 / l12 1 :
they are merely 90 rotations of one another.
A simple, constructed example, shown in Figure 12, illustrates the differences between A12 1 and A1
for A = diag(152, 1). Because A is diagonal, the unit-norm eigenvectors g1 and g2 are (1, 0) and (0, 1)
respectively. Note that the lengths of the principal axes of the ellipses are equal to the inverse of the
square root of the associated eigenvalue and not to the square roots of the eigenvalues themselves.
Thus the lengths of the principal axes of the ellipses do not visually reflect the eigenvalues that they
1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

336

A. PHATAK AND S. DE JONG

Figure 12. Ellipsoids A1 and A12 for A = diag k(152, 1)

are associated with. For this reason we prefer to work with the ellipsoid A to illustrate transformations
carried out using the matrix A. in Figure 12, for example, the major axis of A1 has length 15 and is
collinear with g1, the eigenvector of A associated with the larger of the two eigenvalues. Similarly, the
minor axis has length 1 and is collinear with g2, the eigenvector associated with the smaller
eigenvalue. Thus it seems to us more natural to illustrate the geometry of PCR and PLS in an
ellipsoidal manifold whose dimensions reflect what we might intuitively expect. Our choice is an
arbitrary one, however, and the geometry could be illustrated just as well (though with a little more
confusion, we think) by using Ac2 .
II.2. Planes of tangency
We need to describe geometrically the algebraic operations b = Aa and a = A 2 1b, where A is a p 3 p
symmetric matrix and b and a are both p-vectors. Phatak et al.15 have already discussed the geometric
interpretation of such transformations, but we offer here a different, more intuitive proof of the result.
Note that we shall be working with the ellipsoid Ac.
Without loss of generality we can assume that the vector b lies on the surface of Ac. Consequently,
we can write
bTA 2 1b = c

(43)

Now let us consider the set of vectors given by b + d and which lie on the surface of the ellipsoid. As
d0, the tips of these vectors will form an infinitesimally small surface element on Ac around b.
Furthermore, such vectors satisfy
(b + d)TA 2 1(b + d) = c

(44)

bTA 2 1b + dTA 2 1b + bTA 2 1d + dTA 2 1d = c = bTA 2 1b

(45)

Expanding yields

and then discarding the second-order term in d and simplifying gives


dTA 2 1b = 0
1997 by John Wiley & Sons, Ltd.

(46)

Journal of Chemometrics, Vol. 11, 311338 (1997)

THE GEOMETRY OF PARTIAL LEAST SQUARES

337

Figure 13. Simple illustration of the transformation of a vector a by the matrix A via the algebraic operation Aa

which implies that d is orthogonal to A 2 1b. Now the set of vectors d as d0 constitutes a surface
element around the origin whose normal (orthogonal) vector is A 2 1b. Thus adding b to the vectors d
merely translates the element to the surface of the ellipsoid without changing its orientation, i.e. the
direction of the normal is still A 2 1b. This implies that the hyperplane tangent to the ellipsoid at b has
a = A 2 1b as its normal. If we turn the interpretation around, we can now interpret the transformation
b = Aa as follows. Given A (p 3 p) and a (p 3 1), the point of tangency between the ellipsoid Ac and the
(p 2 1)-dimensional hyperplane that is orthogonal to a gives a vector that is a scalar multiple of Aa.
We will denote such a transformation of the vector a as a tangent rotation of a on the ellipsoid A. For
a simple example where p = 2, we can illustrate the construction of Aa very easily. In Figure 13, a is
an arbitrary vector and we also show the family of lines orthogonal to it. The one that is tangent to
the ellipse yields kAa at the point of tangency. Such a method of construction extends easily to p > 2.
In three dimensions the idea would be to construct planes orthogonal to a and tangent to the ellipsoid;
we cannot, of course, view the construction in greater than three dimensions but it amounts to nothing
more than the point of tangency between a p-dimensional ellipsoid and a (p 2 1)-dimensional
hyperplane.
The reverse transformation (reverse tangent rotation), going from b = Aa to a = A 2 1b, is equally
straightforward to illustrate. Given an arbitrary vector b in two dimensions, say, we first construct a
line that is tangent to the ellipse at the point where b (or its extension) intersects the ellipse. Then the
vector that is orthogonal to this tangent line is a scalar multiple of A 2 1b. Extensions to p > 2 are
obvious.
In considering the geometric interpretation of the transformation Aa and its reverse, we have
ignored the issue of the norms of the resulting vectors. As we shall see throughout Sections 5 and 6,
however, we will be using this method of construction outlined above to construct subspaces onto
which projections will be made. Hence getting the direction is important, not the absolute length.
REFERENCES
1. S. Wold, A. Ruhe, H. Wold and W. Dunn, SIAM J. Sci. Stat. Comput. 5, 735743 (1984).
2. I. Helland, Commun. Stat.Simul. 17, 581607 (1988).
3. R. Manne, Chemometrics Intell. Lab. Syst. 2, 187197 (1987).

1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

338
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.

A. PHATAK AND S. DE JONG

M. Stone and R. Brooks, J.R. Stat. Soc. B, 52, 237269 (1990).


A. Phatak, A. Penlidis and P. Reilly, Anal. Chim. Acta, 277, 495501 (1992).
S. de Jong, J. Chemometrics, 7, 551557 (1993).
I. Frank and J. Friedman, Technometrics, 35, 109135 (1993).
C. Goutis, Ann. Stat. 24, 816824 (1996).
S. de Jong, Chemometrics Intell. Lab. Syst. 18, 251263 (1993).
F. Lindgren, P. Geladi and S. Wold, J. Chemometrics, 7, 4559 (1993).
M. Kaspar and W. Ray, Comput. Chem. Eng. 17, 985989 (1993).
B. Bush and R. Nachbar Jr., J. Comput.-Aided Mol. Desigh, 7, 587619 (1993).
S. Rnnar, F. Lindgren, P. Geladi and S. Wold, J. Chemometrics, 8, 111125 (1994).
P. Young, SIAM J. Sci. Comput. 15, 225230 (1994).
A. Phatak, A. Penlidis and P. Reilly, Commun. Stat.Theory Methods, 21, 15171533 (1992).
J. Magnus and H. Neudecker, Matrix Differential Calculus with Applications in Statistics and Econometrics,
Wiley, Chichester (1988).
G. Seber, Linear Regression Analysis, Wiley, New York (1977).
I. Jolliffe, Principal Component Analysis, Springer, New York (1986).
A. Basilevsky, Applied Matrix Algebra in the Statistical Sciences, Elsevier, New York (1983).
P. Brown, Measurement, Regression, and Calibration, Oxford University Press, Oxford (1993).
G. Golub and C. Van Loan, Matrix Computations, 2nd edn, Johns Hopkins University Press, Baltimore, MD
(1989).
K. Takeuchi, H. Yanai and B. Mukherjee, The Foundations of Multivariate Analysis, Wiley Eastern, New
Delhi (1982).
S. Wold, H. Martens and H. Wold, in Lecture Notes in Mathematics: Matrix Pencils, ed. by A. Ruhe and B.
Kagstrm, pp. 286293, Springer, Heidelberg (1983).
A. Hskuldsson, J. Chemometrics, 2, 211228 (1988).
A. Phatak, Evaluation of some multivariate methods and their applications in chemical engineering, Ph.D.
Thesis, University of Waterloo (1993).

1997 by John Wiley & Sons, Ltd.

Journal of Chemometrics, Vol. 11, 311338 (1997)

You might also like