Pca Slides

Principal Component Analysis
Frank Wood
December 8, 2009
This lecture borrows and quotes from Jolies Principle Component Analysis book. Go buy it!
Principal Component Analysis

The central idea of principal component analysis (PCA) is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the rst few retain most of the variation present in all of the original variables. [Jollie, Pricipal Component Analysis, 2nd edition]
Data distribution (inputs in regression analysis)
Figure: Gaussian PDF
Uncorrelated projections of principal variation
Figure: Gaussian PDF with PC eigenvectors
PCA rotation
Figure: PCA Projected Gaussian PDF
PCA in a nutshell
Notation
x is a vector of p random variables k is a vector of p constants k x =
p j =1 kj xj
Procedural description
Find linear function of x, 1 x with maximum variance. Next nd another linear function of x, 2 x, uncorrelated with 1 x maximum variance. Iterate.
Goal
It is hoped, in general, that most of the variation in x will be accounted for by m PCs where m << p .
Derivation of PCA
Assumption and More Notation
is the known covariance matrix for the random variable x Foreshadowing : will be replaced with S, the sample covariance matrix, when is unknown.
Shortcut to solution
For k = 1, 2, . . . , p the k th PC is given by zk = k x where k is an eigenvector of corresponding to its k th largest eigenvalue k . If k is chosen to have unit length (i.e. k k = 1) then Var(zk ) = k
Derivation of PCA
First Step
Find k x that maximizes Var(k x) = k k Without constraint we could pick a very big k . Choose normalization constraint, namely k k = 1 (unit length vector).
Constrained maximization - method of Lagrange multipliers

To maximize k k subject to k k = 1 we use the technique of Lagrange multipliers. We maximize the function k k (k k 1) w.r.t. to k by dierentiating w.r.t. to k .
Derivation of PCA
This results in d k k k (k k 1) d k k k k k = 0 = 0 = k k
This should be recognizable as an eigenvector equation where k is an eigenvector of b f and k is the associated eigenvalue. Which eigenvector should we choose?
Derivation of PCA
If we recognize that the quantity to be maximized k k = k k k = k k k = k then we should choose k to be as big as possible. So, calling 1 the largest eigenvector of and 1 the corresponding eigenvector then the solution to 1 = 1 1 is the 1st principal component of x. In general k will be the k th PC of x and Var( x) = k We will demonstrate this for k = 2, k > 2 is more involved but similar.
Derivation of PCA
Constrained maximization - more constraints
The second PC, 2 x maximizes 2 2 subject to being uncorrelated with 1 x. The uncorrelation constraint can be expressed using any of these equations
cov(1 x, 2 x) = 1 2 = 2 1 = 2 1 1 = 1 2 = 1 1 2 = 0 Of these, if we choose the last we can write an Langrangian to maximize 2 2 2 2 (2 2 1) 2 1
Derivation of PCA
Constrained maximization - more constraints
Dierentiation of this quantity w.r.t. 2 (and setting the result equal to zero) yields d 2 2 2 (2 2 1) 2 1 = 0 d 2 2 2 2 1 = 0 If we left multiply 1 into this expression 1 2 2 1 2 1 1 = 0 0 0 1 = 0 then we can see that must be zero and that when this is true that we are left with 2 2 2 = 0
Derivation of PCA
Clearly 2 2 2 = 0 is another eigenvalue equation and the same strategy of choosing 2 to be the eigenvector associated with the second largest eigenvalue yields the second PC of x, namely 2 x. This process can be repeated for k = 1 . . . p yielding up to p dierent eigenvectors of along with the corresponding eigenvalues 1 , . . . p . Furthermore, the variance of each of the PCs are given by Var[k x] = k , k = 1, 2, . . . , p
Properties of PCA
For any integer q , 1 q p , consider the orthonormal linear transformation y=Bx where y is a q -element vector and B is a q p matrix, and let y = B B be the variance-covariance matrix for y. Then the trace of y , denoted tr(y ), is maximized by taking B = Aq , where Aq consists of the rst q columns of A. What this means is that if you want to choose a lower dimensional projection of x, the choice of B described here is probably a good one. It maximizes the (retained) variance of the resulting variables. In fact, since the projections are uncorrelated, the percentage of variance accounted for by retaining the rst q PCs is given by
q k =1 k p k =1 k
100
PCA using the sample covariance matrix

If we recall that the sample covariance matrix (an unbiased estimator for the covariance matrix of x) is given by S= 1 XX n1
where X is a (n p ) matrix with (i , j )th element (xij x j ) (in other words, X is a zero mean design matrix). We construct the matrix A by combining the p eigenvectors of S (or eigenvectors of X X theyre the same) then we can dene a matrix of PC scores Z = XA Of course, if we instead form Z by selecting the q eigenvectors corresponding to the q largest eigenvalues of S when forming A then we can achieve an optimal (in some senses) q -dimensional projection of x.
Computing the PCA loading matrix

Given the sample covariance matrix S= 1 XX n1
the most straightforward way of computing the PCA loading matrix is to utilize the singular value decomposition of S = A A where A is a matrix consisting of the eigenvectors of S and is a diagonal matrix whose diagonal elements are the eigenvalues corresponding to each eigenvector. Creating a reduced dimensionality projection of X is accomplished by selecting the q largest eigenvalues in and retaining the q corresponding eigenvectors from A
Sample Covariance Matrix PCA
Figure: Gaussian Samples
Figure: Gaussian Samples with eigenvectors of sample covariance matrix
Figure: PC projected samples
Figure: PC dimensionality reduction step
Figure: PC dimensionality reduction step
PCA in linear regression

PCA is useful in linear regression in several ways Identication and elimination of multicolinearities in the data. Reduction in the dimension of the input space leading to fewer parameters and easier regression. Related to the last point, the variance of the regression coecient estimator is minimized by the PCA choice of basis. We will consider the following example. 4.5 1.5 x N [2 5], 1.5 1.0 y = X[1 2] when no colinearities are present (no noise) xi 3 = .8xi 1 + .5xi 2 imposed colinearity
Noiseless Linear Relationship with No Colinearity
Figure: y = x[1 2] + 5, x N([2 5],
4.5 1.5
1.5 1.0
Noiseless Planar Relationship
Figure: y = x[1 2] + 5, x N([2 5],
4.5 1.5
1.5 1.0
Projection of colinear data

The gures before showed the data without the third colinear design matrix column. Plotting such data is not possible, but its colinearity is obvious by design. When PCA is applied to the design matrix of rank q less than p the number of positive eigenvalues discovered is equal to q the true rank of the design matrix. If the number of PCs retained is larger than q (and the data is perfectly colinear, etc.) all of the variance of the data is retained in the low dimensional projection. In this example, when PCA is run on the design matrix of rank 2, the resulting projection back into two dimensions has exactly the same distribution as before.
Projection of colinear data
Figure: Projection of multi-colinear data onto rst two PCs
Reduction in regression coecient estimator variance

If we take the standard regression model y = X + And consider instead the PCA rotation of X given by Z = ZA then we can rewrite the regression model in terms of the PCs y = Z + . We can also consider the reduced model y = Zq q +
q
where only the rst q PCs are retained.

If we rewrite the regression relation as y = Z + . Then we can, because A is orthogonal, rewrite X = XAA = Z where = A . = A Clearly using least squares (or ML) to learn is equivalent directly. to learning And, like usual, = (Z Z)1 Z y = A(Z Z)1 Z y so

Without derivation we note that the variance-covariance matrix of is given by
p
) = 2 Var(
k =1
1 lk ak ak
where lk is the k th largest eigenvalue of X X, ak is the k th column of A, and 2 is the observation noise variance, i.e. N(0, 2 I) This sheds light on how multicolinearities produce large variances . If an eigenvector lk is small then the for the elements of resulting variance of the estimator will be large.

One way to avoid this is to ignore those PCs that are associated with small eigenvalues, namely, use biased estimator
m
=
k =1
1 lk ak ak X y
where l1:m are the large eigenvalues of X X and lm+1:p are the small.
m
) = 2 Var(
k =1
1 lk ak ak
This is a biased estimator, but, since the variance of this estimator is smaller it is possible that this could be an advantage. Homework: nd the bias of this estimator. Hint: use the spectral decomposition of X X.
Problems with PCA

PCA is not without its problems and limitations PCA assumes approximate normality of the input space distribution
PCA may still be able to produce a good low dimensional projection of the data even if the data isnt normally distributed
PCA may fail if the data lies on a complicated manifold PCA assumes that the input data is real and continuous. Extensions to consider
Collins et al, A generalization of principal components analysis to the exponential family. Hyv arinen, A. and Oja, E., Independent component analysis: algorithms and applications ISOMAP, LLE, Maximum variance unfolding, etc.
Non-normal data
Figure: 2d Beta(.1, .1) Samples with PCs
Non-normal data
Figure: PCA Projected

Pca Slides

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pca Slides

Uploaded by

Copyright:

Available Formats

Principal Component Analysis

Principal Component Analysis

Data distribution (inputs in regression analysis)

Figure: Gaussian PDF

Uncorrelated projections of principal variation

Figure: Gaussian PDF with PC eigenvectors

Figure: PCA Projected Gaussian PDF

Constrained maximization - method of Lagrange multipliers

cov(1 x, 2 x) = 1 2 = 2 1 = 2 1 1 = 1 2 = 1 1 2 = 0 Of these, if we choose the last we can write an Langrangian to maximize 2 2 2 2 (2 2 1) 2 1

PCA using the sample covariance matrix

Computing the PCA loading matrix

Sample Covariance Matrix PCA

Figure: Gaussian Samples

Sample Covariance Matrix PCA

Figure: Gaussian Samples with eigenvectors of sample covariance matrix

Sample Covariance Matrix PCA

Figure: PC projected samples

Sample Covariance Matrix PCA

Figure: PC dimensionality reduction step

Sample Covariance Matrix PCA

Figure: PC dimensionality reduction step

PCA in linear regression

Noiseless Linear Relationship with No Colinearity

Figure: y = x[1 2] + 5, x N([2 5],

Noiseless Planar Relationship

Figure: y = x[1 2] + 5, x N([2 5],

Projection of colinear data

Projection of colinear data

Figure: Projection of multi-colinear data onto rst two PCs

Reduction in regression coecient estimator variance

where only the rst q PCs are retained.

Reduction in regression coecient estimator variance

Reduction in regression coecient estimator variance

Reduction in regression coecient estimator variance

Problems with PCA

Figure: 2d Beta(.1, .1) Samples with PCs

Figure: PCA Projected

You might also like