Professional Documents
Culture Documents
Frank Wood
December 8, 2009
This lecture borrows and quotes from Jolies Principle Component Analysis book. Go buy it!
PCA rotation
PCA in a nutshell
Notation
x is a vector of p random variables k is a vector of p constants k x =
p j =1 kj xj
Procedural description
Find linear function of x, 1 x with maximum variance. Next nd another linear function of x, 2 x, uncorrelated with 1 x maximum variance. Iterate.
Goal
It is hoped, in general, that most of the variation in x will be accounted for by m PCs where m << p .
Derivation of PCA
Assumption and More Notation
is the known covariance matrix for the random variable x Foreshadowing : will be replaced with S, the sample covariance matrix, when is unknown.
Shortcut to solution
For k = 1, 2, . . . , p the k th PC is given by zk = k x where k is an eigenvector of corresponding to its k th largest eigenvalue k . If k is chosen to have unit length (i.e. k k = 1) then Var(zk ) = k
Derivation of PCA
First Step
Find k x that maximizes Var(k x) = k k Without constraint we could pick a very big k . Choose normalization constraint, namely k k = 1 (unit length vector).
Derivation of PCA
Constrained maximization - method of Lagrange multipliers
This results in d k k k (k k 1) d k k k k k = 0 = 0 = k k
This should be recognizable as an eigenvector equation where k is an eigenvector of b f and k is the associated eigenvalue. Which eigenvector should we choose?
Derivation of PCA
Constrained maximization - method of Lagrange multipliers
If we recognize that the quantity to be maximized k k = k k k = k k k = k then we should choose k to be as big as possible. So, calling 1 the largest eigenvector of and 1 the corresponding eigenvector then the solution to 1 = 1 1 is the 1st principal component of x. In general k will be the k th PC of x and Var( x) = k We will demonstrate this for k = 2, k > 2 is more involved but similar.
Derivation of PCA
Constrained maximization - more constraints
The second PC, 2 x maximizes 2 2 subject to being uncorrelated with 1 x. The uncorrelation constraint can be expressed using any of these equations
Derivation of PCA
Constrained maximization - more constraints
Dierentiation of this quantity w.r.t. 2 (and setting the result equal to zero) yields d 2 2 2 (2 2 1) 2 1 = 0 d 2 2 2 2 1 = 0 If we left multiply 1 into this expression 1 2 2 1 2 1 1 = 0 0 0 1 = 0 then we can see that must be zero and that when this is true that we are left with 2 2 2 = 0
Derivation of PCA
Clearly 2 2 2 = 0 is another eigenvalue equation and the same strategy of choosing 2 to be the eigenvector associated with the second largest eigenvalue yields the second PC of x, namely 2 x. This process can be repeated for k = 1 . . . p yielding up to p dierent eigenvectors of along with the corresponding eigenvalues 1 , . . . p . Furthermore, the variance of each of the PCs are given by Var[k x] = k , k = 1, 2, . . . , p
Properties of PCA
For any integer q , 1 q p , consider the orthonormal linear transformation y=Bx where y is a q -element vector and B is a q p matrix, and let y = B B be the variance-covariance matrix for y. Then the trace of y , denoted tr(y ), is maximized by taking B = Aq , where Aq consists of the rst q columns of A. What this means is that if you want to choose a lower dimensional projection of x, the choice of B described here is probably a good one. It maximizes the (retained) variance of the resulting variables. In fact, since the projections are uncorrelated, the percentage of variance accounted for by retaining the rst q PCs is given by
q k =1 k p k =1 k
100
where X is a (n p ) matrix with (i , j )th element (xij x j ) (in other words, X is a zero mean design matrix). We construct the matrix A by combining the p eigenvectors of S (or eigenvectors of X X theyre the same) then we can dene a matrix of PC scores Z = XA Of course, if we instead form Z by selecting the q eigenvectors corresponding to the q largest eigenvalues of S when forming A then we can achieve an optimal (in some senses) q -dimensional projection of x.
the most straightforward way of computing the PCA loading matrix is to utilize the singular value decomposition of S = A A where A is a matrix consisting of the eigenvectors of S and is a diagonal matrix whose diagonal elements are the eigenvalues corresponding to each eigenvector. Creating a reduced dimensionality projection of X is accomplished by selecting the q largest eigenvalues in and retaining the q corresponding eigenvectors from A
4.5 1.5
1.5 1.0
4.5 1.5
1.5 1.0
) = 2 Var(
k =1
1 lk ak ak
where lk is the k th largest eigenvalue of X X, ak is the k th column of A, and 2 is the observation noise variance, i.e. N(0, 2 I) This sheds light on how multicolinearities produce large variances . If an eigenvector lk is small then the for the elements of resulting variance of the estimator will be large.
=
k =1
1 lk ak ak X y
where l1:m are the large eigenvalues of X X and lm+1:p are the small.
m
) = 2 Var(
k =1
1 lk ak ak
This is a biased estimator, but, since the variance of this estimator is smaller it is possible that this could be an advantage. Homework: nd the bias of this estimator. Hint: use the spectral decomposition of X X.
PCA may fail if the data lies on a complicated manifold PCA assumes that the input data is real and continuous. Extensions to consider
Collins et al, A generalization of principal components analysis to the exponential family. Hyv arinen, A. and Oja, E., Independent component analysis: algorithms and applications ISOMAP, LLE, Maximum variance unfolding, etc.
Non-normal data
Non-normal data