You are on page 1of 8

Principal Component Analysis

• PCA is a statistical method capable


Presentation Title Goes Here of reducing the dimensionality of
Principal Component Analysis
…presentation subtitle. multivariate data.
• Multivariate data are:
 Obtained by taking more than one response
variable from each experimental unit.
 There response variables are correlated.
Violeta I. Bartolome
Senior Associate Scientist
PBGB-CRIL
v.bartolome@cgiar.org

Data from such studies are usually


Example
stored in ‘data matrix’, D
• An agronomist might be interested Seed Ht Leaf Ln Leaf Wd Lig Ln
in: Plant 1 x11 x12 x13 x14
 Seedling height (X1)
 Leaf length (X2) Plant 2 : : : :
 Leaf width (X3) D = Plant 3 : : : :
 Ligule length (X4)
• Data were collected from 5 plants Plant 4 : : : :

Plant 5 x51 x52 x53 x54


The Problem of Data Reduction
2 Usually the number of rows
Scenario (experimental units/object) or
the number of columns • Case 1: No variability in X1
(response variables/attributes) 20

is large.

X1
18

Goal 2 Data reduction on either row


or column of D. 16
14 15 16 17 18 19
X2

Solution 2 Perform data reduction


techniques -- PCA is one! The amount of information from Xi is shown by its variability.
X1 gives very small information so it can be ignored to reduce
data to X2 alone with no loss of information.

Case 2: Small variability in X1 Case 3: Significant variation in both


variables
20 20
X1

18

X1
18

16 16
14 16 18 20
14 15 16 17 18 19
X2
X2

Ignoring X1 gives little loss of information.


We can not simply discard any one variable.
To reduce dimensionality of data under the last
What PCA does case, we translate the (X1, X2) axes into (Y1, Y2)
axes and rotate the new axes.
• Find a smaller set of new variables to
represent the data with as little loss
of information as possible 20 Y2
 The new set of variables are called principal
components and they are uncorrelated Y1

X1
18
Y1 = A11X1 + A21X2 + … + Ap1Xp
. pull
.
16
.

®
14 15 16 17 18 19

Yp = A1pX1 + A2pX2 + … + AppXp X2

The Yi’s are linear combinations of the original variables.

Y1 = A11X1 + A21X2
Y2 = A12X1 + A22X2
Example

Let X be defined as:


X1 Plant height (cm)
Y1

X2 Leaf area (cm2)


X3 Days to maturity
X4 Dry matter weight
X5 Panicle length
Y2 X6 Grain dry weight (g/hill)
X7 % filled grains
We can now ignore Y1. Hence, PCA is used to construct
Y1 and Y2 which are called “principal components” of (X1, X2).
• Then new variables may be formed by some
‘linear combinations’ of X variables: TOTAL SYSTEM VARIABILTITY

Big contributors to Y1 among p variables

Y1=A11X1 + A21X2 + A31X3 +A41X4 + A51X5 + A61X6 + A71X7

Y2=A12X1 + A22X2 + A32X3 +A42X4 + A52X5 + A62X6 + A72X7 reproduced by p principal components

Big contributors to Y2

• We may call Y1 as “growth component” and


k p-k
Y2 as “yield component”.
principal components
• PCA is used to find the values of Aij’s
(eigenvectors) so that the information from X There is almost as much information in the k components
as there is in the original p variables.
will be regained by Y1 and Y2.

Properties of Principal Components


Properties of Principal Component
Total variance of the X’s = Total of the eigenvalues:
• The variance of each principal component is p

its respective eigenvalue (characteristic root) ∑ Var ( X ) = σ


i =1
i
2
11 + σ 222 + ⋯ + σpp
2

Var ( Yi ) = λi p p

Cov ( Yi, Yk ) = 0 = ∑ Var ( Yi ) = ∑ λ i = λ 1 + λ 2 + ⋯ + λ p


i =1 i =1

where λ1 ≥ λ2 ≥ … ≥ λp ≥ 0 λk
Proportion of total variance
=
due to the kth PC λ1 + λ2 + ... + λ p

If most (≥ 80%) of the total variance can be attributed to the


first one, two, or three components, then these components can
“replace” the original p variables w/o much loss of information.
Eigenvector Principal components may be obtained using either
the variance-covariance matrix Σ or the correlation
• The set of corresponding coefficients of matrix ρ

the X’s in the kth PC [ A1k A2k … Aik … Apk ] • PCA using Σ considers relative weights of the
variables (larger variance ⇒ larger weight)
is the eigenvector of the kth PC.
• PCA using ρ considers all variables with
• The magnitude of the Aij measures the equal weights
importance of Xi to Yj. Correlation matrix is recommended when:

• The sign of Aij denotes the direction of the • attributes have large variances
• scales have highly varied ranges
contribution of Xi to Yj. • different units of measurements

Example: Random variables X1, X2, and X3 with the


covariance matrix Proportion of total variance accounted
for by the PC’s:
 1 − 2 0
∑ =  − 2 5 0 λ1 5.83
  = = 0.73
 0 0 2 λ1 + λ 2 + λ 3 8

Eigenvalue-eigenvector pairs:
λ2 2.00
λ1 = 5.83 e1’ = [0.383 -0.924 0] = = 0.25
λ2 = 2.00 e2’ = [0 0 1] λ1 + λ 2 + λ 3 8
λ3 = 0.17 e3’ = [0.924 0.383 0]

Principal Components: Components Y1 and Y2 explain 98% of the total


Y1 = 0.383 X1 - 0.924 X2 variation. Hence they could replace the 3 original
Y2 = X3 variables with little loss of information.
Y3 = 0.924 X1 + 0.383 X2
Sample Data

PCA in R

Read data to R prcomp()

Scale=TRUE causes data to be 80% is explained by 3


standardized to have unit variance principal components
before the analysis.
Scores
Eigenvector

PCA Plot
Another Example
biplot()

Thank you!

You might also like