Intermediate R - Principal Component Analysis

Principal Component Analysis
• PCA is a statistical method capable

Presentation Title Goes Here of reducing the dimensionality of
Principal Component Analysis
…presentation subtitle. multivariate data.
• Multivariate data are:
Obtained by taking more than one response
variable from each experimental unit.
There response variables are correlated.
Violeta I. Bartolome
Senior Associate Scientist
PBGB-CRIL
v.bartolome@cgiar.org
Data from such studies are usually

Example
stored in ‘data matrix’, D
• An agronomist might be interested Seed Ht Leaf Ln Leaf Wd Lig Ln
in: Plant 1 x11 x12 x13 x14
Seedling height (X1)
Leaf length (X2) Plant 2 : : : :
Leaf width (X3) D = Plant 3 : : : :
Ligule length (X4)
• Data were collected from 5 plants Plant 4 : : : :
Plant 5 x51 x52 x53 x54

The Problem of Data Reduction
2 Usually the number of rows
Scenario (experimental units/object) or
the number of columns • Case 1: No variability in X1
(response variables/attributes) 20
is large.
X1
18
Goal 2 Data reduction on either row

or column of D. 16
14 15 16 17 18 19
X2
Solution 2 Perform data reduction

techniques -- PCA is one! The amount of information from Xi is shown by its variability.
X1 gives very small information so it can be ignored to reduce
data to X2 alone with no loss of information.
Case 2: Small variability in X1 Case 3: Significant variation in both

variables
20 20
X1
18
X1
18
16 16
14 16 18 20
14 15 16 17 18 19
X2
X2
Ignoring X1 gives little loss of information.

We can not simply discard any one variable.
To reduce dimensionality of data under the last
What PCA does case, we translate the (X1, X2) axes into (Y1, Y2)
axes and rotate the new axes.
• Find a smaller set of new variables to
represent the data with as little loss
of information as possible 20 Y2
The new set of variables are called principal
components and they are uncorrelated Y1
X1
18
Y1 = A11X1 + A21X2 + … + Ap1Xp
. pull
.
16
.
®
14 15 16 17 18 19
Yp = A1pX1 + A2pX2 + … + AppXp X2
The Yi’s are linear combinations of the original variables.
Y1 = A11X1 + A21X2
Y2 = A12X1 + A22X2
Example
Let X be defined as:

X1 Plant height (cm)
Y1
X2 Leaf area (cm2)

X3 Days to maturity
X4 Dry matter weight
X5 Panicle length
Y2 X6 Grain dry weight (g/hill)
X7 % filled grains
We can now ignore Y1. Hence, PCA is used to construct
Y1 and Y2 which are called “principal components” of (X1, X2).
• Then new variables may be formed by some
‘linear combinations’ of X variables: TOTAL SYSTEM VARIABILTITY
Big contributors to Y1 among p variables
Y1=A11X1 + A21X2 + A31X3 +A41X4 + A51X5 + A61X6 + A71X7
Y2=A12X1 + A22X2 + A32X3 +A42X4 + A52X5 + A62X6 + A72X7 reproduced by p principal components
Big contributors to Y2
• We may call Y1 as “growth component” and

k p-k
Y2 as “yield component”.
principal components
• PCA is used to find the values of Aij’s
(eigenvectors) so that the information from X There is almost as much information in the k components
as there is in the original p variables.
will be regained by Y1 and Y2.
Properties of Principal Components

Properties of Principal Component
Total variance of the X’s = Total of the eigenvalues:
• The variance of each principal component is p
its respective eigenvalue (characteristic root) ∑ Var ( X ) = σ

i =1
i
2
11 + σ 222 + ⋯ + σpp
2
Var ( Yi ) = λi p p
Cov ( Yi, Yk ) = 0 = ∑ Var ( Yi ) = ∑ λ i = λ 1 + λ 2 + ⋯ + λ p

i =1 i =1
where λ1 ≥ λ2 ≥ … ≥ λp ≥ 0 λk
Proportion of total variance
=
due to the kth PC λ1 + λ2 + ... + λ p
If most (≥ 80%) of the total variance can be attributed to the

first one, two, or three components, then these components can
“replace” the original p variables w/o much loss of information.
Eigenvector Principal components may be obtained using either
the variance-covariance matrix Σ or the correlation
• The set of corresponding coefficients of matrix ρ
the X’s in the kth PC [ A1k A2k … Aik … Apk ] • PCA using Σ considers relative weights of the
variables (larger variance ⇒ larger weight)
is the eigenvector of the kth PC.
• PCA using ρ considers all variables with
• The magnitude of the Aij measures the equal weights
importance of Xi to Yj. Correlation matrix is recommended when:
• The sign of Aij denotes the direction of the • attributes have large variances
• scales have highly varied ranges
contribution of Xi to Yj. • different units of measurements
Example: Random variables X1, X2, and X3 with the

covariance matrix Proportion of total variance accounted
for by the PC’s:
 1 − 2 0
∑ =  − 2 5 0 λ1 5.83
  = = 0.73
 0 0 2 λ1 + λ 2 + λ 3 8
Eigenvalue-eigenvector pairs:
λ2 2.00
λ1 = 5.83 e1’ = [0.383 -0.924 0] = = 0.25
λ2 = 2.00 e2’ = [0 0 1] λ1 + λ 2 + λ 3 8
λ3 = 0.17 e3’ = [0.924 0.383 0]
Principal Components: Components Y1 and Y2 explain 98% of the total

Y1 = 0.383 X1 - 0.924 X2 variation. Hence they could replace the 3 original
Y2 = X3 variables with little loss of information.
Y3 = 0.924 X1 + 0.383 X2
Sample Data
PCA in R
Read data to R prcomp()
Scale=TRUE causes data to be 80% is explained by 3

standardized to have unit variance principal components
before the analysis.
Scores
Eigenvector
PCA Plot
Another Example
biplot()
Thank you!

Intermediate R - Principal Component Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intermediate R - Principal Component Analysis

Uploaded by

Copyright:

Available Formats

Principal Component Analysis

• PCA is a statistical method capable

Data from such studies are usually

Plant 5 x51 x52 x53 x54

Goal 2 Data reduction on either row

Solution 2 Perform data reduction

Case 2: Small variability in X1 Case 3: Significant variation in both

Ignoring X1 gives little loss of information.

Yp = A1pX1 + A2pX2 + … + AppXp X2

The Yi’s are linear combinations of the original variables.

Let X be defined as:

X2 Leaf area (cm2)

Big contributors to Y1 among p variables

Y1=A11X1 + A21X2 + A31X3 +A41X4 + A51X5 + A61X6 + A71X7

• We may call Y1 as “growth component” and

Properties of Principal Components

its respective eigenvalue (characteristic root) ∑ Var ( X ) = σ

Cov ( Yi, Yk ) = 0 = ∑ Var ( Yi ) = ∑ λ i = λ 1 + λ 2 + ⋯ + λ p

If most (≥ 80%) of the total variance can be attributed to the

Example: Random variables X1, X2, and X3 with the

Principal Components: Components Y1 and Y2 explain 98% of the total

Read data to R prcomp()

Scale=TRUE causes data to be 80% is explained by 3

You might also like