You are on page 1of 18

Principal Components Analysis

(PCA)

Fundamentals of Predictive Analytics with JMP


(Klimberg & McCullough)
Chapter 6 – pp. 135-149
Why use
Principal Components Analysis (PCA)?
PCA provides a method for understanding data
by:
1. Extracting a smaller number of important
components
2. Accounting for/explaining the variability in the data,
where each principal component considers a subset
of variables to be important
3. Providing DATA SET INSIGHT which may help
uncover factors underlying the data
4. Reducing the number of variables in the model

2
Applications

• Uses: • Examples:
– Data Visualization – How many unique “sub-sets” are
in the sample?
– Data Reduction
– How are they similar / different?
– Data Classification
– What are the underlying factors
– Trend Analysis that influence the samples?
– Noise Reduction – Which time / temporal trends are
(anti)correlated?
– Which measurements are needed
to differentiate?
– How to best present what is
“interesting”?
Cluster Analysis
How does Principal Components
Analysis Differ from Cluster Analysis?

Clustering segmented the data table horizontally by grouping cases or


records
While

Principal Components Analysis (and Factor Analysis) segment the data


table vertically  reduce the number of variables to be analyzed)

4
How does PCA Work?
• PCA transforms a set of “k” correlated variables into a new set of
uncorrelated variables called principal components.

• Start with (Prin1) is a linear combination of the original variables and


will account for the most variation in the data set:

– Prin1 = w11x1 + w12x2+w13x3+…+w1kxk

• Prin 2 is not correlated with the first principal component, and accounts
for the second-most variation in the data set., and so on.

• The general form for the ith principal component is


• Prin(i) = wi1x1 + wi2x2+wi3x3+…+wikxk where wi1 is the weight of the
first variable in the ith principal component

• Principal components are orthogonal (not correlated with each other)

5
Steps in PCA

1. Start with a data table with multiple observations


and variables
2. Construct a correlation matrix or covariance matrix
3. Extract the principal components from the matrix,
along with amount of variation explained by
extracting
a) Eigenvectors (a vector of weights or loadings)
b) Eigenvalues (the amount of variation explained by each
factor
4. Sort the principal components based on amount of
variation it explains
5. Select the number of principal components to use

6
Principal Components Analysis on:

• Covariance Matrix:
– Variables must be in same units
– Emphasizes variables with most variance
– Mean eigenvalue ≠1.0

• Correlation Matrix:
– Variables are standardized (mean 0.0, SD 1.0)
– Variables can be in different units
– All variables have same impact on analysis
– Mean eigenvalue = 1.0
Principal Components Analysis in JMP

Example: princomp.jmp (page 142)


Select Analyze Multivariate Methods  Principal Components
Select all of the INDEPENDENT variables and enter into Y , Columns
Click on the LRT and select Eigenvalues and Scree Plot
Select Save Principal Components and enter the no. of components you wish to use

8
Selecting the Best Number of Principal
Components - method 1
(using princomp.jmp from Klimberg & McCullough, pg. 143)

• Look for the “elbow”


in the scree plot

a) Shows the variance


explained by each
principal component
b) Ideal number of factors is
number right before tail
levels off
c) Selecting principal
components after this
point adds little
d) Suggest keeping first
three components here

9
Selecting the best number of principal
components – methods 2 & 3

• Method 2 - How many


eigenvalues are greater
than 1?
a) Eigenvalues over 1 are
contributing “more than
their share” to explain
variation
b) Do not be too strict
with this rule
c) In this example would
use 4 components
• Method 3 – Account for
a specified proportion
of variation (if 75%
then choose 5)

10
Another Example – MassHousing.jmp

• Crim: per capita crime rate by town


• We are trying to • ZN: proportion of residential land zoned
for lots over 25,000 sq. ft.
predict Median • INDUS: proportion of non-retail business
acres per town

Value of owner- • NOX: nitric oxides concentration (parts


per 10 million)
• ROOMS: average number of rooms per
occupied homes, •
dwelling
AGE: proportion of owner-occupied units
based on a number •
built prior to 1940
DISTANCE: weighted distances to five
Boston employment centres
of variables. • RADIAL: index of accessibility to radial
highways
• TAX: full-value property-tax rate per
$10,000
• PT: pupil-teacher ratio by town

• Note that some of • B: 1000(Bk – 0.63)2 where Bk is the


proportion of blacks by town
LSTAT: % lower status of the population
the variables in the

• MVALUE: median value of owner-
occupied homes in $1000
data file have been
transformed.
11
Results of Multiple Regression

The results of a Multiple


Regression Analysis
show that we have 11
significant independent
variables (of the 13 we
have tried)

12
Principal Components Analysis- Mass
Housing Data
The Scree Plot shows
that after the first
Principal Component,
the explanation of
variability drops off quite
a bit.

13
Principal Components Analysis- Mass
Housing Data (continued)

We can see that the first


principal component
explains over 51% of
the variability. We can
account for about 72%
of the change by using
three principal
components

14
Loading Matrix from
Mass Housing PCA Analysis

By looking at the loading matrix we can see the relative


importance of each variable within each principal component

15
Loading Plot for Components 1 & 2

PCA can also allow us to


see the structure in the
data and we may be
able see variables that
are near each other in
the principal
components space.

Note how the variables


tend to cluster in the
four different quadrants

16
Comparing Multiple Regression Results – Principal
Components vs. Original Independent Variables

Results from Multiple Regression


using 3 principal components Note that while we have
gone from 11 variables
down to 3 principal
components, our
Adjusted R-squared has
decreased by only about
9%, while RMSE has
Results from Multiple Regression increased slightly from
using 11 independent variables 4.73 to 5.49.

17
Summary and key points

• PCA can help gain data insight by generating


and interpreting results
• PCA may help us to reduce the number of
variables in the model
• New variables can be generated and used
directly in future modeling
• The first principal component accounts for the
largest amount of variation in the data
• The second principal component is orthogonal
(not correlated) to the first, explains the second
largest variation, etc.

18

You might also like