You are on page 1of 24

Principal Components and Factor Analysis: An

Example
36-350, Data Mining
1 October 2008

Data: The United States circa 1977

The state.x77 data set is available by default in R; its a compilation of data


about the US states put together from the 1977 Statistical Abstract of the United
States, with the actual measurements mostly made a few years before.1 Its a
little long in the tooth now but handy for teaching purposes.
The variables are:
Population in thouusands
Income
dollars per capita
Illiterarcy
Percent of the population
Life Exp
Years of life expectancy at birth
Murder
Number of murders and non-negligent manslaughters per 100,000 people
HS Grad
Percent of adults who were high-school graduates
Frost
Mean number of days per year with low temperatures below freezing
Area
In square miles
You can do help(state.x77) for a little more detail. There is also a state.center
data set, giving the longitude and latitude of the geographic center of each state
(except for Alaska and Hawaii, which are artificially put somewhere off the west
coast) see Figures 1 and 2.

1 The Statistical Abstract is the best book published in America (P. Krugman), an immensely valuable compilation of data about a huge range of aspects of American life, put out
every year by the Census Bureau. Its available for free online at http://www.census.gov/
compendia/statab/.

50

Where R Thinks the US States Are


Alaska
Washington
Montana

North Dakota
Minnesota

45

Maine
Oregon

South Dakota
Idaho

Wyoming

40

Latitude

Nebraska

Nevada

Utah

Wisconsin

Vermont
New Hampshire
Michigan
New York
Massachusetts
Iowa
Rhode Island
Connecticut
Pennsylvania
Ohio
Illinois
Indiana
New Jersey
Maryland

Colorado

Kansas Missouri

Delaware
West Virginia
Virginia

Kentucky

35

California

ArizonaNew Mexico

Tennessee North Carolina


Oklahoma
Arkansas
South Carolina
Mississippi
Alabama
Georgia

Hawaii

Texas

30

Louisiana

Florida

-120

-110

-100

-90

-80

-70

Longitude

Figure 1: Geographic centers of the US states, per Rs built-in state.center


data set. N.B. Alaska and Hawaii are artificially brought close to the lower 48.

plot.new() # Start up a new plot


# How big is the plotting window? Set ranges from the data well be graphing
plot.window(xlim=range(state.center$x),ylim=range(state.center$y))
# Put the name of each state at its center. Shrink text 25% so theres less
# overlap between the names.
text(state.center,state.name,cex=0.75)
# Add the horizontal axis at the bottom and the vertical at the left.
axis(1)
axis(2)
# Draw a box.
box()
# Add the titles
title(main="Where R Thinks the US States Are",xlab="Longitude",ylab="Latitude")
Figure 2: R code for drawing Figure 1.

Principal Components

The command prcomp is the preferred command for principal component analysis in R. It performs a singular value decomposition directly on the data matrix;
this is slightly more accurate than doing an eigenvalue decomposition on the
covariance matrix but conceptually equivalent. A naive approach here is unrewarding.
> prcomp(state.x77)
Standard deviations:
[1] 8.532765e+04 4.465107e+03 5.591454e+02 4.640117e+01
[5] 6.043099e+00 2.461524e+00 6.580129e-01 2.899911e-01
. . . followed by more output specifying the principal components. The standard
deviations here are the square roots of the eigenvalues. (Exercise: why are
the square roots standard deviations?) To see this another way, use the plot
command on the output of prcomp:
plot(prcomp(state.x77),type="l")
The output is Figure 3.
The thing to notice is that the first eigenvalue is immensely larger than the
others. Naively, one might think this means that theres just one component
here and our job is done. In reality, this is because weve chosen units where
one variable is immensely larger than the others, so it varies much more.
> apply(state.x77,2,sd)
Population
Income
Illiteracy
Life Exp
4.464491e+03 6.144699e+02 6.095331e-01 1.342394e+00
Murder
HS Grad
Frost
Area
3.691540e+00 8.076998e+00 5.198085e+01 8.532730e+04
In other words, typical variations in area from state to state are orders of magnitude larger than anything else. And in fact if we look at the principal components, the first one is giving us area, and everything else is noise. We can see
this by doing a biplot of the data and the original features together (Figure
4).
biplot(prcomp(state.x77),cex=(0.75,1))
# Plot original data points and feature vectors against principal
# components; shrink data-point names by 25% for contrast/legibility
Things are a lot better if we standardize the variables to have variance 1.
prcomp will do this for us. (See Figure 5.)
biplot(prcomp(state.x77,scale.=TRUE),cex=c(0.5,0.75))
Looking at the plot, several things are striking.

4e+09
0e+00

2e+09

Variances

6e+09

prcomp(state.x77)

Figure 3: Scree plot for PCA of the unscaled state.x77 data, created using.

-2e+05

0e+00

2e+05

4e+05

6e+05

2e+05

0.2

-2e+05

-0.2
-0.4

Ohio
Illinois
Pennsylvania

Area

0e+00

Alaska

Wyoming
Vermont
Nevada
Montana
North
Dakota
South
Dakota
Delaware
New
Hampshire
Idaho
Hawaii
Rhode
Island
New
Mexico
Maine
Utah
Nebraska
West
Virginia
Arkansas
Oregon
Arizona
Mississippi
Kansas
Colorado
Oklahoma
South
Carolina
Iowa
Connecticut
Kentucky
Washington
Alabama
Louisiana
Minnesota
Maryland
Tennessee
Illiteracy
Life
Exp
Frost
Murder
HS
Grad
Income
Wisconsin
Missouri
Georgia
Virginia
Indiana
North
Carolina
Population
Massachusetts
New Jersey
Florida
Michigan

0.0

Texas

New York

-4e+05

PC2

0.4

4e+05

0.6

0.8

6e+05

-4e+05

California

-0.4

-0.2

0.0

0.2

0.4

0.6

0.8

PC1

Figure 4: PCA plot of the state.x77 data set without scaling. Note how the first
principal component is basically just area; in the units chosen, this is orders of
magnitude more variable than anything else.

0.6

-5

10

15

0.4

15

Alaska

10

California

0.2

New York

Area

Income

PC2

Texas

Illinois

Florida
Population

Nevada

Michigan

HS Grad

Murder

Arizona

Ohio
Maryland
New Jersey
Pennsylvania
Oregon
Virginia
Hawaii
Wyoming
Montana
Kansas
Missouri
Massachusetts
Life Exp
Connecticut
Indiana
New Mexico
Minnesota

Frost

Utah
Iowa
Nebraska
Idaho
Wisconsin
North Dakota

Delaware
Oklahoma

Illiteracy
Georgia
Alabama
Louisiana

North Carolina
Tennessee
Kentucky

Maine

-0.2

0.0

South Carolina
Mississippi

Arkansas

-5

-0.2

NewDakota
Hampshire
South
Rhode Island
Vermont

0.0

Washington
Colorado

West Virginia

0.2

0.4

0.6

PC1

Figure 5: PCA plot of the state.x77, rescaling features to have variance 1. The
first principal component distinguishes between cold, educated, long-lived states
with low violence from warm, ill-educated, shorted-lived, murderous states,
which also tend to be poorer. The second principal component most distinguishes big, populous, rich states from small, poor ones, which have some tendency to be less murderou and a very weak tendency to be warmer.

1. The first principal component, the one which summarizes the most variance, is very nearly the same as the illiteracy rate (on the positive dimension) or life expectancy (going the other way), and almost the same as
the number of frost days per year. The murder rate projects very strongly
on to the positive direction of the first principal component. Put a bit
crudely, PC1 distinguishes between cold states with educated, harmless,
long-lived populations, and warm, ill-educated, short-lived, violent states.
2. The second PC correlates extremely closely with area states really do
vary hugely in their area, even when you standardize. It also strongly
correlates with population and income, and more weakly with secondary
education and murder. More weakly, it negatively correlates with life
expectancy and frost. The second PC distinguishes big rich educated
states from small poor ignorant states, which tend to be a bit warmer,
and less murderous.
3. The two leading components are not independent high values of PC2
imply a very narrow range of PC1 values.
4. Alaska was something of an outlier; Hawaii, on the other hand, was as
close to being typically American as anywhere. (Make of this what you
will.)
Looking at the scree plot, there is no obvious point at which it levels off or
otherwise breaks; you could argue for one principal component or five.
The fact that PC1 is so close to the frost variable suggests that part of
whats being picked up here is the difference between the South and the rest of
the country.2 There are two ways we could try to follow up on that. One would
be to include a categorical variable for the region. Unfortunately, PCA only
works with numerical, metrical values. We could try to finesse that by assigning
numbers to the categories, and people do that sometimes, but the results become
totally hostage to which numbers we use i.e., under our control, and not in a
good way. There are models which will let us combine both variable types, but
it will take us a while to develop them properly.
The other approach is to throw in latitude and longitude as variables. Wed
need both, because, e.g., New Mexico isnt part of the South, while Kentucky
is. Lets make that data.
> state.x77.centers = cbind(state.x77,state.center$x,state.center$y)
> colnames(state.x77.centers) = c(colnames(state.x77),"Longitude","Latitude")
The first line creates a new array by adding on the longitudes as latitudes as
columns. The second names the columns, copying over the column names from
state.x77. Lets run this through PCA (Figure 7).
> biplot(prcomp(state.x77.centers,scale.=TRUE),cex=c(0.5,0.8))

2.0
0.0

0.5

1.0

1.5

Variances

2.5

3.0

3.5

prcomp(state.x77, scale. = TRUE)

Figure 6: Screen plot for the state.x77 data, with features scaled to variance
1. Notice the absence of a really sharp break. Arguably there is a break at 5.

10

15
15

Alaska

10

0.4

0.6

-5

California

0.2

Nevada

Hawaii

Arizona

Income
HS Grad

Washington
Oregon

Murder
Population

New Mexico
Florida

0.0

Colorado

Illinois
New York

Montana
Wyoming
Utah
Idaho
Kansas

Life Exp
Nebraska
Latitude

Alabama
Georgia

Missouri
Oklahoma
Virginia
Ohio
Maryland
Indiana
New Pennsylvania
Jersey

Minnesota
Iowa
North Dakota
Wisconsin
South Dakota
Delaware
Massachusetts
Connecticut

Frost

IlliteracyLouisiana

Michigan

PC2

Texas

Area

Mississippi
Tennessee
Arkansas
North Carolina
South Carolina
Kentucky

Maine

-0.2

-5

-0.2

West Virginia
New Hampshire
Vermont
Rhode Island

Longitude

0.0

0.2

0.4

0.6

PC1

Figure 7: PCA plot when latitude and longitude are added as variables.

10

The first principal component is still nearly (but not quite) the life-expectancyand-frost/illiteracy axis. But its also nearly the latitude/illiterarcy axis. (The
correlation between frost days and latitude is, not surprisingly, stronger than
the correlation between latitude and life expectancy, but the latter have more
similar projections on to the first two components!) Murder rate still strongly
projects on to this axis, with the opposite sign from latitude, as one might expect. Longitude is measured in negative degrees east of the prime meridian, i.e.,
larger longitude is more easterly, and we see that there is a negative relationship
between longitude and area eastern states tend to be smaller than western
states. This is now the second principal component.
One last tweak before setting PCA down. We might suspect that some of
the features here arent related to population or area on their own, but rather
with population density people per square mile. Population density is a
nonlinear transformation of two of the features, so it cant really be faked with
PCA. (Exercise: Why not?) However, nothing stops us adding it on our own
(Figure 8).
> states.density = cbind(state.x77,
state.x77[,"Population"]/state.x77[,"Area"],
state.center$x,state.center$y)
> colnames(states.density) = c(colnames(state.x77),"Density","Longitude",
"Lattitude")
> biplot(prcomp(states.density,scale.=TRUE),cex=c(0.5,0.75))
The first component hasnt changed very much, but now the second component
is very nearly density! In fact, because the first component is very nearly latitude
and the second longitude, is this very much like a rotated version of the state
map we started with. (Give the PCA plot a quarter-turn clockwise.) Its not
quite the same for example, Hawaii is closer to the center than California or
Texas! but its actually a pretty good match.
Notice that in all our principal component plots, the frost and murder
features have pointed in more or less the opposite directions. When we add
in latitude and longitude, murder is very strongly negatively related to latitude, i.e., more southernly states tend to have a higher murder rate. This is
interesting, but it would be more interesting if we could say something about
why.
Presumably murders do not cause an absence of frost3 , but maybe hot
weather makes people lose their tempers? If so, we might expect murder rates
to rise across the country with climate change. On the other hand, we can
randomize the frost values and see what happens.
> states.density.randfrost = states.density
> states.density.randfrost[,"Frost"] = sample(states.density[,"Frost"])
> biplot(prcomp(states.density.randfrost,scale.=TRUE),cex=c(0.5,0.8))
2 It may especially suggest this if you were brought up just below the Mason-Dixon line,
while being taught that the Civil War was treason in defense of slavery.
3 But see Frazer (1922).

11

-5

Rhode Island

Longitude

New Jersey

0.2

Massachusetts
Density
Connecticut

Maine
New Hampshire
Vermont

Wisconsin

Kentucky
North Carolina
Virginia
South Carolina
Tennessee
Michigan
Illinois Population Arkansas
Missouri
Georgia
Florida
Oklahoma
Mississippi
Alabama
Illiteracy
Louisiana

0.0

Nebraska
Kansas

Income

PC2

Idaho
Utah
Wyoming
Colorado
Montana
Washington
Oregon

-0.2

West Virginia
New York

Indiana

Life Exp
Iowa
Lattitude
North
Dakota
Minnesota
South
Dakota

HS Grad

Delaware
Maryland
Pennsylvania
Ohio

Frost

Hawaii

-10

Murder

New Mexico
Arizona

Nevada

Texas

-5

California

-10

-0.4

Area

Alaska

-0.4

-0.2

0.0

0.2

PC1

Figure 8: PCA plot including the additional variable of density (inhabitants


per square mile). Compare this to the map. Note that the placement of
points here looks fairly random, i.e., the two components are closer to being
independent (though theyre still not).

12

This makes a copy of the states.density array. Then it randomly permutes


the Frost column, and repeats the PCA. The R function for sampling from a
vector is sample, but it defaults to sampling without replacement, and making
the sample size as large as the original, which gives a random permutation of
the vector. (See help(sample).) We could also randomize by, say, putting in
Gaussian values with the same mean and variance, but this is just as easy and
does not commit us to distributional assumptions.
What the biplot shows is that randomizing frost doesnt actually have much
effect on the first two components, and in particular the homicide rate remains
very close to PC1.
Maybe currently-hot places have some other characteristic, not so obvious
from the data set, which actually affects homicide?4 These are the kinds of
questions which may be suggested by a method like PCA, but which it cant
actually answer.

4 For an interesting and largely-persuasive conjecture about why murder rates are higher in
the south, see Nisbett and Cohen (1996).

13

-10

-5

Rhode Island

Longitude

New Jersey

0.2

Density
Massachusetts
Connecticut

VermontDelaware
Maine Pennsylvania
New HampshireMaryland
Indiana

New York West Virginia


Ohio
Virginia
Michigan
Kentucky
North Carolina
Population Tennessee
Missouri
Arkansas
South Carolina
Illinois
Oklahoma
Mississippi
Florida

Wisconsin
Minnesota
Iowa
North Dakota
South Dakota
Nebraska
Kansas

Income

PC2

Georgia
Alabama

Idaho
Wyoming
Oregon
Utah
Colorado
Frost
Washington
Hawaii

HS Grad

0.0

Life Exp
Lattitude

Illiteracy
Louisiana

Montana

New Mexico

Murder

-0.2

Arizona
California
Nevada

-5

Texas

-10

-0.4

Area

Alaska

-0.4

-0.2

0.0

0.2

PC1

-10

-5

Rhode Island

Longitude

New Jersey

0.2

Density
Massachusetts
Connecticut

0.0

Wisconsin

PC2

West Virginia

Virginia
North Carolina
Population Kentucky
Michigan
Tennessee
Illinois
FrostFlorida Arkansas South Carolina
Missouri
Oklahoma

Nebraska
Kansas

Income
HS Grad

Indiana

Iowa
Minnesota
NorthSouth
Dakota
Dakota

Georgia
Alabama
Mississippi
Louisiana
Illiteracy

Utah
Idaho
Washington
Hawaii
Oregon Colorado
Wyoming
Montana

California

-0.2

New Hampshire
Maryland
Maine
Vermont Delaware
Pennsylvania
New York
Ohio

Life Exp
Lattitude

Arizona

New Mexico

Murder

Texas

-5

Nevada

-10

-0.4

Area

Alaska

-0.4

-0.2

0.0

0.2

PC1

-5

0
Rhode Island

5
Longitude

0.2

Maryland
Delaware
Maine
Frost
New Hampshire Pennsylvania
New York
Ohio
Vermont
West Virginia
Indiana
Iowa
North Carolina
Virginia
Wisconsin
Michigan
North Dakota
Population Kentucky
Nebraska
South Dakota
Tennessee
Illinois
Kansas
Arkansas South Carolina
Income
Georgia
Missouri
Florida
Oklahoma
OregonIdaho
Utah
Alabama
Montana
Louisiana
Washington
Mississippi
Wyoming
Colorado
Illiteracy

HS Grad

Minnesota

0.0

Life
Exp
Lattitude

PC2

New Jersey
Massachusetts
Density

Connecticut

Hawaii
New Mexico

-0.2

Texas

Murder

-5

Nevada Arizona
California

-0.4

Area

Alaska

-0.4

-0.2

0.0

0.2

PC1

Figure 9: PCA plots with the Frost variable randomly permuted, for three different
permutations. Note that it typically has a substantial projection on to PC1 and PC2;
its bimodal, so its easy for fluctuations to make it look reasonably correlated with
other variables.

14

Factor Analysis

Now lets do some factor analyses. The R command for maximum likelihood
factor estimation is factanal.
Wed like to make plots similar to the ones weve made for PCA. Unfortunately the output of factanal is not structured in the way that biplot likes.
Fortunately its not too hard to bridge them (code in Fiure 10).
Wed also like to make scree plots (Figure 11).
Now lets let rip. Well start with a single factor.
> factanal(states.density,factors=1)
Call:
factanal(x = states.density, factors = 1)
Uniquenesses:
Population
0.959
HS Grad
0.505
Lattitude
0.347

Income Illiteracy
0.768
0.186
Frost
Area
0.512
0.999

Life Exp
0.545
Density
0.998

Murder
0.358
Longitude
0.976

Loadings:
Population
Income
Illiteracy
Life Exp
Murder
HS Grad
Frost
Area
Density
Longitude
Lattitude

Factor1
-0.202
0.481
-0.902
0.674
-0.801
0.704
0.699

-0.153
0.808

SS loadings
Proportion Var

Factor1
3.847
0.350

Test of the hypothesis that 1 factor is sufficient.


The chi square statistic is 216.17 on 44 degrees of freedom.
The p-value is 1.42e-24
Lets break this down. Were looking for a single factor which summarizes
as much of the correlations among the variables as possible. What we find is
15

# Make a biplot from the output of factanal


# Presumes: fa.fit is a fitted factor model of the
#
type returned by factanal
#
fa.fit contains a scores object
#
fa.fit has at least two factors!
# Inputs: fitted factor analysis model, additional
#
parameters to pass to biplot()
# Side-effects: Makes biplot
# Outputs: None
biplot.factanal <- function (fa.fit,...)
{
# Get the first two columns of scores, i.e.,
# scores on first two factors
x = fa.fit$scores[,1:2]
# Get the loadings on the first two factors
y = fa.fit$loadings[,1:2]
biplot(x,y,...)
}
Figure 10: Function for making biplots from the output of factanal. If I were
a real programmer, Id add in tests for the presumptions, and tell R to run this
function when biplot is called with an input of class factanal. But Im not.
# Make scree (eigenvalue-magnitude) plots from
# the output of factanal()
# Input: fitted model of class factanal,
#
x-axis label (default "factor"),
#
y-axis label (default "eigenvalue")
#
graphical parameters to pass to plot()
# Side-effects: Plots eigenvalues vs. factor number
# Output: None
screeplot.factanal <- function(fa.fit,xlab="factor",ylab="eigenvalue",...) {
# sum-of-squares function for repeated application
sosq <- function(v) {sum(v^2)}
# Get the matrix of loadings
my.loadings <- as.matrix(fa.fit$loadings)
# Eigenvalues can be recovered as sum of
# squares of each column
evalues <- apply(my.loadings,2,sosq)
plot(evalues,xlab=xlab,ylab=ylab,...)
}
Figure 11: Scree-plot function for objects of class factanal.

16

1.5

1.0

45
0.5

0.0

Latitude

40
-0.5

-1.0

35

-1.5

30

-2.0

-2.5
-120

-110

-100

-90

-80

-70

Longitude

Figure 12: Plot of the factor scores for a one-factor analysis by geographic
location.
something with a strong positive relationship with latitude; specifically, a one
standard deviation increase in the factor increases latitude by 0.808. There is
also a positive relationship with high school graduation rates, income, and frost,
though those are all weaker. It has a strong negative relationship with illiteracy
(0.902) and homicide (0.801), and a weaker negative one with population
and longitude. (Geographically, there is less of this factor in the southeast than
elsewhere Figure 12.)
The size of the eigenvalue for this factor is 3.847. This is the amount of
variance in the (standarized) features which is summarized by this factor. The
total variance is just equal to the number of features (Exercise: why?), so
were capturing 3.847/11 = 0.350 of the variance.
Finally, the last part of the output does a simple goodness-of-fit test, which

17

library(lattice) # For more graphics commands!


state.g = cbind(state.center$x,
state.center$y,
factanal(states.density,factors=1,scores="regression")$scores)
colnames(state.g) = c("Longitude","Latitude","G")
levelplot(state.g[,"G"]~state.g[,"Longitude"]*state.g[,"Latitude"],
xlab="Longitude",ylab="Latitude")
# See help(levelplot) for more

is only reliable if we make the assumption that everything is Gaussian and the
parameters are estimated by maximum likelihood. The factor model is then a
restriction of the general multivariate Gaussian model, for which the maximum
likelihood estimate is just the empirical covariance matrix. What we know
from general statistical theory is that twice the log likelihood ratio between an
unrestricted model and a restricted model,
R = 2L(bunrestricted ) 2L(brestricted )

(1)

has a 2 distribution if the restricted model is right and they are both estimated
by maximum likelihood. The number of degrees of freedom is the number of
extra parameters which are free in the unrestricted model. The actual calculations here are tedious, so its better to have the computer do them. Basically
it amounts to comparing the correlation matrix wed expect under the fitted
model and the one we actually observe.
The upshot in this case is that the fit is horrible. So much, then for the one
factor model. How about two?
The plotting command is
biplot.factanal(factanal(states.density,factors=2,scores="regression"),
cex=c(0.5,0.8))
which delivers Figure 13.
This isnt too far from the principal components plot, but its also not the
same. How are we doing on model fitting?
SS loadings
Proportion Var
Cumulative Var

Factor1 Factor2
3.886
1.962
0.353
0.178
0.353
0.532

Test of the hypothesis that 2 factors are sufficient.


The chi square statistic is 144.43 on 34 degrees of freedom.
The p-value is 1.45e-15
Again, horrible, though were now accounting for over half the variance.
What about three?

18

-1.0

-0.5

0.0

0.5

Longitude

1.0
1.0

-1.5

Maine
New Hampshire
Massachusetts
Rhode Island
Vermont
Connecticut
New Jersey
Delaware

Ohio
Michigan
Indiana

Frost

Kentucky

Wisconsin

Population Illinois

Tennessee

Iowa
Minnesota

Lattitude

Alabama

Factor2

Mississippi

Missouri

Illiteracy
Arkansas
Murder

Kansas Life Exp


South
Income
Nebraska
North Dakota
Dakota

Oklahoma

Louisiana

Texas

Colorado
Wyoming

-1

HS Grad

Montana

Area

New Mexico

0.0

Florida

South Carolina
Georgia

West Virginia

0.5

Density

Virginia
North Carolina

Maryland
Pennsylvania

-0.5

New York

Utah
Idaho

Arizona

Washington
Oregon

-2

California

-1.0

Nevada

Alaska

-1.5

Hawaii

-2

-1

Factor1

Figure 13: Biplot of the two-factor model.

19

-1.0

-0.5

0.0

0.5

1.0

1.0

Alaska

Hawaii
California
Nevada
Washington
Oregon

Area

Colorado
Idaho
Wyoming
Montana
Utah
North Dakota

Life Exp

Illiteracy
Georgia
Alabama
Mississippi

Illinois

Kansas
Nebraska

PopulationOklahoma
Florida
Arkansas
Tennessee
Kentucky

South Carolina
North Carolina

-1

SouthMinnesota
Dakota
Iowa

Missouri
Michigan

Virginia
New York

Lattitude

0.0

New Mexico

Murder

Maryland
Indiana
Ohio
New Jersey
Delaware
Pennsylvania

Density

West Virginia

Wisconsin

Frost

Connecticut

-0.5

Factor2

Income

Texas

Louisiana

0.5

HS Grad

Arizona

Massachusetts
Rhode Island
New Hampshire
Vermont

-2

Maine

-2

-1

-1.0

Longitude

Factor1

Figure 14: Biplot for the three-factor model, showing scores for the two largest
factors.

20

SS loadings
Proportion Var
Cumulative Var

Factor1 Factor2 Factor3


3.819
2.098
1.265
0.347
0.191
0.115
0.347
0.538
0.653

Test of the hypothesis that 3 factors are sufficient.


The chi square statistic is 99.02 on 25 degrees of freedom.
The p-value is 9.17e-11
Notice that adding more factors is causing the loadings on to the first two
factors, and hence the arrows in the biplot, to move around. This doesnt
happen with PCA.
Lets go all the way to the largest number of factors that we can fit (Figures
15 and 16).
> states.density.fa6 <- factanal(states.density,factors=6,scores="regression")
> biplot.factanal(states.density.fa6,cex=c(0.5,0.75))
> screeplot.factanal(states.density.fa6,type="b")
The fit is finally not bad:
SS loadings
Proportion Var
Cumulative Var

Factor1 Factor2 Factor3 Factor4 Factor5 Factor6


2.411
1.920
1.603
1.501
1.074
0.986
0.219
0.175
0.146
0.136
0.098
0.090
0.219
0.394
0.539
0.676
0.774
0.863

Test of the hypothesis that 6 factors are sufficient.


The chi square statistic is 5.11 on 4 degrees of freedom.
The p-value is 0.276
This said, its not right to just take this p-value naively. We have done a whole
bunch of hypothesis tests, and the more tests you do at a fixed size, the more
likely you are to have a false positive. So we should really do some kind of
control for the multiple tests. We will return to this issue later.
Notice also that this goodness-of-fit test is not checking lots of the background assumptions to the factor-analysis model: that the data have a Gaussian
distribution, for example, or that the observations are IID samples. (Just how
plausible is it that Oklahoma is completely independent of Texas, or Connecticut
of New York?)
Notice also that we havent even begun to address the issue of rotations, i.e.,
of applying some transformation to the factors which doesnt change their observable implications, but might make things like Figure 15 easier to understand.
(This is controlled by the rotation argument to factanal.)

21

-0.5

0.0

Oklahoma
New
Jersey
Density

New Mexico
Texas

Lattitude

Idaho
Maine
Pennsylvania
New Hampshire

Income

Montana

Frost

0.0

California

0.5

North Dakota
Minnesota
Iowa
South Dakota
Oregon
Washington
Nebraska
Wisconsin
HS Grad
Utah
Massachusetts
Rhode Island
Connecticut
Kansas

Arizona

Colorado
Ohio
West Virginia

Population

Factor2

0.5
Life Exp

Hawaii

-1.0

Arkansas
Florida

Indiana

Area
Delaware
New York

Longitude

Vermont
Wyoming

Louisiana
Virginia
Missouri
Tennessee
Illinois
Mississippi
Kentucky Maryland
South Carolina North Carolina

-0.5

-1

Illiteracy

Michigan

Alaska

Georgia

-2

Alabama
Murder

-3

-1.0

Nevada

-3

-2

-1

0
Factor1

Figure 15: Biplot with six factors.

22

2.4
2.2
2.0
1.8

eigenvalue

1.6
1.4
1.2
1.0
1

4
factor

Figure 16: Scree plot for six factors.

23

References
Frazer, James George (1922). The Golden Bough: A Study in Magic and Religion. London: Macmillan, abridged edn. URL http://www.gutenberg.org/
etext/3623.
Nisbett, Richard E. and Dov Cohen (1996). Culture of Honor: The Psychology
of Violence in the South. Boulder, Colorado: Westview Press.

24

You might also like