You are on page 1of 36

Sparse Principal Component Analysis with the R

package spca
Giovanni Maria Merola
RMIT International University Vietnam
emaillsspca@gmail.com

Abstract
Sparse Principal Components Analysis (SPCA) aims to find principal components with
few non-zero loadings. In the R-package spca we implemented our Least Squares Sparse
Principal Components Analysis method (LS SPCA Merola 2014). The resulting sparse
solutions are uncorrelated and minimise the Leats Squares criterion subject to sparsity
requirements. Both these features are enhancements over existing SPCA methods. Locally
optimal solutions can be found by a branch-and-bound (BB) algorithm. Recognising that
sparsity is not the only requirement for simplicity of interpretation, we implemented a
backward elimination algorithm (BE) that computes sparse solutions with large loadings.
This algorithm can be run without specifying the number of non-zero loadings in advance.
It can also be required that a minimum amount of variance is explained by the components
and that the components are only combinations of a subset of the variables. Hence, the
BB algorithm can be used as an exploratory Data Analysis tool. The package also contains
utilities for printing the solutions and their summary statistics, plotting and comparing
the results. We give a thorough example of an application of SPCA to a small dataset.

Keywords: SPCA, Uncorrelated Components, branch-and-bound, Backward Elimination.

1. Introduction
Sparse Principal Component Analysis (SPCA) is a variant of Principal Component Analysis
(PCA) in which the components computed are restricted to be combinations of only few of
the variables. In other words, the majority of the weights that define the combinations, called
loadings, are zero. The number of non-zero loadings of a component is called cardinality or
L0 norm. The advantage of sparse components is that they are more readily interpretable.
For example, a component from a set of IQ test scores equal to ( 21 Inductive reasoning +
1
2 Deductive reasoning) can be easily interpreted as the Logic skills of a person.
Traditionally, loadings are simplified by thresholding, that is by discarding the ones smaller
than a threshold (hereafter, for simplicity, we speak of the size of the loadings meaning the
absolute value of the non-zero ones). Since this practice is potentially misleading (Cadima
and Jolliffe 1995), a number of algorithmic SPCA methods have been proposed.
SPCA methods determine the sparse components by maximising their variance subject to cardinality constraints (Moghaddam, Weiss, and Avidan 2006). This objective function is taken
from a property of the PCs which is irrelevant for summarising the information contained in
the data (ten Berge 1993). In some of the methods the objective function is derived from

Sparse Principal Component Analysis with the R package spca

a Lasso Regression type LS optimisation (e.g. Zou) but in others the derivations given are
not convincing. Furthermore, most of the SPCA methods produce correlated components.
Correlated components are more difficult to interpret and the sum of the variance that they
explain is not equal to the total variance explained.
In a recent paper (Merola 2014) we suggest improving over existing SPCA methods by deriving
sparse components that are uncorrelated and minimise the LS criterion, like the PCs. We
refer to this approach as LS SPCA. The solutions are obtained from constrained Reduced
Rank Regression (Izenman 1975).
The problem common to all SPCA methods is finding the set of indices that optimise the
objective function. This Mixed Integer Problem is non-convex NP-hard, hence computationally intractable (Moghaddam et al. 2006). We do not explore this computational aspect and
suggest a simple greedy branch-and-bound search LS SPCA(BB), adapted from Farcomeni
(2009), useful for finding solutions for medium size datasets.
Originally SPCA was designed to replace thresholding, that is compute components with only
few large loadings. However, this is not guaranteed by the cardinality constraints. For this
reason we developed a backward elimination algorithm (BE) that iteratively eliminates small
loadings from a solution until only ones larger than a threshold are left. Considering that the
size of the loadings is not the only feature for interpretability, the elimination process can be
stopped also if a minimal cardinality or a maximal variance loss are reached.
The BE algorithm represents a departure from the LS optimality path, but this should be
compensated in terms of users needs satisfied. In fact, BE breaks away from the inflexibility
of the LS criterion by finding sub-optimal solutions with defined characteristics, foremost the
size of the loadings. In our studies we found that the BE solutions compare well with the
BB ones in terms of variance explained. For our analyses we often use LS SPCA(BE) as an
Exploratory Data Analysis tool with which we compare solutions with different features.
In this paper we present the R (R Core Team 2013) package spca, which contains functions for
computing LS SPCA solutions. In the package we implemented the BB and BE algorithms.
BE is computationally much simpler and faster than BB. We solved problems with up to 700
variables in reasonable time, as we show in Merola (2014). This should be enough to satisfy
a large community of researchers.
The spca package also contains utilities for plotting, printing, summarising and comparing
different solutions. This paper is an illustration of its usage through examples. Computational
details will we discussed sparingly.
The original paper proposing LS SPCA is still unpublished because reviewers seem to ignore
that the original objective of PCA is the maximisation of the variance explained. Also journals
editors seem to share the same lack of knowledge and to approve biased reviews, likely from
authors of different SPCA methods. For example, Dr. Qiu, the chief editor of Technometrics,
accepted a report which did not discuss at all the content but rejected the paper only because
I would have ignored an article (by A. Daspremont et al.) which, instead, was cited in the
references and the results of which were compared with mine in the paper. He also accepted
another report from a referee (who could barely write in English) who called the LS criterion
a new measure used ad-hoc. Dr Qiu rejected the paper on those grounds and added the
reason that the algorithm is not scalable. Since computational efficiency is not the journals
aims and scope, this also was unfair. The (kind) email of complaint I sent to Dr. Qiu went
unanswered. This is just to testify how frustrating publishing sometimes can be, thanks to

Giovanni Maria Merola

unfair resistance from other fellow academics.


The paper is organised as follows: in the next section we briefly outline the LS SPCA solutions;
in Section 3 we outline the BB and BE algorithms; in Section 4 we illustrate the different
options available in spca on a public dataset on baseball statistics, which is also included in
the package; in Section 5 we give some concluding remarks.

2. Optimal solutions
Given a matrix of n observations on p mean-centred variables X (n p), PCA intends to find
[d] that minimises the LS criterion. It is easy
the rank-d (d p) approximation of the data, X
[d] = XAP0 , where A (p d) is the
to show that the approximation must be in the form X
matrix of loadings and P (p d) a matrix of coefficients. The matrix T = XA (n d) defines
the PCs.
The PCs can be constrained to be uncorrelated without loss of optimality because of the extra
sum of squares principle. Hence, the solutions are the solutions to:
A=

[d] ||2 = arg max ||X


[d] ||2 = arg max
arg min ||X X
A<pd

[d] )=d
Rank(X

aj <p

d
X
a0j SSaj
j=1

a0j Saj

(1)

subject to a0j Sak = 0, j 6= k,


where S = n1 X0 X (p p) denotes the sample covariance matrix and the summation in the
last term derives from the uncorrelatedness of the components. The problem is completely
identified by the loadings because P0 = (T0 T)1 T0 X and the rank-d approximation is equal
[d] = XA (A0 X0 XA)1 A0 X0 X.
to X
[d] ||2 , can be broken down into the sum of the
It follows that the total variance explained, ||X
variances explained by each component Vexp(tj ) = a0j SSaj /a0j Saj . Hence, the PCA problem
can be equivalently solved by maximising the individual Vexp(tj ), that is as:
aj = arg max
aj <p

a0j SSaj
, j = 1, . . . , d
a0j Saj

(2)

subject to a0j Sak = 0, j 6= k.


Note that this problem is invariant to the scale of the loadings.

2.1. PCA solutions


It is well known that the optimal PCA loadings are proportional to the eigenvectors of S,
{vj , j = 1, . . . , d}, corresponding to the d largest eigenvalues taken in non increasing order,
{1 2 . . . d }. Therefore the variance explained simplifies to:
Vexp (tj ) = max Vexp(tj ) =

vj0 SSvj
vj0 Svj
=
= j .
vj0 Svj
vj0 vj

It follows that the variance explained by each PC is equal to the corresponding eigenvalue
and that the loadings can be taken to be orthonormal so that the variance explained is also

Sparse Principal Component Analysis with the R package spca

equal to the variance of the PCs. Consequently, PCA has been popularised as the solution to
this simpler problem
aj = arg max a0j Saj , j = 1, . . . , d

(3)

aj <p

subject to a0j ak = jk ,
where jk is the Kronecker delta. ten Berge (1993, page 87) warns about this formulation of
the PCA problem by stating: Nevertheless, it is undesirable to maximize the variance of the
components rather than the variance explained by the components, because only the latter is
relevant for the purpose of finding components that summarize the information contained in
the variables.
PCA can be computed in R with different functions. We implemented the function pca that
returns an spca object using a simple eigen-decomposition:
pca(S, nd, only.values = FALSE, screeplot = FALSE, kaiser.print = FALSE)
This function returns the first nd eigenvectors of the correlation matrix S . Setting the flag
only.values = TRUE returns only the eigenvalues (useful if only the variance explained is
needed and much faster). Setting screeplot = TRUE produces a screeplot of the eigenvalues and if kaiser.print = TRUE the number of eigenvalues larger than one is printed and
returned. This is the kaiser rule for deciding how many PCs to include in the model.

2.2. Least Squares Sparse PCA solutions


The LS SPCA problem is obtained by constraining the cardinality of the loadings in Problem
(1), which gives:
A = arg min ||X XAP0 ||2 = arg max
A<pd

d
X
a0j SSaj

A<pd j=1

a0j Saj

(4)

subject to L0 (aj ) cj and a0j Sak = 0, j 6= k,


where cj < p are the maximal cardinalities allowed.
Under cardinality constraints Vexp(tj ) no longer simplifies to (a0j Saj )/a0j aj and the solutions
must be obtained by maximising the individual Vexp in Equation (2) directly. Details of the
derivation of these solutions can be found in Merola (2014).
The sparse components are combinations of the variables with index in indj , {j = 1, . . . , d},
which we denote with the matrices Wj (n cj ). So, we can write the sparse components
j , where the a
j are the vectors of dimension cj containing only the non-zero
as tj = Wj a
loadings. Then Problem (4) becomes:
j = arg min ||X Wj a
j p0j ||, j = 1, . . . , d
a

(5)

j <cj
a

j = 0, j > 1,
subject to T0(j) Wj a
where T(j) is the matrix containing the first (j 1) components (T(1) = 0). Let Jj (p cj ) be
the matrices formed by the columns of the p dimensional identity matrix with indices in indj ,

Giovanni Maria Merola

j . With this notation


then we can write Wj = XJj and the full sparse loadings as aj = Jj a
the SPCA Problem (5) can be written as
arg min ||X
j <cj
a

j p0j ||
Wj a

j
0j J0j SSJj a
a
a0j SSaj
= arg max 0 0
, j = 1, . . . , d (6)
= arg max
0
j Jj SJj a
j
a
j <cj
j aj Saj
j <cj aj =Jj a
a
a

subject to Rj aj = 0, for j > 1,


where Rj = A0(j) SJj defines the uncorrelatedness constraints, with A(j) being the first (j 1)
loadings. Hence the variance explained by the sparse PCs is the same Vexp defined in Equation
(2).
Problem (6) can be seen as a series of constrained rank-one Reduced-Rank Regression problems (Izenman 1975) where the regressors are the columns of the Wj matrices. It is well
1 satisfying:
known that the first solution is the eigenvector a
1 ,
1 = max a
(W10 W1 )1 W10 XX0 W1 a

(7)

where max is the largest eigenvalue. This solution is unique as long as the variables in W1 are
not multicollinear. Hereafter we exclude the possibility that a matrix Wj is not full column
rank because that set of variables should be discarded and a full rank one sought. The sparse
loadings can be computed also if only the covariance matrix S is known. In fact, Equation
(7) can be written as:
0
1 = max a
1 ,
(8)
D1
1 J1 SSJ1 a
where Dj = Wj0 Wj = J0j SJj denotes the covariance matrices of the variables with indices in
indj , which are invertible for the full rank assumptions.
The following solutions can be found by applying the uncorrelatedness constraints to the RRR
Problems (6) as in constrained multiple regression (e.g., see Rao and Toutenburg 1999, or
Magnus and Neudecker 1999, Th. 13.5, for a more rigorous proof). The solutions are the
eigenvectors satisfying:
0
j = max a
j ,
Cj D1
(9)
j Jj SSJj a
1 0 +
0
where Cj = Icj D1
j Rj (Rj Dj Rj ) Rj , with C1 = Ic1 , and the subscript + denotes a
generalized inverse. The solutions exist because Rj spans the space of Wj . In this derivation
j = 0 can never be satisfied. This means
we assume that R0j Rj is singular, otherwise Rj a
that uncorrelatedness can only be achieved if the cardinalities satisfy cj j. The LS SPCA
solutions can be computed from the leftmost eigenvector, bj , of the symmetric matrices
1
1
1 21
2 0
2
j = (Cj D1
(Cj D1
j ) Jj SSJj (Cj Dj ) as a
j ) bj .

The above derivation shows that the sparse components that explain the most variance are
not eigenvectors of submatrices of the covariance matrix and that their variance is no longer
equal to the variance that they explain.
As mentioned above, the uncorrelatedness constraints require that the cardinality of a component is not less than its order. In some situations it may be desirable to compute correlated components with lower cardinality. In this case the objective is the minimisation of
the extra variance explained by the j-th component. That is, the variance explained by its
complement orthogonal to the previous components, defined by t
j = Qj tj = Xj bj , where
0
1
0
Qj = (I T(j) (T(j) T(j) ) T(j) ), Xj = Qj X are the rediduals of X orthogonal to the first
(j 1) components and bj are the loadings relative to the residual matrix. On substituting

Sparse Principal Component Analysis with the R package spca

t
j into Problem (6), it is simple to show that the extra variance explained by this component
is:
b0j Sj Sj bj
,
(10)
b0j Sj bj
where Sj = X0j Xj is the covariance matrix of the residuals. However, in this formulation
of the problem it is not possible to impose cardinality constraints on the loadings of the
original variables, aj . One possible way around this problem is to require that the correlated
components explain as much as possible of the variance of the residuals Xj . That is the
loadings can be obtained by solving the problem:
j p0j ||, j = 1, . . . , d
aj = arg min ||Xj Wj a
j <cj
a

While correctly used in iterative PCA algorithms, such as Power Method and NIPALS (Wold
1966), for example, without uncorrelatedness constraints this approach does not maximize
the extra variance explained in Equation (10), but the approximation
a0j Sj Sj aj
a0j Saj

(11)

The loadings of the correlated components are given by the eigenvectors satisfying:
0
j = max a
j .
D1
j J j Sj Sj J j a

(12)

When needed, we will refer to these solutions as Least Squares Correlated Sparse Principal
Components Analysis (LS CSPCA).
The LS SPCA solutions can be computed with the function :
spca(S, ind, unc = TRUE)
This function takes the as arguments a covariance or correlation matrix S , list of indices
ind of the sparse loadings and the flag unc to indicate whether each component should
be uncorrelated with the preceding ones or not. The number of components to compute is
determined from the length of ind . The function returns an spca object with the loadings,
the variance explained and other items. See the documentation for details.

3. Finding a solution
Finding the optimal set of indices for the SPCA problem is a nonconvex mixed integer linear
program, which is NP-hard, hence computationally intractable (Moghaddam et al. 2006).
We consider two greedy approaches for its solution: a branch-and-bound algorithm (BB)
analogous to that proposed in Farcomeni (2009) for the maximisation of the variance and the
Backward Elimination (BE), which departs from the optimisation of the variance explained
to achieve large loadings.
In the following we illustrate the basic functioning of these algorithms illustrating the different
options available. In our implementation we allow some arguments to take fewer values than
necessary by assigning the last value passed to the missing ones. So, for example, passing

Giovanni Maria Merola

thresh = c(0.2, 0.1) will assign the value 0.2 to the first component and 0.1 to all others.
Further details can be found in the package documentation.

3.1. Branch-and-bound
Branch-and-bound is a well known method for speeding up the exhaustive searches of all
possible solutions to a problem by excluding subsets of variables that will certainly give worse
results than the current one. In our case, sets of variables that explain less variance than
the current optimum are eliminated altogether from the search. The convergence to the
optimum is certain because eliminating a variable from a set of regressors cannot yield a
larger regression sum of squares. Details on the implementation can be found in Farcomeni
(2009), who uses a different objective function. The solutions are found sequentially for
each component, therefore the solution may not be a global optimum for a given number of
components.
In our implementation of the BB algorithm we added the possibility of constraining the
variables that can enter the solutions and of excluding variables that have been used for
previous solutions. The BB algorithm can be called with the function spcabb:
spcabb(S, card, startind, unc = TRUE, excludeload = FALSE,
nvexp = FALSE)
where S is a correlation matrix and card a vector with the cardinalities of the solutions.
The number of components to compute is determined by the lenght of card . A list of starting
indices for each component can be passed as a list argument [ startind = list() ]. The variables
used for previous components can be excluded from the following ones [ excludeload = TRUE ].
The objective function can be set to be the actual variance explained, Equation (10) [ nvexp = TRUE ],
or the approximated one, Equation (11). The usage of the actual vexp is not advised.

3.2. Backward Elimination algorithm


BE iteratively eliminates the smallest loading from a solution and recomputes the component without that variable until a stop rule is met. We call this procedure trimming. We
implemented BE in the package spca adding options that accommodate different aspects of
interpretability. BE is called with the function spcabe:
spcabe(S, nd = FALSE, thresh = FALSE, ndbyvexp = FALSE, perc = TRUE,
unc = TRUE, startind = NULL, mincard = NULL, excludeload = FALSE,
threshvar = FALSE, threshvaronPC = FALSE, trim = 1, reducetrim = TRUE,
diag = FALSE, eps = 1e-04)
The function takes a covariance or correlation matrix [ S ] as first argument. The number of
components to compute [ nd ] can be either specified directly or decided by a stop rule defined
by the percentage of cumulative variance explained (PCVE) reached [ ndbyvexp = real [0,1] ].
If no stopping rule is specified all components are computed.
There are three stop rules for trimming applicable to each component:
cardinality The minimal cardinality of the loadings [ mincard ];

Sparse Principal Component Analysis with the R package spca

loss of variance explained The maximum acceptable loss of variance explained. This can
be computed either with respect to the loss of cumulated variance explained either from
that explained by the same number of PCs [ threshvaronPC ] or from that explained by
the component before trimming [ threshvar ] (both arguments must be real in [0, 1]);

threshold This is the minimal absolute value required for a loading [ thresh ]. The threshold
can be either specified with respect to the loadings scaled to unit L2 norm or to the
percentage contributions (scaled to unit L1 norm, [ perc = TRUE ]).

The stop rules for trimming are given in order of precedence and can have a different value
for each component. The stop rules are all optional, if none is given, the minimal cardinality
is set equal to the rank of the component if they have to be uncorrelated otherwise it is set
equal 1.
In problems with a large number of variables the computation can be sped up by trimming
more than one loading at the time [ trim = integer > 1 ]. When the number of loadings left
is less than the number of loadings to trim trimming stops. However, more accurate solutions
can be obtained by finishing off the elimination by trimming the remaining ones one by one
[ reducetrim = TRUE ].
The algorithm by default computes components uncorrelated with the previous ones. However, one or more can be computed without this requirement, as given in Equation (12)
[ unc = FALSE ].
The components can be constrained to be combinations of only a subset of the variables with
two options:

starting indices A list containing the indices from which trimming must start for each
component [ startind = list() ];

exclude indices used With this flag the next components are trimmed starting from only
the indices of the variables that were not included in previous loadings [ excludeload TRUE].

The standard output is an spca object containing the loadings, variance explained and few
other items. A richer output containing diagnostic information can be obtained by setting
[diag = TRUE ]. The value under which a loading is considered to be zero can be changed
from the default 104 with the argument [ eps = real > 0 ].
The BE algorithm is outlined in Algorithm 1. Not all options are shown.

Giovanni Maria Merola

Algorithm 1 LS SPCA(BE)
initialize
Stopping rules for the number of components
nd {the number of components to compute}
ndbyvexp {optional, minimum variance cumulated explained}
Select which variables can enter the solutions
startindj {optional, the starting indices for trimming}
Stopping rules for elimination. Can be different for each component
startindj {starting set of indices}
threshj {minimum absolute value of the sparse loadings}
mincardj 1 {minimum cardinality of the sparse loadings}
threshvarj {optional maximum relative loss of variance explained}
end initialize
for j = 1 to nd do
Compute aj as the j-th LS SPCA solution for startindj
Vexpfullj = Vexp(aj )
while ministartindj |aij | < threshj and length(startindj ) > mincardj do
indoldj = startindj , aoldj = aj
k: |akj | |aij |, i startindj
startindj = startindj \k
Compute aj as the j-th LS SPCA solution for startindj
if 1 Vexp(aj )/Vexpfullj > threshvarj then
startindj = indoldj , aj = aoldj
break
end if
end
Pwhile
if ji=1 Vexp(ai ) threshvarj then
nd = j
break
end if
end for
There is no obvious rule for choosing the thresholds, thresh. However, if the loadings are

computed to have unitary L2 norm, specifying a threshold larger than 1/ c will ensure a
cardinality lower than c. For percentage contributions standardized to unit L1 norm specifying
a threshold larger than 1/c will ensure a cardinality lower than c. For this reason, the
choice of the minimum cardinality and of the threshold must be considered together and later
components require a lower threshold than the first ones. Note that trimming is designed
for components computed from correlation matrices. If a covariance matrix is used, different
thresholds for every variable should be used.

4. Methods for spca objects


The spca package contains different S3 methods for printing, plotting and comparing sparse
solutions. Print and plot use generic functions. So, they take as first argument an spca object
and can be called without the suffix spca.

10

Sparse Principal Component Analysis with the R package spca

print
print.spca prints a formatted matrix of loadings and is called as:
print(smpc, cols, digits = 3, rows, noprint = 0.001, rtn = FALSE,
perc = TRUE, namcomp = NULL)
cols is the number of loadings to print, the number of digits shown is set by the argument
digits (this is set to 1 if perc = TRUE ), noprint sets the threshold below which a loading
is considered zero and is not printed, rtn controls whether the matrix is returned, perc
whether the loadings should be printed as percentage contributions and namcomp is an
optional vector on names to show in the header.

showload
Especially when the number of variables is large, loadings can be more readily read using the
showload function which prints the nonzero loadings one by one.
showload(smpc, cols, digits = 3, onlynonzero = TRUE, noprint = 0.001,
perc = TRUE, rtn= FALSE)
The function takes an spca object as first argument, the number of solutions to print [ cols ],
the number of digits to print [ digits ], a flag indicating whether only the nonzero loadings
should be printed [ onlynonzero ], the value below which a loading is considered zero [ noprint ]
and the flags [ perc ], indicating whether the percentage contributions should be printed and
[ rtn ], indicating whether the output should be returned.

summary
The method summary.spca prints the summary statistics for an spca solution and it is called
as
summary(smpc, cols, rtn = FALSE, prn = TRUE, thrsehcard = 0.001, perc = TRUE)
The method takes an spca object as first argument, the number of solutions to print [ cols ],
a flag indicating whether to return the matrix of formatted summaries or not [ rtn ], a flag
indicating whether to print the summaries or not [ prn ], the value below which a loading
is considered zero and is not counted in the cardinality [ threshcard ] and a flag indicating
whether to return the percentage contributions [ perc ].

plot
The method plot.spca plots the percentage cumulative variance explained by the sparse
components together with that explained by the corresponding number of PCs. It can also
plot the loadings. It is called as:

Giovanni Maria Merola

11

plot(smpc, cols, plotvexp = TRUE, plotload = FALSE, perc = TRUE, bnw = FALSE,
nam, varnam = FALSE, onlynonzero = TRUE, plotloadvsPC = FALSE, pcs =
NULL, addlabels = TRUE, mfrowload = 1, mfcolload = 1, loadhoriz = FALSE)

This method takes an spca object as first argument. cols denotes the number of dimensions to plot, plotvexp and plotload control whether PCVE and the loadings are plotted, perc whether the loadings should be plotted as percentage contributions. bnw controls whether the CPVE plot is in colour or black and white. nam assigns a name to
the results and varnam assigns names to the variables (useful when the names are long).
onlynonzero specifies whether only the non zero loadings must be included in the plot or
not. mfrowload and mfcolload set the layout of the plot of the loadings by arranging them
in a grid. loadhoriz specifies whether the loadings should be plotted as vertical or horizontal
bars. Setting plotloadvsPC = TRUE the sparse loadings are plotted against the corresponding PCA loadings. A line showing the equality with the PCs loadings is also shown. The
argument addlabels determines if the nonzero loadings are to be labelled (if not FALSE)
and whether short labels addlabels = TRUE instead of the original ones should be used
addlabels = orig . This plot inherits the arguments set for the others.

spca.comp
The function spca.comp can be used to compare two or more spca solutions. It produces
comparative plots of the PCVE and of the loadings. It also prints the summaries side by side
and, optionally, the loadings. The function is not implemented as a method to make it easier
to use and it is called as

spca.comp(smpc, nd, nam = FALSE, perc = TRUE, plotvar = TRUE, plotload = FALSE,
prnload = TRUE, shortnamcomp = TRUE, rtn = FALSE, prn = TRUE, bnw = FALSE)

It takes as first argument a list of spca objects, the number of dimensions to compare [ nd ],
a vector on names for the different solutions [ nam ], a flag indicating whether the loadings
should be printed as percentage contributions [ perc ]. The flag plotvar indicates whether
PCVE should be plotted with the corresponding PCA one, and the flag plotvar whether
the loadings should be plotted together; prnload indicates whether the loadings of the different solutions should be printed side by side. the flag shortnamecomp indicates whether
the components should be printed using short names (Cx.y) (useful when there are several
components to print); the flag rtn indicates whether a text matrix with the formatted summaries should be returned, prn indicates whether the summaries should be printed and bnw
whether the plots should be in black and white.
The spca.comp function produces different plots and this makes the customisation of each
of them quite cumbersome. Since the primary objective of this package is the computation
of the sparse solutions and the plots are simple to produce, we did not include the possibility
of adding extra graphical parameters in this function.

12

Sparse Principal Component Analysis with the R package spca

5. Sparse Principal Components Analysis Example


In this section we obtain different sparse solutions by changing the settings of the BB and
BE algorithms. For this we use dataset baseball (available at Statlib, http://lib.stat.cmu.
edu/datasets/baseball.data) which contains observations on 16 performance statistics of
263 US Major League baseball hitters, taken on their career and 1986 season. We chose this
dataset because it is small but gives varied results. The demonstration does not want to be
a proper analysis of the data (in any case, we are not baseball experts) but a simple example
in which we illustrate some of the spca package features.
The results shown are obtained with spca version 1.6.6 , which includes this dataset. I firstly
developed this package for my own needs, then I thought that I might as well share it with
others. Therefore, it is not intended to be applied to large problems, where the computational
needs are more important than optimality ones. In Merola (2014) we applied the BE algorithm
to problems with more than 700 variables.
First we load the spca package and extract the baseball data. The names of the 16 variables
in the set are also shown.

library("spca")
data(baseball, package = "spca")
colnames(baseball)
[1]
[3]
[5]
[7]
[9]
[11]
[13]
[15]

"times at bat 1986"


"home runs 1986"
"runs batted 1986"
"years career"
"hits career"
"runs career"
"walks career"
"assists 1986"

"hits 1986"
"runs 1986"
"walks 1986"
"times at bat career"
"home runs career"
"runs batted career"
"put outs 1986"
"errors 1986"

PCA
The full PCA solutions are computed with the function pca which produces PCA output of
class spca. We require the screeplot and the Kaiser rule (Kaiser 1960), useful for determining
the number of components to keep in the model. We also show the first five percentage
contributions calling the S3 spca method print. This method is automatically applied to spca
objects with the following options:

b.pca = pca(baseball, nd = 5, screeplot = TRUE, kaiser.print = TRUE)

Giovanni Maria Merola

13

eigenvalue

10

11

12

13

14

15

16

component

Figure 1: Screeplot of the baseball data


[1] "number of eigenvalues larger than 1 is 3"
# PCA with screeplot and kaiser Rule
print(b.pca, cols = 1:5)
Percentage Contributions
Comp1 Comp2 Comp3 Comp4 Comp5
times at bat 1986
5.5 10.4
2.5
3.1
3.5
hits 1986
5.4 10.2
1.8
3.8
4.5
home runs 1986
5.6
6.3 -12.3
9.1 -16.7
runs 1986
5.4 10.1 -2.3
6.9
7.2
runs batted 1986
6.5
8.5 -6.1
6.2 -10.3
walks 1986
5.8
6.3 -2.4 -3.3 19.5
years career
7.9 -6.9
3.5 -0.7 -0.7
times at bat career
9.3
-5
4.6 -1.2
0.2
hits career
9.3 -4.7
4.5 -1.6
0.5
home runs career
8.9 -3.3 -4.3
2.8 -7.6
runs career
9.5 -4.4
3.2 -0.2
2.1
runs batted career
9.5 -4.3
0.6 -0.5 -4.1
walks career
8.9
-5
2 -2.5
6.1
put outs 1986
2.2
4.3 -5.5 -51.6 -3.1

14

Sparse Principal Component Analysis with the R package spca

assists 1986
errors 1986

4.7 23.7
0.6 -0.9
-0.2
5.6 20.7 -5.9
-13
----- ----- ----- ----- ----45.3 71
81.8 87.2 91.6

PCVE

The screeplot and Kaiser Rule indicate that three components are sufficient. These explain
about 82% of total variance.
Next we plot the first four [cols = 4, plotload = TRUE] PCA percentage contributions.
The cumulative variance explained is not plotted [plotvexp = FALSE] (because it would plot
twice the PCA variance). We require that also zero contributions are plotted, for ease of comparison onlynonzero = FALSE. The plots will be in a 22 grid [mfrowload = 2, mfcolsload
= 2] with the variables labelled as V1, V2,... [varnam = FALSE] (because the names are quite
long). These and other parameters can be set, as explained in the help documentation.
plot(b.pca, cols = 4, plotvexp = FALSE, plotload = TRUE, addlabel = TRUE,
onlynonzero = FALSE, mfrow = 2, mfcol = 2)

Comp 1

V7
10%

V1 V2 V3 V4

V5

Comp 2

15%

V8 V9 V10V11V12V13

V1 V2

V4
V5
V6

V3

V16
V14V15

10%

V6

8%

V14

5%

6%
4%

0%

2%
0%

5%

2%
V15V16

10%

V8 V9

Comp 3

Comp 4

30%

V15
V16

V3
10%

V1 V2

V4 V5
V10

V15

0%

20%

10%
10%

V10
V11V12V13

V7

V7 V8 V9

V1 V2

V11
V13
V12

0%

V6

V7 V8 V9

V11V12
V13
V16

20%
30%

V4

10%

V6
V5

40%
V10

V14
50%

20%

V3
60%

V14

Figure 2: Percentage contributions of the first four PCs


The PCs loadings show that the first components is made of the sum of career and season
offensive play, with higher weight for the career statistics. The second component is the
difference of the previous season results from the career ones. Season offensive play has larger

Giovanni Maria Merola

15

weight in this component, hence it should characterise young players who had a good 1986
season. Defensive play has large weight in the third and fourth components.

5.1. Choice of cardinality

b.cc.be = choosecard(baseball, nd = 3, method = "BE", )

component 1
100%

component 1

5 6 7

45%

8 9 10 11 12 13 14 15 16

3 4

80%

60%

PVE

Mincontr

44%

43%

40%

2
20%

3
4 5
6 7 8

0%

10

15 16

15

cardinality

cardinality

component 1

16
15
14
1
3
12
11
10
9
8
76
5
4

10

component 1

15

3
4

0.30

45%

42%

9 10 11 12 13 14

5
6

7
8

PVE

9 10
11 12

13 14 15 16

0.10

43%

0.20

Entropy

44%

1
0%

20%

40%

60%

Mincontr

80%

100%

0.00

42%

1
5

10
cardinality

Figure 3: Cardinality Plots. Min contr vs card top-left, etc

[1] "Comp 1 , reached min cardinality of 5 , smallest loading is 0.237"


[1] "Comp 2 , reached min cardinality of 2 , smallest loading is 0.493"

15

16

Sparse Principal Component Analysis with the R package spca


component 2

component 2
13 14 15 16

2
26%

30%

6
25%

PVE

Mincontr

12
9 10 11

20%

24%

4
6

10%

7
8

9 10

11 12 13 14

10

12

14

24%

15 16

16

cardinality

component 2

0.35
Entropy

PVE

25%

14

16

4
5

0.30

2
6
4

12

component 2

25%

10

6
7

0.25

26%

16
1514
13
12
11
10
9
8

cardinality

8
9
10
11

0.20

24%

12

13

14

2
10%

20%

Mincontr

30%

10

12

14

cardinality

Figure 4: Cardinality Plots. Min contr vs card top-left, etc

[1] "Comp 1 , reached min cardinality of 5 , smallest loading is 0.237"


[1] "Comp 2 , reached min cardinality of 5 , smallest loading is 0.322"
[1] "Comp 3 , reached min cardinality of 3 , smallest loading is 0.024"

15

16
16

Giovanni Maria Merola

17

component 3
15%

component 3

9 10 11 12 13 14 15 16

10%

10%

PVE

Mincontr

8%

5%

6%

10
9

11 12 13

14

0%

10

12

14

4%

16
15

3
16

cardinality

component 3
16
1514
1
3 9810
12
11
7

10

12

14

16

cardinality

65

component 3
4

0.30

10%

Entropy

PVE

7
8
9 10

0.20

6%

0.25

8%

11
12

13 14 15

0.15

4%

3
0%

5%

10%

15%

Mincontr

16
4

10

12

14

16

cardinality

Figure 5: Cardinality Plots. Min contr vs card top-left, etc


#
#
#
#
#
#

choosecard(S, nd, method = c('BE', 'BB', 'PCA', perc =


TRUE, unc = TRUE, trim = 1), perc = TRUE, unc = TRUE, trim
= 1, reducetrim = TRUE, prntrace = FALSE, rtntrace = TRUE,
plotminload = TRUE, plotcvexp = TRUE, plotlovsvexp = TRUE,
plotentropy = TRUE, plotfarcomeni = FALSE, mfrowplot = 2,
mfcolplot = 2, plotall = TRUE, ce = 1)

At the prompt enter the preferred cardinality for component 1 (add a decimal to stop) : we
entered the values 5, 5 and 4.1.

5.2. SPCA(BB)
As a first tentative to get some insight of the solutions, we run LS SPCA(BB) requiring
four components of cardinality 5. The output gives the summary statistics, then prints the
percentage contributions and then plots them together with PRCVE.
b.bb1 = spcabb(baseball, card = rep(5, 4))
done
done
done
done

comp
comp
comp
comp

1
2
3
4

18

Sparse Principal Component Analysis with the R package spca

# print the summaries


summary(b.bb1, cols = 4)

PVE
PCVE
PRCVE
Card
Ccard
PVE/Card
PCVE/Ccard
MinCont

Comp1
45.1%
45.1%
99.5%
5
5
9%
9%
8.7%

Comp2
25.3%
70.4%
99.1%
5
10
5.1%
7%
9.4%

Comp3
10.7%
81.1%
99.1%
5
15
2.1%
5.4%
7.8%

Comp4
5.6%
86.7%
99.3%
5
20
1.1%
4.3%
4.7%

# print the contributions


b.bb1
Percentage Contributions

times at bat 1986


hits 1986
home runs 1986
runs 1986
runs batted 1986
walks 1986
years career
times at bat career
hits career
home runs career
runs career
runs batted career
walks career
put outs 1986
assists 1986
errors 1986
PCVE

Comp1 Comp2 Comp3 Comp4


25.9
12
23.8 15.9
12.2 17.3
13.7 13.2
8.7
49.1 -34.2 -15.9
16.4

-4.7
7.8 -59.6
-28
9.4 -24.5 -7.7
----- ----- ----- ----45.1 70.4 81.1 86.7

# plot PCVE and contributions


plot(b.bb1, plotload = TRUE, nam = "BB 1", varnam = TRUE, mfrowload = 2,
mfcolload = 2)

Giovanni Maria Merola

19

Comp 1

Comp 2

times at bat career

30%

times at bat 1986

50%

runs 1986
20%

runs batted 1986


errors 1986

40%

10%
0%

30%
10%
20%

home runs career


runs 1986

runs batted 1986

20%

walks 1986
30%

10%

40%

times at bat career

0%

Comp 3
30%

home runs 1986

Comp 4
20%

home runs 1986


hits 1986

20%
put outs 1986

0%

10%
walks career
0%

errors 1986

20%

10%
40%
20%
30%

times at bat career


errors 1986
assists 1986

60%
put outs 1986

Figure 6: Percentage contributions of the first four LS SPCA(BB) solutions

20

Sparse Principal Component Analysis with the R package spca

90%

80%

PCVE

70%

60%

50%

40%

PCA
BB 1

number of components

Figure 7: Cumulative variance explained by LS SPCA(BB) compared with PCA

A different view of the BB contributions can be obtained by plotting them versus those of
the PCs, using the plotloadvsPC = TRUE in the plot method, as shown below in Figure

plot(smpc = b.bb1, cols = 1:4, plotvexp = FALSE, addlabels = TRUE,


varnam = TRUE, plotloadvsPC = TRUE, pcs = b.pca, mfrow = 2,
mfcol = 2, onlynonzero = FALSE)

Giovanni Maria Merola

21

comp 1
50%

comp 2
times at bat career

times at bat 1986

20%

runs 1986

Sparse contributions

Sparse contributions

40%

30%

20%
home runs career
runs batted 1986
runs 1986
10%

walks 1986

runs batted 1986


errors 1986

10%

0%

10%

20%

30%
0%

0%

2%

4%

6%

8%

times at bat career

5%

0%

5%

PC

PC

comp 3

comp 4

30%

assists 1986
errors 1986

10%

home runs 1986


hits 1986

times at bat career

Sparse contributions

Sparse contributions

20%

10%

0%

put outs 1986

10%

0%
walks career
errors 1986

20%

40%

20%

home runs 1986

10%

60%

0%

10%

20%

put outs 1986

50%

40%

PC

30%

20%

10%

0%

PC

Figure 8: Sparse loadings plotted against the corresponding PCA ones

5.3. SPCA(BE)
BE solutions with a PRCVE of at least 99% [ threshvaronPC = 0.01] are shown below.
b.be1 = spcabe(baseball, nd = 4, threshvaronPC = 0.01)
[1] "Comp 4 , reached min cardinality of 4 , smallest loading is 0.181"
summary(b.be1)

PVE
PCVE
PRCVE
Card
Ccard
PVE/Card
PCVE/Ccard
Converged
MinCont
b.be1

Comp1
45%
45%
99.3%
5
5
9%
9%
2
11.1%

Comp2
25.4%
70.4%
99.1%
7
12
3.6%
5.9%
2
8.4%

Comp3
10.6%
81%
99%
4
16
2.7%
5.1%
2
17.4%

Comp4
5.7%
86.6%
99.3%
4
20
1.4%
4.3%
1
11.3%

10%

22

Sparse Principal Component Analysis with the R package spca

Percentage Contributions

times at bat 1986


hits 1986
home runs 1986
runs 1986
runs batted 1986
walks 1986
years career
times at bat career
hits career
home runs career
runs career
runs batted career
walks career
put outs 1986
assists 1986
errors 1986
PCVE

Comp1 Comp2 Comp3 Comp4


21.3 21.2
16.3
11.1

29.1

14.8

19.4
16.2
-11.3
-9.3
-17.4
21.3 -15.6
29.3
-10
17
-57.5
-29.3
8.4 -24.2
----- ----- ----- ----45
70.4 81
86.6

In some cases it can be more convenient to see just the nonzero loadings and to plot them.
This is shown below.
# print the nonzero contributions separately
showload(b.be1)
[1] "Component 1"
times at bat 1986
home runs 1986
runs career
21.3%
11.1%
21.3%
runs batted career
walks career
29.3%
17%
[1]
[1] "Component 2"
times at bat 1986
runs 1986
runs batted 1986
21.2%
19.4%
16.2%
years career
runs career runs batted career
-9.3%
-15.6%
-10%
errors 1986
8.4%
[1]
[1] "Component 3"
home runs 1986
hits career
assists 1986
errors 1986
29.1%
-17.4%
-29.3%
-24.2%

Giovanni Maria Merola

[1]
[1] "Component 4"
times at bat 1986
16.3%
put outs 1986
-57.5%
[1]

home runs 1986


14.8%

23

walks 1986
-11.3%

# plot the contributions


plot(b.be1, plotload = TRUE, mfrowload = 2, mfcol = 2, varnam = TRUE,
addlabel = TRUE, nam = "BEB1")

Comp 1

Comp 2
30%

runs batted career


20%

times at bat 1986


runs 1986
runs batted 1986

30%
times at bat 1986

errors 1986

runs career

25%

10%
walks career

20%
0%

home runs 1986


15%

10%

10%

years career
5%

20%

runs batted career

runs career

0%

Comp 3

Comp 4

home runs 1986

20%

times at bat 1986


home runs 1986

30%
20%

0%

10%
20%

0%

walks 1986

10%
40%
20%
hits career
30%

errors 1986
assists 1986

60%
put outs 1986

Figure 9: Percentage contributions of the first four BE solutions

24

Sparse Principal Component Analysis with the R package spca

90%

80%

PCVE

70%

60%

50%

40%

PCA
BEB1

number of components

Figure 10: Cumulative variance explained by BE compared with PCA


The BE output can be compared with the BB one using spca.comp. This function returns
the loadings side-by-side, and the comparative plots of the variance explained and the contributions (setting for each dimension plot.load = TRUE. In the plots of loading the variables
names are replaced by their order.
spca.comp(list(b.bb1, b.be1), nd = 4, nam = c("BB 5", "BE 99"),
plotvar = TRUE, plotload = TRUE)
[1] "Loadings"
C1.1
times at bat 1986
hits 1986
home runs 1986
runs 1986
runs batted 1986
walks 1986
years career
times at bat career
hits career
home runs career
runs career
runs batted career
walks career
put outs 1986
assists 1986
errors 1986
PCVE

C1.2
C2.1
C2.2
C3.1
0.213 0.259 0.212
0.111

0.238

0.122
0.137
0.087

0.173
0.132

0.491

-0.342

0.194
0.162
-0.093
-0.159

0.164
0.213
0.293
0.17

----45.1

----45

-0.156
-0.1
0.078
-0.28
0.094 0.084 -0.245
----- ----- ----70.4
70.4
81.1

Giovanni Maria Merola

C3.2
C4.1
C4.2
times at bat 1986
0.163
hits 1986
0.12
home runs 1986
0.291 0.159 0.148
runs 1986
runs batted 1986
walks 1986
-0.113
years career
times at bat career
hits career
-0.174
home runs career
runs career
runs batted career
walks career
-0.047
put outs 1986
-0.596 -0.575
assists 1986
-0.293
errors 1986
-0.242 -0.077
----- ----- ----PCVE
81
86.7
86.6
[1]
[1] Summary statistics
C1.1 C1.2 C2.1 C2.2 C3.1 C3.2 C4.1 C4.2
PVE
45.1 45.0 25.3 25.4 10.7 10.6
5.6
5.7
PCVE
45.1 45.0 70.4 70.4 81.1 81.0 86.7 86.6
PRCVE
99.5 99.3 99.1 99.1 99.1 99.0 99.3 99.3
Card
5
5
5
7
5
4
5
4
Ccard
5
5
10
12
15
16
20
20
PVE/Card
9.0
9.0
5.1
3.6
2.1
2.7
1.1
1.4
PCVE/Ccard
9.0
9.0
7.0
5.9
5.4
5.1
4.3
4.3
Min %Cont
8.7% 11.1% 9.4% 8.4% 7.8% 17.4% 4.7% 11.3%

25

26

Sparse Principal Component Analysis with the R package spca

Component 1

Component 2

60%

BB 5
BE 99

30%

20%

50%

10%

30%

loadings

loadings

40%

0%

10%

20%
20%

10%

30%

0%

40%

10

11

12

13

14

15

16

10

variables

variables

Component 3

Component 4

40%

11

12

13

14

15

BB 5
BE 99

16

20%

30%

0%
20%

10%

loadings

loadings

20%

0%

40%
10%

20%

30%

40%

10

11

12

13

14

15

BB 5
BE 99

60%

80%

16

variables

90%

Cum. Var. Expl.

80%

60%

50%

40%

variables

100%

70%

3
components

PCA
BB 5
BE 99

10

11

12

13

14

15

BB 5
BE 99

16

Giovanni Maria Merola

27

The output shows that the PCVE are almost the same as the PCA one but that the contributions are different for the two solutions. For the second component the smallest contribution
produced by BE is smaller than the corresponding BB one.

5.4. More BE solutions using different options


Since we ran BE without specifying a threshold for the loadings, some of the contributions are
less than 10%. The next example shows the output of BE run requiring that the contributions
are not less than 20% and that enough components should be computed to explain at least
80% of the total variance [ndbyvexp = 0.80].
b.be2 = spcabe(baseball, thresh = 0.2, ndbyvexp = 0.8)
[1] "Comp 3 , reached min cardinality of 3 , smallest loading is 0.062"
[1] "Comp 4 , reached min cardinality of 4 , smallest loading is 0.128"
[1] "reached PCVE 82.2% with 4 components"
summary(b.be2)

PVE
PCVE
PRCVE
Card
Ccard
PVE/Card
PCVE/Ccard
Converged
MinCont

Comp1
44.5%
44.5%
98.2%
3
3
14.8%
14.8%
0
24.8%

Comp2
24.6%
69.1%
97.3%
3
6
8.2%
11.5%
0
27.5%

Comp3
3.6%
72.6%
88.8%
3
9
1.2%
8.1%
1
4.2%

Comp4
9.6%
82.2%
94.3%
4
13
2.4%
6.3%
1
8.1%

print(b.be2, namcomp = NULL)


Percentage Contributions

times at bat 1986


hits 1986
home runs 1986
runs 1986
runs batted 1986
walks 1986
years career
times at bat career
hits career
home runs career
runs career
runs batted career

Comp1 Comp2 Comp3 Comp4


28 37.3
4.2

19.1

27.5
8.1

-14.6

-35.3
47.2

28

Sparse Principal Component Analysis with the R package spca

walks career
put outs 1986
assists 1986
errors 1986

24.8
-53
42.8 -58.1
----- ----- ----- ----44.5 69.1 72.6 82.2

PCVE

The algorithm stopped after four components but the last two did not converge because the
minimum cardinality (required for uncorrelatedness) was reached. Note the warnings printed
and the value 1 corresponding to converged in the summary table. in fact, the minimum
contributions are smaller than 10%.
In some cases it may be necessary to require a cardinality smaller than the order of the
component. This can be achieved by removing the uncorrelatedness restrictions on the last
two components. The output for this set up is shown next.
b.be3 = spcabe(baseball, thresh = 0.2, ndbyvexp = 0.8, unc = c(TRUE,
TRUE, FALSE, FALSE))
[1] "less than 16 unc flags given, last 12
[1] "reached PCVE 85.5% with 4 components"

set to

summary(b.be3)

PVE
PCVE
PRCVE
Card
Ccard
PVE/Card
PCVE/Ccard
Converged
MinCont

Comp1
44.5%
44.5%
98.2%
3
3
14.8%
14.8%
0
24.8%

Comp2
24.6%
69.1%
97.3%
3
6
8.2%
11.5%
0
27.5%

Comp3
10.8%
79.9%
97.7%
3
9
3.6%
8.9%
0
28.2%

Comp4
5.6%
85.5%
98.1%
2
11
2.8%
7.8%
0
23.6%

print(b.be3, namcomp = paste("Index", 1:4))


Percentage Contributions

times at bat 1986


hits 1986
home runs 1986
runs 1986
runs batted 1986
walks 1986

Index 1 Index 2 Index 3 Index 4


28
37.3
23.6
28.2
27.5

FALSE"

Giovanni Maria Merola

years career
times at bat career
hits career
home runs career
runs career
runs batted career
walks career
put outs 1986
assists 1986
errors 1986

-35.3
47.2
24.8
-76.4

----44.5

PCVE

29

----69.1

-41.2
-30.6
----79.9

----85.5

names(b.be3)
[1]
[4]
[7]
[10]

"loadings"
"vexpPC"
"unc"
"Uncloadings"

"contributions" "vexp"
"cardinality"
"ind"
"converged"
"corComp"

round(b.be3$corComp, 2)

[1,]
[2,]
[3,]
[4,]

[,1] [,2] [,3] [,4]


1.00 0.00 0.15 -0.02
0.00 1.00 -0.12 -0.01
0.15 -0.12 1.00 -0.13
-0.02 -0.01 -0.13 1.00

The last command shows the correlation between the components and it can be seen that the
third one is correlated with all others while the fourth one only with the third one.
The components can be constrained to be combinations of only a subset of the variables. For
example, it could be reasonable to separate career and season statistics, as also suggested by
the previous solutions.
indoffs = 1:6
indoffc = 7:13
inddef = 14:16
inds = c(indoffs, inddef)

We could ask which are the single variables for each of these groups that are most useful
for explaining the variability. We can obtain an answer by requiring correlated components
(these are really single variables) of cardinality one for each group. The output is shown below
with the components named with respect to the indices selected argsenamcompc(,...).

30

Sparse Principal Component Analysis with the R package spca

b.bbs2c = spcabb(baseball, card = rep(1, 3), unc = FALSE,


startind = list(indoffc, indoffs, inddef))
done comp 1
done comp 2
done comp 3
summary(b.bbs2c)

PVE
PCVE
PRCVE
Card
Ccard
PVE/Card
PCVE/Ccard
MinCont

Comp1
41.5%
41.5%
91.5%
1
1
41.5%
41.5%
100%

Comp2
25.6%
67%
94.4%
1
2
25.6%
33.5%
100%

Comp3
9.1%
76.1%
93%
1
3
9.1%
25.4%
100%

print(b.bbs2c, namcomp = c("off carr.", "off 1986", "defens. 1986"))


Percentage Contributions
off carr. off 1986 defens. 1986
100

times at bat 1986


hits 1986
home runs 1986
runs 1986
runs batted 1986
walks 1986
years career
times at bat career
hits career
home runs career
runs career
runs batted career 100
walks career
put outs 1986
assists 1986
errors 1986
----PCVE
41.5

names(b.bbs2c)

100
----67

----76.1

Giovanni Maria Merola

[1] "loadings"
"vexp"
[5] "niter"
"unc"
[9] "Uncloadings"

"vexpPC"
"nvexp"

31

"ind"
"corComp"

round(b.bbs2c$corComp,2)
[,1]
[1,] 1.00
[2,] 0.22
[3,] -0.10

[,2] [,3]
0.22 -0.10
1.00 0.34
0.34 1.00

When the number of variables is large, more than one loading can be trimmed at each iteration
[trim = x, where x is the number of loadings to trim]. Since there could be left less loadings
than the number to trim, setting reducetrim = TRUE makes these last loadings trimmed one
at the time. Below we compare the solutions
b.be3t = spcabe(baseball, thresh = 0.15, perc = TRUE, ndbyvexp = 0.8,
trim = 3, reducetrim = TRUE)
[1]
[1]
[1]
[1]
[1]

"reduced trim to TRUE remaining 2 iterations. comp 2"


"reduced trim to TRUE remaining 1 iterations. comp 3"
"Comp 3 , reached min cardinality of 3 , smallest loading is 0.123"
"Comp 4 , reached min cardinality of 4 , smallest loading is 0.203"
"reached PCVE 81.7% with 4 components"

b.be3nt = spcabe(baseball, thresh = 0.15, perc = TRUE, ndbyvexp = 0.8,


trim = 1)
[1] "Comp 3 , reached min cardinality of 3 , smallest loading is 0.062"
[1] "Comp 4 , reached min cardinality of 4 , smallest loading is 0.128"
[1] "reached PCVE 82.2% with 4 components"
summary(b.be2, perc = TRUE)
Comp1
PVE
44.5%
PCVE
44.5%
PRCVE
98.2%
Card
3
Ccard
3
PVE/Card
14.8%
PCVE/Ccard 14.8%
Converged 0
MinCont
24.8%

Comp2
24.6%
69.1%
97.3%
3
6
8.2%
11.5%
0
27.5%

Comp3
3.6%
72.6%
88.8%
3
9
1.2%
8.1%
1
4.2%

summary(b.be3t, perc = TRUE)

Comp4
9.6%
82.2%
94.3%
4
13
2.4%
6.3%
1
8.1%

32

Sparse Principal Component Analysis with the R package spca

Comp1
PVE
41.2%
PCVE
41.2%
PRCVE
91%
Card
1
Ccard
1
PVE/Card
41.2%
PCVE/Ccard 41.2%
Converged 0
MinCont
100%

Comp2
26.9%
68.1%
95.9%
3
4
9%
17%
0
20.8%

Comp3
7.4%
75.5%
92.3%
3
7
2.5%
10.8%
1
8.1%

Comp4
6.2%
81.7%
93.7%
4
11
1.5%
7.4%
1
10.8%

spca.comp(list(b.be3nt, b.be3t), nam = c("trim 1", "trim 3"))

100%

90%

Cum. Var. Expl.

80%

70%

60%

50%

40%

3
components

PCA
trim 1
trim 3

Giovanni Maria Merola

33

[1] "Loadings"
times at bat 1986
hits 1986
home runs 1986
runs 1986
runs batted 1986
walks 1986
years career
times at bat career
hits career
home runs career
runs career
runs batted career
walks career
put outs 1986
assists 1986
errors 1986
PCVE

C1.1
C1.2
0.28

C2.1
C2.2
C3.1
0.373 0.497
0.042
0.275
0.294

1 -0.353 -0.208
0.472
0.248

----44.5
C3.2

----41.2
C4.1

----69.1
C4.2

----68.1

-0.53
0.428
----72.6

times at bat 1986


hits 1986
home runs 1986
0.191
runs 1986
0.108
runs batted 1986
0.081
walks 1986
years career
times at bat career
-0.146
hits career
0.452
0.29
home runs career
runs career
runs batted career -0.467
-0.354
walks career
put outs 1986
assists 1986
0.081
errors 1986
-0.581 -0.248
----- ----- ----PCVE
75.5
82.2
81.7
[1]
[1] Summary statistics
C1.1 C1.2 C2.1 C2.2 C3.1 C3.2 C4.1 C4.2
PVE
44.5 41.2 24.6 26.9
3.6
7.4
9.6
6.2
PCVE
44.5 41.2 69.1 68.1 72.6 75.5 82.2 81.7
PRCVE
98.2 91.0 97.3 95.9 88.8 92.3 94.3 93.7
Card
3
1
3
3
3
3
4
4

34

Sparse Principal Component Analysis with the R package spca

Ccard
3
PVE/Card
14.8
PCVE/Ccard 14.8
Converged
0
Min %Cont 24.8%

1
6
4
41.2
8.2
9.0
41.2 11.5 17.0
0
0
0
100% 27.5% 20.8%

9
1.2
8.1
1
4.2%

7
2.5
10.8
1
8.1%

13
11
2.4
1.5
6.3
7.4
1
1
8.1% 10.8%

In case diagnostic information about the solution and the elimination is required, setting the
argiment diag = TRUE produces several pieces of information, as exemplified below. Details
on the diagnostic output can be found in the help.
b.be1 = spcabe(baseball, nd = 4, threshvaronPC = 0.01, diag = TRUE)
[1] "Comp 4 , reached min cardinality of 4 , smallest loading is 0.181"
# the output now contains:
names(b.be1)
[1]
[4]
[7]
[10]
[13]
[16]
[19]

"loadings"
"vexpPC"
"unc"
"totvcloss"
"eliminated"
"threshvaronPC"
"mincard"

"contributions"
"cardinality"
"converged"
"vlossbe"
"thresh"
"ndbyvexp"
"tracing"

"vexp"
"ind"
"vexpo"
"niter"
"threshvar"
"stopbyvar"

# for example, the sequence of variables eliminated are


# stored in eliminated
b.be1$elim
$`Comp 1`
assists 1986
errors 1986
15
16
runs 1986
hits career
4
9
home runs career times at bat career
10
8
years career
hits 1986
7
2
$`Comp 2`
home runs career
10
put outs 1986
14
home runs 1986
3

put outs 1986


14
runs batted 1986
5
walks 1986
6
home runs 1986
3

hits career times at bat career


9
8
assists 1986
walks career
15
13
walks 1986
hits 1986
6
2

Giovanni Maria Merola

35

errors 1986
16
$`Comp 3`
runs batted career
12
runs 1986
4
walks 1986
6
times at bat career
8
hits career
9

hits 1986
2
times at bat 1986
1
years career
7
put outs 1986
14

runs career
11
walks career
13
home runs career
10
runs batted 1986
5

$`Comp 4`
walks career times at bat career
13
8
hits 1986
years career
2
7
assists 1986 runs batted career
15
12
home runs career
runs batted 1986
10
5

runs career
11
errors 1986
16
hits career
9
runs 1986
4

6. Concluding Remarks
The package spca was firstly developed for my own needs. The literature on SPCA does not
present yet many applications and plots typical of PCA, such as, loadings plots, for example,
are not very useful for sparse solutions. Therefore, I propose different ways for plotting and
printing the solutions that I found useful. I hope that they will be useful for other researchers
as well. The package for now only takes correlation matrices as inputs, so utilities for the
analysis of the components scores are not present yet.
The plotting utilities are designed to provide a first graphical description of the output and
they are not the main feature of the package. Since it is extremely difficult to set the graphical
parameters to produce perfect plots under all possible circumstances (also because these can
be device dependent), I chose settings that should produce good plots in most situations. The
plots are simple to reproduce and better ones can be easily obtained from the spca objects.
The functions spcabb and spcabe are written entirely in R and they cannot deal with large
problems. I am working on writing them in C.

References

36

Sparse Principal Component Analysis with the R package spca

Cadima J, Jolliffe IT (1995). Loadings and correlations in the interpretation of principal


components. Journal of Applied Statistics, 22(2), 203214.
Farcomeni A (2009). An exact approach to sparse principal component analysis. Computational Statistics, 24(4), 583604.
Izenman AJ (1975). Reduced-rank regression for the multivariate linear model. Journal of
Multivariate Analysis, 5(2), 248264.
Kaiser H (1960). The application of electronic computers to factor analysis. Educational
and Psychological Measurement, 20, 141151.
Magnus J, Neudecker H (1999). Matrix Differential Calculus With Applications in Statistics
and Econometrics. John Wiley.
Merola G (2014). Least Squares Sparse Principal Component Analysis: a Backward Elimination approach to attain large loadings. In Submitted for publication. Preprint available
at http://arxiv.org/abs/1406.1381.
Moghaddam B, Weiss Y, Avidan S (2006). Spectral Bounds for Sparse PCA: Exact and
Greedy Algorithms. In Advances in Neural Information Processing Systems, pp. 915922.
MIT Press.
R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org.
Rao C, Toutenburg H (1999). Linear Models: LS and Alternatives. Springer Series in Statistics
Series, second edition. Springer Verlag.
ten Berge J (1993). Least Squares Optimization in Multivariate Analysis. DSWO Press,
Leiden University.
Wold H (1966). Estimation of principal components and related models by iterative least
squares. In P Krishnaiah (ed.), In Multivariate Analysis, volume 59, pp. 391420. Academic
Press, NY.

Affiliation:
Giovanni Maria Merola
Department of Economics, Finance and Marketing
RMIT International University Vietnam
702 Nguyen Van Linh Bvrd
PMH, D 7, HCMC, Viet Nam
E-mail: lsspca@gmail.com

You might also like