You are on page 1of 7

Why Reduce Dimension?

Dimension Reduction: —We often have the intuition that a data


Principal components, set is fundamentally low dimensional
Multidimensional scaling even if it comes in a high dimensional
form.
—Predictors and inferential tools can
Credits today’s materials: rafael Irizarry
waste much of their effort in irrelevant
corners of high dimensional space
Alter, Botstein, Brown. Proc. Natl. Acad. Sci. USA , Vol. 97,
Issue 18, 10101-10106, August 29, 2000
..\688-2002\webs\PNAS -- Alter et al_ 97 (18) 10101.htm

PCA: 2-D example


Dimension Reduction Methods
Dimensions can be either
—Variable Pruning genes or samples 4

—Principal Component analysis/ Singular 3

Value Decomposition 2

—Multidimensional Scaling 1

-4 -3 -2 -1 1 2 3 4
-1

-2

-3

-4

Identify direction of greatest


Reexpress coordinates in
variation.
terms of principal component
4

3
and rest. Remove rest for
2
dimension reduction.
1

-4 -3 -2 -1 1 2 3 4
-1

-2 Ist Principal
component
-3

-4
After projection

1
Principal Component Analysis (PCA)
Principal Component Analysis
—PCA is a way of expressing a high —Rotate data so that the new coordinates
dimensional data set via an alternative set of are based on correlation structure of the
dimensions that are
data
– Orthogonal to each other
– Easy to rank by how “representative” they are —Select a subset of the new coordinates
– Not always easy to interpret with high variability, and use for data
—It is useful because it allows visualization of visualization, summarization and
the data in most representative dimension clustering
—It may (or not) provide useful predictions for —E.G. Clustering tumor samples using
clustering and phenotype prediction PCA on expression profiles

PCA COMPUTATIONS

Singular Value Decomposition


—SVD is identical to PCA if the data is Where u is the m -dimensional projected vector and x the original, d-
already centered dimensional data vector.
The m projection vectors that maximize the variance of u, i.e. the
—Otherwise the idea is the same but the principal axes, are given by the eigenvectors           of the data set's
rotation is driven by mean information covariance matrix S associated with the larges m eigevalues.
as well as covariance structure The observed data covariance matrix is:

and the eigenvectors and -values can be found by solving the set of
equations:

Percent variance retained:


PCA computation an example
Requires matrix inversion (~N3 operations,
N is dimension)
Storing variance-covariance matrix (~N2
where N is dimension)
If data has only a few dimensions – very
fast (one vector consist of all genes at
one time point)
If data has many dimensions – very slow
(one vector for each gene at multiple
time points)

2
The read my lips example

Data:

Kirby, Weisser, Dangelmayer 1993

Please detach along the perforated line….

Data: New Basis Vectors Data: EigenLips

PCA PCA

PCA: Caveats Alter Botstein and Brown


—The first principal component may have
little or no biological signal; for example,
webs\PNAS -- Alter et al_ 97 (18) 10101.htm
if you don’t normalize intensity, the first
component will capture effects that
should have been normalized.
—The top components may not be the
best ones for clustering or classification
—Nonlinear trends across are not
considered

3
Cell Cycle
Experiments

•Elutriation-synchronized cell cycle


•α-Factor-synchronized cell cycle
•Two additional independent experiments measured
mRNA levels of strain cultures with overactivated
CLN3, which encodes a G1/S cyclin, at t = 30 and
40 min relative to their levels at the start of
The cell cycle is an ordered set of events, culminating in cell growth and division
into two daughter cells. Non-dividing cells not considered to be in the cell cycle. overactivation at t = 0. The dataset for the factor,
The stages, pictured to the left, are G1-S-G2-M. The G1 stage stands for "GAP 1". CLB2, and CLN3 experiments we analyze
The S stage stands for "Synthesis". This is the stage when DNA replication occurs.
The G2 stage stands for "GAP 2". The M stage stands for "mitosis", and is when
nuclear (chromosomes separate) and cytoplasmic (cytokinesis) division occur.
Mitosis is further divided into 4 phases.

Fig. 1. Normalized elutriation eigengenes. (a) Raster display of NT, the expression of 14
eigengenes in 14 arrays. (b) Bar chart of the fractions of eigenexpression, showing that
and capture about 20% of the overall normalized expression each, and a high entropy d =
0.88. (c) Line-joined graphs of the expression levels of (red) and (blue) in the 14
arrays fit dashed graphs of normalized sine (red) and cosine (blue) of period T = 390 min
and phase = 2 /13, respectively.

Fig. 2. Normalized elutriation expression in the subspace associated with the cell cycle.
(a) Array correlation with along the y-axis vs. that with along the x-axis, color-coded
Fig. 3. Genes sorted by relative correlation with and of normalized elutriation. (a) Normalized elutriation according to the classification of the arrays into the five cell cycle stages, M/G1 (yellow),
expression of the sorted 5,981 genes in the 14 arrays, showing traveling wave of expression. G1 (green), S (blue), S/G2 (red), and G2/M (orange). The dashed unit and half-unit circles
(b) Eigenarrays expression; the expression of |1N and , the eigenarrays corresponding to and , displays the outline 100% and 25% of overall normalized array expression in the and subspace.
sorting. (c) Expression levels of (red) and (green) fit normalized sine and cosine functions of period (b) Correlation of each gene with vs. that with , for 784 cell cycle regulated genes, color-
Z =N-1 = 5,980 and phase 2 /13 (blue), respectively. coded according to the classification by Spellman et al. (3).

4
Fig. 4. Rotated normalized factor, CLB2, and CLN3 eigengenes. (a) Raster display of
RNT, where = , , and . (b) and capture 20% of the overall Fig. 5. Rotated normalized factor, CLB2, and CLN3 expression in the subspace
normalized expression each. (c) Expression levels of (red) and (blue) fit dashed associated with the cell cycle. (a) Array correlation with along the y-axis vs. that with |
graphs of normalized sine (red) and cosine (blue) of period T/2 = 66 min and phase /4, along the x-axis, color-coded according to the classification of the arrays into the five cell
respectively, and (green) fits dashed graph of normalized sine of period T = 112 min cycle stages, M/G1 (yellow), G1 (green), S (blue), S/G2 (red), and G2/M (orange). The
and phase - /8, from t = 7 to t = 119 min during the cell cycle. dashed unit and half-unit circles outline 100% and 25% of overall normalized array
expression in the and subspace. (b) Correlation of each gene with vs. that with , for
638 cell cycle regulated genes, color-coded according to the classification by Spellman et
al. (3).

Fig. 6. Genes sorted by relative correlation with and of rotated normalized factor, CLB2, and
CLN3. (a) Normalized expression of the sorted 4,579 genes in the 22 arrays, showing traveling wave of
expression from t = 0 to 119 min during the cell cycle and standing waves of expression in the CLB2-
and CLN3-overactive arrays. (b) Eigenarrays expression; the expression of and , the eigenarrays
corresponding to and , displays the sorting. (c) Expression levels of (red) and (green) fit
normalized sine and cosine functions of period Z =N-1 = 4,578 and phase /8 (blue), respectively.

5
MDS: Multidimensional
Scaling
—PCA requires vector representation
—It can be interesting to start from
pairwise distances between n points.
—How can we do dimension reduction and
visualization in this case?
—Find coordinates for points in d
dimensional space s.t. distances are
preserved as well as possible

Multidimensional Scaling Motivations


—Multi-Dimensional Scaling (MDS) is a MDS attempts to
general technique for displaying n-
—Identify abstract variables which
dimensional data in 2D.
have generated the inter-object
similarity measures
—It preserves the notion of “nearness”, and —Reduce the dimension of the data in
therefore clusters of items in n-dimensions
a non-linear fashion
still look like clusters on a plot.
—Reproduce non-linear higher-dimen-
sional structures on a lower-dimen-
sional display

6
Multidimensional Scaling

Bittner et al. Nature 2000

Backing out genes that are associated with MDS


dimensions References
—MULTIVARIATE ANALYSIS by K.
MARDIA
—MULTIVARIATE STATISTICAL
ANALYSIS: A CONCEPTUAL
INTRODUCTION, by SK KACHIGAN
—Yeung and Ruzzo, PCA Analysis for
clustering gene expression data
Bioinformatics 2001.
—LINEAR ALGEBRA USEFUL FOR
STATISTICS by SEARLE

You might also like