Professional Documents
Culture Documents
We want to identify and operate with underlying latent factors rather than the observed data
E.g. topics in news articles Transcription factors in genomics
Combinations of observed variables may be more effective bases for insights, even if physical meaning is obscure
Basic Concept
Areas of variance in data are where items can be best discriminated and key underlying phenomena observed
Areas of greatest signal in the data
So we want to combine related variables, and focus on uncorrelated or independent ones, especially those along which the observations have high variance We want a smaller set of variables that explain most of the variance in the original data, in more compact and insightful form
Basic Concept
What if the dependences and correlations are not so strong or direct? And suppose you have 3 variables, or 4, or 5, or 10000? Look for the phenomena underlying the observed covariance/co-dependence in a set of variables
Once again, phenomena that are uncorrelated or independent, and especially those along which the data show high variance
These phenomena are called factors or principal components or independent components, depending on the methods used
Factor analysis: based on variance/covariance/correlation Independent Component Analysis: based on independence
Capture as much of the original variance in the data as possible Are called Principal Components
http://www.cs.mcgill.ca/~sqrt/dimr/dimreductio
Original Variable A
Orthogonal directions of greatest variance in data Projections along PC1 discriminate the data most along any one axis
Principal Components
First principal component is the direction of greatest variability (covariance) in the data Second is the next orthogonal (uncorrelated) direction of greatest variability
So first remove all the variability along the first component, and then find the next direction of greatest variability
And so on
Properties
It can be viewed as a rotation of the existing axes to new positions in the space defined by original variables New axes are orthogonal and represent the directions with maximum variability
=1 Construct Langrangian uTxxTu uTu Vector of partial derivatives set to zero xxTu u = (xxT I) u = 0 As u 0 then u must be an eigenvector of xxT with eigenvalue
Dimensionality Reduction
Can ignore the components of lesser significance.
25 20 15 10 Variance (%) 5 0 PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
You do lose some information, but if the eigenvalues are small, you dont lose much
n dimensions in original data calculate n eigenvectors and eigenvalues choose only the first p eigenvectors, based on their eigenvalues final data set has only p dimensions
PCA/FA
Principal Components Analysis
Extracts all the factor underlying a set of variables The number of factors = the number of variables Completely explains the variance in each variable
Parsimony?!?!?!
Factor Analysis
Also known as Principal Axis Analysis Analyses only the shared variance
NO ERROR!
Rank documents based on similarity of their vectors with the query vector
Using cosine distance of vectors
Problems
Looks for literal term matches
Terms in queries (esp short ones) dont always capture users information need well
Problems:
Synonymy: other words with the same meaning
Car and automobile
What if we could match against concepts, that represent related words, rather than words themselves
Example of Problems
-- Relevant docs may not have the query terms but may have many related terms -- Irrelevant docs may have the query terms but may not have any related terms
Example
Suppose we have keywords
Car, automobile, driver, elephant
We want queries on car to also get docs about drivers and automobiles, but not about elephants
What if we could discover that the cars, automobiles and drivers axes are strongly correlated, but elephants is not How? Via correlations observed through documents If docs A & B dont share any words with each other, but both share lots of words with doc C, then A & B will be considered similar E.g A has cars and drivers, B has automobiles and drivers
When you scrunch down dimensions, small differences (noise) gets glossed over, and you get desired behavior
-- Relevant docs may not have the query terms but may have many related terms -- Irrelevant docs may have the query terms but may not have any related terms
Similarity with existing docs computed by taking dot products with the rows of the matrix D2Dt
Example
term controllability observability realization feedback controller observer transfer function polynomial matrices ch2 1 1 1 0 0 0 0 0 0 ch3 1 0 0 1 1 1 0 0 0 ch4 0 0 1 0 0 1 0 0 0 ch5 0 0 0 0 0 0 0 0 0 ch6 1 1 1 0 1 1 1 1 1 ch7 0 1 0 1 1 1 1 0 0 ch8 0 0 1 0 0 0 0 1 1 ch9 1 1 0 0 0 0 0 0 1
U (9x7) = 0.3996 0.4180 0.3464 0.1888 0.3602 0.4075 0.2750 0.2259 0.2958 -0.1037 -0.0641 -0.4422 0.4615 0.3776 0.3622 0.1667 -0.3096 -0.4232 0.5606 0.4878 -0.3997 0.0049 -0.0914 -0.3657 -0.1303 -0.3579 0.0277 -0.3717 0.1566 -0.5142 -0.0279 0.1596 -0.2684 0.4376 0.3127 0.4305 -0.3919 0.5771 0.2787 -0.2087 -0.2045 -0.0174 0.3844 -0.2406 -0.3800 -0.3482 0.1981 0.0102 0.4193 -0.3701 0.2711 -0.3066 -0.3122 0.5114 0.1029 -0.1094 -0.2857 -0.6629 -0.1023 0.5676 0.1230 -0.2611 0.2010
S (7x7) = 3.9901 0 0 0 0 0 0 0 2.2813 0 0 0 0 0 0 0 1.6705 0 0 0 0 0 0 0 1.3522 0 0 0 0 0 0 0 1.1818 0 0 0 0 0 0 0 0.6623 0 0 0 0 0 0 0 0.6487 V (7x8) = 0.2917 0.3399 0.1889 -0.0000 0.6838 0.4134 0.2176 0.2791 -0.2674 0.4811 -0.0351 -0.0000 -0.1913 0.5716 -0.5151 -0.2591 0.3883 0.0649 -0.4582 -0.0000 -0.1609 -0.0566 -0.4369 0.6442 -0.5393 -0.3760 -0.5788 -0.0000 0.2535 0.3383 0.1694 0.1593 0.3926 -0.6959 0.2211 0.0000 0.0050 0.4493 -0.2893 -0.1648 -0.2112 -0.0421 0.4247 -0.0000 -0.5229 0.3198 0.3161 0.5455 -0.4505 -0.1462 0.4346 0.0000 0.3636 -0.2839 -0.5330 0.2998
This happens to be a rank-7 matrix -so only 7 dimensions required Singular values = Sqrt of Eigen values of AAT
In the sense of having to find quantities that are not observable directly Similarly, transcription factors in biology, as unobservable causal bridges between experimental conditions and gene expression
Terms
M
mxn A
U
mxr U
S
rxr D
Vt
rxn VT
Uk
mxk Uk
Sk
kxk Dk
Vkt
kxn T Vk
= mxn =
Terms
Singular Value Decomposition (SVD): Convert term-document matrix into 3matrices U, S and V
Recreate Matrix: Multiply to produce approximate termdocument matrix. Use new matrix to process queries OR, better, map query to reduced space
S (7x7) = 3.9901 0 0 0 0 0 0 0 2.2813 0 0 0 0 0 0 0 1.6705 0 0 0 0 0 0 0 1.3522 0 0 0 0 0 0 0 1.1818 0 0 0 0 0 0 0 0.6623 0 0 0 0 0 0 0 0.6487 V (7x8) = 0.2917 0.3399 0.1889 -0.0000 0.6838 0.4134 0.2176 0.2791 -0.2674 0.4811 -0.0351 -0.0000 -0.1913 0.5716 -0.5151 -0.2591 0.3883 0.0649 -0.4582 -0.0000 -0.1609 -0.0566 -0.4369 0.6442 -0.5393 -0.3760 -0.5788 -0.0000 0.2535 0.3383 0.1694 0.1593 0.3926 -0.6959 0.2211 0.0000 0.0050 0.4493 -0.2893 -0.1648 -0.2112 -0.0421 0.4247 -0.0000 -0.5229 0.3198 0.3161 0.5455 -0.4505 -0.1462 0.4346 0.0000 0.3636 -0.2839 -0.5330 0.2998
This happens to be a rank-7 matrix -so only 7 dimensions required Singular values = Sqrt of Eigen values of AAT
U (9x7) = 0.3996 -0.1037 0.5606 -0.3717 -0.3919 -0.3482 0.4180 -0.0641 0.4878 0.1566 0.5771 0.1981 0.3464 -0.4422 -0.3997 -0.5142 0.2787 0.0102 0.1888 0.4615 0.0049 -0.0279 -0.2087 0.4193 0.3602 0.3776 -0.0914 0.1596 -0.2045 -0.3701 0.4075 0.3622 -0.3657 -0.2684 -0.0174 0.2711 0.2750 0.1667 -0.1303 0.4376 0.3844 -0.3066 0.2259 -0.3096 -0.3579 0.3127 -0.2406 -0.3122 0.2958 -0.4232 0.0277 0.4305 -0.3800 0.5114 S (7x7) = 3.9901 0 0 0 0 0 0 0 2.2813 0 0 0 0 0 0 0 1.6705 0 0 0 0 0 0 0 1.3522 0 0 0 0 0 0 0 1.1818 0 0 0 0 0 0 0 0.6623 0 0 0 0 0 0 0 0.6487 V (7x8) = 0.2917 -0.2674 0.3883 -0.5393 0.3926 -0.2112 0.3399 0.4811 0.0649 -0.3760 -0.6959 -0.0421 0.1889 -0.0351 -0.4582 -0.5788 0.2211 0.4247 -0.0000 -0.0000 -0.0000 -0.0000 0.0000 -0.0000 0.6838 -0.1913 -0.1609 0.2535 0.0050 -0.5229 0.4134 0.5716 -0.0566 0.3383 0.4493 0.3198 0.2176 -0.5151 -0.4369 0.1694 -0.2893 0.3161 0.2791 -0.2591 0.6442 0.1593 -0.1648 0.5455
Formally, this will be the rank-k (2) matrix that is closest to M in the 0.1029 matrix norm sense
U2 (9x2) = 0.3996 -0.1037 0.4180 -0.0641 0.3464 -0.4422 0.1888 0.4615 0.3602 0.3776 0.4075 0.3622 0.2750 0.1667 0.2259 -0.3096 0.2958 -0.4232 S2 (2x2) = 3.9901 0 0 2.2813 V2 (8x2) = 0.2917 -0.2674 T 0.3399 0.4811 0.1889 -0.0351 -0.0000 -0.0000 0.6838 -0.1913 0.4134 0.5716 0.2176 -0.5151 0.2791 -0.2591
0.52835834 0.42813724 0.30949408 0.0 1.1355368 0.5239192 0.46880865 0.5063048 0.5256176 0.49655432 0.3201918 0.0 1.1684579 0.6059082 0.4382505 0.50338876 0.6729299 -0.015529543 0.29650056 0.0 1.1381099 -0.0052356124 0.82038856 0.6471 -0.0617774 0.76256883 0.10535021 0.0 0.3137232 0.9132189 -0.37838274 -0.06253 0.18889774 0.90294445 0.24125765 0.0 0.81799114 1.0865396 -0.1309748 0.17793834 0.25334513 0.95019233 0.27814224 0.0 0.9537667 1.1444798 -0.071810216 0.2397161
U2S2V2T
0.21838559 0.55592346 0.19392742 0.0 0.6775683 0.6709899 0.042878807 0.2077163 0.4517898 -0.033422917 0.19505836 0.0 0.75146574 -0.031091988 0.55994695 0.4345 0.60244554 -0.06330189 0.25684044 0.0 0.99175954 -0.06392482 0.75412846 0.5795
U4S4V4T
1.1630535 0.67789733 0.17131016 0.0 0.85744447 0.30088043 -0.025483057 1.0295205
K=4
3 components ignored
0.7278324 0.46981966 -0.1757451 0.0 1.0910251 0.6314231 0.11810507 1.0620605 0.78863835 0.20257005 1.0048805 0.0 1.0692837 -0.20266426 0.9943222 0.106248446 -0.03825318 0.7772852 0.12343567 0.0 0.30284256 0.89999276 -0.3883498 -0.06326774 0.013223715 0.8118903 0.18630582 0.0 0.8972661 1.1681904 -0.027708884 0.11395822 0.21186034 1.0470067 0.76812166 0.0 0.960058 1.0562774 0.1336124 -0.2116417 -0.18525022 0.31930918 -0.048827052 0.0 0.8625925 0.8834896 0.23821498 0.1617572 -0.008397698 -0.23121 0.2242676 0.0 0.9548515 0.14579195 0.89278513 0.1167786 0.30647483 -0.27917668 -0.101294056 0.0 1.1318822 0.13038804 0.83252335 0.70210195
K=6
One component ignored
U6S6V6T
1.0299273 1.0099105 -0.029033005 0.0 0.9757162 0.019038305 0.035608776 0.98004794 0.96788234 -0.010319378 0.030770123 0.0 1.0258299 0.9798115 -0.03772955 1.0212346 0.9165214 -0.026921304 1.0805727 0.0 1.0673982 -0.052518982 0.9011715 0.055653755 -0.19373542 0.9372319 0.1868434 0.0 0.15639876 0.87798584 -0.22921464 0.12886547 -0.029890355 0.9903935 0.028769515 0.0 1.0242295 0.98121595 -0.03527296 0.020075336 0.16586632 1.0537577 0.8398298 0.0 0.8660687 1.1044582 0.19631699 -0.11030859 0.035988174 0.01172187 -0.03462495 0.0 0.9710446 1.0226605 0.04260301 -0.023878671 -0.07636017 -0.024632007 0.07358454 0.0 1.0615499 -0.048087567 0.909685 0.050844945 0.05863098 0.019081593 -0.056740552 0.0 0.95253044 0.03693092 1.0695065 0.96087193
Querying
To query for feedback controller, the query vector would be q = [0 0 0 1 1 0 0 0 0]' (' indicates transpose), Let q be the query vector. Then the document-space vector corresponding to q is given by: q'*U2*inv(S2) = Dq Point at the centroid of the query terms poisitions in the new space. For the feedback controller query vector, the result is: Dq = 0.1376 0.3678 To find the best document match, we compare the Dq vector against all the document vectors in the 2-dimensional V2 space. The document vector that is nearest in direction to Dq is the best match. The cosinevalues for the eight document vectors and the query vector are:
-0.3747 0.9671 0.1735 -0.9413 0.0851 0.9642 -0.7265 -0.3805
U2 (9x2) = 0.3996 -0.1037 0.4180 -0.0641 0.3464 -0.4422 0.1888 0.4615 0.3602 0.3776 0.4075 0.3622 0.2750 0.1667 0.2259 -0.3096 0.2958 -0.4232 S2 (2x2) = 3.9901 0 0 2.2813 V2 (8x2) = 0.2917 -0.2674 0.3399 0.4811 0.1889 -0.0351 -0.0000 -0.0000 0.6838 -0.1913 0.4134 0.5716 0.2176 -0.5151 0.2791 -0.2591
term controllability observability realization feedback controller observer transfer function polynomial matrices
ch2 1 1 1 0 0 0 0 0 0
ch3 1 0 0 1 1 1 0 0 0
ch4 0 0 1 0 0 1 0 0 0
ch5 0 0 0 0 0 0 0 0 0
ch6 1 1 1 0 1 1 1 1 1
ch7 0 1 0 1 1 1 1 0 0
ch8 0 0 1 0 0 0 0 1 1
ch9 1 1 0 0 0 0 0 0 1
-0.37
0.967
0.173
-0.94
0.08
0.96
-0.72
-0.38
Medline data
Some of the individual effects can be achieved with simpler techniques (e.g. thesaurus construction). LSI does them together. LSI handles synonymy well, not so much polysemy Challenge: SVD is complex to compute (O(n3))
Needs to be updated as new documents are found/updated
SVD Properties
There is an implicit assumption that the observed data distribution is multivariate Gaussian Can consider as a probabilistic generative model latent variables are Gaussian sub-optimal in likelihood terms for non-Gaussian distribution Employed in signal processing for noise filtering dominant subspace contains majority of information bearing part of signal Similar rationale when applying SVD to LSI
LSI Conclusions
SVD defined basis provide P/R improvements over term matching
Interpretation difficult Optimal dimension open question Variable performance on LARGE collections Supercomputing muscle required
Factor Analysis (e.g. PCA) is not the most sophisticated dimensionality reduction technique
Dimensionality reduction is a useful technique for any classification/regression problem
Text retrieval can be seen as a classification problem
Limitations of PCA
Are the maximal variance dimensions the relevant dimensions for preservation? Relevant Component Analysis (RCA) Fisher Discriminant analysis (FDA)
Limitations of PCA
Should the goal be finding independent rather than pair-wise uncorrelated dimensions Independent Component Analysis (ICA) ICA PCA
PCA vs ICA
Limitations of PCA
The reduction of dimensions for complex distributions may need non linear processing Curvilenear Component Analysis (CCA)
Non linear extension of PCA Preserves the proximity between the points in the input space i.e. local topology of the distribution Enables to unfold some varieties in the input data Keep the local topology
4.
http://en.wikipedia.org/wiki/Image:Eigenfaces.png
Eigenfaces
Experiment and Results Data used here are from the ORL database of faces. Facial images of 16 persons each with 10 views are used. - Training set contains 167 images. - Test set contains 163 images. First three eigenfaces :
1 0.8 a c c u ra c y 0.6 0.4 0 50 100 150 num of eigenfaces ber validation set training set
Sporulation Data
The patterns overlap around the origin in (1a). The patterns are much more separated in (1b).
Assignable
Variations in raw material, machine tools, mechanical failure and human error. These are accountable circumstances and are normally larger.
Cocktail-party Problem
Multiple sound sources in room (independent) Multiple sensors receiving signals which are mixture of original signals Estimate original source signals from mixture of received signals Can be viewed as Blind-Source Separation as mixing parameters are not known
Most common assumption is that source signals are statistically independent. This means knowing value of one of them gives no information about the other. Methods based on this assumption are called Independent Component Analysis methods
statistical techniques for decomposing a complex data set into independent parts.
It can be shown that under some reasonable conditions, if the ICA assumption holds, then the source signals can be recovered up to permutation and scaling.
Separation 1 +
21
12
Separation 2
a11
a12
a13
a14
x1
x2
x3
x4
This is recorded by the microphones: a linear mixture of the sources xi(t) = ai1*s1(t) + ai2*s2(t) + ai3*s3(t) + ai4*s4(t)
Recovered signals
BSS
Problem: Determine the source signals s, given only the mixtures x.
If we knew the mixing parameters aij then we would just need to solve a linear system of equations. We know neither aij nor si. ICA was initially developed to deal with problems closely related to the cocktail party problem Later it became evident that ICA has many other applications
e.g. from electrical recordings of brain activity from different locations of the scalp (EEG signals) recover underlying components of brain activity
x = a1 s1 + a 2 s 2
where a1, a2 are basis vectors and s1, s2 are basis coefficients. Constraint: Basis coefficients s1 and s2 are statistically independent.
Image denoising
Original image
Noisy image
Wiener filtering
ICA filtering
50
100
150
200
250
300
350
Approaches to ICA
Unsupervised Learning
Factorial coding (Minimum entropy coding, Redundancy reduction) Maximum likelihood learning Nonlinear information maximization (entropy maximization) Negentropy maximization Bayesian learning
Dynamic image
for the quantitative analysis (ex. blood flow, metabolism) acquiring the data sequentially with a time interval
tem po ral
frame 1
frame 3
frame 2
spatial
y1y2y3yN
Unmixing
g(u) u1 u2
Noise
u3
uN
Elementary Activities
Dynamic Frames
Independent Components
Independent Components
Right Ventricle
Left Ventricle
Tissue
It finds a linear data representation that best model the covariance structure of face image data (second-order relations) Factorial Faces Factorial Coding/ICA It finds a linear data representation that Best model the probability distribution of
Eigenfaces
PCA
= a1
+ a2
+ a3
+ a4
+ an
= b1
+ b2
+ b3
+ b4
+ bn
Experimental Results
Figure 1. Sample images in the training set. (neutral expression, anger, and right-light-on from first session; smile and left-light-on from second session)
Figure 2. Sample images in the test set. (smile and left-light-on from first session; neutral expression, anger, and right-light-on from second session)
(a)
(b)
Figure 3. First 20 basis images: (a) in eigenface method; (b) factorial code. They are ordered by column, then, by row.
Experimental Results
Figure 1. The comparison of recognition performance using Nearest Neighbor: the eigenface method and Factorial Code Representation using 20 or 30 principal components in the snap-shot method;
Experimental Results
Figure 2. The comparison of recognition performance using MLP Classifier: the eigenface method and Factorial Code Representation using 20 or 50 principal components in the snap-shot method;
PCA vs ICA
Linear Transform
Compression Classification
PCA
Focus on uncorrelated and Gaussian components Second-order statistics Orthogonal transformation
ICA
Focus on independent and non-Gaussian components Higher-order statistics Non-orthogonal transformation