You are on page 1of 4

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY

VOLUME 3 ISSUE 3 MARCH 2015 ISSN: 2349 9303

Dimensionality Reduction Techniques for Document


Clustering- A Survey
K.Aravi

G.Kalaivani

Thiyagarajar College of Engineering,


Department of Computer Science and Engineering
vanigk4@gmail.com

Thiyagarajar College of Engineering


Department of Computer Science and Engineering
kalaitce22@gmail.com

Abstract Dimensionality reduction technique is applied to get rid of the inessential terms like redundant and noisy terms in
documents. In this paper a systematic study is conducted for seven dimensionality reduction methods such as Latent Semantic
Indexing (LSI), Random Projection (RP), Principle Component Analysis (PCA) and CUR decomposition, Latent Dirichlet
Allocation(LDA), Singular value decomposition (SVD). Linear Discriminant Analysis(LDA)
Index Terms Document clustering, CUR decomposition, Latent Dirichlet Allocation, Latent Semantic Indexing, Principle
Component Analysis, Random Projection, Singular Value Decomposition.

1 INTRODUCTION
A text document is stored in a large database, if a user needs to
search or retrieve a specific document; it is tough to try and do this
task. So clustering the text documents makes easy for searching and
retrieving the data. But high dimensional datasets are not efficient for
clustering and information retrieval. High dimensionality denotes
that quite 10 thousand terms in a document. It slows down
computing the distance between documents and it makes the
clustering process also slow The illustration of every and each term
within the documents is projected in a vector space. The
preprocessing steps are done before dimensionality reduction such as
stop word removal and stemming process. The redundant and noisy
data to be reduced for efficient document clustering. It improves the
performance of clustering.

2. DIMENSIONALITY REDUCTION TECHNIQUES


The curse of dimensionality refers that when the dimensions are
increasing and the size of the space also increases, it leads to
sparseness. It increases running time. Dimensionality reduction is the
transformation of high dimensional data into a low dimensional
space. The main idea for this technique is to reduce the number of
dimensions without much loss of data.

2.1 Principle Component Analysis


PCA aims to seek out the correlation between the documents. The
correlation value ranges between -1 to 1. It reduces the sparseness in
documents. The objective is to maximize the variance of the
projected data.
Step 1: The original matrix is the documents are projected in vector
space. Taking Xi.... Xn as column vectors, each of which has M
rows. Place the column vectors into a single matrix X of dimensions
M N.

IJTET2015

Step 2: Calculate mean for each documents

M=

..(1)

Step 3: Calculate Covariance Matrix


(2)
= Mean of 1st data set
= 1st dataset raw score
= Mean of 2nd dataset
= 2nd dataset raw score
Step 4: Calculate Eigen vectors and Eigen values for covariance
matrix..(2)
Step 5: Finally dimensions are reduced as multiplying matrix X
with matrix Ek Ek Eigen vectors
X PCA = EkT X .(3)
The number of the terms of the original data matrix X is reduced by
multiplying with a d k matrix Ek which has k eigenvectors
corresponding to the k largest Eigen values. The result matrix is X
PCA.

PROS and CONS


It works well for sparse data.
It can be applied only for linear datasets.

2.2 Random Projection


Random projection is a technique which projects the documents to a
randomly chosen low-dimensional space. Its equation is as follows:

80

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY


VOLUME 3 ISSUE 3 MARCH 2015 ISSN: 2349 9303
N = Total no of Documents
d =Original dimension of documents.
k = desired dimension.
X = Matrix of those documents projected in vector space. We reduce
the number of the features by multiplying the matrix R to the original
document matrix X.

2.3 Singular Value Decomposition


SVD is a technique for matrices dimensionality reduction and based
on a theorem of linear algebra which says that a rectangular matrix
M can be divided into the product of three matrices U is an
orthogonal matrix, a diagonal matrix S, and the transpose of an
orthogonal matrix V.
M= USVT ...........................................(5)
M is an m*n matrix. Rows and Column represents those documents
are projected in vector space. U is calculated as M * MT (left
singular matrix) . Take Eigen value and Eigen vectors and
Orthonormalization process for the left singular matrix. S is a
diagonal matrix . V is calculated as MT * M (right singular matrix)
and take VT. At the end dimensions are reduced as discard the values
which are 0 and below 0.

PROS and CONS

words belong to each topic, Classify unseen documents into those


topics. Topic modeling associates with each document a probability
distribution over "topics", which are in turn distributions over

words. The task of parameter estimation in these models is to


learn both what the topics are, and which documents use them in
what proportions. The objective of LDA is to perform
dimensionality reduction while preserving as much of the
document discriminatory information as possible. It seeks to find
directions along which the topics are best separated.

2.6 Linear Discriminant analysis


It does so by taking into consideration the scatter within-documents
but also the scatter between documents. The methodology is
follows.
Suppose there are n documents.
Let i be the mean vector of the documents i, i=1,2,3n
let Mi be the no of terms within documents i, i= 1,2,3n
let M= M=

be the total no of terms.

Within-class scatter matrix:

Sw=

Reduce the sparse data.

Between-class scatter matrix:

2.4 Latent Semantic Indexing

Sb=

LSI is one of the standard dimensionality reduction techniques in


information retrieval. A collection of documents can be represented
as a huge term-document matrix and various things such as how
closely two documents is to a user issued query. In this matrix, each
word is a row and each document is a column. Each cell contains the
number of times that word occurs in that document. LSI transforms
the original data in a different space so that two documents/words
about the same concept are mapped close (so that they have higher
cosine similarity). LSI achieved this by Singular Value
Decomposition (SVD) of term-document matrix and embeds the
original high dimensional space into a lower dimensional space with
minimal distance distortion. q is query matrix.

LDA computes a transformation that maximizes the betweenclass scatter while minimizing the within-class scatter.

Q= qTUS-1 .(6)

PROS and CONS


Finding relationships between terms in a document.
The queries are projected into vector space for each and every user
query

2.5 Latent Dirichlet Allocation


Its a way of automatically discovering topics in documents. It
represent documents as mixtures of topics that spit out words with
certain probabilities. Topic modeling is a classic problem in
information retrieval. LDA discovers the different topics used,
discovers the mixture of topics in each document, discover which

IJTET2015

-T

Maximize

Such a transformation should retain class separability while


reducing the variation due to sources other than identity

PROS and CONS


Maximizes topic seperability.

2.7 CUR Decomposition


A CUR matrix is a set of three matrix that, when multiplied
together. A CUR approximation can be used in the same way as
the low-rank approximation of the Singular value decomposition
(SVD). CUR approximations are less accurate than the SVD, but
the rows and columns come from the original matrix.
A= CUR(7)
Matrix A is those documents are projected in vector space. Matrix
C is columns of matrix A. Matrix U is pseudo inverse of
intersection of C and R . Matrix R is rows of matrix A .

81

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY


VOLUME 3 ISSUE 3 MARCH 2015 ISSN: 2349 9303
PROS and CONS
(+)The vectors are original columns and rows.
(-)There are duplicated columns and rows. So the matrix is very
large. So it takes more time to compute.

3. CONCLUSION
We conclude that the dimensionality reduction technique is efficient
technique to cut back the inessential terms in documents. In this
paper we provided major dimension reduction approaches which will
be applied to document clustering.. In this paper we provided major
dimension reduction approaches that can be applied to document
clustering. These techniques can be applied to linear datasets. It
mainly reduces sparseness in data.

4. END SECTIONS
4.1 Acknowledgments
We would like to thank UCI repository of machine learning
databases for the use of several of their public datasets.

5. REFERENCES
[1] Jessica Lin Dimitrios Gunopulos , Dimensionality Reduction by
Random Projection and Latent Semantic Indexing,Department of
Computer Science & Engineering University of California,
Riverside.{jessica,dg}@cs.ucr.edu..
[2]Mahdi Shafiei, Singer Wang, Roger Zhang, Evangelos Milios, Bin
Tang, Jane Tougas, Ray Spiteri Document Representation and
Dimension Reduction for Text Clustering Workshop on Text Data
Mining and Management (TDMM), In conjunctionwith 23rd IEEE ICDE
Conference, April 15, 2007, Istanbul, Turkey.
[3 ]Barbara Rosario Latent Semantic Indexing: An overview
INFOSYS 240 Spring 2000 Final Paper.
[4] ] Laurence A. F. Park Kotagiri Ramamohanarao Latent Semantic
Analysis for large document sets. ARC Centre for Perceptive and
Intelligent Machines in Complex Environments
Department of Computer Science and Software Engineering
The University of Melbourne.
[5] Bin Tang, Michael Shepherd, Malcolm I. Heywood, Xiao Luo
Comparing Dimension Reduction Techniques for Document Clustering
Faculty of Computer Science, Dalhousie University, 6050 University
Avenue, Halifax, Nova Scotia, Canada, B3H 1W5 {btang, shepherd,
mheywood, luo}@cs.dal.ca
[6] Sabrina Simmons, Zachary Estes Using latent semantic analysis to
estimate similarityDepartment of Psychology, University of Warwick
Coventry CV4 7AL, UK
[7] Imola K.Fodor A survey of dimension reduction techniques center
for Applied Scientific Computing,Lawrence Livermore National
Laboratory P.O.Box808,L560, Livermore,CA
[8] L.J.P. van der Maaten , E.O. Postma, H.J. van den Herik
Dimensionality Reduction: A Comparative Review MICC, Maastricht
University, P.O. Box 616, 6200 MD Maastricht, The Netherlands
[9] Si Chen and Yufei Wang Latent Dirichlet Allocation University of
California San Diego.
[10]Ravikumar V, K.Raguveer Legal Documents Clustering using

IJTET2015

Latent Dirichlet Allocation Dept. of IS&E Research scholar, NIE


Mysore, India
[11]David M. Blei Princeton University Introduction to Probabilistic
Topic Models
[12] Alireza Sarveniazi An Actual Survey of Dimensionality
Reduction American Journal of Computational Mathematics, 2014, 4,
55-72.
[13] Yoni Halpern, Steven Horng, Larry A. Nathanson, Nathan I.
Shapiro, David Sontag A Comparison of Dimensionality Reduction
Techniques
for Unstructured Clinical Text New York University, New York, NY.
[14] G.N.Ramadevi, K.Usharani Study On Dimensionality Reduction
Techniques And Applications Research Scholar, Department of
Computer Science, S.P.M.V.V, Tirupati.
[15] Manoranjan dash Dimensionality Reduction Department of
Information Systems,School of Computer Engineering, Nanyang
Technological University, Nanyang Avenue, Singapore 639798.
[16] Sasan Karamizadeh, Shahidan M. Abdullah, Azizah A. Manaf1,
Mazdak Zamani1, Alireza Hooman . An Overview of Principal
Component Analysis Journal of Signal and Information Processing,
2013, 4, 173-175 doi:10.4236/jsip.2013.43B031 Published Online
August 2013.
[17] Jon Shlens, A Tutorial on Principal Component Analysis
Derivation, Discussion and Singular Value Decomposition 25march
2003
[18] Ali Ghodsi Dimensionality Reduction
A Short Tutorial
Department of Statistics and Actuarial Science University of Waterloo,
Waterloo, Ontario, Canada, 2006
[19] C. Suh, S. Graduciz, M. Gaune-Escard , K. Rajan, Data
Dimensionality Reduction:Introduction to Principal Component
Analysis Combinatorial Sciences and Materials Informatics
Collaboratory Iowa State University, CNRS , Marseilles, France
[20]Arunasakthi. K, KamatchiPriya A Review on Linear and NonLinear Dimensionality Reduction Techniques Assistant Professor
Department of Computer Science and Engineering
Ultra College of Engineering and Technology for Women,India. And
Assistant Professor, Department of Computer Science and Engineering,
Vickram College of Engineering, Enathi, Tamil Nadu, India
[21] Kresimir Delac 1, Mislav Grgic 2 and Sonja Grgic A Comparative
Study of PCA, ICA AND LDA Croatian Telecom, Savska 32, Zagreb,
Croatia, e-mail: kdelac@ieee.org ,University of Zagreb, FER, Unska
3/XII, Zagreb, Croatia
[22] David M. Blei, Andrew Y. Ng, Michael I. Jordan Latent Dirichlet
Allocation Computer Science Division University of California
Berkeley, CA 94720, USA, Computer Science Department,Stanford
University, Stanford, CA 94305, USA, Computer Science Division and
Department of Statistics, University of California, Berkeley, CA 94720,
USA. Journal of Machine Learning Research 3 (2003) 993-1022
[23] Yehuda Koren and Liran Carmel, Robust Linear Dimensionality
Reduction IEEE Transactions on Visualization and Computer Graphics.
[24] Kresimir Delac, Mislav Grgic, Sonja Grgic Independent
Comparative Study of PCA, ICA, and LDAon the FERET Data Set
University of Zagreb, FER, Unska 3/XII, Zagreb, Croatia
[25] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas,
and R. A. Harshman. Indexing by latent semantic analysis.
Journal of the American Society of Information Science,
41(6):391407, 1990
[26] P. Husbands, H. Simon, and C. H. Q. Ding. On the use of the
singular value decomposition for text retrieval. In Computational

82

INTERNATIONAL JOURNAL FOR TRENDS IN ENGINEERING & TECHNOLOGY


VOLUME 3 ISSUE 3 MARCH 2015 ISSN: 2349 9303
information retrieval, pages 145156. Society for Industrial
and Applied Mathematics, Philadelphia, PA, USA, 2001.
[27] K. Lerman. Document clustering in reduced dimension vector
space. http://www.isi.edu/_lerman/papers/papers.html,
1999, last accessed Jan. 11, 2007.
[28] M. Shafiei, S. Wang, R. Zhang, E. Milios, B. Tang,
J. Tougas, and R. Spiteri. A systematic study of document
representation and dimension reduction for text clustering.
Technical Report Technical Report CS-2006-05, Faculty of
Computer Science, Dalhousie University, Halifax, Canada,
July 2006.
[29] M. Steinbach, G. Karypis, and V. Kumar. A comparison of
common document clustering techniques. In KDD Workshop
on Text Mining, 2000
[30] B. Tang, X. Luo, M. I. Heywood, and M. Shepherd. Comparative
study of dimension reduction techniques for document
clustering. Technical Report CS-2004-14, Faculty of
Computer Science, Dalhousie University, December 2004
[31] Morgan Kaufmann Publishers, San Francisco, US.
K. Yeung and W. Ruzzo. Principal component analysis for
clustering gene expression data. Bioinformatics, 17(9):763
774, September 2001
[32] G.W. Furnas et al., Information retrieval using a singular value
decomposition model of latent semantic structure, in Proceedings of
SIGIR, 1988.

IJTET2015

83

You might also like