You are on page 1of 5

ISSN 2319 – 1953

International Journal of Scientific Research in Computer Science Applications and Management Studies

DOCUMENT CLUSTERING BASED ON


CORRELATION PRESERVING INDEXING
Prajna Bodapati #1, Sirisha Pativada *2
#1
Dept of CSE, Andhra University, Visakhapatnam, India
*2
Dept of CSE, Andhra University, Visakhapatnam, India
1
prajna.mail@gmail.com
2
siriengg87@gmail.com

Abstract— This paper presents a brand new spectral clustering approximation to the original document space by minimizing
technique referred to as correlation preserving indexing (CPI) the global reconstruction error (Euclidean distance).
that is performed within the correlation similarity live area. However, because of the high dimensionality of the
During this framework, the documents are projected into a low- document space, a certain representation of documents usually
dimensional semantic area during which the correlations
reside on a nonlinear manifold embedded in the between the
between the documents within the native patches are maximized
whereas the correlations between the documents outside these documents. Thus, it is not able to effectively
patches are minimized simultaneously. Since the intrinsic capture the nonlinear manifold structure embedded in
geometrical structure of the document area is usually embedded the similarities between them [12]. An effective docu-
within the similarities between the documents, correlation as a ment clustering method must be able to find a low-
similarity live is additional appropriate for detecting the intrinsic dimensional representation of the documents that can
geometrical structure of the document area than Euclidean best preserve the similarities between the data points.
distance. Consequently, the proposed CPI technique will Locality preserving indexing (LPI) method is a different
effectively discover the intrinsic structures embedded in high- spectral clustering method based on graph partitioning
dimensional document area. The effectiveness of the new
theory [8]. The LPI method applies a weighted function
technique is demonstrated by in depth experiments conducted on
varied information sets and by comparison with existing to each pair-wise distance attempting to focus on cap-
document clustering strategies. turing the similarity structure, rather than the dissimi-
Keywords— Document clustering, correlation measure,
larity structure, of the documents. However, it does not
correlation latent semantic indexing dimensionality reduction.
overcome the essential limitation of Euclidean distance.
Furthermore, the selection of the weighted functions is
I. INTRODUCTION often a difficult task.
Document clustering aims to automatically group related In recent years, some studies suggest that correlation as
documents into clusters. It is on of the most important tasks in a similarity measure can capture the intrinsic structure
machine learning and artificial intelligence and has received embedded in high-dimensionality data, especially when
much attention in recent years [1]-[3]. Based on various the input data is sparse. In probability theory statistics,
distance measures, and a number of methods have been proposed correlation indicates the strength and direction of liner
to handle document clustering [4]-[10]. A typical and widely relationship between two random variable which revels the
used distance measure is the Euclidean distance. The k-means nature of data by the classical geometric concept of an angle.
method [4] is one of the methods that use the Euclidean distance, It is a scale invariant association measure usually used to
which minimizes the sum of the squared Euclidean distance calculate the similarity between two vectors.
between the data points and their corresponding cluster centres. In this paper, we propose a new document clustering
Since the document space is always of high dimensionality, methods based on correlation preserving indexing (CPI),
it is preferable to find a low-dimensional representation of which explicitly consider the manifold structure embedded in
the documents to reduce the computation complexity. the similarities between the documents. It aims to find an
Low computation cost is achieved in spectral clus- optimal semantic subspace by simultaneously maximizing
tering methods, in which the documents are first projected the correlation between the documents in the local patches
into a low-dimensional semantic space and then a traditional and minimizing the correlations between the documents
clustering algorithm is applied to finding document clusters. outside these patches. Which is implemented by Reuters
Latent semantic indexing (LSI) [7] is one of the effective sample data and presented the correlation measurements and
spectral clustering methods, aimed at finding the best subspace comparisons between the CPI and K-means?
In recent years, some studies [13] suggest that correlation as a
similarity measure can capture the intrinsic

IJSRCSAMS
Volume 1, Issue 1 (July 2012) www.ijsrcsams.com
ISSN 2319 – 1953
International Journal of Scientific Research in Computer Science Applications and Management Studies

structure embedded in high-dimensional data, especially


when the input data is sparse [19]-[20]. In probability theory uT v u v
corr(u, v) , ....................... (2)
and statistics, correlation indicates the strength and direction of T
u uv v T u v
a linear relationship between two random variables which
reveals the nature of data represented by the classical Note that the correlation corresponds to an angle θ such that
geometric concept of an "angle". It is a scale invariant cosθ = corr (u, v). The larger the value of corr (u, v) is the
association measure usually used to calculate the similarity stronger the association between the two vectors u and v.
between two vectors. In many cases, correlation can Online document clustering aims to group the documents
effectively represent the distributional structure of the input into clusters, which belong to unsupervised learning.
data which conventional Euclidean distance cannot explain. However, it can be transformed into semi-supervised
The usage of correlation as a similarity measure can be found learning by using the following information.
in the canonical correlation analysis (CCA) method [21]. A1) If two documents are close to each other in the
The CCA method is to find projections for paired data sets original document space, then they tend to be grouped
such that the correlations between their low-dimensional into the same cluster [8].
representatives in the projected spaces are mutually maximized. A2) If two documents are far away from each other in the
Specifically, given a paired data set consisting of matrices original document space, they tend to be grouped into
X = { x1,x2,.........xn} and Y = {y1,y2, ............. yn} different clusters.
We would like to find directions wx for X and wy Based on these assumptions, we can propose a spectral
for Y that maximize the correlation between the projections clustering in the correlation similarity measure space through
of X on wx and the projections of Y on wy . This can be the nearest neighbours graph learning.
expressed as A. K-Means on Document Sets
Xwx ,Ywy The k-means method is one of the methods that use the
maxwx ,wy Xwx . YwY
.................................. (1) Euclidean distance, which minimizes the sum of the squared
Euclidean distance between the data points and their
corresponding cluster centres. Since the document space is
Where , and . denote the operators of inner product and
always of high dimensionality, it is preferable to find a low
norm, respectively. dimensional representation of the documents to reduce
As a powerful statistical technique, the CCA method has computation complexity.
been applied in the field of pattern recognition and machine
learning [20]-[21]. Rather than finding a projection of one set B. Correlation Based Clustering Criteria
of data, CCA finds projections for two sets of corresponding Suppose yi Є Y is the low-dimensional representation of
data X and Y into a single latent space that projects the the ith document xi Є X in the semantic subspace, where
corresponding points in the two data sets to be as nearby
i = 1, 2, · · · , n. Then the above assumptions A1) and A2) can
as possible. In the application of document clustering,
while the document matrix X is available, the cluster be expressed as
label (Y) is not. So the CCA method cannot be directly max corr( yi , y j ) .............................. (3)
used for clustering. i x j N ( xi )
In this paper, we propose a new document clustering corr( yi , y j ) ................................(4)
min
method based on correlation preserving indexing (CPI). x j N ( xi )
i

II. DOCUMENT CLUSTERING BASED ON CORRELATION Respectively, where N (xi) denotes the set of nearest
PRESERVING INDEXING neighbours of xi. The optimization of (3) and (4) is
In high-dimensional document house, the Semantic equivalent to the following metric learning
structure is sometimes implicit. it's fascinating to find an d (x, y) *cos(x, y)
occasional dimensional semantic subspace in which the
Where d(x, y) denotes the similarity between the
semantic structure will become clear. Hence, discovering the
documents x and y, corresponds to whether x and y
intrinsic structure of the documents house is usually a primary
concern of document clustering since the manifold structure is are the nearest neighbours of each other.
usually embedded within the similarities between the The maximization problem (3) is an attempt to ensure that if
documents, correlation as a similarity live is appropriate for xi and xj are close, then yi and yj are close as well. Similarly,
capturing the manifold structure embedded within the high- the minimization problem (4) is an attempt to ensure that if the
dimensionality document house. maximization problem (3) is an attempt to ensure that if xi and
Mathematically, the correlation between vector U and V xj are far away, yi and yj are also far away. Since the
defined as following equality is always true

IJSRCSAMS
Volume 1, Issue 1 (July 2012) www.ijsrcsams.com
ISSN 2319 – 1953
International Journal of Scientific Research in Computer Science Applications and Management Studies

corr(yi , y j ) + corr(yi , y j ) = corr( yi , y j ) ..(5) correlations between the document points among the nearest
i y j N ( yx ) i y j N ( yx ) i j
neighbours are preserved, we call this criterion "correlation
preserving indexing" or CPI.
The simultaneous optimization of (3) and (4) can be achieved Physically, this model may be interpreted as follows. All
by maximizing the following objective function documents are projected onto the unit hyper-sphere (circle
corr( yi , y j ) for 2D). The global angles between the points in the local
i x j N ( xi ) neighbours, βi, are minimized and the global angles between
.................................................. (6) the points outside the local neighbours, αi, are maximized
corr( yi , y j )
i j
simultaneously, as illustrated in Fig. 1. On the unit hyper-sphere,
a global angle can be measured by spherical arc, that is, the
Without loss of generality, we denote the mapping between
geodesic distance. The geodesic distance between Z and Z’ on
the original document space and the low dimensional semantic
r the unit hyper-sphere can be expressed as
subspace by W i.e. W xi = yi following some algebraic d G(Z , Z ' ) arccos(Z T Z ' ) arccos(corr(Z , Z ' ) (12)
manipulations, we have
Since a strong correlation between Z and Z’ means a small
corr( yi , y j ) yTi y j
geodesic distance between Z and Z’, then CPI is equivalent to
i x j N ( xi )
= i x j N ( xi ) yiT y i y Tj y j = simultaneously minimizing the geodesic distances between the
corr( yi , y j ) y y T points in the local patches and maximizing the geodesic
i j
i j distances between the points outside these patches. The
T T
i i y y iy jy
i j geodesic distance is superior to traditional Euclidean distance
T in capturing the latent manifold [14].Based on this conclusion,
tr( yi yj ) tr(W T xixj TW ) CPI can effectively capture the intrinsic structures embedded in
T the high-dimensional document space.
i x Nj ( x ) i tr( yi yiT )tr( yj j ) i x Nj (x ) tr(W
i
T
x xTiWi )tr(W T x xi Tj W )
= ..(7)
tr( yi yj T ) tr(W T xixj TW )
ij tr( yi yiT )tr( yj yj T ) ij tr(W T xixi TW )tr(W T xi xj TW )
Where tr()is the trace operator. Based on optimization
theory, the maximization of (7) can be written as
tr(W T xi xj TW ) tr(W T ( (xi xj T ))W )
i x j N ( x)i i x j N ( x)i
arg max T T
= arg max T
(8)
w tr(W xixj W ) w tr(W ( (xi xj T ))W )
i j ij

With the constraint tr(W xi x W) =1 for i=1 to n r r


(9)
Consider a mapping W Rm*d where m and d are the Fig.2.1 projections of CPI.
dimensions of the original document space and the semantic It is worth nothing that semi-supervised learning using the
subspace, respectively. We need to solve the following nearest neighbours graphs approach in the Euclidean distance
constrained optimization space was originally proposed in the literatures [15] and [16],
d and LPI is also based on this idea. Differently, CPI is a semi-
wi M S wi supervised learning using nearest neighbours graph
arg max i 1 ............................................................
(10) approach in the correlation measure space. Zhong and Ghosh
d
T showed that Euclidean distance is not appropriate for
w M Tw
i i clustering high dimensional normalized data such as text
i 1
and a better metric for text clustering is the cosine
d
w x xT w similarity[12]-[13].Lebanon in [17] proposed a distance
Subject to ijj i =1, j=1,2,.,n. (11) metric for text documents, which was defined as:
i1
Here the matrices MT and MS1 are defined as n1
xi yi
M (x x ) T
M T d F* (x, y) arccos
T i j S (x x ) i j i1
i
x, y,
i j i x j N ( xi )
This distance is very similar to the distance defined by
It is easy to validate that the matrix MT is semi-positive
definite. Since the documents are projected in the low- (12). Since the distance d F* (x, y) is local and is defined on
dimensional semantic sub subspace in which the the entire embedding space [17], correlation might be a

IJSRCSAMS
Volume 1, Issue 1 (July 2012) www.ijsrcsams.com
ISSN 2319 – 1953
International Journal of Scientific Research in Computer Science Applications and Management Studies

suitable distance measure for capturing the intrinsic 3) Compute the term frequency vector using the TF/IDF
structure embedded in document space. That is why the weighting scheme. The TF/IDF weighting scheme assigned
proposed CPI method is expected to outperform the LPI to the term ti in document dj is given by:
method. Note that the distance d * (x, y) can be obtained
F tf ni, j
tfi, j idfi Here tf i, j
based on the training data and it can be used for classification idf i, j nk , j
rather than clustering. k
is the term frequency of the term ti in document dj where
III. RELATED WORK ni,j is the number of occurrences of the considered term ti in
A. Clustering algorithm based on CPI d
document d j .idfi log is the inverse
Given a set of documents x1, x2 ...... xn Є R Let X
n d : ti d
denote the document matrix. The algorithm for document document frequency which is a measure of the general
clustering based on CPI can be summarized as follows: importance of the term ti, where |D| is the total number of
a. Construct the local neighbour patch, and compute the
matrices MS and MT. documents in the corpus and|{d: ti Є d}| is the number
b. Project the document vectors into the SVD subspace by of documents in which the term ti appears. Let v= {t1,
throwing away the zero singular values. The singular value t2...... tm} be the list of terms after the stop words removal
decomposition of X can be written as X=UΣVT Here all and words stemming operations. The term frequency vector
zero-singular values in Σ have been removed. Accordingly, Xj of document dj is defined as:
the vectors in U and V that correspond to these zero singular Xj = [x1j, x2j........... xmj], Xij= (tf / idf) i, j
values have been removed as well. Thus the document vectors Using n documents from the corpus, we construct an m× n
in the SVD subspace can be obtained by X=UTX. term-document matrix X. The above process can be
completed by using the text to matrix generator (TMG) code.
IV. COMPLEXITY ANALYSIS
The time complexity of the CPI clustering algorithm can be VI. CLUSTERING RESULTS
analysed as follows: Consider n documents in the d-
dimensional space (d ≫ n). In step a, we need to compute Experiments were performed on Reuters, and OHSUMED
the pair wise distance which needs O (n2d) operations. data sets. . We compared the proposed algorithm with other
Secondly, we need to find the k nearest neighbours for each competing algorithms under same experimental setting. In all
data point which needs O (kn2) operations. Thirdly, experiments, our algorithm performs better than or
computing the matrices MS and MT requires O (n2d) competitively with other algorithms.
operations and O (n (n − k) d) operations, respectively. Thus, Cor 0: 1:0.04395113068664207:
the computation cost in step 1 is O (2n2d + kn2 + n (n − k) d). 0.04395113068664207
In step 'b' the SVD decomposition of the matrix X needs O :
(d3) operations and projecting the documents into the n- :
dimensional SVD subspace takes O (mn2) operations. As a Cor 26: 27: 0.051122792602862815:
result, step 'b' costs O (d3+n2d).Then, transforming the 0.051122792602862815
documents into m-dimensions semantic subspace requires
O(mn2) operations In step 'c', it takes O (lcmn) operations to
find the final document clusters, where l is the number of
iterations and c is the number of clusters. Since k ≪ n,
l<< n and m, n ≪ d in document clustering applications, the
step 'b' will dominate the computation. To reduce the
computation cost of step 'b', one can apply the iterative SVD
algorithm [18] rather than matrix decomposition algorithm
or feature selection method to first reduce the dimension.

V. DOCUMENT REPRESENTATION
In all experiments, each document is represented as a term
frequency vector. The term frequency vector can be
computed as follows:
1) Transform the documents to a list of terms after words
stemming operations.
2) Remove stop words. Stop words are common words that
contain no semantic content.

IJSRCSAMS
Volume 1, Issue 1 (July 2012) www.ijsrcsams.com
ISSN 2319 – 1953
International Journal of Scientific Research in Computer Science Applications and Management Studies
Retrieval, Melbourne, AU:. New York, NY, USA: ACM, 1998, pp. 96-
103.
[6] X. Liu, Y. Gong, W. Xu and S. Zhu, "Document clustering
withcluster refinement and model selection capabilities," in SIGIR ’02:
Proceedings of the 25th annual international ACM SIGIR conference on
Research and development in information retrieval. New York, NY,
USA: ACM, 2002, pp. 191-198.
[7] S. C. Deerwester, S. T. Dumais. T. K. Landauer, G. W. Furnas, andR.
A. Harshman, "Indexing by latent semantic analysis," Journal of the
American Society of Information Science, vol. 41, no. 6, pp. 391-407,
1990.
[8] D. Cai, X. He, and J. Han, "Document clustering using locality
preserving indexing," IEEE Trans. on Knowledge and Data Engi-
neering, vol. 17, no. 12, pp. 1624-1637, 2005.
[9] W. Xu, X. Liu, and Y. Gong, "Document clustering based onnon-
negative matrix factorization," in SIGIR ’03: Proceedings of the 26th
Fig: 5.1 Performance comparisons of different clustering methods annual international ACM SIGIR conference on Research and development
using Reuter’s dataset in informaion retrieval. New York, NY, USA: ACM, 2003, pp. 267-
273.
VII. CONCLUSIONS [10] S. Zhong and J. Ghosh, "Generative model-based document clustering: a
comparative study," Knowl. Inf. Syst., vol. 8, no. 3, pp.374-384, 2005.
We present a new document clustering method based on [11] D. K. Agrafiotis and H. Xu, "A self-organizing principle for
correlation preserving indexing. It simultaneously maximizes learning nonlinear manifolds," Proceedings of the National Academyof
the correlation between the documents in the local patches and Sciences of the United States of America, vol. 99, no. 25, pp. 15 869 -
15872, 2002.
minimizes the correlation between the documents outside [12] S. Zhong and J. Ghosh, "Scalable, balanced model-based clustering," in
these patches. Consequently, a low-dimensional semantic Proc. 3rd SIAM Int. Conf. Data Mining. New York, NY,USA: ACM,
subspace is derived where the documents corresponding to the 2003, pp. 71-82.
same semantics are close to each other. Extensive experiments [13] Y. Fu, S. Yan, and T. S. Huang, "Correlation metric for generalized
feature extraction," IEEE Transactions on Pattern Analysis and Machine
on NG20, Reuters and OHSUMED corpora show that the Intelligence, vol. 30, no. 12, pp. 2229-2235, Dec. 2008.
proposed CPI method outperforms other classical clustering [14] J. B. Tenenbaum, V. de Silva, and J. C. Langford, "A global
methods. Furthermore, the CPI method has good geometric framework for nonlinear dimensionality reduction."
generalization capability and thus it can effectively deal with Science, vol. 290, no. 5500, pp. 2319-2323, December 2000.
[15] X. Zhu, Z. Ghahramani, and J. Lafferty, "Semi-supervised learning using
data with very large size. gaussian fields and harmonic functions," in ICML-03, 20th
International Conference on Machine Learning, 2003.
REFERENCES [16] X. Zhu, "Semi-supervised learning literature survey," Computer
[1] R. T. Ng and J. Han, "Efficient and effective clustering methods for Sciences, University of Wisconsin-Madison, Tech. Rep., 2005.
spatial data mining," in Proceedings of the 20th VLDB Conference, [17] G. Lebanon, "Metric learning for text documents," IEEE Trans. on
Santiago. New York, NY, USA: ACM, 1994, pp. 144-155 Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 497-507,
[2] A. K. Jain, M. N. Murty and P. J. Flynn, "Data clustering: a 2006.
review “, ACM computing Surveys, volume 31, no.3, pp: 264-323, [18] P. Strobach, "Bi-iteration svd subspace tracking algorithms," IEEE
1999. Transactions on Signal Processing, vol. 45, no. 5, pp. 1222 -1240, may 1997.
[3] S. Kotsiantis and P. Pintelas, "Recent advances in clustering:A [19] Y. Ma, S. Lao, E. Takikawa, and M. Kawade, "Discriminant
brief survey," WSEAS Transactions on Information Science and analysis in correlation similarity measure space," in ICML ’07:
Applications, vol. 1, no. 1, pp. 73-81, 2004. Proceedings of the 24th international conference on Machine learning, 2007,
[4] J. B. MacQueen, "Some methods for classification and analysis of pp. 577-584.
multivariate observations," in Proc. of the fifth Berkeley Symposium on [20] R. D. Juday, B. V. K. Kumar, A. Mahalanobis, Correlation Pattern
Mathematical Statistics and Probability, vol. 1. University of Recognition. Cambridge University Press, 2005.
California Press, 1967,pp 281-297. [21] D. R. Hardoon, S. R. Szedmak, and J. R. Shawe-taylor,
[5] L. D. Baker and A. K. McCallum, "Distributional clustering of "Canonicalcorrelation analysis: An overview with application to
words for text classification," in Proceedings of SIGIR-98, 21st ACM learning methods," Neural Comput., vol. 16, no. 12, pp. 2639-2664, 2004.
International Conference on Research and Development in Information

IJSRCSAMS
Volume 1, Issue 1 (July 2012) www.ijsrcsams.com

You might also like