You are on page 1of 1

Proposed topic of research work

Speeding-up the k-means clustering method: a prototype based approach

Clustering is a process of grouping a set of input patterns into a few groups where
patterns in a group are more similar than patterns in different groups. The k-means
clustering method is a popular clustering method which derives a set of patterns into k-
groups. The k-means clustering method works as, for a given dataset, initially it selects k
random patterns as k-centers of the clusters then it iteratively assigns each pattern in the
data set to one of these k-centers by minimizing the sum of the squared distances. It is an
iterative procedure requires multiple scans of the data set and it has O(n) time and space
complexity where n is the number of patterns in the data set. This method does not work
efficiently for large data sets like data mining applications. There are several
approximate methods like single pass k-means method which scans the data set only once
to produce the clustering result. There are other algorithmic approaches which speeds-up
the k-means method by building an index like data structure over the data set which
efficiently finds the nearest neighbor of a pattern to one of its k-means and produces
similar result of the k-means clustering method with reduced time complexities. For
example, using a data structure similar to kd-tree to speed-up the process was given by
filtering algorithm to reduce number of centers to be searched for a set of points which
are enclosed in hyper rectangle. This filtering approach is not a good choice for high
dimensional data sets since kd-trees suffer from the curse of dimensionality problem.

This thesis proposes a few approximate methods for speeding-up the k-means clustering
method and its applications to large web and text clustering methods. Approximate
methods means where the clustering results of proposed methods are approximately equal
to the clustering result of the k-means clustering method. But, these methods will reduce
the computational burden and hence can be applied to large data sets like data mining
applications. These methods process the data in a faster pace, but produce the
approximate clustering result as the k-means method. A prototype based methods are
going to propose for this, where prototypes are derived using the leaders clustering
method. Along with prototypes called leader some additional information is also
preserved which enables in deriving the k-means clustering method. The proposed
methods use only a few selected prototypes from the data set along with some additional
information and hence are called approximate methods. These proposed methods are
analyzed using rough set theory and are applied to some large text/web mining
techniques.

Finally, these proposed methods are experimentally compared against k-means clustering
method with clustering results and computational burden which will show that proposed
methods may be better used in some situations where quality of clustering method is not
so important but the time required to generate clusters is important.

You might also like