Comparison of Graph Clustering Algorithms

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page 3230

Comparison of Graph Clustering Algorithms
Aditya Dubey
#1
, Sanjiv Sharma
#2

Department of CSE/IT
Madhav Institute of Technology & Science
Gwalior, India

Abstract Clustering algorithms are one of the ways of
extracting the valuable information apart from a large
database by partitioning them. All of these clustering
algorithms have their main goal that is to find clusters
by maximizing the similarity in intra clusters and
reducing the similarity between different clusters.
Besides of their main goal, all of these algorithms work
on different problem domain. In this paper, two
algorithms K-means and spectral clustering algorithm
are described. Both algorithm are tested and evaluated
on different applications driven dataset. For calculating
the efficiency of the clustering algorithm, silhouette
index is used. Performance and accuracy of both the
clustering algorithm are presented and compared by
using validity index.

Keywords Silhouette index, Clustering, Spectral
clustering, Normalized, Datum, Laplace matrix.
I. INTRODUCTION
Clustering is one of the most extensively
used techniques for exploratory data analysis with
application ranging from statistics, pattern
recognition, image segmentation and biology to
social activities. In virtually every scientific field
dealing with the experimental data, people want to
get a first impression on their data by trying to
identify groups of similar behavior in their data. The
idea of data clustering is simple in its nature and it is
similar to thinking of human; whenever we are
having a large amount of data, we usually tend to
summarize this large amount of data into a small
number of groups or categories in order to further
facilitate its analysis.
When data objects are distributed as compact or
closed groups which are well separated, clusters are
said to be well defined and those well defined
clusters are referred to as natural clusters. When these
groups are not compact or when groups overlap each
other, clusters are not well defined; a clear and
meaningful description of clusters then becomes a
vital task. Moreover, most of the data obtained from
different applications seem to have some inherent
properties that lend themselves to have natural
groupings. Trying to categorize the data or finding
the groupings among the data is not a simple task for
humans unless the data is of low dimensionality. This
is the reason some methods in soft computing have
been proposed to solve this kind of problem and these
methods are referred to as Data Clustering Methods
and they are the subject of this paper.
Clustering methods are not only used to categorize
the data, but can also be used for data compression
and model construction. For example, by finding
similarity in the data, one can represent similar data
behavior by using fewer symbols. Secondly, based on
the groupings of data, we can build a model of the
problem. By clustering the data, relevance knowledge
in data can find. Francisco Azuaje[2] proposed Case
Based Reasoning (CBR) system which is based on
the Growing Cell Structure (GCS) model. In CBR,
data can be stored in such a knowledge base that is
indexed or grouped by cases. Each group of cases is
assigned to a particular category. In GCS, data can be
added or removed from the required dataset based on
the learning scheme used. When a query of the
required information is given as an input to the
model, the system selects the most suitable cases
from the case base based on how much close these
cases are to the query.
Recently, a variety of clustering algorithms have
been proposed to handle data that is not linearly
separable so that the data is having good
representation of facts or characteristics. However the
main goal of all these clustering algorithms is to
maximize the intra cluster similarity whereas
reducing the inter cluster similarity [1]. Beside of
their main goal all of these clustering algorithms
work on different ideas of clustering. In this paper,
two efficient and popular graph clustering algorithms
are discussed and compared by taking different
datasets of different applications: Spectral Clustering
and k-means Clustering.
The efficiency of these clustering depends on how
efficiently the each data point satisfy the common
properties of its cluster and how much dissimilar
these data points are to the other clusters. In this
paper, two clustering techniques are implemented and
tested against different problem dataset. For
comparing the performance of these algorithms on
different data obtained from different applications
Silhouette Index validation method is used in this
paper. Silhouette Index is used to describe measure
for a good partition.

The section II represents an overview of graph
clustering techniques. Section III describes spectral
and K-means clustering algorithms in detail. Validity
index Silhouette describing the efficiency of
clustering algorithm is described in section IV.
Section V introduces the implementation of these
clustering algorithm over different dataset obtained
from different application and goes over the results of
each technique. Section VI represents a brief
conclusion. References are given in the section VII.
II. GRAPH CLUSTERING OVERVIEW
As mentioned earlier, the goal of graph clustering
algorithm is to partition the graph into k sub graphs.
Recently, there are many of the graph partitioning
algorithms have proposed. The efficiency of a graph
clustering algorithm depends on phenomena that data
points satisfy the properties or characteristics of a
cluster and do not satisfy the properties of another
cluster. Therefore, problem arises for the overlapping
of clusters. Increase in these overlapping clusters
proportionally decreases the efficiency of a clustering
algorithm.
Some clustering techniques use the cluster centers
to represent each cluster. K-means algorithm works
on this principle. There are clustering algorithms
which use either similarity or the dissimilarity
between the two data points by constructing a
similarity matrix or a dissimilarity matrix.
Partitioning Along Mediods (PAM) is one of those
algorithms which uses dissimilarity matrix for its
clustering calculation. Spectral clustering algorithm
beyond of this uses a similarity graph.
In some of the clustering algorithm the number of
clusters to constructs should be known in advance.
The algorithms fail in the case where number of
partitions to construct is not known. This can be the
drawback. Both clustering algorithms described in
this paper require the number of clusters to construct
to be known. However, there are clustering
algorithms which does not require the user to enter
the number of clusters. Some examples of those
clustering algorithms include mountain clustering and
subtractive clustering. All of these algorithms have
their advantages and disadvantages over other
algorithms. To partition a graph the following
questions must be considered:
What is the specific measure for a good
partition?
How such partitions can be calculated
efficiently?
What should be the required number of
partitions to have good representation of data?
How clustering algorithm can be used for
detecting outliers?
The first technique for data clustering is K-means
clustering [3]. This algorithm is applied to different
applications including statistics, pattern recognition,
image segmentation, biology and social activities.
This algorithm works on finding cluster centers and
then using these centers to assign each data point to
the cluster. K-means algorithm is simple to
implement. K-means algorithm is still an extremely
fast clustering algorithm which is one of its
advantages. Full detailed discussion will follow in the
next section.
The second technique for clustering of the data is
spectral clustering [5]. This algorithm is tested
against a variety of data set derived from different
applications and it is found that the efficiency of
finding clusters of data is more for this clustering
algorithm. In this algorithm, eigen values are used for
finding the clusters. Spectral clustering has become
one of the most admired algorithms. It is simple to
implement, can be solved efficiently by linear algebra
software, and very often outperforms other clustering
algorithms. Spectral clustering algorithm can also be
used for outlier detection in clustered graph.
III. GRAPH CLUSTERING TECHNIQUES
Given a dataset which consist of N number of
datum or the rows and D represent the number of
dimensions or columns used to characterize a single
datum. Let G be the number of clusters or sub graphs
to construct.
A. K-means clustering algorithm
Step1: Input a dataset and the number of clusters to
construct.
Step2: Construct a graph (V, E) in which vertices
represents each datum and the edges represents the
relationship or similarity between these datum driven
from the attributes or columns of the dataset.
Step3: Initialize randomly G cluster centers C
=c
1
, c
2
,. . . . , c
G
.
Step4: Assign each data point in graph to cluster
by determining the distance with the closest centroid.
For this determine the membership matrix M which
is of dimensions (GN). The element m
I,j
of matrix
M is 1 if the distance of data point x
j
from cluster
center c
I
is least than distance from other cluster
centroid otherwise 0.
Step5: Calculate the cost function F for every
cluster as summation of all the distances from each
data point to its cluster centroid. Exit from these steps
if cost function below a certain threshold value is
found or its improvement over previous iteration
become very less.

Step6: Update the centers of each cluster by using
the mean of all the data points lying in cluster. Go to
step2.
Since, in the first step cluster centers are chosen
randomly therefore the performance of the algorithm
very much depends on this selection. Hence, it is
preferred to run the algorithm many times so that
efficient clustering can be performed on the dataset.
The question of which distance measure to use
depends on the particular application dataset. In most
of the cases, Euclidean distance measure is used.
Since K-means is an iterative algorithm, a suitable
exit condition is required which can be providing the
number of iterations as an input.
B. Spectral Clustering Algorithm
Step1: Input a dataset and the number of clusters
to construct.
Step2: Construct a graph (V, E) in which vertices
represents each datum and the edges represents the
attributes or columns of the dataset. Let W be the
weighted adjacency weighted matrix constructed
from the dataset.
Step3: Construct a similarity graph by using
weighted adjacency matrix W.
Step4: Compute the un-normalized graph laplacian
matrix as L=D-W.
Step5: Compute the first k normalized eigen
vectors e
1
, e
2
, . . . ,e
G
of the normalized graph
laplacian L
rw
=D
-1
L.
Step6: Let E be the matrix containing eigen
vectors e
1
, e
2
, . . . ,e
G
as columns.
Step7: Let y
I
be the vector corresponding to the i
th

row of E where i=1, 2, . . . , N.
Step8: By using K-means algorithm cluster the
nodes in y
I
into clusters c
1
,., c
G
.
The algorithm presented above uses normalized
graph laplacian L
norm
=D
-1
L . Ng Jordan [6]
proposed an algorithm in which different normalized
graph laplacian L
norm
=D
-1/ 2
AD
-1/ 2
is used. As
can be noted, normalized laplacian in the Ng Jordan
algorithm requires an additional multiplication of
matrix which results in unnecessary complexity
increment. In place of normalized laplacian graph,
un-normalized graph laplacian can also be used for
eigen vector calculation. An important point related
to the spectral clustering is that which of the three
graph laplace is to be used for eigenvector
calculation. For answering this query [8], the degree
distribution of the graph should be considered. If the
graph is very regular and most vertices have
approximately the same degree, then all the graph
laplacian are very similar to each other, and have
equivalent results. Whereas, if degree of vertices in a
graph are broadly distributed, then different laplacian
graphs will have different results. There are several
arguments which advocate for using normalized
rather than unnormalized spectral clustering.
Secondly, the algorithm needs to calculate eigen
vectors corresponding to smallest eigen values. Many
eigen solver methods exist for this task. Lanczos
algorithm is used in this paper to calculate the
eigenvectors. The normalizations used in this
algorithm are used to improve the performance of the
algorithm.
There are some interesting similarity between
spectral clustering methods and kernel PCA, which
has been observed to perform clustering [26], [27].
The primary difference between the first steps of
spectral algorithm and kernel PCA with a Gaussian
kernel is the normalization of affinity matrix to form
eigen vector matrix U. These normalizations are used
to improve the performance of the algorithm.
IV. SILHOUETTE INDEX
To describe how efficiently an algorithm
partitions data into clusters validation indices [23] are
used. Clustering algorithms assign each data point to
the most suitable cluster. For each data point j let n
00

denotes the dissimilarity of data point j with the all
the partner data points sharing that single cluster.
That is, how well data point j is having the same
characteristics as those of other data point
characteristics within the same cluster. In the ideal
case, the dissimilarity value should be the least. Then
calculate the average dissimilarity of data point j with
data points of another cluster. Similarly find the
dissimilarity of datum j with data points of all other
clusters. Let n
11
denote the dissimilarity of datum j
with data points of other clusters. Therefore,
Silhouette of data point j denoted by S (j) is defined
as:
S(j) =
n
11
n
00
max {n
00
,n
11
}
S (j) =
1
n
00
n
11
n
00
<n
11
0 n
00
=n
11
n
11
n
00
1 Otherwise

Obviously, we have S (j) [-1, 1]. When S (j)
approaches to +1 it denotes how well data points are
clustered by the clustering algorithm. For this n
00
(j)
should be as minimum as possible whereas n
11
(j)
should be as larger as possible. While in the opposite
case, if the S (j) approaches to -1, then it advocates
for assigning the data point j to another nearest
cluster indicating that data points are not well
partitioned. Thus the average of silhouette calculated
for complete data indicate how well the data points
are clustered by clustering algorithm. Secondly,

silhouette index also indicate how many number of
clusters are suitable for partitioning the dataset so
that the partitioned data have a good representation of
facts and information.
V. IMPLEMENTATION AND RESULTS
In this paper, three datasets taken from three
different applications are tested under both the
clustering algorithms. Silhouette index describes the
efficiency of each algorithm. Silhouette index also
describe which number of cluster is suitable for
clustered sub graphs so that clustered data have good
representation of facts.
A. Abalone dataset
The abalone dataset describes the data of sea
snails. This dataset is taken from Department of
Primary Industry and Fisheries, Tasmania. There are
9 attributes or dimensions and 4177 data entries or
rows in this dataset. In this data set, data points are
highly overlapped so finding an efficient clustering is
a challenging task. Spectral and K-means clustering
algorithm are used for clustering the graph into
number of sub graphs from 2 to 7.
Fig.1. Graph of the Abalone dataset
After clustering, silhouette index is calculated for
both the clustering algorithm against the number of
clusters. Table1 indicate that silhouette index for
spectral clustering is more as compare to the K-
means clustering. But, the time taken by spectral
clustering is more as compare to K-means. Therefore,
it is clear that spectral clustering produces efficient
results but has higher time requirement for clustering.
The higher silhouette also indicates that when graph
is partitioned into 3 numbers of sub graphs it has
good representation of clustered data.
TABLE1.
PERFORMANCE OF ALGORITHMS ON ABALONE
No. of
clusters
Silhouette Index
Spectral clustering K-means clustering
2 .39 .17
3 .42 .21
4 .35 .17
5 .34 .12
6 .39 .16
7 .32 .16
Time
Required
11.9517 sec 11.1785 sec
B. Banknotes dataset
The banknotes dataset consist of description of
Swiss bank notes. 200 bank notes are taken into
consideration. In this dataset along with some
original bank notes there are some fake notes data
entries. By clustering the dataset, it is easily
identified that either bank notes are fake or original.
In this dataset, 6 dimensions are used to describe
each bank note. Clustering is performed on this graph
by using the number of clusters as from 2 to 7.
Fig.2. Graph of Banknotes dataset
Table 2 indicates that silhouette index for spectral
clustering is more as compare to K-means clustering
algorithm. It is also clear; clustering the graph into 2
numbers of sub graphs will have good representation
of clustered graph since silhouette index is highest
for 2 numbers of clusters. Beside of efficiency,

spectral clustering algorithm takes much time for
clustering as compare to K-means clustering.
Required higher time is the drawback of spectral
clustering than K-means clustering algorithm.
TABLE2.
PERFORMANCE OF ALGORITHMS ON BANKNOTES DATASET
No. of
clusters
Silhouette Index
2 .52
.42
3 .43 .32
4 .39 .27
5 .22 .22
6 .17 .19
7 .19 .19
Time
Required
.15717 sec .01520 sec
C. Parkinson dataset
This dataset is taken from biology application.
This dataset is composed of a range of biomedical
voice measurements from 31 people, 23 with
Parkinson's disease (PD). Each column in the table is
a particular voice measure, and each row corresponds
one of 195 voice recording from these individuals.
The main aim of the data is to discriminate healthy
people from those with Parkinson disease.
Fig.3. Graph of Parkinson dataset
In Table3, Silhouette index indicating the
efficiency of graph clustering algorithms. For spectral
clustering the silhouette index is higher as compare to
that of K-means clustering. K-means clustering
algorithm takes a lesser time to cluster the graph as
compare to spectral clustering. The higher time
requirement for spectral clustering is just because of
the complex eigen values and eigen vectors
calculations.
TABLE3.
PERFORMANCE OF ALGORITHMS ON PARKINSONS DISEASE
No. of
clusters
Silhouette Index
2 .40 .3
3 .1 .001
4 .2 .12
5 .1 .01
6 .2 .20
7 .2 .08
Time
Required
.145186 sec .12053 sec
VI. CONCLUSION
K-means is a simple and fast algorithm. As shown
in tables 1, 2 and 3 spectral clustering algorithm
outperforms over K-means algorithm. K-means
algorithm can also perform well if some conditions or
clauses are given as an input to the algorithm.
However the clustering done by K-means algorithm
may vary each time the algorithm is run on the
dataset. This is just because of the first step of the
algorithm in which clusters are initialized randomly.
However for overcoming this drawback the algorithm
can have several times run on the same database. K-
means algorithm has an advantage that it takes much
less time for clustering than other clustering
algorithm suggesting that it is having less
computational complexity. Finding an optimal
solution of clustering using the K-means algorithm is
a NP Hard problem [25].
Algorithm such as K-means does not correspond
to convex regions formed by the data. It simply uses
the local optimum principle for the cluster
membership of data points. A promising alternative
that has recently been emerged in a number of
applications is to use spectral methods of clustering.
Spectral clustering is used for the graph having
spherical regions of the data. In the last step of
spectral clustering K-means algorithm is used. The
question arises that why not to use k-means directly
in the dataset. The answer is that K-means algorithm
alone is unable to cluster data points of convex
regions and K-means directly run finds the
unsatisfactory results. Spectral clustering has higher
time complexity and it also have high computational
time. In the future work if the complexity and
computational time of spectral clustering is

minimized then this algorithm will become the best
algorithm for clustering.
VII. REFERENCES
[1] J ang, J .-S. R., Sun, C.-T., Mizutani, E., Neuro- Fuzzy and
Soft Computing A Computational Approach to Learning
and Machine Intelligence, Prentice Hall.
[2] Azuaje, F., Dubitzky, W., Black, N., Adamson, K.(J une
2000), Discovering Relevance Knowledge in Data: A
Growing Cell Structures Approach, IEEE Transactions on
Systems, Man, and Cybernetics- Part B: Cybernetics, Vol.
30, No. 3 (pp. 448)
[3] J . A. Hartigan and M. A. Wong (1979), A k-means
clustering algorithm, Applied Statistics, 28:100-- 108.
[4] The MathWorks (1999), Inc., Fuzzy Logic Toolbox For
Use With MATLAB, The MathWorks, Inc..
[5] Shi, J . and Malik, J . (2000). Normalized cuts and image
segmentation. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 22 (8), 888 905.
[6] Ng, A., J ordan, M., and Weiss, Y. (2002). On spectral
clustering: Analysis and an algorithm. In T. Dietterich, S.
Becker, and Z. Ghahramani (Eds.), Advances in Neural
Information Processing Systems 14(pp 849-856)
[7] Peng Yang, Biao Huang. A Spectral Clustering Algorithm
for Outlier Detection. International Conference on Future
Information Technology and Management Engineering
[8] Deepak Verma (2003). A Comparison of Spectral
Clustering Algorithms.UW CSE Technical Report.
[9] Chris Ding, Xiaofeng He, Hongyuan Zha, Ming Gu, Horst
D. Simon. A MinMaxCut Spectral Method for Data
Clustering and Graph Partitioning.
[10] Inderjit Dhillon, Yuqiang Guan and Brian Kulis. A Unified
View of Kernel k-means, Spectral Cluseting and Graph
Cuts
[11] P. Domingos and M. Rihaedson(2001). Mining the
Network Value of customers, Proc. 7th ACM SIGKDD,
pp57-66
[12] Y. Wang, D. Charabarti, C.Wang and C. Falotsos (2003),
Epidemic Spreading in Real Networks: An Eigenvalue
Viewpoint, SRDS pp25-34.
[13] S.Wasserman and K. Faust (1994), Social Network
Analysis Cambridge University Press, Cambridge.
[14] C.Dhing, X. He, H. Zha, M. Gu, and H. Simon (2001), A
min-max cut algorithm for graph partitioning and data
clustering Proc. Of ICDM.
[15] M. Ester, H.-P. Kriegel, J . Sander, and X. Xu (1996). A
Density Based Algorithm for Discovering Clusters in Large
Spatial Databases with Noise. N Proc. 2nd Int. Conf. on
Knowledge Discovery and Data Mining (KDD96),
Portland, OR, pages291-316. AAAI Press.
[16] Y.Weiss (1999). Segmentation using eigenvectors: A
unifying view. In International Conference on Computer
Vision.
[17] G. Scott and H. Longuet-Higgins (1990). Feature grouping
by relocalisation of eigenvectors of the proximity matrix. In
Proc. British Machine Vison Conference.
[18] N. Christianini, J . Shawe-Taylor, and J . Kandola (2002).
Spectral Kernel methods for clustering. In Neural
Information Processing.
[19] B. Scholkopf, A. Smola, and K.-R. Muller (1998).
Nonlinear component analysis as a kernel eigenvalue
problem. Neural Computation.
[20] Z. Wu and R. Leahy (1993), An Optimal Graph Theoretic
Approach to Data Clustering: Theory and its Application to
Image Segmentation, IEEE Trans. Pattern Analysis and
Machine Intelligence.
[21] Khaled Hammouda, A comparative study of Data
Clustering Techniques.
[22] Ingo Burk(2012), Thesis on spectral clustering.
[23] P. J . Rousseeuw(1987), Silhouettes: a Graphical Aid to the
Interpretation and Validation of Cluster Analysis,
Computational and Applied Mathematics 20, pp. 5365.
[24] Ulrike von Luxburg(2007), A Tutorial on Spectral
Clustering, Statistics and Computing.
[25] M. Mahajan, P. Nimbhorkar and K. R. Varadarajan(2009),
The Planar k-Means Problemis NP-Hard, WALCOM.
[26] N. Christianini, J . Shawe-Taylor, and J . Kandola (2002).
Spectral Kernel methods for clustering. In Neural
Information Processing.
[27] B. Scholkopf, A. Smola, and K.-R. Muller (1998).
Nonlinear component analysis as a kernel eigenvalue
problem. Neural Computation.

Comparison of Graph Clustering Algorithms

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comparison of Graph Clustering Algorithms

Uploaded by

Copyright:

Available Formats

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9Sep 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page 3230

You might also like