Professional Documents
Culture Documents
j o u r n a l h o m e p a g e : w w w . e l s ev i e r . c o m / l o c a t e / p a t r e c
Article history: The determination of community structures within social networks is a significant problem in the area of data mining. A proper
Received 12 April 2013 community is usually defined as a subgraph with a higher internal density and a lower crossing density with others subgraphs.
Available online 20 September 2013 Hierarchical clustering algorithms produce a set of nested clusters, sometimes called dense subgraphs, organized as a
hierarchical system and the output is always referred as a dendrogram. However, determining which of clusters in the
Communicated by M.A. Girolami
dendrogram will be selected to form communities in the final output is a difficult problem. Most implementations of data
mining algo-rithms require expert guidance in the implementation of the algorithm in order to establish the appro-priate
Keywords:
selection of such communities, and ultimately the output may not be optimized as with fixed height tree-cutting algorithms. In
Hierarchical clustering
this paper, a novel algorithm for community selection is proposed. The intuition of our approach is based on drops of densities
Community selection
Graph density between each pair of parent and child nodes on the dendrogram – the higher the drop in density, the higher probability the
Density drop child should form an inde-pendent community. Based on the Max-Flow Min-Cut theorem, we propose a novel algorithm
which can output an optimal set of local communities automatically. In addition, a faster algorithm running in linear time is
also presented for the case that the dendrogram is a tree. Finally, we validate this approach through a variety of data sets
ranging from synthetic graphs to real world benchmark data sets.
1. Introduction Hierarchical clustering, see Scott (2000), is one of the most pop-ular
approaches to clustering problems and the output is called dendrogram. A
Networks can be used to describe the pairwise relationships be-tween dendrogram is a diagram frequently used to illus-trate the arrangement of the
nodes. Thinking of these nodes as vertices, we can in turn view such networks clusters produced by hierarchical clustering (see Fig. 1(a)).
as graphs where the edges are defined by these same pairwise relationships.
Sociologists use networks to de-scribe the relationship among n persons in The dendrogram has a root, say v0, on the top of diagram repre-senting
terms of their connec-tion strength, reflecting how of a connection exists the whole network and leaves on the bottom representing each individual.
between pair, common behaviors, or the level of collaboration. The subgraph Every node among internal levels represents a subgraph of the original
with denser connections inside and sparser connections to other subgraphs network. The node on the lower level is called child, and one on higher level
can provide invaluable insight into the structure of the whole network or data is called parent. Each edge connecting two nodes in the dendrogram forms a
visualization. Detecting such communities or clusters of closely related parent–child relationship. The subgraphs induced by these parent–child rela-
objects remains one of the most inter-esting problems in the field of tionships, specifically the hierarchy of sets of nodes in the original graph, are
bioinformatics, social networks, epi-demiology and data mining. Many called clusters and form the candidate pool of communi-ties for output.
clustering algorithms have been proposed in the literature (Girvan and However, which of those clusters in the dendro-gram will be selected to form
Newman, 2001; Newman, 2004, 2006; Hastie et al., 2001; Kaufman and communities in final output?
Rousseeuw, 1990; Scott, 2000).
‘‘There are no completely satisfactory algorithms that can be used for
determining the number of population clusters for many type of cluster
harder to find
analysis’’ said in SAS/STAT 9.2 User’s Guide. It is much
⇑ Corresponding author. Fax: +1 304 293 3982.
1 out an optimal community partition than determin-ing the
E-mail addresses: cqzhang@math.wvu.edu, cqzhang@mail.wvu.edu (C.-Q. Zhang). This
research is partially supported by an NSA Grant H98230-12-1-0233 and an number of communities, and therefore it remains one of the
NSF Grant DMS-126480.
most challenging problems in current research of data
0167-8655/$ - see front matter 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.patrec.2013.09.008
mining.
X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53 47
1 ... 130
1 ... 100
A
B C D
A subgraph with denser connections indicates that members are more Ghazalpour et al. (2006) and Gargalovic et al. (2006), which is mainly based
similar to each other than they are to portions of the graph outside the on the shape of branches of dendrogram and the in-ner structure of each node
subgraph. If we use ‘‘density’’ of a subgraph, de-fined to be the average was not reflected in the process of com-munity detecting.
number of edges between nodes in the ver-tex set of the subgraph, to describe
its global strength connections (see Section 2.2 for more detailed definition), a In this paper, we will propose a new community detection algo-rithm
good community detection algorithm should detect a partition of the input (Algorithm 1), different from other traditional algorithms (fixed height cutting
networks into subgraphs satisfying: algorithm where all edges to be cut are in the same height level or pre-defined
community number algo-rithm where the number of communities should be
inputted afore-hand), where the community result will be output automatically
1. Higher internal connection density. and the edges to be cut could be located in any level of the dendro-gram. To
2. Lower external connection density. be more specific, our algorithm will find an edge cut of a given dendrogram,
separating the root and all leaves, where the edges in the edge cut could be
The traditional algorithm of identifying communities of a dendrogram is located in any level. The family of all nodes (children) immediately below the
referred to as tree cutting, branch cutting or branch pruning. One kind of tree edge cut will be the output of our algorithm and form all desired communities
cutting algorithm needs the number of communities as an input aforehand, but automatically.
the problem of determining the number of clusters itself is hard in most cases.
Another most widely used tree cut algorithm is called fixed height cutting: the
user chooses a fixed height on the dendrogram, and all nodes in the branches The basic idea of our algorithm is as follows. As we know, each node v in
immediately below the height of the cut form the family of communities. The the dendrogram is a candidate of the optimal local community and the
fixed height tree cutting is simple and rather naive, but the output sometimes
induced subgraph Gv by its members has relatively higher inter-connection
does not make any sense especially for complicated cases. The following
example will reveal the downside of fixed height cutting. than extra-connection. Those arcs in the dendrogram with larger density drop
indicate improper agglomeration and hence form candidates for edge cutting.
Based on those observations, we assign weights on arcs of T based on density
drop between child and parent. Our community detection algorithm could
catch all arcs with larger density drop and automatically generate a proper
Let G be a network consists of 4 giant clusters A–D (see Fig. 1(b)). The community partition. When we test our algorithm on the above example (see
subgraph induced by A is a complete subgraph of or-der 100 and each edge is Fig. 1(b)), on which traditional community detection algorithm fails, 4
weighted by 3, and ones induced by B–D are also complete and of order 10. communities consisting of A–D respectively are obtained as we expected.
Each edge in those three clusters is assigned 4 and all rest of crossing edges
are weighted by 1. Fig. 1(a) is the dendrogram generated by a density driven
cluster-ing algorithm. By applying traditional fixed height cut algorithm, one
may produce an output consisting of two communities of or-ders 100 and 30, The outline of this paper is as follows. In Section 2 we shall de-scribe
respectively (see Fig. 1(a)), while the output of 4 communities consisting of classical algorithms and review the Quasi Clique Merge (QCM) algorithm,
A–D respectively should be the true clustering result. whose output dendrogram will be the start point of our new community
detection algorithm. In Section 3, the new algorithm (Algorithm 1) will be
described in detail. In addition, a faster algorithm (Algorithm 2) running in
Fixed height cutting is a simple and naive technique with many desirable linear time is also pre-sented in Section 3 for the special case when the
properties, but unreliable when the dendrogram is large and complicated as dendrogram is a tree. In Section 4 we apply our algorithm to some classic
we have seen from the above example. Another community selection social networks and compare its result with that of known clusters, which
technique, called ‘‘dynamic tree cut’’, was discussed in Carlson et al. (2006), verify our new algorithm’s utility.
Dong and Horvath (2007),
48 X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53
between VðGja Þ VðGjb Þ and VðGjb Þ, a
2. Hierarchical clustering The Quasi Clique Merger The core of the QCM algorithm number of vertices in Gja and not in
(QCM) algorithm (Ou and Zhang, focuses on deciding whether or not Gjb . And the hierarchical
Clustering is a process of 2007) is a hierarchical clustering to add a member to an already den-
sity wT of the parent j0 is the
grouping all members of whole net- algorithm used to detect dense sub- selected dense subgraph C. For a P 1;...;t
wT ðja ;jb Þ
work into a family of subgraphs, graphs (clusters) of weighted member v R VðCÞ, we define the a;b
w G 2f t g
.
graphs. In this paper, we will use P Tð j0 Þ¼
called communities. Communities cðv; CÞ ¼ u2V ðCÞ wðuvÞ
2
are of interest because they often the output of QCM as the start point jVðCÞj . A member v is For example, let Gj1 ¼ K4 and
correspond to functional units in a of our new community detec-tion andðCÞ where n ¼ jVðCÞj and an ¼ 1 Gj2 ¼ K4 be two children of some
particular research purpose. A algorithm. This is because the node j0 and Gj1 \ Gj2 ¼ K2. Then Hj0
user specified parameter k (> 1),
cluster consist of a number of ob- combination of QCM algorithm and and serves as a coefficient that will be K2 and weighted by 4=ð2
jects with similar characteristics. our new community detection controls the density during the 2Þ ¼ 1.
Any cluster is a subset (or super- algorithm (Algorithm 1) meets the growing of a cluster C.
set) of some community and hence main requirement of an optimal The QCM algorithm consists of The node h in Fig. 1(a) obtained
a candidate for inclusion into any community detection: higher three main steps: Growing, Merging by merging three clusters B–D has a
containing community of interest. internal cluster density and lower and Contracting. See Ou and Zhang corresponding hierarchical
The goal of clustering is to find an inter-cluster density. Further-more, (2007) and Zhao and Zhang (2011) subgraph Hh ¼ K3, a weighted tri-
optimal output consisting of a set of QCM also has two other major for more details of the algorithm. angle and each edge is assigned
2=ð30 30Þ ¼ 2=900.
communities. features: (1) its output den-drogram By Ou and Zhang (2007), the
is smaller, which clearly highlights This hierarchical density will be
complexity of constructing used to measure extra connec-tion
meaningful clusters, while most
existing algorithms produce a larger the hierarchically nested among clusters. There will be a
2.1. Traditional approach system is Oðj V 3 logðj V large hierarchical density drop when
binary hierarchi-cal tree (Fiedler, 2 2 j
intra density and extra density vary
1973; Girvan and Newman, 2001; OðjVj log ðjVjÞÞ in average while the number of levels of the
K-means clustering (Hartigan, Pothen et al., 1990); (2) it allows dendrogram is OðlogðjVjÞÞ instead of thesignificantly.
worst case OThus the information
1975) is a well-known partition carried by inner structure of any
overlapping clustering or multi- As noted previously, an
algorithm which aims to partition cluster will be de-tected through the
membership, which is a concept important feature of output
the whole network into exactly k hierarchical density. In the
that has recently received increased generated by QCM algorithm is the
communities in which each member following we will propose a novel
attention (Palla et al., 2005). overlapping clustering or multi-
belongs to one community with algorithm (Algorithm 1) to find a
membership, which means one
nearest mean or center. The main set of edges based on the change of
object may belong to more than one
side effect of this algorithm is that hierarchical density from level to
A subgraph H in an un-weighted cluster while traditional algorithms
the number of communities has to level. Algorithm 2 is modified from
graph is defined as a clique if every force one object to belong to
be pre-assigned at the beginning of Algorithm 1 for the special case
pair of members of H is joined by exactly one cluster. Hence the
this process. when dendrogram is a tree.
one edge. It is well-known that the dendrograms generated by
search of cliques with maximum traditional algorithms are trees
Algorithm 1 presented here
Hierarchical clustering is vertices in graphs is an NP- while circuits may exist in the
finds an optimal edge cut in a den-
another kind of clustering analysis, complete problem. For a subgraph dendrogram obtained by QCM
drogram that will generate clusters
which either describes a C in a graph with weight algorithm because of the feature of with multi-membership.
partitioning of a graph varying from x on their edges, we can define the density of multi-membership.
C by Instance: Let T be the
a sin-gle cluster containing all P dendrogram and the unique vertex
members of the whole network to n dðCÞ ¼ 2 e2EðCÞ wðeÞ
, where EðCÞ v0 on the highest level. All arcs of T
is the set of edges connecting mem-
clus-ters each one containing a jVðCÞjðjVðCÞj 1Þ are oriented downward
single member, or from individual 3. Detecting optimal communities
bers in C. As seen above, let C be a from vertices
members to the single whole graph subgraph of un-weighted graph G
cluster (Defays, 1977; Sibson, (or weighted graph with w ¼ 1 for In the dendrogram T, each node
1973). As a result, strategies for all edges), then dðC Þ ¼ 1 im-plies represents a candidate commu-nity
hierarchical clustering generally fall that C induces a clique in G. For a for selection, which is a dense
into two types: agglomerative weighted graph, a subgraph C is subgraph (possible consisting of
algorithms, which proceed by a called a D-quasi-clique if dðCÞ P D only one vertex) in the original
series of fusions, and divisive for some positive real num-ber D. input network G.
algorithms, which process by a ser- Our new community detection
ies of partitions. The output of a algorithm starts with the output
hierarchical clustering algorithm dendrogram T of QCM algorithm.
can be described by Fig. 1(b). No Let j0 be a non-leaf node in the
matter which case, the desired dendrogram T with children j1; . . . ;
family of clusters is heavily jt , and let G ji (i ¼ 0; 1; . . . ; t) be
dependent on the horizontal cut line the corresponding subgraphs in the
with constant height, which will input network G. The
exhibit suboptimal or even awk- hierarchical density of Gj0 is
ward performance for complicated calculated as follows. Construct a
graph
dendrograms.
Hj0 with vertex set fj1; . . . ; jt g
edge between ja and jb is
min
P Ea P a; b wðeÞ
e b; b wðeÞ; e Eb
2 ½ & 2½
&
!
community structure, but are
Algorithm 1. S uv v , which will be the output foressentially
communitiesrandom in other respects.
The main cost of the algorithm !
ð Þ¼f g
after the processing, and @
add a new vertex t as the Therefore the total runtime of our and S uv S S vz ; rithm. This fraction depends on z
v ! out þ !
Add arcs t weights are integers, the run-time of from communities other than its
Step 3. If l > 1 then l l 1 and back to
v
!
for every leaf v and assign weight
finding the max-flow is bounded by own community. The performance
! with Otherwise, W Gv : v S
the weig ht for the arc v0 v0 0
c jVðGv0 Þj
flow in the network. (where v corresponds the Each point in the above figure
v0v00
Computer_Generated Graphs
95100
100% 100% 100% 100%
correctedly
0.96 1.89 2.83 3.38 99.06%
3.84 98.18%
4.33 97.42%
4.67
90
Fraction of vertices classfied
92.77% 92.89%
5.38 5.84
85
85.5%
6.22
83.3%
80
5.96
1 2 3 4 5 6
Fig. 2. Numbers in dark green are the average out-degree, i.e., the number of inter-community edges over 15 random samples. (For interpretation of the references to color in this figure legend, the
reader is referred to the web version of this article.)
Fig. 3. The network of friendships in the Karate club study in Zachary as described in the text. The administrator and instructor are represented by nodes 1 and 33, respectively. Nodes associated with
the club administrator’s faction are drawn as green squares, those associated with the instructor’s faction are drawn as red circles. (For interpretation of the references to color in this figure legend, the
reader is referred to the web version of this article.)
performs nearly perfectly when zout < 6, classifying 92% or more of total club’’ of Zachary (1977). In this study, Zachary observed 34 mem-bers of a
vertices correctly, and 97% or more if z out < 5. In fact, when z out the number karate club over a period of two years at an American uni-versity. During the
of inter-community edges is larger, no one can make the decision comparing course of observation, a disagreement developed between the administrator of
the average degree 16. the club and the club’s instructor, which ultimately resulted in the instructor’s
Thus our algorithm performs very well as long as each vertex has more leaving and starting a new club, taking about a half of the original club’s
connection within the community than connection to other communities. Our members with him.
conclusions match the result obtained by Newman by calculating betweenness
(Girvan and Newman, 2001). Zachary constructed a simple un-weighted graph to show the friendships
between two members of the club, each member in the club is represented by
a node, and edge is drawn if the two members are friends outside the club
4.2. Zachary’s Karate club study shows the network, with the administrator and
activities. Fig. 3
instructor were respected by node 1 and 34, respectively.
We now turn our applications to real world network data. The first real-
world social network for which the community structure is already known Green square represent individuals associated with the
from other sources is the well known ‘‘Karate administrator and red circle represent those associated with
the instructor.
X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53 51
Fig. 4. Hierarchical tree showing the complete community structure for the network calculated by using Girvan and Newman’s algorithm. Only the left most object 3 was wrongly classified.
25 26 10 32 29 24 28 27 30 34 33 23 21 19 16 15 31 9 12 1 6 7 5 11 17 20 22 13 2 3 4 8 14 18
Fig. 5. Hierarchical tree of Karate club calculated by using our algorithm. Multi-membership can be observed in the first agglomerative step.
These data were collected by Davis et al. in the 1930s. They rep-resent
observed attendance at 14 social events by 18 Southern wo-men. The result is
Fig. 6. The dolphin social network of Lusseau et al. The dashed curve represents the division
a person-by-event matrix: cell ði; jÞ is 1 if person i attended social event j,
into two equally sized parts found by a standard spectral partitioning calculation. The solid
curve represents the division found by the modularity-based algorithm of this section. and 0 otherwise. The goal of this study is to determine the social structure of
this club according to their atten-dance among all social events. The first
reported result on this data set was given by Homans (1950) as follows:
Among many algorithms suggested for this data set, two of them have
dominated the literature: the spectral bisection algo-rithm (Fiedler, 1973; Group 1: 1, 2, 7, 8, 14, 15, 16;
Pothen et al., 1990), which is based on the Group 2: 11, 12, 13, 17, 18;
52 X. Qi et al. / Pattern Recognition Letters 36 (2014) 46–53
0 .2 2
17.26 17.26 17.26
0 .3 3 0 .3 3 0 .3 3
3.64 3.64
1
1 1 1
1 2 3 4 5 6 7 8 9 10 11 12
Fig. 7. The left figure is the dendrogram of clustering analysis. Right figure is the weighted network constructed by our algorithm with hierarchical densities and weights.
Members 3, 4, 5, 6, 9, 10 are not clearly clustered to either group. By 4.5. Bottlenose dolphins network
feeding the club data to our algorithm, two main clusters are obtained as
follows: Fig. 6 represents the social network of a community of 62 bot-tlenose
dolphins living in Doubtful Sound, New Zealand. The net-work was compiled
Group 1: 1, 2, 7, 8, 9, 10, 14, 15, 16; by Lusseau et al. (2005) from seven years of field studies of the dolphins,
Group 2: 3, 4, 11, 12, 13, 17, 18; with ties between dolphin pairs being established by observation of
statistically significant frequent asso-ciation. This network is interest because,
Note that members 5 and 6 are still not properly grouped either into Group during the course of the study, the dolphin group splits into two smaller
1 or Group 2. Those two members only attended two out of 14 conferences subgroups, repre-sented by the circles and squares, following the departure of
and hard to classify into either group only according to their attendance a key member of the population, represented by triangle in the figure.
records. Member 9 attended three times and each time he could meet member
8 and 14. Member 10 has the similar preference with the majority of members Our algorithm gives the perfect division when processing the whole
from Group 1. Hence it is reasonable to assemble members 9 and 10 into first network of 62 nodes. As shown, a community consists of all squares, and
group. Same information can be dig out from the attendance record to support another one consists of all circles and one triangle. If we remove the key
our result. This result improved an earlier result obtained by Zhao and Zhang member represented by yellow triangle in above figure, then our algorithm
(2011), which uses fixed height cut and adds member 4 into Group 1 and 10 outputs almost same result except one member was wrongly grouped. In fact,
to Group 2. Their result is also identical with one obtained by Ronald Breiger this member has only two neighbors in the network, one is the key member
using duality analysis (Breiger, 1974). and the other one belongs to the group consisting of all red squares in the
figure.
The application of our algorithm to the dolphins data set also demonstrates
4.4. Political books network the robustness of our algorithm to node removal to some extent.