J.-H.Wang and J.-D.Rau - VQ-agglomeration: A Novel Approach To Clustering

VQ-agglomeration: a novel approach to clustering
J.-H.Wang and J.-D.Rau Abstract: A novel approach called VQ-agglomeration capable of performing fast and autonomous clustering is presented. The approach involves a vector quantisation (VQ) process followed by an agglomeration algorithm that treats codewords as initial prototypes. Each codeword is associated with a gravisphere that has a well defined attraction radius. The agglomeration algorithm requires that each codeword be moved directly to the centroid of its neighbouring codewords. The movements of codewords in the feature space are synchronous, and will converge quickly to certain sets of concentric circles for which the centroids identify the resulting clusters. Unlike other techniques, such as the k-means and the fuzzy C-means, the proposed approach is free of the initial prototype problem and it does not need pre-specification of the number of clusters. Properties of the agglomeration algorithm are characterised and its convergence is proved.
1 Introduction Both clustering and classification play important roles in statistical pattern recognition [ 11. Classification can be viewed as one special type of supervised training in the sense that it is used to determine optimal or near-optimal decision boundaries from a set of labelled patterns for classifying similar (but unseen) samples in the future. Compared to classification, clustering is more difficult, because clustering is an attempt to find structure in a set of observations that we know very little about [2]. Lowdimensional clustering by human eyes is fast and efficient. However, clustering high-dimensional input data via human eyes is a very different story. The development of algorithms for performing high-dimensional clustering is by no means a trivial task; the main difficulty stems from the fact that the input distribution function is often not available, neither is the number of clusters. Often, the best one can do is to apply several clustering techniques and check for common clusters instead of searching for a technical measure of validity for an individual clustering [ 2 ] .The price to pay, obviously, is an enormous amount of computation time. Traditional clustering algorithms can be classified into two main categories: hierarchical and partitional. In hierarchical clustering methods, the sequence of forming groups proceeds such that whenever two data points belong to the same cluster at some level, they remain together at all higher levels. The rationale of partitional clustering is to choose IC initial partitions (prototypes) of the data set and then alter cluster memberships so as to obtain better partitions according to some objective functions. Two major drawbacks of the partitional clustering are the difficulty in determining the number of clusters k and
0 IEE, 2001 IEE Proceedings online no. 20010139

DOI: 10.1049/ip-vis: 20010139 Paper first received 28th March and in revised form 17th July 2000
The author is with the Department of Electrical Engineering, National Taiwan Ocean University, 2 Peining Rd. Keelung, Taiwan
36
the sensitivity to the initial prototypes. Often, two different choices of initial prototypes may result in quite different clustering results in partitional clustering methods. Traditional approaches to determining the number of clusters k can be divided into three main categories [3]. The first category is to evaluate a certain global validity measure of the k-partition for a range of k values, and then pick the value of k that optimises the validity measure [4, 51. The second category is to perform progressive clustering [ 3 , 6, 71, where clustering is initially performed with an overspecified number of clusters, The clustering algorithm starts by partitioning the data set into a large number of small clusters. After convergence, spurious clusters are eliminated, compatible clusters are merged and good clusters are identified. Performance of this split-andmerge approach greatly depends on the cluster-merging criterion, which is usually difficult to determine. The third category extracts one cluster at a time [8] when k is unknown. It treats the data as a mixture of components, and uses an estimator to acquire the parameters of each component. Its performance is quite sensitive to the form of the underlying input distributions. Much worse, as cited in [3], the idea of extracting clusters in a serial fashion becomes infeasible when the input contains clusters that have overlaps because removing one cluster may partially destroy the structure of other clusters. In this paper, we present a fast clustering approach that does not need the number of clusters to be specified; the clustering result is autonomously determined by the nature of the input. The input data set first undergoes a vector quantisation process and is divided into Mj partitions. 4 Then, the resulting A codewords are taken as good initial prototypes to an aggkmeration process that has its origin in classical physics. Each codeword cJ is regarded as a neutral particle with mass mJ-defined as the number of input vectors represented by c Also associated with each J : codeword cJ is a gravisphere with a well defined attraction radius. In the course of the agglomeration process, each codeword would move directly to the centroid of its neighbouring codewords. The movements of codewords are synchronous, and codewords located at a denser input area will simultaneously attract codewords from a sparser
IEE Proc -Vu Image Signal Process, Pbl 148, No I , February 2001
area. Therefore, this approach is different from the method of [SI in that it extracts clusters in a parallel fashion. After a few synchronous steps, codewords would stop moving and stay at a few positions called cluster centroids. After convergence, codewords that moved to the same centroid are grouped as a cluster, as are their associated input vectors. One should note that, unlike in [3, 6, 71 the proposed agglomeration algorithm does not actually merge prototypes in the progress of agglomeration. One thing that VQ and clustering have in common is that they both group an unlabelled data set into a certain number of clusters such that data within the same cluster have a high degree of similarity. However, clustering is often a more difficult task, as it does not have specific criteria, such as the MSE or information entropy commonly adopted in vector quantisation. Practical aspects of a clustering algorithm, such as its effectiveness in revealing natural groups of data [9], are often considered more important than mathematical rigour. In contrast, communications experts tend to develop optimal vector quantisers with respect to some abstract criteria from information theory. This leads to broadly applicable methods [lo] that may have some assured minimum level of performance, but do not take advantage of the structure characteristics of the input data for a particular application. In this paper, the rationale of adopting VQ as the preprocess of an agglomeration algorithm is twofold: through VQ, adjacent input feature vectors will be encoded as the same codeword. Because the number of codewords is much less than the number of input vectors, subsequent agglomeration processing on codewords rather than directly on individual input items would require much less computation time. Moreover, because VQ can approximate the input distribution, using the codewords as good initial prototypes, instead of random selections from the raw input, can effectively reduce the sensitivity to initial prototypes. To begin with, the paper proposes the VQ-agglomeration approach that not only can save much computation time, but it is also free of the initial prototype problem that plagues the conventional It-means algorithm [I]. We then present a novel algorithm that employs the strategy of getright-to-centroid to agglomerate codewords. The denser input area, which acts as a black hole, will attract nearby codewords to it. The agglomeration process iterates until no more codewords can move. By then the final resulting clusters as well the number of clusters are obtained. Thus, the main contributions of the paper are the proposal of a novel agglomeration algorithm incorporated with a VQ pre-process to facilitate fast and autonomous clustering, characterising properties that closely relate to the efficiency of the agglomeration algorithm, proving the stability of the agglomeration process, contrasting the performance of the VQ-agglomeration approach and other clustering techniques and further validation through simulation studies. In the remainder of this paper the roles played, as well as the advantages provided, by a VQ pre-process in the task of clustering are explained. The codeword-agglomeration (CA) algorithm essential to the success of the VQ-agglomeration approach is presented. Important properties of the CA algorithm are characterised and its convergence stability is also proved. Simulation results are provided to verify that the proposed VQ-agglomeration approach is fast in computation time, and free of the initial prototype problem. We also present a recursive version of the VQagglomeration, and show its usefulness in facilitating cluster validation.
IEE Proc -Es Image Signal Procem Vol 148. No 1. February 2001
I
2 Advantages of the VQ pre-process Advantages of conducting the pre-process of vector quantisation are: (i) because the number of codewords is generally much less than the number of input vectors, much computation time can be saved (ii) if the codewords are used as initial prototypes, the initialisation problem in prototype-based clustering techniques can be alleviated to some extent. Unfortunately, conventional prototype-based techniques would not allow a set of codeGords be used as input data and as initial prototypes at the same time. That is, the two advantages in general Fannot be achieved simultaneously. In this paper, we incorporate the CA algorithm and a VQ pre-process, naming the process VQ-agglomeration, to successfully acquire both the advantages. The CA algorithm treats each codewcrd as an initial prototype and as an input data item in .he course of the agglomeration processes. To differentiate, any approach that incorporates VQ and any copventional clustering techniques (such as Ic-means [ 11, the minimum spanning tree (MST) algorithm [ l 1, 121, etc.) is referred to as VQ-clustering.
2.1 Saving computation time via VQ Two conventional Clustering methods were tested, namely the MST algorithm [ l l , 121 and fuzzy C-means (FCM) [ 131. The MST algorithm is a well known graph-theoretical technique for discovering clusters in a data set. The method first finds a minimum spanning tree for the input data. Then, by removing the longest K edges, K 1 clusters are obtained. A simple example is shown in Fig. 1 where the input contains seven separate Gaussian distributions with 1830 points. Fig. l a shows thc spanning tree and Fig. 1b is the clustering result. By removing the six longest edges, the data set is correctly separated into seven clusters. However, as a graph-theoretical approach, the MST algorithm is rather computationally intensive. In this case, the calculation time for obtaining the minimum spanning tree is 300.72 s using a Pentinum-I11 500. To reduce computation time, we applied the LBG method [14] to quantise (codebook size +=64) the input data, and the result is shown in Fig. IC. The computation time for LBG is 0.37 s. Fig. Id illustrates the corresponding spanning tree for the input codewords of Fig. IC. Note that the computation time for generating the spanning tree when the codewords, instead of original data, were used as inputs to the MST algorithm is only 0.0 1 s. Thus, the VQ-clustering approach indeed can save computational time. More interestingly, we note that the VQ-clustering approach can achieve almost the same clustering results (but with much less computation time). Table 1 shows the experimental results of the VQclustering approach that incorporates the MST algorithm and various neural network quantisers, such as the
Table 1: Simulation results of the VQ-clustering approach that incorporates neural VQ and MST clustering
~~ ~
Quantiser
V Q Time
BNN 0.17
LBG 0.37 yes
FSCL SCONN2 Neural-Gas G S C, 0.32 yes 0.71

yes
7.14 34.14
0.50
MSE
30.18 34.51 33.81 31.07
32.76 Yes
Correct yes clustering?
Yes
M, zz 64
37
Fig. 1 Exjinniple of use ofMSTalgorYthni

Miiiiinuiii spanning tree for scven clusters with 1830 data points h By removing the six longest cdges (the six bold lines in U ) , correct result is obtained c Codewords obtained by using LBG with A, = 64 d d Corresponding ininiinuni spanning trec of the codewords in c
CI
bi-criterion neural network (BNN) [ 151, frequency sensitive competitive learning (FSCL) [ 161, self-creating and organising neural networks (SCONN2) [17], neural-gas [ 181, and growing cell structures (GCS,) [19]. The input is the same as in Fig. la. As can be seen in Table I , the clustering results are almost identical for these quantisers. The slight difference in mean square error (MSE) is mainly due to the use of variant quantisers. After the training process, if there should exist dead nodes (to which no input data is referred), these dead nodes would act as noise or outliers and could incur erroneous clustering results. For example, some clusters may have no codewords to represent them. Consequently, a fast and efficient quantiser without the dead-node problem is a desirable feature in the VQ-clustering approach, if a neural quantiser is to be used. We also tested the performances of the VQ-clustering approach that used variant clustering algorithms. Hereafter, we will use the BNN as the default neural quantiser due to its economy in computation time and freedom from dead nodes [15]. The test input, as shown in Fig. 2a, contains 1536 data points that comprise the four Gaussian clusters, with two on the right-hand side slightly overlapping. The computation time for the BNN is 0.55 s with Mf =256. Given the original input data, it required 126.77 s to obtain the minimum spanning tree. However, Fig. 2h shows the result of a spanning tree for which the input used the 256 codewords, and it took only 0.68 s. Note that in Fig. 2b, removing the three longest edges cannot correctly separate the two overlapping clusters. Fig. 2c shows the clustering
38
result of FCM without VQ, and Fig. 2d shows the result of FCM for which the input used the 256 codewords. It took 2.06 and 0.27 s for Figs. 2c and d, respectively. As can be seen in Figs. 2c and a , two overlapping clusters can be the correctly separated, and the resulting centroids are almost the same. The above experiments reveal an important property concerning the VQ-clustering approach. That is, once a clustering technique is chosen to be incorporated with a quantisation pre-process, savings in computation time can be easily obtained because of the use of vector quantisation. However, as far as the final clustering result is concerned, the validity of clustering mainly relies on which specific clustering technique is used, not on the quantisation technique. In this context, a powerful clustering algorithm is the key to the success of the VQ-clustering approach; quantisation can be conducted either via conventional methods or via soft computing techniques such as neural networks.
2.2 Reducing the initialisation sensitivity As mentioned, in some prototype-based clustering techniques bad initial prototypes would easily lead to local minima and result in invalid clustering results. The kmeans algorithm, for example, is well known for its sensitivity to the choice of the initial centroids (i.e. prototypes). The VQ-clustering approach, however, can easily solve this initial prototype problem, unlike in Section 2.1,
IEE Proc.-Vis. Iinuge Signal Process., Vol. 14X, No. 1, Febrnnry 2001
.,,.
a
1 C
b
.
C
.d
Exurnpie of use of variant clustering uigoritl~ni.s a Input data and codewords are plotted. BNN consumed 0.55 s with M, = 256 b Minimum spanning tree of the codewords in a c Result of FCM without VQ. + denotes the true cluster centroids d Result of FCM after quantisation; thc four initial prototypes for FCM were tdndomly chosen from the 256 codewords
Fig. 2
where we first specified a codebook size of k to train BNN, and used the resulting codewords as initial centroids (prototypes) for the It-means algorithm. Fig. 3a shows the initial centroids as well as their converged centroids after applying the /-means to the original input items. As shown in Fig. 36, the original input data were correctly grouped into seven clusters by using the good initial prototypes provided by BNN. In fact, for some clustering algorithms based on iterative scheme (e.g. FCM), provision of a set of good initial prototypes could even increase the convergence speed. As shown in Fig. 4a where the initial prototypes are selected randomly from the input data; it took the FCM 0.49 s to converge. But when the initial prototypes were obtained via BNN with Mr = 4 as shown in Fig. 4b, it took BNN and FCM 0.02 and 0.22s, respectively. It is worth noting that the computation time for BNN accounts for only a small portion of the overall computation time. This result implies that the provision of good initial prototypes by vector quantisation can lead to faster convergence for the subsequent clustering algorithm.
clusters needs not be pre-specified and the convergence of the clustering is guaranteed.
3 VQ-agglomeration approach
3.1 CA algorithm Initially, assume that a VQ process has divided N input ! vectors into M partitions, each represented by a codeword. The CA algorithm is applied to agglomerate these codewords into some clpsters. Codewords located at a denser input area will attract codewords froin a sparser area, and the movements of codewords in the feature space are synchronous (i.e. simultaneous). After convergence, codewords moved to the same centroid are labelled as a cluster. In pseudo-code, the algorithm is listed in Table 2. The underlying principle of moving codewords has its origin in classical physics. Each codeword c], j = 1 , 2 . . . M, , is regarded as a neutral particle with mass mJ defined as the number of input vectors represented by c,. If a codeword is located near a denser area, its mass
Table 2: Pseudo-code of the CA algorithm
Given Mf codewords from the VQ pre-process
Do
2.3 From VQ-clustering to VQ-agglomeration Having shown various advantages that can possibly be benefited from adopting a VQ pre-process, we note that because most prototype-based algorithms require that the number of input data be much more than that of initial prototypes, the VQ-clustering approach in general cannot simultaneously acquire all the advantages. In the following Section, we present the VQ-agglomeration approach that is free ofthis limitation and can perform fast and autonomous clustering. By autonomous we mean that the number of
IEE Proc.-viS. Image Signal P i ~ o c e . ~al.. 148, No. I , Fehriiaiy 2001 .~.
Move simultaneously (using eqn. 3 ) codewords c j , j = 1 to M f , to the centroid of SD(j , t). Until convergence (namely, no more codewords is moving) Group codewords that converge to the same cluster centroid into a cluster.
39
a a
Fig. 3 k-means algoritlzm and VQ-cliistering

a Initial centroids are obtained by applying BNN to perform quantisatioii with M, =,7.Also shown are the converged centroids after applying the k-means algorithm to the original input data, given thc 7 initial centroids h Test data of 7 Gaussian distributions are correctly grouped into 7 clusters by using the VQ-clustering framework. 'D' denotes an initial prototype, and 'f' denotes the converged cluster centroid
would be larger, and vice versa. Also, each particle cl is associated with a gravisphere with radius given by
b Fig. 4 Use of BNN to select initial psototypes U Initial prototypes selccted randomly from input points, FCM takes 0.49 s to converge b Initial prototypes provided by BNN, A, = 4 BNN and FCM take 0.02 and 4 0.22 s, respectively, to converge
YJ =
' 3
(1)
where p, is a mean distance measure, say Euclidean distance, from codeword cJ to the input vectors represented ' by cJ. The parameter 0 is presumed to be a function of M and is used to modulate I\. To proceed, define a direct neighbour of cl as the codeword for which the gravisphere directly intersects that of cJ. Namely, if the distance between c k and cJ is less than (pJ p k ) x 0, ck is a direct neighbour of c Furthermore, denote S D ( j ,t) as the subset that contains &ect neighbours of an arbitrary codeword cI at the tth synchronous step. The centroid of S D ( j ,t) at the tth synchronous step is defined as
where w(k, t) is the location of cIi in the feature space. In the course of a synchronous update, each codeword cj moves directly toward its corresponding w ( j , t), i.e.
w ( j ,t
+ 1) = G ( j , t )
(3)
As will be seen, this get-right-to-centroid strategy generates some interesting and desirable properties. Fig. 5 shows simulation results of three successive synchronous steps, where the input is the same as in Fig. 2a and Mf=32. After the third synchronous step, the 32 codewords with variable sizes of gravispheres have converged (agglomer40
ated) into four separate concentric circles, correctly representing the four Gaussian clusters. The significance of the gravisphere is its usefulness in quantitatively and qualitatively representing the neighbourhood of an arbitray codeword ci . When the gravisphere of cj directly intersects that of ck, these two codewords are considered as direct neighbours to each other. Note that codewords in the neighbourhood of cj are likely to be grouped into the same cluster. On one hand, the gravisphere size of cj qualitatively represents the reign of codewords that could be of the same cluster to which cj belongs, and on the other hand, when the size of the gravisphere of cj increases, the dissimilarity among the codewords that intersect cj could also increase and the chance of erroneously agglomerating different clusters into one cluster increases. Clearly these two-sided constraints require that the gravisphere of a codeword cj be large enough to directly intersect the neighbouring codewords around c j , yet it should not be excessively large so as to intersect irrelevant codewords. To satisfy these two constraints, the size of the gravisphere of cj should be quantitatively related to the local distribution around cj , rather than to mass as in classical physics. That is, if cj is located at a sparser input area, i.e. input vectors represented by cj distribute widely over the feature space, then its associated gravisphere should be larger, and vice versa. Because pj is a mean distance from cj to the input vectors represented by cj, it is appropriate to use eqn. 1 to compute the gravisphere radius of cj. In this aspect, the concept of
IEE Psoc.-Vis. Imuge Signul Pvocess., Vol. 148, No. I . February 2001
I
a
.
.
I
".
I
b
,
. . .. "
. .
Progress ojugg/omemting codewords using gravispheres Test input data used 4 Gaussian distributions with slight overlap and Mr = 32. Darker points denote codewords and their corresponding gravisphcres. Vector quantisation was conducted using BNN a r=O
b/=l
c t=2
Fig. 5
d t=3
the gravisphere is quite analogous to that of the Voronoi region [20]. The sensitivity of the gravisphere size due to using different values of Mf is mild, especially for input data where every cluster is distinctly separate from other clusters. For example, when the input data in Fig. l a are used and the value of My = 64, the value of H ranging from 1.5 to 3.5 would work well. In fact, the gravisphere size specified by eqn. 1 rarely results in intersecting codewords that belong to different clusters, unless the input data contain clusters that are very close to each other or even overlap. However, consider that My is very large and 0 is a constant, then the number of input vectors represented by each codeword, the value of ,uj and the gravispheres would be improperly small. In that case, codewords in a sparse input area that should be grouped together might erroneously end up in several clusters. Reasoned as above, 0 should be a function of MS. In principle, for an arbitrary input distribution and a given value of Mf there exists an optimal H that yields optimal partition. Although it is virtually impossible to access the optimal ( analytically, 3 nevertheless, a heuristic approach is feasible. To see this, we used the input data in Fig. 2a where two clusters on the right-hand side are slightly overlapping, and employed a curve fitting function to figure out an optimal value of M f . Empirical results have shown that 0=((7.967 x i o - 8 j M ; + ( - 8 . ~ 2 8 x ~ o - ~ ) -+(0.029) M~ My 1.OS) works well for general Gaussian dstributions. Note that the gravisphere size for each codeword will not change throughout the agglomeration process.
3.2 Properties of the CA algorithm The CA algorithm differs the gravitational approach [21] proposed by Wright in many ways. First, the CA algorithm works directly on codewords (prototypes), rather than the input data points. Wright's approach treats each data point as a particle and agglomerates data points by gravitational attraction. Secondly, the algorithm employs the get-rightto-centroid strategy to agglomerate the prototypes, and does not actually merge prototypes in the progress of agglomeration. Finally, the algorithm needs not pre-specify the number of clusters, the value of which is autonomously determined by the input nature. In the following, we characterise more interesting properties of the CA algorithm, including its stability and convergence:
(i) Gravitational property describes how an arbitrary codeword cj behaves around its neighbourhood S D ( j , t). By eqn. 2, the centroid of codewords in SD(j , t) must be closer to codewords with larger mass than those with smaller mass. Hence, by eqn. 3, an arbitrary codeword cj will move towards a codeword that has larger mass in S D ( j ,t). This gravitational property has a natural analogy in classical physics, namely the law of universal gravity whereby a particle with larger mass attracts neighbouring particles with smaller mass. The property not only helps to discover local dense areas quickly, it also effectively separates overlapping clusters. (ii) Relative-velocity property concerns the interactions among neighbouring codewords. According to eqn. 2, a codeword with smaller mass will move farther than the one with larger mass. That is, during each synchronous step, a
41
IEE Proc.-Vis. Imuge Signul hnces.s., Vol. 148, No. I , Febniury 2001
codeword in a sparser input area will move faster, in a relative sense, than the one in a denser area. This creates a desirable property that codewords in a denser area seem to wait for codewords in a sparser area so as to prevent the latter from being left behind and erroneously becoming several separate clusters. The relative-velocity property ensures codewords always move in the desired direction (i.e. towards a concentrated area) and prevents codewords in a sparser input area from being trapped in local minima. More significantly, combination of the above two properties guarantees that, as the iterative synchronous operation proceeds, all codewords would move to denser areas, and more and more gravispheres of codewords that (by input nature) belong to the same cluster would intersect each other and ultimately get agglomerated into some concentric circles (gravispheres). In fact, the denser area can be perceived as a black holethat would attract codewords to it.
3.3. Convergence of the CA algorithm We start by defining S D ( j ,t ) as the set of codewords for which their gravispheres have direct intersections (including the special case when one gravisphere is fully covered by another gravisphere) with the codeword cJ at the tth j, synchronous step. Furthermore, SI( t) is defined as the set of codewords for which their gravispheres have direct or indirect contacts with the codeword cI at the tth synchronous step. Fig. 6 is a zoom-in of the upper-left corner of Fig. 5a. Considering the codeword c,; the direct neighboursofc, arec2,c3 andc4,that i s S D ( 1 , 0 ) = { c , , c 2 , c 3 , c4}. Although the gravispheres of c5 and c8 directly intersect with that of c 2 , they have only indirect contacts (through c2)with that of c , . Thus, codewords c5 and c8 are indirect neighbours of c I . Accordingly, S,( 1 , O ) = (c, Ij = 1, 2 , . . . , 9). Moreover, in Fig. 6 we see that Sl(l, 0) = S , ( 2 , O)= . . . =S,(9, 0).
and cJ will move to it. Hence, in the course of synchronous moving, all codewords in &(k, t ) will gradually move to the codeword in S,(k, t) that has the largest mass. Therefore, all codewords in SI(k,t) are bound to move inwardly to some 0 denser areas. Although during the synchronous update one or few codewords may depart from S,(k, t), they will join neighbouring codewords to form another new set at the next synchronous step. Afterwards, the gravitational property will force codewords in S,(k, t+ 1) to unanimously move to the denser area, i.e. the inward agglomeration is still bound to occur. This is illustrated in Fig. 5a where initially the set Sl(k, t = 0), representing the overlapping Gaussians, split into two separate sets at t = 2.
Lemma 2: After finite synchronous moving steps, the locations w(j,t) of all codewords in Sl(/c,t ) will eventually become equal.
Proofi By lemma 1, codewords in S,(/c, t) are bound to move inwardly to some denser areas; this is to say that t) more and more gravispheres of codewords in SI(/,will intersect each other. Consequently, the direct neighbours S D ( j ,t ) of codewords c, in SI(k,t) will eventually become identical. By eqn. 2, it follows that w(j,t) for codewords in S,(/c, t) will become equal after finite synchronous steps. 0
Consider the four codewords at the lower-left corner in Fig. 5a, eachwith different sizes ofgravispheres.At t = 0, the SD(j , t) of these four codewords are not the same. After one synchronous step, any pair of these four gravispheres intersect, i.e. the S D ( j ,t= 1) of the four codewords are identical, and their locations w ( j , t ) are equal at t = 2. After this, we see no more codewords moving in these concentric circles. Convergence Theorem: After finite synchronous moving steps, all codewords as well as their gravispheres will form several sets of concentric circles and will not move any more.
Prooj By lemma 2, after finite synchronous moving steps, the gravispheres of the codewords in an arbitrary S ( < I ]t) , will form a set of concentric circles. By induction, when all codewords as well as their gravispheres form several sets of concentric circles, all codewords will not move any 0 more.
4
Lemrna 1: In the progress of synchronous movements, all codewords in S,(k, t) are bound to move inwardly to denser input areas.
Prooj Consider an arbitrary set S,(k, t ) for which their gravispheres have direct or indirect contacts with the codeword ck at the t-th synchronous step. Direct neighbour s,( j , t) of each codeword cI in S,(k, t) may be different from each other, thus according to eqn. 2 their corresponding centroids 6 ( j , t) may not be the same. Moreover, according to the gravitational property, the centroid &(,j, t) of each codeword c, in SI(k,t ) would be near the codeword with larger mass,
Experimental results
4.1 Variant Gaussian distributions This experiment shows that the proposed VQ-agglomeration clustering approach can easily separate distinct clusters, regardless of the size of cluster and number of data points in each cluster. For this purpose, the test input data contains six 2-D Gaussian input clusters. Some have different deviations on x and y co-ordinates, forming ellipsoids in 2-D feature plane. As shown in Figs. 7a and b, the VQ-agglomeration approach converged after five synchronous movements, and successfully identified six valid clusters.
Fig. 6 Zoom-in i f t h e upper-left cluster in Fig. 5a, showing relationship ofneighbotrring codewods
42
4.2 Using VQ against without VQ In the proposed clustering approach, codewords resulting from the VQ pre-process are taken as input data to the CA algorithm processes; agglomerating codewords rather than using an individual input vector requires much less computation time. Another unique feature of the CA algorithm is that the codewords act as good initial prototypes in
IEE Proc.-Vis. Image Signal
Pmcess..
Vol. 14N. No. I , February 2001
.
a
..
Fig. 8 Results obtained without VQ pre-pmcess

CI
64 initial prototypes (and thcir gravispheres) selected randomly from the original input data h After seven synchronous steps, these prototypes erroneously converged into three sets of concentric circles identifying three clusters
VQ-agglomerution approach a 32 initial prototypes resulting from VQ pre-process and their gravispheres b After five synchronous steps, codewords converged into six sets of concentric circles identifying the six clusters
Fig. 7
the CA algorithm. Therefore, this experiment and the results in Fig. 5 have empirically shown that the proposed clustering approach can simultaneously acquire two benefits from using codewords, namely freedom from the initial prototype problem, and the fast computation time.
conducting agglomeration. To see what improvements this feature can bring to the clustering performance, we randomly picked initial prototypes from the original input data. Without the VQ pre-process, the result is as shown in Fig. 8. The input is the same as for Fig. 2 and Mr = 64. As shown in Fig. 8a, because the initial prototypes were randomly selected, some codewords would have too large gravispheres. As a result, the two overlapping clusters were erroneously grouped together after convergence, as shown in Fig. 8b. Comparing Figs. Sa and 8a verifies that a VQ pre-process indeed can provide good initial prototypes for
Table 3: Insensitivity to the change of M,
~
4.3. Effect of changing the codebook size In general, if each cluster is distinctly separate from any other the CA algorithm can correctly identify clusters. In addition, the tolerance to the change of Mf is good, except when Mf is too large and some clusters are very close to each other, or even overlap. To see this, we ran the algorithm 100 times for each different value of Mf ranging from 8 to 400. In performing the quantisation, input sequences of feature vectors are made different for each
20 0.031 1 0.0102 3.40 1.60 4.00 100
32 0.0508 0.0184 4.10 1.90 4.00 100
64 0.0917 0.0357 4.21 2.60 4.00 100
100
0.1817 0.0469 4.82 3.23 4.03 97
200
0.3136 0.1866 5.79 4.25 4.03 97
300 0.4416 0.3475 6.74 4.61 4.06 94
400 0.901 1 0.61 68 7.38 4.78 4.15 86
VQ Time (s)
Agg I o me rat io n time (s) Avg. number of sync. steps
0.0225 0.0040 2.07 1.28 4.00 100
0
Avg. number of clusters Accuracy (%)
All data were averaged over 100 different runs. The accuracy denotes the percentage of obtaining correct number of clusters in 100 runs.
IEE Pmc.-Vis. Imuge Signul Process., Vol. 14X, No. 1, Februnry 2001
43
run. Using the same 4-Gaussian data in Fig. 5. Table 3 shows that when Mf increases, the average number of resulting clusters N, is always nearly 4. The chance of achieving N, = 4 is still very large (i.e. accuracy is 86%) even when Mf=400. Hence, the resulting number of clusters is not sensitive to the change of M f . It is worth noting that the CA algorithm still converges very fast even when Mf = 400. Despite the fact that the CA algorithm is robust to the change of Mf, validity of clustering in the case of very large Mr can be improved. In fact, a too large Mf not only incurs more computational time, it may also causes erroneous clusters. Thus, we present a recursive scheme to solve this problem of guessing Mf blindly. The idea is to run the VQ-agglomeration recursively, with the next cycle using a better trial of Mf than its preceding cycle. For example, assume an initial value of Mf(CyZ=I). After convergence, N,(Cyl= 1) is obtained. Then, set Mf(Cyl 1) = 2 x N,(Cyl) and rerun the VQ-agglomeration. We check if N,(Cyl+ 1) = N,(Cyl). If not, continue to the next cycle, and so on. The feasibility of this scheme stems from the fact that the probability of two successive cycles that use different values of Mf result in the same erroneous N, is extremely small. Equivalently speaking, two successive cycles that use different Mf and converge to the same N, would indicate that the resulting N, is reliable (see Table 3). When Mf = 400, the chance of identifying 4 clusters is 86%. Suppose N,( 1) # 4, say N,( 1) = I O after the first cycle, we then use Mf= 2 x IO = 20 in the second cycle. In Table 3, when Mf =20, N,(2)=4 would be obtained after convergence. As Nc(2) # N,( l), we continued to the third cycle. When Mf = 2 x 4 = 8, N,(3) = 4 was obtained after finishing the third cycle. Thus, we have obtained N,(CyZ 1) = N,(Cyl) after three cycles. By comparing the computation time of the three cycles, the second and third cycles consumed only a small portion of total computation time, implying that the recursive scheme is computationally efficient. More significantly, even if the initial guess of Mf is smaller than the valid number of clusters k, the recursive scheme can quickly identify the valid clusters by trying another Mf(Cy1 I ) = 2 x N,(Cyl).
either via conventional methods or via neural network techniques. Finally, characterisations of the CA algorithm have shown that the CA algorithm nicely fits the proposed VQ-agglomeration approach in achieving the goal of fast and autonomously clustering. In the future, one may consider incorporating the recursive VQ-agglomeration and a defined global/local validity measure to verify a correct partition as well as the valid number of clusters.
6 Acknowledgments
The authors would like to thank the National Science Council of Taiwan (Grant No. NSC 89-2213-E-019-020) for support of this research.
7
References
1 GOSE, E., JOHNSONBAUGH, R., and JOST, S.: Pattem recognition

and image analvsis (Prentice-Hall. 1996) NADLEk, M., and SMITH, E.P.1 Patiem recognition engineering (Wiley Interscience, 1993) 3 FRIGUI, H., and KRISHNAPURAM, R.: A robust competitive clustering alogrithm with applications in computer vision, IEE2 Fans. Pattern Anal. Mach. Intell., 1999, 21, (5), pp. 4 5 0 4 6 5 4 KWON, S.H.: Cluster validity index for fuzzy clustering, Electron. Lett., 1998, 34, (22), pp. 2176-2177 5 BOUDRAA, A.O.: Dynamic estimation of number of clusters in data sets, Electron. Lett., 1999, 35, (19), pp. 1606-1608 6 SUH, H., KIM, J.H., and W E E , C.H.: Convex-set-based fuzzy clustering, IEEE Trans. Fuzzy Syst., 1999, 7, (3), pp. 271-285 7 RAVI, T.V, and GOWDA, K.C.: Clustering of symbolic objects using gravitational approach, IEEE Trans. Sy.st. Man Cybern. B, Cybern., 1999, 29, (6), pp. 888-894 8 ZHUNG, X., HUANG, Y., PALANIAPPAN, K., and LEE, J.S.: Gaussian mixture modeling, decomposition and applications, IEEE Trans. Signal Process., 1996, 5, pp. 1293-1302 9 DUBES, R.C.: How many clusters are best? An experiment, Pattern Recoenit.. 1987. 20. (6). nu. 645-663 10 CHGRUNGRUENG, SEQUIN, C.H.: Optimal adaptive kmeans algorithms with dynamic adjustment of learning rate, IEEE Trans. Neural Netw., 1995, 6, (I), pp. 157-169 1 1 ZAHN, C.T.: Graph-theoretical methods for detecting and describing gestalt clusters, IEEE Trans. Comput., 1971, C-20, pp. 68-86 12 WU, Z., and LEAHY, R.: An optimal graph theoretic approach to data clustering: theory and its application to image segmentation, IEEE Trans. Pattern Anal. Mach. Intell., 1993, 15, (1 l), pp. 1101-1 113 13 BEZDEK, J.C.: Pattern recognition with fuzzy objective function algorithms (Plenum Press, New York, 1981) 14 LINDE, Y., BUZO, A., and GRAY, R.M.: An algorithm for vector quantizer design, IEEE Trans. Commun., 1980, COM-28, (l), pp. 8495 15 WANG, J.H., and PENG, C.Y.: Competitive neural network scheme for learning vector quantization, Electron. Lett., 1999,35, (9), pp. 725-726 I6 GALANOPOULOS, S.A., and AHALT, S.C.: Codeword distribution for frequency sensitive competitive learning with one-dimensional input data, IEEE Trans. Neural. Netw., 1996, 7, (3), pp. 752-756 17 CHOI, , DOO-IL, , and PARK, S.H.: Self-creating and organizing Nehv., 1994, 5, (4), pp. 561-575 neural networks, IEEE Trans. Ne~iral 18 MARTTNETZ, T.M., BERKOVICH, S.G., and SCHULTEN, K.J.: Neural-gas network for vector quantization and its application to time-series prediction, IEEE Trans. Neural Netw, 1993, 4, (4), pp. 558-5 69 19 FRITZKE, B.: Growing cell structures - a self-organizing network for unsupervised and supervised learning, Neural New., 1994, 7, (9), pp. 1441-1460 20 HAYKIN, S.: Neural networks (Prentice-Hall, 1998, 2nd edn.) 21 WRIGHT, W.E.: Gravitational clustering, Pattern Recognit., 1977, 9, (3), pp. 151-166 2
~
5 Conclusions and discussions The presented CA algorithm can quickly agglomerate codewords into a valid number of clusters by the nature of the input, regardless of the input distribution function and whether the input clusters overlap or not. The beauty of the proposed VQ-agglomeration approach is threefold. First, it is free of the initial prototype problem. Secondly, the clustering process is fully autonomous, because prespecifying the initial prototypes is not necessary. Thirdly, it is flexible in implementation, because the approach advantageously permits designers greater flexibility in the selection of the quantiser, e.g. quantisation can be conducted
44
IEE Proc.-Vis. Image Signal Process., h l . 148, No. I , February 2001

J.-H.Wang and J.-D.Rau - VQ-agglomeration: A Novel Approach To Clustering

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

J.-H.Wang and J.-D.Rau - VQ-agglomeration: A Novel Approach To Clustering

Uploaded by

Copyright:

Available Formats

VQ-agglomeration: a novel approach to clustering

0 IEE, 2001 IEE Proceedings online no. 20010139

LBG 0.37 yes

FSCL SCONN2 Neural-Gas G S C, 0.32 yes 0.71

30.18 34.51 33.81 31.07

Correct yes clustering?

Fig. 1 Exjinniple of use ofMSTalgorYthni

Fig. 3 k-means algoritlzm and VQ-cliistering

Vol. 14N. No. I , February 2001

Fig. 8 Results obtained without VQ pre-pmcess

20 0.031 1 0.0102 3.40 1.60 4.00 100

32 0.0508 0.0184 4.10 1.90 4.00 100

64 0.0917 0.0357 4.21 2.60 4.00 100

300 0.4416 0.3475 6.74 4.61 4.06 94

400 0.901 1 0.61 68 7.38 4.78 4.15 86

0.0225 0.0040 2.07 1.28 4.00 100

1 GOSE, E., JOHNSONBAUGH, R., and JOST, S.: Pattem recognition

IEE Proc.-Vis. Image Signal Process., h l . 148, No. I , February 2001

You might also like