You are on page 1of 6

Scale-based Clustering using the Radial

Basis Function Network


Srinivasa V. Chakravarthy and Joydeep Ghosh,
Department of Electrical and Computer Engineering,
The University of Texas, Austin, TX 78712
- Adaptive learning dynamics of the RaThe question of scale naturally arises in clusterdial Basis Function Network (RBFN) are com- ing. At a fine scale each data point can be viewed
pared with a scale-based clustering technique as a cluster and at a coarse scale the entire d a h
[Won931 and a relationship between the two set can be seen as a single cluster. Scale has not
is pointed out. Using this link, it is shown been given sufficient attention in cont,est of clust,erhow scale-based clustering can be done using ing even though the notion of scale is long familiar
the RBFN, with the Radial Basis Function in various fields, and is fundamental to two new top(RBF) width as the scale parameter. The ics in mathematics, viz. fractals and wavelet theory
technique suggests the right scale at which [RV91].
Scale-based clustering has been studied hy several
the given data set must be clustered and obviates the need for knowing the number of authors in recent years ([RGFSO]; [Won93]). Rose
clusters beforehand. We show how this method et al. have approa.ched the clustering problem froni
solves the problem of determining the num- statistical mechanics perspect,ive. In t,heir anal!.ber of RBF units and the widths required to sis, temperature act,s as the scale parameter. At,
a low temperat,ure each d a h point is a cluster c w get a good network solution.
I. INTRODUCTION
Clustering aims at partitioning data into more or
less homogeneous subsets when the apriori distribution of the data is not known. The clustering
problem arises in various disciplines and the existing literature is abundant. Traditional approaches
to this problem define a cost function which, when
minimized, yields desirable clusters. Hence the final
configuration depends heavily on the cost function
chosen. Moreover, these cost functions are usually
convex and present the problem of local minima.
Well-known clustering methods like the k-means algorithm are of this type [DH73]. Due to the presence
of many local minima, effective clustering depends
on the initial configuration chosen. To avoid this
problem stochastic gradient search techniques have
been studied [KGV83].
Two key problem that clustering algorithms need
to address are: 1) How many clusters are present,
and 2) how to initialize the cluster centers. An important approach to clustering, which is also the
one taken in the present work, is hierarchical clustering that involves merging or splitting of clusters
in the scale-space. In this paper we show how this
approach solves the two forementioned problems.
Supported in part by DARPAJONR contract N00014-92-C0232, AFOSR Contract F4962a93-1.0307, and by a Faculty Development Grant from TRW Foundation.

0-7803-1901-~/94 $4.00 01994 E E E

ter, while the ent.ire da.t.a.set is trea.t.ecl as a single


cluster at a sufficiently high t,emparature. It is nat,ural that the number of clusters obt,ained using their
technique depends on the t,emperature a t which the
data is clustered. But the authors suggest no definite criteria to decide w h a t is the appropriate or
natural scale for a given problem. \Vongs study,
which is also motivated by statistical mechanics.
addresses the problem and suggest,s a way of determining the right sca.le based on some sta.hilit,y
criteria ([Won93]). Further, he p0int.s out that. t,lie
phenomenon of splitting/nierging of clust.ers at crit,ical values of scale is due to bifurcation. where t,lie
scale parameter is the bifurcation parainet,er.
In this paper, motiva.ted by t81ieanalysis presented
in [Won93], we show how scale-based clustering can
be done using the Radial Basis Function Network
(RBFN). These networks a.pproximate an unknown
function from sample data by positioning localized
receptive fields over portions of t.he input. spa.ce
tha,t contain the sample data. It. will h e seen that,
the width of t,he receptive fields is the scale parameter in o u r scheme. Existing solutions t,o determination of network parameters do not. give sa.tisfactory answers to several crucial questions like.
how many receptive fields are required for a good
fit, what should be the width of the receptive fields
etc. Viewing width as a scale para.met,er appears to
throw new light on these questions.

897

The paper is organized as follows: Section 2 describes in brief the scale-based clustering scheme
presented in [Won93]. In Section 3 it is shown how
the analysis of Section 2 can be applied to the RBFN
caae. Accordingly, a way of performing scale-based
clustering using RBFN is suggested. An experimental study of this technique is presented in Section 4.
A detailed discussion of the present technique and
its possible extension to approximation tasks using
RBFN, is given in the final section.
11.

SCALE-BASED CLUSTERING BY

It can be seen intuitively that P acts as a sca.le


parameter. The number of clusters obtained by this
procedure therefore depends on the value of ,8 and
the data set on hand. As P is decreased gradually
from a large value the positions of cluster centers
vary smoothly and at certain critical values of p
the number of clusters suddenly cha.nges. This is
due to a bifurcation in y with /3 as the bifurcation
parameter. The necessary condition for bifurcation
to occur is a f / a y = 1, or,

MELTING

Wong took a new approach to multi-scale clustering based on the theory of statistical mechanics and
bifurcation theory [Won93]. This approach allows
one to consider one cluster at a time. The cost e(z),
for associating a datum I with a cluster C, whose
center is y, is defined as,

Let P ( z ) denote the Gibbs distribution for associating a datum z with C. Now, the entropy,

Two kinds of bifurcation occur in this process:


a pitchfork bifurcation where two clusters continuously merge into a single clust8erand a. saddle-node
bifurcation, where a cluster becomes unstable and
is siphoned int,o another.
An important contribution of [Won931 is a crit,erion to decide how good a. cluster is. The quailtiby
Fra.ctiona1 Free Energy ( F F E ) of a nomiiial clust.er.
MQ(P) =

P ( z )=

CC,,Q exp(-d(z - Y)))


Z

.EQ

(3)

is maximized subject to the constraint,

is a measure of how good/stable a cluster is. A good


cluster must have alarge FFE (over a. t,lireshold ,VT)
over a large range of f3 on 1oga.rithinics a l e .
A SCALE
P A R A M E T E R I N THE
RADIALBASISFUNCTION
NETWORK

111. WIDTHA S

We obtain,

where,
2

Similar to the free energy of thermodynamics,


a function,
F = -l/PlogZ
(5)
is defined which is minimized as the system settles
into a stable configuration. Solving a F / a y = 0 we
get,

In this section, the analysis presented in the previous section will be applied to 1ea.rniiig dynamics
of the RBFN. By pointing out a relat.ion between
the RBFN learning dynamics and Wongs clustering
procedure it will be shown how multi-scale clustering can be done using RBFN, with the width parameter playing the role of a sca.le pa.raineter.
The RBFN belongs to the general class of threelayered feedforward net,works. The output of t,he
ith output node, f i ( x ) ,when input. vect,or x is presented, is given by :

Since (6) cannot be solved directly for y, it is evaiuated iteratively by the following rule,

The final ys are the fixed points of the above


one-parameter map for which F acts as a Lyapunov
function.

where Rj(z)= R(llx - xjll/uj) is a suitable radially


symmetric function that defines tlie out,put. of t,he
j t h hidden node. Often R ( . ) is chosen to be t.he
Gaussian function where the width parameter, aj is
the standard devia.tion. In equa,tioii ( l o ) , xj is the
location of the jt.h centroid. where ea.ch centroid is
represented by a kernel/hidden node, and wij is tlie
weight connecting t,he jtli kernel/hidden node to the
it,h output, node.

898

Learning involves adapting some or all of the three


sets of parameters viz., wij,z, and U,. Some fast

learning procedures exist where the centroids are


calculated using clustering methods like the k-means
algorithm, the width parameter by various heuristics and the weights, wj, by pseudoinversion techniques like the Singular Valued Decomposition.
Alternatively, these parameters can be calculated
by minimizing the error in the network performance.
Consider a quadratic error function, E = E, Ep
where Ep = Ci(tf- fi(xp))2. Here t: is the target
function for input xp and fj is as defined in equation (10). The mean square error is the expected
value of Ep over all patterns. The parameters can
be changed adaptively by performing gradient descent on E,, as given by the following equations:

(C(t4- fi(xp))wij).

Wongs procedure. The only additional term which


appears in (7) is 2 which is a normalizing factor.
Similar to (7), (14) can be viewed as a one-para
meter map. The error function E is naturally the
Lyapunov function of the above map dynamics.
Now the central idea of the present. work is to
find clusters in a given data set using equation (14).
The network is trained on various data sets with a
constant target output, t . Fixed points of equation
(14), which are the centroid positions, are comput,ed
at various values of U . These centroids cluster t.he
input data at a scale determined by U .
IV. SIMULATION
RESULTS
Two simple one-dimensional data sets and one 2-D
data set are chosen for our simulations. The three
data sets have 2, 4 and 3 clusters respectively. Histograms of data sets I and I1 are shown in Figs 1 and
3. Data set I1 is only a shifted version of data set I
added to itself. Even though, at a first. glance, D a h
Set I1 appears to have 4 clusters it can be seen that.
at a larger scale only 2 clusters are present. Data. set.
I11 consists of 3 well-separated clusters in a plane.
The procedure followed in performing the clustering at different scales is briefly described. We start,
with a large number of RBF nodes and initialize
the centroids by picking at random from the d a h
set on which clustering is to he performed. Then
the following t,wo steps are esecuted:

(13)

Now, assume w,,,tf and U, are assigned constant


values, w, t and U respectively with the additional
condition that w / t << 1. The network now has a
single output with a fixed value, w, assigned to all
the weights connected to the output, and a constant
target output, t , assigned to all input patterns. The
widths of all the RBF units are the same. Then, for
Gaussian basis functions, (12) becomes:
Axj = 59 exp(

-Ibp

- xjI12 (XP - xj)


U2

U2

(t - f(xp))w.

(14)

Since w/t << 1, and R(.) is bounded both above


and below, for sufficiently small w , the term (t f(xp))w % t w , a constant. Thus, from (14) the
iterative update rule is

where the parameter U plays a role similar to p in


Wongs procedure. Note that (15) resembles closely
the iterative rule for cluster center calculation (7) in

Find the correct. cent.roids by iterating equa.t,ion


(14).
Increase U by a const,ant fa.ctor a.nd repeat t.lie
previous step.

For simulation with Data Set. I, 10 RBF nodes


are used. Change in location of centroids as CT is
varied is depicted as a tree diagram in Fig 2. For
small U there is no significant, va.riation in centroid
positions. As U increases, for 0.01 < CT < 0.02, pairs
of centroids merge, leaving only 8 distinct cent.roids.
Further merging takes place at, U = .02 and U = .Os.
At U = 0.2 there are two cases where 3 centroids
merge into one. It can be seen t,liat in ea.ch of these
two cases, 2 of the branches are unstable oiies. A t
U = .06 only 2 branches remain. It can be seen t1ia.t.
these two branches vary little over a large range of (T.
showing the viability of the corresponding clusters.
Here, it must be mentioned t,liat the tree diagram
is biased on the lower side of U because of the finite
number of nodes chosen. If more nodes are chosen
more branches can be expected at lower scales. Finally, at U = 0.2 the two st.ahle branches also niergr
into a single branch.
The case of data set, I1 is similar t,o the previous case in several respects. The t.ree diagram for

899

this case is given in Fig 4. The RBFN has 14 hidden


nodes. At U = 0.02 only 4 branches can be seen. Every other branch of the initial 10 branches merged
into one of these 4 at some lower value of U . Henceforth these branches do not vary much over a large
range of U until U = 0.1598 when the 4 branches
merge into 2 branches. From the tree diagram it appears that the 4 branches are stable over the greatest
range of U . Hence these 4 clusters provide the most
natural clustering for the data set.
Data set I11 with the initial positions of the 30
centroids can be seen in Fig 5. As U is increased,
at U = 0.131 each of the centroids occupied one of
3 points in the plane after extensive merging. This
indicates that the network discovered three clusters
in the data at the forementioned scale. The clusters
centers found by the network are (-0.004, -0.008),
(0.995, 0.991) and (1.995, -0.008) where the corresponding true centers are (0, 0), (1, 1) and (2, 0)
respectively (see Fig. 6).

V. DISCUSSION
We have investigated a way of performing scalebased clustering using the RBFN. To our knowledge
there is no instance where this network is used for
clustering data. This work can be viewed as an extension to our earlier study linking Kohonens SelfOrganizing Feature Map [Koh89], which is a clustering technique, and RBFN. Further, in the present
work, it is shown how the width parameter controls
the scale of the input space and how multi-scale clustering can be done.
Even though RBFN is a feedforward network t r a
ined by supervised learning procedures, our method
uses dummy target outputs and the network dynamics seems to belong to unsupervised learning
category. A distinct feature of this technique is
that the network is not used for its originally intended purpose viz., mapping multi-dimensional inputs onto multi-dimensional outputs. The architecture of RBFN is used in a novel way to accomplish
clustering tasks which a supervised feedforward network is not designed to do. Also, no apriori information is given regarding the number of clusters
present in the data.
It will be noted that the width parameter acquires
a new significance in our work. This parameter has
been largely neglected by RBFN studies in the past
for several reasons. In some of the early work which
popularized RBFNs, the label localized receptive
fields has been attached to these networks [MD89].
The intuitive idea that RBFNs approximate a function by combining several overlapping local fits to
the target function gained ground. Therefore, the
width is only allowed to be of the same scale as that

of the distance between centroids of neighboring receptive fields. Advantages of such a choice of width
are robustness of model, and stable and fast learning. On the other hand, the same feature requires
large-sized networks for approximating even simple
polynomial functions over a considerable measure of
the input space. Therefore, RBFNs are often better suited for interpolation than for extrapolation.
A study by Hoskins et al. [HLC93] shows that exploiting the scale-like characteristic of the width parameter solves the above difficulty t o some extent.
Our model not only shows how to perform clustering at different scales but, in the manner of [Won93],
prescribes a way of determining the appropriate scale
for clustering. Choosing the right scale must take
into consideration not only low cost criteria but also
issues pertaining to stability of the network model.
The question of stability exposes neural network
modeling to structural stability a.nd related issues.
A task for future is to achieve better generalization
on approximation tasks using RBFN.
The RBFN has several desirable features which
renders work with this model challenging. In contrast to the popular Multi-Layered Perceptron (MLP)
trained by backpropagation, RBFN is not conceptually opaque. For this reason training one layer
a t a time is possible without having to directly dea.1
with an all-containing cost function as in the case of
backpropagation. In another study, the present authors describe how Self-organized Feature Maps can
be constructed using RBFN, for l-D data [GC93].
The same study analytically proves that the map
generated by RBFN is identical to that obtained by
Kohonens procedure in a limiting situation. Work
is underway t o extend this technique for generating maps of higher dimensions. Relating RBFN t.0
other prominent neural network architectures has
obvious theoretical and practical interest. Unifying underlying features of various models brought.
out by the above-mentioned st*udies,is of theoretical importance. On the practical side, incorporating
several models in a single architecture is beneficial
for effective and efficient VLSI implementation.

REFERENCES
[DH73]

[GC93]

900

R. 0. Duda and P. E. Hart. Paifern Claaszficatzon and Scene Asalyszs. Wiley, New
York, 1973.
J . Ghosh and S. V. Chakravarthy. Rapid
kernel classifier: A link between the selforganizing feature map and the radial
basis function network. Journal of Irttellagent Material Systems and Structures
(Spl. Issue 011 Neural Networks). October.
1993.

[HLC93] J . C. Hoskins, P. Lee, and S. V.


Chakravarthy. Polynomial modeling behavior in radial basis function networks.
In Proc. of World Conference on Neural
Networks, Portland, OR., pages 693-699,
July 1993.
[KGV83] S.Kirkpatrick, C. D. Jr Gelatt, and M. P.
Vecchi. Optimization by simulated annealing. Science, 220:671-680, May 1983.
[Koh89] T. Kohonen. Self-Organization and Associative Memory. Springer-Verlag, Berlin,
3rd Ed. 1989.
[MD89] J . Moody and C. J. Darken. Fast learning in networks of locally-tuned process
ing units. Neural Computation, 1(2):281294, 1989.
[RGF90] K. Rose, E. Gurewitz, and G. C. Fox.
A deterministic annealing approach to
clustering. Pattern Recognition Letters,
11:589-594, 1990.
[RV91] 0. Rioul and M. Vetterli. Wavelets and
signal processing. IEEE Signal Processing
Magazine, pages 14-38, October 1991.
[Won931 Y. F. Wong. Clustering data by melting.
Neural Computation, 5( 1):89-104, 1993.

Histogramof Data with 2 Clusters

Data wllh 2 Clusters using 10 RBF nodes

40

35

n
0.7

0.d

Figure 1: Histogram of Data with 2 Clusters.

Figure 2: Tree diagram for 2 cluster Data with 10


RBF nodes.

90 1

Histogram of Data with 4 Clusters

Data with 4 Clusters using 14 REF nodes

45

1.4
1.2

$1
g 1

Figure 3: Histogram of Data with 4 Clusters.

Figure 4: Tree diagram for 4 cluster Data with 14


RBF nodes.

Data with 3 Clusters showing initial Cluster Canters

P.

Data with 3 Clusters showing Cluster Centers at sigma = 0.131.

'

:.

. ..*..:..
.. . . . . .
.. . ....
.<.*;:* .r; . . ..

'

0 ' ' .

.... *>. y:. . ..

'8

. * .

*.
*.

..

. :.
.
.
.
.. .. ... .. .. . .. . .
.
. ...(..>... . . ..
0. . *. *>* .-:.-;: . .
. . .. . .. .

w.

-0.5

0.5

1.5

2.5

Figure 6: Data with 3 Clusters showing Cluster


Centers at a = 0.131.

Figure 5: Data with 3 Clusters showing initial


Cluster Centers.

902

You might also like