You are on page 1of 12
“Neural-Gas” Network for Vector Quantization and its Application to Time-Series Prediction ‘Thomas M. Martinetz, Member, IEEE, Stanislav G. Berkovich, and Klaus J. Schulten rror—nhich, in general, has many local tian Intl 4 neural network algorithm based on a “soft ule fs presented that exhibits good performance in reaching the ‘optimum, or atleast coming clas. The voft-max rule employed i an extension of the standard -means chstering procedure and {aes into account a “neighborhood raking” af the reference (eight) vectors. Its show that the dynamics of the reference (eight) vectors during the inpl-driven adaptation procedure 1) is determined by the gradient of an energy function whose shape ‘an be modulated through a neighborhood determining pa ter, and 2) resembles the dynamics of Brownian particles moving {nm potetal determined by the data point density. The network {s employed to represent the attractor of the Mackey-Class ict the outpet values ‘The results obtained forthe time-series prediction compare very favorably with the results achieved by Back-propagatton and rial basis function networks, 1 werooucrion ETIACIAL 4 Wels igi nfrmaton processing Aisrrcne th oles aon anne et noni of is fen regu the apna ing tee nies fer coms, ns we age a pln Inoingopech and inupe ein, hes epee celts elon" gunn” eqns (oe + fw me 1D ‘Eso auteton iu ecoe saa manos, « simnill 8° lang ony ent (ese y) of rere oe! eco etc cluster centers) w € Ri IN. A data vector» € V i dcr y te esachig a “wing rere ‘eto te of w or ich ie asoron rendre sth gue oro e-mfseiml Thsporebos hie mand Vint heer a sabes Vix (ee Ville—wil 0, one cannot Specify cos function that is minimized by (6. 1, THE "NEURAL-GAS” ALGORITIM In this paper, we present a neural network model which, applied tothe task of vector quantization, 1) converges quickly {0 low distortion errors, 2) reaches a distortion error E lower than that resulting from K-means clustering, msximum- entropy clustering (for practically feasible numbers of iteration steps) ad from Kohonen’s feature map, and 3) a the same time obeys a gradient descent on an energy surface (ike the maximum-entopy clustering, io contrast 10 Kohonen’s feature map algorithm). For reasons we wil give later, we call this network model the "neura-gas” network. Similar {0 the maximum-entropy chstering and Kobonen’s feature tmap the neural-ges network alo Uses a "softmax” adaptation mule. However, instead of the distance |jv — w|| or of the frangement ofthe w;’s within an exteroal late, it wtizes 2 "ncighborbood-ankng” of the reference vectors w, forthe five data veaor ach time dala vector v & presented, we determine the “neighborhood-ranking” (wi.,Wi,,---,Wiy_,) of the refer- ence vectors witht, being closet to tn, being second lowest 10 and wi = 0,.--,NN ~ 1 being the Teference ‘rector for which there are F vectors w wit [lv — wl] < flv = 1, | I we dente the number asociated with cach ‘vector w; by k(e,), which depends on w andthe woe set (ay,-"-,w) of reference vector, then the adaption Sep we employ for adjusting the ws given by bw hglhi(o.w)) (vw) T= 1, o ‘The step size € € [0,1] describes the overall extent of the modification, and /(ki(v,w)) i unity for kj = 0 and ‘decays to 200 for increasing , with a characteristic decay ‘constant A. In the simulations we describe as follows, we cehose ha (Fi(oye)) =e K\PAD)/9, For X — 0, (7) becomes equivalent to the K-means adaptation rule (3), whereas for 2-0 not only the “winner” w,, but the second closest reference vector w,, third closest reference vector wi, et, is also updated, [As we show in Appendix I, the dynamics ofthe wy’s obeys a stochastic gradient descent on the cost function Eqg(t,)) = mab. | dy P(ayha(bi(o,w))(9 ~ w,)? ® wit ca) = Saath) = Fae) 25x normalization factor that only depends on A. Byy i elated to the framework of fuzzy clustering [24], [25] In contrast to hard clustering where each data point v is deterministcally assigned tits closest reference vector wp, fuzzy clustering associates 9 o a reference vecor w witha certin degree (a), the so-called “fuzzy membership of w to cluster i. In the case of hard clustering holds piy(v) = 1 and pa(v) = 0 for ## to, If we choose a “fuzzy” assignment of data point 1 to reference vector w;, which depends on whether 1; is the nearest, next-neares, next-next-nearest, et, neighbor of v, ie, if we choose pi(o) = ha(hi(e,e))/C(A), then the average distortion error we obtain, and which has to be sinimized, is given by Eng, and the corresponding gradient escent is given by adaptation rule (7). ‘Through the decay constant A we can modulate the shape Of the cost function Eyy. For A —+ oo the cost function Eng becomes parabolic, whereas for 1 —+ 0 it becomes {equivalent tothe eos function in (2), ie the cost function We ultimately want to minimize, but which has many local ‘minima. Therefore, to obtain good results concerning the set of reference vectors, we start the adaption process determined by (7) with a large decay constant and decrease A with each adaptation step. By gradually decreasing the parameter \ wwe expect the local minima of F to emerge slowly, thereby Preventing the set w of reference vectors fom getting trapped in suboptimal states. IL THE NETWORK’S PERFORMANCE ON A MODEL PROBLEM ‘To test the performance of the neural-gas algorithm in mini- ‘mizing E and to compare it with the three other approaches we described (-means clustering, maximam-entropy clustering, and Kohonen’s topology-conserving map), we choose a data sistrbution Pw) for which 1) the global minimum of Es known for large numbers of reference vectors and 2) which reflects, atleast schematically, essential features of data dsti- butions that are typical in applications. Data distributions that ions often consist of, eventually separated, points. Therefore, also for our test we choose 4 model data distribution that is clustered. To be able 10 Setermine the global minimum, in our mode data distribution the clusters are of square shape within » two-dimensional ut space. Since we choose N= Axnumber of clusters and separate the clusters far enough from each other, the ‘optimal set of u's is given when each of the square clusters is represented by four reference vectors, and when the four reference vectors within each cluster are arranged inthe known ‘optimal configuration for a single squat, In Fig. 1 we see the neural-gas adapting into a representation ‘of our model data distribution with 15 clusters and N’~ 60 reference vector. With each adaptation step, a data point Within one of the squares is stochastically chosen with equal probability over each square. Subsequently, adjustments ofthe 10's according to (7) are performed, We show the intial state, the state after $000, 15000, and finally after 80000 adaptation steps. In the simulation run depicted in Fig. 1 the neutal-gas algorithm was able to find the optimal representation of the data. distribution, However, depending on the initial choice of the wy's(cho- ‘sen randomly) and depending on the speed with which the parameter A is decreased, i., depending on the total number Of adaptation steps tax employed, it might happen thatthe refetence vectors converge toa configuration that is ony close ‘but not exactly at the optimum, Therefore, to demonstrate the average performance of the neural-gas algorithm we showin Fig. 2 the mean distortion error fr different total numbers of adaptation steps fnax- For each of the diferent total numbers ‘of adaptation steps We averaged over 50 simulation runs, for each of which not only the initialization of the w,"s were chosen randomly but also the 15 clusters of our mode! aia distribution were placed randomly. Since we know the ‘minimal distonion exter Fy that can be optimally achieved ooo 7 g ooo ago eo og Fg 1. The nerabgs netionk representing data dso a 22 hat ‘is of 1 prt cates of are shape, On each cast the decay ‘a ms homepne Te cece ves Wy ae Speed Dns Te nil ales forthe ws a hen ano! whe awe {nthe tp le pcre. We sesh th sate ae £40 (op gh. 18000 (toe Tet) nd ater 8000 sdepetion wep fm gh) A he end fhe adaption pace the set of recs esas as conte the opal congaton ey each che epee by four eres for our model data distribution and the number of reference vectors we employ, we choose « = (Eta) ~ Ea)/Eo a8 8 performance measure with E(¢max) as the final distortion cerzor reached. cr = 0 corresponds to a simulation run which reached the global minimum, whereas, c.g, = 1 corresponds ‘0 a very large distortion error, ie, distortion error which fs twice as large as the optimum, As we can see in Fig. 2 for tax = 100000 the average performance of the neural gas network is a = 0.09, which means that the average distortion err B for tmax = 100000 is 9% larger than what can be optimally achieved For comparison, we also show in Fig, 2 the result achieved by the K-means clustering, the maximum-entropy clustering, and Kohonen’s feature map algorithm. Up to tmac = 8000, only the distortion error ofthe K-means clustering is slightly smaller than the distortion error of the neural-gas algorithm. For tax > 8000, all three procedures perform worse than the neural-gas algorithm. Fora total number of 100000 adaptation steps the distortion error of the maximum-entropy clastering is more than twice as large as the distortion ertor achieved by the neural-gas algorithm. Theoretically, for the maximum- entropy approach the performance measure a should converge 'o 2er0 for fax — 90. However, a8 mentioned already in the introduction, the convergence might be extremely slow. Indeed, al four clustering procedures, including the maximum. entropy approach and the neural-gas algorithm, donot improve their final distortion eror significantly further within the range Performance (E-Es)/Ep oad Ton ‘Tova aumbe of adaptation Hep un Fig. 2. The prtormce of he aus citi a minimizing the Sionon enor E fo the mol darn of dat pits which dese in he tex and an staple of which x onwn i Pp Deitel the fn for ifsc Stal manera adaption ep tee The performance mesa (BE —"ra)/e wih Pea he minal dono eer tht ca be sclved for be pea ea debt nde sume feces vets we ed et the wat Pe compara ne sn where Siedith the Mada K-reae stern makina cetiopy cui aad Kobe's enue sp tise Upto tnar = 5000 oly be Satorton er ofthe Kms suserig i gly sal hn he Saran rt ofthe meal alge, For Yan > BOD th de ter peaches peroom none than he nea model For al mene 0000 aeons he ton othe ner lg esi by toe tha Yosef wo than te Sitrton cor sheved y the maimar opy reesue 100000 < tax < 500 000. Which the txts were made. Fig. 2 demonstates that the convergence of the neutal- 25 gorithm is faster than the convergence of the three other approaches, This i important for practical applications ‘where adaptation steps are “expensve," eg, in robot contol ‘where each adaptation step comesponds 10 tial movement (of the robot arm. Applications that require the lering of Inpu-outpt relations, vector quantization networks establish 4 representation of the input space that can then Be used for generating output values, ether through discrete output values [26], local linear mappings 12}, of through radial basis functions [27 Kohnen’ toplogy-conserving map as 4 vector quantizer, together with Toca linear mappings for generating outpt values, bas been studied for 8 umber of robot contol tasks (12), [16}-[18}. However, forthe reason of faster convergence, we took the neural-gas algorithm for an implementation of the Jaming algorithms [16)-{18} on an indusirial robot arm [28]. Compared to the versions that Fe based on Kohonen's feature map apd require about 6000 dapation steps (vial movements ofthe robot arm) oreach the ‘minimal positioning evor (18) only 3000 eps ae sicient ‘when the neural-gas network is employed (28) For the simulations of the neurl-gas network a presented in Fig. 2 we chose ha(k) = exp (—k/2) with 9 decreas exponentially with the numberof adapation steps ¢ aX) NOsPaMios with Ry = 10,Ap = O01, and tae € (6, 10000]. Compared to other choles forthe neighborhood function ha(k), eg, Gaussians, hy(E) = exp( 8/2) pro- vided the best result The step size « has the same time Aependence a8 Ate, «8) = G(ez/a)! toe with 500000 isthe limit up to and ty = 0.005. The similarity of the neura-gas network and the Kohonen algorithm motivated the time dependence (0) = 2i(rg/)""* fore and . This ime dependence has provided good results in applications ofthe Kohoncn network [16}-{18}. The parcutar choice for Ai,y.¢r, and e7 isnot ‘very eiial and was optimized hy “il and eto.” The only simulation parameter of the the adaptive K-means clustering is the step size «, which was chosen in our simulations identical to that ofthe neura-gas algorithm, In contrast the three other vector quantization procedures, the final result ofthe K-means clustering depends very much on the quality of the nial Aistribution of the reference vectors w. Therefore, to avoid a comparison in favor ofthe neural-as network, We initialized the -K-means algorithm more presructured by exploiting a priori Knowledge about the data distribution. Rather than initializing the wi's totally at random, they were randomly assigned to vectors lying within the 15 clusters, This choice prevents that some of the codebook vectors remain unused For the Monte Carlo simulations of the maximum-entopy clustering the step size « was also chosen a forthe neural-gas algorithm. The inverse temperature 8 had the time dependence Blt) = fulBp/B,) "with B= 1 and By = 10000. This scheduling of provided the best results for the range of ‘otal numbers of adaption steps tmx that was investigated ‘Also for Kohonen's feature map algorithm the sep size ¢ ‘was chosen 35 forthe neual-gasalgoitm and the oer two clustering procedures. The function hi (4) that determines the neighborhood relation between site and site j of the latice A of Kohonen's feature map algorithm was chosen to be Gaussian of the form ho(i,j) = exp (-Ili— jIP/20*) 19}, LOHLI8}. The decay constant 0, like A, decreased with steps £ according to a(t) = aiay/ai)'™ 2 and oy = 0.01, The values for 6, and 0 were optimized. IV, “GASUIRE” DYNAMICS OF TH REFERENCE VECTORS In this section we clarity the name neurl-gas and give a quantitative expression forthe density distribution of the Feference vectors, We ine the densi ou) of reer vet at loon w of VC R® through ou) = Fig) with a(w) being the fe oan onl oe od cea ee of Voronoi polygon Viiu. According to the definition of Vi, hich was given in (I), 4 & Vy fs valid, Hence, o(t) is 3 Step function that is constant on each Voronoi polygon Vn the following, we study the case where the Vorono polygons change their size F slowly from one Voronoi polygon 0 the next. Then we can regatd o(w) as being continuous, which allows to derive an expression for (u's dependence on the ensitydistibution of data points Fora given v, the density distribution o() determines the smumbers (vt). = 1y---, Ny which are necessary for an agjutment of the reference vectors wf) isthe number of reference vers within sphere cenered et» with radios Ile wil te o Inthe following, we look atthe average change (Aw) of a reference vector with an adaptation step (7), given through «[ Bor ersiisow)o-w). (0) (Sw) In case of a small decay constant A, ie, a A for which the range of fia(ks(v,2)) is small compared to the curvature of P(u) and p(w), we may expand the integrand of (10) around w, since only the data points v for which lv ~ |) is a small contribute to the integral. If, asin the simulations descried previously, A decreases against zero with the number fof adaptation steps, the range of ha(Fy(v,»)) will always ‘become that small at some point during the adaptation process, ‘The expansion of the integrand together with (9) yields to the leading order in A 24DP. gra (PP Pawo). a) ‘u denotes the gradient with respect tothe coordinates ofthe data space. Equation (11) states that the average change of a reference vector wy at location w is determined by two terms, fone which is proportional to dyP(u) at w, and one which is proportional to dyo(u) at w. The derivation af (14) is provided in Appendix Il Equation (11) suggests the name neura-gas for the al- ‘gorithm introduced. The average change of the reference Yectors corresponds to an overdamped motion of particles in @ potential V(u) that is given by the negative data point density, ie, V(x) = —P(u). Superimposed on the gradient of this potential is a “force” proportional to dye, which points toward the direction of the space where the particle (aw density p(w) is low. This “force” isthe result of a repulsive coupling between the pails (reference vectors). In its form it resembles an entropic force and tends to homogeneously lstribute the particles (reference vectors) aver the input space, Tike in ease of a diffusing gus ‘The stationary solution of (11), ie the solution of is piven by ofte) ox Plu)” (2) wit D 7= Dee ww This relation describes the asymptotic density distribution of the reference vectors w, and sates that the density o(u) of reference vectors at location w is nonlinearly proportional to the density of data points P(u). An asymptotic density bution of the reference vectors that is proportional to Puy” with y = D/(D + 2) is optimal for the task of ‘minimizing the average quadratic distortion err (2) [29]. ‘We tested (13) for a one-dimensional data distribution, ie, for D = 1. For this purpose we chose a data density distribution of the form Pu) = 2u,u € [0,1] and N = 50 reference vectors. The initial values for wy € R were drawn randomly from the interval [0,1]. For the parameters © and X we chose the small but finite values © = 0.01 and = 2, which were kept constant during a subsequent perfomance of 5000000 aapiaton spe (DA doble tion of the final result, Le, of the 50 data = oft) = 2/(wiet ~ wins yi = Ayo-580, 10.323, which compares well to the theoretical \V. ON THE ComPLEXITY OF THE NEURAL-GAS NETWORK ‘The computationally expensive part of an adaptation step of the neural-ges algorithm is the determination of ‘the “neighbothood-ranking,” i.e, of the kiyd = IyseeM. In 4 parallel implementation of the neura-gas network, each reference vector w; can be assigned toa computational unit ‘To determine its, each unit i has to compare the distance || oC its reference vector tothe input w withthe distance liv — will of all the other units 3,3 = 1,---,. If each unit performs this comparison in a parallelized way, each unit é needs O(log N) time steps to determine its “neighborhood rank" kj. Ih subsequent time step, each computational unit adjusts its wy according 10 equation (7). Hence, in a parallel implementation the computation time required for an adaptation step of the noural-gas algorithm increases like log with the number NV of reference vectors ‘A scaling like logN is an interesting result since the ‘computation time for an adaptation step of « “winner-take-all” ‘network like the K-means clustering algorithm, which requires ‘much less computation because only the “winning” unit has to be determined, also scales like logN in a parallel implemen- tation. Ina serial implementation, of course, the computation ime required for an adaptation sep ofthe neural ga algorithm increases faster with than the coresponding computation for 8 step ofthe K-mcans clustering. Determining the kei = Iy-yN in seid implementation correspon 10 sorting the distances fo wilt = 1.-,.N, which scales Hike NV log N. Searching forthe smallest disiance [fo 10 perform a sep of the K-means clustering sales only linearly ‘with the numberof reference vectors. VL. AprLicaTion To ‘TME-SERIES PREDICTION A very interesting learning problem is the prediction of deterministic, but chaotic, time-series, that we want to take as an application example of the neural-gas network, The particular time series we choose isthe time-series generated by the Mackey-Glass equation [30]. The prediction task requires to learn an input-output mapping y = f(v) of a curent state 1 of the time-series (a vector of D conseeutive time-series Values) ito @ prediction of a future time-series value y. If one chooses D large enough, ic., D = 4 in the case of the Mackey-Glass equation, the D'dimensional state vectors w all lie within a limited part of the D-dimensional space and form the attractor V of the Mackey-Glass equation. In order to approximate the input-output relation y = f(w) we partition the attractor’s domain into NV smaller subregions Ya Nand complete the mapping by choosing local linear mappings to generate the output values on each subregion Vj. To achieve optimal results by this approach, the partioning of V into subregions has 10 be optimized by choosing Vs the overall size of which is as small as posible. ‘A way of pattioning V isto employ a vector quantization procedure and take the resulting Voronoi polygons as subre- isions Vj. To break up the domain region of an input-output relation by employing a vector quantization procedure and to spproximate the input-output relation by local linear mappings ‘was suggested in [12]. Based on Kohonen's feature map algorithm, this approach has been applied successfully 10 various robot control tasks [12], {16}-{18] and has also been applied to the task of predicting time series [31]. However, a5 we have shown in Section Il, for intricately structured input manifolds the Kohonen algorithm leads to suboptimal parttionings and, therefore, provides only suboptimal approx- imations of the input-output relation y = f(9). The atractor ‘of the Mackey-Glass equation forms such a manifold. is topology and fractal dimension of 2.1 for the parameters ‘chosen makes it impossible to specify a coresponding latice Structure. For this reason, we employ the neural-gas network {for paritioning the input space V, which allows to achieve {good or even optimal subregions V, also in the case of {topologically intricately structured input spaces. ‘A hybrid approximation procedure that also uses a vector ‘quantization technique for preprocessing the input domain 19 ‘obiain a convenient coding for generating output values has been suggested by Moody and Darken [27] I their approach, preprocessing the input signals, for which they used the K. ‘means clustering, serves the tsk of distributing the centers 1, of set of radial bass functions, ie, Gaussian’s, over the input domain, The approximation of the input-output relation is then achieved through superpositions ofthe Gaussians. Moody and Darken demonstrated the performance of their approach 80 for the problem of predicting the Mackey-Glass time- series. A comparison of their result with the performance ‘we achieve with the neural-gas network combined with local Tincar mappings is given below. A. Adaptive Local Linear Mappings ‘The task is to adaptively approximate the function y = f(w) with v € VC RP and y € RV denotes the function's domain region. Our network consists of NV computational units, each containing a reference or weight vector wy (for ‘the neural-gas algorithm) together witha constant yi and 2 D- ‘dimensional vector aj. The neural-gas procedure assigns each unit toa subregion V; as defined in (1), andthe coefficints ty and a define a linear mapping, B= wae) ay from RP to R over each of the Voronoi polyhedra V. Hence, the function y = f(e) is approximated by j = fe) with £00) = vit + axe) “(2 49)) as) ‘i(v) denotes the computational unit i with its w, closest to v “To leam the input-output mapping. we perform a series of training steps by presenting input-output pars (v, y= (8) The reference vectors w, are adjusted according t adaptation step (7) ofthe neurl-gs algorithm, To obtain adaption rules for the output coefficients y, and a, fr each i we reqite the mean squared error fy, €20P(w)(F(o) ~ f(W))? between the acual and the required output, averaged over subregion Vi, to ‘be minimal. A gradient descent with respect wo ys anda; yields [oP =w [eerily wa -(0=w9) 09) (@—w)) ana [oP I9— nace) om) ii dy Plu) iw) an wly— va (om) For \ + 0 in adaptation step (7) the neural-gasalgoritm ‘provides an equlibriam distrbution of the w's for which ti 42vP(o)(e-m,) = 0 foreach , Le, denotes the center of gravity ofthe subregion V;, Hence, fy, @®vP()a(0- 1.) in (16) vanishes and the adaptation step forthe y's takes ‘on @ form similar tothe adaptation step ofthe ws, except that only the ouput of the “winner” anit is adjusted with « training sep. To obtain a significantly faster convergence of the outpt weights y. anda, we replace bin (16) and (17) by hye.) hich has he same form a hy (k(n adaptation ep (7), except thatthe decay constant’ might be ofa different value than A. By this replacement we achieve thatthe y. and a, of each unit is updated in a taining step, with a step size that decreases with he unit’s “neighborhood rank” to the current input v. In the beginning of the taiaing Procedure, 4 is large and the range of input signals that affect the weights of a unit i large. As the number of training steps ‘increases, ’ decreases to zero and the fine tuning ofthe output eights tothe local shape of f(w) takes place, Hence, inthe ‘on-line formulation, the adapiaton steps we use for adjusting 1% and a, are given by Au =e habla) (v ~ w) Aas =e hae(ki(0t))-(y ~ vi ~ (v= an) (vw), (is) 'B. Prediction ofthe Mackey-Glass Tne Series ‘The time series we want 10 predict with our network algorithm is generated by the Mackey-Glass equation aa(t=r)_ Tran HO) = Balt) + with parameters «= 0.2, = -0.1, and + = 17 [30]. x(¢) is quasi-periodic and chaotic with a fractal attractor dimension 2.1 forthe parameter values we chose. The cheracterisie time constant of 2(¢) is fanae = 50, which makes it particularly Aitfical 1 forecast (+ At) with At > 50, Input v of our network algorithm consists of four pst values of a(t), ie, (x(t) x(t ~ 6), 2(¢~ 12), 2(¢ — 18). Embedding a set of time-series values in a state vector is common to several approaches, inchuding those of Moody and Darken (27), Lapedes and Farber (32), and Sugihara and May (33). The time span we want t0 forecast into the future is At = 90. For that purpose we iteratively predict -2(t+6),2(¢+ 12), et, until we have reached (+90) afer 15 of such iterations. Because of this iterative forecasting, the ‘output y which corresponds tov and whichis used for taining the network is the te valve of x(¢ +6) We studied several different traning procedures. First, we trained several networks of different sizes using 100000 0 200000 training steps and 100000 training pairs » = (x(#),2(¢ ~ 6), (¢ ~ 12),2(¢ ~ 18)),y = x(t + 6). One ‘could deem this raining as “on-line” because of the abundant supply of data. Fig. 3 presents the temporal evolution ofthe neural-gas adapting 10 a representation of the Mackey-Glass tractor, We show a three-dimensional projection ofthe four «dimensional input space. The initialization of the reference vectors, presented inthe top left pat of Fig 3, is random, After 500 training steps, the reference vectors have “contracted” coarsely 10 the relevant part of the input space (op right), With further training steps (20000, bottom lef), the network assumes the general shape ofthe atacto, and atthe end ofthe adaptation procedure (100000 training steps, bottom sight), the reference vectors are distibuted homogeneously over the Mackey-Giass attrctor. The small dots depict already. Presented inputs u, whose distribution is piven by the shape of the attractor. = 020 % 0 ar fel. te mS ow | § om oN F 2 10 eS sum i é 2 “io fr 3 as f10 en Be 1 140 200 E a0 es 5 20 = ; 130 2 an i z 20 am 10 2 28 3 as 4 40S 10235 «3 3 «450 S Tot Naber of Weighs) lg Si fDi Se) Fe 4. Th mem pooner ves thesis he eon, FS. Th maid pein re vr ing ti ries ‘he sera pape coined wth alee mapping (1) npr srl agent Moody nd Dsken's Keone RBF tod (2) and aps Sek propzato 3) To aa the oe pdision er, he Kea REE Iba eles stot 10 times mre welgh than te era sgt Fig. 4 shows the normalized prediction error as a function of network size. The size of a network is the number ofits ‘weights, with nine weights per computational unit (four for cach w, and a,, plus one for y,). The prediction error is determined by the rms value of the absolute prediction error for At = 90, divided by the standard deviation of (0), ‘As we can see in Fig. 4, compared to the results Moody ‘and Darken obtained with K-means clustering plus radial basis functions (27) the neural-gas network combined with local linear mappings requires about 10 times fewer weights to achieve the same prediction error. The horizontal line in Fig. 4 shows the prediction error obtained by Lapedes and Farber withthe back-propagation algorithm [32]. Lapedes and Farber tested only one network size. On a set of only 500 data points they achieved a normalized prediction ertor of about 0.05. However, their leaming time was on the order fof an hour on a Cray X-MP. For comparison, we tained ‘8 network using 1000 data points and obtained the same prediction error of 0.05, however, training took only 90 ‘on a Silicon Graphics IRIS, which achieves 4MFlops for LINPACK benchmarks. To achieve comparable results, Moody and Datken employed about 13000 datapoints, which required 41800 s at 90 KFLops. Hence, compared to our learning procedure, Moody and Darken’s approach requires a much larger data set but is twice as fast. However, because of possible variations in operating systems and other conditions, ‘bh speeds can be considered comparable. Fig. 5 shows the resulis of @ study of our algorithm's performance in an “off-line” or scarce data environment. We trained networks of various sizes through 200 epochs (or 200000 steps, whichever is smaller) on different sizes of traning sets. Due to the effect of overfiting, small networks achieve a better performance than large networks if the taining set of data points is small. With increasinly large amounts of data the prediction error for the different network sizes saturates and approaches its lower bound, ‘vor sae Doe athe fe af vernal etvors sce bet fesormace an ag network the taing st of ata po reall ‘Wahine Inge aman of tbe predison eer or the diferent wor ses apace one bt ‘As inthe simulations described in Section I, the parameters 6 A,€,, and X” had the time dependence x = 24(24/2,)!"== with t's the cutent and fax 88 the total sumber of taining steps. The initial and final values forthe simulation parameters were «= 0.99,¢7 = 0.001;4; = N/3,Ap = 0.0001; <4 05,¢) = 0.05;X, = N/6, and X) = 0.05. As in the simulations of Section HI, the particular choice for these parameter values is not very critical and was optimized by vial and error. Both h(t) andy (E) decreased exponentially with &. VIL. Discussion In this paper we presented a novel approach to the task of ‘minimizing the distortion eror of vector quantization coding. ‘The goal was to present an approach that does not require any prior knowledge about the set of data points and, at the same time, converges quickly 10 optimal or at least near ‘optimal distortion errors. The adaptation rule we employed is a softmax version of the K-means clustering algorithm and resembles to a certain extent both the adaptation rule of the maximum-entropy clustering. and Kohonen’s featre map algorithm. However, compared tothe maximum-entropy clustering, it is distance ranking instead of the absolute sistance of the reference vectors to the current data vector that determines the adaptation step. Compared to Kohonen's feature map algorithm, it is not the neighborhood ranking fof the reference vectors within an external latice but the neighborhood:-ranking within the input space that i taken into sccount. ‘We compared the performance of the neural-gas approach with K -means clustering, maximum-entropy clustering, and ‘Kohonen's feature map algorithm on # model data distribution Which consisted of a number of separated data clusters. On the model data distribution we found that 1) the neural-gus algorithm converges faster and 2) reaches smaller distortion errors than the thee other clustering procedures. The price for the faster convergence to smaller distortion errors, however, is a higher computational effort. In serial implementation the computation time of the neural-zas algorithm scales like Niog.N’ with the number NV of reference vectors, whereas the three other clustering procedures all scale only linearly with .V. Nonetheless, in a highly parallel implementation the computation time’ required for the neural-gas algorithm bbocomes the same as for the three other approaches, namely O(logN), ‘We shoved that, in contast 10 Kohonen's feature map algorithm, the neural-gas algorithm minimizes a global cost function. ‘The shape of the cost function depends on the neighborhood parameter A, which determines the range ofthe ‘global adaptation of the reference vectors. The frm ofthe cost function relates the neural-gas algorithm to fuzzy clustering, ‘with an assignment degree of data point to a reference vector that depends on the reference vectors neighborhood rank to this data point. Through an analysis of the average change of the reference vectors for small but finite, we ‘ould demonstrate that the dynamics ofthe neural-gas network resembles the dynamics of a set of particles diffusing in 4 potential. The potential is given by the negative density Aisiribution of the data points, which leads toa higher density of reference vectors in those regions where the data point ensity is high. A quantitative relation between the density of reference vectors and the density of data poins could be derived ‘To demonstrate the performance of the neural-gas algorithm, wwe chose the problem of predicting the chaotic time-series generated by the Mackey-Glass equation. The “neural-pas” network had 10 form an efficient representation of the un derlying attractor, which has a fractal dimension of 2.1. The representation (as discretization of the relevant pats of the Input space) was utilized to learn the required output ie a forecast of the time-series, by using lal linear mappings. A comparison with the performance of K-means clustering ‘combined with radial basis funetions showed thatthe neural= 188 network requires an order of magnitude fewer weights to achieve the same prediction error. Also the generalization capabilites ofthe neural-gas algorithm combined with local linear mappings compared favorably with the generalization capabilites of the RBF-approach. To achieve identical accu- racy, the RBF-approach requires a training data set that is larger by an order of magnitude than the taining data set ‘which is sufficient for aneural-gas network with local linear ‘mappings. ApPENDIX 1 ‘We show that the dynamics of the neural-gas network, described by adaptation step (7), caresponds 10 stochastic eraient descent on a potential function. For this purpose we prove the following: Theorem: Fora sot of reference vectors w wy). tw, € RP and a density distribution P(v) of date points WE RP over the input space V CR, the relation or = a9 [@orenthiow)te-) =~ with B=3h Py P(eyha(ky(v,w))(v~ ws) (20) is valid. &(v,w) denotes the numberof reference vectors wy with [jv ~ al] < |Iv — yl Proof: For notational convenience we define d(v) ‘This yields ey with $ [eoromesnmg tile, a 1NS(-) denotes the derivative of hy(-). We have to show that R, vanishes for each i = 1,>+-N. For kj(v,40) the relation do.) = ad ~) ey is valid, with 6(.) as the Heavyside step function 1, forz > 0 we) {0 fae co “The derivative ofthe Heavyside step function (2) isthe det distribution 5{2) with {2)=0 fore 20 [itera and This yields nef aP oP oth (b(v.)) > Se ~ A =¥ [ PoPcovsitiow) deed — a) - es ach of the 1 integrands in the second team of (24) is nonvanishing ony for thse ws for which dt i valid, respectively. For these w' we can Write y(u, wo) = od ~ df) = Dod - a) =klve), 5) and, hence, we obtain R= [ dP vP(a)hs(hy(u, w))d? dy 6d? ~ dP) ~ [/ aPoProyctstevw) dds > 66 — a 6) Since é(r) = 6(—z) is valid, Ry vanishes for each é = Ten “APPENDIX I Inthe following we provide a derivation of (6) The average change (Aw,) of a reference vector with adaptation step (7) is given by (am) =« fa Porrih(ow)ie-m) en with Ay (l(0,.)) being a factor that determines the sizeof the adaptation step and that depends on the number k; of reference vectors w, being closer to than w. We assume ha(f,) 10 be tunity for; = 0 and to decrease to zero for increasing ky with a characteristic decay constant, We can express (0,0) by ki(v,t0) = (ou)? 1 geen 0 with o(a) a8 the density distribution of the reference vectors in the input space V = R2. For # given set w = (w1,---,ww) of reference vectors, ‘x(v,.) depends only on r'= 9 —w, and, therefore, we introduce ar) Fo bale), @) Inthe following we assume the limit of « continuous dist bution of reference vectors wi, ie, we assume e(u) to be analytical and nonzero over the input space V. Then, since elu) > 0, for a fixed direction #,2(7) increases strictly ‘monotonically with |[r| and, therefore, the inverse of zr), denoted by +(2), exists, This yields for (27) « [Peat reayinta?re)sle)de 0) with J(z) = det (Ory /02.), u,v = 1,---,D and = [a We assume hy (y(~ w,)) 10 decrease rapidly to 2eo with increasing lv ~ wl, ie., we assume the range of ha(bi(r)) within V to be small enough to be able to neglect any higher derivatives of P(u) and o(u) within this range. Then we may replace P(w; + r(z)) and J(2) in (30) by the frst terms of their Taylor expansion around 2 = 0, yielding (am) =¢ f i(2?)(Poe) +an5r +) : (1) + age + rte) ae en (aw, ‘The Taylor expansion of r(x) can be obtained by determin= ing the inverse of (r) in (29) for small |||. For small |r| {it holds jm] < |}r|| in (28)), the density o(v + wu) in (28) is given by ov + u) = ow +r +n) = oui) + (r+ wdro(wi) + Ofr?) G2) with d= 0/9, Together with (29), tis yes Brel) 207) =r(roalw)/? (14 Pee ow) +002) Ca with pp ey 8s the volume ofa sphere with radius one in dimension D. As the inverse of (33) we obtain for smal jz 200)" (1 (roe) PERE 4 012") G5) which yields the fist terms of the Taylor expansion of (a) Mound 2 = 0. Bechie of (5) i ols oy 2rglz) ae, bur00)-¥? = poy PELRE _ Gan py-1/0 Eu 2 (1-00)? Ee — (roe)? 3 3 = (= by 3(r00)"7/° 56 38 4 Ola?) 0) ad, therfore, OP _ oP or Os = Br ds = (700) /PapP oy is valid at = = 0 ‘The off-diagonal terms in (36) contribute only in quadratic order 10 Jz) = det (Ory/0xy). Consiering the diagonal terms op wo Howar onde a = gil S ore tye, Jee) = (ro0)"*(1=(rv0-¥°(14 +012") one an, therefore, A J D) Fen oer (145) 2%0 0% is valid at 2 = 0. ‘Taking into account (35), (37)-(39) we obtain, for (31), (aw) = Hiroe? fame? (Pt) + (roo). P+) (coor ~ (1+ 5) roey-noevont =e 4.) (0 coo Jie “ ‘The integral over the terms of odd onder in z vanish because of the rotational symmetry of hx(r?). Considering only the leading nonvanishing term (leading in A), we finally obtain (aw, ro (*®*P=7"2are) «ay with «2 and D way “ Actsowuuncwent The authors would like to thank J. Walter for many discus- sions on the problem of predicting chaotic time series and for providing parts of the prediction software. They also thank R, Der and T. Villmann for valuable remarks on the theorem in “Appendis 1 REFERENCES (0) RM. Gi, "ese quotation” HEE ASSP Mag, va, 0.2.38 ap, Soe [2] SP Lye Lea agentes quantization a POM,” IEE Trans. orm Theory wo. 138, p21 [a] "Macon, "Soe method fr clisiston and anys of mut thaite eteatons."in Prac St Berkey Spo Mathomas Sets and Probably, LM. LeCun and 1. Nea Ede 1907, 14) Rowe, Guten, ant 6. Foe, “Ss michancs apse ‘tmsios in casein” Physea! Rey Lets ol 6, a. pp StS oe, {5} $: Geman al. Geman, “Sockati elaine, Gite dtrbion And the Bayean enor of images” TEEE Troms Pee tal Heo a a a. 7 (6) $°7-Nowla, "Marina tneeod epee ing." in Advances sel rt Posing Ss 2B iy, 2 No Yor: Morgan Katina, 190, pp. 310-8 171 T Ket Sema etna gly cre fee snaps Biologia Cybern op. 39-0, 1. Us) T Rotonen Asai of Sine se-ransng pos” Bile iter ol pp 15-140 ks (91 1 Kone, Se Srantaton nd oie Memory (Spins Seis | Inormtion Sciences 8)” Hoth. Seinge luo] T Koboven Mis, and Sarma Phonic map insight ‘epee of poco exes peck reapion ih Pr [ihn Conf on Patern Recogiion Motel RS 2-186, (ut) Je Mok" Ronese, aH Gh "cr guueie ie specch faning” Pro IEEE, v7, pp. S886 588 (02] Hote and Shut, “Tops esnering mppigs for easing ‘oo ek” a Neral ork or Camp. HS Dene Ea AIP Conf Pro I3i, Seog, Ut 1986p 16-90. 03] NIM Namba and RO King “nage Selig ing vector gua Zale: A revi," TEEE Trane Commu vo opp 935 I TM, Nasal and. Feng, “Nestor quniztion of images based ‘pon the Kone sel raising eat map IEEE Int Conf Nea! Neoors San Deg, Ca, 988 pp 101108 (05) 2-Roylorand K.PLD "Arai of = Sarl eciwenh gochey for stor uct of Speech pam” Is roc Ta Rm INNS Meet, 136, p 0. bs) U6) H. Ries. Maines, and. Sebuhen, “Topslgy-

You might also like