You are on page 1of 16

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

A Hybrid Algorithm for Data Clustering Using Honey Bee Algorithm, Genetic Algorithm and K-Means Method Mohammad Ali Shafia1,a, Mohammad Rahimi Moghaddam1,b, Rozita Tavakolian2,c
1 Department of Industrial Engineering, Iran University of Science and Technology, Tehran, Iran 2 Department of Information Technology Engineering, Tarbiat Modares University, Tehran, Iran a omidshafia@iust.ac.ir, bmrahimim@iust.ac.ir, c rozi_tak63@yahoo.com
ISSN: 2231-8852

Article Info
Received: 29th September 2011 Accepted: 10th November 2011 Published online: 1st December 2011

2011 Design for Scientific Renaissance All rights reserved

ABSTRACT
In this article a novel population based hybrid algorithm called Genetic Bee Tabu K-Means Clustering Algorithm (GBTKC) is developed based on basic Honey Bee Algorithm in which the benefits of KMeans Method is used in order to improve its efficiency. Also, the simplicity of K-Means, the diversity of Genetic Algorithm to find the global optimum and advantages of Tabu Search has been combined in GBTKC. Using Honey Bee, our hybrid algorithm has more ability to search for the global optimum solutions and more ability for passing local optimum as well as generating efficient near optimal solutions. Moreover, GBTKC is run on three known data sets from the UCI Machine Learning Repository and the results of clustering using this algorithm compared with other studied algorithms will be stated in the literature review. So, this is revealed that GBTKC is definitely a convergent optimal solution, and the quality of answers provided by the algorithm is more reliable than those of other algorithms. Keywords: Clustering, Hybrid Algorithm, K-Means Method, Honey Bee Algorithm (HBA), Genetic Algorithm (GA), Tabu Search Algorithm (TS).

1. Introduction With the rapid increasing of information on the web, clustering related data and documents to achieve useful information become more important for information retrieval systems. Different methods to tackling the issue of clustering have been existed. The most important of which is K-Means with classifying the data set into a number of homogenized groups based on their similarities. The main problem with K-Means is the tendency to convergence in Local Optimum. Meta-heuristic algorithms are widely used to optimize the result of K-Means. But, resulting clustering quality and the optimal solution is still a challenging task. Honey Bee Algorithm, as a Meta-heuristic Algorithms, introduces a novel approach to search in solution space via observing honey bee behavior regarding foraging that leads to improve the solution quality.

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

Intelligent methods are used mostly in designing great modern and professional information systems for solving complex problems. In order to improve complex sets of data or information, an approach inspired by popular nature is used. In this approach, organism structure and team behavior of animals are observed and studied. In the result body of these studies Particle Swam Optimization (Cura, 2009; Kennedy & Eberhart, 1995) is derived from birds' flocks or fish schools. Meanwhile, Honey Bee Algorithm (HBA) is a novel method created to solve improvements inspired from honey bee behavior foraging, collecting nectar and pollen and producing honey. Honey Bee Algorithm is a population-based algorithm that was originally proposed by (Phamet al., 2006) and simulates the foraging behavior that a swarm of bees display. HBA falls in the category of Swarm-based optimization algorithms (SOAs). SOAs use different methods present in the nature in order to search in solution space and accessing the optimum solution. The key difference between SOAs and direct algorithms such as hill climbing is that SOAs use population of solutions instead of a single one (Pham et al., 2007). HBA has this potential to have many functions. This optimization algorithm stimulates honey bee foraging leading to being a helpful approach in other diverse issues. Some of these functions are stated in the review of literature. In this research, HBA is used as a novel method in SOAs in order to solve the issue of Clustering. HBA uses different mechanisms such as waggle dance to find the best site for food source as well as searching for the next one. Clustering is one of the unsupervised techniques and learning based methods of partitioning similar data points. Clustering is defined as classifying homogenized sets of data points into several clusters at most on this condition that there would be no back ground knowledge on the subject (Murthy & Chowdhury, 1996). There are different clustering algorithms each of which use certain steps in order to categorize a large number of data points into less number of groups in a way that data points of one group have the most similar characteristics and features and the data points of different groups have the least similarity in regards with characteristics and features (Pham et al., 2007). One of the most well-known useful clustering algorithms is K-Means (Jain & Dubes, 1988). The algorithm is efficient for clustering large data sets, because its computational complexity only grows only linearly with the number of data points. The important point about K-Means is that it does not guarantee optimal solution although this method converges into good solutions (Bottou & Bengio, 1995). Moreover, framework of K-Means method is used to HBA Hybridization in order to generate optimal solutions. This article introduces a new optimization algorithm called; Genetic Bee Tabu K-Means Clustering Algorithm (GBTKC) which stands for. This is a hybrid algorithm which has developed based on HBA (Pham et al., 2006). Also, it utilizes benefits of Genetic Algorithm (GA) (Goldberg, 1989), Tabu Search (TS) (Glover, 1989a; Glover, 1989b) and K-Means Methods (Selim & Ismail, 1984). The HBA performs a kind of neighborhood search combined with a random search in a way that is reminiscent of the food foraging behavior of swarms of honey bees (Pham et al., 2006; Pham et al., 2006). As mentioned before, a major drawback of previous clustering-based approaches is getting stuck at a local optimum. In this research, we utilize different features of four meta111

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

heuristic algorithms and present a novel hybrid algorithm able to improve the results of previous approaches. For this purpose, our approach tries to answer the following questions: a. How does GBTKC combine the different features of four Meta-heuristic algorithms to achieve a more precise solution? b. How much improvement can be achieved in our approach in comparison with others? The paper is organized in a way that section two briefly reviews the clustering concepts and encompasses its different methods. Section three briefly introduces the classical GA. The next section deals with the foraging behavior of bees and the initial ideas of the proposed clustering method. The Hybrid Algorithm proposed for Clustering is presented in section five. Results of the clustering experiments are reported in section six and finally, section seven concludes the paper and provides a set of suggestions for future studies. 1.1 Clustering Methods Data clustering, which is one of the famous issues of NP-complete, has been developed in order to find groups in the heterogeneous data with minimizing criteria of dissimilarity as the end. Solving this problem would be of use in data mining, machine learning and pattern classification solutions (Liu et al., 2008). Clustering methods identify groups or better say clusters of a data set using a step by step approach in the sense that in each cluster there are objects similar to each other, yet different from those in other clusters (Lee & Yang, 2009; Rokach & Maimon, 2005; Xu & WunschII, 2005). There exist methods of clustering, a number of which are published in the literature review (Grabmeier & Rudolph, 2002; Lee & Yang, 2009). They can be broadly classified into four categories (Han & Kamber, 2001): partitioning methods, hierarchical methods, density-based methods and grid-based methods. Other taxonomy of clustering approaches is shown in Fig. 1. Other clustering techniques that do not fit in these categories have also been developed. These are fuzzy clustering, artificial neural networks and GA. A discussion of different clustering algorithms can be found in references published by different authors (Han & Kamber, 2001; Pham & Afify, 2006). K-Means is among the simplest and most commonly used methods of clustering in Partitioning Method category (McQueen, 1967). Each cluster in K-Means is displayed by the mean value of the data points within the cluster. What this method does is trying to divide a data set S into k clusters in the way that the sum of the Euclidean distances between data points and closest clusters centers is minimized. It named Total With in Cluster Variance (TWCV). This criterion is defined formally by Eq. 1, where xi(j) is the ith data point belonging to the jth cluster, cj is the centre of the jth cluster, k is the number of clusters and nj is the number of data points in cluster j. (1) As mentioned above, the implementation of K-Means clustering involves optimization. First, the algorithm takes k randomly selected data points and makes them the initial centers of the k clusters being formed. The algorithm then assigns each data point to the cluster with closest centre to it. In the second step, the centers of the k clusters are recomputed, and the
112

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

data points are redistributed. This step is repeated for a specified number of iterations or until there is no change to the membership of the clusters over two successive iterations. It is known that the K-Means algorithm may become trapped at the local optimal solutions, depending on the choice of the initial cluster centers.

Fig.1. A taxonomy of clustering approaches (Lee & Yang, 2009) HBA and GA have a potentially greater ability to avoid local optima than is possible with the localized search employed by most clustering techniques. A research proposed a genetic algorithm-based clustering technique, called GA-clustering, has proven effective in providing optimal clusters (Maulik & Bandyopadhyay, 2000). In this algorithm, solutions (typically, cluster centroids) are represented by the bit strings. The search for an appropriate solution begins with a population, or collection, of initial solutions. Members of the current population are used to create the next generation population by applying operations such as random mutation and crossover. At each step, the solutions in the current population are evaluated relative to some measures of fitness (which, typically, is inversely proportional to E), with the fittest solutions selected probabilistically as seeds for producing the next generation. The process performs a generate-and-test beam search of the solution space, in which variants of the best current solutions are most likely to be considered next. Another research has presented a clustering method based on the classic HBA (Pham et al., 2007). The method employs the Bee Algorithm to search for the set of cluster centers that minimizes a given clustering metric. One of the advantages of the proposed method is that, it does not become trapped at locally optimal solutions. In the present report, it would be shown that the proposed method produces better performances than those of the K-Means and the GA-clustering algorithm. In the next sections, an alternative clustering method to solve the local optimum problem is described. The new method adopts HBA and GA as it has proved to give a more robust performance than other intelligent optimization methods for clustering problems. (Pham, et al., 2006) 1.2 The Classical GA GA uses a stochastic search procedure providing adaptive and robust search over a wide range of search spaces. The procedure is inspired by the Darwinian's principle of survival of the fittest individuals and of natural selection. The technique was first introduced for use in
113

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

adaptive systems (Holland, 1975). It was then employed by several researchers in solving various optimization problems effectively and efficiently. The search procedure starts with the initialization of a few parameters which may/may not be modified in the course of search. The algorithm passes through three basic phases iteratively, namely, the reproduction phase, the crossover phase and the mutation phase. The detailed operation at each phase is lucidly described in literature (goldberg, 1989; Xiao et al., 2010). The Classical GA can be described as shown in Fig.2.

Fig.2.Pseudo code of the basic GA 1.3 The Classical HBA 1.3.1 Bees in Nature A colony of honey bees can fly in different directions in rather long distance in order to forage. Honey bee forage source is flower patches in which there is a lot of nectar or pollen. Gathering this pollen or nectar is easy for honey bees leading to more honey bees visit more nectar or pollen and vice versa (Seeley, 1996; Von-Frisch, 1976). The foraging process begins with random searching of a scout bee from one patch to another in a colony. Scout bee (See Fig. 3) is a type of unemployed forager that starts searching spontaneously without any knowledge (zbakir, Baykasoglu, & Tapkan, 2010). During the harvesting season, a colony continues its exploration, keeping a percentage of the population as scout bees (Seeley, 1996). The bees that return to the hive have evaluated different patches based on their quality which is dependent on parameters such as proportion of sugar in nectar or pollen of that patch (Camazine et al., 2003). They deposit their nectar or pollen and go to the dance floor to perform a dance known as the waggle dance (VonFrisch, 1976). This mysterious dance is essential for colony communication, and contains three pieces of information regarding a flower patch (Camazine et al., 2003). This information helps the colony to send the bees to the flower patches without any guide, instruction or map more meticulously. Moreover, patch value relies on the amount of existing food as well as the energy needed to harvest it (Camazine et al., 2003). After performing waggle dance on the dance floor, scout bees return to the patch and take flower bees waiting for them in the hive with them. More flower bees are sent to patches with
114

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

a higher probability of food existence. Continuing this manner causes the colony to gather food in the fastest most efficient way possible. Deciding about the next waggle dance is crucial after returning to the hive (Camazine et al., 2003). Naturally, when there is still enough nectar in the patch in a way that it is considered as a source, waggle dance is advertized and recruit bees are sent to the source. Recruit (See Fig.3), another type of unemployed forager, attends to a waggle dance done by some other bees leading the bee to start searching using the knowledge from waggle dance (zbakir et al., 2010).

Fig.3.Typical behavior of honey bee foraging (Hckel & Dippold, 2009; zbakir et al., 2010) 1.3.2 The basic HBA Lots of SOAs have developed from bee behavior. These algorithms are classified into two categories based on their behavior in the nature: foraging behavior and mating behavior (Marlnakls, Marmakl, & Matsatsinls, 2009). The most important approaches that simulate the foraging behavior of bees are the Artificial Bee Colony (ABC) Algorithm proposed by (Karaboga & Basturk, 2007, 2008), the Bee Colony Optimization Algorithm proposed by (Teodorovic & Dell'Orco, 2005), the Bee Hive Algorithm published by (Wedde, Farooq, & Zhang, 2004) and the Virtual Bee Algorithm proposed by (Yang, 2005) which is applied in continuous optimization problems. We focus on Honey Bee Algorithm among the ones we have introduced above. HBA is an optimization algorithm inspired by the natural foraging behavior of honey bees to find the optimal solution (Pham, Castellani, & Ghanbarzadeh, 2007). The algorithm has been successfully applied to the optimization problems including a famous data set (Pham et al.,, 2006). The algorithm starts with the n scout bees being placed randomly in the search space. Then, the fitness of the sites visited by the scout bees after return is evaluated. The best m sites will be selected from n. The evaluation of fitness leads to selecting m scout bee from n. In this m selected sites, e sites are introduced as good selected sites and other ( m-e) sites are introduced as bad selected ones.
115

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

A neighborhood search sites of size ngh is selected for each m sites. In this step a neighborhood size ngh is determined which will be used to update the m bees declared in the previous step. Number of bees (n2) will be selected randomly to be sent to e sites and choosing n1 bees randomly which their number is less than n2, to be sent to (m-e) sites. In this step, recruit bees for the selected sites and evaluate the fitness of the sites. Finally, the best bee from each site (the highest fitness) is chosen to form the next bee population. The remaining bees for initialing new population will be assigned randomly around the search space. This algorithm is repeated until the stopping criterion is met. Usually stopping criteria is the number of the repetitions imax.(Pham, Castellani et al., 2007) Steps of basic Bee Algorithm are described in detail in Fig. 4 and its flowchart is illustrated in Fig. 5 and the basic HBA requires a number of parameters to be set as are shown in table 1.

0. Begin 1. Initialize population with random solutions. 2. Evaluate fitness of the population. 3. While (stopping criterion not met) 3.1. Forming new population. 3.2. Select sites for neighborhood search. 3.3. Recruit bees for selected sites (more bees for best e sites) and evaluate fitness. 3.4. Select the fittest bee from each patch. 3.5. Assign remaining bees to search randomly and evaluate their fitness. 9. End While. 10. End.

Fig.4. Pseudo code of the basic Bee Algorithm (Idris 2009)

Fig.5. Flowchart of the basic Bee Algorithm

116

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

Table 1: Parameters of basic HBA


Parameters N m e n2 n1 ngh imax Description Number of scout bees Number of sites selected out of n visited sites Number of best sites out of m selected sites Number of bees recruited for best e sites Number of bees recruited for the other (m-e) selected sites Neighborhood size Number of algorithm steps repetitions or stopping criterion

2. Methodology 2.1 The Proposed Hybrid Algorithm for Clustering GBTKC exploits the search capability of the HBA and GA to overcome the local optimum problem of the K-Means algorithm. More specifically, the task is to search for the appropriate cluster centers Ci (1 i k) such that TWCV (Eq. 1) is minimized. Here K denotes the number of clusters. Pseudo code of this algorithm is shown in Fig. 6 and its flowchart is shown in Fig. 7. The steps of the proposed algorithm are further described below. The algorithm requires a number of parameters to be set, namely: number of scout bees (n), number of sites selected for neighborhood searching out of n visited sites ( m), number of top-rated or elite sites among m selected sites ( e), number of bees recruited for the best e sites (ne), number of bees recruited for the other ( m-e) selected sites (no), stopping criterion, mutation probability in GA (Pm) and the length of Tabu List (L). The algorithm starts with an initial population of n scout bees. Each bee represents a potential clustering solution. Each n scout bees includes a solution as the set of k cluster centers. The initial locations of the centers are randomly assigned. The Euclidean distances between each data object and all centers are calculated to determine the cluster to which the data object belongs to (i.e. the cluster with centre closest to the object). In this way, initial clusters can be constructed. The most popular metric especially for continuous features is the Euclidean distance which is illustrated in Eq. 2 below. (2) After the clusters have been formed, the original clusters centers are replaced by the actual centroids of the clusters to define a particular clustering solution (i.e. a bee). This initialization process is applied each time and new bees are to be created. In step 3.1, the fitness computation process is carried out for each site visited by a bee by calculating TWCV (Eq. 1) which is inversely related to fitness. Then in step 3.2, m fittest sites for neighborhood search have been selected. In step 3.3, the m sites with the highest fitness are designated as selected sites and chosen for neighborhood search. In step 3.4, the algorithm conducts searches around the selected sites, assigning more bees to search in the vicinity of the best e sites. Selection of the best sites can be made directly according to the fitness associated with them. Alternatively, the fitness values are used to determine the probability of the sites being selected. As already mentioned, this is done by recruiting more bees for the best e sites than for the other selected
117

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

ones. Together with scouting, this differential recruitment is a key operation of the Bee Algorithm. Searches in the neighborhood of the best e sites can be conducted using different formulas such as the one below. If X=(x1,x2,,xk) is current clusters centers and Y=(y1,y2,,yk) is another point in ngh neighborhood of X, then: (3) In step 3.5, fitness of recruited bees will be evaluated. In step 3.6, the bee with the highest fitness in each patch will be selected to form part of the next bee population. The fittest bee in patches made a part of next generation. In step 3.7, the remaining bees in the new population are assigned to using GA concept. In this step, two bees from ( n-m) remaining bees will be selected randomly and crossover operator will be used in selected bees to generate two offsprings. Then mutation operator will be used on resulted offsprings. If offsprings is not in Tabu list, evaluate its fitness. So, this bee is selected for next generation and added to Tabu list. If the Tabu list is full, the last item will exit and the input item will be stated at top of the list. In this step will be used from single point crossover and mutation operator with probability of Pm, also the Tabu list length is constant ( L). If offsprings is not in Tabu list then offsprings will be added to Tabu list, evaluated fitness of offsprings, and assign offsprings to next generation. These steps will be repeated until all of remaining (n-m) bees of next generation is generated.

Fig 6. Pseudo code of the GBTKC Algorithm


118

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

Fig.7. Flowchart of the GBTKC Algorithm At the end of each iteration, the colony will have two parts in the new population. One part will be acquired from the fittest bee in the selected patches, and the other part will be generated by GA and TS concepts. These steps are repeated until a stopping criterion is met. GBTKC is due to the ability of the HBA and GA to perform local and global search simultaneously (Pham, Koc et al., 2007). The power of GA arises from crossover and mutation operators (Srinivas & Patnaik, 1994). Crossover causes a structured, but randomized, exchange of solutions, with the possibility that good solutions can generate better ones. And mutation keeps the diversity of population. Also GBTKC allows HBA utilize crossover and mutation operator of GA in order to increase the diversity of population in each iteration and prevent the premature convergence. 3. Results & Discussion In this section, results of implementing and testing GBTKC algorithm along with its comparison with the results of four other algorithms stated in the literature are mentioned. In other words, GBTKC algorithm is compared with Basic K-Means (McQueen, 1967) which is the simplest method of K-Means for solving the issue of clustering; GA K-Means (Krishna & Murty, 1999) which is a hybrid algorithm of GA and K-Means method; Basic Bee Algorithm (Karaboga & Ozturk, 2011; Pham, Castellani et al., 2007) which is the simplest algorithm
119

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

derived from bee behavior; SOM K-Means (Vesanto & Alhoniemi, 2000) which is the combination of K-Means method and Neural Network solutions for solving clustering issue. Table 2: Features of Reuters, Wine and IRIS data sets used in the experiment
Data Set Name Reuters21578 Wine IRIS Number of Objects (Include documents and news) 300 178 150 Number of Features 1206 13 4 Number of Classes 6 3 3

Table 3: Parameters used in the clustering experiments


Algorithm Basic Algorithm Genetic Algorithm Parameters K-Means K-Means Total number of no exchanges in fitness Crossover probability, Pc Mutation probability, Pm Population size, P Basic HBA Number of scout bees, n Number of selected sites, m Neighborhood size, ngh Number of best sites out of m or elite bees, e Number of bees recruited for best e sites, ne Number of bees recruited for the other (m-e) selected sites, no Stopping criterion, imax Initial neighborhood size, IN Topology function, TFCN Distance function, DFCN Steps for neighborhood to shrink to 1, STEPS Number of scout bees, n Number of selected sites, m Neighborhood size, ngh Number of best sites out of m or elite bees, e Number of bees recruited for best e sites, ne Number of bees recruited for the other (m-e) selected sites, no Mutation Probability in GA, Pm Stopping criterion, imax Length of Tabu List, L Value 2 1 0.5 300 90 60 0.8 50 50 30 40 3 'hextop' 'dist' 100 90 60 0.7 40 50 30 0.3 50 10

SOM Algorithm

K-Means

Genetic Bee K-Means Algorithm (GBTKC)

In order to compare and evaluate these algorithms, well known real data sets from the UCI Machine Learning Repository (Blake & Merz, 1998) are needed that has used Reuters21578 (Lewis, 1997) as a larger data set with 300 documents, then Wine (Murphy & Aha, 1992) as a smaller data set with 178 observations and IRIS (Grabmeier & Rudolph, 2002) as the smallest data set in this article with 150 observations to run and test the algorithms. In accordance with features of required data for each of these algorithms, all of

120

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

Reuters, Wine and IRIS data sets were cleaned using primary and heuristic methods. Table 2 illustrates features and general aspects of these three data sets. The noticeable point in running and testing of SOM K-Means algorithm is that this algorithm was first run 50 times with the standard toolbox of SOM Neural Network in Matlab software (R2009) and then the outputs were implemented as cluster centers so that it is possible to produce final results after running K-Means on these centers. To run the first phase of this algorithm in Matlab software default parameters are used. For the assessment of the tested algorithm performance clustering criterion TWCV was used after running them in a way that the smaller value of this criterion, the better clustering results will be and vice versa, i.e. the greater the TWCV criterion, the worse the clustering results will be. In the following, table 3 displays the parameter size of each of the algorithms in this test. Each of these algorithms in this test is run 15 times and consequently average (mean), minimum and maximum amount of TWCV is calculated. Table 4 illustrates results of running and implementing of each of the algorithms. As can be seen in Table 4, the proposed clustering method outperforms the other four algorithms in three data sets. Table 4: Results for the tested clustering algorithms
Data Set Algorithm Basic K-Means Algorithm Genetic K-Means Algorithm Basic Bee Algorithm SOM K-Means Algorithm Genetic Bee K-Means Algorithm (GBTKC) Basic K-Means Algorithm Genetic K-Means Algorithm Basic Bee Algorithm SOM K-Means Algorithm Genetic Bee K-Means Algorithm (GBTKC) Basic K-Means Algorithm Genetic K-Means Algorithm Basic Bee Algorithm SOM K-Means Algorithm Genetic Bee K-Means Algorithm (GBTKC) Mean TWCV 1434.815940 1423.873518 1425.135698 1422.523106 1419.420171 18134.905339 16242.673531 16345.956243 16249.969845 16235.406237 100.406274 78.969404 78.979825 78.940841 78.941603 Min. TWCV 1452.990012 1435.549627 1434.035237 1430.892543 1430.191211 18701.956158 16241.361841 16328.918898 16245.641182 16229.165709 78.940841 78.940841 78.940841 78.940841 78.940841 Max. TWCV 1486.418508 1441.937816 1446.418007 1440.308472 1440.782632 18663.266267 16350.143577 16366.847007 16253.840755 16248.729222 145.279322 78.999457 79.326450 78.940841 78.191357

Reuters

Wine

IRIS

4. Conclusions In this article one of the applications of HBA as one of the new members of Swarm-based Metaheuristic has been evaluated. Recently, studies on algorithms derived from bee behavior have concerned optimization issues. That is why in this article there has been an attempt to study the applying HBA on a NP- hard problem called Clustering. Clustering is a very important issue both theoretically and practically, attracting a lot of researchers. K-Means is one of the simplest most efficient methods introduced in regards with clustering. Naturally, there are disadvantages to this method as well as amazing advantages.
121

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

The proposed algorithm which is utilized to solve the issues of clustering is the modified version of the basic Bee Algorithm that was presented by Pham and his colleagues for the first time. The algorithm which is introduced by the name GBTKC is a novel hybrid algorithm in which the benefits of K-Means Method is used in order to improve its efficiency. GBTKC is based on basic HBA and a mixture of GA, TS and K-Means Method is used to design it. In this new hybrid algorithm, advantages of different algorithm are used. GBTKC makes use of two algorithms, HBA and GA, in order to search in cluster centers and minimize objective function. Also, this algorithm uses the two algorithms, GA and TS, in order to diversify solution space in new population generation. One of the key benefits of this algorithm is that it does not stuck on locally optimal solutions and the quality of findings and answers provided by this algorithm is much better than those of the previous studied algorithms in subject literature. This improved quality of the algorithm is the result of using both the HBA and GA which does the local and global search simultaneously. In order to test the performance of GBTKC this algorithm is implemented and run on three well- known data sets and computational experience is very encouraging. These experiments prove that the algorithm is absolutely converged to optimal solution in all runs. Also, the experimental findings of the algorithm as a result of the algorithm running on both of Reuters21578, Wine and IRIS data sets determine that GBTKC has worked better than any of the other introduced algorithms in the subject literature of this study. One of the drawbacks of GBTKC is the number of the parameters used that should be tuned in this algorithm. Another drawback is the long CPU time of this algorithm. As a result, working on finding a solution to help the algorithm users in order to choose the appropriate value of parameters and use proper solutions to decrease this algorithm CPU time dramatically is of great value. These subjects can be scheduled as future works. References Blake, C. L., & Merz, C. J. (1998). UC Irvine Machine Learning Repository. University of California at Irvine Repository of Machine Learning Databases, from http://www.ics.uci.edu/~mlearn/MLRepository.html Bottou, L., & Bengio, Y. (1995). Convergence properties of the k-means algorithm. Advances in Neural Information Processing Systems, 7, 585-592. Camazine, S., Deneubourg, J., Franks, N. R., Sneyd, J., Theraula, G., & Bonabeau, E. (2003). Self-Organization in Biological Systems. Princeton: Princeton University Press. Cura, T. (2009). Particle swarm optimization approach to portfolio optimization. Nonlinear Analysis: Real World Applications, 10(4), 2396-2406 Glover, F. (1989a). Tabu Search, Part I. ORSA Journal on Computing, 1(3), 190-206. Glover, F. (1989b). Tabu Search, Part II. ORSA Journal on Computing, 2(1), 4-32. Goldberg, D. E. (1989). Genetic Algorithms in Search. New York: Reading: Addison-Wesley Longman. Grabmeier, J., & Rudolph, A. (2002). Techniques of cluster algorithms in data mining. Data Mining and Knowledge Discovery, 6, 303-360.
122

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

Hckel, S., & Dippold, P. (2009). The Bee Colony-inspired Algorithm (BCiA) A Two-Stage Approach for Solving the Vehicle Routing Problem with Time Windows. Paper presented at the GECCO09, Montral, Qubec, Canada. Han, J., & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Diego, California, USA: Academic Press. Holland, J. (1975). Adaptation in Natural and Artificial Systems. Ann Arbor, MI. Idris, R. M., Khairuddin, A., & Mustafa, M. W. (2009). Optimal Allocation of FACTS Devices for ATC Enhancement Using Bees Algorithm. Journal of Aleppo University Engineering Science Series, World Academy of Science, Engineering and Technology, 54. Jain, A. K., & Dubes, R. C. (1988). Algorithms for Clustering Data. Englewood Cliffs, New Jersey, USA: Prentice Hall. Karaboga, D., & Basturk, B. (2007). A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm. Journal of Global Optimization, DOl 10.1007Is10898-0079149-x. Karaboga, D., & Basturk, B. (2008). On the performance of artificial bee colony (ABC) algorithm. Applied Soft Computing, 8, 687-697. Karaboga, D., & Ozturk, C. (2011). A novel clustering approach: Artificial Bee Colony (ABC) algorithm. Applied Soft Computing, 11, 652-657. Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. Paper presented at the Proceedings of 1995 IEEE International Conference on Neural Networks. Krishna, K., & Murty, M. N. (1999). Genetic K-Means Algorithm. IEEE Transactions on Systems, Man, and Cybernetics (Part B: Cybernetics), 29 (3), 433-439. Lee, I., & Yang, J. (2009). Common Clustering Algorithms Comprehensive Chemometrics. In (Vol. Chapter 2. 27, pp. 577-618): University of Western Sydney, Campbelltown, NSW. Lewis, D. (1997). Reuters-21578 text categorization test collection, Available at: http://www.research.att.com/~lewis/reuters21578.html. Liu, Y., Yi, Z., Wu, H., Ye, M., & Chen, K. (2008). A tabu search approach for the minimum sum-of-squares clustering problem. Information Sciences, 178(12), 2680-2704. Marlnakls, Y., Marmakl, M., & Matsatsinls, N. (2009). A Hybrid Discrete Artificial Bee Colony - GRASP Algorithm for Clustering. IEEE, 548- 553. Maulik, U., & Bandyopadhyay, S. (2000). Genetic algorithm-based clustering technique. Pattern Recognition, 33 (9), 1455-1465. McQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Paper presented at the Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Murphy, P. M., & Aha, D. W. (1992). UCI machine learning repository. from http://www.ics.uci.edu/~mlearn//MLRepository.html Murthy, C. A., & Chowdhury, N. (1996). In search of optimal clusters using genetic algorithm. Pattern Recognition Letters, Elsevier B.V., 17, 825-832. zbakir, L., Baykasoglu, A., & Tapkan, P. (2010). Bees algorithm for Gener alized Assignment Problem. Applied Mathematics and Computation, 215, 37823795.

123

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

Pham, D. T., & Afify, A. A. (2006). Clustering techniques and their applications in engineering. Submitted to Proceedings of the Institution of Mechanical Engineers, Journal of Mechanical Engineering Science. Pham, D. T., Castellani, M., & Ghanbarzadeh, A. (2007). Preliminary design using the Bees Algorithm. Paper presented at the Eighth International Conference on Laser Metrology, CMM and Machine Tool Performance, LAMDAMAP, Euspen, UK, Cardiff. Pham, D. T., Ghanbarzadeh, A., Ko, E., & Otri, S. (2006). Application of the Bees Algorithm to the training of radial basis function networks for control chart pattern recognition. Paper presented at the 5th CIRP International Seminar on Intelligent Computation in Manufacturing Engineering (ICME-06), Ischia, Italy. Pham, D. T., Ghanbarzadeh, A., Ko, E., Otri, S., Rahim, S., & Zaid, M. (2006). The Bees Algorithm - A novel tool for complex optimization problems. Paper presented at the IPROMS 2006, Proceedings of the 2nd Virtual International Conference on Intelligent Production Machines and Systems, Cardiff, UK. Pham, D. T., Koc, E., Lee, J. Y., & Phrueksanant, J. (2007). Using the Bees Algorithm to schedule jobs for a machine. Paper presented at the Eighth International Conference on Laser Metrology, CMM and Machine Tool Performance, LAMDAMAP, Euspen, UK, Cardiff. Pham, D. T., Otri, S., Afify, A., Mahmuddin, M., & Al-Jabbouli, H. (2007). Data Clustering Using the Bees Algorithm. Paper presented at the 40th CIRP International Manufacturing Systems Conference, Manufacturing Engineering Centre, Cardiff University, Cardiff, UK. Rokach, L., & Maimon, O. (2005). Clustering Methods. In O. M. L. Rokach (Ed.), Data Mining and Knowledge Discovery Handbook (pp. 321-352). New York: Springer. Seeley, T. D. (1996). The Wisdom of the Hive: The Social Physiology of Honey Bee Colonies. Cambridge: Harvard University Press. Selim, S. Z., & Ismail, M. A. (1984). K-means type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell., 6, 81-87. Srinivas, M., & Patnaik, L. M. (1994). Adaptive Probabilities of Crossover and Mutation in Genetic Algorithms. IEEE Transactions on Systems, Man and Cybernetics, 24, 656-667. Teodorovic, D., & Dell'Orco, M. (2005). Bee colony optimization - A cooperative learning approach to complex transportation problems. Advanced OR and AIMethods in Transportation, 51-60. Vesanto, J., & Alhoniemi, E. (2000). Clustering of the Self-Organizing Map. Student Member, IEEE Transactions on Neural Networks, VOL. 11(NO. 3, MAY 2000), pp. 586 600. Von-Frisch, K. (1976). Bees: Their Vision, Chemical Senses and Language. Ithaca: Cornell University Press. Wedde, H. F., Farooq, M., & Zhang, Y. (2004). BeeHive: An efficient fault-tolerant routing algorithm inspired by honey bee behavior. In M. Dorigo (Ed.), Ant colony optimization and swarm intelligence (pp. 83-94). LNCS: Springer Berlin. Xiao, J., Yan, Y. P., Zhang, J., & Tang, Y. (2010). A quantum-inspired genetic algorithm for k-means clustering. Expert Systems with Applications, 37, 4966-4973.
124

Journal of Advanced Computer Science and Technology Research 1 (2011) 110-125

Xu, R., & WunschII, D. (2005). Survey of Clustering Algorithms. IEEE Transactions on Neural Networks, 16(3), 645-678. Yang, X. S. (2005). Engineering optimizations via natureinspired virtual bee algorithms. In J. M. Yang & J. R. Alvarez (Eds.), IWINAC 2005 (Vol. 3562, pp. 317-323): LNCS, Springer-Verlag, Berlin Heidelberg.

125

You might also like