You are on page 1of 5

IPASJ International Journal of Computer Science (IIJCS)

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

A Publisher for Research Motivation ........

Volume 2, Issue 10, October 2014

An Efficient Data Pre-processing for Support


Vector Clustering Using Genetic Algorithm
Mr. KHARADE SACHIN NAVNATH 1, Mr. AVHAD PRAVIN SIDU2 and Prof. SANDEEP
KUMAR 3
1

Mtech CSE Student


IES, Bhopal
2

Lecturer, IF Dept
HSBPVTs Parikrama Polytechnic, Kashti
3

Asst. Professors
CSE Depts
IES, Bhopal

ABSTRACT
The performance of support vector clustering suffered Due to noisy data. The pre-processing of data play important role in
support vector cluster. In support vector clustering the mapping of data from one sphere to another sphere found some
unwanted behavior of data, these behavior are boundary point, core and outlier. These data point degrade the performance and
efficiency of support vector clustering. This paper presents an efficient data preprocessing procedure for the support of vector
clustering (SVC) to reduce the size of a training dataset. Solving the optimization problem and labeling the data points with
cluster labels are time-consuming in the SVC training. Preprocessing procedures used for SVC to reduce SVC training set are
Heuristics for Redundant point Elimination (HRE) and Shared nearest Neighbor (SNN) technique result in loss of data. Main
objective of research is to improve the efficiency of SVC training procedure dealing with classification problems. The proposed
genetic algorithm is used to optimize the SVM classifier. Our proposed method been very powerful for different data sets. The
obtained results using the genetic algorithms approach is better results than other methods.

Keywords: SVC, SVM, Genetic Algorithm, UCI machine data set.

1. INTRODUCTION
Clustering has always been a tricky task in pattern classification. Many clustering algorithms have been proposed in the
past years. Division of patterns, data items, and feature vectors into groups (clusters) is a complicated task since
clustering does not presume any prior knowledge, which is the cluster to be searched for. There exist no class label
attributes that would tell which classes exist. Some of the traditional clustering techniques are Hierarchical clustering
algorithms, Partitioned clustering algorithms, nearest neighbour clustering, and Fuzzy clustering [5]. Clustering
algorithms are capable of finding clusters with different shapes, sizes, densities, and even in the presence of noise and
outliers in datasets. Although these algorithms can handle clusters with different shapes, they still cannot produce
arbitrary cluster boundaries to adequately capture or represent the characteristics of clusters in the dataset [1], [9].
Support Vector Clustering (SVC), which is inspired by the support vector machines, can overcome the limitation of
these clustering algorithms. SVC algorithm has two main steps a) SVM Training and b) Cluster Labeling [12]. SVM
training step involves construction of cluster boundaries and cluster labeling step involves assigning the cluster labels
to each data point. Solving the optimization problem and cluster labeling is time consuming in the SVC training
procedure [4]. Many of the research efforts have been taken to improve the efficiency of cluster labeling step. Only little
work is done to improve the accuracy and efficiency of SVC training procedure. In recent time, specialists have made
use of different cluster labeling techniques and different pre-processing procedures for improving the efficiency of SVC
procedure. Pre-processing procedures used for SVC to reduce SVC training set are Heuristics for Redundant-point
Elimination (HRE) and Shared Nearest Neighbour (SNN) technique result in loss of data. My main objective of
research is to reduce the execution time of SVC procedure as well as to improve the ability of proposed SVC scheme in
dealing with classification problems. Due to fewer efforts taken by researchers to reduce execution time and accuracy
of SVC training procedure, I decided main objective of my research is to reduce the execution time of SVC procedure
as well as to improve the ability of proposed SVC scheme in dealing with classification problems. For this purpose, I

Volume 2 Issue 10 October 2014

Page 7

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 2, Issue 10, October 2014

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

am using Genetic Algorithm based data pre-processing procedure for eliminating irrelevant, redundant, or noisy data
from the dataset and passing quality data or reduced dataset to SVC training procedure. SVC algorithm performs two
main steps: a) SVC Training: this step forms cluster contours of the dataset, and b) Cluster Labeling: this step assigns
the cluster labels to each data point.

Figure 1: Flowchart of SVC Algorithm [4]


1.1 Existing Support Vector Clustering Techniques:
Support Vector Clustering (SVC) involves following steps [2]
1. Data Pre-processing: Eliminates insignificant points and gives reduced training set.
2. Kernel-parameter Tuning: Gives the value of (C, q).
3. Optimization using SMO Algorithm: Solving dual for Lagrange multipliers.
4. Cluster Labeling: Labeling the data points with cluster labels.

Figure 2: Flowchart of the SVC Procedure

Volume 2 Issue 10 October 2014

Page 8

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 2, Issue 10, October 2014

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

1.2 Existing Data Pre-processing Procedures for SVC:


Currently there are two data pre-processing procedures are available in literature for support vector clustering (SVC).
These pre-processing techniques remove noise points, outliers and insignificant points which are not important for
clustering. They reduce the size of the training dataset. After pre-processing, Sequential Minimal Optimization (SMO)
algorithm is applied on the reduced dataset for solving the optimization problem. Next, labeling of each data point with
appropriate cluster labels is done using cluster labeling method [2], [4].

2. LITERAURE REVIEW
The SVC algorithm, first proposed by Ben-Hur et al. [1], identifies the cluster contours with arbitrary geometric
representations, and automatically determines the number of clusters for a given dataset by a unified framework. The
SVC algorithm has been widely researched in both theoretical developments and practical applications due to its
outstanding features [1]. Wang and Chiang have developed an effective parameter search algorithm to automatically
search suitable parameters for the SVC algorithm [6]. However, there is a common agreement in SVC research
community that solving the optimization problem and labeling the data points with cluster labels are time-consuming
in the SVC training procedure. Yang et al. [7] used proximity graphs to model the proximity structure of datasets.
Their approach constructed appropriate proximity graphs to model the proximity and adjacency. After the SVC training
process, they employed cutoff criteria to estimate the edges of a proximity graph. This method avoids redundant checks
in a complete graph, and also avoids the loss of neighbourhood information as it can occur when only estimating the
adjacencies of support vectors. Lee and Lee [8] created a new cluster labeling method based on some invariant
topological properties of a trained kernel radius function. The method they proposed consisted of two phases. The first
phase was to decompose a given data set into a small number of disjoint groups where each group was represented by
its candidate point and all of its member points belong to the same cluster. The second phase was then to label the
candidate points. Wang and Chiang [10] in 2008, proposed an efficient pre-processing procedure for SVC. This
procedure reduces the size of the training dataset.

3. PROBLEM STATEMENT
Division of patterns, data items, and feature vectors into groups (clusters) is a complicated task since clustering does
not assume any prior knowledge, which is the cluster to be searched for. There exist no class label attributes that would
tell which classes exist. Support Vector Clustering (SVC) algorithm has two main steps: 1) SVC training and 2)
Cluster labeling [14]. Solution of optimization problem and labeling the data points with cluster labels is time
consuming in SVC algorithm. This limitation of SVC makes them inefficient for large datasets. There are many
techniques exist in literature to reduce time complexity of cluster labeling step such as complete graph (CG) strategy
[1], modified complete graph (SVG) strategy [9], proximity graph modeling [11], 2-phase cluster labeling strategy [7].
Only little efforts have been made to improve the efficiency of SVC training step. Due to noisy datasets accuracy and
efficiency of clustering algorithms get decreases. Some of the data pre-processing procedures exist for SVC is: 1)
Heuristics for Redundant-point Elimination (HRE) [2], and 2) Data pre-processing based on Shared Nearest Neighbour
Algorithm [4]. Noise reduction and outlier detection based on SNN technique is efficient process, but this SNN based
pre-processing procedure generate result on the consideration of loss of data. The proposed procedure improves an
accuracy and efficiency of SVC procedure without significantly altering the final cluster configuration. It overcomes
the drawback of the SVC algorithm for dealing with large datasets. It also overcomes the drawback of SNN preprocessing procedure by reducing the loss of significant data.

4. PROPOSED METHODOLOGY
In proposed method the Genetic Algorithm is used, to improve an accuracy and efficiency of SVC procedure without
significantly altering the final cluster configuration. It overcomes the drawback of the SVC algorithm for dealing with
large datasets. It also overcomes the drawback of SNN pre-processing procedure by reducing the loss of significant
data. The Genetic Algorithm (GA), first introduced by John Holland in the early seventies, is the powerful stochastic
algorithm based on the principles of natural selection and natural genetics, which has been quite successfully, applied
in machine learning and optimization Problems. To solve a Problem, a GA maintains a population of individuals (also
called strings or chromosomes) and probabilistically modifies the population by some genetic, operators such as
selection, crossover and mutation, with the intent of seeking a near optimal solution to the problem. Coding to Strings
in GA [5], [6], each individual in a population is usually coded as coded as a fixed-length binary string. The length of
the string depends on the domain of the parameters and the required precision. For example, if the domain of the
parameter x is [2], [5], and the precision requirement is six places after the decimal point, then the domain [2], [5],

Volume 2 Issue 10 October 2014

Page 9

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 2, Issue 10, October 2014

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

should be divided into 7,000,000 equal size ranges. The implies that the length of the string requires to be 23, for the
reason that 4194304=222<7000000<223=8388608 the decoding from a binary string <b22b21b0> into a real number
is straightforward and is completed in two steps.
(1) Convert the binary string <b22b21b0> from the base 10 by

(2) calculate the corresponding real number x by

The proposed algorithm includes three steps. The steps are as follow...
A. Initial Population
The initial process is quite simple. We create a population of individuals, where individual in a population is a binary
string with a fixed-length, and every bit of the binary string is initialized randomly.
B. Evaluation
In each generation for which the GA is run, each individual in the population is evaluated against the unknown
environment. The fitness values are associated with the values of objective function.
C. Genetic Operators
Genetic operators drive the evolutionary process of a population in GA, after the Darwinian principle of survival of the
fittest and naturally occurring genetic operations. The most widely used genetic operators are reproduction, crossover
and mutation. To perform genetic operators, one must select individuals in the population to be operated on .The
selection strategy is chiefly based on the fitness level of the individuals actually presented in the population. There are
many different selection strategies based on fitness. The most popular is the fitness proportionate selection. After a new
population is formed by selection process, some members of the new populations undergo transformations by means of
genetic operators to form new solutions (a recombination step). Because of intuitive similarities, we only employ during
the recombination phase of the GA three basic operators: reproduction, crossover and mutation, which are controlled by
the parameter pr, pc and pm (reproduction probability, crossover probability and Mutation probability), respectively.
Let us illustrate these three genetic operators. As an individual is selected, reproduction operators only copy it from the
current population into the new population (i.e., the new generation) without alternation. The crossover operator starts
with two selected individuals and then the crossover point (an integer between 1 and L-1, where L is the length of
strings) is selected randomly. Assuming the two parental individuals are x1 and x2, and the crossover point is 5
(L=20). If X1 = (01001|101100001000101) and X2 = (11010|011100000010000) Then the two resulting offspring are
X1= (01001|011100000010000) and X2= (11010|101100001000101). The third genetic operator, mutation,
introduces random changes in structures in the population, and it may occasionally have beneficial results: escaping
from a local optimum. In our GA, mutation is just to negate every bit of the strings, i.e., changes a 1 to 0 and vice
versa, with probability pm.

5. EXCEPTED OUTCOME
This proposed research work focuses the drawbacks of SVC for dealing with large datasets. SNN based data preprocessing procedure for SVC results in loss of data. To overcome these drawbacks Genetic Algorithm based data preprocessing procedure is used to eliminate noise and irrelevant data from the dataset. To measure performance of the
proposed algorithm I have used six datasets namely iris, glass identification, Wisconsin breast cancer, yeast, wine
quality, and page blocks classification dataset. From experimental results, it is observed that the proposed algorithm
improves accuracy and efficiency of SVC for iris, glass identification, Wisconsin breast cancer, yeast, wine quality, and
page blocks classification dataset without altering the final cluster configurations. The proposed data pre-processing
procedure reduced the size of the dataset significantly which results in decrease in execution time required by SVC.
That is the proposed pre-processing procedure reduces noise and irrelevant data from the dataset.

6. CONCLUSION
In this paper, a data preprocessing procedure using genetic algorithm is proposed to improve efficiency and accuracy.
Differing from the traditional cluster labeling algorithms, we consider to improve the accuracy as well as to reduce the
time complexity by decreasing both the number of sampled point pairs. By our proposed method, the classification
accuracy of all the six dataset is better than SNN-SVC method. From experimental results on the six datasets shows

Volume 2 Issue 10 October 2014

Page 10

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 2, Issue 10, October 2014

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

that the proposed algorithm can be served as an effective tool in dealing with classification problems. Our proposed
pre-processing procedure passes quality data to SVC training procedure and it results in increase in accuracy of SVC.
Hence, it improves ability of SVC in dealing with classification problems.

REFERENCES
[1] A. Ben-Hur, D. Horn, H.T. Siegelmann, V. Vapnik, A Support Vector Clustering Method, In Proc. of Int. Conf.
on Pattern Recognition, 2000, pp. 724-727.
[2] J. Saketha Nath, S.K. Shevade, An Efficient Clustering Scheme Using Support Vector Methods, Pattern
Recognition, 2006, 1473-1480.
[3] J. S. Wang, J. C. Chiang, A Cluster Validity Measure with a Hybrid Parameter Search Method for Support Vector
Clustering Algorithm, PatternRecognition, 2008, pp. 506-520.
[4] J. S. Wang, J. C. Chiang, An Efficient Data Preprocessing Procedure for Support Vector Clustering, Journal of
Universal Computer Science, 2009, pp. 705-721.
[5] A. Jain, M. Murty, P. Flynn, Data Clustering: A Review, ACM Computing Surveys,1999, pp. 264-323.
[6] J. C. Platt, Fast training of support vector machines using sequential minimum optimization, Advances in Kernel
Methods Support Vector Learning, 1998, pp. 185-208.
[7] J. Lee, D. Lee, An Improved Cluster Labeling Method for Support Vector Clustering, IEEE Trans. Pattern
Analysis and Machine Intelligence, 2005, pp. 461-464.
[8] K. Jong, E. Marchiori, and van der Vaart, Finding Clusters using Support Vector Classifiers, ESANN
Proceedings. Europian Symposium on ANN, 2003, pp. 223-228.
[9]A. Ben-Hur, D. Horn, H.T. Siegelmann, V. Vapnik, Support Vector Clustering, Journal of Machine Learning
Research 2, 2001, pp. 125-137.
[10] J. S. Wang, J. C. Chiang, A Cluster Validity Measure with Outlier Detection for Support Vector Clustering,
IEEE Trans. Systems, Man, and Cybernetics-Part B, 38, 1, 2008, pp. 78-89.
[11] J. Yang, V. E. Castro, S. K. Chalup, Support Vector Clustering Through Proximity Graph Modeling, In Proc.
of 9th Int. Conf. on Neural Information Processing, 2002, pp. 898-903.
[12] R. X. Donald, C. Wunsch, Clustering, IEEE Press Series on Computational Intelligence, 2009, pp. 172-187.

Volume 2 Issue 10 October 2014

Page 11

You might also like