You are on page 1of 6

IPASJ International Journal of Computer Science (IIJCS)

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

AN INTEGRATED TECHNIQUE TO ENHANCE


THE PERFORMANCE OF THE CLASSIFIERS
K.Swetha1, Dr.S.Babu2

Research Scholar1, SCSVMV University, Kanchipuram, India


Assistant Professor 2, SCSVMV University, Kanchipuram, India

ABSTRACT
Data Mining is a process by which data can be analyzed, so as to generate useful knowledge. In data mining,
Classifiers are the wide accepted effective technique for prediction. It predicts cluster membership
for knowledge instances. The goal of classifier is to predict the target class accuracy for each case among the data.
Though the classifiers achieve best prediction accuracy in most of cases, fails to achieve the same in few cases
particularly in huge dataset. The key idea of the proposed technique is to cluster the data and then apply
classification algorithm, there by the performance of classifier is improved. The major datasets used in this
experiment are collected from the UCI Machine Learning Repository and a dataset named telecom are collected by
a survey. For each dataset it is important to choose a clustering method carefully. Here we used K-means clustering
and filtered clustering algorithm using WEKA tool. The performance measures accuracy and ROC are computed
and compared to highlight the performance of the model. Based on the experiment it is concluded that the proposed
hybrid model performs well than the unitary model.
Keywords: Classification, Clustering, K-Means, ROC.

I INTRODUCTION
Data Mining may be a wide space that integrates techniques from numerous fields as well as machine
learning, Artificial Intelligence, statistics and pattern recognition for the analysis massive volumes of information.
There are an oversized variety of information mining algorithms embedded in these fields to perform completely
different data analysis tasks. Data processing has attracted lots of attention within the analysis business and in society
as an entire in recent years, because of huge convenience of enormous quantity of data and therefore they would like for
turning such data into helpful information and knowledge.

Classification technique is capable of process a wider type of information than regression and is growing in quality.
Classification is that the method of finding a model that describes and distinguishes categories and ideas.
Several classification ways are planned by researchers in machine learning, pattern recognition and statistics. Most
algorithms square measure memory resident usually forward little datasets.

Classification has various applications as well as fraud detection, target selling, performance prediction, producing
and diagnosis. Information classification consists of learning step (where a classification model is constructed) and
classification step (where the model is employed to predict the category label for the given data).

II REVIEW OF LITERATURE
Yaswanth Kumar Alapati has done their research work on “Combining Clustering with Classification: A Technique
to Improve Classification Accuracy”. From this paper it’s clear, that to search out helpful patterns in High-Dimensional
knowledge Feature choice Algorithms may be used. The feature choice algorithms employed
in planned framework square measure Correlation-based Feature choice (CFS), Relief-F. The result shows
that it's higher to use feature choice algorithms for spatial property reduction.

Mehdi Naseriparsa and Touraj Varaee has done their research work on “Improving Performance of a Group of
Classification Algorithms Using Re-sampling and Feature Selection”. In this paper they planned a
replacement hybrid methodology during which they use a mixture of Re-sampling, filtering the sample domain and

Volume 6, Issue 7, July 2018 Page 9


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

wrapper set analysis methodology with genetic search to scale back dimensions of Lung-Cancer dataset that they
received from UCI Repository of Machine Learning databases. Finally, they apply some well- illustrious classification
algorithms to the ensuing dataset and compare the results and prediction rates before and once the applying of our
feature choice methodology on the dataset. It shows that this methodology
outperforms alternative feature choice ways with a lower price.

S.Revathy and Dr.T.Nalini has done their work on ”Performance Comparison of Various Clustering Algorithm”. In
this paper a comparative study of cluster algorithms across two completely different knowledge things are performed.
The performance of the varied cluster algorithms is compared supported the time taken to create the calculable clusters.
The results of assorted cluster algorithms to create clusters area unit represented as a graph. Therefore it is
often over because the time taken to create the clusters will increase because the range of cluster will increase. The
farthest first cluster formula takes only a few seconds to cluster the information things whereas the opposite cluster
takes the longest time to perform cluster. The farthest first cluster formula takes only a few seconds to cluster the
information things whereas the opposite cluster takes the longest time to perform cluster.

Pao, Y. and Sobajic, D.J. has developed the research work on “Combined Use of Unsupervised and Supervised
Learning for Dynamic Security Assessment”. In this paper they projected the employment of the unsupervised learning
system as a mechanism for quick screening of system disturbances. Input patterns are clustered in keeping
with similarities discovered among the input options. Then they proceed with supervised learning paradigm
for correct estimation of the CCT parameter. Therefore the combined use of learning techniques helps to
handle giant bodies of information.

III PROPOSED WORK


In this section, an integrated technique (i.e) combining supervised and unsupervised learning technique to enhance
the performance of the classifier is presented. The key idea of the proposed technique is to cluster the data and then
apply classification algorithm, there by the performance of classifier is improved.

The process first starts by classification. The collected datasets are classified using the following algorithms known as
J48, Naïve bayes, NB Tree, Random tree algorithm and the accuracy of the classifiers for different datasets are obtained
and recorded. Then the datasets are clustered using the following algorithms known as K-Means and Filtered
Clustering algorithm. Using these algorithms different datasets are clustered and the results are stored. The clustered id
is then added to the dataset. After adding the clustered data, different classifiers has been executed and we obtain the
accuracy of the classifier. The results have been recorded for comparison. The results of performing the accuracy of
different classifiers with different datasets is compared with the accuracy of different classification algorithms with
clustering.

IV EXPERIMENTS AND RESULTS


To evaluate the efficiency of the proposed method, the WEKA tool was used. Five datasets from the UCI Machine
Learning Repository namely Bupa Data, German Credit Data, Haberman's Survival Data, Pima Indian Diabetes Data,
Spect (Heart Data) and a dataset named telecom collected by our survey are used in the experiment. The above
mentioned six data set are selected and used in the experiment. The description of these six datasets is shown in Table I.
TABLE I: DESCRIPTION OF DATA SETS EMPLOYED IN THE EXPERIMENTS

Datasets No. of Instances No. of Attributes No. of Classes

Bupa 344 7 2

Diabetes 768 9 2

German credit 1000 21 2

Habermans 306 4 2

Spect 80 23 2

Telecom 264 16 2

Volume 6, Issue 7, July 2018 Page 10


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

Ten cross fold validation method is used by which 90% of the data is considered as training set and 10% of data as
testing set with 10%. To assess the competence of the proposed method, different types of test and comparison with
existing methods have been carried out. They are

 Performance evaluation of J48 with various datasets


 Performance evaluation of various classification algorithms with Bupa datasets
 Accuracy of Different Classifiers with Different Datasets
 Accuracy of Different Classification Algorithms with clustering
 ROC comparison on Bupa dataset

A. PERFORMANCE EVALUATION OF J48 WITH VARIOUS DATASETS


As a first step the J48 algorithm was executed on the actual dataset using Weka tool and results are recorded. As a
next step, the clustering method was executed on the actual dataset and the J48 is again executed on the clustered
dataset. The results of the above process are recorded and presented in Table II.
Table II : PERFORMANCE EVALUATION OF J48 WITH VARIOUS DATASETS
Data sets Accuracy of the Classifier % Accuracy of the classifier with Lifts By %
Filtered clustering %
Bupa 65.99 93.90 27.91

Diabetes 73.83 95.96 22.13

German credit 70.5 86.9 16.4

Habermans 71.90 99 27.1

Spect 71.25 91.25 20

Telecom 55.30 89.39 34.09

The experimental results in Table II illustrate that, the performance of classifiers is improved on the clustered data set
than the actual data set. The table also describes Lifts by Value, which is the amount of enrichment on the performance
of the classifier between the actual and clustered dataset. From the table it is also exposed that, the proposed method
significantly lifts the accuracy of an actual dataset by a minimum of 16% to a maximum of 34% on different datasets.

B. PERFORMANCE EVALUATION OF VARIOUS CLASSIFICATION ALGORITHMS WITH BUPA DATA SETS


In contrast to the above comparison, various classification algorithms like J48, Naïve bayes, NB Tree and Random
tree are executed on the Bupa actual dataset. The result of the same are recorded. To cluster Bupa dataset, the
clustering algorithm known as filtered clustering was applied to it. The above mentioned algorithms are again
executed on the resultant dataset. Table III shows the result of the experiments.
Table III: PERFORMANCE EVALUATION OF VARIOUS CLASSIFICATION ALGORITHMS WITH BUPA
DATA SETS
Classifier Accuracy of the Classifier % Accuracy of the classifier with Lifts By %
Filtered clustering %
J48 65.99 93.90 27.91
Naïve Bayes 55.23 93.60 38.37

NB Tree 64.53 91.28 26.75


Random tree 68.90 91.57 22.67

The results of various classification algorithms illustrate that, the proposed method performs better. The performance
accuracy of various classifiers was considerably lifted by 22% to 38% on the actual dataset. Particularly Naive Bayes
performs better than the other classifier on the Bupa dataset. Still, Naive Bayes proves that, the proposed method
cluster the dataset well to enhance the performance of the classifier.

C. ACCURACY OF DIFFERENT CLASSIFIERS WITH DIFFERENT DATASET

Volume 6, Issue 7, July 2018 Page 11


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

As a first step several classification algorithms like J48, Naïve Bayes, NB Tree and Random tree was executed on the
actual dataset using Weka tool and results are recorded. The results of the above process are recorded and presented in
Table IV.

Table IV: ACCURACY OF DIFFERENT CLASSIFIERS WITH DIFFERENT DATASET


Accuracy Accuracy Accuracy Accuracy

S.no Datasets J48 Naive Bayes NB Tree Random Tree

1. Bupa 65.99 55.23 64.53 68.90

2. Diabetes 73.83 76.30 73.57 70.96

3. German credit 70.5 75.4 75.3 67.1

4. Habermans 71.90 74.84 72.55 68.30

5. Spect 71.25 70 61.25 58.75

6. Telecom 55.30 50.38 58.33 64.77

D. ACCURACY OF DIFFERENT CLASSIFICATION ALGORITHMS WITH CLUSTERING


As a next step, the clustering method like K-means and filtered clustering was executed on the actual dataset and the
classifiers are again executed on the clustered dataset. The results of the above process are recorded and presented in
Table V.

Table V: ACCURACY OF DIFFERENT CLASSIFICATION ALGORITHMS WITH CLUSTERING


J48 J48 Naive Naive NB Tree NB Tree Random Random
Bayes Bayes Tree Tree
K-Means Filtered
DataSet K- Filtered K-Means Filtered K-Means Filtered
Mean
s
Bupa
89.24 93.90 91.86 93.60 91.86 91.28 90.12 91.57

Diabetes 89.19 95.96 92.19 95.83 86.85 94.79 87.11 94.92

German Credit 75.6 86.9 82.1 83.3 82.1 88.2 69.4 79.1

Habermans 95.42 99 94.12 99.03 90.20 99.3 95.10 99.67

Spect 75 91.25 88.75 93.75 86.25 92.5 86.25 87.5

Telecom 84.47 89.39 93.94 95.08 87.5 99.24 89.39 91.29

The experimental results in Table V illustrate that, the performance of classifiers is improved on the clustered data set than the
actual data set. The table also describes Lifts by Value, which is the amount of enrichment on the performance of the classifier

Volume 6, Issue 7, July 2018 Page 12


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992
between the actual and clustered dataset. From the table it is also exposed that, the proposed method significantly lifts the accuracy
of an actual dataset.
The experimental results shows that applying clustering technique prior to classification algorithm is beneficial. Thus, the method
is proposed to improve the accuracy of a classification algorithm.

E. ROC COMPARISON ON BUPA DATASET


The ROC curves are defined using various classifiers like J48, Naïve Bayes, NB tree and random tree as a base
classifier on the bupa dataset. The depicted ROC curves indicate that the performance of the classifier has vast
difference between normal classification and clustering with classification. This shows that, the number of correct
classification of positive instances is less and the number of misclassification on negative instances is high on normal
classification method. Whereas, clustering with classification, the classifiers are able to improve the result by
producing very little misclassifications on both positive and negative instances. The step of the curve toward the point
(0, 1) on the proposed method proves the same. This is the vital facts to prove that, the proposed model extensively
enhances the performance of the classifier.

Figure 1 ROC ANALYSIS OF VARIOUS CLASSIFIERS ON BUPA DATASET

Figure 2 ROC ANALYSIS OF VARIOUS CLASSIFIERS WITH K-MEANS CLUSTERING

All the above various levels of comparison with the well known existing method confirm that, the proposed method
performs best on the dataset to improve the accuracy of the classifier.

Volume 6, Issue 7, July 2018 Page 13


IPASJ International Journal of Computer Science (IIJCS)
Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
A Publisher for Research Motivation ........ Email:editoriijcs@ipasj.org
Volume 6, Issue 7, July 2018 ISSN 2321-5992

V CONCLUSION
It is not always feasible to apply classification algorithms directly on dataset. Data classification over different classes
is very complicated and has less ability and achieves less accuracy. Hence the method is used to improve the accuracy
by integrating supervised and unsupervised learning technique (classification and clustering). The data set has been
clustered using clustering algorithms and the clustered data is moved for classification. Major datasets are taken form
UCI machine learning repository and a dataset named telecom was collected made by a survey and used in the
experiment. Clustering algorithms such as K-Means and Filtered clustering are used while the classification algorithms
such as J48, Naïve bayes, NB tree and Random tree are used in WEKA tool.

Different types of experiments are done and the results are recorded. The comparative study has been performed with
the result. The result proves that the proposed method performs well when compared to the other existing methods.
Based on the above, it is concluded that the proposed hybrid model performs well than the unitary model. In addition, it
is also concluded that applying clustering technique prior to classification algorithm is beneficial to improve the
accuracy of the classifier.

References

1. Yaswanth Kumar Alapati(2016) , “Combining Clustering with Classification: A Technique to Improve


Classification Accuracy” International Journal of Computer Science Engineering, Volume 5.
2. Mehdi Naseriparsa, Touraj Varaee (2013), “Improving Performance of a Group of Classification Algorithms
Using Resampling and Feature Selection” World of Computer Science and Information Technology Journal,
volume 3.
3. S.Revathy and Dr.T.Nalini (2013),”Performance Comparison of Various Clustering Algorithm” International
Journal of Advanced Research in Computer Science and Software Engineering, Volume 3.
4. Pao and Sobajic D.J, “Combined Use of Unsupervised and Supervised Learning for Dynamic Security
Assessment” Transactions on Power Systems, volume 7.
5. Sharareh R., Niakan Kalhori and Xiao-Jun Zeng (2014), “Improvement the Accuracy of Six Applied
Classification Algorithms through Integrated Supervised and Unsupervised Learning Approach” Journal of
Computer and Communications.

Volume 6, Issue 7, July 2018 Page 14

You might also like