Professional Documents
Culture Documents
ABSTRACT
Data Mining is a process by which data can be analyzed, so as to generate useful knowledge. In data mining,
Classifiers are the wide accepted effective technique for prediction. It predicts cluster membership
for knowledge instances. The goal of classifier is to predict the target class accuracy for each case among the data.
Though the classifiers achieve best prediction accuracy in most of cases, fails to achieve the same in few cases
particularly in huge dataset. The key idea of the proposed technique is to cluster the data and then apply
classification algorithm, there by the performance of classifier is improved. The major datasets used in this
experiment are collected from the UCI Machine Learning Repository and a dataset named telecom are collected by
a survey. For each dataset it is important to choose a clustering method carefully. Here we used K-means clustering
and filtered clustering algorithm using WEKA tool. The performance measures accuracy and ROC are computed
and compared to highlight the performance of the model. Based on the experiment it is concluded that the proposed
hybrid model performs well than the unitary model.
Keywords: Classification, Clustering, K-Means, ROC.
I INTRODUCTION
Data Mining may be a wide space that integrates techniques from numerous fields as well as machine
learning, Artificial Intelligence, statistics and pattern recognition for the analysis massive volumes of information.
There are an oversized variety of information mining algorithms embedded in these fields to perform completely
different data analysis tasks. Data processing has attracted lots of attention within the analysis business and in society
as an entire in recent years, because of huge convenience of enormous quantity of data and therefore they would like for
turning such data into helpful information and knowledge.
Classification technique is capable of process a wider type of information than regression and is growing in quality.
Classification is that the method of finding a model that describes and distinguishes categories and ideas.
Several classification ways are planned by researchers in machine learning, pattern recognition and statistics. Most
algorithms square measure memory resident usually forward little datasets.
Classification has various applications as well as fraud detection, target selling, performance prediction, producing
and diagnosis. Information classification consists of learning step (where a classification model is constructed) and
classification step (where the model is employed to predict the category label for the given data).
II REVIEW OF LITERATURE
Yaswanth Kumar Alapati has done their research work on “Combining Clustering with Classification: A Technique
to Improve Classification Accuracy”. From this paper it’s clear, that to search out helpful patterns in High-Dimensional
knowledge Feature choice Algorithms may be used. The feature choice algorithms employed
in planned framework square measure Correlation-based Feature choice (CFS), Relief-F. The result shows
that it's higher to use feature choice algorithms for spatial property reduction.
Mehdi Naseriparsa and Touraj Varaee has done their research work on “Improving Performance of a Group of
Classification Algorithms Using Re-sampling and Feature Selection”. In this paper they planned a
replacement hybrid methodology during which they use a mixture of Re-sampling, filtering the sample domain and
wrapper set analysis methodology with genetic search to scale back dimensions of Lung-Cancer dataset that they
received from UCI Repository of Machine Learning databases. Finally, they apply some well- illustrious classification
algorithms to the ensuing dataset and compare the results and prediction rates before and once the applying of our
feature choice methodology on the dataset. It shows that this methodology
outperforms alternative feature choice ways with a lower price.
S.Revathy and Dr.T.Nalini has done their work on ”Performance Comparison of Various Clustering Algorithm”. In
this paper a comparative study of cluster algorithms across two completely different knowledge things are performed.
The performance of the varied cluster algorithms is compared supported the time taken to create the calculable clusters.
The results of assorted cluster algorithms to create clusters area unit represented as a graph. Therefore it is
often over because the time taken to create the clusters will increase because the range of cluster will increase. The
farthest first cluster formula takes only a few seconds to cluster the information things whereas the opposite cluster
takes the longest time to perform cluster. The farthest first cluster formula takes only a few seconds to cluster the
information things whereas the opposite cluster takes the longest time to perform cluster.
Pao, Y. and Sobajic, D.J. has developed the research work on “Combined Use of Unsupervised and Supervised
Learning for Dynamic Security Assessment”. In this paper they projected the employment of the unsupervised learning
system as a mechanism for quick screening of system disturbances. Input patterns are clustered in keeping
with similarities discovered among the input options. Then they proceed with supervised learning paradigm
for correct estimation of the CCT parameter. Therefore the combined use of learning techniques helps to
handle giant bodies of information.
The process first starts by classification. The collected datasets are classified using the following algorithms known as
J48, Naïve bayes, NB Tree, Random tree algorithm and the accuracy of the classifiers for different datasets are obtained
and recorded. Then the datasets are clustered using the following algorithms known as K-Means and Filtered
Clustering algorithm. Using these algorithms different datasets are clustered and the results are stored. The clustered id
is then added to the dataset. After adding the clustered data, different classifiers has been executed and we obtain the
accuracy of the classifier. The results have been recorded for comparison. The results of performing the accuracy of
different classifiers with different datasets is compared with the accuracy of different classification algorithms with
clustering.
Bupa 344 7 2
Diabetes 768 9 2
Habermans 306 4 2
Spect 80 23 2
Telecom 264 16 2
Ten cross fold validation method is used by which 90% of the data is considered as training set and 10% of data as
testing set with 10%. To assess the competence of the proposed method, different types of test and comparison with
existing methods have been carried out. They are
The experimental results in Table II illustrate that, the performance of classifiers is improved on the clustered data set
than the actual data set. The table also describes Lifts by Value, which is the amount of enrichment on the performance
of the classifier between the actual and clustered dataset. From the table it is also exposed that, the proposed method
significantly lifts the accuracy of an actual dataset by a minimum of 16% to a maximum of 34% on different datasets.
The results of various classification algorithms illustrate that, the proposed method performs better. The performance
accuracy of various classifiers was considerably lifted by 22% to 38% on the actual dataset. Particularly Naive Bayes
performs better than the other classifier on the Bupa dataset. Still, Naive Bayes proves that, the proposed method
cluster the dataset well to enhance the performance of the classifier.
As a first step several classification algorithms like J48, Naïve Bayes, NB Tree and Random tree was executed on the
actual dataset using Weka tool and results are recorded. The results of the above process are recorded and presented in
Table IV.
German Credit 75.6 86.9 82.1 83.3 82.1 88.2 69.4 79.1
The experimental results in Table V illustrate that, the performance of classifiers is improved on the clustered data set than the
actual data set. The table also describes Lifts by Value, which is the amount of enrichment on the performance of the classifier
All the above various levels of comparison with the well known existing method confirm that, the proposed method
performs best on the dataset to improve the accuracy of the classifier.
V CONCLUSION
It is not always feasible to apply classification algorithms directly on dataset. Data classification over different classes
is very complicated and has less ability and achieves less accuracy. Hence the method is used to improve the accuracy
by integrating supervised and unsupervised learning technique (classification and clustering). The data set has been
clustered using clustering algorithms and the clustered data is moved for classification. Major datasets are taken form
UCI machine learning repository and a dataset named telecom was collected made by a survey and used in the
experiment. Clustering algorithms such as K-Means and Filtered clustering are used while the classification algorithms
such as J48, Naïve bayes, NB tree and Random tree are used in WEKA tool.
Different types of experiments are done and the results are recorded. The comparative study has been performed with
the result. The result proves that the proposed method performs well when compared to the other existing methods.
Based on the above, it is concluded that the proposed hybrid model performs well than the unitary model. In addition, it
is also concluded that applying clustering technique prior to classification algorithm is beneficial to improve the
accuracy of the classifier.
References