You are on page 1of 5

2012 International Conference on Information and Computer Networks (ICICN 2012) IPCSIT vol.

27 (2012) (2012) IACSIT Press, Singapore

A Comparative Study for Various Methods of Classification


Peiman Mamani Barnaghi+,Vahid Alizadeh Sahzabi and Azuraliza Abu Bakar
Department of Artificial Intelligence National University of Malaysia (UKM) Bangi, Malaysia {peiman.barnaghi,vahid.alizadeh,aab}@ftsm.ukm.my

Abstract. This paper discusses data mining techniques to process a medical dataset and identify the relevance of liver disorder and drinking alcohol drink by classification of blood test data. We have used four different classification methods including decision tree, Bayesian algorithms (Naive Bayes and Bayesian Networks), Neural Network classification and Rough Sets methods. To evaluate the methods, we have used the Waikato Environment for Knowledge Analysis (WEKA) open source tool. WEKA is a collection of machine learning algorithms that can be used for different processing tasks such as classification, and clustering. Bayesian algorithms and Neural Network classification methods are implemented with WEKA. However, as WEKA does not support methods based on Rough Sets we have used Rosetta. We have provided an evaluation based on applying these classification methods to our dataset and measuring the accuracy of test results. The evaluation results show that using Neural Networks obtains the best result among the other methods. Keywords: Classification, Decision tree methods, Bayesian algorithms, Neural Network, Rough Sets.

1. Introduction
In this paper we process a blood test dataset and use different classification methods to learn from the test data set and develop a system that is able to identify the existing of a liver disorder by processing the blood test data. The Liver disorder dataset consists of a set of blood test results that are used to describe the liver disorders that might arise from extreme alcohol consumption and find any relationship between alcohol consumption and liver disease. This dataset has some attributes that show values depend on red blood cell volume, hydrolase enzyme, Serum glutamate primitive transaminase; that high level of this kind of enzyme released into the blood may be a sign of liver damage, Aspartate transaminase that it is another enzyme associated with liver parenchyma cells, Gamma-glutamyl transpeptidase An enzyme that catalyzes the transfer of a -glutamyl group from glutathione or -glutamyl peptide to another peptide or amino acid; that high level value is the best single screening assay for detecting latent or chronic liver disease. The Waikato Environment for Knowledge Analysis (WEKA) is a collection of machine learning algorithms that can be used for processing tasks such as classification and clustering. In this paper, we use decision tree methods that contain J48 that is implemented of C.4.5 and LMT type. Bayes algorithms and Neural Network classification methods, Naive Bayes and Bayesian methods (i.e. MLP and RBF) are implemented in WEKA. We used Rosetta for implementing Rough Set methods. We divided our data set to training data (66%) and all of methods tested and compared according 34% of this data.

2. The classification methods


+

Corresponding author. Tel. (+6)012 241 31 68 E-mail address: peiman.barnaghi@gmail.com 62

This section describes the classification methods are used in this paper. We discuss each method and explain how the method has been used in our experiment. Decision tree Decision Trees (DT) tree learning algorithms work based on processing and deciding upon attributes of the data. Attributes in DT are nodes and each leaf node is representing a classification. Two algorithms namely J48 and LMT were used in our experiments [1-2]. LMT is a method based on DT that final nodes are replaced with logistic regression functions. After modelling result, comparison of results analyzes shows the approach that provides a better performance. J48 is an implementation of C4.5 in WEKA. C4.5 uses information entropy concept [3]. The disadvantages of DT are focus on continues attributes, computational efficiently with growing tree size. According to comparison provided for different classification methods in emotion recognition [4], DT is the best classifier method on that group with 15 attributes. Artificial Neural Networks Artificial Neural Networks (ANN) are one of the common classification methods in data mining. To employ Neural Network based classifiers, Multi Layer Perceptron (MLP) and Radial Base Function (RBF) were used in this work. MLP is a feed forward network that makes a model to map input data to output data. Hidden layer in MLP can include various layers between input and output. The structure of MLP is shown in Figure 1. RBF is another type on ANN. The input of NN in RBF is linear and the output is nonlinear. The output of this type of ANN is taken from weighted sum of hidden layers output. The RBF networks are divided in two feed-forward layer. Figure 2 is illustrates the structure of this networks (The figure are adapted from [5]).

2.1.

2.2.

Fig 1. Multi Layer Perception schema

Fig 2. Radial Base network

In [5], Avellaneda, et al compared four types of Neural Network classifiers. The experiment data included 700 records with nine attributes. Avellaneda, et al reported that that MLP obtained the best result in their experiment. Rough Sets Rough set method is another classification technique in data mining. This section describes some basic concepts and features of rough set theory that are important to describe the classification method. A suite of methodologies that is useful for characterization of data that is not accurate has been provided by Rough set theory. One of the main aspects in rough set methods is providing a formal framework for automated transformation of data into knowledge for changing the modules [6]. Rough set theory offers mathematical tools to discover hidden patterns in data that can be used in data mining. The main goal of the rough set analysis is providing an estimate of data [7]. The rough set theory has been used for several types of data with various sizes. In [7] a Rough Set method is tested on small, medium and large size of data. The results show that Rough Sets are useful for small and medium size datasets [7]. As the dataset that we have used for liver diseases is small size, we also consider applying Rough Set based classification in our evaluations. Bayesian methods Bayesian methods are also used as one of the classification solutions in data mining. In our work we use two main Bayesian methods namely naive Bayes and Bayesian networks that are implemented in WEKA software for classification [8].

2.3.

2.4.

63

A naive Bayes classifier could be defined as an independent feature model deals with a simple probabilistic classifier based on applying Bayes' theorem with strong independence assumptions. There are several models that make different assumption fitting for Nave Bayes [9] [10].

3. Result and discussion


In this work four classification methods: ANN (MLP, RBF), DT (J48, LMT), Bayes (Nave Bayes, Bayesian network) and rough set were evaluated. We used a training set for our data and then applied the test part of our dataset and measured the accuracy of the classifications. There are 340 instances with 7 attributes in Liver disorder dataset that are divided into two classes. As we can see in Table.1 almost all of four methods improved accuracy by increasing the training size. However, the high accuracy for large training sets can be also resulted due to the over fitting problem. J48, MLP and RBF with 76.41% have higher accuracy compared to other methods. Figure 3 shows the changes in accuracy of different methods by changing the size of the training set.
Table 1. Accuracy of all methods in different training sizes Training Size 10-90 20-80 30-70 40-60 50-50 60-40 70-30 80-20 90-10 Bayes Net 58.55% 61.11% 59.91% 65.02% 65.08% 61.48% 67.32% 66.17% 76.47% Nave Bayes 57.89% 59.62% 59.91% 64.03% 65.58% 61.48% 67.32% 66.17% 76.47% MLP (2 hidden layer) 57.23% 61.85% 62.02% 69.95% 65.68% 66.66% 69.30% 70.58% 79.41% Rough Set 55.92% 60.37% 67.51% 67.98% 65.08% 65.18% 68.31% 54.70% 76.47%

J48 53.28% 57.03% 54.85% 57.14% 64.49% 60.74% 66.33% 63.23% 79.41%

LMT 55.59% 59.62% 56.54% 69.45% 71.00% 70.37% 59.40% 69.17% 74.47%

RBF 61.84% 62.96% 64.13% 67.98% 65.68% 62.96% 67.32% 67.64% 79.41%

Fig 3. The accuracy is increased fluctuated during raising training set size

We have also considered different attributes and their affected in liver disorder dataset classification. According to the attribute selection methods such as best first and ranker methods, some attributes including SGOT, ALK, MCV and SGPT are removed and then the classification methods are applied. The result of this experiment is shown in Table 2. Table 2. Accuracy of all methods in different feature sizes Number Of features 3 4 5 6 7 J48 76.47% 82.35% 79.41% 79.41% 79.41% LMT 76.47% 59.40% 82.35% 73.52% 74.47% Bayes Net 76.47% 82.35% 76.47% 76.47% 76.47% Nave Bayes 76.47% 82.35% 76.47% 76.47% 76.47% MLP (5 hidden layer) 76.47% 76.47% 88.23% 91.17% 79.41% RBF 73.52% 79.41% 79.41% 82.35% 79.41% Rough Set 58.82% 61.76% 64.70% 73.52% 76.47%

64

As shown in Fig.4 there was not a clear effect in accuracy of some of the methods with increasing feature sizes; however the performance of MLP is raised sharply by using only 6 features. MLP showed high accuracy at 91.17 %; in addition RBF had a slight improvement by using only 6 features. As we can see, there is not significant change in Nave Bayes, Bayes Net, J48 and LMT methods by increasing the number of features.

Fig 4. Relevant of number of features with accuracy

Figure 4 demonstrate the the accuracy of the evaluations by applying different numver of attributes. This justifies that feature size of data set have important role in accuracy and in particular is significantly affects the performance of the Rough Set method.

4. Conclusions and future work


In this paper, seven types of four classification methods including MLP and RBF in NN, Nave Bayes and Bayesian Net in Bayesian, J48 and LMT in decision tree and Rough set are applied to a Liver disorder dataset [11]. All blood tests should be classified in two classes: Class 0 and Class 1. Compared to Bayesian and Rough Sets, Neural Networks classifier methods obtain a good result. MLP obtains higher results than RFB and also J48 shows good results but Rough Sets did not perform well to classify the experimental dataset compare to other methods. It is assumed that in liver disorder dataset, increasing the size of training set will produce better results. MLP shows that can it provide better results with larger training set. The future work will focus on attribute selection attribute techniques. We will also study the tuning and optimization techniques for the classifiers and to make sure that the large training set will not cause over fitting problem.

5. References
[1] K. Golnabi, et al., "Analysis of firewall policy rules using data mining techniques," 2006, pp. 305-315. [2] H. Li, et al., "Data Mining Techniques for Complex Formation Evaluation in Petroleum Exploration and

Production: A Comparison of Feature Selection and Classification Methods," 2008, pp. 37-43.
[3] J. Liang and Z. Shi, "The information entropy, rough entropy and knowledge granulation in rough set

theory," International Journal of Uncertainty Fuzziness and Knowledge-Based Systems, vol. 12, pp. 3746, 2004.
[4] T. Justin, et al., "Comparison of different classification methods for emotion recognition," pp. 700-703. [5] D. A. Avellaneda, et al., "Natural Texture Classification: A Neural Network Models Benchmark," 2009,

pp. 325-329.
[6] J. F. Peters and S. Ramanna, "Towards a software change classification system: A rough set approach,"

Software Quality Journal, vol. 11, pp. 121-147, 2003.


[7] A. Butalia, et al., "Applications of Rough Sets in the Field of Data Mining," 2008, pp. 498-503. [8] J. Nicholson, et al., "Emotion recognition in speech using neural networks," Neural computing &

applications, vol. 9, pp. 290-296, 2000.

65

[9] G. Qiang, "An Effective Algorithm for Improving the Performance of Naive Bayes for Text

Classification," pp. 699-701.


[10] Y. Herdiyeni, et al., "A Bayesian network approach for image similarity," pp. 1-6. [11] Liver disorder data set-Bupa medical research category. Available at:

http://archive.ics.uci.edu/ml/datasets/Liver+Disorders

66

You might also like