Professional Documents
Culture Documents
88
89
graphical plot which illustrates the performance of a binary classifier system as its
discrimination threshold is varied. It is created by plotting the fraction of true
positives out of the positives True positive rate vs. the fraction of false positives
out of the negatives false positive rate , at various threshold settings. TPR is also
known as sensitivity and FPR is one minus the specificity or true negative rate.
ROC graphs are two-dimensional graphs in which true positive rate is plotted on
the Y axis and false positive rate is plotted on the X axis. An ROC graph depicts
relative tradeoffs between benefits (true positives) and costs (false positives).
According to John Peter et al. (2012), confusion matrix displays the number
of correct and incorrect predictions made by the model compared with the actual
classifications in the test data. The matrix is n-by-n, where n is the number of
classes and from that we calculated the accuracy of each classification algorithms.
Table 5.1 A simple confusion matrix table
Predicted class
Actual Class
C1
C1
True positives
C2
False negatives
C2
False positives
True negatives
This analysis will result in preventing diabetic patients from being affected
by heart diseases, thereby resulting in low mortality rates as well as reduced cost on
health for the state. SVMs have proven to be a classification technique with
excellent predictive performance and also have been investigated with the help of
ROC curves for both training and testing data.
The confusion matrix indicating the accuracy of the SVM classifier for the
given data set is shown in Table 5.2.
Table 5.2. The confusion matrix of the SVM classifier
True low
True high
Class precision
pred.
low
355
24
93.67%
pred.
high
118
97.52%
class
recall
99.16%
83.10%
From the results obtained, it can be seen that the classifier exhibits a very
high classification accuracy i.e 94.60% overall. It also shows a very high precision
for the positive class (97.52%) and also the recall of the positive class is quite good
(83.10%). In the case of negative classes, the classifier exhibits high precision
(93.67%) as well as high recall (99.10%).
The possibility of diagnosis of heart disease vulnerability in diabetic
patients with reasonable accuracy has been shown. Classifiers of this kind can help
in early detection of the vulnerability of a diabetic patient to heart disease, thereby
the patients can be forewarned to change their lifestyles. This results in preventing
diabetic patients from being affected by heart disease, in turn resulting in low
92
mortality rates as well as reduced cost on health for the state. SVMs have proved
to be a classification technique with excellent predictive performance and also have
been investigated with the help of ROC curves for both training and testing data.
Hence, this SVM model can be recommended for the classification of the diabetic
dataset.
5.3 COMPARING SUPPORT VECTOR MACHINE AND DECISION TREE
EXPERIMENTAL RESULTS
In the third experiment, Rapid miner has been used as a tool due to its
learning operators and operator framework, which allows forming nearly arbitrary
processes. Apart from accuracy are trying to determine the ROC, AUC and Lift
chart for measuring the performance.
5.3.1 ROC / AUC and Performance for the SVM classifier
In data mining and association rule learning, lift is a measure of the
performance of a targeting model at predicting or classifying cases as having an
enhanced response measured against a random choice targeting model. Lift is
simply the ratio of target response divided by average response. This operator
creates a lift chart based on a plot for the discredited confidence values for the
given example set and model.
The AUC of a classifier is equivalent to the probability that the classifier
will rank a randomly chosen positive instance higher than a randomly chosen
negative instance. AUC is also used for comparing classifiers. However, ROC and
AUC use a single training and testing pair. It is an estimate of the probability that a
classifier will rank a randomly chosen positive instance higher than a randomly
chosen negative instance.
We have set up the value of C as 5 and as 1 for the SVM to operate in
the RBF (Radial Basis Function) kernel type and use the type C-SVC which is
standardized regular algorithm. A common method is to calculate the area under the
roc curve, abbreviated AUC. Since the AUC is a portion of the area of the unit
93
square, its value will always be between 0 and 1.0. However, because random
guessing produces the diagonal line between (0, 0) and (1, 1), which has an area of
0.5, no realistic classifier should have an AUC less than 0.5. It measures the
discriminating ability of a binary classification model. The larger the AUC, the
higher the likelihood that an actual positive case will be assigned a higher
probability of being positive than an actual negative case.
The result of Area under curve for the SVM classifier used in Rapid miner
tool is shown in Figure 5.3. The red color indicates ROC and blue color indicates
ROC threshold.
94
The performance of the SVM classifier indicating the accuracy with two
classes high and low for the given data set is shown in Table 5.3.
Table 5.3 Performance of SVM with an accuracy of 79.8 % (High and Low)
True low
True high
Class precision
pred.
low
755
209
78.32%
pred.
high
36
100.00%
class
recall
100.00%
14.69%
95
The performance of the decision tree indicating the accuracy with two
classes high and low using information gain as split parameter for the given data set
is shown in Table 5.4.
Table 5.4 Performance of Decision tree with an accuracy of 89.2 % (High and
Low) using Information gain as split parameter
True low
True high
Class precision
pred.
low
654
98.94%
pred.
high
101
238
70.21%
class
recall
86.62%
97.14%
As shown in Table 5.5, for the results of all two models, decision tree
appears to be most effective as it has the highest percentage of correct predictions
(89.2%) for patients with heart diseases, followed by support vector machine.
Table 5.5 Accuracy of SVM and Decision Tree (High and Low)
Technique
Accuracy
in
percentage
Decision tree
89.20
Support Vector
Machine
79.80
The use of Rapid miner has simplified the efforts on k-fold validation,
generation of AUC and ROC which has helped in proper evaluation of the
96
performance of the learning models to evaluate the best classifier so that it can be
further refined for better prediction.
5.4 COMPARING NAIVE BAYES, SUPPORT VECTOR MACHINE AND
DECISION TREE EXPERIMENTAL RESULTS
In the final experiment also Rapid miner has been used as a tool for
evaluating and comparing three classification techniques using three classes high,
medium and low with diabetic patient dataset to determine the possible ways to
predict the risk of heart disease for diabetic patients.
In general, The Bayes theorem formula is P(h/D)= P(D/h) P(h) / P(D)
where
and
Naive Bayes algorithm uses the Bayes formula, which calculates the
probability of a patient record Y having the class label Cj. The label could be
High, Medium and Low.
97
We have tried to take age, sex, smoking, alcohol, cholesterol HDL as the
prime attribute to evaluate Naive Bayes with the plot and it is shown respectively in
Figures 5.5, 5.6, 5.7, 5.8 and 5.9. In Figure 5.5, X axis denotes age and Y axis
denotes the density. At the age of 52, the risk is high, at the age of 54, the risk is
medium and at the age of 55 the risk is low.
98
Parameter
Low
Medium
High
Age
Mean
54.92
53.62
51.24
Age
Standard
deviation
12.03
11.90
11.33
Similarly, its distribution table for above five attributes is shown in Tables
5.6, 5.7, 5.8, 5.9 and 5.10. In Figure 5.6, X axis denotes sex and Y axis denotes the
density.
99
Parameter
Low
Medium
High
Sex
value=M
0.61
0.55
0.49
Sex
value=F
0.40
0.45
0.50
100
Parameter
Low
Medium
High
Smoking
value= yes
0.163
0.151
0.173
Smoking
value= - (no)
0.485
0.459
0.534
Smoking
value= unknown
0.350
0.389
0.293
101
Attribute
Parameter
Low
Medium
High
Alcohol
value= yes
0.09
0.08
0.04
Alcohol
value= - (no)
0.56
0.48
0.62
Alcohol
value= unknown
0.35
0.43
0.35
Table 5.10 Nave Bayes distribution table for cholesterol HDL attribute
Attribute
Parameter
Low
Medium
High
Cholesterol
mean
3.92
4.38
4.89
Cholesterol
Standard
deviation
0.78
0.86
0.97
True low
True
medium
True high
Class
precision
631
62
21
88.38%
39
98
26
60.12%
11
25
86
70.49%
92.66%
52.97%
64.66%
pred. low
pred.
medium
pred. high
class recall
103
True low
True
medium
True high
Class precision
398
25
15
90.87%
283
155
59
31.19%
59
92.19%
58.44%
83.78%
44.36%
pred. low
pred. medium
pred. high
class recall
The decision tree using various split methods such as Gain ratio,
Information gain and Gini index has been tried as shown in Table 5.13 which gives
different levels of accuracy.
Table 5.13
Split method
criteria
Accuracy in
percentage
Classification
error in
percentage
Gain ratio
88.19
11.81
Information gain
90.79
9.21
Gini Index
87.69
12.31
104
pred. low
pred.
medium
True low
657
True
medium
25
True high
11
class
precision
94.81%
18
148
20
79.57%
12
102
85.00%
96.48%
80.00%
76.69%
pred. high
class recall
Niyati Gupta et al. (2013) have defined the accuracy as the proportion of
instances that are correctly classified. It is calculated by the total number of
correctly predicted high risk (true positive) and correctly predicted low risk
(true negative) over the total number of classifications. It can be calculated as
Accuracy = (TP + TN) / (TP + TN + FP + FN)
For a multiclass classification problem, TP, FP, TN and FN for each class i
are indicated as definitions. TPi, FPi, TNi, and FNi for the class i are also defined.
Then, certain parameters can be calculated to evaluate the multiclass classification
results accordingly. For e.g., True Positive Rate TPR, Precision and f-Measure
value for each class and the overall accuracy can be calculated.
105
Accuracy
in
percentage
Decision tree
90.79
Nave Bayes
81.58
Support Vector
Machine
61.26
As shown in Table 5.15, for the results of all three models, decision tree
appears to be most effective as it has the highest percentage of correct predictions
(90.79%) for patients with heart diseases, followed by followed by nave Bayes and
support vector machine. The performance in terms of graphs for accuracy,
precision, sensitivity, specificity and F-score are shown in figures 5.10, 5.11, 5.12,
5.13 and 5.14 respectively.
When more than two classes are dealt with, the accuracy alone might not be
sufficient. So evaluation of precision, sensitivity, specificity and F-score along with
accuracy to determine the right classifier has been done.
According to Sheik Abdullah et al. (2012) precision is the fraction of
retrieved instances that are relevant and recall is the fraction of relevant instances
that are retrieved.
The precision can be calculated as
Precision = TP / (TP + FP)
However, TP rate alone is not sufficient to fully measure performance of the
classifier in a single class. Therefore we compute Precision for class i as,
Precision i = TPi / (TPi + FPi)
The sensitivity is the proportion of positive instances that are correctly
classified as positive (e.g. the proportion of sick people that are classified as sick).
It is also called as Recall.
It can be calculated as
Sensitivity = TP / (TP + FN)
The specificity is the proportion of negative instances that are correctly
classified as negative (e.g. the proportion of healthy people that are classified as
healthy). It can be calculated as
Specificity = TN / (TN + FP)
F-score or F-measure is a measure of a test's accuracy and it is the harmonic
mean of precision and recall which can be calculated as
F-score = 2 * (Precision * Recall) / (Precision + Recall)
107
Accuracy
Precision
Sensitivity
Specificity
F-Score
Decision tree
90.79
0.86
0.84
0.93
0.84
Nave Bayes
81.58
0.72
0.69
0.85
0.70
Support Vector
Machine
61.26
0.71
0.61
0.80
0.65
Med
High
Decision tree
90.79
0.79
Nave Bayes
81.58
61.26
Classification
models
Support
Vector
Machine
Precision
Recall (Sensitivity)
Low
Med
High
Low
Med
High
Low
0.85
0.94
0.80
0.76
0.96
0.95
0.97
0.87
0.60
0.70
0.88
0.52
0.64
0.92
0.92
0.95
0.68
0.31
0.92
0.90
0.83
0.44
0.58
0.57
0.99
0.84
110
5.5 SUMMARY
The main focus in this chapter is on the application of three different data
mining algorithms namely Nave Bayes, Support vector machine and Decision tree
on diabetes dataset to predict the risk of heart diseases based on their predictive
accuracy. Hence, a comparison of the outcomes of the various classification
techniques has been made and a higher degree of accuracy of the decision tree is
found. The performances are compared through accuracy, sensitivity, specificity
and F-score. For future research, stacking techniques can be used to increase the
accuracy of decision trees and reduce the number of leaf nodes.
111