Professional Documents
Culture Documents
Chapter5
This chapter explains the theory and practice of various model evaluation
mechanisms in data mining. Performance analysis is mainly based on confusion
matrix. Also various statistical measures such as accuracy, ROC area etc used to
analyse performances are described here.
Chapter 5
71
In the table 5.1, a confusion matrix is shown, for which, the various
values and related equations are described. Few of these equations are very
relevant for performance analysis.
Confusion Predicted
matrix Negative Positive
Actual Negative a b
Positive c d
The entries in the confusion matrix have the following meaning in the
context of a data mining problem:
a is the number of correct predictions that an instance is negative,
The accuracy (AC) is the proportion of the total number of predictions that
were correct. It is determined using the equation:
- (5.1)
The recall or true positive (TP) rate is the proportion of positive cases that
were correctly identified, as calculated using the equation:
- (5.2)
The false positive (FP) rate is the proportion of negatives cases that were
incorrectly classified as positive, as calculated using the equation:
- (5.3)
The true negative (TN) rate is defined as the proportion of negatives cases that
were classified correctly, as calculated using the equation:
- (5.4)
The false negative (FN) rate is the proportion of positives cases that were
incorrectly classified as negative, as calculated using the equation:
- (5.5)
Finally, precision (P) is the proportion of the predicted positive cases that
were correct, as calculated using the equation:
- (5.6)
Chapter 5
73
The graph below shows three ROC curves representing excellent, good, and
worthless tests plotted on the same graph. The accuracy of the test depends on
how well the test separates the group being tested into positive and negative
cases. Accuracy is measured by the area under the ROC curve. An area of 1
represents a perfect test; an area of .5 represents a worthless test. A rough
guide for classifying the accuracy of a diagnostic test is the traditional
academic point system: A sample plot is shown in figure 5.1.
Chapter 5
75
Supervised 9 9 9
Basic Algorithm C 4.5/CART Back propagation Probability based
(Bayes theorem)
Accuracy Very good Very good Good
Table 5.2 Comparison of classifier performances against various data mining concepts
Confusion matrix
Predicted
E G A P
E 401 4 26 72
Actual
G 6 6 1 1
A 9 2 410 5
P 84 1 4 31
E 0.80 0.79
G 0.46 0.43
A 0.93 0.96
P 0.28 0.26
TABLE 5.4: class wise precision/recall values for neural networks model
Chapter 5
77
Table 5.4 shows the precision and recall values for each of the classes
observed for the neural network model. The average of these respective values
is used for final performance comparison.
The accuracy is given by (5.1),
AC= 848/1063 = 0.797
The experimental back ground for each of these models is described in
detail in chapter 4. In order to further verify the effectiveness of this modelling
approach, this model was tested using data of year 2008. It was observed that
the model is giving comparable results when it is trained with its previous
consecutive three years data as training data (2005-2007). But when the test
data is from a year which is far away from the year of training data, the model
may not give optimum results, considering the natural fluctuations of
employment opportunities for a branch. Hence it is recommended that for best
performance of this modelling approach for prediction of any current year n, at
most its previous (n-3) consecutive years data may be used as training data.
Confusion matrix
Predicted
E G A P
E 416 7 6 74
Actual G 7 4 1 2
A 11 4 404 7
P 86 3 1 30
The confusion matrix obtained for the decision tree work is shown in
table 5.5. Table 5.6 shows the class wise precision/recall values observed for
decision tree model.
E 0.80 0.83
G 0.22 0.29
A 0.98 0.95
P 0.27 0.25
Table 5.6: class wise precision/recall values for decision tree model
Confusion matrix
Predicted
E G A P
E 496 0 2 5
Actual
G 5 3 4 2
A 80 30 248 68
P 20 1 2 97
Chapter 5
79
E 0.82 0.99
G 0.1 0.21
A 0.97 0.58
P 0.56 0.8
Table 5.8: class wise precision/recall values for Naive Bayes classifier model
The confusion matrix obtained is shown in table 5.7. Table 5.8 shows the class
wise precision/recall values for Naive Bayes classifier model. The accuracy is
given by (5.1),
AC= 844/1063 = 0.794
5.4 Comparison
Chapter 5
81