Professional Documents
Culture Documents
Keywords
Classification methods, Area Under Curve metric,
Sample size, Attributes types.
1. Introduction
Data mining algorithms which carry out the assigning
of objects into related classes are called classifiers.
Classification algorithms include two main phases; in the
first phase they try to find a model for the class attribute as
a function of other variables of the datasets, and in the
second phase, they apply previously designed model on
the new and unseen datasets for determining the related
class of each record [1]. There are different methods for
data classification such as Decision Trees (DT), Rule
Based Methods, Logistic Regression (LogR), Linear
Regression (LR), Nave Bayes (NB), Support Vector
Machine (SVM), k-Nearest Neighbor (k-NN), Artificial
2. Related Works
Many works related to the comparison of the
classification methods have been done. Any of these works
has compared various classifiers with each other and with
regards to the test data and evaluation criteria, has reported
the results.
Efficiency criterion RMSE has been used by Kim in [4]
for comparing DT, ANN and LR. In this article the effect
of the kind of attributes and the size of dataset on the
j =1
j =1
Yi = 1 + 3 x j + 2 c j
(1)
(2)
size of 200, 500, 1000, 3000 and 5000 records have been
made from datasets. These numbers have been selected for
generating datasets with small, medium and large sizes.
Table 1. Properties of datasets with 3 variables
Number of
Number of
ID
Continuous Variables Discrete Variables
DS1
3
0
DS2
2
1
DS3
1
2
DS4
0
3
Table 2. Properties of datasets with 5 variables
Number of
Number of
ID
Continuous Variables Discrete Variables
DS5
5
0
DS6
4
1
DS7
3
2
DS8
2
3
DS9
1
4
DS10
0
5
Table 3. Properties of datasets with 7 variables
Number of
Number of
ID
Continuous Variables Discrete Variables
DS11
7
0
DS12
6
1
DS13
5
2
DS14
4
3
DS15
3
4
DS16
2
5
DS17
1
6
DS18
0
7
Table 4. Properties of datasets with 10 variables
Number of
Number of
ID
Continuous Variables Discrete Variables
DS19
10
0
DS20
9
1
DS21
8
2
DS22
7
3
DS23
6
4
DS24
5
5
DS25
4
6
DS26
3
7
DS27
2
8
DS28
1
9
DS29
0
10
For applying classifiers on datasets, two distinct
datasets should be used. First dataset for training (training
set) and the second one for testing (test set). In this article,
Cross Validation method with fold value equal to 10 has
attribute.
A more complex test based on a discrete attribute in
which the possible values are allocated to a variable
number of groups with one outcome for each group
rather than each value.
If attribute A has continuous numeric values, a binary
test with outcomes AZ and A>Z, based on
comparing the value of A against a threshold value Z.
All these tests are evaluated in the same way, looking
at the gain ratio arising from the division of training cases
that they produce. Two modifications to C4.5 for
improving use of continuous attributes are presented in
[23].
dist ( X 1 , X 2 ) =
i =1
( x1i x2i ) 2
(3)
B1
B2
5. Results Evaluation
b21
b22
Margin
b11
b12
k-NN
LogR
C4.5
SVM
LC
NB
1. 1
1
0. 9
AUC
0. 8
0. 7
0. 6
0. 5
0. 4
Y = + 1 . X 1 + 2 . X 2 + 3 . X 3 + ... + n . X n
(4)
DS20
DS21
DS22
DS23
DS24
DS25
DS26
DS27
DS28
DS29
Datasets
DS19
(5)
and j = 1,..., m
4. Experimental Results
The classifiers DT, k-NN, LogR, NB, C4.5, SVM and
LC have been applied on datasets DS1 to DS29 and with
sample sizes 200, 500, 1000, 3000 and 5000 records.
Because of reducing the amount of plots in this paper, the
gained results of datasets have been reported in the form of
Table 5, but only plots of datasets with 10 records are
depicted. Table 5 shows the results of applying the
(6)
DS19
DS20
DS21
DS22
DS23
DS24
DS25
DS26
DS27
DS28
DS29
Sample size
200
500
1000
3000
5000
200
500
1000
3000
5000
200
500
1000
3000
5000
200
500
1000
3000
5000
200
500
1000
3000
5000
200
500
1000
3000
5000
200
500
1000
3000
5000
200
500
1000
3000
5000
200
500
1000
3000
5000
200
500
1000
3000
5000
200
500
1000
3000
5000
DT
k-NN
LogR
NB
C4.5
SVM
LC
0.7941
0.8914
0.9141
0.9865
0.9942
0.6343
0.7984
0.9139
0.9892
0.9972
0.7206
0.8836
0.9684
0.9978
0.9965
0.6307
0.8869
0.9545
0.9929
0.9823
0.8670
0.9004
0.9389
0.9907
0.9977
0.6700
0.8269
0.9366
0.9950
0.9975
0.6422
0.8419
0.9840
0.9947
0.9944
0.8069
0.8901
0.9599
0.9946
0.9977
0.6746
0.8766
0.9587
0.9981
0.9992
0.8622
0.9185
0.9712
0.9982
0.9986
0.9669
0.9949
0.9975
0.9985
0.9983
0.7683
0.8347
0.8451
0.9220
0.9453
0.5848
0.7565
0.8208
0.9297
0.9579
0.7074
0.8273
0.8859
0.9668
0.9924
0.6473
0.8286
0.9221
0.9785
0.9722
0.8829
0.8938
0.9060
0.9709
0.9840
0.7458
0.8759
0.9294
0.9729
0.9859
0.8172
0.8468
0.9132
0.9838
0.9978
0.9063
0.9387
0.9584
0.9864
0.9945
0.8919
0.9436
0.9726
0.9960
0.9986
0.9147
0.9651
0.9782
0.9969
0.9983
0.9508
0.9828
0.9945
0.9987
0.9993
0.6328
0.6117
0.5702
0.5800
0.5844
0.5828
0.6318
0.5989
0.6241
0.6268
0.4667
0.5468
0.5646
0.5745
0.5642
0.4789
0.5477
0.6012
0.5559
0.5393
0.6771
0.6249
0.5612
0.6082
0.5931
0.4791
0.6047
0.6207
0.6117
0.6038
0.5387
0.4988
0.5546
0.5689
0.5627
0.6679
0.6326
0.6082
0.5608
0.5684
0.6081
0.4900
0.5094
0.5722
0.5303
0.7635
0.4849
0.5642
0.5339
0.5430
0.6425
0.5558
0.5278
0.5640
0.5417
0.7126
0.6650
0.6607
0.6520
0.6499
0.6264
0.6359
0.6371
0.6574
0.6517
0.5737
0.5947
0.6074
0.5955
0.5987
0.4473
0.5600
0.6366
0.5906
0.5743
0.7060
0.6747
0.6347
0.6353
0.6325
0.5507
0.6170
0.6629
0.6518
0.6539
0.5318
0.5636
0.6172
0.6008
0.6086
0.6127
0.6434
0.6083
0.6038
0.6187
0.5949
0.5789
0.5822
0.6094
0.6017
0.6148
0.5163
0.5773
0.5551
0.5620
0.5886
0.5492
0.5141
0.5364
0.5253
0.7452
0.8743
0.9176
0.9917
0.9956
0.5872
0.8034
0.9099
0.9917
0.9984
0.6547
0.9114
0.9731
0.9984
0.9963
0.6334
0.9079
0.9574
0.9906
0.9802
0.8698
0.8812
0.9337
0.9913
0.9981
0.7521
0.8745
0.9393
0.9964
0.9980
0.7597
0.8996
0.9826
0.9941
0.9952
0.8760
0.9312
0.9755
0.9956
0.9979
0.8567
0.9511
0.9811
0.9981
0.9992
0.9190
0.9639
0.9788
0.9990
0.9984
0.9661
0.9949
0.9975
0.9983
0.9983
0.7448
0.8589
0.9078
0.9841
0.9927
0.5418
0.7442
0.8609
0.9720
0.9860
0.7728
0.9569
0.9810
0.9973
0.9951
0.6997
0.9109
0.9577
0.9897
0.9666
0.8044
0.8143
0.8465
0.9386
0.9766
0.6616
0.7433
0.8467
0.9020
0.9634
0.4948
0.8714
0.9898
0.9940
0.9937
0.7238
0.7754
0.8081
0.9018
0.9675
0.5961
0.6865
0.7169
0.9120
0.9634
0.6382
0.6309
0.7655
0.8861
0.9347
0.6524
0.7645
0.7676
0.8647
0.8796
0.6408
0.6075
0.5690
0.5780
0.5830
0.5727
0.6236
0.5958
0.6211
0.6214
0.4743
0.5095
0.5496
0.5492
0.5467
0.4731
0.5478
0.6016
0.5495
0.5176
0.6577
0.5798
0.5272
0.5916
0.5598
0.4500
0.5935
0.6151
0.5993
0.5942
0.5561
0.5154
0.5438
0.5468
0.5482
0.6564
0.6080
0.5935
0.5321
0.5512
0.5945
0.4917
0.5142
0.5733
0.5331
0.7384
0.4896
0.5701
0.5260
0.5381
0.6278
0.5570
0.4999
0.5646
0.5314
k-NN
LogR
C4.5
SVM
LC
NB
1.1
0.9
0.9
AUC
AUC
1.1
0.8
DT
k-NN
LogR
C4.5
SVM
LC
NB
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
DS19
DS19
DS20
DS21
DS22
DS23
DS24
DS25
DS26
DS27
DS28
DS20
DS21
DS22
DS29
DT
k-NN
LogR
C4.5
SVM
LC
1.1
0. 9
0.9
AUC
AUC
1. 1
0. 7
DS25
DS26
DS27
DS28
DS29
NB
0. 8
DS24
Datasets
Datasets
DS23
DT
k-NN
LogR
C4.5
SVM
LC
NB
0.8
0.7
0. 6
0.6
0. 5
0.5
0. 4
DS19
DS20
DS21
DS22
DS23
DS24
DS25
DS26
DS27
DS28
DS29
Datasets
0.4
DS19
DS20
DS21
DS22
DS23
DS24 DS25
Datasets
DS26
DS27
DS28
DS29
7. Acknowledgment
The authors want to express their gratitude to the
Iranian National Elite Foundation for their support of
this paper. This work is partially supported by Noor
data mining research group as well.
8. References
[1] P-N. Tan, M. Steinbach, V. Kumar, Introduction to
Data Mining, Addison-Wesley Publishing, 2006.
[2] M. Kantardzic, Data Mining: Concepts, Models,
Methods, and Algorithms, John Wiley & Sons
Publishing, 2003.
[3] I.H. Witten, E. Frank, Data Mining: Practical Machine
Learning Tools and Techniques, Morgan Kaufmann
Publishing, Second Edition, 2005.
[4] Y.S. Kim, Comparision of the decision tree, artificial
neural network, and linear regression methods based on
the number and types of independent variables and
sample size, Journal of Expert Systems with
Application, Elsevier, 2008, pp. 1227-1234.
[5] A. Fadlalla, An experimental investigation of the
impact of aggregation on the performance of data
mining with logistic regression, Journal of Information
& Management, Elsevier, 2005, pp. 695-707.
[6] J. Huang, J. Lu, C.X. Ling, Comparing Nave Bayes,
Decision Trees, and SVM with AUC and Accuracy,
Proceedings of the Third IEEE International Conference