Professional Documents
Culture Documents
1, First Author
Abstract
In traditional K-Nearest Neighbor (TR_KNN), Euclidean distance is usually used as the distance
metric between different sampleswhich leads to the worse classification performance. This paper
presents a novel KNN classification algorithm based on Maximum Entropy (ME_KNN), improves the
distance metric without any effect of subjectivity. The proposed method is tested on 4 UCI datasets and
8 artificial Toy datasets. The experimental results show that our proposed algorithm achieves
significant improvement in recall, precision and accuracy than TR_KNN.
966
estimation in linear and nonlinear problems [11]. The fact reflects that ME is similarity metric, namely
distance metric, between observed values and actual values. Therefore, we use ME as distance metric
in KNN instead of Euclidean distance. The experimental results validate that the method is able to
classify the sample datasets effectively.
In this paper, according to TR_KNN with Euclidean metric, we propose a k-nearest neighbor based
on Maximum Entropy (ME_KNN) with the combination of TR_KNN and ME, see Section 2.
Experimental results on real data and artificial data are fulfilled in Section 3. The conclusion is shown
in Section 4.
such that xi , ci i 1,2,, n , where xi xi1 , xi2 ,, xil is a l dimensional vector, that
d xi , y j
x
l
k 1
k
i
- y kj
(1)
Step 4: Determine k nearest neighbors. Sort the distances in ascending order, and take k samples with
relative minimum distances;
Step 5: Find the dominant class: let the k nearest neighbors be x1, x2 ,, xk , and the corresponding
class labels be c1, c2 ,, ct that belong to the label set C .The queried test sample is classified
according to the class of k nearest neighbors by means of maximum probability. The probabilitywhich
means the percentage of each class appearing in k nearest neighborsis calculated as the number of the
class appearing in k nearest neighbors divided by k. And the class with maximum probability is the
dominant class. Let S s1 , s2 , s3 ,, s t be the set of number for each class in k nearest neighbors.
The details are described as follows:
arg max s k
(2)
967
lead to all feature components with the same weight and the same contribution to classification result.
But this is not of universality. So many scholars introduced weight coefficient to overcome this
shortcoming [8-10]. However, classification accuracy can be affected by subjective factors when the
weight coefficient is introduced into KNN. Therefore, we introduce ME as the distance metric in KNN.
In ME, different features without any independence requirement can be merged into one probability
model, which is a significant characteristic of ME. Furthermore, ME model has the advantages of short
training time and lower classification complexity. Therefore, we use ME as the distance metric between
training sample xi and test sample y j . The details are described as follows [11]:
d_ME xi , y j y kj log
l
y kj
k 1
xik
(3)
Eq.(3) is used as distance metric instead of Euclidean distance used in TR_KNN. ME is very close to
natural state of things for keeping all the uncertainty, and it doesnt involve any weight problem, so Eq.
(3) can never be affected by subjectivity and overcome the significant defect of TR_KNN.
3. Experimental results
In this section, datasets and evaluation indexes are given in section 3.1. The experiments on the two
decisive parameters, k value and the percentage of training samples in a dataset, for classification
performance are presented in section 3.2. The experiments on real datasets and artificial datasets for
performances of TR_KNN and ME_KNN are shown in section 3.3 and section 3.4 respectively.
Property type
Sample number
Iris
Real
150
Wine
Integer, Real
178
13
Abalone
Real
300
Balance
Categorical
910
968
In order to compare the classification performance of TR_KNN with that of ME_KNN, macro
average recall ( macro - r ), macro average precision ( macro - p ), as well as macro-F1 measure are
chosen here [13]. macro - r , macro - p and macro - F1 measure of ME_KNN are denoted by
rME, pME, and F1ME respectively. macro - r , macro - p and macro - F1 measure of TR_KNN
are denoted by rTR, pTR, and F1TR respectively. macro - r , macro - p and
macro - r
k 1
(4)
t
t
macro - p
p
k 1
(5)
t
n
macro - F1
F
k 1
1k
(6)
In Eq.(4), recall rk such that rk ak b k ,where ak denotes the number of the kth class test samples
predicted correctly, bk denotes the number of the kth class test samples;
samples that predicted to be the kth class. In Eq.(6), F1k such that F1k 2rk pk (rk p k ) , which combine
the recall ( rk ) and precision ( pk ) into a single measure.
In addition, the accuracy [14] also used in section 3.2.1, the formula is described as follow:
ACC
(7)
The only way to determine k value is to repeatedly adjust it although k is a very important parameter
for KNN. In this section, in order to show the effect of different k values on the classification
performance, we assign 5, 10, 15 and 20 to k in the experiments. And given the number of training set
is two-thirds of the whole dataset. The results show that accuracy changes with different k values at Iris,
Wine, Abalone and Balance datasets. The experimental results are described in Figure 1.
In Figure 1, the classification accuracy of ME_KNN and TR_KNN reach the peaks when k equals
to both 5 and 15 at Iris dataset, while the accuracy of the two KNN algorithms reach peaks when k
equals to 15 at Wine dataset. At Abalone dataset, the accuracy of the two KNN reach the peaks when k
equals to 20. However, the accuracy reaches optimal value when k equals to 10 at Balance dataset. In
the four subfigures, the trends of the lines are absolutely different, and the k values are different when
the classification performances are the best [12]. Therefore, we still havent a determined method to
determine k value except adjusting it repeatedly by experiments. It is pointed out that k value must not
exceed the sample number of the class with minimum sample. Although there is not a determined
method to determine k value, it is very obvious that the accuracy of ME_KNN is much higher than that
of TR_KNN with the same k value.
969
a. At Iris dataset
c. At Abalone dataset
b. At Wine dataset
d. At Balance dataset
Figure 1. The changes of accuracy with different k values at four real datasets
3.2.2. The classification results with different percentage of training samples
The percentage of training samples is a critical parameter for classification, which directly affects
the classification result. In order to show the influence of the different percentage of training samples
on classification performance, we set the percentage of training samples in dataset to be 1/31/2 and
2/3, respectively. In this experimentwe set k to be 5. Table 2 demonstrates the effect of the percentage
of training sample on the classification performance.
In Table 2, Ptrs denotes the percentage of training samples in dataset, and the number of samples is
given for each dataset. At Iris dataset, when the percentage of training samples is 2/3, rME, pME, and
F1ME of ME_KNN reach the optimal valueswhich are 97.92%, 98.04%, and 0.9798, respectively.
While the best performance for TR_KNN is also obtained at the same condition with ME_KNN, which
is that rTR, pTR, and F1TR reach to 95.83%, 97.92%, and 0.9686, respectively. At Balance dataset, the
increasing extents from TR_KNN to ME_KNN of each evaluation index are much higher than that in
the other three datasets. Table 2 shows that rME reaches 76.17% when the percentage of training
samples is 1/2, pME reaches 76.65% and F1ME reaches 0.7518 with the percentage of training samples
equaling 2/3 at Balance dataset. All the evaluation indexes improve gradually with the increment of
training samples for every single dataset, which illustrate that the classification performance, both for
ME_KNN and TR_KNN, is improved with the increment of the percentage of the training samples [15].
In summary, the classification performance of ME_KNN is better than that of TR_KNN.
970
Wine
Abalone
Ptrs
rTR/%
rME/%
pTR/%
pME/%
F1TR
F1ME
1/3
92.93
94.95
93.33
94.98
0.9313
0.9496
1/2
93.33
94.67
93.33
94.67
0.9333
0.9467
2/3
95.83
97.92
97.92
98.04
0.9686
0.9798
1/3
90.87
91.74
92.87
93.01
0.9186
0.9237
1/2
93.85
94.45
92.69
94.73
0.9326
0.9459
2/3
92.14
96.28
94.93
95.14
0.9351
0.9571
1/3
86.36
88.89
79.89
81.85
0.8300
0.8523
1/2
86.50
88.00
81.24
84.17
0.8379
0.8604
2/3
87.88
90.00
80.76
83.54
0.8417
0.8660
73.49
65.87
75.57
0.6524
0.7451
1/3
Balance
64.62
1/2
67.09
76.17
67.94
73.06
0.6751
0.7458
2/3
66.56
73.76
67.69
76.65
0.6712
0.7518
rTR/%
rME/%
pTR/%
pME/%
F1TR
F1ME
Iris
95.83
97.92
97.92
98.04
0.9686
0.9798
Wine
96.66
96.80
94.71
97.22
0.9567
0.9700
Abalone
86.87
89.74
78.04
78.89
0.8222
0.8397
Balance
63.15
75.80
69.30
77.50
0.6608
0.7664
The experimental results in real datasets are shown in Table 3. Each evaluation index of the four
datasets is increased in ME_KNN relative to TR_KNN. The increment extend of macro- r from
TR_KNN to ME_KNN at Wine dataset is the minimum, but the value still reach 0.14%, while every
index of Balance dataset is improved greatly and classification performance of ME_KNN is significant.
At Balance dataset, macro- r , macro- p , and macro F1 increase by 12%, 8% and 0.1 respectively.
Therefore, the classification performance of ME_KNN at the four real datasets is superior to that of
TR_KNN.
971
Variance
Datasets
Number of
deviation
samples
Number of Feature
classes
dimension
Toy1
0.5
0.5
600
Toy2
0.5
600
Toy3
600
Toy4
600
Toy5
0.8
1.5
600
Toy6
600
Toy7
600
10
Toy8
600
11
Feature
dimension
rTR/%
rME/%
pTR/%
pME/%
F1TR
F1ME
Toy1
99.01
100.00
99.01
100.00
0.9900
1.0000
Toy2
98.50
99.50
98.54
99.51
0.9852
0.9950
Toy3
98.50
98.50
98.50
98.50
0.9850
0.9850
Toy4
97.00
97.50
97.08
97.57
0.9705
0.9747
Toy5
93.95
96.94
94.14
96.94
0.9411
0.9697
Toy6
84.62
86.00
86.13
86.95
0.8421
0.8563
Toy7
10
82.89
85.11
83.42
85.12
0.8395
0.8535
Toy8
11
81.00
81.70
85.05
81.56
0.8280
0.8097
4. Conclusions
In this paper, a novel KNN classification algorithm based on Maximum Entropy is proposed by
combining KNN and Maximum Entropy. The methods of determining k value and the percentage of
training samples are presented in the two experiments of optimal parametric selection. ME_KNN
shows superiority than TR_KNN in recall, precision, accuracy and stability with the experiments using
real datasets and artificial datasets.
5. Acknowledgments.
This work was supported in part by the Nature Science Foundation of Northeast Agricultural
University under contract no.2011RCA01.
6. References
[1] Sarabjot S. Anand, David A. Bell, John G. Hughes, A General Framework for Data Mining Based
on Evidence Theory, Data & Knowledge Engineering, Elsevier, vol. 18, no. 3, pp.189-223, 1996.
[2] Huawen Liu, Shichao Zhang, Noisy Data Elimination Using Mutual k-Nearest Neighbor for
972
Classification Mining, Journal of Systems and Software, Elsevier, vol. 85, no. 5, pp.1067-1074,
2012.
[3] Taeho Jo, Malrey Lee, Yigon Kim, String Vectors as a Representation of Documents with
Numerical Vectors in Text Categorization, Journal of Convergence Information Technology,
AICIT, vol. 2, no. 1, pp.66-73, 2007.
[4] Jun Toyama, Mineichi Kudo, Hideyuki Imai, Probably Correct K-Nearest Neighbor Search in
High Dimensions, Pattern Recognition, Elsevier, vol. 43, no. 4, pp.1361-1372, 2012.
[5] Richard Nock, Paolo Piro, Frank Nielsen, Wafa Bel Haj Ali, Michel Barlaud, Boosting k-NN for
Categorization of Natural Scenes, International Journal of Computer Vision, Springer US, vol.
100, no. 3, pp.294-314, 2012.
[6] TM Cover, P E Hart, Nearest Neighbor Pattern Classification, IEEE Transactions on Information
Theory, IEEE, vol. 13, no. 1, pp.21-27, 1967.
[7] Mohammad Ashraf, Girija. Chetty, Dat Tran, Dharmendra Sharma, A New Approach for
Constructing Missing Features Values, IJIIP, AICIT, vol. 3, no. 1, pp.110- 118, 2012.
[8] Eui-Hong Han, George Karypis, Vipin Kumar, Text Categorization Using Weight Adjusted
k-Nearest Neighbor Classification, In Proceedings of the 5th Pacific-Asia Conference on
Knowledge Discovery and Data Mining, pp.53-65, 2001.
[9] Mansoor Zolghadri Jahromi, Elham Parvinnia, Robert John, A Method of Learning Weighted
Similarity Function to Improve the Performance of Nearest Neighbor, Information Sciences,
973