Professional Documents
Culture Documents
(.,- _
:(,
-
,
I
I
(
-
, -
.
,
.
,
- - - -
And put the above formula into E (A), then,
E
(A)
=
Pi
+
ni
(-
log
2
log
2
J
i
P
+
N
Pi + ni Pi + ni Pi + ni Pi + ni
f 1 ( ,
-
J
-
L
..
I
-
..
I
(-
,...
- -
Because (P
+
N) In2 is a constant in taining data set, so
we could suppose a fction S(A) to meet the following
formula:
s(., - _
(
,
-
-
,
I -
J
I
-
S(A) just contains add, multiply, division operations, so
operation time is certainly shot than that of E(A) which
contains multiple logarithmic terms. So, S(A) fnction is
selected to compute the entropy of each property for
comparison, the biggest information gain that is smallest
node of entropy of property is selected to be splitting node.
In this article, S(A) is named simplifed information entropy.
But the generation of S(A) is in line with I(l
+
x) " x. So,
before giving the number N of property value for S(A), the
accuracy for data classifcation of decision tree classifer
must be afected. However, seeing the experimental data
results, this afuence is very small from the whole
performance of classifer to data classifcation.
B. A
n
alysis of algorithm
Simplifed operation is easy to be extended to multiple
classes. Suppose sample set S has C classifcations of
sample training set in total, with the quantity of each sample
classifcation as pi, i=I,2, ... ,C. If propert A is considered
as root of decision tree, A has v values vI, v2, ... , vv, it
divide E into V subsets EI,E2, ... ,Ev. Suppose the quantity
of j sample classifcation contained in Ei is Pij,j =1,2, ... ,C,
then the entropy of subset Ei is:
__
||
N
,
|
Because IEilIn 2 is a constat, the substitution of the
number N of property is made to get that Entropy (EIA) is
c
v
,
|
,
equal to
f
]1
,
,
]1 N,
this polynomial is very
similar with two situations, and the situation of knowledge
classifcation is more. So in the same situation, the
computation of information entropy
C v
,
,
with
f
]
I
,
I
]1 N
is much faster than that with
E(A) fnction, and could promote the effciency of decision
tree classifer.
V. Sample validation
A specifc example is used to compare the advantages
and disadvatages of the improved method mentioned
above ad ID3 algorithm. Training sample is described in
table 1. Figure 1 is the decision tee generated by ID3
algorithm. From the fgue, we could obviously fnd the
multiple-valued preference of property selection, so we
reconstruct a decision tree with optimization method
ofered in this article: frstly, reconstruct decision tree
according to the method that is put forward to overcome
multiple-valued property selection. Seeing fgure 2, we
could fnd that this method better overcome the problem of
multiple-valued preference of ID3 algorithm, decision tee
is simpler afer optimized, and could maintain the precision
of original algorithm well. Decision tree fgure 3 could be
obtained by selecting body situation as root node of
decision and calling recursively this method to construct
various sub-trees.
property
One
two
three
four
fve
six
seven
eight
nine
ten
eleven
twelve
thirteen
fourteen
ffeen
1 bl 1 1 bl f a e a e o trammg sample
Monthly
physical condition gender
income
no healthy male
no sufering three disease female
no healthy male
yes sufering two disease male
no sufering three disease female
no sufering one disease female
yes sufering two disease male
no sufering one disease male
no sufering two disease female
no sufering two disease female
no healthy male
no sufering one disease female
no sufering three disease female
no healthy male
yes sufering one disease female
classifcation
N
P
N
P
P
N
P
N
N
P
P
P
P
N
P
VI-157 Volume 1
2010 Interational Conference On Computer Design And Appliations (ICCDA 2010)
Figure
algorithm
The decision tree structured with ID3
Figure 2: decision tree constcted in the condition of
avoiding multi value preference
Figure 3: the decision tree constructed with new
property selection criteria
From the construction of decision tree, we could know
that property selection criteria changes, but constructed
decision tree form and ID3 algorithm is basically consistent,
and the computation is simple, the computational speed is
promoted greatly. It is suitable to do mining in large-scale
database or data warehouse.
V. CONCLUSION
With the analysis for the common problems in structure
process of decision tree, this article focuses on the problems
of decision tree algorithm existed in the links of property
vacancy and property selection to put forward the concept
of Weighted Simplifcation Entropy, to utilize the size of
Weighted Simplifcation Entropy to determine splitting
property, to put forward optimization strategy based on ID3
algorithm, then to structure decision tree classifer. The
example verifes that the method put forward in this article
precedes ID3 algorithm, and could efectively promotes the
effciency, predicted precision and rule conciseness of
algorithm
REFERENCE
[I] Mobasher B,Cooley R,Jaideep S,etal. Comments on decision tree.
New York:IEEE Press, I 999.
[2] Shahabi C, Zarkesh A M, Adibi J, et al. Introduction of neutral
network [C] . . Biningham: IEEE Press,2001.
[3] Yan T,JacobesnM,Garcia-Mo Lina H,et al Introduction of genetic
algorithm. [C]. :Paris WAM Press,1999.
[4] Xu L. The conception of rough set VO. Alaska:Anchorage
Press,200 1.
[5] Bezdek J C. Introduction of statistical model Ph. D. D,M athematics,
Co mell University, Ithaca, N ew York, 1973
[6] KamelM and Selim S Z.The application of induction learing
algorithm. 1994,27 (3) : 421 - 428
[7] Zadeh L A. Introduction of ID3 algorithm . 1965, 8: 338 - 353
[8] KamelM and Selim S Z. Analysis of ID3 algorithm. 1994, 61: 177-
188
[9] A 1- Sultan K S and Selim S Z. Application of decision tree. 1993, 26
(9) : 1357 - 1361
[10] M iyamo to S and N akayama K.The introduction of training set.
1986,16 (3): 479 - 482
[II] Bezdek J C, Hathaway R, SabinM and TuckerW. The analysis of
missing property value.: Man. Cybemet. 1987, 17 (5): 873 -877
[12] KameciM and Selina S Z. Centrol classifcation algorithm .. 1994,27
(3) : 421 - 428
[13] JamelM ad Salim S Z. The construction of decision tree. 1994,61:
177- 188
[14] A 1- Susan K S and Selim S Z. The introduction of heuristic strategy
1993,26 (9): 1357 - 1361
[15] Meyanmo to S and Naayama K. I Infonation content of data
classifcation . IEEE T
VI-I58 Volume I