1 A PDF

Moving towards efcient decision tree construction
B. Chandra
*
, P. Paul Varghese
Department of Mathematics, Indian Institute of Technology, New Delhi 110016, India
a r t i c l e i n f o
Article history:
Received 5 November 2007
Received in revised form 2 December 2008
Accepted 8 December 2008
Keywords:
Decision trees
Gini Index
Gain Ratio
Split measure
a b s t r a c t
Motivated by the desire to construct compact (in terms of expected length to be traversed
to reach a decision) decision trees, we propose a new node splitting measure for decision
tree construction. We show that the proposed measure is convex and cumulative and uti-
lize this in the construction of decision trees for classication. Results obtained from sev-
eral datasets from the UCI repository show that the proposed measure results in decision
trees that are more compact with classication accuracy that is comparable to that
obtained using popular node splitting measures such as Gain Ratio and the Gini Index.
2008 Published by Elsevier Inc.
1. Introduction
Top down induction of decision trees is a powerful and popular method of pattern classication [15,22,23]. The decision
tree is generated by recursively partitioning the training data using a splitting attribute till all the records in the partition
belong to the same class. The splitting attribute is chosen based on the value of the node splitting measure. As with other
pattern classication paradigms, more complex models (larger decision trees) tend to produce poorer generalization perfor-
mance. Not surprisingly then, a large amount of effort has gone into producing decision trees of smaller size.
The techniques for producing decision trees of smaller size can be viewed as those that are implemented during the con-
struction of the tree (such as a new node splitting measure or new stopping criteria) or implemented after the construction
of the tree (pruning). The distinction is subtle though important since the decision made at each node of the tree is based on
a greedy search. It is therefore entirely possible that a partitioning which substantially optimizes the node splitting criteria at
a node results in sub-partitions where patterns of different classes are distributed in a way that requires many more parti-
tions within the sub-partition. Since it not possible to look-ahead [9,19,21,29] beyond one or a few levels, the impact of
techniques implemented during the construction phase is limiting. A promising alternative is to use an estimate of the com-
plexity for classifying patterns in the resulting sub-partitions and to favor partitions that optimize the node splitting mea-
sure while resulting in sub-partitions which are amenable to easier classication [10,12,13,16].
The techniques used during the construction phase of the tree are nonetheless an important area of research since they
can help in producing decision trees of smaller size when used in combination with techniques implemented after the con-
struction of the tree. The node splitting measure is primary among the techniques that can be implemented during the con-
struction of the decision tree. Though there have been proposals for new node splitting measures, the most popular ones
remain the information theoretic variants [20,23] and the Gini Index [2].
0020-0255/$ - see front matter 2008 Published by Elsevier Inc.
doi:10.1016/j.ins.2008.12.006
* Corresponding author. Tel.: +91 11 26591493; fax: +91 11 26581005.
E-mail addresses: bchandra104@yahoo.co.in (B. Chandra), pallathpv@yahoo.com (P. Paul Varghese).
Information Sciences 179 (2009) 10591069
Contents lists available at ScienceDirect
Information Sciences
j our nal homepage: www. el sevi er . com/ l ocat e/ i ns
From another perspective, the comprehensibility of decision trees has also contributed to their popularity. Within this
perspective, smaller decision trees lead to more compact (and more general) rules. The notion of the size of a decision tree
and how it is measured is thus important enough to merit additional comments. More specically, here and elsewhere in this
paper, our notion of size or compactness of the decision tree is not based on a count of the total number of nodes. To explain
this departure from the more common practice of using the number of nodes for the size of the tree, consider the root of a
decision tree which has a single left child and, to make the point, a right sub-tree with 20 nodes. The size of the decision tree
in this case is 22 nodes. However, if the data distribution is such that 80% of the data is classied by the left child of the root
node, then the size of the tree is much less. Indeed it would be comparable to a tree with the left sub-tree of the root having,
say ve nodes and the right sub-tree having ve nodes (i.e. a tree with a total of 11 nodes). We thus utilize the more mean-
ingful measure of the expected length to reach a decision (a leaf node) for the size of the tree.
Motivated by performance and comprehensibility considerations, we propose a new node splitting measure in this paper.
We show that the proposed measure is convex and well behaved. Our results over a large number of problems indicate that
the measure results in smaller trees in a large number of the cases without any loss in classication accuracy.
We have laid out the rest of the paper as follows. In Section 2, we recall two popular node splitting measures and highlight
some of the induction algorithms that utilize these. Our intent is not to provide a comprehensive review but to provide de-
tails on the most popular measures that are also relevant for the rest of the paper. In Section 3, we introduce the proposed
node splitting measure and derive some properties of the proposed measure. In Section 4, we provide an algorithm to con-
struct decision trees utilizing the proposed measure. In Section 5 we provide results obtained with the proposed measure
and compare it to popular decision tree induction algorithms. In Section 6, we present our conclusions.
2. Some existing splitting measures and algorithms
In this section we describe the more popular split measures and the algorithms that utilize these split measures. Our in-
tent is not to provide an exhaustive review but rather is to provide an overview of those measures that are required to make
the paper self-contained. More extensive reviews appear in [18,28]. The commonly used standard splitting measures, Gain
Ratio [24] and Gini Index [2] which is used in variety of classication algorithms are described in this section.
2.1. Gain Ratio
Entropy based Information Gain is used as split measure in ID3 algorithm [23]
EntropyS
pIlog
2
pI; 1
where p(I) is the proportion of S belonging to class I. Entropy has also been used for incidence pattern classication [8] where
the important attributes are chosen for classication based on information incidence degree. In Gain(S, A), the information
gain of example set S on attribute A is dened as
GainS; A EntropyS
jS
v
j=jSj EntropyS
v
; 2
where Sv = subset of S for which attribute A has value v and jS
v
j denotes the size of subset S
v
.
The notion of Gain [22] used in ID3 [23] tends to favor attributes that have a large number of values over those with few
values. Such an attribute will have the highest Gain compared to all other attributes [18]. To compensate for this Quinlan
suggested using the following ratio instead of Gain:
GainRatiioS; A
GainS; A
SplitInfoS; A
; 3
where SplitInfo(S, A) = I(jS
1
j/jSj, jS
2
j/jSj, . . . , jS
m
j/jSj) is the information due to the split of S on the basis of the value of the cat-
egorical attribute A. In C4.5 [24,25] we choose the attribute value that maximizes the Gain Ratio as the splitting attribute.
Variations of Gain Ratio are also reported in literature, but they have several limitations. Normalized Gain [14] is a minor
modication of the Gain Ratio measure. Several assumptions are stated by the authors, only under which, the Normalized
Gain measure performs better than Gain Ratio but the occurrences of such cases is not very frequent. Average Gain measure
has been proposed by Dianhong and Liangxiao [7] which is also a small variation of the Gain Ratio measure. This measure
aims at overcoming the drawback of Gain Ratio where the split information (denominator in Gain Ratio measure) sometimes
becomes zero or very small. The measure that divides the Information Gain by the number of values of the attribute instead
of split information. The problem with this measure is that it is unable to handle numeric attributes.
2.2. Gini Index
Gini Index [2] is used as the split measure in SLIQ algorithm developed by the Quest team at IBM [17]. The Gini Index is
minimized at each split, so that the tree becomes less diverse as we progress. The class histograms are built for each succes-
sive pairs of values of attributes. At any particular node, after obtaining the histograms for all attributes, the Gini Index for
1060 B. Chandra, P. Paul Varghese / Information Sciences 179 (2009) 10591069
each histogram is computed. The histogram that gives the least Gini Index gives us the splitting point for the node under
consideration. Gini Index for a sample histogram (refer Table 1) with two classes namely A and B is dened as follows:
Gini Index
a1 a2
n
1
a1
a1 a2
_ _
2
a2
a1 a2
_ _
2
_ _
b1 b2
n
1
b1
b1 b2
_ _
2
b2
b1 b2
_ _
2
_ _
; 4
where n = total number of records = a1 + a2 + b1 + b2.
Gini Index is computed for each histogram and for each attribute. Once the Gini Index for each histogram is known, the
split attribute is chosen to be the one whose class histogram gives the least Gini Index, and the split value equals the splitting
point h for that histogram. The stopping criterion is when the Gini Index at a node becomes zero, as this implies that all data
records at the node have been classied completely.
2.3. Existing algorithms
Efcient C4.5 [27] is an improvement over C4.5 in which various strategies are suggested. One of the strategies proposed
is to sort the data using quick sort or counting sort, and the other strategy is to compute the local threshold of C4.5 using
main memory version of the Rainforest algorithm, which does not need sorting. In Robust C4.5 [33] algorithm, Gain Ratio
is computed using attributes having Gain greater than Average Gain. This takes care of over tting. In SLIQ [17], Gini Index
is computed at successive midpoints of each attribute and the attribute that has the minimum Gini Index is chosen as the
splitting attribute. SPRINT [30] algorithm aims at parallelizing SLIQ. In PUBLIC [26] the approach used for tree generation is
same as SPRINT but Entropy is used as split measure and Pruning is carried out while building the Decision Tree. Instead of
computing Gini Index for each distinct value of a numeric attribute in CLOUDS algorithm [1], Gini Index is evaluated at
interval boundaries of the attribute value range obtained using Quantizing based technique and the point with globally least
value is used for splitting. Each interval contains approximately the same number of points. For each numeric attribute, the
Gini Index is computed at points that separate two intervals (i.e., at the interval boundaries) and the point with the globally
least value is used for splitting. This ensures that the best split point is chosen, but unfortunately, it requires another pass
over the entire dataset. CMP-S [31] uses the same discretization technique used in CLOUDS to reduce the computational
complexity. A major improvement in CMP-S over CLOUDS is that it manages to avoid scanning the dataset a second time
to determine the exact split point. Elegant Decision Tree algorithm [4] suggests an improvement over SLIQ wherein the Gini
Index is computed not for every successive pair of values of an attribute like SLIQ but over different ranges of attribute
values. Robust Decision Tree Algorithm [5] is an improvement over SLIQ where the Gini Index is evaluated at class
boundaries.
3. Proposed measure
The proposed measure is designed to reduce the number of distinct classes that would result in each sub-tree after a split.
After presorting the attribute values (along with its class labels), the measure is computed for each attribute at every suc-
cessive midpoint of distinct attribute values. The attribute that has the minimum measure value is chosen as the splitting
attribute at the corresponding split value.
3.1. Split measure
Let the set of records before the split be denoted by R. The records in R are described using a set of attributes A
1
, A
2
, . . . , A
m
and a class label C. The set of attributes is treated as a single attribute A
i,. . .,m
, whose domain is the Cartesian product of the
domains of A
1
, A
2
, . . . , A
m
, dom(A
i,. . .,m
) = dom(A
1
) dom(A
2
) dom(A
3
) . . . dom(A
m
). Let dom(C) = {c
1
, c
2
, . . . , c
k
} be the do-
main of class labels C and k be the class domain size (i.e. the number of distinct values in dom(C)). Let n
i
(S) denote the num-
ber of records in S # R for which the class label is c
i(16i6k)
. Thus, each subset S of R can be mapped to a point
n(S) = (n
1
(S), n
2
(S), n
3
(S), . . . , n
k
(S)) in the k dimensional Euclidean space. The total number of records in the subset S will be
jSj
k
i1
n
i
S.The measure value at split point p is given as follows:
Table 1
Class histogram.
Attribute value < h A B
L a1 a2
R b1 b2
h is the splitting value for an attribute.
a1 denotes the number of attributes which are less than h and belong to class A.
a2 denotes the number of attributes which are less than h and belong to class B.
b1 denotes the number of attributes which are greater than h and belong to class A.
b2 denotes the number of attributes which are greater than h and belong to class B.
B. Chandra, P. Paul Varghese / Information Sciences 179 (2009) 10591069 1061
Measuresplitpoint p
jS
1
j
jRj

C
S
1
C
R
_ _
C
S
1
i1
n
i
S
1
n
i
R
_
_
_
_
jS
2
j
jRj

C
S
2
C
R
_ _
C
S
2
i1
n
i
S
2
n
i
R
_
_
_
_
; 5
where R is the Entire Dataset under consideration, S
1
is the Dataset above the splitpoint p, S
2
is the Dataset below the split-
point p, C
S
1
is the Number of distinct classes in S
1
, C
S
2
is the number of distinct classes in S
2
, C
R
is the number of distinct
classes in R, n
i
S
1
is the number of records in S
1
& R having class c
i
, n
i
S
2
is the number of records in S
2
& R having class
c
i
, n
i
R is the number of records in R having class c
i
In case of numeric attributes, the measure will have two terms as shown in (5); however for categorical attributes, the
number of terms in the equation will depend on the number of distinct values that the attribute can take. Properties of
the proposed split measure are discussed in the following subsection.
3.2. Properties of the proposed split measure
Convexity, cumulative behavior and well-behavedness are the important properties any good split measure should pos-
sess. Average Class Entropy and Information Gain [20] have been shown to be convex [12,13] and the well-behavedness [10]
property of these functions has been proved. It has been shown that Gain Ratio is not a convex function but is still well-be-
haved [10,11]. Gini Index has been proved to be strictly convex [2,3,32], well-behaved and also cumulative [12]. The well-
behavedness and cumulative property of Gini Index makes it suitable for multisplitting. It has been shown that this property
holds good not only for the entropy based measures, but for all strictly convex measures [6]. Properties of the proposed split
measure are given in the following subsection.
3.3. Convexity of proposed split measure
The proof of convexity of the proposed measure is given below.
Lemma 1. The proposed measure is convex on n.
Proof. The split measure is evaluated on R to split the dataset into two disjoint subsets S and (R S) at a split point which
has minimum value for the split measure.
For any vector D 0, the second derivative of the functions along D is nonnegative.
Let r = (r
1
, r
2
, . . . , r
k
) be the set of all records to be split, and let krk
k
i1
r
i
, which is also equal to krk
k
i1
r
i
R i.e. jRj
and let knk
k
i1
n
i
, i.e. jSj. The equation of the proposed measure can be transformed as follows:
Measure
jSj
jRj

C
S
C
R
k
i1
n
i
S
n
i
R
_ _
jR Sj
jRj
_ _
C
RS
C
R
k
i1
n
i
R n
i
S
n
i
R
_ _
: 6
Here the summation runs from 1 to k instead of 1 to C
S
as for all other values of i the n
i
(S) will be equal to zero.
Measure
1
C
R
jRj
MnS
1
C
R
jRj
MnR nS;
MnS jSj C
S

k
i1
n
i
S
n
i
R
_ _
;
MnR S jR Sj C
RS

k
i1
n
i
R S
n
i
R
_ _
as n(S) is the number of records in S and n
i
(S) is the number of records in S having class c
i
, n(R S) can be written as
n(R) n(S) and n
i
(R S) can be written as n
i
(R) n
i
(S).
MnR nS jR Sj C
RS

k
i1
n
i
R n
i
S
n
i
R
_ _
:
Now
jR Sj
k
i1
n
i
R S
k
i1
n
i
R n
i
S
k
i1
n
i
R
k
i1
n
i
S jRj jSj;
MnR nS jRj jSj C
RS

k
i1
n
i
R n
i
S
n
i
R
_ _
;
MnS jSj C
S

k
i1
n
i
S
n
i
R
_ _
:
7
So,
Let D = (d
1
, d
2
, d
3
, . . . , d
k
) and Y D; n
k
i1
d
i
n
i
.
Here n
i
is the number of records of class c
i
and Y is a vector which is a linear combination of proportions of n
i
s. Here d
i
can
take only positive values. If d
i
is zero then Y will not have records with class c
i
. The rst derivative of M(n) is
M
0
n
dMn
dY

k
i1
1
d
i
@Mn
@n
i
C
S

k
i1
n
i
S
k
i1
1
d
i
n
i
R
_ _
C
S

k
i1
1
d
i
k
i1
n
i
S
n
i
R
_ _
C
S

k
i1
n
i
S
k
i1
1
d
i
n
i
R
_ _
C
S
ktk
k
i1
n
i
S
n
i
R
_ _
;
where t d
1
1
; d
1
2
; d
1
3
; . . . ; d
1
k
and the second derivative is
M
00
n C
S

k
i1
1
d
i
k
i1
1
d
i
n
i
R
_ _
C
S
ktk
k
i1
1
d
i
n
i
R
_ _
C
S
ktk
k
i1
1
d
i
n
i
R
_ _
C
S
ktk
k
i1
1
d
i
n
i
R
_ _
2 C
S
ktk
k
i1
1
d
i
n
i
R
_ _
8
Now as C
S
P1; n
i
R P1; 1 Pd
i
P0 and ktk P1, and hence Mn > 0.
Hence the measure is strictly convex. A strictly convex function achieves its minimum value at a boundary point. For
selecting the best split point of an attribute while generating a Multiway decision tree, the split measure used must be Well
Behaved and Cumulative [11,12]. The following subsection shows the well-behavedness of the split measure.
3.4. Well-behavedness of the split measure
It is important for a split measure to favor partitions that keep records of the same class together. Such an evaluation
function is termed as useful. Convex functions possess this property and hence form a subclass of useful evaluation func-
tions. If a useful evaluation function is cumulative, then it is also well behaved [10,11]. It has already been shown that the
proposed measure is convex and hence useful. Now we show that the proposed measure is cumulative. An attribute evalu-
ation function (i.e. the split measure) is said to be cumulative if the value of the split measure is obtained by (weighted) sum-
mation over the impurities of its subsets [10,11]. If S
i
& R is a partition of the dataset R, then an evaluation function F is
cumulative if there exists a function f such that FR constant
i
f S
i
, where the constant may depend on the entire data-
set R but not on its partitions. It is shown below that the proposed measure is cumulative
Measure
jSj
jRj

C
S
C
R
k
i1
n
i
S
n
i
R
_ _
jRj jSj
jRj
_ _
C
RS
C
R
k
i1
n
i
R n
i
S
n
i
R
_ _
1
jRj C
R
k
i1
n
i
S jSj C
S
n
i
R
_ _
1
jRj C
R
k
i1
n
i
R n
i
S jRj jSjC
RS
n
i
R
_ _
FR;
FR
1
jRj C
R
k
i1
n
i
S
1
jS
1
j C
S
1
n
i
R
_ _
1
jRj C
R
k
i1
n
i
S
2
jS
2
j C
S
2
n
i
R
_ _
; where S
2
R S
1
1
jRj C
R
f S
1

1
jRj C
R
f S
2
1
jRj C
R
j
f S
j
:
9
Here j = 1, 2 because of binary splitting.
Hence the proposed measure is cumulative. Proposed measure is convex and cumulative, and hence it is well behaved.
3.5. Bounds of the proposed split measure
3.5.1. Lower bound
The proposed measure will have the minimum value when the subsets after partition will have all the records belonging
to a single class. For a binary decision tree there will be two segments. In such a case various parameters of the proposed
measure will take the following values
C
S
1
1; C
S
2
1; C
R
2; jRj jS
1
j jS
2
j;
n
i
S
1

jS
1
j if i 1;
0 otherwise ;
_
n
i
S
2

jS
2
j if i 2;
0 otherwise;
_
n
i
R
jS
1
j if i 1;
jS
2
j if i 2;
_
Measure
1
2

jS
1
j jS
2
j
jRj
0:5:
10
For binary splitting, the minimum value the measure will take is 0.5. It can be further shown that the minimum value will be
the same for m-way split also.
3.5.2. Upper bound
The proposed measure will have the maximum value when the number of distinct classes in any of the partitions S
i
is
equal to the number of distinct classes in the dataset R before split. The upper bound of the split measure is discussed below
Measuresplitpoint p
jS
1
j
jRj

C
S
1
C
R
_ _
C
S
1
i1
n
i
S
1
n
i
R
_
_
_
_
jS
2
j
jRj

C
S
2
C
R
_ _
C
S
2
i1
n
i
S
2
n
i
R
_
_
_
_
:
In the above equation, each of the ratios,
jS
1
j
jRj
;
C
S
1
C
R
;
jS
2
j
jRj
;
C
S
2
C
R
will lie in the range (0, 1]. The term
C
S
1
i1
n
i
S
1
n
i
R
_ _
will have maximum
value only when n
i
(S
1
) = n
i
(R), "i. Hence, the term
C
S
1
i1
n
i
S
1
n
i
R
_ _
will be equal to C
S
1
; jS
1
j jRj; C
S
1
C
R
; jS
2
j 0 and C
S
2
0.
Thus,
Measuresplitpoint p 1 1 C
R
0 C
R
: 11
But jS
2
j > 0 and C
S
2
> 0 hence Max(Measure) = $C
R
. This is also true for m-way split.
In order to give better understanding of the behavior of the proposed measure for binary classication, Fig. 1 shows the
graph between probability of a class Vs the measure value.
4. Tree generation algorithm
In the algorithm the decision tree is grown by repeatedly partitioning the training data. The algorithm is discussed below.
Buildtree (data S){
if (all points in S are in the same class) then return;
construct presorted attribute list attr[i][j] along Class information.
0
0.5
1
1.5
2
2.5
0
.
0
2
0
.
1
0
.
1
8
0
.
2
6
0
.
3
4
0
.
4
2
0
.
5
0
.
5
8
0
.
6
6
0
.
7
4
0
.
8
2
0
.
9
0
.
9
8
P(class 1)
M
e
a
s
u
r
e

V
a
l
u
e
Fig. 1. Plot for measure value varying class 1 probability.
// i = number of attributes and j = the number of patterns
for k = 1 to i
evaluate_splits(attr[k][ ], measure[k], split_point [k])
end
split_attribute = attribute with minimum measure[] for all i
splitval = value of split_attribute where measure[] is minimum
Add a new node to decision tree with (split_attribute, splitval)
Put all the records with attr[split_attribute][] < splitval into L
Put all the records with attr[split_attribute][] >= splitval into R
Build tree (L)
Buildtree(R)
}
Evaluate_Splits(attribute attr[ ], measure m[ ], split_point sp[ ])
{
splitpoint[ ] = mean of successive distinct values in attr[ ]
for k = 1 to sizeof (splitpoint[ ]);
S1 = dataset above the Split point for Attribute attr
C
S
1
=Find the number of distinct classes in S1
S2 = dataset below the Split point for Attribute A
C
S
2
=Find the number of distinct classes in S2
mk
jS
1
j
jRj

C
S
1
C
R
_ _
C
S
1
i1
n
i
S
1
n
i
R
_
_
_
_
jS
2
j
jRj

C
S
2
C
R
_ _
C
S
2
i1
n
i
S
2
n
i
R
_
_
_
_
end
}
The algorithm suggested for building the decision tree using the proposed measure results into a binary search tree which
is also the case for SLIQ algorithm (which uses Gini Index). C4.5 algorithm (which uses Gain Ratio) handles continuous and
discrete attributes in a different manner. For a discrete valued attribute multiway partitioning occurs at the splitting node
where as for continuous valued attribute it results into binary splits. The time required to classify new instances depends on
the type of the tree. If h is the height, and dmax is the maximum number of children of any node in the multiway decision
tree then the total search time complexity is O(hdmax), whereas the search time complexity of a binary search tree is O(h).
The depth of the decision tree constructed using the proposed measure in the testing phase has thus been compared only
with that of Gini Index, whereas the classication accuracy is compared with both Gini Index and Gain Ratio.
5. Results
The comparative performance of the proposed measure, Gini Index and Gain Ratio was evaluated using 19 different data-
sets (refer Table 2) from the UCI machine-learning repository (http://www.ics.uci.edu/~mlearn/MLRepository.html).
The expected length traversed to reach a decision (i.e. a leaf node) and the classication accuracy has been reported. For
the curious reader, comparisons indicating the size and depth of the decision tree and the time taken to construct the deci-
sion tree is also reported. While, the depth of the decision tree is not relevant in light of our earlier comments, we have in-
cluded it for the curious reader. All results that we report subsequently are based on 10-fold cross-validation. The
comparison of classication accuracy and the result of t-tests are displayed in Table 3. The last two columns in Table 3 show
the mean difference error between the proposed and the existing measures and their statistical signicance. The notation
used is e
d
(p
value
), where e
d
denotes the average difference in errors between the proposed measure and Gain Ratio and
the Gini Index respectively, and p
value
denotes the signicance of this difference. A negative value of the mean difference
e
d
implies that the average error is less for proposed measure. If the p
value
is less than a = 0.05 then the difference is statis-
tically signicant.
From Table 3, it is observed that the average difference in error between Gini Index and the proposed measure is not sta-
tistically signicant for all the datasets. However the average difference in error between Gain Ratio and the proposed mea-
sure for Haberman, Wisconsin Breast Cancer, Ionosphere, Image and Credit is statistically signicant. Gain Ratio gives less
error for Haberman and Credit dataset, whereas the proposed measure gives less error for Wisconsin Breast Cancer, Iono-
sphere and Image datasets. Overall, the proposed measure gave the best classication accuracy for 7 of the 19 datasets
(shown as underlined boldface) and second best for four more datasets (shown as boldface).
Comparisons based on the expected length traversed to reach a decision (i.e. a leaf node) are shown in Fig. 2. It is clear
that the expected length traversed to make a decision is less compared to that obtained using the Gini Index.
Table 2
List of datasets used.
Dataset number Dataset name No. of attributes No. of classes
1 Haberman 3 2
2 Iris 4 3
3 Balanced Scale 4 3
4 Liver 6 2
5 Pima Indian Diabetes 8 2
6 Wisconsin BC 9 2
7 Echocardiogram 9 2
8 Wine 13 3
9 Zoo 16 7
10 Mushroom 21 2
11 Iono sphere 34 2
12 Lung Cancer 54 3
13 Image 19 7
14 Glass 9 6
15 Credit 14 2
16 Vehicle Silhoutte 18 4
17 Voting 16 2
18 Heart Statlog 13 2
19 Lymphograph 18 4
Table 3
Average classication accuracy using Gain Ratio, Gini Index and proposed measure.
Dataset number Gain Ratio (%) Gini Index (%) Proposed measure (%) Proposed versus Gain Ratio Proposed versus Gini Index
e
d
(p
value
) e
d
(p
value
)
1 73.87 65.81 64.84 9.032 (0.014) 0.968 (0.752)
2 93.33 98.00 98.00 3.967 (0.139) 0.000 (1.000)
3 74.52 77.74 74.35 0.161 (0.916) 3.387 (0.051)
4 62.86 57.14 58.10 4.762 (0.409) 0.952 (0.662)
5 65.06 67.92 68.57 3.506 (0.193) 0.649 (0.736)
6 92.17 93.33 95.22 3.043 (0.019) 1.884 (0.064)
7 86.00 84.00 88.00 2.000 (0.555) 4.000 (0.168)
8 93.33 88.33 93.33 0.000 (1.000) 5.000 (0.121)
9 97.00 90.00 88.00 9.000 (0.171) 2.000 (0.443)
10 100.00 98.13 91.96 8.044 (0.109) 6.175 (0.237)
11 74.00 88.57 86.00 10.571 (0.037) 4.000 (0.122)
12 45.00 57.50 60.00 15.000 (0.140) 2.500 (0.758)
13 55.28 94.00 92.00 36.840 (0.000) 1.688 (0.104)
14 66.19 65.00 60.00 6.667 (0.083) 5.714 (0.260)
15 87.14 81.25 77.14 10 (0.02) 4.107 (0.067)
16 57.41 70.24 70.82 13.412 (1.000) 0.588 (0.677)
17 95.65 93.48 93.48 2.174 (0.236) 0.000 (0.500)
18 70.67 74.33 70.33 0.333 (0.459) 4.000 (0.119)
19 72.86 78.57 72.14 1.429 (0.427) 6.429 (0.061)
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Dataset Name
N
u
m
b
e
r

o
f

N
o
d
e
s

t
r
a
v
e
r
s
e
d

f
o
r

D
e
c
i
s
i
o
n

M
a
k
i
n
g

Gini Index
Proposed Measure
Fig. 2. Comparison of the expected length traversed to reach a decision.
Fig. 3 compares the size of the decision tree (number of nodes) while Fig. 4 compares the proposed measure in terms of
depth of the decision tree. While the number of nodes resulting from the use of the proposed measure is more, the depth of
the tree is signicantly less. However, as we stated earlier, the size of the decision tree is more meaningfully compared using
the expected length traversed using that measure it is clear that the use of the proposed measure results in more compact
decision trees.
Finally, Fig. 5 shows the comparison of the time taken to construct the decision tree. We observe that the time taken to
construct the decision tree is comparable to the time taken to construct the tree with the Gini Index.
0
20
40
60
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Dataset Number
S
i
z
e

o
f

t
h
e

T
r
e
e
Gini Index
Proposed Measure
Fig. 3. Comparison of the size (number of nodes) of the decision tree.
0
2
4
6
8
10
12
14
16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Dataset Number
D
e
p
t
h
o
f

t
h
e

D
e
c
i
s
i
o
n

T
r
e
e
GINI Index
Proposed Measure
Fig. 4. Comparison of the depth of the decision tree.
0.010
0.100
1.000
10.000
100.000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Dataset Number
T
i
m
e

(
S
e
c
o
n
d
s
)
Gini Index
Proposed Measure
Fig. 5. Comparison of the time taken for constructing the decision tree.
5.1. Results after pruning
MDL pruning [17] was employed on decision trees built using all the three measures. Accuracy of pruned decision trees
constructed using different node splitting measures is shown in Table 4. The classication accuracies are based on 10-fold
cross-validation and reported as the average accuracy of those runs the standard deviation of the accuracy obtained in
those runs. The pruned decision tree built using the proposed measure has better accuracy in 11 out of the 19 datasets com-
pared to the unpruned decision tree. Size of the pruned decision tree constructed using the proposed measure is less in 13
out of the 19 datasets as compared to those built using Gain Ratio and Gini Index. The comparison of the size of the pruned
decision tree built using different measures is shown in Fig. 6.
6. Conclusions
A new node splitting measure has been proposed for decision tree construction. The proposed measure possesses the con-
vexity and cumulative property which are important properties for any split measure. Results obtained on several datasets
from the UCI repository indicate that the proposed measure results in decision trees that are more compact (in terms of ex-
pected length traversed to reach a decision i.e. a leaf node), without compromising on accuracy. As mentioned in Section 1,
node splitting measures represent one aspect of decision tree construction and need to be combined with techniques with
other approaches that optimize the tree after its construction. After applying pruning also, the accuracy and size of the deci-
sion tree constructed using the proposed measure is better compared to that using Gain Ratio and Gini Index.
Table 4
Average classication accuracy of pruned decision trees build using Gain Ratio, Gini Index and proposed measure.
No. Dataset name Gain Ratio Gini Index Proposed measure
1 Haberman 72.9 9.4 71.94 9.26 73.55 9.35
2 Iris 98 3.22 98 3.22 98 3.22
3 Balanced Scale 65.65 12.36 67.42 8.08 66.77 14.23
4 Liver 64.29 11.93 55.71 8.99 62.86 12.66
5 Pima Indian Diabetes 72.08 5.75 74.81 7.92 69.74 6.09
6 Wisconsin BC 94.49 4.26 93.19 4.21 95.07 5.08
7 Echocardiogram 89 11.01 84 13.5 89 9.94
8 Wine 96.67 3.88 88.33 8.05 93.89 4.86
9 Zoo 91 15.95 89 14.49 82 19.32
10 Mushroom 99.56 1.4 98.13 3.76 92.2 13.37
11 Iono sphere 93.43 4.68 88.29 7.67 84.57 10.36
12 Lung Cancer 40 17.48 50 23.57 52.5 18.45
13 Image 91.26 10.15 93.85 3.05 93.64 3.43
14 Glass 64.29 16.69 63.81 15.75 55.71 14.38
15 Credit 89.11 14.35 87.14 12.45 78.04 12.32
16 Vehicle silhoutte 61.76 13.17 69.53 5.75 63.41 3.06
17 Voting 96.52 3.43 96.52 3.43 94.78 6.74
18 Heart statlog 72.67 6.05 75.67 4.98 69.67 6.56
19 Lymphograph 72.86 13.8 66.43 9.55 61.43 10.21
0
10
20
30
40
50
60
70
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Dataset Number
S
i
z
e

o
f

t
h
e

p
r
u
n
e
d

d
e
c
i
s
i
o
n

t
r
e
e
Gain Ratio
Gini Index
Proposed Measure
Fig. 6. Comparison of the size of the pruned decision tree.
References
[1] K. Alsabti, S. Ranka, V. Singh, CLOUDS: A Decision Tree Classier for Large Datasets, KDD, 1998, pp. 28.
[2] L. Breiman, J.H. Friedman, R.A. Olsen, C.J. Stone, Classication and Regression Trees, Wadsworth International, 1984.
[3] L. Breiman, Some properties of splitting criteria, Machine Learning 24 (1996) 4147.
[4] B. Chandra, S. Mazumdar, V. Arena, N. Parimi, Elegant Decision Tree Algorithm, in: Proceedings of the Third International Conference on Information
Systems Engineering (Workshops), IEEE CS, 2002, pp. 160169.
[5] B. Chandra, P. Paul, On improving the efciency of SLIQ Decision Tree algorithm, in: IJCNN-2007, IEEE, 2007, pp. 6671.
[6] C.W. Codrington, C.E. Brodley, On the Qualitative Behavior of Impurity Based Splitting Rules I: The Minima-Free Property, Technical Report, Purdue
University 199705, 1997.
[7] W. Dianhong, J. Liangxiao, An improved attribute selection measure for decision tree induction, in: Proceedings of Fourth International Conference on
Fuzzy Systems and Knowledge Discovery (FSKD 2007), vol. 4, 2007, pp. 654658.
[8] Shi-fei Ding, Zhong-zhi Shi, Studies on incidence pattern recognition based on information entropy, Journal of Information Science 31 (6) (2005) 497
502.
[9] M. Dong, R. Kothari, Look-ahead based fuzzy decision tree induction, IEEE Transactions of Fuzzy Systems 9 (3) (2001) 461468.
[10] T. Elomaa, J. Rousu, On the well-behavedness of important attribute evaluation functions, in: G. Grahne (Ed.), Proceedings of Sixth Scand. Conference
on Articial Intelligence, IOS Press, Amsterdam, 1997.
[11] T. Elomma, J. Rousu, General and efcient multisplitting of numerical attributes, Machine Learning (1999) 149.
[12] U.M. Fayyad, K.B. Irani, On the handling of continuous-valued attributes in decision tree generation, Machine Learning 8 (1992) 87102.
[13] U.M. Fayyad, K.B. Irani, Multi-interval discretization of continuous valued attributes for classication learning, in: Proceedings of the 13th
International Joint Conference on Articial Intelligence, Morgan Kaufmann, San Francisco, CA, 1993, pp. 10221027.
[14] B.H. Jun, C.S. Kim, H. Song, A new criterion in selection and discretization of attributes for the generation of decision trees, IEEE Transactions on Pattern
Analysis and Machine Intelligence 19 (12) (1997) 13711375.
[15] R. Kothari, M. Dong, Decision trees for classication: a review and some new results, in: Lecture Notes in Pattern Recognition, World Scientic
Publishing Company, Singapore, 2001.
[16] Y. Li, Y. Dong, R. Kothari, Classiability based omnivariate decision trees, IEEE Transactions on Neural Network 16 (6) (2005) 15471560.
[17] M. Mehta, R. Agrawal, J. Riassnen, SLIQ: a fast scalable classier for data mining, in: Extending Database Technology, Avignon, France, 1996.
[18] T.M. Mitchell, Machine Learning, McGraw-Hill International, 1997.
[19] J. Mingers, Inducing rules for expert systems statistical aspects, The Professional Statistician 5 (1986) 1924.
[20] J. Mingers, An empirical comparison of selection measures for decision tree induction, Machine Learning 3 (1989) 319342.
[21] S.K. Murthy, S. Salzberg, Lookahead and pathology in decision tree induction, in: Proceedings of the 14th International Joint Conference on Articial
Intelligence, Morgan Kaufmann, 1995, pp. 10251031.
[22] J.R. Quinlan, Learning efcient classication procedures and their application to chess end games, in: R.S. Michalshi, JG. Carbonell, T.M. Mitchell (Eds.),
Machine Learning: An Articial Intelligence Approach, Morgan Kaufmann, 1983.
[23] J.R. Quinlan, Induction of decision trees, Machine learning (1986) 81106.
[24] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, California, 1993.
[25] J.R. Quinlan, Improved use of continuous attributes in C4.5, Journal of Articial Intelligence Research 4 (1996) 7790.
[26] R. Rastogi, K. Shim, Public: a decision tree classier that integrates building and pruning, in: Proceedings of the 24th International Conference on Very
Large Data Bases, 1998, pp. 404415.
[27] S. Ruggieri, Efcient C4.5 [classication algorithm], IEEE Transactions on Knowledge and Data Engineering 14 (2) (2002) 438444.
[28] S.R. Saffavian, D. Landgrebe, A survey of decision tree classier methodology, IEEE Transactions on Systems, Man, and Cybernetics 21 (3) (1991) 660
674.
[29] U.K. Sarkar, P.P. Chakrabarti, S. Ghose, S.C. DeSarkar, Improving greedy algorithms by lookahead-search, Journal of Algorithms 16 (1) (1994) 123.
[30] J.C. Shafer, R. Agarwal, M. Mehta, SPRINT: a scalable parallel classier for data mining, in: Proceedings of the 24th International Conference on Very
large Databases, 1996.
[31] H. Wang, C. Zaniolo, CMP: a fast decision tree classier using multivariate predictions, in: Proceedings of the 16th International Conference on Data
Engineering, IEEE, 2000, pp. 449460.
[32] Morimoto Yasuhiko, Algorithms for nding attribute value group for binary segmentation of categorical databases, IEEE Transactions on Knowledge
and Data Engineering 14 (6) (2002) 12691279.
[33] Z. Yao, P. Liu, L. Lei, J. Yin, R_C4.5 decision tree model and its applications to health care dataset, in: Proceedings of International Conference on Services
Systems and Services Management ICSSSM05, vol. 2, 1315, IEEE, 2005, pp. 10991103.

1 A PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 A PDF

Uploaded by

Copyright:

Available Formats

Moving towards efcient decision tree construction

You might also like