Professional Documents
Culture Documents
5 Algorithm
A LITERATURE REVIEW
Afnaria; Dame Iffa Saragih; Siswadi| Data Mining | October 7th, 2016
Decision Tree
(Mantas & Abelln, 2014) Decision trees (DTs), also known as Classication Trees or
hierarchical classiers, started to play an important role in machine learning with the
publication of Quinlans ID3 (Iterative Dichotomiser 3) (Quinlan, 1986). Subsequently,
Quinlan also presented the C4.5 algorithm (Classier 4.5) (Quinlan, 1993), which is an
advanced version of ID3. Since then, C4.5 has been considered a standard model in
supervised classication. It has also been widely applied as a data analysis tool to very
different elds, such as astronomy, biology, medicine, etc.
Decision trees are models based on a recursive partition method, the aim of which
is to divide the data set using a single variable each level. This variable is selected with a
given criterion. Ideally, they dene a set of cases in which all the cases belong to the same
class.
Their knowledge representation has a simple tree structure. It can be interpreted as
a compact set of rules in which each tree node is labeled with an attribute variable that
produces branches for each value. The leaf nodes are labeled with a class label.
The process for inferring a decision tree is mainly determined by the followings aspects:
a. The criteria used to select the attribute to insert in a node and branching (split
criteria).
b. The criteria to stop the tree from branching.
c. The method for assigning a class label or a probability distribution at the leaf nodes.
d. The post-pruning process used to simplify the tree structure.
Many different approaches for inferring decision trees, which depend upon the
aforementioned factors, have been published. Quinlans ID3 (Quinlan, 1986) and C4.5
(Quinlan, 1993) stand out among all of these. Decision trees are built using a set of data
referred to as the training data set. A different set, called the test data set, is used to check
the model. When we obtain a new sample or instance of the test data set, we can make a
decision or prediction on the state of the class variable by following the path in the tree
from the root to a leaf node, using the sample values and tree structure.
PAGE 1
Stopping criteria: The branching of the decision tree is stopped when there is not attribute
with a positive Info-Gain Ratio score or there are a minimum number of instances per leaf
which is usually set to 2. But in addition to this, using the aforementioned condition in
Split Criteria of valid split attributes, the branching of a decision tree is also stopped when
there is not any valid split attribute.
Handling numeric attributes: This tree inducer manipulates numeric attributes with a
very simple approach. Within this method, only binary split attributes are considered and
each possible split point is evaluated. Finally, it is selected the point that induces a partition
of the samples with the highest Information Gain based split score.
Dealing with missing values: It is assumed that missing values are randomly distributed
(Missing at Random Hypothesis). In order to compute the scores, the instances are split
into pieces. The initial weight of an instance is equal to the unit, but when it goes down a
branch receives a weight equal to the proportion of instances that belongs to this branch
(weights sum to 1). Information Gain based scores can work with this fractional instances
using sum of weights instead of sum of counts. When making predictions, C4.5 marginalize
the missing variable by merging the predictions of all the possible branches that are
consistent with the instance (there are several branches because it has a missing value)
using their previously computed weights.
Post-pruning process: Although there are many different proposals to carry out a postpruning process of a decision tree (see Rokach & Maimon (2010)), the technique employed
by C4.5 is called Pessimistic Error Pruning. This method computes an upper bound of the
estimated error rate of a given subtree employing a continuity correction of the Binomial
distribution. When the upper bound of a subtree hanging from a given node is greater than
the upper bound of the errors produced by the estimations of this node supposing it acts as
a leaf, then this subtree is pruned.
C4.5 Algorithm
(Mazid, Ali, & Tickle, 2010) C4.5 is a popular decision tree based algorithm to solve
data mining task. Professor Ross Quinlan from University of Sydney has developed C4.5 in
1993. Basically it is the advance version of ID3 algorithm, which is also proposed by Ross
Quinlan in 1986. C4.5 has additional features such as handling missing values,
categorization of continuous attributes, pruning of decision trees, rule derivation and
others. C4.5 constructs a very big tree by considering all attribute values and finalizes the
decision rule by pruning. It uses a heuristic approach for pruning based on the statistical
significance of splits. Basic construction of C4.5 decision tree is
1. The root nodes are the top node of the tree. It considers all samples and selects the
attributes that are most significant.
2. The sample information is passed to subsequent nodes, called branch nodes which
eventually terminate in leaf nodes that give decisions.
3. Rules are generated by illustrating the path from the root node to leaf node.
Dealing huge data with computational efficiency is one of the major challenges for C4.5
users. Most of the time, it is very difficult to handle data file when dimensionality expands
PAGE 2
enormously during process for rule generation. As C4.5 uses decision tree, it needs to
consider some other issues such as depth of the decision tree, handling of continuous
attributes, method of selection measure to adopt significant attributes, dealing of missing
values, etc. Following section illustrates about some features of C4.5 algorithm.
There are several features of C4.5, as follows.
1.
i 1
freq (Ci, P)
freq (Ci, P)
log 2
P
P
(1)
PAGE 3
Information content of L can be measured by computing Info (L), the total information
content of L can be computed once L is divided with respect to the outcomes of a selected
attribute say z. The weighted sum of the entropies of each subets gives the total information
content of L.
inf oz L
i 0
Li
inf oLi
L
(2)
(3)
Gives information gained by dividing L with respect to the test on z. This is done for the
selection of attribute z with highest information gain. Three base cases for C4.5 algorithm
are considered (1) If all samples in dataset belong to same class a leaf node is created for
decision tree choosing that class. (2) If no information gain is provided by any
feature/attribute, a decision node is created high up the tree with expected value of the
class. (3) If a class of an unseen instance encountered a decision node is created high up the
tree with expected value of the class.
PAGE 4
If tree grows very large, stop it before it reaches maximal point of perfect
classification of the training data
Allow the tree to over-fit the training data then postprune tree.
References
Lakshmi, B. N., Indumathi, T. S., & Ravi, N. (2016). A Study on C.5 Decision Tree
Classification Algorithm for Risk Predictions During Pregnancy. Procedia Technology,
24, 15421549. http://doi.org/10.1016/j.protcy.2016.05.128
Mantas, C. J., & Abelln, J. (2014). Credal-C4.5: Decision tree based on imprecise
probabilities to classify noisy data. Expert Systems with Applications, 41(10), 4625
4637. http://doi.org/10.1016/j.eswa.2014.01.017
Mazid, M., Ali, S., & Tickle, K. (2010). Improved C4. 5 algorithm for rule based
classification. Proceedings of the 9th WSEAS International Conference on Artificial
Intelligence, Knowledge Engineering and Data Bases, 296301. Retrieved from
http://www.researchgate.net/publication/228579114_Improved_C_4._5_Algorithm_fo
r_Rule_Based_Classification/file/3deec520b1a84f41f8.pdf
PAGE 5