The C4.5 Algorithm: A Literature Review

The C4.
5 Algorithm
A LITERATURE REVIEW
Afnaria; Dame Iffa Saragih; Siswadi| Data Mining | October 7th, 2016
Decision Tree
(Mantas & Abelln, 2014) Decision trees (DTs), also known as Classication Trees or
hierarchical classiers, started to play an important role in machine learning with the
publication of Quinlans ID3 (Iterative Dichotomiser 3) (Quinlan, 1986). Subsequently,
Quinlan also presented the C4.5 algorithm (Classier 4.5) (Quinlan, 1993), which is an
advanced version of ID3. Since then, C4.5 has been considered a standard model in
supervised classication. It has also been widely applied as a data analysis tool to very
different elds, such as astronomy, biology, medicine, etc.
Decision trees are models based on a recursive partition method, the aim of which
is to divide the data set using a single variable each level. This variable is selected with a
given criterion. Ideally, they dene a set of cases in which all the cases belong to the same
class.
Their knowledge representation has a simple tree structure. It can be interpreted as
a compact set of rules in which each tree node is labeled with an attribute variable that
produces branches for each value. The leaf nodes are labeled with a class label.
The process for inferring a decision tree is mainly determined by the followings aspects:
a. The criteria used to select the attribute to insert in a node and branching (split
criteria).
b. The criteria to stop the tree from branching.
c. The method for assigning a class label or a probability distribution at the leaf nodes.
d. The post-pruning process used to simplify the tree structure.
Many different approaches for inferring decision trees, which depend upon the
aforementioned factors, have been published. Quinlans ID3 (Quinlan, 1986) and C4.5
(Quinlan, 1993) stand out among all of these. Decision trees are built using a set of data
referred to as the training data set. A different set, called the test data set, is used to check
the model. When we obtain a new sample or instance of the test data set, we can make a
decision or prediction on the state of the class variable by following the path in the tree
from the root to a leaf node, using the sample values and tree structure.
C4.5 Tree Inducer

The main ideas that were introduced in Quinlan (1993) are:
Split criteria: Information Gain (Quinlan, 1986) (see Eq. (1)) was rstly employed to select
the split attribute at each branching node. But this measure is strongly affected by the
number of states of the split attribute: attributes with a higher number of states were usually
preferred. Quinlan introduced the Information Gain Ratio (IGR) criterion (see Eq. (2)) for
this new tree inducer, which penalizes variables with many states. This score normalizes
the information gain of an attribute X by its own entropy. It is selected the attribute with
the highest Info-Gain Ratio score and whose Info-Gain score is higher than the average
Info-Gain scores of the valid split attributes. These valid split attributes are those which are
numeric or whose number of values is smaller than the thirty percent of the number of
instances that are in this branch.
PAGE 1
Stopping criteria: The branching of the decision tree is stopped when there is not attribute
with a positive Info-Gain Ratio score or there are a minimum number of instances per leaf
which is usually set to 2. But in addition to this, using the aforementioned condition in
Split Criteria of valid split attributes, the branching of a decision tree is also stopped when
there is not any valid split attribute.
Handling numeric attributes: This tree inducer manipulates numeric attributes with a
very simple approach. Within this method, only binary split attributes are considered and
each possible split point is evaluated. Finally, it is selected the point that induces a partition
of the samples with the highest Information Gain based split score.
Dealing with missing values: It is assumed that missing values are randomly distributed
(Missing at Random Hypothesis). In order to compute the scores, the instances are split
into pieces. The initial weight of an instance is equal to the unit, but when it goes down a
branch receives a weight equal to the proportion of instances that belongs to this branch
(weights sum to 1). Information Gain based scores can work with this fractional instances
using sum of weights instead of sum of counts. When making predictions, C4.5 marginalize
the missing variable by merging the predictions of all the possible branches that are
consistent with the instance (there are several branches because it has a missing value)
using their previously computed weights.
Post-pruning process: Although there are many different proposals to carry out a postpruning process of a decision tree (see Rokach & Maimon (2010)), the technique employed
by C4.5 is called Pessimistic Error Pruning. This method computes an upper bound of the
estimated error rate of a given subtree employing a continuity correction of the Binomial
distribution. When the upper bound of a subtree hanging from a given node is greater than
the upper bound of the errors produced by the estimations of this node supposing it acts as
a leaf, then this subtree is pruned.
C4.5 Algorithm
(Mazid, Ali, & Tickle, 2010) C4.5 is a popular decision tree based algorithm to solve
data mining task. Professor Ross Quinlan from University of Sydney has developed C4.5 in
1993. Basically it is the advance version of ID3 algorithm, which is also proposed by Ross
Quinlan in 1986. C4.5 has additional features such as handling missing values,
categorization of continuous attributes, pruning of decision trees, rule derivation and
others. C4.5 constructs a very big tree by considering all attribute values and finalizes the
decision rule by pruning. It uses a heuristic approach for pruning based on the statistical
significance of splits. Basic construction of C4.5 decision tree is
1. The root nodes are the top node of the tree. It considers all samples and selects the
attributes that are most significant.
2. The sample information is passed to subsequent nodes, called branch nodes which
eventually terminate in leaf nodes that give decisions.
3. Rules are generated by illustrating the path from the root node to leaf node.
Dealing huge data with computational efficiency is one of the major challenges for C4.5
users. Most of the time, it is very difficult to handle data file when dimensionality expands
PAGE 2
enormously during process for rule generation. As C4.5 uses decision tree, it needs to
consider some other issues such as depth of the decision tree, handling of continuous
attributes, method of selection measure to adopt significant attributes, dealing of missing
values, etc. Following section illustrates about some features of C4.5 algorithm.
There are several features of C4.5, as follows.
1.
Continuous Attributes Categorization

Earlier versions of decision tree algorithms were unable to deal with continuous
attributes. An attribute must be categorical value was one of the preconditions for
decision trees. Another condition is decision nodes of the tree must be categorical
as well. Decision tree of C4.5 algorithm illuminates this problem by partitioning the
continuous attribute value into discrete set of intervals which is widely known as
discretization. For instance, if a continuous attribute C needs to be processed by
C4.5 algorithm, then this algorithm creates a new Boolean attributes C so that it is
true if C<b and false otherwise. Then it picks values by choosing a best suitable
threshold.
2. Handling Missing Values

Dealing with missing values of attribute is another feature of C4.5 algorithm.
There are several ways to handle missing attributes. Some of these are Case
Substitution, Mean Substitution, Hot Deck Imputation, Cold Deck Imputation,
Nearest Neighbour Imputation
[6]. However C4.5 uses probability values for missing value rather assigning existing
most common values of that attribute. This probability values are calculated from
the observed frequencies in that instance. For example, let A is a Boolean attribute.
If this attribute has six values with A=1 and four with A=0, then in accordance with
Probability Theory, the probability of A=1 is 0.6 and the probability of A=0 is 0.4. At
this point, the instance is divided into two fractions: the 0.6 fraction of the instances
is distributed down the branch
for A=1 and the remaining 0.4 fraction is distributed down the other branch of tree.
As C4.5 split dataset to training and testing, the above method is applied in both of
the datasets. In a sentence we can say that, C4.5 uses most probable classification
which is computed by summing the weights of the attributes frequency.
(Lakshmi, Indumathi, & Ravi, 2016) C4.5 Decision Tree Classification Algorithm was
developed by Ross Quinlan as extension from ID3 algorithm also developed by him. These
classifiers construct a decision tree as a learning model from the data samples. The divide
and conquer approach is followed for construction of decision tree models using a measure
called information gain to select the attribute from the dataset for the tree. Consider a
possible test is to be selected with n outcomes which divides the data set L with training
samples into subsets {L1, L2, L3.........Ln}. Distribution of classes in L and its subsets Li is the
only information available for tree construction. Considering P to be any set of samples,
then freq (Ci,P) are the total number of samples in P belonging to Ci and |P| is denoted by
the number of samples in P. The entropy for the set P is given by
Info (p) =
i 1
freq (Ci, P)
freq (Ci, P)
log 2
P
P
(1)
PAGE 3
Information content of L can be measured by computing Info (L), the total information
content of L can be computed once L is divided with respect to the outcomes of a selected
attribute say z. The weighted sum of the entropies of each subets gives the total information
content of L.
inf oz L
i 0
Li
inf oLi
L
(2)
The gain is given by
Gainz inf oL inf oz L
(3)
Gives information gained by dividing L with respect to the test on z. This is done for the
selection of attribute z with highest information gain. Three base cases for C4.5 algorithm
are considered (1) If all samples in dataset belong to same class a leaf node is created for
decision tree choosing that class. (2) If no information gain is provided by any
feature/attribute, a decision node is created high up the tree with expected value of the
class. (3) If a class of an unseen instance encountered a decision node is created high up the
tree with expected value of the class.
Limitations of C4.5 Algorithm

Some limitations of C4.5 are as follows.
1. Empty branches
Constructing tree with meaningful value is one of the crucial steps for rule generation
by C4.5 algorithm. In our experiment, we have found many nodes with zero values or
close to zero values. These values neither contribute to generate rules nor help to
construct any class for classification task. Rather it makes the tree bigger and more
complex.
2. Insignificant branches
Numbers of selected discrete attributes create equal number of potential branches to
build a decision tree. But all of them are not significant for classification task. These
insignificant branches not only reduce the usability of decision trees but also bring on
the problem of over fitting.
3. Over fitting
Over fitting happens when algorithm model picks up data with uncommon
characteristics. This cause many fragmentations is the process distribution.
Statistically insignificant nodes with very few samples are known as fragmentations.
Generally C4.5 algorithm constructs trees and grows it branches just deep enough to
perfectly classify the training examples. This strategy performs well with noise free
data. But most of the time this approach over fits the training examples with noisy
data. Currently there are two approaches are widely using to bypass this over-fitting in
decision tree learning. Those are:
PAGE 4
If tree grows very large, stop it before it reaches maximal point of perfect
classification of the training data
Allow the tree to over-fit the training data then postprune tree.
References
Lakshmi, B. N., Indumathi, T. S., & Ravi, N. (2016). A Study on C.5 Decision Tree
Classification Algorithm for Risk Predictions During Pregnancy. Procedia Technology,
24, 15421549. http://doi.org/10.1016/j.protcy.2016.05.128
Mantas, C. J., & Abelln, J. (2014). Credal-C4.5: Decision tree based on imprecise
probabilities to classify noisy data. Expert Systems with Applications, 41(10), 4625
4637. http://doi.org/10.1016/j.eswa.2014.01.017
Mazid, M., Ali, S., & Tickle, K. (2010). Improved C4. 5 algorithm for rule based
classification. Proceedings of the 9th WSEAS International Conference on Artificial
Intelligence, Knowledge Engineering and Data Bases, 296301. Retrieved from
http://www.researchgate.net/publication/228579114_Improved_C_4._5_Algorithm_fo
r_Rule_Based_Classification/file/3deec520b1a84f41f8.pdf
PAGE 5

The C4.5 Algorithm: A Literature Review

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The C4.5 Algorithm: A Literature Review

Uploaded by

Copyright:

Available Formats

The C4.

C4.5 Tree Inducer

Continuous Attributes Categorization

2. Handling Missing Values

The gain is given by

Gainz inf oL inf oz L

Limitations of C4.5 Algorithm

You might also like