You are on page 1of 31

Constructing Decision Trees

A Decision Tree Example


The weather data example.

ID code Outlook Temperature Humidity Windy Play
a
b
c
d
e
f
g
h
i
j
k
l
m
n
Sunny
Sunny
Overcast
Rainy
Rainy
Rainy
Overcast
Sunny
Sunny
Rainy
Sunny
Overcast
Overcast
Rainy
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
False
True
False
False
False
True
True
False
False
False
True
True
False
True
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
~continues
Outlook
humidity windy yes
no yes yes no
sunny overcast rainy
high normal false true
Decision tree for the weather data.
The Process of Constructing a
Decision Tree
Select an attribute to place at the root of the
decision tree and make one branch for every
possible value.
Repeat the process recursively for each
branch.
Which Attribute Should Be Placed
at a Certain Node
One common approach is based on the
information gained by placing a certain
attribute at this node.
Information Gained by Knowing
the Result of a Decision
In the weather data example, there are 9
instances of which the decision to play is
yes and there are 5 instances of which the
decision to play is no. Then, the
information gained by knowing the result of
the decision is
bits. 940 . 0
14
5
log
14
5
14
9
log
14
9
=
|
.
|

\
|

|
.
|

\
|
+
|
.
|

\
|

The General Form for Calculating
the Information Gain
Entropy of a decision =

P
1
, P
2
, , P
n
are the probabilities of the n
possible outcomes.
n n
P P P P P P log log log
2 2 1 1

Information Further Required If
Outlook Is Placed at the Root
Outlook
yes
yes
no
no
no
yes
yes
yes
yes
yes
yes
yes
no
no
sunny overcast rainy
. 693 . 0 971 . 0
14
5
0
14
4
971 . 0
14
5
required further n Informatio
bits =
|
.
|

\
|
+
|
.
|

\
|
+
|
.
|

\
|
=
Information Gained by Placing
Each of the 4 Attributes
Gain(outlook) = 0.940 bits 0.693 bits
= 0.247 bits.
Gain(temperature) = 0.029 bits.
Gain(humidity) = 0.152 bits.
Gain(windy) = 0.048 bits.
The Strategy for Selecting an
Attribute to Place at a Node
Select the attribute that gives us the largest
information gain.
In this example, it is the attribute Outlook.
Outlook
2 yes
3 no
4 yes
3 yes
2 no
sunny overcast rainy
The Recursive Procedure for
Constructing a Decision Tree
The operation discussed above is applied to each
branch recursively to construct the decision tree.
For example, for the branch Outlook = Sunny,
we evaluate the information gained by applying
each of the remaining 3 attributes.
Gain(Outlook=sunny;Temperature) = 0.971 0.4 =
0.571
Gain(Outlook=sunny;Humidity) = 0.971 0 = 0.971
Gain(Outlook=sunny;Windy) = 0.971 0.951 = 0.02
Similarly, we also evaluate the information
gained by applying each of the remaining 3
attributes for the branch Outlook = rainy.
Gain(Outlook=rainy;Temperature) = 0.971
0.951 = 0.02
Gain(Outlook=rainy;Humidity) = 0.971 0.951
= 0.02
Gain(Outlook=rainy;Windy) =0.971 0 =
0.971
The Over-fitting Issue
Over-fitting is caused by creating decision
rules that work accurately on the training set
based on insufficient quantity of samples.
As a result, these decision rules may not
work well in more general cases.
Example of the Over-fitting Problem
in Decision Tree Construction
bits 848 . 0
17
9
log
17
9
17
8
log
17
8
20
17
children at the entropy Average
bits 993 . 0
20
9
log
20
9
20
11
log
20
11
subroot at the Entropy
2 2
2 2
=
|
.
|

\
|
+ =
=
|
.
|

\
|
+ =
11 Yes and 9 No samples;
prediction = Yes
8 Yes and 9 No samples;
prediction = No
3 Yes and 0 No samples;
prediction = Yes
A
i
=0 A
i
=1
Hence, with the binary split, we gain more
information.
However, if we look at the pessimistic error rate,
i.e. the upper bound of the confidence interval of
the error rate, we may get different conclusion.
The formula for the pessimistic error rate is






Note that the pessimistic error rate is a function of
the confidence level used.
( ) user. by the specified level confidence the is and ,
samples, of number the is rate, error observed the is where
1
4
2
1
2
2
2 2 2
c c z
n r
n
z
n
z
n
r
n
r
z
n
z
r
e

u =
+
+ + +
=
The pessimistic error rates under 95%
confidence are
( )
6598 . 0
17
645 . 1
1
1156
706 . 2
17
17
8
17
17
8
645 . 1
34
645 . 1
17
8
4742 . 0
3
645 . 1
1
36
645 . 1
645 . 1
6
645 . 1
6278 . 0
20
645 . 1
1
1600
706 . 2
20
45 . 0
20
45 . 0
645 . 1
40
645 . 1
45 . 0
2
2
2
17
8
2
2 2
3
0
2
2 2
20
9
=
+
+ + +
=
=
+
+
=
=
+
+ + +
=
e
e
e
Therefore, the average pessimistic error rate
at the children is


Since the pessimistic error rate increases
with the split, we do not want to keep the
children. This practice is called tree
pruning.
6278 . 0 632 . 0 6598 . 0
20
17
4742 . 0
20
3
> = +
Tree Pruning based on _
2
Test of
Independence
We construct the
corresponding contingency
table
A
i
=
0
A
i
=
1
Yes 3 8 11
No 0 9 9
3 17 20
11 Yes and
9 No samples;
8 Yes and
9 No samples;
3 Yes and
0 No samples; A
i
=0 A
i
=1
15 . 1
20
7 1 9
20
7 1 9
- 9
20
9 3
20
9 3
- 0
20
1 1 17
20
1 1 17
- 8
20
3 11
20
3 11
- 3
statistic The
2 2 2 2
2
=

|
.
|

\
|

+

|
.
|

\
|

+

|
.
|

\
|

+

|
.
|

\
|

=
_
Therefore, we should not split the subroot
node, if we require that the _
2
statistic must
be larger than _
2
k,0.05
, where k is the degree
of freedom of the corresponding
contingency table.
Constructing Decision Trees based
on _
2
test of Independence
Using the following example, we can
construct a contingency table accordingly.
75 Yess out of
100 samples;
Prediction = Yes
45 Yess out of
50 samples;
20 Yess out of
25 samples;
10 Yess out of
25 samples; 100
100
50
100
25
100
25
100
25
5 15 5
100
75
45 10 20
2 1 0
No
Yes
A
i
A
i
=0 A
i
=1
A
i
=2








Therefore, we may say that the split is
statistically robust.
991 . 5 67 . 22
100
4
1
2
1
100
4
1
2
1
5
100
4
1
4
1
100
4
1
4
1
15
100
4
1
4
1
100
4
1
4
1
5
100
4
3
2
1
100
4
3
2
1
45
100
4
3
4
1
100
4
3
4
1
10
100
4
3
4
1
100
4
3
4
1
20
2
05 . 0 , 2
2 2 2
2 2 2
2
= > =

|
.
|

\
|

+

|
.
|

\
|

+

|
.
|

\
|

+

|
.
|

\
|

+

|
.
|

\
|

+

|
.
|

\
|

=
_
_
Assume that we have another attribute
A
j
to consider
A
j
=0 A
j
=1
Yes 25 50 75
No 0 25 25
25 75 100
75 Yes out of
100 samples;
50 Yes out of
75 samples;
25 Yes out
of 25 samples; A
j
=0 A
j
=1
841 . 3 11 . 11
100
5 7 5 2
100
5 7 5 2
- 25
100
5 7 5 7
100
5 7 5 7
- 50
100
25 5 2
100
25 5 2
- 0
100
5 7 5 2
100
5 7 5 2
- 25
2
05 . 0 , 1
2 2 2 2
2
= > =

|
.
|

\
|

+

|
.
|

\
|

+

|
.
|

\
|

+

|
.
|

\
|

=
_
_
Now, both A
i
and A
j
pass our criterion. How
should we make our selection?
We can make our selection based on the
significance levels of the two contingency tables.

( )
( ) ( )
( ) . 10 8 0008 . 0 33 . 3 ) 1 , 0 ( Prob 2
33 . 3 ) 1 , 0 ( Prob 33 . 3 ) 1 , 0 ( Prob '
11 . 11 ) 1 , 0 ( Prob ) 11 . 11 ( 1 ' 11 . 11
4
2 2
' , 1
2
1

= = > =
s + > =
> = = =
N
N N
N F
o
o _
_
o
Therefore, A
i
is preferred over A
j
.
. 10 19 . 1 1 1
) 67 . 22 ( 1 " 67 . 22
5
) 67 . 22 (
2
1
2
" , 2
2
2

=
|
|
.
|

\
|
=
= =
e
F
_
o
o _
If a subtree is as follows







_
2
= 4.543 < 5.991
In this case, we do not want to carry out the split.
15 Yess out of
20 samples;
9 Yess out of
10 samples;
4 Yess out of
5 samples;
2 Yess out of
5 samples;
Termination of Split due to Low
Significance level
A More Realistic Example and
Some Remarks
In the following example, a bank wants to
derive a credit evaluation tree for future use
based on the records of existing customers.
As the data set shows, it is highly likely that
the training data set contains inconsistencies.
Furthermore, some values may be missing.
Therefore, for most cases, it is impossible to
derive perfect decision trees, i.e. decision
trees with 100% accuracy.
~continues
Attributes Class
Education Annual Income Age Own House Sex Credit ranking
College High Old Yes Male Good
High school ----- Middle Yes Male Good
High school Middle Young No Female Good
College High Old Yes Male Poor
College High Old Yes Male Good
College Middle Young No Female Good
High school High Old Yes Male Poor
College Middle Middle ----- Female Good
High school Middle Young No Male Poor
~continues
A quality measure of decision trees can be
based on the accuracy. There are
alternative measures depending on the
nature of applications.
Overfitting is a problem caused by making
the derived decision tree work accurately
for the training set. As a result, the decision
tree may work less accurately in the real
world.
~continues
There are two situations in which
overfitting may occur:
insufficient number of samples at the subroot.
some attributes are highly branched.
A conventional practice for handling
missing values is to treat them as possible
attribute values. That is, each attribute has
one additional attribute value corresponding
to the missing value.

Alternative Measures of Quality of
Decision Trees
The recall rate and precision are two widely used
measures.




where C is the set of samples in the class and C is
the set of samples which the decision tree puts into
the class.
'
Precision
Rate Recall
C
C C
C
C C
'
'

=
~continues
A situation in which the recall rate is the
main concern:
A bank wants to find all the potential credit
card customers.
A situation in which precision is the main
concern:
A bank wants to find a decision tree for credit
approval.

You might also like