You are on page 1of 25

COMP527:

Data Mining
COMP527: Data Mining

Dr Robert Sanderson
(azaroth@liv.ac.uk)

Dept. of Computer Science
University of Liverpool
2008

Classification: Trees January 18, 2008 Slide 1


COMP527:
Data Mining
COMP527: Data Mining

Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: Text­as­Data
Classification: Evaluation 2 Text Mining: Text­as­Language
Regression, Prediction Revision for Exam

Classification: Trees January 18, 2008 Slide 2


COMP527:
Data Mining
Today's Topics

Trees
Tree Learning Algorithm
Attribute Splitting Decisions
Random
'Purity Count'
Entropy (aka ID3)
Information Gain Ratio

Classification: Trees January 18, 2008 Slide 3


COMP527:
Data Mining
Trees

Anything can be made better by storing it in a tree structure! (Not really!)

Instead of having lists or sets of rules, why not have a tree


of rules? Then there's no problem with order, or repeating
the same test over and over again in different conjunctive
rules.

So each node in the tree is an attribute test, the branches


from that node are the different outcomes.

Instead of 'separate and conquer', Decision Trees are the


more typical 'divide and conquer' approach. Once the
tree is built, new instances can be tested by simply
stepping through each test.

Classification: Trees January 18, 2008 Slide 4


COMP527:
Data Mining
Example Data Again

Here's our example data again:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no

How to construct a tree from it, instead of rules?

Classification: Trees January 18, 2008 Slide 5


COMP527:
Data Mining
Tree Learning Algorithm

Trivial Tree Learner:

create empty tree T
select attribute A
create branches in T for each value v of A
for each branch,
recurse with instances where A=v
add tree as branch node

Most interesting part of this algorithm is line 2, the attribute 
selection.  Let's start with a Random selection, then look at how it 
might be improved.

Classification: Trees January 18, 2008 Slide 6


COMP527:
Data Mining
T

Random method:  Let's pick 'windy'

Windy

false true

6 yes 3 yes
2 no 3 no

Need to split again, looking at only the 8 and 6 instances respectively.
For windy=false, we'll randomly select outlook:
sunny: no, no, yes    | overcast: yes, yes    | rainy: yes, yes, yes

As all instances of overcast and rainy are yes, they stop, sunny continues.

Classification: Trees January 18, 2008 Slide 7


COMP527:
Data Mining
Attribute Selection

As we may have thousands of attributes and/or values to test, we 
want to construct small decision trees.  Think back to RIPPER's 
description length ... the smallest decision tree will have the 
smallest description length.  So how can we reduce the number 
of nodes in the tree?

We want all paths through the tree to be as short as


possible. Nodes with one class stop a path, so we want
those to appear early in the tree, otherwise they'll occur
in multiple branches.

Think back: the first rule we generated was


outlook=overcast because it was pure.

Classification: Trees January 18, 2008 Slide 8


COMP527:
Data Mining
Attribute Selection: Purity

'Purity' count:

Outlook

sunny rainy

2 yes 3 yes
4 yes
3 no 2 no

Select attribute that has the most 'pure' nodes, randomise equal 
counts.
Still mediocre. Most data sets won't have pure nodes for several 
levels. Need a measure of the purity instead of the simple count.

Classification: Trees January 18, 2008 Slide 9


COMP527:
Data Mining
Attribute Selection: Entropy

For each test:
Maximal purity:  All values are the same
Minimal purity:  Equal number of each value

Find a scale between maximal and minimal, and then merge across all of the 
attribute tests.

One function that calculates this is the Entropy function:
entropy(p1,p2...,pn) 
= ­p1*log(p1) + ­p2*log(p2) + ... ­pn*log(pn)

p1 ... pn are the number of instances of each class, expressed as a fraction of 
the total number of instances at that point in the tree. log is base 2.

Classification: Trees January 18, 2008 Slide 10


COMP527:
Data Mining
Attribute Selection: Entropy

entropy(p1,p2...,pn) 
= ­p1*log(p1) + ­p2*log(p2) + ... ­pn *log(pn)

This is to calculate one test.     For outlook there are three tests:
sunny:  info(2,3)  
= ­2/5 log(2/5)  ­3/5 log(3/5) 
= 0.5287 + 0.4421
= 0.971

overcast:     info(4,0) = ­(4/4*log(4/4)) +  ­(0*log(0))

Ohoh!  log(0) is undefined.  But note that we're multiplying it by 0, so what ever it 
is the final result will be 0. 

Classification: Trees January 18, 2008 Slide 11


COMP527:
Data Mining
Attribute Selection: Entropy

sunny:   info(2,3) = 0.971
overcast:  info(4,0) = 0.0
rainy:  info(3,2) = 0.971

But we have 14 instances to divide down those paths...
So the total for outlook is:
(5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693

Now to calculate the gain, we work out the entropy for the top node 
and subtract the entropy for outlook:
info(9,5) = 0.940 
gain(outlook) = 0.940 ­ 0.693 = 0.247

Classification: Trees January 18, 2008 Slide 12


COMP527:
Data Mining
Attribute Selection: Entropy

Now to calculate the gain for all of the attributes:
gain(outlook) = 0.247
gain(humidity) = 0.152
gain(windy) = 0.048
gain(temperature) = 0.029

And select the maximum ... which is outlook.
This is (also!) called information gain.  The total is the information, 
measured in 'bits'.
Equally we could select the minimum amount of information 
needed ­­ the minimum description length issue in RIPPER.

Let's do the next level, where outlook=sunny.

Classification: Trees January 18, 2008 Slide 13


COMP527:
Data Mining
Attribute Selection: Entropy

Now to calculate the gain for all of the attributes:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
sunny mild high false no
sunny cool normal false yes
sunny mild normal true yes

Temp:    hot   info(0,2)  mild info(1,1)   cool info(1,0)
Humidity: high  info(0,3)  normal info(2,0)
Windy:    false info(1,2)  true info(1,1)

Don't even need to do the math.  Humidity is the obvious choice as 
it predicts all 5 instances correctly.  Thus the information will be 
0, and the gain will be maximal. 

Classification: Trees January 18, 2008 Slide 14


COMP527:
Data Mining
Attribute Selection: Entropy

Now our tree looks like:
Outlook

sunny rainy

?
Humidity yes

normal high

yes no

This algorithm is called ID3, developed by Quinlan.

Classification: Trees January 18, 2008 Slide 15


COMP527:
Data Mining
Entropy: Issues

Nasty side effect of Entropy:  It prefers attributes with a large 
number of branches.
Eg, if there was an 'identifier' attribute with a unique value, this 
would uniquely determine the class, but be useless for 
classification. (over­fitting!)

Eg: info(0,1) info(0,1) info(1,0) ...

Doesn't need to be unique.  If we assign 1 to the first two instances, 
2 to the second and so forth, we still get a 'better' split.

Classification: Trees January 18, 2008 Slide 16


COMP527:
Data Mining
Entropy: Issues

Half­Identifier 'attribute':
info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0) 
info(1,1) 
= 0  0  0.5  0.5  0  0  0.5

2/14 down each route, so: 
= 0*2/14 + 0*2/14 + 0.5*2/14 + 0.5*2/14 + ...
= 3 * (2/14 * 0.5)
= 3/14 
= 0.214
Gain is: 
0.940 ­ 0.214 = 0.726

Remember that the gain for Outlook was only 0.247!
Urgh.  Once more we run into over­fitting. 

Classification: Trees January 18, 2008 Slide 17


COMP527:
Data Mining
Gain Ratio

Solution:  Use a gain ratio.  Calculate the entropy disregarding 
classes for all of the daughter nodes:

eg  info(2,2,2,2,2,2,2) for half­identifier 
and  info(5,4,5) for outlook

identifier = ­1/14 * log(1/14) * 14 = 3.807
half­identifier = ­1/7 * log(1/7) * 7 = 2.807
outlook = 1.577

Ratios:
identifier = 0.940 / 3.807 = 0.247
half­identifier = 0.726 / 2.807 = 0.259
outlook = 0.247 / 1.577 = 0.157

Classification: Trees January 18, 2008 Slide 18


COMP527:
Data Mining
Gain Ratio

Close to success:  Picks half­identifier (only accurate in 4/7 
branches) over identifier (accurate in all 14 branches)!

half­identifier = 0.259
identifier = 0.247
outlook = 0.157
humidity = 0.152
windy = 0.049
temperature = 0.019

Humidity is now also very close to outlook, whereas before they 
were separated.

Classification: Trees January 18, 2008 Slide 19


COMP527:
Data Mining
Gain Ratio

We can simply check for identifier like attributes and ignore them.  
Actually, they should be removed from the data before the data 
mining begins.

However the ratio can also over­compensate.  It might pick an 
attribute because it's entropy is low.  Note how close humidity 
and outlook became... maybe that's not such a good thing?

Possible Fix:  First generate the information gain.  Throw away any 
attributes with less than the average. Then compare using the 
ratio.

Classification: Trees January 18, 2008 Slide 20


COMP527:
Data Mining
Alternative: Gini

An alternative method to Information Gain is called the Gini Index

The total for node D is:   
gini(D) = 1 ­ sum(p12, p22, ... pn2)
Where p1..n are the frequency ratios of class 1..n in D.

So the Gini Index for the entire set:
= 1­ (9/142 + 5/142) 
= 1 ­ (0.413 + 0.127)
= 0.459

Classification: Trees January 18, 2008 Slide 21


COMP527:
Data Mining
Gini

The gini value of a split of D into subsets is:

Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn)

Where N' is the size of split D', and N is the size of D.

eg:   Outlook splits into 5,4,5:
split  = 5/14 gini(sunny) + 4/14 gini(overcast) 
  + 5/14 gini(rainy)
sunny  = 1­sum(2/52, 3/52) = 1 ­ 0.376 = 0.624
overcast= 1­ sum(4/42, 0/42) = 0.0
rainy  = sunny
split  = (5/14 * 0.624) * 2 
= 0.446 

Classification: Trees January 18, 2008 Slide 22


COMP527:
Data Mining
Gini

The attribute that generates the smallest gini split value is chosen 
to split the node on.

(Left as an exercise for you to do!)

Gini is used in CART (Classification and Regression Trees), IBM's 
IntelligentMiner system, SPRINT (Scalable PaRallelizable 
INduction of decision Trees).  It comes from an Italian statistician 
who used it to measure income inequality.

Classification: Trees January 18, 2008 Slide 23


COMP527:
Data Mining
Decision Tree Issues

The various problems that a good DT builder needs to address:

– Ordering of Attribute Splits
As seen, we need to build the tree picking the best attribute to split on first.
– Numeric/Missing Data
Dividing numeric data is more complicated. How?
– Tree Structure
A balanced tree with the fewest levels is preferable.
– Stopping Criteria
Like with rules, we need to stop adding nodes at some point. When?
– Pruning
It may be beneficial to prune the tree once created? Or incrementally?

Classification: Trees January 18, 2008 Slide 24


COMP527:
Data Mining
Further Reading


Introductory statistical text books

Witten, 3.2, 4.3

Dunham, 4.4

Han, 6.3

Berry and Browne, Chapter 4

Berry and Linoff, Chapter 6

Classification: Trees January 18, 2008 Slide 25

You might also like