Comp527 08

COMP527:
Data Mining
COMP527: Data Mining
Dr Robert Sanderson
(azaroth@liv.ac.uk)
Dept. of Computer Science
University of Liverpool
2008
Classification: Trees January 18, 2008 Slide 1

COMP527:
Data Mining
COMP527: Data Mining
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam

COMP527:
Data Mining
Today's Topics
Trees
Tree Learning Algorithm
Attribute Splitting Decisions
Random
'Purity Count'
Entropy (aka ID3)
Information Gain Ratio

COMP527:
Data Mining
Trees
Anything can be made better by storing it in a tree structure! (Not really!)
Instead of having lists or sets of rules, why not have a tree

of rules? Then there's no problem with order, or repeating
the same test over and over again in different conjunctive
rules.
So each node in the tree is an attribute test, the branches

from that node are the different outcomes.
Instead of 'separate and conquer', Decision Trees are the

more typical 'divide and conquer' approach. Once the
tree is built, new instances can be tested by simply
stepping through each test.

COMP527:
Data Mining
Example Data Again
Here's our example data again:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
How to construct a tree from it, instead of rules?

COMP527:
Data Mining
Tree Learning Algorithm
Trivial Tree Learner:
create empty tree T
select attribute A
create branches in T for each value v of A
for each branch,
recurse with instances where A=v
add tree as branch node
Most interesting part of this algorithm is line 2, the attribute
selection. Let's start with a Random selection, then look at how it
might be improved.

COMP527:
Data Mining
T
Random method: Let's pick 'windy'
Windy
false true
6 yes 3 yes
2 no 3 no
Need to split again, looking at only the 8 and 6 instances respectively.
For windy=false, we'll randomly select outlook:
sunny: no, no, yes | overcast: yes, yes | rainy: yes, yes, yes
As all instances of overcast and rainy are yes, they stop, sunny continues.

COMP527:
Data Mining
Attribute Selection
As we may have thousands of attributes and/or values to test, we
want to construct small decision trees. Think back to RIPPER's
description length ... the smallest decision tree will have the
smallest description length. So how can we reduce the number
of nodes in the tree?
We want all paths through the tree to be as short as

possible. Nodes with one class stop a path, so we want
those to appear early in the tree, otherwise they'll occur
in multiple branches.
Think back: the first rule we generated was

outlook=overcast because it was pure.

COMP527:
Data Mining
Attribute Selection: Purity
'Purity' count:
Outlook
sunny rainy
2 yes 3 yes
4 yes
3 no 2 no
Select attribute that has the most 'pure' nodes, randomise equal
counts.
Still mediocre. Most data sets won't have pure nodes for several
levels. Need a measure of the purity instead of the simple count.

COMP527:
Data Mining
Attribute Selection: Entropy
For each test:
Maximal purity: All values are the same
Minimal purity: Equal number of each value
Find a scale between maximal and minimal, and then merge across all of the
attribute tests.
One function that calculates this is the Entropy function:
entropy(p1,p2...,pn)
= p1*log(p1) + p2*log(p2) + ... pn*log(pn)
p1 ... pn are the number of instances of each class, expressed as a fraction of
the total number of instances at that point in the tree. log is base 2.

COMP527:
Data Mining
entropy(p1,p2...,pn)
= p1*log(p1) + p2*log(p2) + ... pn *log(pn)
This is to calculate one test. For outlook there are three tests:
sunny: info(2,3)
= 2/5 log(2/5) 3/5 log(3/5)
= 0.5287 + 0.4421
= 0.971
overcast: info(4,0) = (4/4*log(4/4)) + (0*log(0))
Ohoh! log(0) is undefined. But note that we're multiplying it by 0, so what ever it
is the final result will be 0.

COMP527:
Data Mining
sunny: info(2,3) = 0.971
overcast: info(4,0) = 0.0
rainy: info(3,2) = 0.971
But we have 14 instances to divide down those paths...
So the total for outlook is:
(5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693
Now to calculate the gain, we work out the entropy for the top node
and subtract the entropy for outlook:
info(9,5) = 0.940
gain(outlook) = 0.940 0.693 = 0.247

COMP527:
Data Mining
Now to calculate the gain for all of the attributes:
gain(outlook) = 0.247
gain(humidity) = 0.152
gain(windy) = 0.048
gain(temperature) = 0.029
And select the maximum ... which is outlook.
This is (also!) called information gain. The total is the information,
measured in 'bits'.
Equally we could select the minimum amount of information
needed the minimum description length issue in RIPPER.
Let's do the next level, where outlook=sunny.

COMP527:
Data Mining
Now to calculate the gain for all of the attributes:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
sunny mild high false no
sunny cool normal false yes
sunny mild normal true yes
Temp: hot info(0,2) mild info(1,1) cool info(1,0)
Humidity: high info(0,3) normal info(2,0)
Windy: false info(1,2) true info(1,1)
Don't even need to do the math. Humidity is the obvious choice as
it predicts all 5 instances correctly. Thus the information will be
0, and the gain will be maximal.

COMP527:
Data Mining
Now our tree looks like:
Outlook
sunny rainy
?
Humidity yes
normal high
yes no
This algorithm is called ID3, developed by Quinlan.

COMP527:
Data Mining
Entropy: Issues
Nasty side effect of Entropy: It prefers attributes with a large
number of branches.
Eg, if there was an 'identifier' attribute with a unique value, this
would uniquely determine the class, but be useless for
classification. (overfitting!)
Eg: info(0,1) info(0,1) info(1,0) ...
Doesn't need to be unique. If we assign 1 to the first two instances,
2 to the second and so forth, we still get a 'better' split.

COMP527:
Data Mining
Entropy: Issues
HalfIdentifier 'attribute':
info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0)
info(1,1)
= 0 0 0.5 0.5 0 0 0.5
2/14 down each route, so:
= 0*2/14 + 0*2/14 + 0.5*2/14 + 0.5*2/14 + ...
= 3 * (2/14 * 0.5)
= 3/14
= 0.214
Gain is:
0.940 0.214 = 0.726
Remember that the gain for Outlook was only 0.247!
Urgh. Once more we run into overfitting.

COMP527:
Data Mining
Gain Ratio
Solution: Use a gain ratio. Calculate the entropy disregarding
classes for all of the daughter nodes:
eg info(2,2,2,2,2,2,2) for halfidentifier
and info(5,4,5) for outlook
identifier = 1/14 * log(1/14) * 14 = 3.807
halfidentifier = 1/7 * log(1/7) * 7 = 2.807
outlook = 1.577
Ratios:
identifier = 0.940 / 3.807 = 0.247
halfidentifier = 0.726 / 2.807 = 0.259
outlook = 0.247 / 1.577 = 0.157

COMP527:
Data Mining
Gain Ratio
Close to success: Picks halfidentifier (only accurate in 4/7
branches) over identifier (accurate in all 14 branches)!
halfidentifier = 0.259
identifier = 0.247
outlook = 0.157
humidity = 0.152
windy = 0.049
temperature = 0.019
Humidity is now also very close to outlook, whereas before they
were separated.

COMP527:
Data Mining
Gain Ratio
We can simply check for identifier like attributes and ignore them.
Actually, they should be removed from the data before the data
mining begins.
However the ratio can also overcompensate. It might pick an
attribute because it's entropy is low. Note how close humidity
and outlook became... maybe that's not such a good thing?
Possible Fix: First generate the information gain. Throw away any
attributes with less than the average. Then compare using the
ratio.

COMP527:
Data Mining
Alternative: Gini
An alternative method to Information Gain is called the Gini Index
The total for node D is:
gini(D) = 1 sum(p12, p22, ... pn2)
Where p1..n are the frequency ratios of class 1..n in D.
So the Gini Index for the entire set:
= 1 (9/142 + 5/142)
= 1 (0.413 + 0.127)
= 0.459

COMP527:
Data Mining
Gini
The gini value of a split of D into subsets is:
Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn)
Where N' is the size of split D', and N is the size of D.
eg: Outlook splits into 5,4,5:
split = 5/14 gini(sunny) + 4/14 gini(overcast)
+ 5/14 gini(rainy)
sunny = 1sum(2/52, 3/52) = 1 0.376 = 0.624
overcast= 1 sum(4/42, 0/42) = 0.0
rainy = sunny
split = (5/14 * 0.624) * 2
= 0.446

COMP527:
Data Mining
Gini
The attribute that generates the smallest gini split value is chosen
to split the node on.
(Left as an exercise for you to do!)
Gini is used in CART (Classification and Regression Trees), IBM's
IntelligentMiner system, SPRINT (Scalable PaRallelizable
INduction of decision Trees). It comes from an Italian statistician
who used it to measure income inequality.

COMP527:
Data Mining
Decision Tree Issues
The various problems that a good DT builder needs to address:
– Ordering of Attribute Splits
As seen, we need to build the tree picking the best attribute to split on first.
– Numeric/Missing Data
Dividing numeric data is more complicated. How?
– Tree Structure
A balanced tree with the fewest levels is preferable.
– Stopping Criteria
Like with rules, we need to stop adding nodes at some point. When?
– Pruning
It may be beneficial to prune the tree once created? Or incrementally?

COMP527:
Data Mining
Further Reading
●
Introductory statistical text books
●
Witten, 3.2, 4.3
●
Dunham, 4.4
●
Han, 6.3
●
Berry and Browne, Chapter 4
●
Berry and Linoff, Chapter 6

Comp527 08

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comp527 08

Uploaded by

Copyright:

Available Formats

COMP527:

Classification: Trees January 18, 2008 Slide 1

Classification: Trees January 18, 2008 Slide 2

Classification: Trees January 18, 2008 Slide 3

Instead of having lists or sets of rules, why not have a tree

So each node in the tree is an attribute test, the branches

Instead of 'separate and conquer', Decision Trees are the

Classification: Trees January 18, 2008 Slide 4

Classification: Trees January 18, 2008 Slide 5

Classification: Trees January 18, 2008 Slide 6

Classification: Trees January 18, 2008 Slide 7

We want all paths through the tree to be as short as

Think back: the first rule we generated was

Classification: Trees January 18, 2008 Slide 8

Classification: Trees January 18, 2008 Slide 9

Classification: Trees January 18, 2008 Slide 10

Classification: Trees January 18, 2008 Slide 11

Classification: Trees January 18, 2008 Slide 12

Classification: Trees January 18, 2008 Slide 13

Classification: Trees January 18, 2008 Slide 14

Classification: Trees January 18, 2008 Slide 15

Classification: Trees January 18, 2008 Slide 16

Classification: Trees January 18, 2008 Slide 17

Classification: Trees January 18, 2008 Slide 18

Classification: Trees January 18, 2008 Slide 19

Classification: Trees January 18, 2008 Slide 20

Classification: Trees January 18, 2008 Slide 21

Classification: Trees January 18, 2008 Slide 22

Classification: Trees January 18, 2008 Slide 23

Classification: Trees January 18, 2008 Slide 24

Classification: Trees January 18, 2008 Slide 25

You might also like