Professional Documents
Culture Documents
Data Mining
COMP527: Data Mining
Dr Robert Sanderson
(azaroth@liv.ac.uk)
Dept. of Computer Science
University of Liverpool
2008
Introduction to the Course Input Preprocessing
Introduction to Data Mining Attribute Selection
Introduction to Text Mining Association Rule Mining
General Data Mining Issues ARM: A Priori and Data Structures
Data Warehousing ARM: Improvements
Classification: Challenges, Basics ARM: Advanced Techniques
Classification: Rules Clustering: Challenges, Basics
Classification: Trees Clustering: Improvements
Classification: Trees 2 Clustering: Advanced Algorithms
Classification: Bayes Hybrid Approaches
Classification: Neural Networks Graph Mining, Web Mining
Classification: SVM Text Mining: Challenges, Basics
Classification: Evaluation Text Mining: TextasData
Classification: Evaluation 2 Text Mining: TextasLanguage
Regression, Prediction Revision for Exam
Trees
Tree Learning Algorithm
Attribute Splitting Decisions
Random
'Purity Count'
Entropy (aka ID3)
Information Gain Ratio
Anything can be made better by storing it in a tree structure! (Not really!)
Here's our example data again:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
How to construct a tree from it, instead of rules?
Trivial Tree Learner:
create empty tree T
select attribute A
create branches in T for each value v of A
for each branch,
recurse with instances where A=v
add tree as branch node
Most interesting part of this algorithm is line 2, the attribute
selection. Let's start with a Random selection, then look at how it
might be improved.
Random method: Let's pick 'windy'
Windy
false true
6 yes 3 yes
2 no 3 no
Need to split again, looking at only the 8 and 6 instances respectively.
For windy=false, we'll randomly select outlook:
sunny: no, no, yes | overcast: yes, yes | rainy: yes, yes, yes
As all instances of overcast and rainy are yes, they stop, sunny continues.
As we may have thousands of attributes and/or values to test, we
want to construct small decision trees. Think back to RIPPER's
description length ... the smallest decision tree will have the
smallest description length. So how can we reduce the number
of nodes in the tree?
'Purity' count:
Outlook
sunny rainy
2 yes 3 yes
4 yes
3 no 2 no
Select attribute that has the most 'pure' nodes, randomise equal
counts.
Still mediocre. Most data sets won't have pure nodes for several
levels. Need a measure of the purity instead of the simple count.
For each test:
Maximal purity: All values are the same
Minimal purity: Equal number of each value
Find a scale between maximal and minimal, and then merge across all of the
attribute tests.
One function that calculates this is the Entropy function:
entropy(p1,p2...,pn)
= p1*log(p1) + p2*log(p2) + ... pn*log(pn)
p1 ... pn are the number of instances of each class, expressed as a fraction of
the total number of instances at that point in the tree. log is base 2.
entropy(p1,p2...,pn)
= p1*log(p1) + p2*log(p2) + ... pn *log(pn)
This is to calculate one test. For outlook there are three tests:
sunny: info(2,3)
= 2/5 log(2/5) 3/5 log(3/5)
= 0.5287 + 0.4421
= 0.971
overcast: info(4,0) = (4/4*log(4/4)) + (0*log(0))
Ohoh! log(0) is undefined. But note that we're multiplying it by 0, so what ever it
is the final result will be 0.
sunny: info(2,3) = 0.971
overcast: info(4,0) = 0.0
rainy: info(3,2) = 0.971
But we have 14 instances to divide down those paths...
So the total for outlook is:
(5/14 * 0.971) + (4/14 * 0.0) + (5/14 * 0.971) = 0.693
Now to calculate the gain, we work out the entropy for the top node
and subtract the entropy for outlook:
info(9,5) = 0.940
gain(outlook) = 0.940 0.693 = 0.247
Now to calculate the gain for all of the attributes:
gain(outlook) = 0.247
gain(humidity) = 0.152
gain(windy) = 0.048
gain(temperature) = 0.029
And select the maximum ... which is outlook.
This is (also!) called information gain. The total is the information,
measured in 'bits'.
Equally we could select the minimum amount of information
needed the minimum description length issue in RIPPER.
Let's do the next level, where outlook=sunny.
Now to calculate the gain for all of the attributes:
Outlook Temperature Humidity Windy Play?
sunny hot high false no
sunny hot high true no
sunny mild high false no
sunny cool normal false yes
sunny mild normal true yes
Temp: hot info(0,2) mild info(1,1) cool info(1,0)
Humidity: high info(0,3) normal info(2,0)
Windy: false info(1,2) true info(1,1)
Don't even need to do the math. Humidity is the obvious choice as
it predicts all 5 instances correctly. Thus the information will be
0, and the gain will be maximal.
Now our tree looks like:
Outlook
sunny rainy
?
Humidity yes
normal high
yes no
This algorithm is called ID3, developed by Quinlan.
Nasty side effect of Entropy: It prefers attributes with a large
number of branches.
Eg, if there was an 'identifier' attribute with a unique value, this
would uniquely determine the class, but be useless for
classification. (overfitting!)
Eg: info(0,1) info(0,1) info(1,0) ...
Doesn't need to be unique. If we assign 1 to the first two instances,
2 to the second and so forth, we still get a 'better' split.
HalfIdentifier 'attribute':
info(0,2) info(2,0) info(1,1) info(1,1) info(2,0) info(2,0)
info(1,1)
= 0 0 0.5 0.5 0 0 0.5
2/14 down each route, so:
= 0*2/14 + 0*2/14 + 0.5*2/14 + 0.5*2/14 + ...
= 3 * (2/14 * 0.5)
= 3/14
= 0.214
Gain is:
0.940 0.214 = 0.726
Remember that the gain for Outlook was only 0.247!
Urgh. Once more we run into overfitting.
Solution: Use a gain ratio. Calculate the entropy disregarding
classes for all of the daughter nodes:
eg info(2,2,2,2,2,2,2) for halfidentifier
and info(5,4,5) for outlook
identifier = 1/14 * log(1/14) * 14 = 3.807
halfidentifier = 1/7 * log(1/7) * 7 = 2.807
outlook = 1.577
Ratios:
identifier = 0.940 / 3.807 = 0.247
halfidentifier = 0.726 / 2.807 = 0.259
outlook = 0.247 / 1.577 = 0.157
Close to success: Picks halfidentifier (only accurate in 4/7
branches) over identifier (accurate in all 14 branches)!
halfidentifier = 0.259
identifier = 0.247
outlook = 0.157
humidity = 0.152
windy = 0.049
temperature = 0.019
Humidity is now also very close to outlook, whereas before they
were separated.
We can simply check for identifier like attributes and ignore them.
Actually, they should be removed from the data before the data
mining begins.
However the ratio can also overcompensate. It might pick an
attribute because it's entropy is low. Note how close humidity
and outlook became... maybe that's not such a good thing?
Possible Fix: First generate the information gain. Throw away any
attributes with less than the average. Then compare using the
ratio.
An alternative method to Information Gain is called the Gini Index
The total for node D is:
gini(D) = 1 sum(p12, p22, ... pn2)
Where p1..n are the frequency ratios of class 1..n in D.
So the Gini Index for the entire set:
= 1 (9/142 + 5/142)
= 1 (0.413 + 0.127)
= 0.459
The gini value of a split of D into subsets is:
Split(D) = N1/N gini(D1) + N2/N gini(D2) + Nn/N gini(Dn)
Where N' is the size of split D', and N is the size of D.
eg: Outlook splits into 5,4,5:
split = 5/14 gini(sunny) + 4/14 gini(overcast)
+ 5/14 gini(rainy)
sunny = 1sum(2/52, 3/52) = 1 0.376 = 0.624
overcast= 1 sum(4/42, 0/42) = 0.0
rainy = sunny
split = (5/14 * 0.624) * 2
= 0.446
The attribute that generates the smallest gini split value is chosen
to split the node on.
(Left as an exercise for you to do!)
Gini is used in CART (Classification and Regression Trees), IBM's
IntelligentMiner system, SPRINT (Scalable PaRallelizable
INduction of decision Trees). It comes from an Italian statistician
who used it to measure income inequality.
The various problems that a good DT builder needs to address:
– Ordering of Attribute Splits
As seen, we need to build the tree picking the best attribute to split on first.
– Numeric/Missing Data
Dividing numeric data is more complicated. How?
– Tree Structure
A balanced tree with the fewest levels is preferable.
– Stopping Criteria
Like with rules, we need to stop adding nodes at some point. When?
– Pruning
It may be beneficial to prune the tree once created? Or incrementally?
●
Introductory statistical text books
●
Witten, 3.2, 4.3
●
Dunham, 4.4
●
Han, 6.3
●
Berry and Browne, Chapter 4
●
Berry and Linoff, Chapter 6