Professional Documents
Culture Documents
Tan,Steinbach, Kumar
4/18/2004
Classification: Definition
Tan,Steinbach, Kumar
4/18/2004
Attrib1
Attrib2
Attrib3
Yes
Large
125K
No
No
Medium
100K
No
No
Small
70K
No
Yes
Medium
120K
No
No
Large
95K
Yes
No
Medium
60K
No
Yes
Large
220K
No
No
Small
85K
Yes
No
Medium
75K
No
10
No
Small
90K
Yes
Learning
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
12
Yes
Medium
80K
13
Yes
Large
110K
14
No
Small
95K
15
No
Large
67K
Apply
Model
Class
Deduction
10
Test Set
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Classification Techniques
Decision Tree based Methods
Rule-based Methods
Memory based reasoning
Neural Networks
Nave Bayes and Bayesian Belief Networks
Support Vector Machines
Tan,Steinbach, Kumar
4/18/2004
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Tan,Steinbach, Kumar
4/18/2004
MarSt
Tid Refund Marital
Status
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
> 80K
NO
YES
10
Tan,Steinbach, Kumar
4/18/2004
Attrib1
Attrib2
Attrib3
Yes
Large
125K
No
No
Medium
100K
No
No
Small
70K
No
Yes
Medium
120K
No
No
Large
95K
Yes
No
Medium
60K
No
Yes
Large
220K
No
No
Small
85K
Yes
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
12
Yes
Medium
80K
13
Yes
Large
110K
14
No
Small
95K
15
No
Large
67K
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
Tan,Steinbach, Kumar
4/18/2004
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
NO
> 80K
YES
4/18/2004
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
Married
Assign Cheat to No
NO
> 80K
YES
4/18/2004
Attrib1
Attrib2
Attrib3
Yes
Large
125K
No
No
Medium
100K
No
No
Small
70K
No
Yes
Medium
120K
No
No
Large
95K
Yes
No
Medium
60K
No
Yes
Large
220K
No
No
Small
85K
Yes
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Class
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
12
Yes
Medium
80K
13
Yes
Large
110K
14
No
Small
95K
15
No
Large
67K
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
Tan,Steinbach, Kumar
4/18/2004
Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Dt
4/18/2004
Hunts Algorithm
Refund
Dont
Cheat
Yes
No
Dont
Cheat
Dont
Cheat
Refund
Refund
Yes
Yes
No
No
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Dont
Cheat
Dont
Cheat
Marital
Status
Single,
Divorced
Cheat
Married
Single,
Divorced
Married
Dont
Cheat
Taxable
Income
Dont
Cheat
Tan,Steinbach, Kumar
Marital
Status
< 80K
>= 80K
Dont
Cheat
Cheat
4/18/2004
Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How
Tan,Steinbach, Kumar
4/18/2004
Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Luxury
Sports
CarType
Tan,Steinbach, Kumar
{Family}
OR
{Family,
Luxury}
CarType
{Sports}
4/18/2004
Large
Medium
Size
{Large}
Tan,Steinbach, Kumar
OR
{Small,
Large}
{Medium,
Large}
Size
{Small}
Size
{Medium}
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Taxable
Income
> 80K?
Taxable
Income?
< 10K
Yes
> 80K
No
[10K,25K)
Tan,Steinbach, Kumar
[25K,50K)
[50K,80K)
4/18/2004
Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How
Tan,Steinbach, Kumar
4/18/2004
Car
Type?
No
Family
Student
ID?
Luxury
c1
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
...
c10
c11
C0: 1
C1: 0
C0: 0
C1: 1
c20
...
C0: 0
C1: 1
Tan,Steinbach, Kumar
4/18/2004
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
Homogeneous,
Tan,Steinbach, Kumar
4/18/2004
Gini Index
Entropy
Misclassification error
Tan,Steinbach, Kumar
4/18/2004
C0
C1
N00
N01
M0
A?
B?
Yes
No
Node N1
C0
C1
Node N2
N10
N11
C0
C1
N20
N21
M2
M1
Yes
No
Node N3
C0
C1
Node N4
N30
N31
C0
C1
M3
M12
N40
N41
M4
M34
Gain = M0 M12 vs M0 M34
Tan,Steinbach, Kumar
4/18/2004
GINI (t ) 1 [ p( j | t )]2
j
0
6
Gini=0.000
Tan,Steinbach, Kumar
C1
C2
1
5
Gini=0.278
C1
C2
2
4
Gini=0.444
C1
C2
3
3
Gini=0.500
4/18/2004
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Tan,Steinbach, Kumar
P(C2) = 6/6 = 1
P(C2) = 5/6
4/18/2004
GINI split
where,
Tan,Steinbach, Kumar
ni
GINI (i )
i 1 n
4/18/2004
B?
Yes
No
C1
C2
Gini = 0.500
Gini(N1)
= 1 (5/6)2 (2/6)2
= 0.194
Gini(N2)
= 1 (1/6)2 (4/6)2
= 0.528
Tan,Steinbach, Kumar
Node N1
Node N2
C1
C2
N1
5
2
N2
1
4
Gini=0.333
Introduction to Data Mining
Gini(Children)
= 7/12 * 0.194 +
5/12 * 0.528
= 0.333
4/18/2004
Two-way split
(find best partition of values)
CarType
Family Sports Luxury
C1
C2
Gini
1
4
2
1
0.393
Tan,Steinbach, Kumar
1
1
C1
C2
Gini
CarType
{Sports,
{Family}
Luxury}
3
1
2
4
0.400
C1
C2
Gini
CarType
{Family,
{Sports}
Luxury}
2
2
1
5
0.419
4/18/2004
Tan,Steinbach, Kumar
Taxable
Income Cheat
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Taxable
Income
> 80K?
Yes
4/18/2004
No
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income
60
Sorted Values
70
55
Split Positions
75
65
85
72
90
80
95
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
No
Gini
Tan,Steinbach, Kumar
0.420
0.400
0.375
0.343
0.417
0.400
0.300
0.343
0.375
0.400
4/18/2004
0.420
Entropy (t ) p ( j | t ) log p ( j | t )
j
4/18/2004
Entropy (t ) p ( j | t ) log p ( j | t )
j
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Tan,Steinbach, Kumar
P(C2) = 6/6 = 1
P(C2) = 5/6
4/18/2004
Information Gain:
GAIN Entropy( p)
Entropy(i )
n
split
i 1
4/18/2004
Gain Ratio:
GAIN
n
n
GainRATIO
SplitINFO log
SplitINFO
n
n
Split
split
i 1
4/18/2004
Minimum
Tan,Steinbach, Kumar
4/18/2004
C1
C2
0
6
P(C1) = 0/6 = 0
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
Tan,Steinbach, Kumar
P(C2) = 6/6 = 1
P(C2) = 5/6
P(C2) = 4/6
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
A?
Yes
No
Node N1
Gini(N1)
= 1 (3/3)2 (0/3)2
=0
Gini(N2)
= 1 (4/7)2 (3/7)2
= 0.489
Tan,Steinbach, Kumar
Node N2
C1
C2
N1
3
0
N2
4
3
Gini=0.361
C1
C2
Gini = 0.42
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini improves !!
4/18/2004
Tree Induction
Greedy strategy.
Split the records based on an attribute test
that optimizes certain criterion.
Issues
Determine how to split the records
How
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification
techniques for many simple data sets
Tan,Steinbach, Kumar
4/18/2004
Example: C4.5
Simple depth-first construction.
Uses Information Gain
Sorts Continuous Attributes at each node.
Needs entire data to fit in memory.
Unsuitable for Large Datasets.
Needs out-of-core sorting.
Tan,Steinbach, Kumar
4/18/2004
Missing Values
Costs of Classification
Tan,Steinbach, Kumar
4/18/2004
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Tan,Steinbach, Kumar
4/18/2004
Underfitting: when model is too simple, both training and test errors are large
Tan,Steinbach, Kumar
4/18/2004
4/18/2004
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Tan,Steinbach, Kumar
4/18/2004
Notes on Overfitting
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Occams Razor
Tan,Steinbach, Kumar
4/18/2004
X
X1
X2
X3
X4
y
1
0
0
1
Xn
A?
Yes
No
B?
B1
B2
C?
C1
C2
X
X1
X2
X3
X4
y
?
?
?
?
Xn
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Post-pruning
Grow decision tree to its entirety
Trim the nodes of the decision tree in a
bottom-up fashion
If generalization error improves after trimming,
replace sub-tree by a leaf node.
Class label of leaf node is determined from
majority class of instances in the sub-tree
Can use MDL for post-pruning
Tan,Steinbach, Kumar
4/18/2004
Example of Post-Pruning
Training Error (Before splitting) = 10/30
Class = Yes
20
Class = No
10
Error = 10/30
= (9 + 4 0.5)/30 = 11/30
PRUNE!
A?
A1
A4
A3
A2
Class = Yes
Class = Yes
Class = Yes
Class = Yes
Class = No
Class = No
Class = No
Class = No
Tan,Steinbach, Kumar
4/18/2004
Examples of Post-pruning
Optimistic error?
Case 1:
Pessimistic error?
C0: 11
C1: 3
C0: 2
C1: 4
C0: 14
C1: 3
C0: 2
C1: 2
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Taxable
Income Class
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
Refund=Yes
Refund=No
No
Divorced 95K
Yes
Refund=?
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
Entropy(Refund=Yes) = 0
No
Married
75K
No
10
Single
90K
Yes
Entropy(Refund=No)
= -(2/6)log(2/6) (4/6)log(4/6) = 0.9183
60K
Class Class
= Yes = No
0
3
2
4
1
Split on Refund:
10
Missing
value
Tan,Steinbach, Kumar
Entropy(Children)
= 0.3 (0) + 0.6 (0.9183) = 0.551
Gain = 0.9 (0.8813 0.551) = 0.3303
Introduction to Data Mining
4/18/2004
Distribute Instances
Tid Refund Marital
Status
Taxable
Income Class
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
60K
Taxable
Income Class
10
90K
Single
Yes
10
Refund
Yes
No
Class=Yes
0 + 3/9
Class=Yes
2 + 6/9
Class=No
Class=No
10
Refund
Yes
No
Class=Yes
Cheat=Yes
Class=No
Cheat=No
Tan,Steinbach, Kumar
4/18/2004
Classify Instances
New record:
Married
Taxable
Income Class
11
85K
No
Refund
NO
Divorced Total
Class=No
Class=Yes
6/9
2.67
Total
3.67
6.67
10
Yes
Single
No
Single,
Divorced
MarSt
Married
TaxInc
< 80K
NO
Tan,Steinbach, Kumar
NO
> 80K
YES
4/18/2004
Other Issues
Data Fragmentation
Search Strategy
Expressiveness
Tree Replication
Tan,Steinbach, Kumar
4/18/2004
Data Fragmentation
Tan,Steinbach, Kumar
4/18/2004
Search Strategy
Other strategies?
Bottom-up
Bi-directional
Tan,Steinbach, Kumar
4/18/2004
Expressiveness
Tan,Steinbach, Kumar
4/18/2004
Decision Boundary
1
0.9
x < 0.43?
0.8
0.7
Yes
No
0.6
y < 0.33?
y < 0.47?
0.5
0.4
Yes
0.3
0.2
:4
:0
0.1
No
Yes
:0
:4
:0
:3
No
:4
:0
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
4/18/2004
x+y<1
Class = +
Class =
4/18/2004
Tree Replication
P
4/18/2004
Model Evaluation
Tan,Steinbach, Kumar
4/18/2004
Model Evaluation
Tan,Steinbach, Kumar
4/18/2004
PREDICTED CLASS
Class=Yes
Class=Yes
ACTUAL
CLASS Class=No
Tan,Steinbach, Kumar
Class=No
b
a: TP (true positive)
b: FN (false negative)
c: FP (false positive)
d: TN (true negative)
4/18/2004
ACTUAL
CLASS Class=No
Class=No
a
(TP)
b
(FN)
c
(FP)
d
(TN)
ad
TP TN
Accuracy
a b c d TP TN FP FN
Tan,Steinbach, Kumar
4/18/2004
Limitation of Accuracy
Tan,Steinbach, Kumar
4/18/2004
Cost Matrix
PREDICTED CLASS
C(i|j)
Class=Yes
Class=Yes
C(Yes|Yes)
C(No|Yes)
C(Yes|No)
C(No|No)
ACTUAL
CLASS Class=No
Class=No
Tan,Steinbach, Kumar
4/18/2004
PREDICTED CLASS
ACTUAL
CLASS
Model
M1
C(i|j)
-1
100
PREDICTED CLASS
ACTUAL
CLASS
150
40
60
250
Accuracy = 80%
Cost = 3910
Tan,Steinbach, Kumar
Model
M2
ACTUAL
CLASS
PREDICTED CLASS
250
45
200
Accuracy = 90%
Cost = 4255
Introduction to Data Mining
4/18/2004
Cost vs Accuracy
PREDICTED CLASS
Count
Class=Yes
Class=Yes
ACTUAL
CLASS
Class=No
b
N=a+b+c+d
Class=No
d
Accuracy = (a + d)/N
PREDICTED CLASS
Cost
Class=Yes
ACTUAL
CLASS
Class=No
Class=Yes
Class=No
Tan,Steinbach, Kumar
Cost = p (a + d) + q (b + c)
= p (a + d) + q (N a d)
= q N (q p)(a + d)
= N [q (q-p) Accuracy]
4/18/2004
Cost-Sensitive Measures
a
Precision (p)
ac
a
Recall (r)
ab
2rp
2a
F - measure (F)
r p 2a b c
wa w d
Weighted Accuracy
wa wb wc w d
1
Tan,Steinbach, Kumar
4/18/2004
Model Evaluation
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
Learning Curve
Requires a sampling
schedule for creating
learning curve:
Arithmetic sampling
(Langley, et al)
Geometric sampling
(Provost et al)
Tan,Steinbach, Kumar
Variance of estimate
4/18/2004
Methods of Estimation
Holdout
Reserve 2/3 for training and 1/3 for testing
Random subsampling
Repeated holdout
Cross validation
Partition data into k disjoint subsets
k-fold: train on k-1 partitions, test on the remaining one
Leave-one-out: k=n
Stratified sampling
oversampling vs undersampling
Bootstrap
Sampling with replacement
Tan,Steinbach, Kumar
4/18/2004
Model Evaluation
Tan,Steinbach, Kumar
4/18/2004
Tan,Steinbach, Kumar
4/18/2004
ROC Curve
- 1-dimensional data set containing 2 classes (positive and negative)
- any points located at x > t is classified as positive
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
Tan,Steinbach, Kumar
4/18/2004
ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
Random guessing
Below diagonal line:
prediction is opposite of
the true class
Tan,Steinbach, Kumar
4/18/2004
No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR
Ideal:
Area
Random guess:
Area
Tan,Steinbach, Kumar
=1
= 0.5
4/18/2004
P(+|A)
True Class
0.95
0.93
0.87
0.85
0.85
0.85
0.76
0.53
0.43
10
0.25
4/18/2004
0.25
0.43
0.53
0.76
0.85
0.85
0.85
0.87
0.93
0.95
1.00
TP
FP
TN
FN
TPR
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.2
FPR
0.8
0.8
0.6
0.4
0.2
0.2
Class
P
Threshold
>=
ROC Curve:
Tan,Steinbach, Kumar
4/18/2004
Test of Significance
Tan,Steinbach, Kumar
4/18/2004
x Bin(N, p)
e.g: Toss a fair coin 50 times, how many heads would turn up?
Expected number of heads = Np = 50 0.5 = 25
Tan,Steinbach, Kumar
4/18/2004
Area = 1 -
P( Z
/2
acc p
Z
p (1 p ) / N
1 / 2
Z/2
Z1- /2
/2
/2
/2
Tan,Steinbach, Kumar
4/18/2004
1-
0.99 2.58
0.98 2.33
50
100
500
1000
5000
0.95 1.96
p(lower)
0.670
0.711
0.763
0.774
0.789
0.90 1.65
p(upper)
0.888
0.866
0.833
0.824
0.811
Tan,Steinbach, Kumar
4/18/2004
e1 ~ N 1 , 1
e2 ~ N 2 , 2
e (1 e )
Approximate:
n
i
Tan,Steinbach, Kumar
4/18/2004
2
2
n1
n2
At (1-) confidence level,
Tan,Steinbach, Kumar
d d Z
t
/2
4/18/2004
An Illustrative Example
Given: M1: n1 = 30, e1 = 0.15
M2: n2 = 5000, e2 = 0.25
d = |e2 e1| = 0.1 (2-sided test)
0.0043
30
5000
d
4/18/2004
j 1
k (k 1)
d d t
t
Tan,Steinbach, Kumar
1 , k 1
4/18/2004