You are on page 1of 8

Learning Classifiers Performance

Initial Model (Assumptions)


• Instances xi in dataset D mapped to feature
Training Trained Model Testing Tested
space: Process Process Model
 

 
    
Performance Performance

Training Data Test Data

Performance measures:
• TP: True Positive
• TN: True Negative
decision boundary • FP: False Positive
Classes associated with instances: X, ◦ • FN: False Negative

• Classification: Note: if N = |D|, then N = TP+TN+FP+FN


f (xi ) = c ∈ {X, ◦} Confusion matrix:
– with xi,j ∈ {>, ⊥}, and f classifier
Predicted class
– dataset D is a multiset yes no
• Objective: learn f (supervised) Actual yes true positive false negative
class no false positive true negative

Performance Example: choosing contact lenses

Tear
Performance measures: Spectacle production
Age prescription Ast rate Lens
• Success rate σ: young myope no reduced none
TP + TN young myope no normal soft
σ= young myope yes reduced none
N young myope yes normal hard
• Error rate :  = 1 − σ young hypermetrope no reduced none
young hypermetrope no normal soft
• TPR (= recall ρ) True Positive Rate young hypermetrope yes reduced none
young hypermetrope yes normal hard
TPR = TP/(TP + FN) pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
• FNR False Negative Rate: FNR = 1 − TPR pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
• FPR False Positive Rate: pre-presbyopic hypermetrope no reduced none
pre-presbyopic hypermetrope no normal soft
FPR = FP/(FP + TN)
pre-presbyopic hypermetrope yes reduced none
• TNR True Negative Rate: TNR = 1 − FPR pre-presbyopic hypermetrope yes normal none
presbyopic myope no reduced none
• Precision π: presbyopic myope no normal none
presbyopic myope yes reduced none
π = TP/(TP + FP) presbyopic myope yes normal hard
presbyopic hypermetrope no reduced none
• F -measure: presbyopic hypermetrope no normal soft
2·ρ·π 2TP presbyopic hypermetrope yes reduced none
F = = presbyopic hypermetrope yes normal none
ρ+π 2TP + FP + FN Ast = Astigmatism
Rule representation and reasoning ZeroR

Rule representation: Basic ideas:


• Logical implication (= rules) • Construct rule that predicts the majority class
LHS → RHS • Used as baseline performance
(LHS = left-hand side = antecedent; RHS
= right-hand side = consequent) Example
• Literals in LHS and RHS are of the form: Contact lenses recommendation rule:
Variable ◦ value (or Attribute ◦ value) → Lens = none
where ◦ ∈ {<, ≤, =, >, ≥} • Total number of instances: 24
• Correctly classified instances: 15 (62.5%)
Rule-based reasoning:
• Incorrectly classified instances: 9 (37.5%)
R∪F C
=== Detailed Accuracy By Class ===
where TP Rate FP Rate Precision Recall F-Measure Class
0 0 0 0 0 soft
• R is a set of rules r ∈ R (rule-base) 0 0 0 0 0 hard
1 1 0.625 1 0.769 none
• F is a set of facts of the form
=== Confusion Matrix ===
Variable = value a b c <-- classified as
0 0 5 | a = soft
• C is a set of conclusions of the same form 0 0 4 | b = hard
as facts 0 0 15 | c = none

OneR OneR: Example

• Construct a single-condition rule for each Rules for contact-lenses recommendation:


variable-value pair
• Select the rules defined for a single variable Tears = reduced → Lens = none
(in the condition) which perform best Tears = normal → Lens = soft

OneR(classvar)
• 17/24 instances correct
{
R←∅ • Correctly classified instances: 17 (70.83%)
for each var ∈ Vars do • Incorrectly classified instances: 7 (29.16%)
for each value ∈ Domain(var) do
classvar.most-freq-value ←
MostFreq(var.value, classvar) === Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
rule ← MakeRule(var.value, 1 0.368 0.417 1 0.588 soft
classvar.most-freq-value) 0 0 0 0 0 hard
0.8 0 1 0.8 0.889 none
R ← R ∪ {rule}
for each r ∈ R do === Confusion Matrix ===
CalculateErrorRate(r) a b c <-- classified as
5 0 0 | a = soft
R ← SelectBestRulesForSingleVar(R) 4 0 0 | b = hard
} 3 0 12 | c = none
Generalisation: separate-and-cover Example: choosing contact lenses
Recommended contact lenses: none, soft, hard
General principles:
1. Choose class value, e.g. hard
2. Construct rule Condition → Lens = hard
3. Determine accuracy α = p/t for all possible
conditions, where
– t: total number of instances covered by
the rule
– p: covered instances with the right (pos-
itive) class value
Condition α = p/t
Covering of classes:
Age = youg 2/8
Age = pre-presbyopic 1/8
• Rule-set generation for each class value sep-
Age = presbyopic 1/8
arately Spectacles = myope 3/12
Spectacles = hypermetrope 3/12
• Peeling: box compression – instances are Astigmatism = no 0/12
peeled off (fall outside the box) one face at Astigmatism = yes 4/12
Tears = reduced 0/12
the time Tears = normal 4/12
• PRISM algorithm 4. Select best condition (4/12)

Separate-and-cover algorithm SelectRule example

SC(classvar, D) Rule:
{ • RHS: Lens = hard
R←∅
for each val ∈ Domain(classvar) do • LHS: Astigmatism = yes, with α = 4/12
E←D Not very accurate; expanded rule:
while E contains instances with val do (Astigmatism = yes ∧ New-condition)
rule ← MakeRule(rhs(classvar.val), lhs(∅)) → Lens = hard
IR ← ∅ Tear
until rule is perfect do Spectacle product
Age prescription Ast rate Lens
for each var ∈ Vars, ∀rule ∈ IR : var ∈
/ rule do young myope yes reduced none
young myope yes normal hard
for each value ∈ Domain(var) do young hypermetrope yes reduced none
young hypermetrope yes normal hard
inter-rule ← Add(rule, lhs(var.value)) pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
IR ← IR ∪ {inter-rule} pre-presbyopic hypermetrope yes reduced none
pre-presbyopic hypermetrope yes normal none
rule ← SelectRule(IR) presbyopic myope yes reduced none
R ← R ∪ {rule} presbyopic myope yes normal hard
presbyopic hypermetrope yes reduced none
RC ← InstancesCoveredBy(rule, E) presbyopic hypermetrope yes normal none

E ← E\RC • Age = young (2/4); Age = pre-presbyopic (1/4);


Age = presbyopic (1/4)
}
• Spectacles = myope (3/6); Spectacles = hyperme-
SelectRule: based on accuracy α = p/t; if α = trope (1/6)
α0, for two rules, select the one with highest p • Tears = reduced (0/6); Tears = normal (4/6)
SelectRule example (continued) SC example (continued)

Rule: Delete 3 instances from E; new rule:


(Astigmatism = yes ∧ Tears = normal) New-condition → Lens = hard
→ Lens = hard Tear
Spectacle production
Expanded rule: Age prescription Ast rate Lens
(Astigmatism = yes ∧ Tears = normal ∧ young myope no reduced none
young myope no normal soft
New-condition) → Lens = hard young myope yes reduced none
young hypermetrope no reduced none
Tear young hypermetrope no normal soft
Spectacle product young hypermetrope yes reduced none
Age prescription Ast rate Lens
young myope yes normal hard young hypermetrope yes normal hard
young hypermetrope yes normal hard pre-presbyopic myope no reduced none
pre-presbyopic myope yes normal hard pre-presbyopic myope no normal soft
pre-presbyopic hypermetrope yes normal none
presbyopic myope yes normal hard pre-presbyopic myope yes reduced none
presbyopic hypermetrope yes normal none pre-presbyopic hypermetrope no reduced none
pre-presbyopic hypermetrope no normal soft
• Age = young (2/2); Age = pre-presbyopic (1/2);
pre-presbyopic hypermetrope yes reduced none
Age = presbyopic (1/2)
pre-presbyopic hypermetrope yes normal none
• Spectacles = myope (3/3); Spectacles = hyperme- presbyopic myope no reduced none
trope (1/3) presbyopic myope no normal none
presbyopic myope yes reduced none
⇒ (Astigmatism = yes ∧ Tears = normal ∧ presbyopic hypermetrope no reduced none
presbyopic hypermetrope no normal soft
Spectacles = myope) → Lens = hard presbyopic hypermetrope yes reduced none
presbyopic hypermetrope yes normal none

WEKA Results
Limitations
Rules:
If Astigmatism = no and Tears = normal
and Spectacles = hypermetrope then Lens = soft
If Astigmatism = no and Tears = normal • Adding one condition at the time is greedy
and age = young then Lens = soft
If age = pre-presbyopic and Astigmatism = no search (‘optimal’ state may be missed)
and Tears = normal then Lens = soft
If Astigmatism = yes and Tears = normal • Accuracy α = p/t: promotes overfitting: the
and Spectacles = myope then Lens = hard more ‘correct’ (higher p compared to t) is,
If age = young and Astigmatism = yes
and Tears = normal then Lens = hard the higher α
If Tears = reduced then none
If age = presbyopic and Tears = normal • Resulting rules cover all instances perfectly
and Spectacles = myope
and Astigmatism = no then Lens = none
If Spectacles = hypermetrope Example
and Astigmatism = yes
and age = pre-presbyopic then Lens = none
If age = presbyopicand Spectacles = hypermetrope Consider rule r1 with accuracy α1 = 1/1 and
and Astigmatism = yes then Lens = none
rule r2 with accuracy α2 = 19/20, then r1 is
=== Confusion Matrix === considered superior to r2
a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard Alternative 1: information gain
0 0 15 | c = none
Alternative 2: probabilistic measure
Correctly classified instances: 24 (100%)
Information gain

p0
" #
ID (r) = p0 log 0 − log
p Comparison accuracy versus
t t information gain
where
Information gain I:
• α = p/t is the accuracy before adding a con-
dition to r • Emphasis is on large number of positive in-
• α0 = p0/t0 is the accuracy after a condition stances
has been added to r • High coverage cases first, special cases later
• Resulting rules cover all instances perfectly
Example
Consider rule r 0 with α0 = 1/1 and rule r 00 with
accuracy α00 = 19/20, both modifications of r Accuracy α:
with α = 20/200. Then is r 0 considered supe-
• Takes number of positive instances only into
rior to r 00 according to accuracy, but
account if ties break
ID (r 0 ) = 1[log(1/1) − log(20/200)] = 1 • Special cases first, high coverage cases later
ID (r 00) = 19[log(19/20) − log(20/200)] ≈ 18.6
• Resulting rules cover all instances perfectly
hence r 0 is inferior to r 00 according to informa-
tion gain

Probabilistic measure Probabilistic measure


N instance in D N instance in D

rule r M positive rule r M positive


selects concerning selects concerning
t instances the class t instances the class

p instances concern the class p instances concern the class

• N = |D|: # instances in dataset D


• M : # instances in D concerning a class Hypergeometric distribution:

• t: # instances in D on which rule r succeeds M N −M


  
k  t−k
• p: # positive instances f (k) = N

t

Hypergeometric distribution: • Rule r selects t instances, of which p are


positive
M N −M
  
k t−k • Probability that a randomly chosen rule r 0
f (k) = N
 
t
does as well or better than r:
M N −M
  
min{t,M
X } min{t,M
X } k t−k
sampling without replacement: probability that P (r 0 ) = f (k) = N
 
k=p k=p t
k instances out of t belong to the class
Approximation Reduced-error pruning
Danger of overfitting to training set can be
M N −M
   reduced by splitting this into:
min{t,M
X } k t−k
P (r 0 ) = N
  • a growing set (GS) (2/3 of training set)
k=p t • a pruning set (PS) (1/3 of training set)
min{t,M
X }    k  t−k
t M M REP(classvar, D)
≈ 1−
k=p
k N N {
R ← ∅; E ← D
(GS, PS) ← Split(E)
i.e. hypergeometric distribution approximated while E 6= ∅ do
by a binomial distribution IR ← ∅
for each val ∈ Domain(classvar) do
= IM/N (p, t − p + 1) if GS and PS contain a val-instance then
rule ← BSC(classvar.val, GS)
where Ix (α, β) is the incomplete beta function: while P (rule | PS) > P (rule− | PS) do
rule ← rule−
1 x Z IR ← IR ∪ {rule}
Ix (α, β) = z α−1(1 − z)β−1dz rule ← SelectRule(IR); R ← R ∪ {rule}
B(α, β) 0 RC ← InstancesCoveredBy(rule, E)
E ← E\RC
where B(α, β) is the beta function, defined as (GS, PS) ← Split(E)
}
Z 1
B(α, β) = z α−1(1 − z)β−1dz BSC is basic separate-and-cover algorithm, and
0 rule− is a rule with last condition removed

Which variable/attribute is best?


Divide-and-conquer: decision trees

Tears
Lens = ? X
reduced normal yes no
none Astigmatism class variable
no yes X C X C
yes yes yes
... ... no
.. ..
Age
Spectacle
yes yes
. .
prescription yes no no yes
young ... ... no no
presbyopic pre−prebyopic hypermetrope myope
.. ...
yes no .
Spectacle hard no no
soft soft Age DX = yes
prescription
young pre−prebyopic
myope hypermetrope
DX = no
presbyopic
Dataset D:
none soft hard none none
D = DX=yes ∪ DX=no
Learning decision trees: Entropy:
X
• R. Quinlan: ID3, C4.5 and C5.0 HC (X = x) = − P (C = c|X = x) ln P (C = c|X = x)
c
• L. Breiman: CART (Classification and Re- Expected entropy:
gression Trees)
X
EHC (X) = P (X = x)HC (X = x)
x
Information gain (again) Example: contact lenses
recommendation
Dataset D:
Class variable is Lens:
D = DX=yes ∪ DX=no 

 5/24 if Lens = soft
Entropy: P (Lens) = 4/24 if Lens = hard
 15/24 if Lens = none

X
HC (X = x) = − P (C = c|X = x) ln P (C = c|X = x)
c
5 5 4 4 15 15
Expected entropy: H(>) = − ln − ln − ln
24 24 24 24 24 24
X
≈ 0.92
EHC (X) = P (X = x)HC (X = x)
x For variable Ast (Astigmatism):

Without the split of the dataset D on variable



 5/12 if Lens = soft

X, the entropy is: P (Lens|Ast = no) = 0/12 if Lens = hard
 7/12 if Lens = none

X
HC (>) = − P (C = c) ln P (C = c)
c Therefore:

Information gain GC from X: 5 5 0 0 7 7


H(Ast = no) = − ln − ln − ln
12 12 12 12 12 12
≈ 0.68
GC (X) = HC (>) − EHC (X)

Example (continued)

For variable Ast (Astigmatism): Example (continued)



 0/12 if Lens = soft
 For variable Tears:
P (Lens|Ast = yes) = 4/12 if Lens = hard
 8/12 if Lens = none
 1 1
EH (Tears) = H(Tears = red) + H(Tears = norm)
2 2
Therefore: ≈ 1/2(0.0 + 1.1) = 0.55

0 0 4 4 8 8
H(Ast = yes) = − ln − ln − ln Information gain:
12 12 12 12 12 12
≈ 0.64
⇒ G(Tears) = H(>) − EH (Tears)
1 1 = 0.92 − 0.55 = 0.37
⇒ EH (Ast) = H(Ast = no) + H(Ast = yes)
2 2
= 1/2(0.68 + 0.64) = 0.66 Comparison:

Information gain: G(Tears) > G(Ast)

⇒ G(Ast) = H(>) − EH (Ast)


Select Tears as first splitting variable
= 0.92 − 0.66 = 0.26
Final remarks I
Final remarks II
A node with too many branches causes the
information gain measure to break down
• Variable selection is myopic: it does not
Example look beyond the effects of its own values;
Suppose that with each branch of a node a a resulting decision tree is therefore likely
dataset with exactly one instance is associated: to be suboptimal

EHC (X) = n · 1/n · (1 log 1 + 0 log 0) = 0


• Decision trees may grow unwieldy, and may
need to be pruned (ID3 ⇒ C4.5)
if X has n values. Hence, GC (X) = HC (>) −
0 = HC (>) attains a maximum
– subtree replacement
Solution: Gain ratio RC :
– subtree raising
• Split information
X
HX (>) = − P (X = x) ln P (X = x)
x • Decision trees can also be used for numer-
• Gain ratio: ical variables: regression trees

RC (X) = GC (X)/HX (>)

You might also like