Decision Boundary: Training Process Process Testing

Learning Classifiers Performance
Initial Model (Assumptions)

• Instances xi in dataset D mapped to feature
Training Trained Model Testing Tested
space: Process Process Model

Performance Performance
Training Data Test Data
Performance measures:
• TP: True Positive
• TN: True Negative
decision boundary • FP: False Positive
Classes associated with instances: X, ◦ • FN: False Negative
• Classification: Note: if N = |D|, then N = TP+TN+FP+FN

f (xi ) = c ∈ {X, ◦} Confusion matrix:
– with xi,j ∈ {>, ⊥}, and f classifier
Predicted class
– dataset D is a multiset yes no
• Objective: learn f (supervised) Actual yes true positive false negative
class no false positive true negative
Performance Example: choosing contact lenses
Tear
Performance measures: Spectacle production
Age prescription Ast rate Lens
• Success rate σ: young myope no reduced none
TP + TN young myope no normal soft
σ= young myope yes reduced none
N young myope yes normal hard
• Error rate : = 1 − σ young hypermetrope no reduced none
young hypermetrope no normal soft
• TPR (= recall ρ) True Positive Rate young hypermetrope yes reduced none
young hypermetrope yes normal hard
TPR = TP/(TP + FN) pre-presbyopic myope no reduced none
pre-presbyopic myope no normal soft
• FNR False Negative Rate: FNR = 1 − TPR pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
• FPR False Positive Rate: pre-presbyopic hypermetrope no reduced none
pre-presbyopic hypermetrope no normal soft
FPR = FP/(FP + TN)
pre-presbyopic hypermetrope yes reduced none
• TNR True Negative Rate: TNR = 1 − FPR pre-presbyopic hypermetrope yes normal none
presbyopic myope no reduced none
• Precision π: presbyopic myope no normal none
presbyopic myope yes reduced none
π = TP/(TP + FP) presbyopic myope yes normal hard
presbyopic hypermetrope no reduced none
• F -measure: presbyopic hypermetrope no normal soft
2·ρ·π 2TP presbyopic hypermetrope yes reduced none
F = = presbyopic hypermetrope yes normal none
ρ+π 2TP + FP + FN Ast = Astigmatism
Rule representation and reasoning ZeroR
Rule representation: Basic ideas:

• Logical implication (= rules) • Construct rule that predicts the majority class
LHS → RHS • Used as baseline performance
(LHS = left-hand side = antecedent; RHS
= right-hand side = consequent) Example
• Literals in LHS and RHS are of the form: Contact lenses recommendation rule:
Variable ◦ value (or Attribute ◦ value) → Lens = none
where ◦ ∈ {<, ≤, =, >, ≥} • Total number of instances: 24
• Correctly classified instances: 15 (62.5%)
Rule-based reasoning:
• Incorrectly classified instances: 9 (37.5%)
R∪F C
=== Detailed Accuracy By Class ===
where TP Rate FP Rate Precision Recall F-Measure Class
0 0 0 0 0 soft
• R is a set of rules r ∈ R (rule-base) 0 0 0 0 0 hard
1 1 0.625 1 0.769 none
• F is a set of facts of the form
=== Confusion Matrix ===
Variable = value a b c <-- classified as
0 0 5 | a = soft
• C is a set of conclusions of the same form 0 0 4 | b = hard
as facts 0 0 15 | c = none
OneR OneR: Example
• Construct a single-condition rule for each Rules for contact-lenses recommendation:

variable-value pair
• Select the rules defined for a single variable Tears = reduced → Lens = none
(in the condition) which perform best Tears = normal → Lens = soft
OneR(classvar)
• 17/24 instances correct
{
R←∅ • Correctly classified instances: 17 (70.83%)
for each var ∈ Vars do • Incorrectly classified instances: 7 (29.16%)
for each value ∈ Domain(var) do
classvar.most-freq-value ←
MostFreq(var.value, classvar) === Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure Class
rule ← MakeRule(var.value, 1 0.368 0.417 1 0.588 soft
classvar.most-freq-value) 0 0 0 0 0 hard
0.8 0 1 0.8 0.889 none
R ← R ∪ {rule}
for each r ∈ R do === Confusion Matrix ===
CalculateErrorRate(r) a b c <-- classified as
5 0 0 | a = soft
R ← SelectBestRulesForSingleVar(R) 4 0 0 | b = hard
} 3 0 12 | c = none
Generalisation: separate-and-cover Example: choosing contact lenses
Recommended contact lenses: none, soft, hard
General principles:
1. Choose class value, e.g. hard
2. Construct rule Condition → Lens = hard
3. Determine accuracy α = p/t for all possible
conditions, where
– t: total number of instances covered by
the rule
– p: covered instances with the right (pos-
itive) class value
Condition α = p/t
Covering of classes:
Age = youg 2/8
Age = pre-presbyopic 1/8
• Rule-set generation for each class value sep-
Age = presbyopic 1/8
arately Spectacles = myope 3/12
Spectacles = hypermetrope 3/12
• Peeling: box compression – instances are Astigmatism = no 0/12
peeled off (fall outside the box) one face at Astigmatism = yes 4/12
Tears = reduced 0/12
the time Tears = normal 4/12
• PRISM algorithm 4. Select best condition (4/12)
Separate-and-cover algorithm SelectRule example
SC(classvar, D) Rule:
{ • RHS: Lens = hard
R←∅
for each val ∈ Domain(classvar) do • LHS: Astigmatism = yes, with α = 4/12
E←D Not very accurate; expanded rule:
while E contains instances with val do (Astigmatism = yes ∧ New-condition)
rule ← MakeRule(rhs(classvar.val), lhs(∅)) → Lens = hard
IR ← ∅ Tear
until rule is perfect do Spectacle product
for each var ∈ Vars, ∀rule ∈ IR : var ∈
/ rule do young myope yes reduced none
young myope yes normal hard
for each value ∈ Domain(var) do young hypermetrope yes reduced none
young hypermetrope yes normal hard
inter-rule ← Add(rule, lhs(var.value)) pre-presbyopic myope yes reduced none
pre-presbyopic myope yes normal hard
IR ← IR ∪ {inter-rule} pre-presbyopic hypermetrope yes reduced none
pre-presbyopic hypermetrope yes normal none
rule ← SelectRule(IR) presbyopic myope yes reduced none
R ← R ∪ {rule} presbyopic myope yes normal hard
presbyopic hypermetrope yes reduced none
RC ← InstancesCoveredBy(rule, E) presbyopic hypermetrope yes normal none
E ← E\RC • Age = young (2/4); Age = pre-presbyopic (1/4);

Age = presbyopic (1/4)
}
• Spectacles = myope (3/6); Spectacles = hyperme-
SelectRule: based on accuracy α = p/t; if α = trope (1/6)
α0, for two rules, select the one with highest p • Tears = reduced (0/6); Tears = normal (4/6)
SelectRule example (continued) SC example (continued)
Rule: Delete 3 instances from E; new rule:

(Astigmatism = yes ∧ Tears = normal) New-condition → Lens = hard
→ Lens = hard Tear
Spectacle production
Expanded rule: Age prescription Ast rate Lens
(Astigmatism = yes ∧ Tears = normal ∧ young myope no reduced none
young myope no normal soft
New-condition) → Lens = hard young myope yes reduced none
young hypermetrope no reduced none
Tear young hypermetrope no normal soft
Spectacle product young hypermetrope yes reduced none
young myope yes normal hard young hypermetrope yes normal hard
young hypermetrope yes normal hard pre-presbyopic myope no reduced none
pre-presbyopic myope yes normal hard pre-presbyopic myope no normal soft
presbyopic myope yes normal hard pre-presbyopic myope yes reduced none
presbyopic hypermetrope yes normal none pre-presbyopic hypermetrope no reduced none
pre-presbyopic hypermetrope no normal soft
• Age = young (2/2); Age = pre-presbyopic (1/2);
pre-presbyopic hypermetrope yes reduced none
Age = presbyopic (1/2)
• Spectacles = myope (3/3); Spectacles = hyperme- presbyopic myope no reduced none
trope (1/3) presbyopic myope no normal none
presbyopic myope yes reduced none
⇒ (Astigmatism = yes ∧ Tears = normal ∧ presbyopic hypermetrope no reduced none
presbyopic hypermetrope no normal soft
Spectacles = myope) → Lens = hard presbyopic hypermetrope yes reduced none
presbyopic hypermetrope yes normal none
WEKA Results
Limitations
Rules:
If Astigmatism = no and Tears = normal
and Spectacles = hypermetrope then Lens = soft
If Astigmatism = no and Tears = normal • Adding one condition at the time is greedy
and age = young then Lens = soft
If age = pre-presbyopic and Astigmatism = no search (‘optimal’ state may be missed)
and Tears = normal then Lens = soft
If Astigmatism = yes and Tears = normal • Accuracy α = p/t: promotes overfitting: the
and Spectacles = myope then Lens = hard more ‘correct’ (higher p compared to t) is,
If age = young and Astigmatism = yes
and Tears = normal then Lens = hard the higher α
If Tears = reduced then none
If age = presbyopic and Tears = normal • Resulting rules cover all instances perfectly
and Spectacles = myope
and Astigmatism = no then Lens = none
If Spectacles = hypermetrope Example
and Astigmatism = yes
and age = pre-presbyopic then Lens = none
If age = presbyopicand Spectacles = hypermetrope Consider rule r1 with accuracy α1 = 1/1 and
and Astigmatism = yes then Lens = none
rule r2 with accuracy α2 = 19/20, then r1 is
=== Confusion Matrix === considered superior to r2
a b c <-- classified as
5 0 0 | a = soft
0 4 0 | b = hard Alternative 1: information gain
0 0 15 | c = none
Alternative 2: probabilistic measure
Correctly classified instances: 24 (100%)
Information gain
p0
" #
ID (r) = p0 log 0 − log
p Comparison accuracy versus
t t information gain
where
Information gain I:
• α = p/t is the accuracy before adding a con-
dition to r • Emphasis is on large number of positive in-
• α0 = p0/t0 is the accuracy after a condition stances
has been added to r • High coverage cases first, special cases later
• Resulting rules cover all instances perfectly
Example
Consider rule r 0 with α0 = 1/1 and rule r 00 with
accuracy α00 = 19/20, both modifications of r Accuracy α:
with α = 20/200. Then is r 0 considered supe-
• Takes number of positive instances only into
rior to r 00 according to accuracy, but
account if ties break
ID (r 0 ) = 1[log(1/1) − log(20/200)] = 1 • Special cases first, high coverage cases later
ID (r 00) = 19[log(19/20) − log(20/200)] ≈ 18.6
• Resulting rules cover all instances perfectly
hence r 0 is inferior to r 00 according to informa-
tion gain
Probabilistic measure Probabilistic measure

N instance in D N instance in D
rule r M positive rule r M positive

selects concerning selects concerning
t instances the class t instances the class
p instances concern the class p instances concern the class
• N = |D|: # instances in dataset D

• M : # instances in D concerning a class Hypergeometric distribution:
• t: # instances in D on which rule r succeeds M N −M

k t−k
• p: # positive instances f (k) = N

t
Hypergeometric distribution: • Rule r selects t instances, of which p are

positive
M N −M

k t−k • Probability that a randomly chosen rule r 0
f (k) = N

t
does as well or better than r:
M N −M

min{t,M
X } min{t,M
X } k t−k
sampling without replacement: probability that P (r 0 ) = f (k) = N

k=p k=p t
k instances out of t belong to the class
Approximation Reduced-error pruning
Danger of overfitting to training set can be
M N −M
reduced by splitting this into:
min{t,M
X } k t−k
P (r 0 ) = N
• a growing set (GS) (2/3 of training set)
k=p t • a pruning set (PS) (1/3 of training set)
min{t,M
X } k t−k
t M M REP(classvar, D)
≈ 1−
k=p
k N N {
R ← ∅; E ← D
(GS, PS) ← Split(E)
i.e. hypergeometric distribution approximated while E 6= ∅ do
by a binomial distribution IR ← ∅
for each val ∈ Domain(classvar) do
= IM/N (p, t − p + 1) if GS and PS contain a val-instance then
rule ← BSC(classvar.val, GS)
where Ix (α, β) is the incomplete beta function: while P (rule | PS) > P (rule− | PS) do
rule ← rule−
1 x Z IR ← IR ∪ {rule}
Ix (α, β) = z α−1(1 − z)β−1dz rule ← SelectRule(IR); R ← R ∪ {rule}
B(α, β) 0 RC ← InstancesCoveredBy(rule, E)
E ← E\RC
where B(α, β) is the beta function, defined as (GS, PS) ← Split(E)
}
Z 1
B(α, β) = z α−1(1 − z)β−1dz BSC is basic separate-and-cover algorithm, and
0 rule− is a rule with last condition removed
Which variable/attribute is best?

Divide-and-conquer: decision trees
Tears
Lens = ? X
reduced normal yes no
none Astigmatism class variable
no yes X C X C
yes yes yes
... ... no
.. ..
Age
Spectacle
yes yes
. .
prescription yes no no yes
young ... ... no no
presbyopic pre−prebyopic hypermetrope myope
.. ...
yes no .
Spectacle hard no no
soft soft Age DX = yes
prescription
young pre−prebyopic
myope hypermetrope
DX = no
presbyopic
Dataset D:
none soft hard none none
D = DX=yes ∪ DX=no
Learning decision trees: Entropy:
X
• R. Quinlan: ID3, C4.5 and C5.0 HC (X = x) = − P (C = c|X = x) ln P (C = c|X = x)
c
• L. Breiman: CART (Classification and Re- Expected entropy:
gression Trees)
X
EHC (X) = P (X = x)HC (X = x)
x
Information gain (again) Example: contact lenses
recommendation
Dataset D:
Class variable is Lens:
D = DX=yes ∪ DX=no 

 5/24 if Lens = soft
Entropy: P (Lens) = 4/24 if Lens = hard
 15/24 if Lens = none

X
HC (X = x) = − P (C = c|X = x) ln P (C = c|X = x)
c
5 5 4 4 15 15
Expected entropy: H(>) = − ln − ln − ln
24 24 24 24 24 24
X
≈ 0.92
EHC (X) = P (X = x)HC (X = x)
x For variable Ast (Astigmatism):
Without the split of the dataset D on variable



X, the entropy is: P (Lens|Ast = no) = 0/12 if Lens = hard

X
HC (>) = − P (C = c) ln P (C = c)
c Therefore:
Information gain GC from X: 5 5 0 0 7 7

H(Ast = no) = − ln − ln − ln
12 12 12 12 12 12
≈ 0.68
GC (X) = HC (>) − EHC (X)
Example (continued)
For variable Ast (Astigmatism): Example (continued)


 For variable Tears:
P (Lens|Ast = yes) = 4/12 if Lens = hard
 1 1
EH (Tears) = H(Tears = red) + H(Tears = norm)
2 2
Therefore: ≈ 1/2(0.0 + 1.1) = 0.55
0 0 4 4 8 8
H(Ast = yes) = − ln − ln − ln Information gain:
12 12 12 12 12 12
≈ 0.64
⇒ G(Tears) = H(>) − EH (Tears)
1 1 = 0.92 − 0.55 = 0.37
⇒ EH (Ast) = H(Ast = no) + H(Ast = yes)
2 2
= 1/2(0.68 + 0.64) = 0.66 Comparison:
Information gain: G(Tears) > G(Ast)
⇒ G(Ast) = H(>) − EH (Ast)

Select Tears as first splitting variable
= 0.92 − 0.66 = 0.26
Final remarks I
Final remarks II
A node with too many branches causes the
information gain measure to break down
• Variable selection is myopic: it does not
Example look beyond the effects of its own values;
Suppose that with each branch of a node a a resulting decision tree is therefore likely
dataset with exactly one instance is associated: to be suboptimal
EHC (X) = n · 1/n · (1 log 1 + 0 log 0) = 0

• Decision trees may grow unwieldy, and may
need to be pruned (ID3 ⇒ C4.5)
if X has n values. Hence, GC (X) = HC (>) −
0 = HC (>) attains a maximum
– subtree replacement
Solution: Gain ratio RC :
– subtree raising
• Split information
X
HX (>) = − P (X = x) ln P (X = x)
x • Decision trees can also be used for numer-
• Gain ratio: ical variables: regression trees
RC (X) = GC (X)/HX (>)

Decision Boundary: Training Process Process Testing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Decision Boundary: Training Process Process Testing

Uploaded by

Copyright:

Available Formats

Learning Classifiers Performance

Initial Model (Assumptions)

Training Data Test Data

• Classification: Note: if N = |D|, then N = TP+TN+FP+FN

Performance Example: choosing contact lenses

Rule representation: Basic ideas:

OneR OneR: Example

• Construct a single-condition rule for each Rules for contact-lenses recommendation:

Separate-and-cover algorithm SelectRule example

E ← E\RC • Age = young (2/4); Age = pre-presbyopic (1/4);

Rule: Delete 3 instances from E; new rule:

Probabilistic measure Probabilistic measure

rule r M positive rule r M positive

p instances concern the class p instances concern the class

• N = |D|: # instances in dataset D

• t: # instances in D on which rule r succeeds M N −M

Hypergeometric distribution: • Rule r selects t instances, of which p are

Which variable/attribute is best?

Without the split of the dataset D on variable

Information gain GC from X: 5 5 0 0 7 7

For variable Ast (Astigmatism): Example (continued)

Information gain: G(Tears) > G(Ast)

⇒ G(Ast) = H(>) − EH (Ast)

EHC (X) = n · 1/n · (1 log 1 + 0 log 0) = 0

RC (X) = GC (X)/HX (>)

You might also like