Lecture2 ClassificationPrediction

Classification and Prediction Overview
Introduction
Decision Trees
Data Mining Techniques: Statistical Decision Theory
Nearest Neighbor
Classification and Prediction Bayesian Classification
Artificial Neural Networks
Mirek Riedewald
Support Vector Machines (SVMs)
Some slides based on presentations by Prediction
Han/Kamber, Tan/Steinbach/Kumar, and Andrew Accuracy and Error Measures
Moore
Ensemble Methods
Classification vs. Prediction Induction: Model Construction

Assumption: after data preparation, have single data Classification
set where each record has attributes X1,,Xn, and Y. Algorithm
Training
Goal: learn a function f:(X1,,Xn)Y, then use this
Data
function to predict y for a given input record (x1,,xn).
Classification: Y is a discrete attribute, called the class label
Usually a categorical attribute with small domain Model
NAME RANK YEARS TENURED
Prediction: Y is a continuous attribute (Function)
M ike A ssistan t P ro f 3 no
Called supervised learning, because true labels (Y- M ary A ssistan t P ro f 7 yes
values) are known for the initially provided data B ill P ro fesso r 2 yes
Typical applications: credit approval, target marketing, Jim A sso ciate P ro f 7 yes IF rank = professor
medical diagnosis, fraud detection D ave A ssistan t P ro f 6 no OR years > 6
Anne A sso ciate P ro f 3 no THEN tenured = yes
3 4
Deduction: Using the Model Classification and Prediction Overview

Introduction
Model
(Function)
Decision Trees
Statistical Decision Theory
Bayesian Classification
Test Artificial Neural Networks
Unseen Data
Data Support Vector Machines (SVMs)
Nearest Neighbor
(Jeff, Professor, 4)
Prediction
NAME RANK YEARS TENURED
Tenured? Accuracy and Error Measures
T om A ssistant P rof 2 no
M erlisa A ssociate P rof 7 no
Ensemble Methods
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
5 6
1
Example of a Decision Tree Another Example of Decision Tree
Splitting Attributes MarSt Single,
Tid Refund Marital Taxable Married Divorced
Status Income Cheat Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No NO Refund
1 Yes Single 125K No
2 No Married 100K No Refund Yes No
Yes No 2 No Married 100K No
3 No Single 70K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No NO MarSt 4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes Single, Divorced Married 5 No Divorced 95K Yes
6 No Married 60K No NO YES
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
7 Yes Divorced 220K No
8 No Single 85K Yes < 80K > 80K
8 No Single 85K Yes
9 No Married 75K No
NO YES 9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
10
10 No Single 90K Yes fits the same data!

10
Training Data Model: Decision Tree
7 8
Apply Model to Test Data Apply Model to Test Data

Test Data Test Data
Start from the root of tree. Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
No Married 80K ? No Married 80K ?

Refund 10
Refund 10
Yes No Yes No
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
9 10

Test Data Test Data
Refund Marital Taxable Refund Marital Taxable

Refund 10
Refund 10
Yes No Yes No
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
11 12
2
Test Data Test Data
Refund Marital Taxable Refund Marital Taxable

Refund 10
Refund 10
Yes No Yes No
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married Assign Cheat to No
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
13 14
Decision Tree Induction Decision Boundary

1
Basic greedy algorithm x2 0.9

X1
Top-down, recursive divide-and-conquer 0.8 < 0.43?
At start, all the training records are at the root 0.7
Yes No
Training records partitioned recursively based on split attributes 0.6
X2
Split attributes selected based on a heuristic or statistical 0.5
X2
< 0.33?
< 0.47?
measure (e.g., information gain) 0.4
Conditions for stopping partitioning 0.3

Yes No Yes No
Refund
Pure node (all records belong Yes No 0.2
:4 :0 :0 :4
to same class) 0.1 :0 :4 :3 :0
No remaining attributes for NO MarSt
0
further partitioning Single, Divorced Married 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Majority voting for classifying the leaf

x1
TaxInc NO
No cases left Decision boundary = border between two neighboring regions of different classes.
< 80K > 80K
For trees that split on a single attribute at a time, the decision boundary is parallel
NO YES to the axes.
15 16
How to Specify Split Condition? Splitting Nominal Attributes

Depends on attribute types Multi-way split: use as many partitions as
Nominal distinct values.
CarType
Ordinal Family Luxury
Sports
Numeric (continuous)
Binary split: divides values into two subsets;
Depends on number of ways to split need to find optimal partitioning.
2-way split
CarType CarType
{Sports, {Family,
Multi-way split Luxury} {Family} OR Luxury} {Sports}
17 18
3
Splitting Ordinal Attributes Splitting Continuous Attributes
Multi-way split: Different options
Small
Size
Large Discretization to form an ordinal categorical
Medium attribute
Binary split: Static discretize once at the beginning
Size Size
Dynamic ranges found by equal interval bucketing,
{Small,
{Large}
OR {Medium,
{Small} equal frequency bucketing (percentiles), or clustering.
Medium} Large}
Binary Decision: (A < v) or (A v)

What about this split? Size
{Small,
{Medium}
Consider all possible splits, choose best one
Large}
19 20
Splitting Continuous Attributes How to Determine Best Split

Before Splitting: 10 records of class 0,
10 records of class 1
Taxable Taxable
Income Income?
> 80K? Own Car Student
Car? Type? ID?
< 10K > 80K
Yes No Yes No Family Luxury c1 c20
c10 c11
Sports
[10K,25K) [25K,50K) [50K,80K)
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
(i) Binary split (ii) Multi-way split
Which test condition is the best?
21 22
Attribute Selection Measure:

How to Determine Best Split
Information Gain
Greedy approach: Select attribute with highest information gain
pi = probability that an arbitrary record in D belongs to class
Nodes with homogeneous class distribution are Ci, i=1,,m
preferred Expected information (entropy) needed to classify a record
in D: m
Need a measure of node impurity: Info(D) pi log 2 ( pi )

i 1
C0: 5 C0: 9 Information needed after using attribute A to split D into v

C1: 5 C1: 1 partitions D1,, Dv: v |D |
Info A ( D)
j
Info(D j )
Non-homogeneous, Homogeneous, j 1 | D |
High degree of impurity Low degree of impurity Information gained by splitting on attribute A:
Gain A (D) Info(D) Info A (D)
23 24
4
Example Information Gain Example
Class P: buys_computer = yes
Predict if somebody will buy a computer
5 4
Infoage ( D) I (2,3) I (4,0)
Class N: buys_computer = no 14 14
5
I (3,2) 0.694
Given data set: Age Income Student Credit_rating Buys_computer 9 9 5 5
Info(D) I (9,5) log 2 log 2 0.940 14
30 High No Bad No 14 14 14 14
5
30 High No Good No Age #yes #no I(#yes, #no) I (2,3) means age 30 has 5 out of 14
30 2 3 0.971 14 samples, with 2 yeses and 3 nos.
3140 High No Bad Yes 3140 4 0 0 Similar for the other terms
> 40 Medium No Bad Yes >40 3 2 0.971
> 40 Low Yes Bad Yes Age Income Student Credit_rating Buys_computer Hence
> 40 Low Yes Good No 30 High No Bad No Gain age ( D) Info(D) Infoage ( D) 0.246
30 High No Good No
31...40 Low Yes Good Yes 3140 High No Bad Yes Similarly,
30 Medium No Bad No > 40 Medium No Bad Yes Gain income ( D) 0.029
30 Low Yes Bad Yes > 40
> 40
Low
Low
Yes
Yes
Bad
Good
Yes
No Gain student( D) 0.151
> 40 Medium Yes Bad Yes 31...40 Low Yes Good Yes
Gain credit_rating ( D) 0.048
30 Medium No
30 Medium Yes Good Yes 30 Low Yes
Bad
Bad
No
Yes
Therefore we choose age as the splitting
31...40 Medium No Good Yes > 40 Medium Yes Bad Yes
attribute
30 Medium Yes Good Yes
31...40 High Yes Bad Yes 31...40 Medium No Good Yes
> 40 Medium No Good No 31...40 High Yes Bad Yes
> 40 Medium No Good No
25 26
Gain Ratio for Attribute Selection Gini Index

Information gain is biased towards attributes with a large Gini index, gini(D), is defined as gini( D) 1 pi2
m
number of values i 1
Use gain ratio to normalize information gain:
GainRatioA(D) = GainA(D) / SplitInfoA(D) If data set D is split on A into v subsets D1,, Dv, the gini
v | Dj | | Dj | index giniA(D) is defined as
SplitInfo A ( D) log 2 v |D |
gini A ( D)
j
j 1 |D| | D| gini( D j )
j 1 | D |
E.g., SplitInfoincome ( D) 4 log 2 4 6 log 2 6 4 log 2 4 0.926 Reduction in Impurity:

14 14 14 14 14 14
gini A ( D) gini( D) gini A ( D)
GainRatioincome(D) = 0.029/0.926 = 0.031
Attribute with maximum gain ratio is selected as splitting Attribute that provides smallest ginisplit(D) (= largest
attribute reduction in impurity) is chosen to split the node
27 28
Comparing Attribute Selection

Practical Issues of Classification
Measures
No clear winner Underfitting and overfitting
(and there are many more)
Missing values
Information gain:
Biased towards multivalued attributes Computational cost
Gain ratio: Expressiveness
Tends to prefer unbalanced splits where one partition is
much smaller than the others
Gini index:
Biased towards multivalued attributes
Tends to favor tests that result in equal-sized partitions and
purity in both partitions
29 30
5
How Good is the Model? Training versus Test Set Error
Training set error: compare prediction of Well create a training dataset
Output y = copy of e,
training record with true value Five inputs, all bits, are except a random 25%
generated in all 32 possible of the records have y
Not a good measure for the error on unseen data. combinations set to the opposite of e
(Discussed soon.)
Test set error: for records that were not used a b c d e y
0 0 0 0 0 0
for training, compare model prediction and 0 0 0 0 1 0
32 records
true value 0 0 0 1 0 0
0 0 0 1 1 1
Use holdout data from available data set 0 0 1 0 0 1
: : : : : :
1 1 1 1 1 1
31 32
Test Data Full Tree for The Training Data

Generate test data using the same method: copy of e, but 25%
inverted. Root
Some ys that were corrupted in the training set will be uncorrupted
in the testing set. e=0 e=1
Some ys that were uncorrupted in the training set will be corrupted
in the test set. a=0 a=1 a=0 a=1
a b c d e y (training y (test
data) data)
0 0 0 0 0 0 0
0 0 0 0 1 0 1
0 0 0 1 0 0 1
0 0 0 1 1 1 1
0 0 1 0 0 1 1 25% of these leaf node labels will be corrupted
: : : : : : :
Each leaf contains exactly one record, hence no error in predicting the training data!
1 1 1 1 1 1 1
33 34
Testing The Tree with The Test Set Whats This Example Shown Us?
1/4 of the tree nodes are 3/4 are fine

Discrepancy between training and test set
corrupted error
1/4 of the test set 1/16 of the test set will 3/16 of the test set will be
records are corrupted be correctly predicted for wrongly predicted because But more importantly
the wrong reasons the test record is corrupted
it indicates that there is something we should do
3/4 are fine 3/16 of the test 9/16 of the test predictions
predictions will be wrong will be fine
about it if we want to predict well on future data.
because the tree node is
corrupted
In total, we expect to be wrong on 3/8 of the test set predictions
35 36
6
Tree Learned Without Access to The
Suppose We Had Less Data
Irrelevant Bits
Output y = copy of e, except a Root
These bits are hidden random 25% of the records

have y set to the opposite of e e=0 e=1
These nodes will be unexpandable
a b c d e y
0 0 0 0 0 0
0 0 0 0 1 0
32 records
0 0 0 1 0 0
0 0 0 1 1 1
0 0 1 0 0 1
: : : : : :
1 1 1 1 1 1
37 38
Tree Learned Without Access to The Tree Learned Without Access to The
Irrelevant Bits Irrelevant Bits
Root Root almost certainly almost certainly all
none of the tree are fine
nodes are
e=0 e=1 e=0 e=1 corrupted
1/4 of the test n/a 1/4 of the test set
set records are will be wrongly
corrupted predicted because
the test record is
In about 12 of In about 12 of corrupted
the 16 records the 16 records 3/4 are fine n/a 3/4 of the test
in this node the in this node the predictions will be
output will be 0 output will be 1 fine
So this will So this will

In total, we expect to be wrong on only 1/4 of the test set predictions
almost certainly almost certainly
predict 0 predict 1
39 40
Typical Observation Reasons for Overfitting

Overfitting
Noise
Model M overfits the Too closely fitting the training data means the models
training data if another
model M exists, such
predictions reflect the noise as well
that M has smaller Insufficient training data
error than M over the
training examples, but
Not enough data to enable the model to generalize
M has smaller error beyond idiosyncrasies of the training records
than M over the entire
distribution of
Data fragmentation (special problem for trees)
instances. Number of instances gets smaller as you traverse
down the tree
Number of instances at a leaf node could be too small
to make any confident decision about class
Underfitting: when model is too simple, both training and test errors are large
41 42
7
Avoiding Overfitting Minimum Description Length (MDL)
A?
X y Yes No
General idea: make the tree smaller X1 1 0
X y
B? X1 ?
Addresses all three reasons for overfitting X2 0 B1 B2
X2 ?
X3 0 C? 1
A C1 C2 B X3 ?
X4 1
Prepruning: Halt tree construction early 0 1 X4 ?

Do not split a node if this would result in the goodness measure Xn

1
falling below a threshold Xn ?
Difficult to choose an appropriate threshold, e.g., tree for XOR Alternative to using validation data
Motivation: data mining is about finding regular patterns in data;
Postpruning: Remove branches from a fully grown tree regularity can be used to compress the data; method that achieves
greatest compression found most regularity and hence is best
Use a set of data different from the training data to decide when
to stop pruning Minimize Cost(Model,Data) = Cost(Model) + Cost(Data|Model)
Validation data: train tree on training data, prune on validation data, Cost is the number of bits needed for encoding.
then test on test data Cost(Data|Model) encodes the misclassification errors.
Cost(Model) uses node encoding plus splitting condition encoding.
43 44
MDL-Based Pruning Intuition Handling Missing Attribute Values

Cost Missing values affect decision tree
construction in three different ways:
Cost(Model, Data) How impurity measures are computed
Lowest total cost
Cost(Model)=model size How to distribute instance with missing value to
child nodes
How a test instance with missing value is classified
Cost(Data|Model)=model errors
small large
Tree size
Best tree size
45 46
Distribute Instances Computing Impurity Measure

Tid Refund Marital Taxable Tid Refund Marital Taxable Split on Refund: assume records with
Status Income Class Status Income Class missing values are distributed as
Tid Refund Marital Taxable
1 Yes Single 125K No Status Income Class 1 Yes Single 125K No
discussed before
2 No Married 100K No 3/9 of record 10 go to Refund=Yes
10 ? Single 90K Yes 2 No Married 100K No
3 No Single 70K No 10
3 No Single 70K No 6/9 of record 10 go to Refund=No

4 Yes Married 120K No
Refund 4 Yes Married 120K No
5 No Divorced 95K Yes Yes No Entropy(Refund=Yes)
5 No Divorced 95K Yes
6 No Married 60K No = -(1/3 / 10/3)log(1/3 / 10/3)
Class=Yes 0 + 3/9 Class=Yes 2 + 6/9 6 No Married 60K No
7 Yes Divorced 220K No
Class=No 3 Class=No 4 (3 / 10/3)log(3 / 10/3) = 0.469
8 No Single 85K Yes 7 Yes Divorced 220K No
9 No Married 75K No 8 No Single 85K Yes Entropy(Refund=No)
10
Probability that Refund=Yes is 3/9 = -(8/3 / 20/3)log(8/3 / 20/3)

9 No Married 75K No
Refund
Yes No Probability that Refund=No is 6/9 10 ? Single 90K Yes (4 / 20/3)log(4 / 20/3) = 0.971
10
Assign record to the left child with Entropy(Children)

Class=Yes 0 Cheat=Yes 2 weight = 3/9 and to the right child = 1/3*0.469 + 2/3*0.971 = 0.804
Before Splitting: Entropy(Parent)
Class=No 3 Cheat=No 4 with weight = 6/9 = -0.3 log(0.3)-(0.7)log(0.7) = 0.881 Gain = 0.881 0.804 = 0.077
47 48
8
Classify Instances Tree Cost Analysis
New record: Married Single Divorced Total
Tid Refund Marital Taxable
Finding an optimal decision tree is NP-complete
Status Income Class Class=No 3 1 0 4 Optimization goal: minimize expected number of binary tests to
uniquely identify any record from a given finite set
11 No ? 85K ?
10
Class=Yes 6/9 1 1 2.67 Greedy algorithm

O(#attributes * #training_instances * log(#training_instances))
Total 3.67 2 1 6.67 At each tree depth, all instances considered
Refund
Yes
No
Assume tree depth is logarithmic (fairly balanced splits)
Need to test each attribute at each node
NO MarSt
Single, What about binary splits?
Married Sort data once on each attribute, use to avoid re-sorting subsets
Divorced
Incrementally maintain counts for class distribution as different split points
TaxInc NO Probability that Marital Status are explored
< 80K > 80K = Married is 3.67/6.67 In practice, trees are considered to be fast both for training
NO YES Probability that Marital Status (when using the greedy algorithm) and making predictions
={Single,Divorced} is 3/6.67
49 50
Tree Expressiveness Rule Extraction from a Decision Tree

Can represent any finite discrete-valued function One rule is created for each path from the root to a leaf
Precondition: conjunction of all split predicates of nodes on path
But it might not do it very efficiently Consequent: class prediction from leaf
Example: parity function Rules are mutually exclusive and exhaustive
Class = 1 if there is an even number of Boolean attributes with Example: Rule extraction from buys_computer decision-tree
truth value = True IF age = young AND student = no THEN buys_computer = no
Class = 0 if there is an odd number of Boolean attributes with IF age = young AND student = yes THEN buys_computer = yes
truth value = True IF age = mid-age THEN buys_computer = yes
For accurate modeling, must have a complete tree IF age = old AND credit_rating = excellent THEN buys_computer = yes
Not expressive enough for modeling continuous IF age = young AND credit_rating = fair THEN buys_computer = no
attributes age?
But we can still use a tree for them in practice; it just <=30 31..40 >40
cannot accurately represent the true function student?
yes
credit rating?
no yes excellent fair
no yes yes
53 55
Classification in Large Databases Scalable Tree Induction

Scalability: Classify data sets with millions of High cost when the training data at a node does not fit in
memory
examples and hundreds of attributes with
Solution 1: special I/O-aware algorithm
reasonable speed Keep only class list in memory, access attribute values on disk
Why use decision trees for data mining? Maintain separate list for each attribute
Use count matrix for each attribute
Relatively fast learning speed
Solution 2: Sampling
Can handle all attribute types Common solution: train tree on a sample that fits in memory
Convertible to simple and easy to understand More sophisticated versions of this idea exist, e.g., Rainforest
classification rules Build tree on sample, but do this for many bootstrap samples
Combine all into a single new tree that is guaranteed to be almost
Good classification accuracy, but not as good as newer identical to the one trained from entire data set
methods (but tree ensembles are top!) Can be computed with two data scans
56 57
9
Tree Conclusions Classification and Prediction Overview
Very popular data mining tool Introduction
Decision Trees
Easy to understand
Easy to implement Nearest Neighbor
Easy to use Bayesian Classification
Little tuning, handles all attribute types and missing values Artificial Neural Networks
Computationally cheap Support Vector Machines (SVMs)
Overfitting problem Prediction
Accuracy and Error Measures
Focused on classification, but easy to extend to
Ensemble Methods
prediction (future lecture)
58 60
Theoretical Results Random Variables

Trees make sense intuitively, but can we get Intuitive version of the definition:
Can take on one of possibly many values, each with a
some hard evidence and deeper certain probability (discrete versus continuous)
understanding about their properties? These probabilities define the probability distribution of
the random variable
Statistical decision theory can give some E.g., let X be the outcome of a coin toss, then
Pr(X=heads)=0.5 and Pr(X=tails)=0.5; distribution is
answers uniform
Need some probability concepts first Consider a discrete random variable X with numeric
values x1,...,xk
Expectation: E[X] = xi*Pr(X=xi)
Variance: Var(X) = E[(X E[X])2] = E[X2] (E[X])2
61 62
Working with Random Variables What is the Optimal Model f(X)?

Let X denote a real - valued random input variable and Y a real - valued random output variable
E[X + Y] = E[X] + E[Y]
Var(X + Y) = Var(X) + Var(Y) + 2 Cov(X,Y)
The squared error of trained model f(X) is E X,Y (Y f ( X )) 2 .
For constants a, b
Which function f(X) will minimize the squared error?
E[aX + b] = a E[X] + b
Var(aX + b) = Var(aX) = a2 Var(X) Consider t he error for a specific value of X and let Y E Y [Y | X ] :
Iterated expectation:
E Y (Y f ( X )) 2 | X E Y (Y Y Y f ( X )) 2 | X
E[X] = EX[ EY[Y| X] ], where EY[Y| X] = yi*Pr(Y=yi| X=x)
E Y (Y Y ) 2 | X E Y (Y f ( X )) 2 | X 2 E Y (Y Y )(Y f ( X )) | X
is the expectation of Y for a given value of X, i.e., is a EY (Y Y ) 2
| X Y f ( X )
2
2(Y f ( X )) E Y (Y Y ) | X
function of X
EY (Y Y ) 2
| X Y f ( X )
2
In general for any function f(X,Y):

EX,Y[f(X,Y)] = EX[ EY[f(X,Y)| X] ]
(Notice : E Y (Y Y ) | X E Y Y | X E Y Y | X Y Y 0)
63 64
10
Optimal Model f(X) (cont.) Implications for Trees
Best prediction for input X=x is the mean of the Y-values of all records

The choice of f(X) does not affect E Y (Y Y ) 2 | X , but Y f ( X ) is minimized for
2
(x(i),y(i)) with x(i)=x
f(X) Y E Y [Y | X ]. What about classification?
Two classes: encode as 0 and 1, use squared error as before
Get f(X) = E[Y| X=x] = 1*Pr(Y=1| X=x) + 0*Pr(Y=0| X=x) = Pr(Y=1| X=x)

Note that E X,Y (Y f ( X )) 2 E X E Y (Y f ( X )) 2 | X . Hence K classes: can show that for 0-1 loss (error = 0 if correct class, error = 1 if
wrong class predicted) the optimal choice is to return the majority class for a
given input X=x

E X,Y (Y f ( X )) 2 E X E Y (Y Y ) 2 | X Y f ( X )
2

Called the Bayes classifier
Problem: How can we estimate E[Y| X=x] or the majority class for X=x from
the training data?
Hence the squared error is minimzed by choosing f(X) E Y [Y | X ] for every X. Often there is just one or no training record for a given X=x
Solution: approximate it
(Notice that for minimizing absolute error E X,Y | Y f ( X ) |, one can show that the best model is Use Y-values from training records in neighborhood around X=x
Tree: leaf defines neighborhood in the data space; make sure there are
f(X) median( X | Y ).) enough records in the leaf to obtain reliable estimate of correct answer
65 66
Bias-Variance Tradeoff Bias-Variance Tradeoff Derivation

2
2

E X , D ,Y Y f ( X ; D) E X E D EY Y f ( X ; D) | X , D . Now consider t he inner term :
Lets take this one step further and see if we can
E D EY Y f ( X ; D) | X , D E D EY Y E[Y | X ] | X , D f ( X ; D) E[Y | X ]
2 2
2

understand overfitting through statistical decision (Same derivation as before for optimal function f(X).)

EY Y E[Y | X ] | X E D f ( X ; D) E[Y | X ]
2 2

theory
(The first term does not depend on D, hence E D EY Y E[Y | X ] | X , D EY Y E[Y | X ] | X .)
2
2

As before, consider two random variables X and Y Consider t he second term :

E D f ( X ; D) E[Y | X ] E D
2
f ( X ; D) E [ f ( X ; D) E [ f ( X ; D)] E[Y | X ] 2
D D
From a training set D with n records, we want to

E D f ( X ; D) E D [ f ( X ; D)] E E [ f ( X ; D)] E[Y | X ]
2
D D
2
2 E D f ( X ; D) E D [ f ( X ; D) E D [ f ( X ; D)] E[Y | X ]
construct a function f(X) that returns good
E D f ( X ; D) E D [ f ( X ; D)] E D [ f ( X ; D)] E[Y | X ]
2 2
approximations of Y for future inputs X 2 E D f ( X ; D) E D [ f ( X ; D) E D [ f ( X ; D)] E[Y | X ]
Make dependence of f on D explicit by writing f(X; D) 2

E D f ( X ; D) E D [ f ( X ; D)] E D [ f ( X ; D)] E[Y | X ]
2
(The third term is zero, because E D f ( X ; D) E D [ f ( X ; D) E D [ f ( X ; D)] E D [ f ( X ; D)] 0.)

Goal: minimize mean squared error over all X, Y,
and D, i.e., EX,D,Y[ (Y - f(X; D))2 ] Overall we therefore obtain :

E X , D ,Y Y f ( X ; D) E X E D [ f ( X ; D)] E[Y | X ] E D f ( X ; D) E D [ f ( X ; D)] EY Y E[Y | X ] | X
2 2 2
2

67 68
Bias-Variance Tradeoff and Overfitting Implications for Trees

E D [ f ( X ; D)] E[Y | X ]2 : bias
E D f ( X ; D) E D [ f ( X ; D)] : variance
Bias decreases as tree becomes larger
2
EY Y E[Y | X ] | X : irreducible error (does not depend on f and is simply thevariance of Y given
2
X.)
Larger tree can fit training data better
Option 1: f(X;D) = E[Y| X,D]
Bias: since ED[ E[Y| X,D] ] = E[Y| X], bias is zero
Variance: (E[Y| X,D]-ED[E[Y| X,D]])2 = (E[Y| X,D]-E[Y| X])2 can be very large
Variance increases as tree becomes larger
since E[Y| X,D] depends heavily on D
Might overfit!
Sample variance affects predictions of larger tree
Option 2: f(X;D)=X (or other function independent of D) more
Variance: (X-ED[X])2=(X-X)2=0
Bias: (ED[X]-E[Y| X])2=(X-E[Y| X])2 can be large, because E[Y| X] might be Find right tradeoff as discussed earlier
completely different from X
Might underfit! Validation data to find best pruned tree
Find best compromise between fitting training data too closely (option 1)
and completely ignoring it (option 2) MDL principle
69 70
11
Classification and Prediction Overview Lazy vs. Eager Learning
Introduction Lazy learning: Simply stores training data (or only
Decision Trees minor processing) and waits until it is given a test
Statistical Decision Theory record
Nearest Neighbor Eager learning: Given a training set, constructs a
Bayesian Classification classification model before receiving new (test)
Artificial Neural Networks data to classify
Support Vector Machines (SVMs) General trend: Lazy = faster training, slower
Prediction predictions
Accuracy and Error Measures Accuracy: not clear which one is better!
Ensemble Methods Lazy method: typically driven by local decisions
Eager method: driven by global and local decisions
71 72
Nearest-Neighbor Nearest-Neighbor Classifiers

Unknown tuple
Recall our statistical decision theory analysis: Requires:

Set of stored records
Best prediction for input X=x is the mean of Distance metric for pairs of
records
the Y-values of all records (x(i),y(i)) with x(i)=x Common choice: Euclidean
(majority class for classification) d (p, q) ( p q )
i i
2
Problem was to estimate E[Y| X=x] or majority i
Parameter k
class for X=x from the training data Number of nearest
neighbors to retrieve
Solution was to approximate it To classify a record:
Use Y-values from training records in Find its k nearest neighbors
Determine output based on
neighborhood around X=x (distance-weighted) average
of neighbors output
73 74
Definition of Nearest Neighbor 1-Nearest Neighbor
X X X
Voronoi Diagram
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points

that have the k smallest distance to x
75 76
12
Nearest Neighbor Classification Effect of Changing k
Choosing the value of k:
k too small: sensitive to noise points
k too large: neighborhood may include points from other
classes
Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning

77 78
Explaining the Effect of k Scaling Issues

Recall the bias-variance tradeoff Attributes may have to be scaled to prevent
Small k, i.e., predictions based on few distance measures from being dominated by
one of the attributes
neighbors
Example:
High variance, low bias
Height of a person may vary from 1.5m to 1.8m
Large k, e.g., average over entire data set Weight of a person may vary from 90lb to 300lb
Low variance, but high bias Income of a person may vary from $10K to $1M
Need to find k that achieves best tradeoff Income difference would dominate record
distance
Can do that using validation data
79 80
Other Problems Computational Cost

Problem with Euclidean measure: Brute force: O(#trainingRecords)
High dimensional data: curse of dimensionality For each training record, compute distance to test record,
keep if among top-k
Can produce counter-intuitive results Pre-compute Voronoi diagram (expensive), then search
111111111110 100000000000 spatial index of Voronoi cells: if lucky
vs O(log(#trainingRecords))
011111111111 000000000001 Store training records in multi-dimensional search tree,
e.g., R-tree: if lucky O(log(#trainingRecords))
d = 1.4142 d = 1.4142
Bulk-compute predictions for many test records using
Solution: Normalize the vectors to unit length spatial join between training and test set
Irrelevant attributes might dominate distance Same worst-case cost as one-by-one predictions, but
usually much faster in practice
Solution: eliminate them
81 82
13
Classification and Prediction Overview Bayesian Classification
Introduction Performs probabilistic prediction, i.e., predicts
Decision Trees class membership probabilities
Statistical Decision Theory Based on Bayes Theorem
Nearest Neighbor Incremental training
Bayesian Classification Update probabilities as new training records arrive
Artificial Neural Networks Can combine prior knowledge with observed data
Support Vector Machines (SVMs) Even when Bayesian methods are
Prediction computationally intractable, they can provide a
Accuracy and Error Measures standard of optimal decision making against
Ensemble Methods which other methods can be measured
99 100
Bayesian Theorem: Basics Bayes Theorem

X = random variable for data records (evidence) Given data record x, the posterior probability of a hypothesis H,
P(H| X=x), follows from Bayes theorem:
H = hypothesis that specific record X=x belongs to class C
Goal: determine P(H| X=x) P(H | X x) P(X x | H )P(H )
Probability that hypothesis holds given a record x P(X x)
P(H) = prior probability Informally: posterior = likelihood * prior / evidence
The initial probability of the hypothesis Among all candidate hypotheses H, find the maximally probably
E.g., person x will buy computer, regardless of age, income etc. one, called maximum a posteriori (MAP) hypothesis
P(X=x) = probability that data record x is observed Note: P(X=x) is the same for all hypotheses
P(X=x| H) = probability of observing record x, given that the If all hypotheses are equally probable a priori, we only need to
hypothesis holds compare P(X=x| H)
E.g., given that x will buy a computer, what is the probability Winning hypothesis is called the maximum likelihood (ML) hypothesis
that x is in age group 31...40, has medium income, etc.? Practical difficulties: requires initial knowledge of many
probabilities and has high computational cost
101 102
Towards Nave Bayes Classifier Computing P(X=x|Ci) and P(Ci)

Suppose there are m classes C1, C2,, Cm Estimate P(Ci) by counting the frequency of class
Classification goal: for record x, find class Ci that Ci in the training data
has the maximum posterior probability P(Ci| X=x) Can we do the same for P(X=x|Ci)?
Need very large set of training data
Bayes theorem:
P(X x | C )P(C ) Have |X1|*|X2|**|Xd|*m different combinations of
P(C | X x) i i possible values for X and Ci
i P(X x)
Need to see every instance x many times to obtain
reliable estimates
Since P(X=x) is the same for all classes, only need
Solution: decompose into lower-dimensional
to find maximum of P(X x | C )P(C ) problems
i i
103 104
14
Example: Computing P(X=x|Ci) and
Conditional Independence
P(Ci)
P(buys_computer = yes) = 9/14
P(buys_computer = no) = 5/14 X, Y, Z random variables
P(age>40, income=low, student=no, credit_rating=bad| buys_computer=yes) = 0 ?
X is conditionally independent of Y, given Z, if
Age Income Student Credit_rating Buys_computer
30 High No Bad No P(X| Y,Z) = P(X| Z)
30 High No Good No
3140 High No Bad Yes Equivalent to: P(X,Y| Z) = P(X| Z) * P(Y| Z)
> 40 Medium No Bad Yes
> 40 Low Yes Bad Yes Example: people with longer arms read better
> 40 Low Yes Good No
31...40 Low Yes Good Yes Confounding factor: age
30 Medium No Bad No
30 Low Yes Bad Yes Young child has shorter arms and lacks reading skills of adult
> 40 Medium Yes Bad Yes
30 Medium Yes Good Yes If age is fixed, observed relationship between arm
31...40 Medium No
31...40 High Yes
Good
Bad
Yes
Yes
length and reading skills disappears
105 106
Estimating P(Xk=xk| Ci) for Continuous

Derivation of Nave Bayes Classifier
Attributes without Discretization
Simplifying assumption: all input attributes P(Xk=xk| Ci) computed based on Gaussian
conditionally independent, given class distribution with mean and standard deviation
d
P( X ( x1 ,, xd ) | Ci ) P( X k xk | Ci ) P( X 1 x1 | Ci ) P( X 2 x2 | Ci ) P( X d xd | Ci )
:
( x )2
1
g ( x, , ) e 2
2
k 1
2
Each P(Xk=xk| Ci) can be estimated robustly
If Xk is categorical attribute as
P( X k xk | C i) g ( xk , k ,Ci , k ,Ci )
P(Xk=xk| Ci) = #records in Ci that have value xk for Xk, divided
by #records of class Ci in training data set Estimate k,Ci from sample mean of attribute Xk
If Xk is continuous, we could discretize it for all training records of class Ci
Problem: interval selection
Too many intervals: too few training cases per interval Estimate k,Ci similarly from sample
Too few intervals: limited choices for decision boundary
107 108
Nave Bayes Example Nave Bayesian Computation

Classes: Compute P(Ci) for each class:
P(buys_computer = yes) = 9/14 = 0.643
P(buys_computer = no) = 5/14= 0.357
C1:buys_computer = yes Compute P(Xk=xk| Ci) for each class
P(age = 30 | buys_computer = yes) = 2/9 = 0.222
C2:buys_computer = no Age Income Student Credit_rating Buys_computer P(age = 30 | buys_computer = no) = 3/5 = 0.6
P(income = medium | buys_computer = yes) = 4/9 = 0.444
30 High No Bad No
P(income = medium | buys_computer = no) = 2/5 = 0.4
30 High No Good No P(student = yes | buys_computer = yes) = 6/9 = 0.667
3140 High No Bad Yes P(student = yes | buys_computer = no) = 1/5 = 0.2
> 40 Medium No Bad Yes P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667

Data sample x > 40 Low Yes Bad Yes P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4
Compute P(X=x| Ci) using the Naive Bayes assumption
> 40 Low Yes Good No
P(30, medium, yes, fair |buys_computer = yes) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044
age 30, 31...40 Low
30 Medium No
Yes Good
Bad
Yes
No
P(30, medium, yes, fair | buys_computer = no) = 0.6 * 0.4 * 0.2 * 0.4 = 0.019
Compute final result P(X=x| Ci) * P(Ci)
income = medium, 30 Low Yes Bad Yes

P(X=x | buys_computer = yes) * P(buys_computer = yes) = 0.028
P(X=x | buys_computer = no) * P(buys_computer = no) = 0.007
> 40 Medium Yes Bad Yes
student = yes, and 30 Medium Yes Good Yes
31...40 Medium No Good Yes Therefore we predict buys_computer = yes for
credit_rating = fair 31...40 High Yes Bad Yes
input x = (age = 30, income = medium, student = yes, credit_rating = fair)

109 110
15
Zero-Probability Problem Nave Bayesian Classifier: Comments
Nave Bayesian prediction requires each conditional probability to Easy to implement
be non-zero (why?)
d
Good results obtained in many cases
P( X ( x1 ,, xd ) | Ci ) P( X k xk | Ci ) P( X 1 x1 | Ci ) P( X 2 x2 | Ci ) P( X d xd | Ci ) Robust to isolated noise points
k 1
Handles missing values by ignoring the instance during
Example: 1000 records for buys_computer=yes with income=low probability estimate calculations
(0), income= medium (990), and income = high (10)
For input with income=low, conditional probability is zero
Robust to irrelevant attributes
Use Laplacian correction (or Laplace estimator) by adding 1 dummy Disadvantages
record to each income level Assumption: class conditional independence,
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003 therefore loss of accuracy
Prob(income = high) = 11/1003 Practically, dependencies exist among variables
Corrected probability estimates close to their uncorrected
counterparts, but none is zero How to deal with these dependencies?
111 112
Probabilities Axioms
Summary of elementary probability facts we have 0 P(A) 1
used already and/or will need soon
P(True) = 1
Let X be a random variable as usual
Let A be some predicate over its possible values P(False) = 0
A is true for some values of X, false for others P(A B) = P(A) + P(B) - P(A B)
E.g., X is outcome of throw of a die, A could be value
is greater than 4
P(A) is the fraction of possible worlds in which A
is true
P(die value is greater than 4) = 2 / 6 = 1/3
113 114
Theorems from the Axioms Conditional Probability

0 P(A) 1, P(True) = 1, P(False) = 0 P(A|B) = Fraction of worlds in which B is true
P(A B) = P(A) + P(B) - P(A B) that also have A true
H = Have a headache
F = Coming down with Flu
From these we can prove: P(H) = 1/10

P(F) = 1/40
P(not A) = P(~A) = 1 - P(A) F P(H|F) = 1/2
P(A) = P(A B) + P(A ~B) Headaches are rare and flu

H
is rarer, but if youre coming
down with flu theres a 50-
50 chance youll have a
headache.
115 116
16
Definition of Conditional Probability Multivalued Random Variables
P(A B) Suppose X can take on more than 2 values
P(A| B) = ------------ X is a random variable with arity k if it can take
P(B) on exactly one value out of {v1, v2,, vk}
Thus
Corollary: the Chain Rule P( X vi X v j ) 0 if i j
P(A B) = P(A| B) P(B) P( X v1 X v2 ... X vk ) 1
117 118
Easy Fact about Multivalued Random

Useful Easy-to-Prove Facts
Variables
Using the axioms of probability
0 P(A) 1, P(True) = 1, P(False) = 0 P( A | B)P(~ A | B) 1
P(A B) = P(A) + P(B) - P(A B)
And assuming that X obeys
P( X vi X v j ) 0 if i j k
P( X v1 X v2 ... X vk ) 1 P( X v
j 1
j | B) 1
We can prove that i
P( X v1 X v2 ... X vi ) P( X v j )
j 1
And therefore: k
P( X v ) 1
j 1
j
119 120
The Joint Distribution Example: Boolean

variables A, B, C
variables A, B, C
A B C
Recipe for making a joint distribution Recipe for making a joint distribution 0 0 0
of d variables: of d variables: 0 0 1
0 1 0
1. Make a truth table listing all 0 1 1
combinations of values of your 1 0 0
variables (has 2d rows for d 1 0 1
Boolean variables). 1 1 0
1 1 1
121 122
17
variables A, B, C
variables A, B, C
A B C Prob A B C Prob
Recipe for making a joint distribution 0 0 0 0.30
Recipe for making a joint distribution 0 0 0 0.30
of d variables: 0 0 1 0.05
of d variables: 0 0 1 0.05
0 1 0 0.10 0 1 0 0.10
1. Make a truth table listing all 0 1 1 0.05
1. Make a truth table listing all 0 1 1 0.05
combinations of values of your 1 0 0 0.05
combinations of values of your 1 0 0 0.05
variables (has 2d rows for d 1 0 1 0.10 variables (has 2d rows for d 1 0 1 0.10
Boolean variables). 1 1 0 0.25 Boolean variables). 1 1 0 0.25
2. For each combination of values, 1 1 1 0.10 2. For each combination of values, 1 1 1 0.10
say how probable it is. say how probable it is.

3. If you subscribe to the axioms of
probability, those numbers must A 0.05 0.10 0.05
sum to 1. 0.10
0.25
0.05 C
B 0.10
0.30
123 124
Using the Using the

Joint Dist. Joint Dist.
Once you have the JD you

can ask for the probability of
P( E ) P(row)
rows matching E
P(Poor Male) = 0.4654 P( E ) P(row)
rows matching E
any logical expression
involving your attribute
125 126
Inference
Using the
with the
Joint Dist.
Joint Dist.
P(row)
P(Poor) = 0.7604 P( E ) P(row) P( E1 | E2 )
P( E1 E2 )

rows matching E1 and E2
rows matching E P ( E2 ) P(row)
rows matching E2
127 128
18
Inference
with the Joint Distributions
Joint Dist.
Good news: Once you Bad news: Impossible to
have a joint create joint distribution
distribution, you can for more than about ten
answer important attributes because
P( E1 E2 )
P(row) questions that involve
uncertainty.
there are so many
numbers needed when
P( E1 | E2 )
rows matching E1 and E2
P ( E2 ) P(row)
rows matching E2
you build it.
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
129 130
What Would Help? Bayesian Belief Networks

Full independence Subset of the variables conditionally independent
P(gender=g hours_worked=h wealth=w) = Graphical model of causal relationships
P(gender=g) * P(hours_worked=h) * P(wealth=w) Represents dependency among the variables
Can reconstruct full joint distribution from a few Gives a specification of joint probability distribution
marginals
Nodes: random variables
Full conditional independence given class value
Links: dependency
Nave Bayes X Y
X and Y are the parents of Z, and Y is
What about something between Nave Bayes and the parent of P
Z Given Y, Z and P are independent
general joint distribution? P
Has no loops or cycles
131 132
Bayesian Network Properties Bayesian Belief Network Example

Family Conditional probability table
Each variable is conditionally independent of History
Smoker
(CPT) for variable LungCancer:
its non-descendents in the graph, given its (FH, S) (FH, ~S) (~FH, S) (~FH, ~S)
parents LC 0.8 0.5 0.7 0.1
Nave Bayes as a Bayesian network: LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
CPT shows the conditional probability for

Y each possible combination of its parents
PositiveXRay Dyspnea Easy to compute joint distribution for

all attributes X1,, Xd, from CPT:
d
X1 X2 Xn PX ( x1 ,..., xd ) P X i xi | parents(X i )
Bayesian Belief Networks i 1
133 134
19
Creating a Bayes Network Computing with Bayes Net
M P(M)=0.6
P(S)=0.3 S
T: The lecture started on time
L: The lecturer arrives late P(RM)=0.3
R: The lecture concerns data mining P(LM^S)=0.05 L P(R~M)=0.6
M M: The lecturer is Mike P(LM^~S)=0.1
S: It is snowing P(TL)=0.3 R
P(L~M^S)=0.1
P(T~L)=0.8
R P(L~M^~S)=0.2
T T: The lecture started on time
L: The lecturer arrives late
R: The lecture concerns data mining
S ? P(T ^ ~R ^ L ^ ~M ^ S)
= P(T ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S)
M: The lecturer is Mike
S: It is snowing
= P(T L) * P(~R ^ L ^ ~M ^ S)
L = P(T L) * P(~R L ^ ~M ^ S) * P(L^~M^S)
= P(T L) * P(~R ~M) * P(L ^ ~M ^ S)
= P(T L) * P(~R ~M) * P(L~M ^ S) * P(~M ^ S)
T
= P(T L) * P(~R ~M) * P(L~M ^ S) * P(~M | S) * P(S)
= P(T L) * P(~R ~M) * P(L~M ^ S) * P(~M) * P(S)
135 136
Computing with Bayes Net Inference with Bayesian Networks

P(S)=0.3 S M P(M)=0.6 Want to compute P(Ci| X=x)
P(RM)=0.3 Assume the output attribute Y nodes parents are all input
P(LM^S)=0.05 P(R~M)=0.6
P(LM^~S)=0.1
L attribute nodes and all these input values are given
P(TL)=0.3 R
P(L~M^S)=0.1
P(L~M^~S)=0.2
P(T~L)=0.8 Then we have P(Ci| X=x) = P(Ci| parents(Y)), i.e., we can
T T: The lecture started on time read it directly from CPT
L: The lecturer arrives late
P(R T ^ ~S)
R: The lecture concerns data mining
M: The lecturer is Mike
What if values are given only for a subset of attributes?
= P(R ^ T ^ ~S) / P(T ^ ~S) S: It is snowing
Can still compute it from the Bayesian network
= P(R ^ T ^ ~S) / ( P(R ^ T ^ ~S) + P(~R ^ T ^ ~S) )
But: exact inference of probabilities in general for an
P(R ^ T ^ ~S): Compute as P(L ^ M ^ R ^ T ^ ~S) + P(~L ^ M ^ R ^ T ^ ~S) arbitrary Bayesian network is NP-hard
+ P(L ^ ~M ^ R ^ T ^ ~S) + P(~L ^ ~M ^ R ^ T ^ ~S)
Compute P(~R ^ T ^ ~S) similarly Solutions: probabilistic inference, trade precision for
efficiency
Any problem here? Yes, possibly many terms to be computed...
137 138
Training Bayesian Networks Classification and Prediction Overview

Several scenarios: Introduction
Given both the network structure and all variables are Decision Trees
observable: learn only the CPTs Statistical Decision Theory
Network structure known, some hidden variables: gradient Nearest Neighbor
descent (greedy hill-climbing) method, analogous to neural Bayesian Classification
network learning
Artificial Neural Networks
Network structure unknown, all variables observable:
search through the model space to reconstruct network Support Vector Machines (SVMs)
topology Prediction
Unknown structure, all hidden variables: No good Accuracy and Error Measures
algorithms known for this purpose Ensemble Methods
Ref.: D. Heckerman: Bayesian networks for data mining
139 141
20
Basic Building Block: Perceptron Perceptron Decision Hyperplane
Input: {(x1, x2, y), }
For Example x2
Called the bias Output: classification function f(x)
d

f (x) sign b wi xi f(x) > 0: return +1
i 1 f(x) 0: return = -1
x1 w1 b+w1x1+w2x2 = 0
x2 +b

w2 Decision hyperplane: b+wx = 0
f
Output y
xd wd Note: b+wx > 0, if and only if
d
w x
i 1
i i b
Input Weight Weighted Activation
b represents a threshold for when the
vector x vector w sum function perceptron fires.
x1
142 143
Representing Boolean Functions Perceptron Training Rule

AND with two-input perceptron Goal: correct +1/-1 output for each training record
b=-0.8, w1=w2=0.5 Start with random weights, select constant (learning
OR with two-input perceptron rate)
b=-0.3, w1=w2=0.5 For each training record (x, y)
Let fold(x) be the output of the current perceptron for x
m-of-n function: true if at least m out of n inputs Set b:= b + b, where b = ( y - fold(x) )
are true For all i, set wi := wi + wi, where wi = ( y - fold(x))xi
All input weights 0.5, threshold weight b is set Keep iterating over training records until all are
according to m, n correctly classified
Can also represent NAND, NOR Converges to correct decision boundary, if the classes
What about XOR? are linearly separable and a small enough is used
Why?
144 145
Gradient Descent Gradient Descent Rule

If training records are not linearly separable, find best Find weight vector that minimizes E(b,w) by altering it
fit approximation. in direction of steepest descent
Gradient descent to search the space of possible weight Set (b,w) := (b,w) + (b,w), where (b,w) = - E(b,w)
vectors -E(b,w)=[ E/b, E/w1,, E/wn ] is the gradient, hence
Basis for Backpropagation algorithm E
b : b b y u(x)
Consider un-thresholded perceptron (no sign function b ( x, y )D
applied), i.e., u(x) = b + wx E
wi : wi wi y u(x) ( xi )
wi
Measure training error by squared error ( x , y )D
Start with random weights,

y u(x)
1 2
E(b, w) iterate until convergence
2 ( x, y )D
Will converge to global
D = training data minimum if is small enough
Let w0 := b.
146 147
21
Gradient Descent Summary Multilayer Feedforward Networks
Epoch updating (aka batch mode) Use another perceptron to combine
Do until satisfied with model output of lower layer Output layer
Compute gradient over entire training set What about linear units only?
Update all weights based on gradient Can only construct linear functions!
Case updating (aka incremental mode, stochastic gradient descent) Need nonlinear component
Do until satisfied with model sign function: not differentiable
For each training record (gradient descent!)
Compute gradient for this single training record
Update all weights based on gradient Use sigmoid: (x)=1/(1+e-x) Hidden layer
1
Case updating can approximate epoch updating arbitrarily close if 0.9
1/(1+exp(-x))
is small enough 0.8

0.7
0.6
Perceptron training rule and case updating might seem identical 0.5
Perceptron function:
0.4
Difference: error computation on thresholded vs. unthresholded 1
output
0.3
y Input layer
0.2 1 e b wx
0.1
0
-4 -2 0 2 4
148 149
1-Hidden Layer Net Example Making Predictions

NINP = 2 NHID = 3
Inputs: all input data attributes
N INS Record fed simultaneously into the units of the input layer
w11
v1 g w1k xk Then weighted and fed simultaneously to a hidden layer
k 1 w1 Number of hidden layers is arbitrary, although usually only one
x1 w21
Weighted outputs of the last hidden layer are the input
w31
to the units in the output layer, which emits the
N INS N HID network's prediction
v2 g w2 k xk Out g Wk vk
w2
The network is feed-forward
w12
k 1 k 1
None of the weights cycles back to an input unit or to an
w22
x2 w3 output unit of a previous layer
Statistical point of view: neural networks perform
w32
N INS
v3 g w3k xk
g is usually the nonlinear regression
sigmoid function
k 1
150 151
Backpropagation Algorithm Overfitting

We discussed gradient descent to find the best weights for When do we stop updating the weights?
a single perceptron using simple un-thresholded function
If sigmoid (or other differentiable) function is applied to Might overfit to training data
weighted sum, use complete function for gradient descent Overfitting tends to happen in later iterations
Multiple perceptrons: optimize over all weights of all
perceptrons Weights initially small random values
Problems: huge search space, local minima Weights all similar => smooth decision surface
Backpropagation Surface complexity increases as weights diverge
Initialize all weights with small random values
Iterate many times
Preventing overfitting
Compute gradient, starting at output and working back Weight decay: decrease each weight by small factor
Error of hidden unit h: how do we get the true output value? Use weighted during each iteration, or
sum of errors of each unit influenced by h.
Update all weights in the network Use validation data to decide when to stop iterating
152 153
22
Neural Network Decision Boundary Backpropagation Remarks
Computational cost
Each interation costs O(|D|*|w|), with |D| training
records and |w| weights
Number of iterations can be exponential in n, the
number of inputs (in practice often tens of thousands)
Local minima can trap the gradient descent
algorithm
Convergence guaranteed to local minimum, not global
Backpropagation highly effective in practice
Many variants to deal with local minima issue
E.g., case updating might avoid local minimum
154 155
Defining a Network Representational Power

1. Decide network topology Boolean functions
# input units, # hidden layers, # units in each hidden layer, # output units
2. Normalize input values for each attribute to [0.0, 1.0]
Each can be represented by a 2-layer network
Transform nominal and ordinal attributes: one input unit per domain value, Number of hidden units can grow exponentially with
each initialized to 0 number of inputs
Why not map the attribute to a single input with domain [0.0, 1.0]? Create hidden unit for each input record
3. Output for classification task with >2 classes: one output unit per class Set its weights to activate only for that input
4. Choose learning rate Implement output unit as OR gate that only activates for desired
Too small: can take days instead of minutes to converge output patterns
Too large: diverges (MSE gets larger while the weights increase and usually
oscillate) Continuous functions
Heuristic: set it to 1 / (#training iterations) Every bounded continuous function can be approximated
5. If model accuracy is unacceptable, re-train with different network arbitrarily close by a 2-layer network
topology, different set of initial weights, or different learning rate
Might need a lot of trial-and-error Any function can be approximated arbitrarily close by a
3-layer network
156 157
Neural Network as a Classifier Classification and Prediction Overview

Weaknesses Introduction
Long training time Decision Trees
Many non-trivial parameters, e.g., network topology
Poor interpretability: What is the meaning behind learned
weights and hidden units? Nearest Neighbor
Note: hidden units are alternative representation of input values,
capturing their relevant features
Bayesian Classification
Strengths Artificial Neural Networks
High tolerance to noisy data Support Vector Machines (SVMs)
Well-suited for continuous-valued inputs and outputs Prediction
Successful on a wide array of real-world data Accuracy and Error Measures
Techniques exist for extraction of rules from neural networks
Ensemble Methods
158 160
23
SVMSupport Vector Machines SVMHistory and Applications
Newer and very popular classification method Vapnik and colleagues (1992)
Uses a nonlinear mapping to transform the Groundwork from Vapnik & Chervonenkis statistical
learning theory in 1960s
original training data into a higher dimension
Training can be slow but accuracy is high
Searches for the optimal separating
Ability to model complex nonlinear decision
hyperplane (i.e., decision boundary) in the boundaries (margin maximization)
new dimension Used both for classification and prediction
SVM finds this hyperplane using support Applications: handwritten digit recognition,
vectors (essential training records) and object recognition, speaker identification,
margins (defined by the support vectors) benchmarking time-series prediction tests
161 162
Linear Classifiers Linear Classifiers

f(x,w,b) = sign(wx + b) f(x,w,b) = sign(wx + b)
denotes +1 denotes +1
denotes -1 denotes -1
How would you How would you

classify this data? classify this data?
163 164
Linear Classifiers Linear Classifiers

denotes -1 denotes -1
How would you How would you

classify this data? classify this data?
165 166
24
Linear Classifiers Classifier Margin
denotes -1 denotes -1 Define the margin
of a linear
classifier as the
Any of these width that the
would be fine..
boundary could be
increased by
..but which is before hitting a
best? data record.
167 168
Maximum Margin Maximum Margin

denotes -1 Find the maximum denotes -1
margin linear
classifier.
This is the Support Vectors
simplest kind of are those
SVM, called linear datapoints that
SVM or LSVM. the margin
pushes up
against
169 170
Why Maximum Margin? Specifying a Line and Margin

Plus-Plane
If we made a small error in the location of the Classifier Boundary

Minus-Plane
boundary, this gives us the least chance of
causing a misclassification.
Model is immune to removal of any non-
support-vector data records.
Plus-plane = { x : wx + b = +1 }
There is some theory (using VC dimension) Minus-plane = { x : wx + b = -1 }
that is related to (but not the same as) the
Classify as +1 if w x + b 1
proposition that this is a good thing.
Empirically it works very well. -1 if wx + b -1
what if -1 < wx + b < 1 ?
171 172
25
Computing Margin Width Computing Margin Width
M = Margin Width x+ M = Margin Width
x-
Plus-plane = { x : wx + b = +1 } Choose arbitrary point x- on minus-plane

Minus-plane = { x : wx + b = -1 }
Goal: compute M in terms of w and b Let x+ be the point in plus-plane closest to x-
Note: vector w is perpendicular to plus-plane Since vector w is perpendicular to these planes, it
Consider two vectors u and v on plus-plane and show that w(u-v)=0
Hence it is also perpendicular to the minus-plane holds that x+ = x- + w, for some value of
173 174
Putting It All Together Finding the Maximum Margin

We have so far: How do we find w and b such that the margin is
wx+ + b = +1 and wx- + b = -1 maximized and all training records are in the
x+ = x- + w correct zone for their class?
|x+- x-| = M Solution: Quadratic Programming (QP)
Derivation: QP is a well-studied class of optimization
algorithms to maximize a quadratic function of
w(x- + w) + b = +1, hence wx- + b + ww = 1
some real-valued variables subject to linear
This implies ww = 2, i.e., = 2 / ww constraints.
Since M = |x+- x-| = |w| = |w| = (ww)0.5 There exist algorithms for finding such constrained
We obtain M = 2 (ww)0.5/ ww = 2 / (ww)0.5 quadratic optima efficiently and reliably.
175 176
Quadratic Programming What Are the SVM Constraints?

uT Ru
Find arg max c d u T Quadratic criterion
u 2 Consider n training
a11u1 a12u2 ... a1mum b1 records (x(k), y(k)),
Subject to
where y(k) = +/- 1
a21u1 a22u2 ... a2 mum b2 n additional linear
inequality M
2 How many constraints
: constraints ww will we have?
an1u1 an 2u2 ... anmum bn
What is the quadratic What should they be?
And subject to optimization criterion?
a( n 1)1u1 a( n 1) 2u2 ... a( n 1) mum b( n 1)
e additional
a( n 2)1u1 a( n 2) 2u2 ... a( n 2) mum b( n 2) linear
: equality
constraints
a( n e )1u1 a( n e ) 2u2 ... a( n e ) mum b( n e ) 177 178
26
Problem: Classes Not Linearly
What Are the SVM Constraints?
Separable
Consider n training denotes +1 Inequalities for training
records (x(k), y(k)), denotes -1 records are not
where y(k) = +/- 1 satisfiable by any w and
M
2 How many constraints b
ww will we have? n.
What is the quadratic What should they be?
optimization criterion? For each 1 k n:
Minimize ww wx(k) + b 1, if y(k)=1
wx(k) + b -1, if y(k)=-1
179 180
Solution 1? Solution 2?
denotes +1 Find minimum ww, denotes +1 Minimize ww +
denotes -1 while also minimizing denotes -1 C(#trainSetErrors)
number of training set C is a tradeoff parameter
errors Problems:
Not a well-defined Cannot be expressed as
optimization problem QP, hence finding
solution might be slow
(cannot optimize two
things at the same time) Does not distinguish
between disastrous
errors and near misses
181 182
Solution 3 What Are the SVM Constraints?

11
denotes +1 Minimize ww + 2 Consider n training
denotes -1 C(distance of error records (x(k), y(k)),
records to their correct where y(k) = +/- 1
7
place) How many constraints
2
M will we have? n.
This works! ww
What should they be?
But still need to do What is the quadratic
For each 1 k n:
something about the optimization criterion?
Minimize wx(k)+b 1 - k, if y(k)=1
unsatisfiable set of wx(k)+b -1+k, if y(k)=-1
n
inequalities 1
w w C k k 0
2 k 1
183 184
27
Facts About the New Problem
Effect of Parameter C
Formulation
Original QP formulation had d+1 variables
w1, w2,..., wd and b
New QP formulation has d+1+n variables
w1, w2,..., wd and b
1, 2,..., n
C is a new parameter that needs to be set for
the SVM
Controls tradeoff between paying attention to
margin size versus misclassifications
185 186
An Equivalent QP (The Dual) Important Facts

n n n
1
Maximize
k 1
k k l y(k ) y(l ) x(k ) x(l )
2 k 1 l 1
Dual formulation of QP can be optimized more
quickly, but result is equivalent
n Data records with k > 0 are the support vectors
Subject to these
constraints:
k : 0 k C
k 1
k y (k ) 0 Those with 0 < k < C lie on the plus- or minus-plane
Those with k = C are on the wrong side of the
Then define: classifier boundary (have k > 0)
n Computation for w and b only depends on those
w k y ( k ) x( k ) Then classify with:
records with k > 0, i.e., the support vectors
k 1 f(x,w,b) = sign(wx + b)
1 Alternative QP has another major advantage, as
b AVG x( k ) w we will see now...
k :0 k C y ( k )

187 188
Easy To Separate Easy To Separate

What would Not a big surprise
SVMs do with
this data?
Positive plane Negative plane
189 190
28
Harder To Separate Harder To Separate
X (= X2)
What can be
done about
this? Non-linear basis
functions:
Original data: (X, Y)
Transformed: (X, X2, Y)
Think of X2 as a new
X attribute, e.g., X
191 192
Corresponding Planes in Original

Now Separation Is Easy Again
X (= X2)
Space
Region above plus-plane
X
Region below minus-plane
193 194
1 Constant Term

2 x1 Quadratic Basis
Common SVM Basis Functions

2 x2

:
Linear Terms Functions
2 x

Polynomial of attributes X1,..., Xd of certain
d
x12
Pure
max degree, e.g., X2+X1X3+X42 x 2
2
Quadratic Number of terms

:
Terms (assuming d input attributes):
Radial basis function
xd
2

(x) (d+2)-choose-2
Symmetric around center, i.e., 2 x1 x2
= (d+2)(d+1)/2
KernelFunction(|X - c| / kernelWidth) 2 x1 x3
: d2/2

Sigmoid function of X, e.g., hyperbolic tangent 2 x1 xd Quadratic

Cross-Terms
Let (x) be the transformed input record 2 x x
2 3
Why did we choose this specific
:
transformation?
Previous example: ( (x) ) = (x, x2) 2 x1 xd
:
2 x x
195 d 1 d
196
29
Dual QP With Basis Functions Computation Challenge
n n n
k l y(k ) y(l ) x(k ) x(l )

1
Maximize
k 1
k
2 k 1 l 1
Input vector x has d components (its d attribute
values)
n The transformed input vector (x) has d2/2
Subject to these
constraints:
k : 0 k C
k 1
k y (k ) 0 components
Hence computing (x(k))(x(l)) now costs order
Then define: d2/2 instead of order d operations (additions,
n multiplications)
w k y (k ) x(k ) Then classify with:
k 1 f(x,w,b) = sign(w(x) + b) ...or is there a better way to do this?
1 Take advantage of properties of certain
b AVG x(k ) w transformations
k :0 k C y ( k )

197 198

1

1
1 Quadratic
2a1 2b1
+

2 a2

2b2
m Dot Quadratic Dot Products
: : 2a b

2 a d
2b
d
i 1
i i
Products Now consider another function of a
a1 2
b1 2
+ and b:
2 2
a b m
a b (a b 1) 2
2 2 2 2
: : i i
i 1
ad
2
bd
2
(a b) 2 2a b 1
(a) (b) (a) (b)
2a1a2 2b1b2 d
2

d
ai bi 2 ai bi 1
d d d d
2a1a3 2b1b3 + 1 2 ai bi ai2bi2 2ai a j bi b j
i 1 i 1 i 1 j i 1
i 1 i 1
: :
d d d
2a1ad 2b1bd ai bi a j b j 2 ai bi 1
m m
2 a a
2 3 2b b
2 3 2a a b b i j i j
i 1 j 1 i 1
: : i 1 j i 1
d d d d
(ai bi ) 2 2 ai bi a j b j 2 ai bi 1
2a1ad 2b1bd i 1 i 1 j i 1 i 1
: :
2a a 2b b
d 1 d d 1 d
199 200
Quadratic Dot Products Any Other Computation Problems?

n
1
The results of (a)(b) and of (ab+1)2 are identical w k y (k ) x(k ) b AVG x(k ) w

Computing (a)(b) costs about d2/2, while k 1 k :0 k C
y ( k )
computing (ab+1)2 costs only about d+2 operations
What about computing w?
This means that we can work in the high-dimensional
Finally need f(x,w,b) = sign(w(x) + b):
space (d2/2 dimensions) where the training records are n
more easily separable, but pay about the same cost as w (x) k y(k ) x(k ) (x)
k 1
working in the original space (d dimensions)
Savings are even greater when dealing with higher- Can be computed using the same trick as before
degree polynomials, i.e., degree q>2, that can be Can apply the same trick again to b, because
computed as (ab+1)q n
x(k ) w j y ( j ) x(k ) x( j )
j 1
201 202
30
SVM Kernel Functions Overfitting
For which transformations, called kernels, With the right kernel function, computation in high
dimensional transformed space is no problem
does the same trick work? But what about overfitting? There are so many
Polynomial: K(a,b)=(a b +1)q parameters...
Usually not a problem, due to maximum margin
Radial-Basis-style (RBF): approach
(a b) 2 Only the support vectors determine the model, hence SVM
K(a, b) exp , and are magic
2 2
complexity depends on number of support vectors, not
parameters that must dimensions (still, in higher dimensions there might be
be chosen by a model more support vectors)
Neural-net-style sigmoidal: selection method.
Minimizing ww discourages extremely large weights,
which smoothes the function (recall weight decay for
K(a, b) tanh( a b ) neural networks!)
203 204
Different Kernels Multi-Class Classification

SVMs can only handle two-class outputs (i.e. a
categorical output variable with arity 2).
What can be done?
Answer: with output arity N, learn N SVMs
SVM 1 learns Output==1 vs Output != 1
SVM 2 learns Output==2 vs Output != 2
:
SVM N learns Output==N vs Output != N
To predict the output for a new input, just predict
with each SVM and find out which one puts the
prediction the furthest into the positive region.
205 206
Why Is SVM Effective on High

SVM vs. Neural Network
Dimensional Data?
Complexity of trained classifier is characterized by the SVM Neural Network
number of support vectors, not dimensionality of the Relatively new concept Relatively old
data
Deterministic algorithm Nondeterministic
If all other training records are removed and training is algorithm
repeated, the same separating hyperplane would be Nice Generalization
found properties Generalizes well but
Hard to train learned in doesnt have strong
The number of support vectors can be used to mathematical foundation
compute an upper bound on the expected error rate of batch mode using
the SVM, which is independent of data dimensionality quadratic programming Can easily be learned in
techniques incremental fashion
Thus, an SVM with a small number of support vectors
can have good generalization, even when the Using kernels can learn To learn complex
dimensionality of the data is high very complex functions functionsuse multilayer
perceptron (not that trivial)
207 209
31
Classification and Prediction Overview What Is Prediction?
Introduction Essentially the same as classification, but output is
Decision Trees continuous, not discrete
Statistical Decision Theory Construct a model
Use model to predict continuous output value for a given
Nearest Neighbor input
Bayesian Classification Major method for prediction: regression
Artificial Neural Networks Many variants of regression analysis in statistics literature;
Support Vector Machines (SVMs) not covered in this class
Prediction Neural network and k-NN can do regression out-of-
Accuracy and Error Measures the-box
Ensemble Methods SVMs for regression exist
What about trees?
210 211
Regression Trees and Model Trees Classification and Prediction Overview

Regression tree: proposed in CART system (Breiman et Introduction
al. 1984) Decision Trees
CART: Classification And Regression Trees Statistical Decision Theory
Each leaf stores a continuous-valued prediction Nearest Neighbor
Average output value for the training records that reach the leaf Bayesian Classification
Model tree: proposed by Quinlan (1992) Artificial Neural Networks
Each leaf holds a regression modela multivariate linear Support Vector Machines (SVMs)
equation Prediction
Training: like for classification trees, but uses variance Ensemble Methods
instead of purity measure for selecting split predicates
212 213
Classifier Accuracy Measures Precision and Recall

Predicted class total
buy_computer = yes buy_computer = no Precision: measure of exactness
True class buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
t-pos / (t-pos + f-pos)
total 7366 2634 10000 Recall: measure of completeness
Accuracy of a classifier M, acc(M): percentage of t-pos / (t-pos + f-neg)
test records that are correctly classified by M F-measure: combination of precision and recall
Error rate (misclassification rate) of M = 1 acc(M) 2 * precision * recall / (precision + recall)
Given m classes, CM[i,j], an entry in a confusion
matrix, indicates # of records in class i that are
labeled by the classifier as class j Note: Accuracy = (t-pos + t-neg) / (t-pos + t-neg +
C1 C2 f-pos + f-neg)
C1 True positive False negative
C2 False positive True negative
214 215
32
Limitation of Accuracy Cost-Sensitive Measures: Cost Matrix
Consider a 2-class problem PREDICTED CLASS
Number of Class 0 examples = 9990
Number of Class 1 examples = 10 C(i|j) Class=Yes Class=No
Class=Yes C(Yes|Yes) C(No|Yes)

If model predicts everything to be class 0, ACTUAL
accuracy is 9990/10000 = 99.9 % CLASS Class=No C(Yes|No) C(No|No)
Accuracy is misleading because model does not detect
any class 1 example
Always predicting the majority class defines the
baseline C(i| j): Cost of misclassifying class j example as class i
A good classifier should do better than baseline
216 217
Computing Cost of Classification Prediction Error Measures

Cost PREDICTED CLASS
Matrix Continuous output: it matters how far off the prediction is from the
true value
C(i|j) + -
ACTUAL
Loss function: distance between y and predicted value y
+ -1 100 Absolute error: | y y|
CLASS
- 1 0 Squared error: (y y)2
Test error (generalization error): average loss over the test set
Model M1 PREDICTED CLASS Model M2 PREDICTED CLASS Mean absolute error: Mean squared error:
1 n 1 n
| y(i) y' (i) | y(i) y' (i)
2
+ - + - n i 1 n i 1
ACTUAL ACTUAL n n
Relative absolute error: | y(i) y' (i) | Relative squared error: ( y(i) y' (i)) 2
+ 150 40 + 250 45
CLASS CLASS i 1 i 1
- 60 250 - 5 200 n n
| y(i) y |
i 1
( y(i) y )
i 1
2
Accuracy = 80% Accuracy = 90%

Squared-error exaggerates the presence of outliers
Cost = 3910 Cost = 4255
218 219
Evaluating a Classifier or Predictor Learning Curve

Holdout method Accuracy versus
sample size
The given data set is randomly partitioned into two sets
Training set (e.g., 2/3) for model construction
Effect of small
sample size:
Test set (e.g., 1/3) for accuracy estimation
Bias in estimate
Can repeat holdout multiple times Variance of
Accuracy = avg. of the accuracies obtained estimate
Helps determine how
Cross-validation (k-fold, where k = 10 is most popular) much training data is
needed
Randomly partition data into k mutually exclusive subsets, Still need to have
each approximately equal size enough test and
In i-th iteration, use Di as test set and others as training set validation data to
be representative
Leave-one-out: k folds where k = # of records of distribution
Expensive, often results in high variance of performance metric
220 221
33
ROC (Receiver Operating
ROC Curve
Characteristic)
Developed in 1950s for signal detection theory to 1-dimensional data set containing 2 classes (positive and negative)
Any point located at x > t is classified as positive
analyze noisy signals
Characterizes trade-off between positive hits and false
alarms
ROC curve plots T-Pos rate (y-axis) against F-Pos
rate (x-axis)
Performance of each classifier is represented as a
point on the ROC curve
Changing the threshold of the algorithm, sample
distribution or cost matrix changes the location of the
point
At threshold t:
TPR=0.5, FPR=0.12
222 223
ROC Curve Diagonal Line for Random Guessing

(TPR, FPR):
(0,0): declare everything to Classify a record as positive with fixed probability
be negative class p, irrespective of attribute values
(1,1): declare everything to Consider test set with a positive and b negative
be positive class records
(1,0): ideal True positives: p*a, hence true positive rate =
(p*a)/a = p
Diagonal line: False positives: p*b, hence false positive rate =
Random guessing (p*b)/b = p
For every value 0p1, we get point (p,p) on ROC
curve
224 225
Using ROC for Model Comparison How to Construct an ROC curve

Neither model
record P(+|x) True Class
Use classifier that produces
consistently posterior probability P(+|x)
outperforms the 1 0.95 +
other 2 0.93 +
for each test record x
M1 better for small 3 0.87 - Sort records according to
FPR
4 0.85 - P(+|x) in decreasing order
M2 better for large
FPR 5 0.85 - Apply threshold at each
6 0.85 + unique value of P(+|x)
Area under the ROC 7 0.76 - Count number of TP, FP, TN, FN
curve 8 0.53 + at each threshold
Ideal: area = 1 9 0.43 - TP rate, TPR = TP/(TP+FN)
Random guess: 10 0.25 + FP rate, FPR = FP/(FP+TN)
area = 0.5
226 227
34
How To Construct An ROC Curve Test of Significance
Class + - + - + - - - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP
FP
5
5
4
5
4
4
3
4
3
3
2
1
2
0
1
0
0
0
Given two models:
Model M1: accuracy = 85%, tested on 30 instances
TN 0 0 1 1 2 4 5 5 5
FN 0 1 1 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0
FPR 1 1 0.8 0.8 0.6 0.2 0 0 0

Model M2: accuracy = 75%, tested on 5000
true positive rate
1.0 instances
Can we say M1 is better than M2?
How much confidence can we place on accuracy
ROC Curve: of M1 and M2?
0.4
Can the difference in accuracy be explained as a
0.2
result of random fluctuations in the test set?
false positive rate 228 229
0 0.2 0.4 1.0
Confidence Interval for Accuracy Confidence Interval for Accuracy

Area = 1 -
Classification can be regarded as a Bernoulli trial Binomial distribution for X=number of

correctly classified test records out of n
A Bernoulli trial has 2 possible outcomes, correct or E(X)=pn, Var(X)=p(1-p)n
wrong for classification Accuracy = X / n
Collection of Bernoulli trials has a Binomial E(ACC) = p, Var(ACC) = p(1-p) / n
For large test sets (n>30), Binomial
distribution distribution is closely approximated by
Probability of getting c correct predictions if model accuracy normal distribution with same mean
is p (=probability to get a single prediction right): and variance
n c ACC has a normal distribution with
c p (1 p)
n c
mean=p, variance=p(1-p)/n Z/2 Z1- /2
ACC p
P Z / 2 Z1 / 2 1
Given c, or equivalently, ACC = c / n and n (#test
p(1 p) / n

records), can we predict p, the true accuracy of
the model? Confidence Interval for p: 2n ACC Z 2 Z 2 4n ACC 4n ACC2
/2 /2
p
2(n Z2 / 2 )
230 231
Comparing Performance of Two

Confidence Interval for Accuracy
Models
Consider a model that produces an accuracy of Given two models M1 and M2, which is better?
80% when evaluated on 100 test instances M1 is tested on D1 (size=n1), found error rate = e1
n = 100, ACC = 0.8 M2 is tested on D2 (size=n2), found error rate = e2
Let 1- = 0.95 (95% confidence)
Assume D1 and D2 are independent
From probability table, Z/2 = 1.96 1- Z
If n1 and n2 are sufficiently large, then
0.99 2.58
N 50 100 500 1000 5000
err1 ~ N 1 , 1
0.98 2.33
err2 ~ N 2 , 2
p(lower) 0.670 0.711 0.763 0.774 0.789
p(upper) 0.888 0.866 0.833 0.824 0.811 0.95 1.96

0.90 1.65 Estimate: ei (1 ei )
p
2n ACC Z2 / 2 Z2 / 2 4n ACC 4n ACC2 i ei and i2
2(n Z2 / 2 ) ni
232 233
35
Testing Significance of Accuracy
An Illustrative Example
Difference
Consider random variable d = err1 err2 Given: M1: n1 = 30, e1 = 0.15
M2: n2 = 5000, e2 = 0.25
Since err1, err2 are normally distributed, so is their E[d] = |e1 e2| = 0.1
difference 2-sided test: dt = 0 versus dt 0
Hence d ~ N (dt, t) where dt is the true difference 0.15(1 0.15) 0.25(1 0.25)
Estimator for dt: t2 0.0043
30 5000
E[d] = E[err1-err2] = E[err1] E[err2] e1 - e2
At 95% confidence level, Z/2 = 1.96
Since D1 and D2 are independent, variance adds up:
e (1 e1 ) e2 (1 e2 ) dt 0.100 1.96 0.0043 0.100 0.128
t2 12 22 1
n1 n2 Interval contains zero, hence difference may not be statistically
significant
At (1-) confidence level, dt E[ d ] Z / 2 t But: may reject null hypothesis (dt 0) at lower confidence level
234 235
Significance Test for K-Fold Cross-

Classification and Prediction Overview
Validation
Each learning algorithm produces k models: Introduction
L1 produces M11 , M12, , M1k Decision Trees
L2 produces M21 , M22, , M2k Statistical Decision Theory
Both models are tested on the same test sets D1, Nearest Neighbor
D2,, Dk Bayesian Classification
For each test set, compute dj = e1,j e2,j Artificial Neural Networks
For large enough k, dj is normally distributed with Support Vector Machines (SVMs)
mean dt and variance t Prediction
Estimate: k

(d d ) 2j t-distribution: get t coefficient
t2 j 1 t1-,k-1 from table by looking up Ensemble Methods
k (k 1) confidence level (1-) and
degrees of freedom (k-1)
d t d t1 ,k 1 t
236 237
Ensemble Methods General Idea

Construct a set of classifiers from the training D
Original
Training data
data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Predict class label of previously unseen
records by aggregating predictions made by Step 2:
Build Multiple C1 C2 Ct -1 Ct
multiple classifiers Classifiers
Step 3:
Combine C*
Classifiers
238 239
36
Why Does It Work? Base Classifier vs. Ensemble Error
Consider 2-class problem
Suppose there are 25 base classifiers
Each classifier has error rate = 0.35
Assume the classifiers are independent
Return majority vote of the 25 classifiers
Probability that the ensemble classifier makes a
wrong prediction: 25 25

i (1 )
i 25i
0.06
i 13
240 241
Model Averaging and Bias-Variance

Bagging: Bootstrap Aggregation
Tradeoff
Single model: lowering bias will usually increase Given training set with n records, sample n
variance records randomly with replacement
Smoother model has lower variance but might not Original Data 1 2 3 4 5 6 7 8 9 10
model function well enough Bagging (Round 1) 7 8 10 8 2 5 10 10 5 9
Bagging (Round 2) 1 4 9 1 2 3 2 7 3 2
Ensembles can overcome this problem Bagging (Round 3) 1 8 5 10 5 5 9 6 3 7
1. Let models overfit Train classifier for each bootstrap sample

Low bias, high variance
2. Take care of the variance problem by averaging Note: each training record has probability
many of these models 1 (1 1/n)n of being selected at least once in
This is the basic idea behind bagging a sample of size n
242 243
Bagged Trees Typical Result

Create k trees from training data
Bootstrap sample, grow large trees
Design goal: independent models, high
variability between models
Ensemble prediction = average of individual
tree predictions (or majority vote)
Works the same way for other classifiers
(1/k) + (1/k) ++ (1/k)
244 245
37
Typical Result Typical Result
246 247
Bagging Challenges Additive Grove

Ideal case: all models independent of each other Ensemble technique for predicting continuous output
Train on independent data samples Instead of individual trees, train additive models
Prediction of single Grove model = sum of tree predictions
Problem: limited amount of training data Prediction of ensemble = average of individual Grove predictions
Training set needs to be representative of data distribution
Combines large trees and additive models
Bootstrap sampling allows creation of many almost Challenge: how to train the additive models without having the first
independent training sets trees fit the training data too well
Diversify models, because similar sample might result Next tree is trained on residuals of previously trained trees in same Grove
model
in similar tree If previously trained trees capture training data too well, next tree is mostly
Random Forest: limit choice of split attributes to small trained on noise
random subset of attributes (new selection of subset for
each node) when training tree
Use different model types in same ensemble: tree, ANN, (1/k) ++ + (1/k) ++ ++ (1/k) ++
SVM, regression models
248 249
Training Groves Typical Grove Performance

10
Root mean squared

9
error
8
Lower is better
+ +
Horizontal axis: tree
+ + + + 7 size
Fraction of training
6
data when to stop
splitting
+ + + 5
Vertical axis: number
4
of trees in each
0.13
single Grove model
3 100 bagging
iterations
2
1
0.5 0.2 0.1 0.05 0.02 0.01 0.005 0.002 0
250 251
38
Boosting Boosting
Records that are wrongly classified will have their
weights increased
Iterative procedure to Records that are classified correctly will have
adaptively change distribution their weights decreased
of training data by focusing Original Data 1 2 3 4 5 6 7 8 9 10
more on previously Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
misclassified records Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
Initially, all n records are

assigned equal weights Assume record 4 is hard to classify
Record weights may change at Its weight is increased, therefore it is more likely
the end of each boosting round to be chosen again in subsequent rounds
252 253
Example: AdaBoost AdaBoost Details

Weight update: w(ji ) i if Ci ( x j ) y j
Base classifiers: C1, C2,, CT wj
( i 1)

1 i
Zi 1 Ci ( x j ) y j
Error rate (n training if
records, wj are weights that where Z i is the normalization factor
sum to 1): Weights initialized to 1/n
i w j Ci ( x j ) y j
n
Zi ensures that weights add to 1
If any intermediate rounds produce error rate higher
j 1 than 50%, the weights are reverted back to 1/n and the
Importance of a classifier: resampling procedure is repeated
1 i Final classification:
i ln
T
C * ( x) arg max i Ci ( x) y
i y i 1
254 255
Illustrating AdaBoost Illustrating AdaBoost

B1
Initial weights for each data point Data points 0.0094 0.0094 0.4623
Boosting
for training
Round 1 +++ - - - - - - - = 1.9459
0.1 0.1 0.1
Original B2
Data +++ - - - - - ++
Boosting
0.3037 0.0009 0.0422
Round 2 - - - - - - - - ++ = 2.9323
New weights
B1 B3
0.0094 0.0094 0.4623 0.0276 0.1819 0.0038
Boosting Boosting
Round 1 +++ - - - - - - - = 1.9459 Round 3 +++ ++ ++ + ++ = 3.8744
Overall +++ - - - - - ++
Note: The numbers appear to be wrong, but they convey the right idea 256
Note: The numbers appear to be wrong, but they convey the right idea 257
39
Bagging vs. Boosting Classification/Prediction Summary
Analogy Forms of data analysis that can be used to train models
Bagging: diagnosis based on multiple doctors majority vote
Boosting: weighted vote, based on doctors previous diagnosis accuracy from data and then make predictions for new records
Sampling procedure Effective and scalable methods have been developed
Bagging: records have same weight; easy to train in parallel
Boosting: weights record higher if model predicts it wrong; inherently
for decision tree induction, Naive Bayesian
sequential process classification, Bayesian networks, rule-based classifiers,
Overfitting Backpropagation, Support Vector Machines (SVM),
Bagging robust against overfitting
Boosting susceptible to overfitting: make sure individual models do not overfit nearest neighbor classifiers, and many other
Accuracy usually significantly better than a single classifier classification methods
Best boosted model often better than best bagged model
Additive Grove
Regression models are popular for prediction.
Combines strengths of bagging and boosting (additive models) Regression trees, model trees, and ANNs are also used
Shown empirically to make better predictions on many data sets for prediction.
Training more tricky, especially when data is very noisy
258 259
Classification/Prediction Summary
K-fold cross-validation is a popular method for accuracy estimation,
but determining accuracy on large test set is equally accepted
If test sets are large enough, a significance test for finding the best
model is not necessary
Area under ROC curve and many other common performance
measures exist
Ensemble methods like bagging and boosting can be used to
increase overall accuracy by learning and combining a series of
individual models
Often state-of-the-art in prediction quality, but expensive to train,
store, use
No single method is superior over all others for all data sets
Issues such as accuracy, training and prediction time, robustness,
interpretability, and scalability must be considered and can involve
trade-offs
260
40

Lecture2 ClassificationPrediction

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture2 ClassificationPrediction

Uploaded by

Copyright:

Available Formats

Classification and Prediction Overview

Classification vs. Prediction Induction: Model Construction

Deduction: Using the Model Classification and Prediction Overview

10 No Single 90K Yes fits the same data!

Training Data Model: Decision Tree

Apply Model to Test Data Apply Model to Test Data

No Married 80K ? No Married 80K ?

Apply Model to Test Data Apply Model to Test Data

No Married 80K ? No Married 80K ?

No Married 80K ? No Married 80K ?

Decision Tree Induction Decision Boundary

Basic greedy algorithm x2 0.9

Conditions for stopping partitioning 0.3

Majority voting for classifying the leaf

How to Specify Split Condition? Splitting Nominal Attributes

Binary Decision: (A < v) or (A v)

Splitting Continuous Attributes How to Determine Best Split

Attribute Selection Measure:

Need a measure of node impurity: Info(D) pi log 2 ( pi )

C0: 5 C0: 9 Information needed after using attribute A to split D into v

Gain Ratio for Attribute Selection Gini Index

E.g., SplitInfoincome ( D) 4 log 2 4 6 log 2 6 4 log 2 4 0.926 Reduction in Impurity:

Comparing Attribute Selection

Test Data Full Tree for The Training Data

1/4 of the tree nodes are 3/4 are fine

In total, we expect to be wrong on 3/8 of the test set predictions

These bits are hidden random 25% of the records

So this will So this will

Typical Observation Reasons for Overfitting

MDL-Based Pruning Intuition Handling Missing Attribute Values

Distribute Instances Computing Impurity Measure

3 No Single 70K No 6/9 of record 10 go to Refund=No

Probability that Refund=Yes is 3/9 = -(8/3 / 20/3)log(8/3 / 20/3)

Assign record to the left child with Entropy(Children)

Class=Yes 6/9 1 1 2.67 Greedy algorithm

Tree Expressiveness Rule Extraction from a Decision Tree

no yes excellent fair

Classification in Large Databases Scalable Tree Induction

Theoretical Results Random Variables

Working with Random Variables What is the Optimal Model f(X)?

In general for any function f(X,Y):

Bias-Variance Tradeoff Bias-Variance Tradeoff Derivation

From a training set D with n records, we want to

approximations of Y for future inputs X 2 E D f ( X ; D) E D [ f ( X ; D) E D [ f ( X ; D)] E[Y | X ]

Make dependence of f on D explicit by writing f(X; D) 2

(The third term is zero, because E D f ( X ; D) E D [ f ( X ; D) E D [ f ( X ; D)] E D [ f ( X ; D)] 0.)

Bias-Variance Tradeoff and Overfitting Implications for Trees

Nearest-Neighbor Nearest-Neighbor Classifiers

Recall our statistical decision theory analysis: Requires:

Problem was to estimate E[Y| X=x] or majority i

Definition of Nearest Neighbor 1-Nearest Neighbor

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points

Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning

Explaining the Effect of k Scaling Issues

Other Problems Computational Cost

Bayesian Theorem: Basics Bayes Theorem

Towards Nave Bayes Classifier Computing P(X=x|Ci) and P(Ci)

Estimating P(Xk=xk| Ci) for Continuous

Nave Bayes Example Nave Bayesian Computation

> 40 Medium No Good No

Theorems from the Axioms Conditional Probability

From these we can prove: P(H) = 1/10