Professional Documents
Culture Documents
Introduction
Decision Trees
Data Mining Techniques: Statistical Decision Theory
Nearest Neighbor
Classification and Prediction Bayesian Classification
Artificial Neural Networks
Mirek Riedewald
Support Vector Machines (SVMs)
Some slides based on presentations by Prediction
Han/Kamber, Tan/Steinbach/Kumar, and Andrew Accuracy and Error Measures
Moore
Ensemble Methods
1
Example of a Decision Tree Another Example of Decision Tree
Splitting Attributes MarSt Single,
Tid Refund Marital Taxable Married Divorced
Status Income Cheat Tid Refund Marital Taxable
Status Income Cheat
1 Yes Single 125K No NO Refund
1 Yes Single 125K No
2 No Married 100K No Refund Yes No
Yes No 2 No Married 100K No
3 No Single 70K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No NO MarSt 4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes Single, Divorced Married 5 No Divorced 95K Yes
6 No Married 60K No NO YES
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
7 Yes Divorced 220K No
8 No Single 85K Yes < 80K > 80K
8 No Single 85K Yes
9 No Married 75K No
NO YES 9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
10
7 8
Refund 10
Yes No Yes No
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
9 10
Refund 10
Yes No Yes No
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
11 12
2
Apply Model to Test Data Apply Model to Test Data
Test Data Test Data
Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat
Refund 10
Yes No Yes No
NO MarSt NO MarSt
Single, Divorced Married Single, Divorced Married Assign Cheat to No
TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K
NO YES NO YES
13 14
X2
Split attributes selected based on a heuristic or statistical 0.5
X2
< 0.33?
< 0.47?
measure (e.g., information gain) 0.4
17 18
3
Splitting Ordinal Attributes Splitting Continuous Attributes
Multi-way split: Different options
Small
Size
Large Discretization to form an ordinal categorical
Medium attribute
Binary split: Static discretize once at the beginning
Size Size
Dynamic ranges found by equal interval bucketing,
{Small,
{Large}
OR {Medium,
{Small} equal frequency bucketing (percentiles), or clustering.
Medium} Large}
19 20
21 22
High degree of impurity Low degree of impurity Information gained by splitting on attribute A:
Gain A (D) Info(D) Info A (D)
23 24
4
Example Information Gain Example
Class P: buys_computer = yes
Predict if somebody will buy a computer
5 4
Infoage ( D) I (2,3) I (4,0)
Class N: buys_computer = no 14 14
5
I (3,2) 0.694
Given data set: Age Income Student Credit_rating Buys_computer 9 9 5 5
Info(D) I (9,5) log 2 log 2 0.940 14
30 High No Bad No 14 14 14 14
5
30 High No Good No Age #yes #no I(#yes, #no) I (2,3) means age 30 has 5 out of 14
30 2 3 0.971 14 samples, with 2 yeses and 3 nos.
3140 High No Bad Yes 3140 4 0 0 Similar for the other terms
> 40 Medium No Bad Yes >40 3 2 0.971
> 40 Low Yes Bad Yes Age Income Student Credit_rating Buys_computer Hence
> 40 Low Yes Good No 30 High No Bad No Gain age ( D) Info(D) Infoage ( D) 0.246
30 High No Good No
31...40 Low Yes Good Yes 3140 High No Bad Yes Similarly,
30 Medium No Bad No > 40 Medium No Bad Yes Gain income ( D) 0.029
30 Low Yes Bad Yes > 40
> 40
Low
Low
Yes
Yes
Bad
Good
Yes
No Gain student( D) 0.151
> 40 Medium Yes Bad Yes 31...40 Low Yes Good Yes
Gain credit_rating ( D) 0.048
30 Medium No
30 Medium Yes Good Yes 30 Low Yes
Bad
Bad
No
Yes
Therefore we choose age as the splitting
31...40 Medium No Good Yes > 40 Medium Yes Bad Yes
attribute
30 Medium Yes Good Yes
31...40 High Yes Bad Yes 31...40 Medium No Good Yes
> 40 Medium No Good No 31...40 High Yes Bad Yes
> 40 Medium No Good No
25 26
number of values i 1
Use gain ratio to normalize information gain:
GainRatioA(D) = GainA(D) / SplitInfoA(D) If data set D is split on A into v subsets D1,, Dv, the gini
v | Dj | | Dj | index giniA(D) is defined as
SplitInfo A ( D) log 2 v |D |
gini A ( D)
j
j 1 |D| | D| gini( D j )
j 1 | D |
27 28
29 30
5
How Good is the Model? Training versus Test Set Error
Training set error: compare prediction of Well create a training dataset
Output y = copy of e,
training record with true value Five inputs, all bits, are except a random 25%
generated in all 32 possible of the records have y
Not a good measure for the error on unseen data. combinations set to the opposite of e
(Discussed soon.)
Test set error: for records that were not used a b c d e y
0 0 0 0 0 0
for training, compare model prediction and 0 0 0 0 1 0
32 records
true value 0 0 0 1 0 0
0 0 0 1 1 1
Use holdout data from available data set 0 0 1 0 0 1
: : : : : :
1 1 1 1 1 1
31 32
Testing The Tree with The Test Set Whats This Example Shown Us?
35 36
6
Tree Learned Without Access to The
Suppose We Had Less Data
Irrelevant Bits
Output y = copy of e, except a Root
a b c d e y
0 0 0 0 0 0
0 0 0 0 1 0
32 records
0 0 0 1 0 0
0 0 0 1 1 1
0 0 1 0 0 1
: : : : : :
1 1 1 1 1 1
37 38
Tree Learned Without Access to The Tree Learned Without Access to The
Irrelevant Bits Irrelevant Bits
Root Root almost certainly almost certainly all
none of the tree are fine
nodes are
e=0 e=1 e=0 e=1 corrupted
1/4 of the test n/a 1/4 of the test set
set records are will be wrongly
corrupted predicted because
the test record is
In about 12 of In about 12 of corrupted
the 16 records the 16 records 3/4 are fine n/a 3/4 of the test
in this node the in this node the predictions will be
output will be 0 output will be 1 fine
7
Avoiding Overfitting Minimum Description Length (MDL)
A?
X y Yes No
General idea: make the tree smaller X1 1 0
X y
B? X1 ?
Addresses all three reasons for overfitting X2 0 B1 B2
X2 ?
X3 0 C? 1
A C1 C2 B X3 ?
X4 1
Prepruning: Halt tree construction early 0 1 X4 ?
Do not split a node if this would result in the goodness measure Xn
1
falling below a threshold Xn ?
Difficult to choose an appropriate threshold, e.g., tree for XOR Alternative to using validation data
Motivation: data mining is about finding regular patterns in data;
Postpruning: Remove branches from a fully grown tree regularity can be used to compress the data; method that achieves
greatest compression found most regularity and hence is best
Use a set of data different from the training data to decide when
to stop pruning Minimize Cost(Model,Data) = Cost(Model) + Cost(Data|Model)
Validation data: train tree on training data, prune on validation data, Cost is the number of bits needed for encoding.
then test on test data Cost(Data|Model) encodes the misclassification errors.
Cost(Model) uses node encoding plus splitting condition encoding.
43 44
small large
Tree size
Best tree size
45 46
8
Classify Instances Tree Cost Analysis
New record: Married Single Divorced Total
Tid Refund Marital Taxable
Finding an optimal decision tree is NP-complete
Status Income Class Class=No 3 1 0 4 Optimization goal: minimize expected number of binary tests to
uniquely identify any record from a given finite set
11 No ? 85K ?
10
49 50
attributes age?
But we can still use a tree for them in practice; it just <=30 31..40 >40
cannot accurately represent the true function student?
yes
credit rating?
no yes yes
53 55
56 57
9
Tree Conclusions Classification and Prediction Overview
Very popular data mining tool Introduction
Decision Trees
Easy to understand
Statistical Decision Theory
Easy to implement Nearest Neighbor
Easy to use Bayesian Classification
Little tuning, handles all attribute types and missing values Artificial Neural Networks
Computationally cheap Support Vector Machines (SVMs)
Overfitting problem Prediction
Accuracy and Error Measures
Focused on classification, but easy to extend to
Ensemble Methods
prediction (future lecture)
58 60
61 62
63 64
10
Optimal Model f(X) (cont.) Implications for Trees
Best prediction for input X=x is the mean of the Y-values of all records
The choice of f(X) does not affect E Y (Y Y ) 2 | X , but Y f ( X ) is minimized for
2
(x(i),y(i)) with x(i)=x
f(X) Y E Y [Y | X ]. What about classification?
Two classes: encode as 0 and 1, use squared error as before
Get f(X) = E[Y| X=x] = 1*Pr(Y=1| X=x) + 0*Pr(Y=0| X=x) = Pr(Y=1| X=x)
Note that E X,Y (Y f ( X )) 2 E X E Y (Y f ( X )) 2 | X . Hence K classes: can show that for 0-1 loss (error = 0 if correct class, error = 1 if
wrong class predicted) the optimal choice is to return the majority class for a
given input X=x
E X,Y (Y f ( X )) 2 E X E Y (Y Y ) 2 | X Y f ( X )
2
Called the Bayes classifier
Problem: How can we estimate E[Y| X=x] or the majority class for X=x from
the training data?
Hence the squared error is minimzed by choosing f(X) E Y [Y | X ] for every X. Often there is just one or no training record for a given X=x
Solution: approximate it
(Notice that for minimizing absolute error E X,Y | Y f ( X ) |, one can show that the best model is Use Y-values from training records in neighborhood around X=x
Tree: leaf defines neighborhood in the data space; make sure there are
f(X) median( X | Y ).) enough records in the leaf to obtain reliable estimate of correct answer
65 66
2 E D f ( X ; D) E D [ f ( X ; D) E D [ f ( X ; D)] E[Y | X ]
construct a function f(X) that returns good
E D f ( X ; D) E D [ f ( X ; D)] E D [ f ( X ; D)] E[Y | X ]
2 2
EY Y E[Y | X ] | X : irreducible error (does not depend on f and is simply thevariance of Y given
2
X.)
Larger tree can fit training data better
Option 1: f(X;D) = E[Y| X,D]
Bias: since ED[ E[Y| X,D] ] = E[Y| X], bias is zero
Variance: (E[Y| X,D]-ED[E[Y| X,D]])2 = (E[Y| X,D]-E[Y| X])2 can be very large
Variance increases as tree becomes larger
since E[Y| X,D] depends heavily on D
Might overfit!
Sample variance affects predictions of larger tree
Option 2: f(X;D)=X (or other function independent of D) more
Variance: (X-ED[X])2=(X-X)2=0
Bias: (ED[X]-E[Y| X])2=(X-E[Y| X])2 can be large, because E[Y| X] might be Find right tradeoff as discussed earlier
completely different from X
Might underfit! Validation data to find best pruned tree
Find best compromise between fitting training data too closely (option 1)
and completely ignoring it (option 2) MDL principle
69 70
11
Classification and Prediction Overview Lazy vs. Eager Learning
Introduction Lazy learning: Simply stores training data (or only
Decision Trees minor processing) and waits until it is given a test
Statistical Decision Theory record
Nearest Neighbor Eager learning: Given a training set, constructs a
Bayesian Classification classification model before receiving new (test)
Artificial Neural Networks data to classify
Support Vector Machines (SVMs) General trend: Lazy = faster training, slower
Prediction predictions
Accuracy and Error Measures Accuracy: not clear which one is better!
Ensemble Methods Lazy method: typically driven by local decisions
Eager method: driven by global and local decisions
71 72
Parameter k
class for X=x from the training data Number of nearest
neighbors to retrieve
Solution was to approximate it To classify a record:
Use Y-values from training records in Find its k nearest neighbors
Determine output based on
neighborhood around X=x (distance-weighted) average
of neighbors output
73 74
X X X
Voronoi Diagram
75 76
12
Nearest Neighbor Classification Effect of Changing k
Choosing the value of k:
k too small: sensitive to noise points
k too large: neighborhood may include points from other
classes
13
Classification and Prediction Overview Bayesian Classification
Introduction Performs probabilistic prediction, i.e., predicts
Decision Trees class membership probabilities
Statistical Decision Theory Based on Bayes Theorem
Nearest Neighbor Incremental training
Bayesian Classification Update probabilities as new training records arrive
Artificial Neural Networks Can combine prior knowledge with observed data
Support Vector Machines (SVMs) Even when Bayesian methods are
Prediction computationally intractable, they can provide a
Accuracy and Error Measures standard of optimal decision making against
Ensemble Methods which other methods can be measured
99 100
101 102
14
Example: Computing P(X=x|Ci) and
Conditional Independence
P(Ci)
P(buys_computer = yes) = 9/14
P(buys_computer = no) = 5/14 X, Y, Z random variables
P(age>40, income=low, student=no, credit_rating=bad| buys_computer=yes) = 0 ?
X is conditionally independent of Y, given Z, if
Age Income Student Credit_rating Buys_computer
30 High No Bad No P(X| Y,Z) = P(X| Z)
30 High No Good No
3140 High No Bad Yes Equivalent to: P(X,Y| Z) = P(X| Z) * P(Y| Z)
> 40 Medium No Bad Yes
> 40 Low Yes Bad Yes Example: people with longer arms read better
> 40 Low Yes Good No
31...40 Low Yes Good Yes Confounding factor: age
30 Medium No Bad No
30 Low Yes Bad Yes Young child has shorter arms and lacks reading skills of adult
> 40 Medium Yes Bad Yes
30 Medium Yes Good Yes If age is fixed, observed relationship between arm
31...40 Medium No
31...40 High Yes
Good
Bad
Yes
Yes
length and reading skills disappears
> 40 Medium No Good No
105 106
2
Each P(Xk=xk| Ci) can be estimated robustly
If Xk is categorical attribute as
P( X k xk | C i) g ( xk , k ,Ci , k ,Ci )
P(Xk=xk| Ci) = #records in Ci that have value xk for Xk, divided
by #records of class Ci in training data set Estimate k,Ci from sample mean of attribute Xk
If Xk is continuous, we could discretize it for all training records of class Ci
Problem: interval selection
Too many intervals: too few training cases per interval Estimate k,Ci similarly from sample
Too few intervals: limited choices for decision boundary
107 108
15
Zero-Probability Problem Nave Bayesian Classifier: Comments
Nave Bayesian prediction requires each conditional probability to Easy to implement
be non-zero (why?)
d
Good results obtained in many cases
P( X ( x1 ,, xd ) | Ci ) P( X k xk | Ci ) P( X 1 x1 | Ci ) P( X 2 x2 | Ci ) P( X d xd | Ci ) Robust to isolated noise points
k 1
Handles missing values by ignoring the instance during
Example: 1000 records for buys_computer=yes with income=low probability estimate calculations
(0), income= medium (990), and income = high (10)
For input with income=low, conditional probability is zero
Robust to irrelevant attributes
Use Laplacian correction (or Laplace estimator) by adding 1 dummy Disadvantages
record to each income level Assumption: class conditional independence,
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003 therefore loss of accuracy
Prob(income = high) = 11/1003 Practically, dependencies exist among variables
Corrected probability estimates close to their uncorrected
counterparts, but none is zero How to deal with these dependencies?
111 112
Probabilities Axioms
Summary of elementary probability facts we have 0 P(A) 1
used already and/or will need soon
P(True) = 1
Let X be a random variable as usual
Let A be some predicate over its possible values P(False) = 0
A is true for some values of X, false for others P(A B) = P(A) + P(B) - P(A B)
E.g., X is outcome of throw of a die, A could be value
is greater than 4
P(A) is the fraction of possible worlds in which A
is true
P(die value is greater than 4) = 2 / 6 = 1/3
113 114
115 116
16
Definition of Conditional Probability Multivalued Random Variables
P(A B) Suppose X can take on more than 2 values
P(A| B) = ------------ X is a random variable with arity k if it can take
P(B) on exactly one value out of {v1, v2,, vk}
Thus
Corollary: the Chain Rule P( X vi X v j ) 0 if i j
117 118
P( X v1 X v2 ... X vk ) 1 P( X v
j 1
j | B) 1
We can prove that i
P( X v1 X v2 ... X vi ) P( X v j )
j 1
And therefore: k
P( X v ) 1
j 1
j
119 120
0 1 0
1. Make a truth table listing all 0 1 1
combinations of values of your 1 0 0
variables (has 2d rows for d 1 0 1
Boolean variables). 1 1 0
1 1 1
121 122
17
The Joint Distribution Example: Boolean
variables A, B, C
The Joint Distribution Example: Boolean
variables A, B, C
A B C Prob A B C Prob
Recipe for making a joint distribution 0 0 0 0.30
Recipe for making a joint distribution 0 0 0 0.30
of d variables: 0 0 1 0.05
of d variables: 0 0 1 0.05
0 1 0 0.10 0 1 0 0.10
1. Make a truth table listing all 0 1 1 0.05
1. Make a truth table listing all 0 1 1 0.05
combinations of values of your 1 0 0 0.05
combinations of values of your 1 0 0 0.05
variables (has 2d rows for d 1 0 1 0.10 variables (has 2d rows for d 1 0 1 0.10
Boolean variables). 1 1 0 0.25 Boolean variables). 1 1 0 0.25
2. For each combination of values, 1 1 1 0.10 2. For each combination of values, 1 1 1 0.10
B 0.10
0.30
123 124
125 126
Inference
Using the
with the
Joint Dist.
Joint Dist.
P(row)
P(Poor) = 0.7604 P( E ) P(row) P( E1 | E2 )
P( E1 E2 )
rows matching E1 and E2
rows matching E P ( E2 ) P(row)
rows matching E2
127 128
18
Inference
with the Joint Distributions
Joint Dist.
Good news: Once you Bad news: Impossible to
have a joint create joint distribution
distribution, you can for more than about ten
answer important attributes because
P( E1 E2 )
P(row) questions that involve
uncertainty.
there are so many
numbers needed when
P( E1 | E2 )
rows matching E1 and E2
P ( E2 ) P(row)
rows matching E2
you build it.
129 130
Nave Bayes as a Bayesian network: LungCancer Emphysema ~LC 0.2 0.5 0.3 0.9
133 134
19
Creating a Bayes Network Computing with Bayes Net
M P(M)=0.6
P(S)=0.3 S
T: The lecture started on time
L: The lecturer arrives late P(RM)=0.3
R: The lecture concerns data mining P(LM^S)=0.05 L P(R~M)=0.6
M M: The lecturer is Mike P(LM^~S)=0.1
S: It is snowing P(TL)=0.3 R
P(L~M^S)=0.1
P(T~L)=0.8
R P(L~M^~S)=0.2
T T: The lecture started on time
L: The lecturer arrives late
R: The lecture concerns data mining
S ? P(T ^ ~R ^ L ^ ~M ^ S)
= P(T ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S)
M: The lecturer is Mike
S: It is snowing
= P(T L) * P(~R ^ L ^ ~M ^ S)
L = P(T L) * P(~R L ^ ~M ^ S) * P(L^~M^S)
= P(T L) * P(~R ~M) * P(L ^ ~M ^ S)
= P(T L) * P(~R ~M) * P(L~M ^ S) * P(~M ^ S)
T
= P(T L) * P(~R ~M) * P(L~M ^ S) * P(~M | S) * P(S)
= P(T L) * P(~R ~M) * P(L~M ^ S) * P(~M) * P(S)
135 136
137 138
20
Basic Building Block: Perceptron Perceptron Decision Hyperplane
Input: {(x1, x2, y), }
For Example x2
Called the bias Output: classification function f(x)
d
f (x) sign b wi xi f(x) > 0: return +1
i 1 f(x) 0: return = -1
x1 w1 b+w1x1+w2x2 = 0
x2 +b
w2 Decision hyperplane: b+wx = 0
f
Output y
xd wd Note: b+wx > 0, if and only if
d
w x
i 1
i i b
Input Weight Weighted Activation
b represents a threshold for when the
vector x vector w sum function perceptron fires.
x1
142 143
21
Gradient Descent Summary Multilayer Feedforward Networks
Epoch updating (aka batch mode) Use another perceptron to combine
Do until satisfied with model output of lower layer Output layer
Compute gradient over entire training set What about linear units only?
Update all weights based on gradient Can only construct linear functions!
Case updating (aka incremental mode, stochastic gradient descent) Need nonlinear component
Do until satisfied with model sign function: not differentiable
For each training record (gradient descent!)
Compute gradient for this single training record
Update all weights based on gradient Use sigmoid: (x)=1/(1+e-x) Hidden layer
1
Case updating can approximate epoch updating arbitrarily close if 0.9
1/(1+exp(-x))
22
Neural Network Decision Boundary Backpropagation Remarks
Computational cost
Each interation costs O(|D|*|w|), with |D| training
records and |w| weights
Number of iterations can be exponential in n, the
number of inputs (in practice often tens of thousands)
Local minima can trap the gradient descent
algorithm
Convergence guaranteed to local minimum, not global
Backpropagation highly effective in practice
Many variants to deal with local minima issue
E.g., case updating might avoid local minimum
Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning
154 155
156 157
158 160
23
SVMSupport Vector Machines SVMHistory and Applications
Newer and very popular classification method Vapnik and colleagues (1992)
Uses a nonlinear mapping to transform the Groundwork from Vapnik & Chervonenkis statistical
learning theory in 1960s
original training data into a higher dimension
Training can be slow but accuracy is high
Searches for the optimal separating
Ability to model complex nonlinear decision
hyperplane (i.e., decision boundary) in the boundaries (margin maximization)
new dimension Used both for classification and prediction
SVM finds this hyperplane using support Applications: handwritten digit recognition,
vectors (essential training records) and object recognition, speaker identification,
margins (defined by the support vectors) benchmarking time-series prediction tests
161 162
163 164
165 166
24
Linear Classifiers Classifier Margin
f(x,w,b) = sign(wx + b) f(x,w,b) = sign(wx + b)
denotes +1 denotes +1
denotes -1 denotes -1 Define the margin
of a linear
classifier as the
Any of these width that the
would be fine..
boundary could be
increased by
..but which is before hitting a
best? data record.
167 168
169 170
25
Computing Margin Width Computing Margin Width
M = Margin Width x+ M = Margin Width
x-
173 174
175 176
26
Problem: Classes Not Linearly
What Are the SVM Constraints?
Separable
Consider n training denotes +1 Inequalities for training
records (x(k), y(k)), denotes -1 records are not
where y(k) = +/- 1 satisfiable by any w and
M
2 How many constraints b
ww will we have? n.
What is the quadratic What should they be?
optimization criterion? For each 1 k n:
Minimize ww wx(k) + b 1, if y(k)=1
wx(k) + b -1, if y(k)=-1
179 180
Solution 1? Solution 2?
denotes +1 Find minimum ww, denotes +1 Minimize ww +
denotes -1 while also minimizing denotes -1 C(#trainSetErrors)
number of training set C is a tradeoff parameter
errors Problems:
Not a well-defined Cannot be expressed as
optimization problem QP, hence finding
solution might be slow
(cannot optimize two
things at the same time) Does not distinguish
between disastrous
errors and near misses
181 182
27
Facts About the New Problem
Effect of Parameter C
Formulation
Original QP formulation had d+1 variables
w1, w2,..., wd and b
New QP formulation has d+1+n variables
w1, w2,..., wd and b
1, 2,..., n
C is a new parameter that needs to be set for
the SVM
Controls tradeoff between paying attention to
margin size versus misclassifications
Source: Hastie, Tibshirani, and Friedman. The Elements of Statistical Learning
185 186
189 190
28
Harder To Separate Harder To Separate
X (= X2)
What can be
done about
this? Non-linear basis
functions:
Original data: (X, Y)
Transformed: (X, X2, Y)
Think of X2 as a new
X attribute, e.g., X
191 192
X
Region below minus-plane
193 194
1 Constant Term
2 x1 Quadratic Basis
Common SVM Basis Functions
2 x2
:
Linear Terms Functions
2 x
Polynomial of attributes X1,..., Xd of certain
d
x12
Pure
max degree, e.g., X2+X1X3+X42 x 2
2
Quadratic Number of terms
:
Terms (assuming d input attributes):
Radial basis function
xd
2
(x) (d+2)-choose-2
Symmetric around center, i.e., 2 x1 x2
= (d+2)(d+1)/2
KernelFunction(|X - c| / kernelWidth) 2 x1 x3
: d2/2
Sigmoid function of X, e.g., hyperbolic tangent 2 x1 xd Quadratic
Cross-Terms
Let (x) be the transformed input record 2 x x
2 3
Why did we choose this specific
:
transformation?
Previous example: ( (x) ) = (x, x2) 2 x1 xd
:
2 x x
195 d 1 d
196
29
Dual QP With Basis Functions Computation Challenge
n n n
1
1
1 Quadratic
2a1 2b1
+
2 a2
2b2
m Dot Quadratic Dot Products
: : 2a b
2 a d
2b
d
i 1
i i
Products Now consider another function of a
a1 2
b1 2
+ and b:
2 2
a b m
a b (a b 1) 2
2 2 2 2
: : i i
i 1
ad
2
bd
2
(a b) 2 2a b 1
(a) (b) (a) (b)
2a1a2 2b1b2 d
2
d
ai bi 2 ai bi 1
d d d d
2a1a3 2b1b3 + 1 2 ai bi ai2bi2 2ai a j bi b j
i 1 i 1 i 1 j i 1
i 1 i 1
: :
d d d
2a1ad 2b1bd ai bi a j b j 2 ai bi 1
m m
2 a a
2 3 2b b
2 3 2a a b b i j i j
i 1 j 1 i 1
: : i 1 j i 1
d d d d
(ai bi ) 2 2 ai bi a j b j 2 ai bi 1
2a1ad 2b1bd i 1 i 1 j i 1 i 1
: :
2a a 2b b
d 1 d d 1 d
199 200
201 202
30
SVM Kernel Functions Overfitting
For which transformations, called kernels, With the right kernel function, computation in high
dimensional transformed space is no problem
does the same trick work? But what about overfitting? There are so many
Polynomial: K(a,b)=(a b +1)q parameters...
Usually not a problem, due to maximum margin
Radial-Basis-style (RBF): approach
(a b) 2 Only the support vectors determine the model, hence SVM
K(a, b) exp , and are magic
2 2
complexity depends on number of support vectors, not
parameters that must dimensions (still, in higher dimensions there might be
be chosen by a model more support vectors)
Neural-net-style sigmoidal: selection method.
Minimizing ww discourages extremely large weights,
which smoothes the function (recall weight decay for
K(a, b) tanh( a b ) neural networks!)
203 204
207 209
31
Classification and Prediction Overview What Is Prediction?
Introduction Essentially the same as classification, but output is
Decision Trees continuous, not discrete
Statistical Decision Theory Construct a model
Use model to predict continuous output value for a given
Nearest Neighbor input
Bayesian Classification Major method for prediction: regression
Artificial Neural Networks Many variants of regression analysis in statistics literature;
Support Vector Machines (SVMs) not covered in this class
Prediction Neural network and k-NN can do regression out-of-
Accuracy and Error Measures the-box
Ensemble Methods SVMs for regression exist
What about trees?
210 211
212 213
32
Limitation of Accuracy Cost-Sensitive Measures: Cost Matrix
Consider a 2-class problem PREDICTED CLASS
Number of Class 0 examples = 9990
Number of Class 1 examples = 10 C(i|j) Class=Yes Class=No
216 217
+ - + - n i 1 n i 1
ACTUAL ACTUAL n n
Relative absolute error: | y(i) y' (i) | Relative squared error: ( y(i) y' (i)) 2
+ 150 40 + 250 45
CLASS CLASS i 1 i 1
- 60 250 - 5 200 n n
| y(i) y |
i 1
( y(i) y )
i 1
2
220 221
33
ROC (Receiver Operating
ROC Curve
Characteristic)
Developed in 1950s for signal detection theory to 1-dimensional data set containing 2 classes (positive and negative)
Any point located at x > t is classified as positive
analyze noisy signals
Characterizes trade-off between positive hits and false
alarms
ROC curve plots T-Pos rate (y-axis) against F-Pos
rate (x-axis)
Performance of each classifier is represented as a
point on the ROC curve
Changing the threshold of the algorithm, sample
distribution or cost matrix changes the location of the
point
At threshold t:
TPR=0.5, FPR=0.12
222 223
224 225
226 227
34
How To Construct An ROC Curve Test of Significance
Class + - + - + - - - + +
P
Threshold >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00
TP
FP
5
5
4
5
4
4
3
4
3
3
2
1
2
0
1
0
0
0
Given two models:
Model M1: accuracy = 85%, tested on 30 instances
TN 0 0 1 1 2 4 5 5 5
FN 0 1 1 2 2 3 3 4 5
35
Testing Significance of Accuracy
An Illustrative Example
Difference
Consider random variable d = err1 err2 Given: M1: n1 = 30, e1 = 0.15
M2: n2 = 5000, e2 = 0.25
Since err1, err2 are normally distributed, so is their E[d] = |e1 e2| = 0.1
difference 2-sided test: dt = 0 versus dt 0
Hence d ~ N (dt, t) where dt is the true difference 0.15(1 0.15) 0.25(1 0.25)
Estimator for dt: t2 0.0043
30 5000
E[d] = E[err1-err2] = E[err1] E[err2] e1 - e2
At 95% confidence level, Z/2 = 1.96
Since D1 and D2 are independent, variance adds up:
e (1 e1 ) e2 (1 e2 ) dt 0.100 1.96 0.0043 0.100 0.128
t2 12 22 1
n1 n2 Interval contains zero, hence difference may not be statistically
significant
At (1-) confidence level, dt E[ d ] Z / 2 t But: may reject null hypothesis (dt 0) at lower confidence level
234 235
(d d ) 2j t-distribution: get t coefficient
Accuracy and Error Measures
t2 j 1 t1-,k-1 from table by looking up Ensemble Methods
k (k 1) confidence level (1-) and
degrees of freedom (k-1)
d t d t1 ,k 1 t
236 237
data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Predict class label of previously unseen
records by aggregating predictions made by Step 2:
Build Multiple C1 C2 Ct -1 Ct
multiple classifiers Classifiers
Step 3:
Combine C*
Classifiers
238 239
36
Why Does It Work? Base Classifier vs. Ensemble Error
Consider 2-class problem
Suppose there are 25 base classifiers
Each classifier has error rate = 0.35
Assume the classifiers are independent
Return majority vote of the 25 classifiers
Probability that the ensemble classifier makes a
wrong prediction: 25 25
i (1 )
i 25i
0.06
i 13
240 241
244 245
37
Typical Result Typical Result
246 247
1
0.5 0.2 0.1 0.05 0.02 0.01 0.005 0.002 0
250 251
38
Boosting Boosting
Records that are wrongly classified will have their
weights increased
Iterative procedure to Records that are classified correctly will have
adaptively change distribution their weights decreased
of training data by focusing Original Data 1 2 3 4 5 6 7 8 9 10
more on previously Boosting (Round 1) 7 3 2 8 7 9 4 10 6 3
Boosting (Round 2) 5 4 9 4 2 5 1 7 4 2
misclassified records Boosting (Round 3) 4 4 8 10 4 5 4 6 3 4
252 253
Round 2 - - - - - - - - ++ = 2.9323
New weights
B1 B3
0.0094 0.0094 0.4623 0.0276 0.1819 0.0038
Boosting Boosting
Round 1 +++ - - - - - - - = 1.9459 Round 3 +++ ++ ++ + ++ = 3.8744
Overall +++ - - - - - ++
Note: The numbers appear to be wrong, but they convey the right idea 256
Note: The numbers appear to be wrong, but they convey the right idea 257
39
Bagging vs. Boosting Classification/Prediction Summary
Analogy Forms of data analysis that can be used to train models
Bagging: diagnosis based on multiple doctors majority vote
Boosting: weighted vote, based on doctors previous diagnosis accuracy from data and then make predictions for new records
Sampling procedure Effective and scalable methods have been developed
Bagging: records have same weight; easy to train in parallel
Boosting: weights record higher if model predicts it wrong; inherently
for decision tree induction, Naive Bayesian
sequential process classification, Bayesian networks, rule-based classifiers,
Overfitting Backpropagation, Support Vector Machines (SVM),
Bagging robust against overfitting
Boosting susceptible to overfitting: make sure individual models do not overfit nearest neighbor classifiers, and many other
Accuracy usually significantly better than a single classifier classification methods
Best boosted model often better than best bagged model
Additive Grove
Regression models are popular for prediction.
Combines strengths of bagging and boosting (additive models) Regression trees, model trees, and ANNs are also used
Shown empirically to make better predictions on many data sets for prediction.
Training more tricky, especially when data is very noisy
258 259
Classification/Prediction Summary
K-fold cross-validation is a popular method for accuracy estimation,
but determining accuracy on large test set is equally accepted
If test sets are large enough, a significance test for finding the best
model is not necessary
Area under ROC curve and many other common performance
measures exist
Ensemble methods like bagging and boosting can be used to
increase overall accuracy by learning and combining a series of
individual models
Often state-of-the-art in prediction quality, but expensive to train,
store, use
No single method is superior over all others for all data sets
Issues such as accuracy, training and prediction time, robustness,
interpretability, and scalability must be considered and can involve
trade-offs
260
40