Professional Documents
Culture Documents
Classification Algorithms
Credits: Padhraic Smyth
Outline
• What is supervised classification
• Classification algorithms
– Decision trees
– k-NN
– Linear discriminant analysis
– Naïve bayes
– Logistic regression
– SVM
• Model evaluation and model selection
– Performance measures (accuracy, ROC, precision, recall…)
– Validation techniques (k-fold CV, LOO)
• Ensemble learning
– Random Forests
Notation
• Variables X, Y with values x, y (lower case)
• Vectors indicated by X
• Y is called the target variable
3
Feature 2
0
Decision
Boundary
-1
2 3 4 5 6 7 8 9 10
Feature 1
Classes of classifiers
• Discriminative models, focus on locating optimal decision
boundaries
– Decision trees: “swiss army knife”, often effective in high dimensionis
– Nearest neighbor: simple, can scale poorly in high dimensions
– Linear discriminants : simple, sometimes effective
– Support vector machines: generalization of linear discriminants, can
be quite effective, computational complexity is an issue
• Class-conditional/probabilistic, based on p( x | ck ),
– Naïve Bayes (simple, but often effective in high dimensions)
• Regression-based, p( ck | x ) directly
– Logistic regression: simple, linear in “odds” space
– Neural network: non-linear extension of logistic, can be difficult to
work with
Decision Tree Classifiers
– Widely used in practice
• Can handle both real-valued and nominal inputs
(unusual)
• Good with high-dimensional data
2 4 6 2 4
IG ( Patrons) 1 [ I (0,1) I (1,0) I ( , )] .541 bits
12 12 12 6 6
2 1 1 2 1 1 4 2 2 4 2 2
IG (Type) 1 [ I ( , ) I ( , ) I ( , ) I ( , )] 0 bits
12 2 2 12 2 2 12 4 4 12 4 4
Patrons has the highest IG of all attributes and so is chosen by the DTL
algorithm as the root
Given Patrons as root node, the next attribute chosen is Hungry?, with
IG(Hungry?) = I(1/3, 2/3) – ( 2/3*1 + 1/3*0) = 0.252
Next step
Given Patrons as root node, the next attribute chosen is
Hungry?, with IG(Hungry?) = I(1/3, 2/3) – ( 2/3*1 + 1/3*0) =
0.252
Final decision tree induced by 12-
example training set
Braodening the applicability of
Decision Trees
• Continuous or integer values attributes
• Missing data
• Continuous output attributes (regression!!!)
Splitting on a nominal attribute
• Nominal attribute with m values
– e.g., the name of a state or a city in marketing data
• Other approaches
– Treat “missing” as a unique value (useful if missing values
are correlated with the class)
How to Choose the Right-Sized Tree?
Predictive
Error
Ideal Range
for Tree Size
Choosing a Good Tree for Prediction
• General idea
– grow a large tree
– prune it back to create a family of subtrees
• “weakest link” pruning
– score the subtrees and pick the best one
• Error Rates
– Training: 3056 emails, Testing: 1536 emails
– Decision tree = 8.7%
– Logistic regression: error = 7.6%
– Naïve Bayes = 10% (typically)
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Why Trees are widely used in Practice
• Can handle high dimensional data
– builds a model using 1 dimension at time
• High Variance
– trees can be “unstable” as a function of the sample
• e.g., small change in the data -> completely different tree
– causes two problems
• 1. High variance contributes to prediction error
• 2. High variance reduces interpretability
– Trees are good candidates for model combining
• Often used with boosting and bagging
• Comments
– Virtually assumption free
– Interesting theoretical properties:
• Disadvantages
– Can scale poorly with dimensionality: sensitive to distance metric
– Requires fast lookup at run-time to do classification with large n
– Does not provide any interpretable “model”
• Improvements:
– Weighting examples from the neighborhood
– Finding “close” examples in a large training set quickly
– Assign a score for class prediction as the fraction of predicted class among k nearest
examples (‘probabilistic version’)
The effect of the value of K on class boundaries
1-
31
Linear Discriminant Classifiers
• Decision rule
P( x1 , x2 ,, xn | c j ) P( xi | c j )
i
Question: For the day <sunny, cool, high, strong>, what’s the
play prediction?
Based on the examples in the table, classify the following case x:
x=(Outlook=Sunny, Temp=Cool, Humidity=High, Wind=strong)
• That means: c = Play tennis or not?
arg max P(c) P(Outlook sunny | c) P(Temp cool | c) P( Humidity high | c) P(Wind strong | c
c[ yes , no]
• Working:
answer : PlayTennis( x) no
Underflow Prevention
• Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.
• Class with highest final un-normalized log
probability score is still the most probable.
• Pros
– Incrementality: with each training example, the prior
and the likelihood can be updated dynamically
– Robustness: flexible and robust to errors.
– Probabilistic hypothesis: outputs not only a
classification, but a probability distribution over all
classes
• Cons
– Performace: Naïve bayes assumption is, in many cases,
not realistic reducing the performance
– Probability estimates: of classes are often un-realistic
(due to the naive bayes assumption)
Evaluating Classification Results (in
general)
• Summary statistics:
– empirical estimate of score function on test data, eg., error rate
• Decision trees
– Good for high-dimensional problems with different data types
Classification
Task
Models, Tree
Parameters
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Naïve Bayes Classifier
Task Classification
Models, Conditional
Parameters probability tables
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Logistic Regression
Task Classification
Task Classification
Representation Memory-based
Search/Optimization None
Models, None
Parameters
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Support Vector Machines
Task Classification
Representation Hyperplanes
Convex optimization
Search/Optimization
(quadratic programming)
Models, None
Parameters
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Software (same as for Regression)
MATLAB •
Many free “toolboxes” on the Web for regression and prediction –
e.g., see http://lib.stat.cmu.edu/matlab/ –
and in particular the CompStats toolbox
R •
General purpose statistical computing environment (successor to S) –
Free (!) –
Widely used by statisticians, has a huge library of functions and visualization tools –
Commercial tools •
SAS, other statistical packages –
Data mining packages –
Often are not progammable: offer a fixed menu of items –