You are on page 1of 54

Data Mining

Classification Algorithms
Credits: Padhraic Smyth
Outline
• What is supervised classification
• Classification algorithms
– Decision trees
– k-NN
– Linear discriminant analysis
– Naïve bayes
– Logistic regression
– SVM
• Model evaluation and model selection
– Performance measures (accuracy, ROC, precision, recall…)
– Validation techniques (k-fold CV, LOO)
• Ensemble learning
– Random Forests
Notation
• Variables X, Y with values x, y (lower case)
• Vectors indicated by X
• Y is called the target variable

• Components of X indicated by Xj with values xj

• “Matrix” data set D with n rows and p columns


– jth column contains values for variable Xj
– ith row contains a vector of measurements on object i, indicated by
x(i)
– The jth measurement value for the ith object is xj(i)

• Unknown parameter for a model = q


– Vector of parameters = q
Decision Region Terminlogy
TWO-CLASS DATA IN A TWO-DIMENSIONAL FEATURE SPACE
6
Decision
Region 1 Decision
5 Region 2

3
Feature 2

0
Decision
Boundary
-1
2 3 4 5 6 7 8 9 10
Feature 1
Classes of classifiers
• Discriminative models, focus on locating optimal decision
boundaries
– Decision trees: “swiss army knife”, often effective in high dimensionis
– Nearest neighbor: simple, can scale poorly in high dimensions
– Linear discriminants : simple, sometimes effective
– Support vector machines: generalization of linear discriminants, can
be quite effective, computational complexity is an issue

• Class-conditional/probabilistic, based on p( x | ck ),
– Naïve Bayes (simple, but often effective in high dimensions)

• Regression-based, p( ck | x ) directly
– Logistic regression: simple, linear in “odds” space
– Neural network: non-linear extension of logistic, can be difficult to
work with
Decision Tree Classifiers
– Widely used in practice
• Can handle both real-valued and nominal inputs
(unusual)
• Good with high-dimensional data

– historically, developed both in statistics and


computer science
Decision tree learning
• Aim: find a small tree consistent with the training examples
• Idea: (recursively) choose "most significant" attribute as root of (sub)tree
Learning decision trees
Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
1. Alternate: is there an alternative restaurant nearby?
2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)
Attribute-based representations
• Examples described by attribute values (Boolean, discrete, continuous)
• E.g., situations where I will/won't wait for a table:

• Classification of examples is positive (T) or negative (F)


• The set of examples used for learning is called training set.
Decision trees
• One possible representation for hypotheses
• E.g., here is the “true” tree for deciding whether to
wait:
Choosing an attribute
• Idea: a good attribute splits the examples into subsets that are
(ideally) "all positive" or "all negative"

• Patrons? is a better choice


Example contd.
• Decision tree learned from the 12 examples:

• Substantially simpler than “true” tree---a more complex


hypothesis isn’t justified by small amount of data
Using information theory
• To implement Choose-Attribute in the DTL
algorithm
• Information Content (Entropy):
I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi)
• For a training set containing p positive examples and
n negative examples:
p n p p n n
I( , ) log 2  log 2
pn pn pn pn pn pn
Information gain
• A chosen attribute A divides the training set E into subsets E1,
… , Ev according to their values for A, where A has v distinct
values.
v
p i ni pi ni
remainder ( A)   I( , )
i 1 p  n pi  ni pi  ni
• Information Gain (IG) or reduction in entropy from the
attribute test:
p n
IG ( A)  I ( , )  remainder ( A)
pn pn
• Choose the attribute with the largest IG
Information gain
For the training set, p = n = 6, I(6/12, 6/12) = 1 bit

Consider the attributes Patrons and Type (and others too):

2 4 6 2 4
IG ( Patrons)  1  [ I (0,1)  I (1,0)  I ( , )]  .541 bits
12 12 12 6 6
2 1 1 2 1 1 4 2 2 4 2 2
IG (Type)  1  [ I ( , )  I ( , )  I ( , )  I ( , )]  0 bits
12 2 2 12 2 2 12 4 4 12 4 4
Patrons has the highest IG of all attributes and so is chosen by the DTL
algorithm as the root

Given Patrons as root node, the next attribute chosen is Hungry?, with
IG(Hungry?) = I(1/3, 2/3) – ( 2/3*1 + 1/3*0) = 0.252
Next step
Given Patrons as root node, the next attribute chosen is
Hungry?, with IG(Hungry?) = I(1/3, 2/3) – ( 2/3*1 + 1/3*0) =
0.252
Final decision tree induced by 12-
example training set
Braodening the applicability of
Decision Trees
• Continuous or integer values attributes
• Missing data
• Continuous output attributes (regression!!!)
Splitting on a nominal attribute
• Nominal attribute with m values
– e.g., the name of a state or a city in marketing data

• 2m-1 possible subsets => exhaustive search is O(2m-1)


– For small m, a simple approach is to branch on specific values
– But for large m this may not work well

• Neat trick for the 2-class problem:


– For each predictor value calculate the proportion of class 1’s
– Order the m values according to these proportions
– Now treat as an ordinal variable and select the best split (linear in m)
– This gives the optimal split for the Gini index, among all possible 2m-1
splits .
Splitting on a integer or continuous
attributes
• A numeric attribute A is tested for two outcomes:
A  split_point and A > split_point
• split_point is a value returned by choose-attribute
function.
Income?

 42,000 > 42,000


Treating Missing Data in Trees
• Missing values are common in practice

• Approaches to handing missing values


– During training
• Ignore rows with missing values (inefficient)
– During testing
• Send the example being classified down both branches and
average predictions
– Replace missing values with an “imputed value” (can be
suboptimal)

• Other approaches
– Treat “missing” as a unique value (useful if missing values
are correlated with the class)
How to Choose the Right-Sized Tree?

Predictive
Error

Error on Test Data

Error on Training Data

Size of Decision Tree

Ideal Range
for Tree Size
Choosing a Good Tree for Prediction
• General idea
– grow a large tree
– prune it back to create a family of subtrees
• “weakest link” pruning
– score the subtrees and pick the best one

• Massive data sizes (e.g., n ~ 100k data points)


– use training data set to fit a set of trees
– use a validation data set to score the subtrees

• Smaller data sizes (e.g., n ~1k or less)


– use cross-validation
Example: Spam Email Classification
• Data Set: (from the UCI Machine Learning Archive)
– 4601 email messages from 1999
– Manually labeled as spam (60%), non-spam (40%)
– 54 features: percentage of words matching a specific word/character
• Business, address, internet, free, george, !, $, etc
– Average/longest/sum lengths of uninterrupted sequences of CAPS

• Error Rates
– Training: 3056 emails, Testing: 1536 emails
– Decision tree = 8.7%
– Logistic regression: error = 7.6%
– Naïve Bayes = 10% (typically)
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Why Trees are widely used in Practice
• Can handle high dimensional data
– builds a model using 1 dimension at time

• Can handle any type of input variables


– categorical, real-valued, etc
– most other methods require data of a single type (e.g.,
only real-valued)

• Trees are (somewhat) interpretable


– domain expert can “read off” the tree’s logic

• Tree algorithms are relatively easy to code and test


Limitations of Classification Trees
• Representational Bias
– piecewise linear boundaries, parallel to axes

• High Variance
– trees can be “unstable” as a function of the sample
• e.g., small change in the data -> completely different tree
– causes two problems
• 1. High variance contributes to prediction error
• 2. High variance reduces interpretability
– Trees are good candidates for model combining
• Often used with boosting and bagging

• Trees do not scale well to massive data sets (e.g., N in


millions)
– repeated random access of subsets of the data
Nearest Neighbor Classifiers
• kNN: select the k nearest neighbors to x from the training data
and select the majority class from these neighbors
• Training method:
– Save the training examples
• At prediction time:
– Use a distance measure (for example Euclidean) and calculate the
distance of a test example x to all training examples
– Find the k training examples (x1,y1),…(xk,yk) that are closest to the test
example x
– Predict the most frequent class among those yi’s.
Nearest Neighbor Classifiers
• k is a parameter:
– Small k: “noisier” estimates, Large k: “smoother” estimates
– Best value of k often chosen by cross-validation
– Important case k = 1 (1NN)

• Comments
– Virtually assumption free
– Interesting theoretical properties:

• Disadvantages
– Can scale poorly with dimensionality: sensitive to distance metric
– Requires fast lookup at run-time to do classification with large n
– Does not provide any interpretable “model”

• Improvements:
– Weighting examples from the neighborhood
– Finding “close” examples in a large training set quickly
– Assign a score for class prediction as the fraction of predicted class among k nearest
examples (‘probabilistic version’)
The effect of the value of K on class boundaries

1-

31
Linear Discriminant Classifiers

• Linear Discriminant Analysis (LDA)


– Earliest known classifier (1936, R.A. Fisher)
– Find a projection onto a vector such that means for each class (2 classes) are separated
as much as possible (with variances taken into account appropriately)
– Reduces to a special case of parametric Gaussian classifier in certain situations
– Many subsequent variations on this basic theme (e.g., regularized LDA)

• Other linear discriminants


– Decision boundary = (p-1) dimensional hyperplane in p dimensions
– Perceptron learning algorithms (pre-dated neural networks)
• Simple “error correction” based learning algorithms
– SVMs: use a sophisticated “margin” idea for selecting the hyperplane
Fisher Linear Discriminant
• For each projection vector w compute for
each example w·xi
• Find w that minimizes within class variance
and maximizes between class variance
Fisher Linear Discriminant
• Solution

Where 1 and 2 are the means of class 1 and 2 vectors and 1


and 2 are the covariances of class 1 and 2 respectively.
• In test time just use the simple decision rule
if
w·xi > threshold class 1
else
class 2

threshold may be chosen such that the accuracy on the training


set is maximal
Support Vector Machines
(will be discussed again later)

• Support vector machines


– Use a different loss function, the “margin”
• Results in convex optimization problem, solvable by
quadratic programming
– Decision boundary represented by examples in
training data
– Linear version: clever placement of the hyperplane
– Non-linear version: “kernel trick” for high-dimensional
problems
– Computational complexity can be O(N3) without
speedups
Naïve Bayes Classifiers
• Bayes theorem
P( B | A) P( A)
P( A | B) 
P( B)

• Theorem of total probability, if events Ai are


mutually exclusive and probability sum to 1
n
P( B)   P( B | Ai ) P( Ai )
i 1
Naïve Bayes Classifiers
• Probabilistic model based on Bayes rule
P( X | C ) P(C )
P(C | X ) 
P( X )

• with conditional independence assumption


on p( x | ck ), i.e.
P( X | C )  P( x1 , x2 ,, xn | C )   P( x | C)
i 1..n
i

• Decision rule

cMAP  argmax P(c j | x1 , x2 ,, xn )


c j C
P( x1 , x2 ,, xn | c j ) P(c j )
 argmax
c j C P( x1 , x2 ,, xn )
 argmax P( x1 , x2 ,, xn | c j ) P(c j )
c j C
Parameters estimation
• P(cj)
– Can be estimated from the frequency of classes in the training
examples.
• P(x1,x2,…,xn|cj)
– O(|X|n•|C|) parameters
– Could only be estimated if a very, very large number of training
examples was available.
• Independence Assumption: attribute values are conditionally
independent given the target value: naïve Bayes.

P( x1 , x2 ,, xn | c j )   P( xi | c j )
i

c NB  arg max P(c j ) P( xi | c j )


c j C i
Properties
• Estimating P( xi | c j ) instead of P( x1 , x2 ,, xn | c j ) greatly
reduces the number of parameters (and the data sparseness).
• The learning step in Naïve Bayes consists of estimating P(c j )
and P( xi | c j ) based on the frequencies in the training data
• An unseen instance is classified by computing the class that
maximizes the posterior
• When conditioned independence is satisfied, Naïve Bayes
corresponds to MAP classification.
Example. ‘Play Tennis’ data
Day Outlook Temperature Humidity Wind Play
Tennis

Day1 Sunny Hot High Weak No


Day2 Sunny Hot High Strong No
Day3 Overcast Hot High Weak Yes
Day4 Rain Mild High Weak Yes
Day5 Rain Cool Normal Weak Yes
Day6 Rain Cool Normal Strong No
Day7 Overcast Cool Normal Strong Yes
Day8 Sunny Mild High Weak No
Day9 Sunny Cool Normal Weak Yes
Day10 Rain Mild Normal Weak Yes
Day11 Sunny Mild Normal Strong Yes
Day12 Overcast Mild High Strong Yes
Day13 Overcast Hot Normal Weak Yes
Day14 Rain Mild High Strong No

Question: For the day <sunny, cool, high, strong>, what’s the
play prediction?
Based on the examples in the table, classify the following case x:
x=(Outlook=Sunny, Temp=Cool, Humidity=High, Wind=strong)
• That means: c = Play tennis or not?

c NB  arg max P(c) P(x | c)  arg max P(c)


c[ yes , no] c[ yes , no]
 P(a
t
t | c)

 arg max P(c) P(Outlook  sunny | c) P(Temp  cool | c) P( Humidity  high | c) P(Wind  strong | c
c[ yes , no]

• Working:

P( PlayTennis  yes)  9 / 14  0.64


P( PlayTennis  no)  5 / 14  0.36
P(Wind  strong | PlayTennis  yes)  3 / 9  0.33
P(Wind  strong | PlayTennis  no)  3 / 5  0.60
etc.
P( yes) P( sunny | yes) P(cool | yes) P(high | yes) P( strong | yes)  0.0053
P(no) P( sunny | no) P(cool | no) P(high | no) P( strong | no)  0.0206

 answer : PlayTennis( x)  no
Underflow Prevention
• Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
• Since log(xy) = log(x) + log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.
• Class with highest final un-normalized log
probability score is still the most probable.

c NB  argmax (log P(c j ) 


c jC
 log P( x | c ))
i positions
i j
Properties of Naïve Bayes Classifier

• Pros
– Incrementality: with each training example, the prior
and the likelihood can be updated dynamically
– Robustness: flexible and robust to errors.
– Probabilistic hypothesis: outputs not only a
classification, but a probability distribution over all
classes
• Cons
– Performace: Naïve bayes assumption is, in many cases,
not realistic reducing the performance
– Probability estimates: of classes are often un-realistic
(due to the naive bayes assumption)
Evaluating Classification Results (in
general)
• Summary statistics:
– empirical estimate of score function on test data, eg., error rate

• More detailed breakdown


– E.g., “confusion matrices”
– Can be quite useful in detecting systematic errors

• Detection v. false-alarm plots (2 classes)


– Binary classifier with real-valued output for each example,
where higher means more likely to be class 1
– For each possible threshold, calculate
• Detection rate = fraction of class 1 detected
• False alarm rate = fraction of class 2 detected
– Plot y (detection rate) versus x (false alarm rate)
– Also known as ROC, precision-recall, specificity/sensitivity
Bagging for Combining Classifiers
• Training data sets of size N

• Generate B “bootstrap” sampled data sets of size N


– Bootstrap sample = sample with replacement
– e.g. B = 100

• Build B models (e.g., trees), one for each bootstrap sample


– Intuition is that the bootstrapping “perturbs” the data enough to make the models more
resistant to true variability

• For prediction, combine the predictions from the B models


– E.g., for classification p(c | x) = fraction of B models that predict c

– Plus: generally improves accuracy on models such as trees


– Negative: lose interpretability
green = majority vote
purple = averaging
the probabilities

From Hastie, Tibshirani,


And Friedman, 2001
Illustration of Boosting:
Color of points = class label
Diameter of points = weight at each iteration
Dashed line: single stage classifier. Green line: combined, boosted classifier
Dotted blue in last two: bagging
(from G. Rätsch, Phd thesis, 2001)
Summary on Classifiers
• Simple models (but can be effective)
– Logistic regression
– Naïve Bayes
– K nearest-neighbors

• Decision trees
– Good for high-dimensional problems with different data types

• State of the art:


– Support vector machines
– Boosted trees

• Many tradeoffs in interpretability, score functions, etc


Decision Tree Classifiers

Classification
Task

Representation Decision boundaries =


hierarchy of axis-parallel

Score Function Cross-validated


error

Search/Optimization Greedy search in tree


space

Data None specified


Management

Models, Tree
Parameters
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Naïve Bayes Classifier

Task Classification

Representation Conditional independence


probability model

Score Function Likelihood

Search/Optimization Closed form probability


estimates

Data None specified


Management

Models, Conditional
Parameters probability tables
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Logistic Regression

Task Classification

Representation Log-odds(C) = linear function


of X’s

Score Function Log-likelihood

Search/Optimization Iterative (Newton) method

Data None specified


Management

Models, Logistic weights


Parameters
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Nearest Neighbor Classifier

Task Classification

Representation Memory-based

Score Function Cross-validated error (for


selecting k)

Search/Optimization None

Data None specified


Management

Models, None
Parameters
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Support Vector Machines

Task Classification

Representation Hyperplanes

Score Function “Margin”

Convex optimization
Search/Optimization
(quadratic programming)

Data None specified


Management

Models, None
Parameters
Data Mining Lectures
Lecture 7: Classification
Padhraic Smyth, UC Irvine
Software (same as for Regression)

MATLAB •
Many free “toolboxes” on the Web for regression and prediction –
e.g., see http://lib.stat.cmu.edu/matlab/ –
and in particular the CompStats toolbox

R •
General purpose statistical computing environment (successor to S) –
Free (!) –
Widely used by statisticians, has a huge library of functions and visualization tools –

Commercial tools •
SAS, other statistical packages –
Data mining packages –
Often are not progammable: offer a fixed menu of items –

Data Mining Lectures


Lecture 7: Classification
Padhraic Smyth, UC Irvine

You might also like