ML in Practice

Machine Learning
in practice (at Yahoo)

common pitfalls, and debugging tricks
(Kilian Weinberger)
(thanks to Rob Shapire, Andrew Ng)

Overview
Machine Learning Setup
Algorithm Debugging
Data Debugging
Machine Learning Setup
Goal
Data
Miracle Learning
Algorithm
Idea
Amazing results!!!
Fame, Glory, Rock’n Roll!
1. Learning Problem
What is my data?
What am I trying to learn?
What would be ideal conditions?
QUIZ: What would be some answers for

email spam filtering?
Example:
What is my data? Email content / Meta Data

What am I trying to learn? User’s spam/ham labels
What would be ideal conditions? Only Y! Employees
2. Train / Test split
Real World
Train Data Test Data
Data
time ??
1. How much data do I need? (More is more - See John’s talk.)
2. How do you split into train / test? (Always by time!)
3. Training data should be just like test data!!

Data set overfitting
many runs one run! Real World
Train Data Test Data
Data
time ??
By evaluating on the same data set over and over, you will overfit
��
log (#trials)
Overfitting bounded by: O
#examples
(see John’s talk)
Kishore’s rule of thumb: subtract 1% accuracy for every time you have
tested on a data set
Ideally: Create a second train / test split!

3. Data Representation:
feature vector:
0 “viagra”
1 “hello”
0 “cheap”
1 “$”
1 “Microsoft”
... ...
0 From a YID?
1 IP known?
2.3423E+12 Sent time in s since 1/1/1970
12323 Email size
0 Attachment size
... ...
data (email) 0.232 Percentile in email length

0.1 Percentile in token likelihood
... ...
Data Representation:
feature vector:
3 “viagra”
bag of word features 1
0
“hello”
“cheap”
(sparse) 1 “$”
1 “Microsoft”
... ...
0 From a YID?
meta features 1
2.3423E+12
IP known?
Sent time in s since 1/1/1970
(sparse / dense) 12323 Email size
0 Attachment size
... ...
aggregate statistics 0.232
0.1
Percentile in email length
Percentile in token likelihood
(dense real) ... ...
Pitfall #1: Aggregate statistics should not be over test data!

Pitfall #2:
Feature scaling fi → (fi + ai ) ∗ bi
With linear classifiers / kernels features should have similar scale
(e.g. range [0,1])
Must use the same scaling constants for test data!!! (most likely
test data will not be in a clean [0,1] interval)
Dense features should be down-weighted when combined with

sparse features
(Scale does not matter for decision trees.)

Pitfall #3: raw data:
3
0
condensed
1
1
feature vector
Over-condensing of features ...
1.2
Features do not need to be

-23.2
semantically meaningful 0
1
2.3
2.3423E+12
Just add them: Redundancy is 12323
5.3
(generally) not a problem 0
... 12.1
Let the learning algorithm decide

what’s useful! 0.232
0.1
...
Example: Thought reading
fMRI scan
Nobody knows what the
features are
But it works!!!
[Mitchel et al 2008]
3. Training Signal
1. How reliable is my labeling source? (Even editors only

agree 33% of the time.)
2. Does the signal have high coverage?
3. Is the signal derived independently of the features?!
4. Could the signal shift after deployment?
Quiz: Spam filtering
r s e l
The spammer with IP e.v.i.lv e has b e
sent 10M spam emails
i
d ly l a
o
over the last 10 days t - use l all emails with this IP as
n t i a t a
spam examples ten da
po in
i s y
Use user’s spam / not-spam
n o votes as signal
to o
g e
e r a
Use Yahoo employees’ov votes
c
lo w
Example: Spam filtering
spam
filter
incoming
Inbox
email
feedback:
SPAM / NOT-SPAM
Junk user
annotates
email
old new
spam ML spam
filter filter
incoming
email Inbox
feedback:
SPAM / NOT-SPAM
Junk user
QUIZ: What is wrong with this setup?

annotates
email
old new
spam ML spam
filter filter
incoming
email Inbox
feedback:
SPAM / NOT-SPAM
Problem: Users only vote when classifier is wrong
New filter learns to exactly invert the old classifier
Possible solution: Occasionally let emails through filter to avoid bias

Example: Trusted votes
Goal: Classify email votes as trusted / untrusted
Signal conjecture:
votes
voted
voted “bad”
“good”
time
evil spammer community
Searching for signal
The good news: We found that exact pattern A LOT!!
votes
voted
voted “bad”
“good”
time
evil spammer community
The bad news: We found other patterns just as often
votes
voted
“good” voted “bad”
time
The bad news: We found other patterns just as often
votes
voted voted voted voted voted

“good” “bad” “good” “bad” “good”
Moral: Given enough data you’ll find anything! time

You need to be very very careful that you learn the right thing!
4. Learning Method
• Classification / Regression / Ranking?

• Do you want probabilities?
• Do you have skewed classes / weighted examples?
• Best off-the-shelf: SVMs or boosted decision trees
• Generally: Try out several algorithms
Method Complexity (KISS)
Common pitfall: Use a too complicated

learning algorithm
ALWAYS try simplest algorithm first!!!
Move to more complex systems after the

simple one works
Rule of diminishing returns!!
(Scientific papers exaggerate benefit of

complex theory.)
QUIZ: What would you use for spam?

Ready-Made Packages
Weka 3
http://www.cs.waikato.ac.nz/~ml/index.html
Vowpal Wabbit (very large scale)

http://hunch.net/~vw/
Machine Learning Open Software Project

http://mloss.org/software
MALLET: Machine Learning for Language Toolking

http://mallet.cs.umass.edu/index.php/Main_Page
Internal: Alex Smola’s LDA implementation
smola@yahoo-inc.com
LIB SVM (Powerful SVM implementation) Internal Boosted Decision Tree Implementation
http://www.csie.ntu.edu.tw/~cjlin/libsvm/ http://twiki.corp.yahoo.com/view/Ysti18n/MLRModelingPackage
SVM Light Internal: Data Mining Platform
http://svmlight.joachims.org/svm_struct.html http://twiki.corp.yahoo.com/view/Yst/Clue
SVM Lin (very fast linear SVM)
http://people.cs.uchicago.edu/~vikass/svmlin.html
Model Selection
(parameter setting with cross validation)
Train
Train’ Val
Do not trust default parameters!!!!
Grid Search over parameters
Most importantly: Learning rate!!
Often easy to use hod-farm on

hadoop (Jerry Ye)
Pick best parameters for Val

5. Experimental Setup
1. Automate everything (one button setup)
• pre-processing / training / testing / evaluation
• Let’s you reproduce results easily
• Fewer errors!!
2. Parallelize your experiments (use hod-farm)

Quiz
T/F: Condensing features with domain expertise improves learning? FALSE
T/F: Feature scaling is irrelevant for boosted decision trees. TRUE
To avoid data overfitting benchmark on a second train/test data set.
T/F: Ideally, derive your signal directly from the features. FALSE
You cannot create train/test split when your data changes over time. FALSE
T/F: Always compute aggregate statistics over the entire corpus. FALSE
Debugging ML algorithms
Problem: Spam filtering
You implemented
logistic regression with
regularization.
Problem: Your test
error is too high
(12%)!
QUIZ: What can you do to fix it?

Fixing attempts:
Get more training data
Get more features
Select fewer features
Feature engineering (e.g. meta features, header information)
Run gradient descent longer
Use Newton’s Method for optimization
Change regularization
Use SVMs instead of logistic regression
But: which one should we try out?

Possible problems
Diagnostics:
Underfitting: Training error almost as high as test error
Overfitting: Training error much lower than test error
Wrong Algorithm: Other methods do better
Optimizer: Loss function is not minimized

Underfitting / Overfitting
Diagnostics
over fitting • test error still decreasing with more data
• large gap between train and test error
error
testing error
desired error
training error
training set size

Diagnostics
under fitting • even training error is too high
• small gap between train and test error
error
testing error
training error
desired error
training set size

Convergence problem vs.
wrong loss function?
Loss function
where are you?

L(w)
only global minimum

Diagnostics
Your loss function L(w) [ w = parameters ]

Train various loss functions (e.g. use WEKA)
If all algorithms perform worse on test set, you might need a
more powerful function class (kernels, decision trees?)
otherwise: compute your training loss L(w’) [w’ parameters
obtained by other loss]
Diagnostics
Case 1: w’ has lower test error and higher loss.
Your loss function is bad
loss
L(w’)
L(w)
iterations
Diagnostics
Case 2: w’ has lower test error and lower loss.
Your optimizer is bad
loss
L(w)
L(w’)
iterations
Quiz: helps against:
(overfitting / underfitting /
bad optimizer / bad loss)
Get more training data overfitting

Get more features underfitting
Select fewer features overfitting
Feature engineering underfitting
Run gradient descent longer bad optimizer
Use Newton’s Method for optimization bad optimizer
Change regularization underfitting / overfitting

Use SVMs instead bad loss function
Debugging Data
Problem: Online error > Test Error
error online error
testing error
desired error
training error
training set size

Analytics: train/test
online
Suspicion: Online data differently distributed

Construct new binary classification problem:
Online vs. train+test
If you can learn this (error < 50%), you have a
distribution problem!!
You do not need any labels for this!!
(Alex Smola YR is a world expert in covariate shift.)
Suspicion: Temporal distribution drift
Train Test 12% Error
time
shuffle
Train Test 1% Error
If E(shuffle)<E(train/test) then you have temporal distribution drift
Cures: Retrain frequently / online learning

Problem: You are “too good” on
your setup ...
online error
error
desired error
training error testing error

iterations
Possible Problems
Caltech 101 Test Accuracy
Is the label included in 90.0
data set?
67.5
Does the training set
contain test data? 45.0
Famous example in 2007: 22.5

Caltech 101
0
2005 2006 2007
Caltech 101
2007 2009
Final Quiz
Increasing your training set size increases the training error.
Temporal drift can be detected through shuffling the training/test sets.
Increasing your feature set size decreases the training error.
T/F: Better and more features always decreases the test error? False
T/F: Very low test error always indicates you are doing well. False
When an algorithm overfits there is a big gap between train and test error.
T/F: Underfitting can be cured with more powerful learners. True
T/F: The test error is (almost) never below the training error. True
Summary
Marty: “Machine learning is only sexy when it works.”

ML algorithms deserve a careful setup
Debugging is just like any other code
1. Carefully rule out possible causes
2. Apply appropriate fix
Resources
Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)
Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller
K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998.
Pattern Recognition and Machine Learning by Christopher M. Bishop
Andrew Ng’s ML course: http://www.youtube.com/watch?v=UzxYlbK2c7E

ML in Practice

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML in Practice

Uploaded by

Copyright:

Available Formats

Machine Learning

in practice (at Yahoo)

(thanks to Rob Shapire, Andrew Ng)

Machine Learning Setup

QUIZ: What would be some answers for

What is my data? Email content / Meta Data

1. How much data do I need? (More is more - See John’s talk.)

2. How do you split into train / test? (Always by time!)

3. Training data should be just like test data!!

Ideally: Create a second train / test split!

data (email) 0.232 Percentile in email length

aggregate statistics 0.232

Pitfall #1: Aggregate statistics should not be over test data!

Dense features should be down-weighted when combined with

(Scale does not matter for decision trees.)

Features do not need to be

Let the learning algorithm decide

1. How reliable is my labeling source? (Even editors only

QUIZ: What is wrong with this setup?

Problem: Users only vote when classifier is wrong

New filter learns to exactly invert the old classifier

Possible solution: Occasionally let emails through filter to avoid bias

The bad news: We found other patterns just as often

The bad news: We found other patterns just as often

voted voted voted voted voted

Moral: Given enough data you’ll find anything! time

• Classification / Regression / Ranking?

Common pitfall: Use a too complicated

ALWAYS try simplest algorithm first!!!

Move to more complex systems after the

Rule of diminishing returns!!

(Scientific papers exaggerate benefit of

QUIZ: What would you use for spam?

Vowpal Wabbit (very large scale)

Machine Learning Open Software Project

MALLET: Machine Learning for Language Toolking

Grid Search over parameters

Most importantly: Learning rate!!

Often easy to use hod-farm on

Pick best parameters for Val

1. Automate everything (one button setup)

• pre-processing / training / testing / evaluation

• Let’s you reproduce results easily

2. Parallelize your experiments (use hod-farm)

T/F: Condensing features with domain expertise improves learning? FALSE

T/F: Feature scaling is irrelevant for boosted decision trees. TRUE

To avoid data overfitting benchmark on a second train/test data set.

QUIZ: What can you do to fix it?

Get more features

Select fewer features

Feature engineering (e.g. meta features, header information)

Run gradient descent longer

Use Newton’s Method for optimization

Use SVMs instead of logistic regression

But: which one should we try out?

Optimizer: Loss function is not minimized

training set size

training set size

where are you?

only global minimum

Your loss function L(w) [ w = parameters ]

Your loss function is bad