Professional Documents
Culture Documents
Algorithm Debugging
Data Debugging
Machine Learning Setup
Goal
Data
Miracle Learning
Algorithm
Idea
Amazing results!!!
Fame, Glory, Rock’n Roll!
1. Learning Problem
What is my data?
What am I trying to learn?
What would be ideal conditions?
By evaluating on the same data set over and over, you will overfit
�� �
log (#trials)
Overfitting bounded by: O
#examples
(see John’s talk)
Kishore’s rule of thumb: subtract 1% accuracy for every time you have
tested on a data set
0 “viagra”
1 “hello”
0 “cheap”
1 “$”
1 “Microsoft”
... ...
0 From a YID?
1 IP known?
2.3423E+12 Sent time in s since 1/1/1970
12323 Email size
0 Attachment size
... ...
3 “viagra”
bag of word features 1
0
“hello”
“cheap”
(sparse) 1 “$”
1 “Microsoft”
... ...
0 From a YID?
meta features 1
2.3423E+12
IP known?
Sent time in s since 1/1/1970
(sparse / dense) 12323 Email size
0 Attachment size
... ...
0.1
Percentile in email length
Percentile in token likelihood
(dense real) ... ...
Must use the same scaling constants for test data!!! (most likely
test data will not be in a clean [0,1] interval)
0
condensed
1
1
feature vector
Over-condensing of features ...
1.2
1
2.3
2.3423E+12
Just add them: Redundancy is 12323
5.3
(generally) not a problem 0
... 12.1
0.1
...
Example: Thought reading
fMRI scan
Nobody knows what the
features are
But it works!!!
[Mitchel et al 2008]
3. Training Signal
spam
filter
incoming
Inbox
email
feedback:
SPAM / NOT-SPAM
Junk user
Example: Spam filtering
annotates
email
old new
spam ML spam
filter filter
incoming
email Inbox
feedback:
SPAM / NOT-SPAM
Junk user
feedback:
SPAM / NOT-SPAM
Signal conjecture:
votes
voted
voted “bad”
“good”
time
evil spammer community
Searching for signal
The good news: We found that exact pattern A LOT!!
votes
voted
voted “bad”
“good”
time
evil spammer community
Searching for signal
The good news: We found that exact pattern A LOT!!
votes
voted
“good” voted “bad”
time
Searching for signal
The good news: We found that exact pattern A LOT!!
votes
Train
Train’ Val
Do not trust default parameters!!!!
• Fewer errors!!
T/F: Ideally, derive your signal directly from the features. FALSE
You cannot create train/test split when your data changes over time. FALSE
T/F: Always compute aggregate statistics over the entire corpus. FALSE
Debugging ML algorithms
Problem: Spam filtering
You implemented
logistic regression with
regularization.
Problem: Your test
error is too high
(12%)!
Change regularization
Diagnostics:
Underfitting: Training error almost as high as test error
Overfitting: Training error much lower than test error
Wrong Algorithm: Other methods do better
error
testing error
desired error
training error
error
testing error
training error
desired error
loss
L(w’)
L(w)
iterations
Diagnostics
Case 2: w’ has lower test error and lower loss.
loss
L(w)
L(w’)
iterations
Quiz: helps against:
(overfitting / underfitting /
bad optimizer / bad loss)
testing error
desired error
training error
online
time
shuffle
error
desired error
data set?
67.5
Does the training set
contain test data? 45.0
2007 2009
Final Quiz
Increasing your training set size increases the training error.
T/F: Better and more features always decreases the test error? False
T/F: Very low test error always indicates you are doing well. False
When an algorithm overfits there is a big gap between train and test error.
T/F: The test error is (almost) never below the training error. True
Summary
Data Mining: Practical Machine Learning Tools and Techniques (Second Edition)
Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller
K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998.