You are on page 1of 33

Machine Learning Tutorial

CB, GS, REC , ,

Section 1

Machine Learning basic concepts

Machine Learning Tutorial for the UKP lab June 10, 2011 10

This ppt includes some slides/slide-parts/text taken from online materials created by the following people: - Greg Grudic - Alexander Vezhnevets - Hal III Daume

What is Machine Learning? The goal of machine learning is to build computer systems that can adapt and learn from their experience.
Tom Dietterich

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

A Generic System

x1 x2

System
h1 , h2 , ..., h K

y1 y2

xN

yM

Hidden Variables: h = ( h1 , h2 ,..., hK )

Input Variables: x = ( x1 , x2 ,..., xN ) I t V i bl

Output Variables: y = ( y1 , y2 ,..., yK )


SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 4

When are ML algorithms NOT needed? When the relationships between all system variables (input, output, and hidden) is completely understood! This is NOT the case for almost any real system!

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

The Sub Fields of ML Sub-Fields

Supervised Learning Reinforcement Learning Unsupervised Learning

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Supervised Learning Given: Training examples

{( x , f ( x ) ) , ( x , f ( x ) ) ,..., ( x
1 1 2 2

, f ( xP ))

for some unknown function (system) y = f ( x ) Find f ( x ) Predict y = f ( x ) , where x is not in the training set

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Model, Model model quality Definition: A computer program is said to learn


from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Learned hypothesis: model of problem/task T Model quality: accuracy/performance measured by P

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Data / Examples / Sample / Instances


Data: experience E in the f form of examples / instances f characteristic of the whole input space
representative sample independent and identically distributed (no bias in selection / observations)

Good G d example l
1000 abstracts chosen randomly out of 20M PubMed entries (abstracts) probably i.i.d. representative?
if annotation is involved it is always a question of compromises

Definitely bad example


all abstracts that have John Smith as an author

Instances have to be comparable to each other


SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 9

Data / Examples / Sample / Instances


Example: set of queries and a set of top retrieved documents f f (characterized via tf, idf, tf*idf, PRank, BM25 scores) for each try predicting relevance for reranking!
top retrieved set is dependent on underlying IR system! issues with representativeness, but for reranking this is fine characterization is dependent on query (exc. PRank), i.e. only certain pairs (for the same Q) are meaningfully comparable (c.f. independent examples for the same Q) we have to normalize the features per query to have same mean/variance we have to form pairs and compare e.g. the diff of feature values

Toy example:
Q = learning, rank 1: tf = 15, rank 100: tf = 2 Q = overfitting, rank 1: tf = 2, rank 10: tf = 0
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 10

Features The available examples (experience) has to be described to the algorithm in a consumable format
Here: examples are represented as vectors of pre-defined features E.g. for credit risk assesment, typical features can be: income range, debt load employment history real estate properties criminal record load, history, properties, record, city of residence, etc.

Common feature types yp


binary nominal ordinal numeric (criminal record, Y/N) (city of residence, X) (income range, 0-10K, 10-20K, ) (debt load, $)
11

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Machine Learning Tutorial


CB, GS, REC , ,

Section 2

Experimental practice
by now youve learned what machine learning is; in the supervised approach you need (carefully selected / prepared) examples that you describe through features; the algorithm then learns a model of the problem based on the examples (usually some kind of optimization is performed in the background); and as a result result, improvement is observed in terms of some performance measure

Machine Learning Tutorial for the UKP lab June 10, 2011

Model parameters
2 kinds of parameters
one the user sets for the training procedure in advance hyperparameter
the degree of polynom to match in regression y number/size of hidden layer in Neural Network number of instances per leaf in decision tree

one that actually gets optimized through the training parameter


regression coefficients network weights size/depth of decision tree (in Weka, other implementations might allow to control that)

we usually do not talk about the latter, but refer to hyperparameters as parameters

Hyperparameters
the less the algorithm has, the better
Naive Bayes the best? No parameters! usually algs with better discriminative power are not parameter-free

typically are set to optimize performance (on validation set, or through cross-validation)
manual, grid search, simulated annealing, gradient descent, etc.

common pitfall:
select the hyperparameters via CV, e.g. 10-fold + report cross-validation results
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 13

Cross-validation, Cross validation Illustration


X k = {x1 ,..., xk }

X1

X2

X3

X4

X5

Test

The result is an average over all iterations

Train

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

14

Cross-Validation Cross Validation


n-fold CV common practice f making (h f ld CV: ti for ki (hyper)parameter estimation more ) t ti ti robust
round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model typical: random splits, without replacement (each instance tests exactly once)
the other way: random subsampling cross-validation

n-fold CV: common practice to report average performance deviation etc performance, deviation, etc.
No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004) bad practice? problem: training sets largely overlap, test errors are also dependent
tends to underestimate real variance of CV (thus e g confidence intervals are to be treated with extreme e.g. caution) 5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets

Folding via nat ral units of processing for the gi en task ia natural nits given
typically, document boundaries best practice is doing it yourself!
ML package / CSV representation is not aware of e.g. document boundaries! The PPI case

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

15

Cross-Validation Cross Validation Ideally the valid settings are:


take off-the-shelf algorithms, avoid parameter tuning and compare results, e.g. results e g via cross validation cross-validation
n.b. you probably do the folding yourself, trying to minimize biases!

do parameter tuning (n.b. selecting/tuning your features is also tuning!) but then normally you have to have a blind set (from the beginning)
e.g. have a look at shared tasks, e.g. CoNLL practical way to learn experimental best practice to align the predefined standards (you might even benefit from comparative results, etc.)

You might want to do something different


be aware of these & the consequences
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 16

The ML workflow
Common ML experimenting pipeline
1. define the task
instance, target variable/labels, collect and label/annotate data credit risk assessment: 1 credit request, good/bad credit, ~s ran out i the di i k di d/b d di in h previous year

2. define and collect/calculate features, define train / validation (development) ((test!)) / test (evaluation) data 3. pick a learning algorithm (e.g. decision tree), train model
train on training set optimize/set model hyperparameters (e.g. number of instances / leaf, use pruning, ) according to performance on validation data
cross validation: use all training data as validation data

test model accuracy on (blind) test set

4. ready to use model to predict unseen instances with an expected y p p accuracy similar to that seen on test
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 17

Try this in Weka


=== Run information === Relation: segment Instances: 1500 Attributes: 20 Test mode: split 80 0% train remainder test 80.0% train,

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Correctly Classified Instances 290 96.6667 % Incorrectly Classified Instances 10 3.3333 % Scheme: weka.classifiers.trees.J48 -C 0.25 -M 12 Correctly Classified Instances 281 93.6667 % Incorrectly C Classified Instances f 19 6.3333 %
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 18

Model complexity
Fitting Fitti a polynomial regression: l i l i
n a( x) = n x n =0
t
1.0 M=0

1.0

M=1

0.0

0.0

By, for instance, least squares:

1.0 0.0 0.5 x 1.0

1.0 0.0 0.5 x 1.0

1.0

M=3

1.0

M=9

n = arg min y j n x j =1 n =0

2
0.0
0.0

1.0 0.0 0.5 x 1.0

1.0 0.0 0.5 x 1.0

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

19

Data size and model complexity Important concept: discriminative power of the algorithm
linear vs nonlinear model some theoretical aspects: 1-hidden-layer NN with unlimited hidden nodes can perfectly model any smooth function/surface

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

20

Data size and model complexity


Overfitting: the model perfectly learns to classify training data but data, has no (bad) generalization ability
results in high test error (useless model) typical for small sample sizes and powerful models

Underfitting: the model is not capable of learning the (complex) patterns in the training set Reasons of Underfitting and Overfitting:
lack of discriminative power small sample size ll l i noise in the data /labels or features/ generalization ability of algorithm has to be chosen wrt. sample size

Size (complexity) of learnt model grows with data size


if the data is consistent this is OK consistent,
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 21

Predictions Confusion matrix TP: p classified as p FP: n classified as p TN: n classified as n FN: p classified as n p Good prediction: G TP+TN Error: FP (false alarm) + FN (miss)
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 22

Evaluation measures
Accuracy
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (TP+TN) / (TP+FN+FP+TN)

Error rate
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (FP+FN) / (TP+FN+FP+TN)

[Root]?[Mean|Absolute][Squared]?Error
The difference between the predicted and actual values e.g.

RMSE=

1 ( f ( x) y ) 2 n

Algorithms (e.g. those in Weka) typically optimize these


might be a mismatch between optimization objective and actual evaluation measure optimize different measures research on its own (e.g. in ML for IR, a.k.a. learning to rank)

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

23

Evaluation measures
Precision
Fraction of correctly predicted positives and all predicted positives TP/(TP+FP) ( )

TP: p classified as p p FP: n classified as p TN: n classified as n FN: p classified as n p

Recall
Fraction of correctly predicted positives and all actual positives ac o o co ec y p ed c ed pos es a d a ac ua pos es TP/(TP+FN)

F measure
weighted harmonic mean of Precision and Recall (usually equal weighted, =1)

precision recall F = (1 + ) 2 precision + recall


2

Only makes sense for a subset of classes (usually measured for a single class)
For all classes, it equals the accuracy
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 24

Evaluation measures
Sequence P/R/F e.g. in Named Entity Recognition Chunking etc P/R/F, e g Recognition, Chunking, etc.
A sequence of tokens with the same label is treated as a single instance John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG. Why? We need complete phrases to be identified correctly How? With external evaluation script, e.g. conlleval for NER

Example tagging:
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG.

Multiple penalty:
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) (PER) (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG) F(PER) = 0.67, F(ORG) = 0 5 0 67 0.5
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 25

Loss types
1. 1 2. 3. The real loss function given to us by the world Typically involves notions of money saved world. saved, time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this function. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance assessments, etc assessments etc. We can perform these evaluations, but they are slow and costly They evaluations costly. require humans in the loop. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate, mean-average-precision. These require humans at the front of the loop, but after that are cheap and quick. T i ll some effort h b h d i k Typically ff t has been put i t showing correlation b t t into h i l ti between th these and something higher up. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for parsing, chunking and named-entity recognition), alignment error rate (for word alignment) and perplexity (f l d l it (for language modeling). Th d li ) These also require h l i humans at th f t of th l t the front f the loop, but differ from (3) in that they are not actually compared with higher-up tasks.

4.

Be careful what you are optimizing! Some measures (trypically of Type 4) become disfunctional when you are optimizing them!
phrase P/R/F e.g. in NER Readability measures
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 26

Evaluation measures
Sequence P/R/F e.g. in Named Entity Recognition Chunking etc P/R/F, e g Recognition, Chunking, etc.
John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG.

Example tagging 1: p gg g
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG) F(PER) = 0.67, F(ORG) = 0.5

Example tagging 2:
John J h _PER studied_O at_O the_O J h _O H ki _O U i di d h Johns Hopkins University_O b f _O j i i _O IBM_ORG. i before joining 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 0 FP 1 FN: Johns Hopkins University ( p y (ORG) ) F(PER) = 1.0, F(ORG) = 0.67

Optimizing phrase-F can encourage / prefer systems that do not mark entities!
most lik l thi i b d!! t likely, this is bad!!
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 27

ROC curve
ROC Receiver O OC Operating Characteristic curve C
Curve that depicts the relation between recall (sensitivity) and false positives (1-specificity) (1 specificity)

Best case

Sensitiv (Reca vity all)

Worst case

False Positives
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

FP / (FP+TN)
28

Evaluation measures
Area under ROC curve, AUC A d As you vary the decision threshold, you can plot the recall vs. false p positive rate The area under the curve measures how accurately your model separates positive f t iti from negatives ti
perfect ranking: AUC = 1.0 random decision: AUC = 0.5

Similarly (e.g. in IR): area under P/R curve


when th h there are t many (t ) negatives too (true) ti correctly identifying negatives is not interesting anyway

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

29

Evaluation measures (Ranking)


Precision P i i @K
number of true positives in top K predictions / ranks

MAP
The average of precisions computed at the point of each of the positives in the ( ) ranked list (P=0 for positives not ranked at all)

NDCG
For graded relevance / ranking Highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

30

Learning curve
Measures h M how th the
accuracy error

of th model changes with f the d l h ith


sample size iteration number

Smaller sample
worse accuracy y more likely bias in the estimate (representative sample) variance in the estimate

Typical learning curve If it looks differently:


you are plotting error vs. size/iteration you are doing something wrong! overfitting (iteration, not sample size)!
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

??
31

Data or Algorithm?
Compare the accuracy of various machine learning algorithms with a varying amount of training data (Banko & Brill, 2001):
Winnow perceptron nave Bayes memory-based learner

Features:
bag of words: words within a window of the target word collocations containing specific words and/or part of speech

Training corpus: 1-billion words from a variety of English texts ( (news articles, literature, scientific abstracts, etc.) )
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 32

Take home messages (up until now)


Supervised learning: based on a set of labeled examples (x, f(x)) learn the p g p ( , ( )) input-output mapping, i.e. f(x) 3 factors of successful machine learning models
much data good features well-suited learning algorithm

ML workflow
1. 2. 2 3. 4. problem definition feature engineering; experimental setup /train validation test / /train, validation, / selection of learning algorithm, (hyper)parameter tuning, training a final model predict unseen examples & fill tables / draw figures for the paper - test

Careful ith C f l with


data representation (i.i.d, comparability, ) experimental setup (cross-validation, blind testing, ) data size and algorithm selection (+ overfitting underfitting ) overfitting, underfitting, ) evaluation measures
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 33

You might also like