Machine Learning Tutorial

Machine Learning Tutorial
CB, GS, REC , ,
Section 1
Machine Learning basic concepts
Machine Learning Tutorial for the UKP lab June 10, 2011 10
This ppt includes some slides/slide-parts/text taken from online materials created by the following people: - Greg Grudic - Alexander Vezhnevets - Hal III Daume
What is Machine Learning? The goal of machine learning is to build computer systems that can adapt and learn from their experience.
Tom Dietterich
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
A Generic System
x1 x2
System
h1 , h2 , ..., h K
y1 y2
xN
yM
Hidden Variables: h = ( h1 , h2 ,..., hK )
Input Variables: x = ( x1 , x2 ,..., xN ) I t V i bl
Output Variables: y = ( y1 , y2 ,..., yK )

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 4
When are ML algorithms NOT needed? When the relationships between all system variables (input, output, and hidden) is completely understood! This is NOT the case for almost any real system!
The Sub Fields of ML Sub-Fields
Supervised Learning Reinforcement Learning Unsupervised Learning
Supervised Learning Given: Training examples
{( x , f ( x ) ) , ( x , f ( x ) ) ,..., ( x
1 1 2 2
, f ( xP ))
for some unknown function (system) y = f ( x ) Find f ( x ) Predict y = f ( x ) , where x is not in the training set
Model, Model model quality Definition: A computer program is said to learn

from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Learned hypothesis: model of problem/task T Model quality: accuracy/performance measured by P
Data / Examples / Sample / Instances

Data: experience E in the f form of examples / instances f characteristic of the whole input space
representative sample independent and identically distributed (no bias in selection / observations)
Good G d example l
1000 abstracts chosen randomly out of 20M PubMed entries (abstracts) probably i.i.d. representative?
if annotation is involved it is always a question of compromises
Definitely bad example

all abstracts that have John Smith as an author
Instances have to be comparable to each other

Data / Examples / Sample / Instances

Example: set of queries and a set of top retrieved documents f f (characterized via tf, idf, tf*idf, PRank, BM25 scores) for each try predicting relevance for reranking!
top retrieved set is dependent on underlying IR system! issues with representativeness, but for reranking this is fine characterization is dependent on query (exc. PRank), i.e. only certain pairs (for the same Q) are meaningfully comparable (c.f. independent examples for the same Q) we have to normalize the features per query to have same mean/variance we have to form pairs and compare e.g. the diff of feature values
Toy example:
Q = learning, rank 1: tf = 15, rank 100: tf = 2 Q = overfitting, rank 1: tf = 2, rank 10: tf = 0
Features The available examples (experience) has to be described to the algorithm in a consumable format
Here: examples are represented as vectors of pre-defined features E.g. for credit risk assesment, typical features can be: income range, debt load employment history real estate properties criminal record load, history, properties, record, city of residence, etc.
Common feature types yp

binary nominal ordinal numeric (criminal record, Y/N) (city of residence, X) (income range, 0-10K, 10-20K, ) (debt load, $)
11
Machine Learning Tutorial

CB, GS, REC , ,
Section 2
Experimental practice
by now youve learned what machine learning is; in the supervised approach you need (carefully selected / prepared) examples that you describe through features; the algorithm then learns a model of the problem based on the examples (usually some kind of optimization is performed in the background); and as a result result, improvement is observed in terms of some performance measure
Machine Learning Tutorial for the UKP lab June 10, 2011
Model parameters
2 kinds of parameters
one the user sets for the training procedure in advance hyperparameter
the degree of polynom to match in regression y number/size of hidden layer in Neural Network number of instances per leaf in decision tree
one that actually gets optimized through the training parameter

regression coefficients network weights size/depth of decision tree (in Weka, other implementations might allow to control that)
we usually do not talk about the latter, but refer to hyperparameters as parameters
Hyperparameters
the less the algorithm has, the better
Naive Bayes the best? No parameters! usually algs with better discriminative power are not parameter-free
typically are set to optimize performance (on validation set, or through cross-validation)
manual, grid search, simulated annealing, gradient descent, etc.
common pitfall:
select the hyperparameters via CV, e.g. 10-fold + report cross-validation results
Cross-validation, Cross validation Illustration

X k = {x1 ,..., xk }
X1
X2
X3
X4
X5
Test
The result is an average over all iterations
Train
14
Cross-Validation Cross Validation

n-fold CV common practice f making (h f ld CV: ti for ki (hyper)parameter estimation more ) t ti ti robust
round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model typical: random splits, without replacement (each instance tests exactly once)
the other way: random subsampling cross-validation
n-fold CV: common practice to report average performance deviation etc performance, deviation, etc.
No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004) bad practice? problem: training sets largely overlap, test errors are also dependent
tends to underestimate real variance of CV (thus e g confidence intervals are to be treated with extreme e.g. caution) 5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets
Folding via nat ral units of processing for the gi en task ia natural nits given
typically, document boundaries best practice is doing it yourself!
ML package / CSV representation is not aware of e.g. document boundaries! The PPI case
15
Cross-Validation Cross Validation Ideally the valid settings are:

take off-the-shelf algorithms, avoid parameter tuning and compare results, e.g. results e g via cross validation cross-validation
n.b. you probably do the folding yourself, trying to minimize biases!
do parameter tuning (n.b. selecting/tuning your features is also tuning!) but then normally you have to have a blind set (from the beginning)
e.g. have a look at shared tasks, e.g. CoNLL practical way to learn experimental best practice to align the predefined standards (you might even benefit from comparative results, etc.)
You might want to do something different

be aware of these & the consequences
The ML workflow
Common ML experimenting pipeline
1. define the task
instance, target variable/labels, collect and label/annotate data credit risk assessment: 1 credit request, good/bad credit, ~s ran out i the di i k di d/b d di in h previous year
2. define and collect/calculate features, define train / validation (development) ((test!)) / test (evaluation) data 3. pick a learning algorithm (e.g. decision tree), train model
train on training set optimize/set model hyperparameters (e.g. number of instances / leaf, use pruning, ) according to performance on validation data
cross validation: use all training data as validation data
test model accuracy on (blind) test set
4. ready to use model to predict unseen instances with an expected y p p accuracy similar to that seen on test
Try this in Weka

=== Run information === Relation: segment Instances: 1500 Attributes: 20 Test mode: split 80 0% train remainder test 80.0% train,
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Correctly Classified Instances 290 96.6667 % Incorrectly Classified Instances 10 3.3333 % Scheme: weka.classifiers.trees.J48 -C 0.25 -M 12 Correctly Classified Instances 281 93.6667 % Incorrectly C Classified Instances f 19 6.3333 %
Model complexity
Fitting Fitti a polynomial regression: l i l i
n a( x) = n x n =0
t
1.0 M=0
1.0
M=1
0.0
0.0
By, for instance, least squares:
1.0 0.0 0.5 x 1.0
1.0 0.0 0.5 x 1.0
1.0
M=3
1.0
M=9
n = arg min y j n x j =1 n =0
2
0.0
0.0
1.0 0.0 0.5 x 1.0
1.0 0.0 0.5 x 1.0
19
Data size and model complexity Important concept: discriminative power of the algorithm
linear vs nonlinear model some theoretical aspects: 1-hidden-layer NN with unlimited hidden nodes can perfectly model any smooth function/surface
20
Data size and model complexity

Overfitting: the model perfectly learns to classify training data but data, has no (bad) generalization ability
results in high test error (useless model) typical for small sample sizes and powerful models
Underfitting: the model is not capable of learning the (complex) patterns in the training set Reasons of Underfitting and Overfitting:
lack of discriminative power small sample size ll l i noise in the data /labels or features/ generalization ability of algorithm has to be chosen wrt. sample size
Size (complexity) of learnt model grows with data size

if the data is consistent this is OK consistent,
Predictions Confusion matrix TP: p classified as p FP: n classified as p TN: n classified as n FN: p classified as n p Good prediction: G TP+TN Error: FP (false alarm) + FN (miss)
Evaluation measures
Accuracy
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (TP+TN) / (TP+FN+FP+TN)
Error rate
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (FP+FN) / (TP+FN+FP+TN)
[Root]?[Mean|Absolute][Squared]?Error
The difference between the predicted and actual values e.g.
RMSE=
1 ( f ( x) y ) 2 n
Algorithms (e.g. those in Weka) typically optimize these

might be a mismatch between optimization objective and actual evaluation measure optimize different measures research on its own (e.g. in ML for IR, a.k.a. learning to rank)
23
Evaluation measures
Precision
Fraction of correctly predicted positives and all predicted positives TP/(TP+FP) ( )
TP: p classified as p p FP: n classified as p TN: n classified as n FN: p classified as n p
Recall
Fraction of correctly predicted positives and all actual positives ac o o co ec y p ed c ed pos es a d a ac ua pos es TP/(TP+FN)
F measure
weighted harmonic mean of Precision and Recall (usually equal weighted, =1)
precision recall F = (1 + ) 2 precision + recall

2
Only makes sense for a subset of classes (usually measured for a single class)
For all classes, it equals the accuracy
Evaluation measures
Sequence P/R/F e.g. in Named Entity Recognition Chunking etc P/R/F, e g Recognition, Chunking, etc.
A sequence of tokens with the same label is treated as a single instance John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG. Why? We need complete phrases to be identified correctly How? With external evaluation script, e.g. conlleval for NER
Example tagging:
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG.
Multiple penalty:
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) (PER) (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG) F(PER) = 0.67, F(ORG) = 0 5 0 67 0.5
Loss types
1. 1 2. 3. The real loss function given to us by the world Typically involves notions of money saved world. saved, time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this function. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance assessments, etc assessments etc. We can perform these evaluations, but they are slow and costly They evaluations costly. require humans in the loop. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate, mean-average-precision. These require humans at the front of the loop, but after that are cheap and quick. T i ll some effort h b h d i k Typically ff t has been put i t showing correlation b t t into h i l ti between th these and something higher up. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for parsing, chunking and named-entity recognition), alignment error rate (for word alignment) and perplexity (f l d l it (for language modeling). Th d li ) These also require h l i humans at th f t of th l t the front f the loop, but differ from (3) in that they are not actually compared with higher-up tasks.
4.
Be careful what you are optimizing! Some measures (trypically of Type 4) become disfunctional when you are optimizing them!
phrase P/R/F e.g. in NER Readability measures
Evaluation measures
Sequence P/R/F e.g. in Named Entity Recognition Chunking etc P/R/F, e g Recognition, Chunking, etc.
John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG.
Example tagging 1: p gg g
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG) F(PER) = 0.67, F(ORG) = 0.5
Example tagging 2:
John J h _PER studied_O at_O the_O J h _O H ki _O U i di d h Johns Hopkins University_O b f _O j i i _O IBM_ORG. i before joining 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 0 FP 1 FN: Johns Hopkins University ( p y (ORG) ) F(PER) = 1.0, F(ORG) = 0.67
Optimizing phrase-F can encourage / prefer systems that do not mark entities!
most lik l thi i b d!! t likely, this is bad!!
ROC curve
ROC Receiver O OC Operating Characteristic curve C
Curve that depicts the relation between recall (sensitivity) and false positives (1-specificity) (1 specificity)
Best case
Sensitiv (Reca vity all)
Worst case
False Positives
FP / (FP+TN)
28
Evaluation measures
Area under ROC curve, AUC A d As you vary the decision threshold, you can plot the recall vs. false p positive rate The area under the curve measures how accurately your model separates positive f t iti from negatives ti
perfect ranking: AUC = 1.0 random decision: AUC = 0.5
Similarly (e.g. in IR): area under P/R curve

when th h there are t many (t ) negatives too (true) ti correctly identifying negatives is not interesting anyway
29
Evaluation measures (Ranking)

Precision P i i @K
number of true positives in top K predictions / ranks
MAP
The average of precisions computed at the point of each of the positives in the ( ) ranked list (P=0 for positives not ranked at all)
NDCG
For graded relevance / ranking Highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.
30
Learning curve
Measures h M how th the
accuracy error
of th model changes with f the d l h ith

sample size iteration number
Smaller sample
worse accuracy y more likely bias in the estimate (representative sample) variance in the estimate
Typical learning curve If it looks differently:

you are plotting error vs. size/iteration you are doing something wrong! overfitting (iteration, not sample size)!
??
31
Data or Algorithm?
Compare the accuracy of various machine learning algorithms with a varying amount of training data (Banko & Brill, 2001):
Winnow perceptron nave Bayes memory-based learner
Features:
bag of words: words within a window of the target word collocations containing specific words and/or part of speech
Training corpus: 1-billion words from a variety of English texts ( (news articles, literature, scientific abstracts, etc.) )
Take home messages (up until now)

Supervised learning: based on a set of labeled examples (x, f(x)) learn the p g p ( , ( )) input-output mapping, i.e. f(x) 3 factors of successful machine learning models
much data good features well-suited learning algorithm
ML workflow
1. 2. 2 3. 4. problem definition feature engineering; experimental setup /train validation test / /train, validation, / selection of learning algorithm, (hyper)parameter tuning, training a final model predict unseen examples & fill tables / draw figures for the paper - test
Careful ith C f l with

data representation (i.i.d, comparability, ) experimental setup (cross-validation, blind testing, ) data size and algorithm selection (+ overfitting underfitting ) overfitting, underfitting, ) evaluation measures

Machine Learning Tutorial

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Tutorial

Uploaded by

Copyright:

Available Formats

Machine Learning Tutorial

CB, GS, REC , ,

Machine Learning basic concepts

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Hidden Variables: h = ( h1 , h2 ,..., hK )

Input Variables: x = ( x1 , x2 ,..., xN ) I t V i bl

Output Variables: y = ( y1 , y2 ,..., yK )

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

The Sub Fields of ML Sub-Fields

Supervised Learning Reinforcement Learning Unsupervised Learning

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Supervised Learning Given: Training examples

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Model, Model model quality Definition: A computer program is said to learn

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

Data / Examples / Sample / Instances

Definitely bad example

Instances have to be comparable to each other

Data / Examples / Sample / Instances

Common feature types yp

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |