Professional Documents
Culture Documents
Section 1
Machine Learning Tutorial for the UKP lab June 10, 2011 10
This ppt includes some slides/slide-parts/text taken from online materials created by the following people: - Greg Grudic - Alexander Vezhnevets - Hal III Daume
What is Machine Learning? The goal of machine learning is to build computer systems that can adapt and learn from their experience.
Tom Dietterich
A Generic System
x1 x2
System
h1 , h2 , ..., h K
y1 y2
xN
yM
When are ML algorithms NOT needed? When the relationships between all system variables (input, output, and hidden) is completely understood! This is NOT the case for almost any real system!
{( x , f ( x ) ) , ( x , f ( x ) ) ,..., ( x
1 1 2 2
, f ( xP ))
for some unknown function (system) y = f ( x ) Find f ( x ) Predict y = f ( x ) , where x is not in the training set
Good G d example l
1000 abstracts chosen randomly out of 20M PubMed entries (abstracts) probably i.i.d. representative?
if annotation is involved it is always a question of compromises
Toy example:
Q = learning, rank 1: tf = 15, rank 100: tf = 2 Q = overfitting, rank 1: tf = 2, rank 10: tf = 0
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 10
Features The available examples (experience) has to be described to the algorithm in a consumable format
Here: examples are represented as vectors of pre-defined features E.g. for credit risk assesment, typical features can be: income range, debt load employment history real estate properties criminal record load, history, properties, record, city of residence, etc.
Section 2
Experimental practice
by now youve learned what machine learning is; in the supervised approach you need (carefully selected / prepared) examples that you describe through features; the algorithm then learns a model of the problem based on the examples (usually some kind of optimization is performed in the background); and as a result result, improvement is observed in terms of some performance measure
Machine Learning Tutorial for the UKP lab June 10, 2011
Model parameters
2 kinds of parameters
one the user sets for the training procedure in advance hyperparameter
the degree of polynom to match in regression y number/size of hidden layer in Neural Network number of instances per leaf in decision tree
we usually do not talk about the latter, but refer to hyperparameters as parameters
Hyperparameters
the less the algorithm has, the better
Naive Bayes the best? No parameters! usually algs with better discriminative power are not parameter-free
typically are set to optimize performance (on validation set, or through cross-validation)
manual, grid search, simulated annealing, gradient descent, etc.
common pitfall:
select the hyperparameters via CV, e.g. 10-fold + report cross-validation results
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 13
X1
X2
X3
X4
X5
Test
Train
14
n-fold CV: common practice to report average performance deviation etc performance, deviation, etc.
No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004) bad practice? problem: training sets largely overlap, test errors are also dependent
tends to underestimate real variance of CV (thus e g confidence intervals are to be treated with extreme e.g. caution) 5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets
Folding via nat ral units of processing for the gi en task ia natural nits given
typically, document boundaries best practice is doing it yourself!
ML package / CSV representation is not aware of e.g. document boundaries! The PPI case
15
do parameter tuning (n.b. selecting/tuning your features is also tuning!) but then normally you have to have a blind set (from the beginning)
e.g. have a look at shared tasks, e.g. CoNLL practical way to learn experimental best practice to align the predefined standards (you might even benefit from comparative results, etc.)
The ML workflow
Common ML experimenting pipeline
1. define the task
instance, target variable/labels, collect and label/annotate data credit risk assessment: 1 credit request, good/bad credit, ~s ran out i the di i k di d/b d di in h previous year
2. define and collect/calculate features, define train / validation (development) ((test!)) / test (evaluation) data 3. pick a learning algorithm (e.g. decision tree), train model
train on training set optimize/set model hyperparameters (e.g. number of instances / leaf, use pruning, ) according to performance on validation data
cross validation: use all training data as validation data
4. ready to use model to predict unseen instances with an expected y p p accuracy similar to that seen on test
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 17
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Correctly Classified Instances 290 96.6667 % Incorrectly Classified Instances 10 3.3333 % Scheme: weka.classifiers.trees.J48 -C 0.25 -M 12 Correctly Classified Instances 281 93.6667 % Incorrectly C Classified Instances f 19 6.3333 %
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 18
Model complexity
Fitting Fitti a polynomial regression: l i l i
n a( x) = n x n =0
t
1.0 M=0
1.0
M=1
0.0
0.0
1.0
M=3
1.0
M=9
n = arg min y j n x j =1 n =0
2
0.0
0.0
19
Data size and model complexity Important concept: discriminative power of the algorithm
linear vs nonlinear model some theoretical aspects: 1-hidden-layer NN with unlimited hidden nodes can perfectly model any smooth function/surface
20
Underfitting: the model is not capable of learning the (complex) patterns in the training set Reasons of Underfitting and Overfitting:
lack of discriminative power small sample size ll l i noise in the data /labels or features/ generalization ability of algorithm has to be chosen wrt. sample size
Predictions Confusion matrix TP: p classified as p FP: n classified as p TN: n classified as n FN: p classified as n p Good prediction: G TP+TN Error: FP (false alarm) + FN (miss)
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 22
Evaluation measures
Accuracy
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (TP+TN) / (TP+FN+FP+TN)
Error rate
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (FP+FN) / (TP+FN+FP+TN)
[Root]?[Mean|Absolute][Squared]?Error
The difference between the predicted and actual values e.g.
RMSE=
1 ( f ( x) y ) 2 n
23
Evaluation measures
Precision
Fraction of correctly predicted positives and all predicted positives TP/(TP+FP) ( )
Recall
Fraction of correctly predicted positives and all actual positives ac o o co ec y p ed c ed pos es a d a ac ua pos es TP/(TP+FN)
F measure
weighted harmonic mean of Precision and Recall (usually equal weighted, =1)
Only makes sense for a subset of classes (usually measured for a single class)
For all classes, it equals the accuracy
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 24
Evaluation measures
Sequence P/R/F e.g. in Named Entity Recognition Chunking etc P/R/F, e g Recognition, Chunking, etc.
A sequence of tokens with the same label is treated as a single instance John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG. Why? We need complete phrases to be identified correctly How? With external evaluation script, e.g. conlleval for NER
Example tagging:
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG.
Multiple penalty:
3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) (PER) (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG) F(PER) = 0.67, F(ORG) = 0 5 0 67 0.5
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 25
Loss types
1. 1 2. 3. The real loss function given to us by the world Typically involves notions of money saved world. saved, time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to this function. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance assessments, etc assessments etc. We can perform these evaluations, but they are slow and costly They evaluations costly. require humans in the loop. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate, mean-average-precision. These require humans at the front of the loop, but after that are cheap and quick. T i ll some effort h b h d i k Typically ff t has been put i t showing correlation b t t into h i l ti between th these and something higher up. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for parsing, chunking and named-entity recognition), alignment error rate (for word alignment) and perplexity (f l d l it (for language modeling). Th d li ) These also require h l i humans at th f t of th l t the front f the loop, but differ from (3) in that they are not actually compared with higher-up tasks.
4.
Be careful what you are optimizing! Some measures (trypically of Type 4) become disfunctional when you are optimizing them!
phrase P/R/F e.g. in NER Readability measures
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 26
Evaluation measures
Sequence P/R/F e.g. in Named Entity Recognition Chunking etc P/R/F, e g Recognition, Chunking, etc.
John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_O joining_O IBM_ORG.
Example tagging 1: p gg g
John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_O joining_O IBM_ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG) F(PER) = 0.67, F(ORG) = 0.5
Example tagging 2:
John J h _PER studied_O at_O the_O J h _O H ki _O U i di d h Johns Hopkins University_O b f _O j i i _O IBM_ORG. i before joining 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 0 FP 1 FN: Johns Hopkins University ( p y (ORG) ) F(PER) = 1.0, F(ORG) = 0.67
Optimizing phrase-F can encourage / prefer systems that do not mark entities!
most lik l thi i b d!! t likely, this is bad!!
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 27
ROC curve
ROC Receiver O OC Operating Characteristic curve C
Curve that depicts the relation between recall (sensitivity) and false positives (1-specificity) (1 specificity)
Best case
Worst case
False Positives
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
FP / (FP+TN)
28
Evaluation measures
Area under ROC curve, AUC A d As you vary the decision threshold, you can plot the recall vs. false p positive rate The area under the curve measures how accurately your model separates positive f t iti from negatives ti
perfect ranking: AUC = 1.0 random decision: AUC = 0.5
29
MAP
The average of precisions computed at the point of each of the positives in the ( ) ranked list (P=0 for positives not ranked at all)
NDCG
For graded relevance / ranking Highly relevant documents appearing lower in a search result list should be penalized as the graded relevance value is reduced logarithmically proportional to the position of the result.
30
Learning curve
Measures h M how th the
accuracy error
Smaller sample
worse accuracy y more likely bias in the estimate (representative sample) variance in the estimate
??
31
Data or Algorithm?
Compare the accuracy of various machine learning algorithms with a varying amount of training data (Banko & Brill, 2001):
Winnow perceptron nave Bayes memory-based learner
Features:
bag of words: words within a window of the target word collocations containing specific words and/or part of speech
Training corpus: 1-billion words from a variety of English texts ( (news articles, literature, scientific abstracts, etc.) )
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas | 32
ML workflow
1. 2. 2 3. 4. problem definition feature engineering; experimental setup /train validation test / /train, validation, / selection of learning algorithm, (hyper)parameter tuning, training a final model predict unseen examples & fill tables / draw figures for the paper - test