Professional Documents
Culture Documents
Agenda
Classification
Problems: Pima Indians diabetes, handwritten digit recognition Algorithms: Neural Networks, Decision Trees, Support Vector Machines Evaluation criteria Using Experimenter for batch experiments Building committee machine
Mini-project
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Machine Classification
Sorting fish on a
conveyor belt:
Salmon () vs. sea bass () set up a camera, take images and use some physical differences (length, lightness, width, fin shape, mouth position, etc) to explore.
Concept of Classification
<Notations>
n = # training examples x = input variables (features or attributes) y = output variable / target variable (x, y) training example The i-th training example = (x(i), y(i))
Training Set
Learning Algorithm
Input features
e.g. pixels in a picture of handwritten digit
Output / prediction
3 or 8
4
hypothesis
f (x) w0 w1 x1 wn xn
Terminology
Features or Attributes
Features are the individual measurable properties of the phenomena being observed Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification
Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier Test set: A set of examples used only to assess the performance [generalization] of a fully-specified classifier
Introduction to Weka
Weka: Data Mining Software in Java
Weka is a collection of machine learning algorithms for data mining & machine learning tasks What you can do with Weka? data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization Weka is an open source software issued under the GNU General Public License How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type Weka in google.
Pima Indians have the highest prevalence of diabetes in the world We will build classification models that diagnose if the patient shows signs of diabetes http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
768 instances 8 attributes age, number of times pregnant, results of medical tests/analysis all numeric (integer or real-valued) Also, a discretized set will be provided Class value = 1 (Positive example ) Interpreted as "tested positive for diabetes" 500 instances Class value = 0 (Negative example) 268 instances
7
The MNIST database of handwritten digits contains digits written by office workers and students We will build a recognition model based on classifiers with the reduced set of MNIST http://yann.lecun.com/exdb/mnist/
Attributes pixel values in gray level in a 28x28 image 784 attributes (all 0~255 integer) Full MNIST set Training set: 60,000 examples Test set: 10,000 examples For our practice, a reduced set with 800 examples is used Class value: 0~9, which represent digits from 0 to 9
8
(Multilayer Perceptron)
In Weka, Classifiers-functions-MultilayerPerceptron
Learning rate
10
Reviews on MLPs
11
In Weka, classifiers-trees-J48
12
In Weka, classifiers-functions-SMO
13
Practice Basic
MultilayerPerceptron vs. J48 vs. SVM Checking the trained model (structure & parameter)
Tuning parameters to get better models Understanding Test options & Classifier output in Weka
Building committee machines using meta algorithms for classification Preprocessing / data manipulation applying Filter Batch experiment with Experimenter Design & run a batch process with KnowledgeFlow
14
Advanced
Training/test pair mnist_reduced_training.arff, mnist_reduced_test.arff 800 & 200 instances, respectively Total set (1,000 instances) mnist_reduced_total.arff Can be used for cross-validation
15
Header
16
load a file that contains the training data by clicking Open file button ARFF or CSV formats are readible
Click MultilayerPerceptron Set parameters for MLP Set parameters for Test Click Start for learning
Click Classify tab Click Choose button Select weka function - MultilayerPerceptron
17
need much experience or many times of trial you may get worse results if you are unlucky
Main parameters for learning: hiddenLayers, learningRate, momentum, trainingTime (epoch), seed
J48
Main parameters: unpruned, numFolds, minNumObj Many parameters are for controlling the size of the result tree, i.e. confidenceFactor, pruning
SMO (SVM)
19
How to Evaluate the Performance? (1/2) Usually, build a Confusion Matrix out of given data Evaluation Metrics
Accuracy (percent correct) Precision Recall Many other metrics: F-measure, Kappa score, etc.
20
TP FN
All with Disease
FP TN
All without Disease
Negative
Everyone
Accuracy Precision
TP TN TP FP TN FN TP TP FP
As recall precision
Recall
TP TP FN
21
The data set is randomly divided into k subsets. One of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. 1 k
6-fold cross validation
D1 D1
Error
D5 D6
Error k
i 1
128 128
D2 D2
128 128
D3 D3
128 128
D4 D4
128 128
128 128
D6 D5
128 128
D2
128
D3
128
D4
128
D5
128
D6
128
D1
128
22
23
Selection, discretize
Re-sampling, selecting specified folds
Instance
24
Select Run tab and click Start If it has finished successfully, click Analyse tab and see the summary
25
26
References
Weka Wiki: http://weka.wikispaces.com/ Weka online documentation:
http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
Textbooks
Tom Mitchell (1997) Machine Learning, McGraw Hill Christopher M. Bishop (2006) Pattern Recognition and Machine Learning, Springer Richard O. Duda, Peter E. Hart, David G. Stork (2001) Pattern classification (2nd edition), Wiley, New York
27
Mini-project
Make an arff file
Make a csv file with MS Excel. Open the csv file with Weka Save the csv file as an arff file Modify the property value of class to discrete value set with any text editor program Save the arff file Reload the arff file with Weka
28
Mini-project
click
load a file that contains the training data by clicking Open file button ARFF or CSV formats are readible
Click MultilayerPerceptron Set parameters for MLP Set parameters for Test Click Start for learning
Click Classify tab Click Choose button Select weka function - MultilayerPerceptron
29
Mini-project
Parameter setting of MLPs
30
31
Mini-project
Make a MLP by yourself with GUI option
You can make the hidden layers by yourself. When clicking more button, you can get details of explanation for GUI.
32
Mini-project
J48
33
Mini-project
Experiments
34
Experiments
35
Mini-project
Classification problem with Weka
Data set 3 different data sets You should include at least one set from UCI ML repository and MNIST set (http://archive.ics.uci.edu/ml/) Classification methods MLP: iters, learning rate, momentum, # of hidden nodes SVM: will be addressed in next time J48: Default options only
36
Mini term-project
Contents in the report
You should compare the results of various parameter settings for MLPs find optimal parameter setting for MLP and report the classification performance on that setting on all data sets Compare the best MLP result to the result of J48 on three data sets (classification and time) Include discussions At most A4 four pages Due date: 24th Nov. 2011(302-314-1)
37