You are on page 1of 37

LOGO

Classification using Weka


(Brain, Computation, and Neural Learning)
Jung-Woo Ha

Agenda
Classification

General Concept Terminology

Introduction to Weka Classification practice with Weka

Problems: Pima Indians diabetes, handwritten digit recognition Algorithms: Neural Networks, Decision Trees, Support Vector Machines Evaluation criteria Using Experimenter for batch experiments Building committee machine

Mini-project
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Machine Classification

Sorting fish on a

conveyor belt:

Salmon () vs. sea bass () set up a camera, take images and use some physical differences (length, lightness, width, fin shape, mouth position, etc) to explore.

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Concept of Classification
<Notations>
n = # training examples x = input variables (features or attributes) y = output variable / target variable (x, y) training example The i-th training example = (x(i), y(i))
Training Set

Learning Algorithm

Input features
e.g. pixels in a picture of handwritten digit

Output / prediction
3 or 8
4

hypothesis
f (x) w0 w1 x1 wn xn

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Terminology
Features or Attributes

Features are the individual measurable properties of the phenomena being observed Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification

Training set / Test set

Training set: A set of examples used for learning, that is to fit the parameters [i.e., weights] of the classifier Test set: A set of examples used only to assess the performance [generalization] of a fully-specified classifier

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Introduction to Weka
Weka: Data Mining Software in Java

Weka is a collection of machine learning algorithms for data mining & machine learning tasks What you can do with Weka? data pre-processing, feature selection, classification, regression, clustering, association rules, and visualization Weka is an open source software issued under the GNU General Public License How to get? http://www.cs.waikato.ac.nz/ml/weka/ or just type Weka in google.

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Dataset #1: Pima Indians Diabetes


Description

Pima Indians have the highest prevalence of diabetes in the world We will build classification models that diagnose if the patient shows signs of diabetes http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

Configuration of the data set


768 instances 8 attributes age, number of times pregnant, results of medical tests/analysis all numeric (integer or real-valued) Also, a discretized set will be provided Class value = 1 (Positive example ) Interpreted as "tested positive for diabetes" 500 instances Class value = 0 (Negative example) 268 instances
7

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Dataset #2: Handwritten Digits (MNIST)


Description

The MNIST database of handwritten digits contains digits written by office workers and students We will build a recognition model based on classifiers with the reduced set of MNIST http://yann.lecun.com/exdb/mnist/

Configuration of the data set

Attributes pixel values in gray level in a 28x28 image 784 attributes (all 0~255 integer) Full MNIST set Training set: 60,000 examples Test set: 10,000 examples For our practice, a reduced set with 800 examples is used Class value: 0~9, which represent digits from 0 to 9
8

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Artificial Neural Networks MLP

(Multilayer Perceptron)

In Weka, Classifiers-functions-MultilayerPerceptron

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Artificial Neural Networks Reviews on BP algorithm


The Number of iterations The number of hidden layers and hidden nodes

Learning rate

Momentum : Four main parameters for learning MLPs

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

10

Reviews on MLPs

Expression power of MLPs

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

11

Decision Trees J48

(Java implementation of C4.5)

In Weka, classifiers-trees-J48

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

12

Support Vector Machines SMO

(sequential minimal optimization) for training SVM

In Weka, classifiers-functions-SMO

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

13

Practice Basic

Comparing the performances of algorithms


MultilayerPerceptron vs. J48 vs. SVM Checking the trained model (structure & parameter)

Tuning parameters to get better models Understanding Test options & Classifier output in Weka
Building committee machines using meta algorithms for classification Preprocessing / data manipulation applying Filter Batch experiment with Experimenter Design & run a batch process with KnowledgeFlow
14

Advanced

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Dataset for Practice with Weka


Pima Indians diabetes

Original data: pima_diabetes.arff Discretized data: pima_diabetes_supervised_discretized.arff

Handwritten Digit (MNIST)

Training/test pair mnist_reduced_training.arff, mnist_reduced_test.arff 800 & 200 instances, respectively Total set (1,000 instances) mnist_reduced_total.arff Can be used for cross-validation

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

15

Data format for Weka (.ARFF)


@relation heart-disease-simplified @attribute age numeric @attribute sex { female, male} @attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina} @attribute cholesterol numeric @attribute exercise_induced_angina { no, yes} @attribute class { present, not_present}

Header

@data 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present Data 67,male,asympt,229,yes,present (CSV format) 38,female,non_anginal,?,no,not_present


Note: You can easily generate arff file by adding a header to a usual CSV text file
(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

16

Neural Networks in Weka


click

load a file that contains the training data by clicking Open file button ARFF or CSV formats are readible

Click MultilayerPerceptron Set parameters for MLP Set parameters for Test Click Start for learning

Click Classify tab Click Choose button Select weka function - MultilayerPerceptron

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

17

Some Notes on the Parameter Setting


Parameter Setting = Car Tuning

need much experience or many times of trial you may get worse results if you are unlucky

Multilayer Perceptron (MLP)

Main parameters for learning: hiddenLayers, learningRate, momentum, trainingTime (epoch), seed

J48

Main parameters: unpruned, numFolds, minNumObj Many parameters are for controlling the size of the result tree, i.e. confidenceFactor, pruning

SMO (SVM)

Main parameters: c (complexity parameter), kernel, kernel parameters


18

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Test Options and Classifier Output

Setting the data set used for evaluation

There are various metrics for evaluation

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

19

How to Evaluate the Performance? (1/2) Usually, build a Confusion Matrix out of given data Evaluation Metrics

Accuracy (percent correct) Precision Recall Many other metrics: F-measure, Kappa score, etc.

For fare evaluation, the cross-validation scheme is used


(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

20

How to Evaluate the Performance? (2/2) Confusion Matrix


Real Prediction Positive Positive Negative
All with positive Test All with Negative Test

TP FN
All with Disease

FP TN
All without Disease

Negative

Everyone

Accuracy Precision

TP TN TP FP TN FN TP TP FP

As recall precision

Recall

TP TP FN

conversely: As recall precision

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

21

Evaluation Method - Cross Validation K-fold Cross Validation

The data set is randomly divided into k subsets. One of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. 1 k
6-fold cross validation
D1 D1

Error
D5 D6

Error k
i 1

128 128

D2 D2

128 128

D3 D3

128 128

D4 D4

128 128

128 128

D6 D5

128 128

D2

128

D3

128

D4

128

D5

128

D6

128

D1

128
22

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Committee Machine in Weka


Using committee machine / ensemble learning in Weka

Boosting: AdaBoostM1 Voting committee: Vote Bagging

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

23

Data Manipulation with Filter in Weka


Attribute

Selection, discretize
Re-sampling, selecting specified folds

Instance

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

24

Using Experimenter in Weka


Tool for Batch experiments
Click New click

Select Run tab and click Start If it has finished successfully, click Analyse tab and see the summary

Set experiment type/iteration control Set datasets / algorithms

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

25

KnowledgeFlow for Analysis Process Design

(Process Flow Diagram of SAS Enterprise Miner )


(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

26

References
Weka Wiki: http://weka.wikispaces.com/ Weka online documentation:

http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Textbooks

Tom Mitchell (1997) Machine Learning, McGraw Hill Christopher M. Bishop (2006) Pattern Recognition and Machine Learning, Springer Richard O. Duda, Peter E. Hart, David G. Stork (2001) Pattern classification (2nd edition), Wiley, New York

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

27

Mini-project
Make an arff file

Make a csv file with MS Excel. Open the csv file with Weka Save the csv file as an arff file Modify the property value of class to discrete value set with any text editor program Save the arff file Reload the arff file with Weka

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

28

Mini-project
click

load a file that contains the training data by clicking Open file button ARFF or CSV formats are readible

Click MultilayerPerceptron Set parameters for MLP Set parameters for Test Click Start for learning

Click Classify tab Click Choose button Select weka function - MultilayerPerceptron

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

29

Mini-project
Parameter setting of MLPs

More explanations on the parameters

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

30

Test Options and Classifier Output

Setting the data set used for evaluation

There are various metrics for evaluation

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

31

Mini-project
Make a MLP by yourself with GUI option

You can make the hidden layers by yourself. When clicking more button, you can get details of explanation for GUI.

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

32

Mini-project
J48

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

33

Mini-project
Experiments

Convenient comparisons on data and methods

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

34

Experiments

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

35

Mini-project
Classification problem with Weka

Data set 3 different data sets You should include at least one set from UCI ML repository and MNIST set (http://archive.ics.uci.edu/ml/) Classification methods MLP: iters, learning rate, momentum, # of hidden nodes SVM: will be addressed in next time J48: Default options only
36

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Mini term-project
Contents in the report

(C) 2010, SNU Biointelligence Lab, http://bi.snu.ac.kr/

You should compare the results of various parameter settings for MLPs find optimal parameter setting for MLP and report the classification performance on that setting on all data sets Compare the best MLP result to the result of J48 on three data sets (classification and time) Include discussions At most A4 four pages Due date: 24th Nov. 2011(302-314-1)
37

You might also like