You are on page 1of 19

Term Paper on WEKA

This paper describes two data analysis technique (Clustering & Classification) using WEKA.

Amrit Kumar (10BM60010)


4/14/2012

WEKA WEKA is the product of the University of Waikato (New Zealand) and was first implemented in its modern form in 1997. It uses the GNU General Public License (GPL). The software is written in the Java language and contains a GUI for interacting with data files and producing visual results (think tables and curves).
Install WEKA
Figure 1. WEKA startup screen

When one starts WEKA, the GUI chooser pops up and lets one choose four ways to work with WEKA and our data. We will choose only the Explorer option to go through the 2 examples described below

Clustering Car manufacturers need to be able to appraise the current market to determine the likely competition for their vehicles. If cars can be grouped according to available data, this task can be largely automatic using cluster analysis.
Information for various makes and models of motor vehicles is contained in car_sales.sav Our objective is to group the automobiles according to their prices and physical properties. This example illustrates the use of k-means clustering with WEKA. The WEKA SimpleKMeans algorithm uses Euclidean distance measure to compute distances between instances and clusters.

Steps to perform clustering 1) Load the CSV file into WEKA. Hence the input CSV file named as car sales weka refined is read by WEKA by clicking Open file and choosing the appropriate file.
Microsoft Excel Macro-Enabled Worksheet

2) To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button. This results in a drop down list of available clustering algorithms. In this case we select "SimpleKMeans". Next, click on the text box to the right of the "Choose" button to get the pop-up window shown in Below Figure, for editing the clustering parameter.

3) In the pop-up window we enter 6 as the number of clusters (instead of the default values of 2) and we leave the value of "seed" as is. Click OK. In the "Cluster Mode" panel, the "Use training set" option is selected. We choose limited no of attributes to form the cluster. This can be done by clicking the Ignore attribute tab. In this case attribute chosen are price, engine_s,
horsepow, wheelbase, width, length, curb_wgt, fuel_cap

4) Now we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window.

5) The result window shows the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for each cluster. Thus, centroids can be used to characterize the clusters. For example, the centroid for cluster 1 shows that this is a segment of cars Having following characteristics Price in thousands 42000 Engine size 3.5 Horsepower 200 Wheelbase 113 Width 74.4 Length 176 Curb weight 3.85 Fuel capacity 18.0

So after clustering we come to know that 19 cars out of 157 that is 12% fall in this cluster. Hence if a new company comes into the market with a wish to manufacture a car with similar specification then it needs to compete with these 19 car models. Below snapshot gives the output of clustering

=== Run information === Scheme: -I 500 -S 10 Relation: Instances: Attributes: weka.clusterers.SimpleKMeans -N 6 -A "weka.core.EuclideanDistance -R first-last"

car_sales_weka_refined 157 14 price engine_s horsepow wheelbas width length curb_wgt fuel_cap

Ignored: manufact model sales resale

type mpg Test mode: evaluate on training data

=== Clustering model (full training set) === kMeans ====== Number of iterations: 4 Within cluster sum of squared errors: 1078.0 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Attribute Full Data (157) 0 (89) 1 (19) 2 (18) 3 (12) 4 (16) 5 (3)

======================================================================================== price engine_s horsepow wheelbas width length curb_wgt fuel_cap 2.0 150.0 112.2 66.7 186.3 2.998 18.5 2.0 115.0 112.2 66.7 192.0 2.998 18.5 42.0 3.5 200.0 113.0 74.4 176.0 3.85 18.0 16.535 2.4 150.0 107.0 68.5 186.3 3.051 15.0 38.9 4.0 190.0 112.2 69.4 194.8 3.876 20.0 21.975 3.0 210.0 108.0 73.0 186.0 3.368 16.0 46.225 5.7 255.0 117.5 77.0 201.2 5.572 30.0

Time taken to build model (full training data) : 0.01 seconds === Model and evaluation on training set === Clustered Instances 0 1 2 3 4 5 89 ( 57%) 19 ( 12%) 18 ( 11%) 12 ( 8%)

16 ( 10%) 3 ( 2%)

One can choose the cluster number and any of the other attributes for each of the three different dimensions available (x-axis, y-axis, and colour). Different combinations of choices will result in a visual rendering of different relationships within each cluster. In our example, we have chosen the cluster number as the xaxis, the instance number (assigned by WEKA) as the y-axis, and the "fuel capacity" attribute as the colour dimension. This will result in a visualization of the distribution of fuel capacity of different cars in each cluster.

Save the resulting data set which includes each instance along with its assigned cluster. To do so, we click the "Save" button in the visualization window and save the result as the file car_sale_output.arff. Convert the arff file to excel. The excel output is shown below

Microsoft Excel 97-2003 Worksheet

Classification Classification (also known as classification trees or decision trees) is a data mining algorithm that creates a step-by-step guide for how to determine the output of a new data instance. The tree it creates is exactly that: a tree whereby each node in the tree represents a spot where a decision must be made based on the input, and you move to the next node and the next until you reach a leaf that tells you the predicted output.
The data set we are using for our classification example is of BMW dealership. The dealership is starting a promotional campaign, whereby it is trying to push a twoyear extended warranty to its past customers. The dealership has done this before and has gathered 4,500 data points from past sales of extended warranties. The attributes in the data set are: Income bracket [0=$0-$30k, 1=$31k-$40k, 2=$41k-$60k, 3=$61k-$75k, 4=$76k$100k, 5=$101k-$150k, 6=$151k-$500k, 7=$501k+] Year/month first BMW bought Year/month most recent BMW bought Whether they responded to the extended warranty offer in the past
@relation bmwreponses @attribute @attribute @attribute @attribute @data 4,200210,200601,0 5,200301,200601,1 6,200411,200601,0 5,199609,200603,0 IncomeBracket {0,1,2,3,4,5,6,7} FirstPurchase numeric LastPurchase numeric responded {1,0}

Input file is attached for reference

1) Load the file

2) We select the Classify tab, then we select the trees node, then the J48 leaf

3) Click Start and let WEKA run. The output from this model is shown below

=== Run information === Scheme: Relation: Instances: Attributes: weka.classifiers.trees.J48 -C 0.25 -M 2 bmwreponses 3000 4 IncomeBracket FirstPurchase LastPurchase responded Test mode: evaluate on training data

=== Classifier model (full training set) === J48 pruned tree -----------------FirstPurchase <= 200011 | | | | IncomeBracket = 0: 1 (271.0/114.0) IncomeBracket = 1 | | LastPurchase <= 200512: 0 (69.0/21.0) LastPurchase > 200512: 1 (69.0/27.0)

| | | | | | | | | | | | | | | | | | | |

IncomeBracket = 2: 1 (194.0/84.0) IncomeBracket = 3: 1 (109.0/38.0) IncomeBracket = 4 | | LastPurchase <= 200511: 0 (54.0/22.0) LastPurchase > 200511: 1 (105.0/40.0)

IncomeBracket = 5 | | | | | | LastPurchase <= 200505 | | | | LastPurchase <= 200504: 0 (8.0) LastPurchase > 200504 | | FirstPurchase <= 199712: 1 (2.0) FirstPurchase > 199712: 0 (3.0)

LastPurchase > 200505: 1 (185.0/78.0)

IncomeBracket = 6 | | | | | | LastPurchase <= 200507 | | | | FirstPurchase <= 199812: 0 (8.0) FirstPurchase > 199812 | | FirstPurchase <= 200001: 1 (4.0/1.0) FirstPurchase > 200001: 0 (3.0)

LastPurchase > 200507: 1 (107.0/43.0)

IncomeBracket = 7: 1 (115.0/40.0)

FirstPurchase > 200011 | | | | | | | | | | | | | IncomeBracket = 0 | | FirstPurchase <= 200412: 1 (297.0/135.0) FirstPurchase > 200412: 0 (113.0/41.0)

IncomeBracket = 1: 0 (122.0/51.0) IncomeBracket = 2: 0 (196.0/79.0) IncomeBracket = 3: 1 (139.0/69.0) IncomeBracket = 4: 0 (221.0/98.0) IncomeBracket = 5 | | | | LastPurchase <= 200512: 0 (177.0/77.0) LastPurchase > 200512 | | FirstPurchase <= 200306: 0 (46.0/17.0) FirstPurchase > 200306: 1 (88.0/30.0)

IncomeBracket = 6: 0 (143.0/59.0)

| | |

IncomeBracket = 7 | | LastPurchase <= 200508: 1 (34.0/11.0) LastPurchase > 200508: 0 (118.0/51.0)

Number of Leaves

28 43

Size of the tree :

Time taken to build model: 0.03 seconds === Evaluation on training set === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 1774 1226 0.1807 0.4773 0.4885 95.4768 % 97.7122 % 100 99.6 3000 % % 59.1333 % 40.8667 %

=== Detailed Accuracy By Class === TP Rate 0.662 0.519 Weighted Avg. 0.591 FP Rate 0.481 0.338 0.411 Precision 0.587 0.597 0.592 Recall 0.662 0.519 0.591 F-Measure 0.622 0.555 0.589 ROC Area 0.616 0.616 0.616 Class 1 0

=== Confusion Matrix === a 1009 710 b 516 | 765 | <-- classified as a = 1 b = 0

Important analysis of result

The important numbers are "Correctly Classified Instances" 59.1% = ((1009+516)/( 1009+516+710+765)) and

"Incorrectly Classified Instances" (40.9 percent). In the "Confusion Matrix," it shows you the number of false positives and false negatives. The false positives are 516, and the false negatives are 710 in this matrix. A false positive is a data instance where the model we've created predicts it should be positive, but instead, the actual value is negative. Conversely, a false negative is a data instance where the model predicts it should be negative, but the actual value is positive. These errors indicate we have some problems in our model, as the model is incorrectly classifying some of the data. While some incorrect classifications can be expected, it's up to the model creator to determine what an acceptable percentage of errors are. For some model this can be very low and for some it can be high. As the % of Incorrectly Classified Instances is more this model can be improved with better data. One can see the tree by right-clicking on the model just created, in the result list. On the pop-up menu, select Visualize tree. Below is the result.

We need to validate our classification tree and this can be done by running our test set through the model. Test set is attached below

To do this, in Test options, select the Supplied test set radio button and click Set. Choose the file bmw-test.arff attached above. When we click Start this time, WEKA will run this test data set through the model we already created and let us know how the model did. Output of test set Comparing the "Correctly Classified Instances" from this test set (55.7%) with the "Correctly Classified Instances" from the training set (59.1%), we see that the accuracy of the model is pretty close, which indicates that the model will not break down with unknown data, or when future data is applied to it.
=== Run information ===

Scheme: Relation: Instances:

weka.classifiers.trees.J48 -C 0.25 -M 2 bmwreponses 3000

Attributes:

4 IncomeBracket FirstPurchase LastPurchase responded

Test mode:

user supplied test set:

size unknown (reading incrementally)

=== Classifier model (full training set) === J48 pruned tree -----------------FirstPurchase <= 200011 | | | | | | | | | | | | | | | | | | | | | | | IncomeBracket = 0: 1 (271.0/114.0) IncomeBracket = 1 | | LastPurchase <= 200512: 0 (69.0/21.0) LastPurchase > 200512: 1 (69.0/27.0)

IncomeBracket = 2: 1 (194.0/84.0) IncomeBracket = 3: 1 (109.0/38.0) IncomeBracket = 4 | | LastPurchase <= 200511: 0 (54.0/22.0) LastPurchase > 200511: 1 (105.0/40.0)

IncomeBracket = 5 | | | | | | LastPurchase <= 200505 | | | | LastPurchase <= 200504: 0 (8.0) LastPurchase > 200504 | | FirstPurchase <= 199712: 1 (2.0) FirstPurchase > 199712: 0 (3.0)

LastPurchase > 200505: 1 (185.0/78.0)

IncomeBracket = 6 | | | | | | LastPurchase <= 200507 | | | | FirstPurchase <= 199812: 0 (8.0) FirstPurchase > 199812 | | FirstPurchase <= 200001: 1 (4.0/1.0) FirstPurchase > 200001: 0 (3.0)

LastPurchase > 200507: 1 (107.0/43.0)

IncomeBracket = 7: 1 (115.0/40.0)

FirstPurchase > 200011 | | | | | | | | | | | | | | | | IncomeBracket = 0 | | FirstPurchase <= 200412: 1 (297.0/135.0) FirstPurchase > 200412: 0 (113.0/41.0)

IncomeBracket = 1: 0 (122.0/51.0) IncomeBracket = 2: 0 (196.0/79.0) IncomeBracket = 3: 1 (139.0/69.0) IncomeBracket = 4: 0 (221.0/98.0) IncomeBracket = 5 | | | | LastPurchase <= 200512: 0 (177.0/77.0) LastPurchase > 200512 | | FirstPurchase <= 200306: 0 (46.0/17.0) FirstPurchase > 200306: 1 (88.0/30.0)

IncomeBracket = 6: 0 (143.0/59.0) IncomeBracket = 7 | | LastPurchase <= 200508: 1 (34.0/11.0) LastPurchase > 200508: 0 (118.0/51.0)

Number of Leaves

28 43

Size of the tree :

Time taken to build model: 0.03 seconds === Evaluation on test set === === Summary === Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Root mean squared error Relative absolute error Root relative squared error Coverage of cases (0.95 level) Mean rel. region size (0.95 level) Total Number of Instances 835 665 0.1156 0.4891 0.5 97.79 % 55.6667 % 44.3333 %

99.9582 % 99.6 99.6 1500 % %

=== Detailed Accuracy By Class ===

TP Rate 0.622 0.494 Weighted Avg. 0.557

FP Rate 0.506 0.378 0.441

Precision 0.541 0.576 0.559

Recall 0.622 0.494 0.557

F-Measure 0.579 0.532 0.555

ROC Area 0.564 0.564 0.564

Class 1 0

=== Confusion Matrix === a b <-- classified as a = 1 b = 0

457 278 | 387 378 |

You might also like