You are on page 1of 31

University of Mumbai

DATA MINING Software Requirement: WEKA Weka is written in Java and is widely used in different operating systems. Weka expects the database to be in the format of ARFF (Attribute-Relation File Format). Data is in MS Excel Sheet ie bank-data.xls 1st step is to convert the excel file into comma-separated format (.csv)

The data contains the following fields


id age sex region income married children car save_acct a unique identification number age of customer in years (numeric) MALE / FEMALE inner_city/rural/suburban/town income of customer (numeric) is the customer married (YES/NO) number of children (numeric) does the customer own a car (YES/NO) does the customer have a saving account (YES/NO)

current_acct does the customer have a current account (YES/NO) mortgage pep does the customer have a mortgage (YES/NO) did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)
1

University of Mumbai

Practical 1 : Perform the steps of Data Preprocessing in WEKA


Step1: Convert the excel file into comma-separated format (.csv) Open the excel file, select Save As from File pull-down Menu. In the ensuing dialog box, select CSV and save the file.

University of Mumbai

Step2: The csv file can be opened in the text editor shown below.

Step3: Start Weka

University of Mumbai

Step 4: Loading the data In addition to native ARFF data format, WEKA can also real .csv format files. Assuming WEKA is installed properly click on Explorer command box. In the Preprocessor Tab click on Open and open the data file (.csv or .arff)

University of Mumbai

Step 5: Filtering Attributes: In our dataset, each customer has a unique id. This attribute has to be removed before the data mining step. First tick on the Check Box corresponding to the id which is 1. On the down left side click on the Remove command button. This removes the id attribute and all its values. Step 6: Saving new data set: Now save this data set as by clicking on the Save dialog box as bank-data2.arff or bank-data2.csv

University of Mumbai

University of Mumbai

Step 7: Now open this new file bank-data2.arff or bank-data2.csv in the text editor (word pad). As seen, attribute ID and its corresponding values have been removed.

Note the line @relation bank-data-weka.filters.unsupervised.attribute.RemoveR1. This statement simply describes the operation that has been done on the data set till now. As seen, attributes can be both numeric and nominal type.

University of Mumbai

Step 8: Decretization Techniques like association rule mining can only be performed on categorical data. This requires performing discretization on numeric or continuous attributes. There are 3 such attributes in our data set, they are: age, children and income. Make changes in the attribute of Children as children {0,1,2,3}

By doing this, the key word numeric from children attribute are getting removed and replaced with a set of values. Now Save the file in the Word Pad and this might give some error message, pass it off by clicking OK

University of Mumbai

Step 9: Decretization in WEKA To perform Decretization on the attributes of age and income, divide each of these into 3 intervals. Open the bank-data2.arff or bank-data2.csv using Open command. Select weak.filters.unsupervised.attributes.Descretize. The textbox in the filter dialog box will have something like Descretize B 10 M -1.0 R first-last.

University of Mumbai

Step 10: Decretization in WEKA cont Click on Textbox to open the DiscretizeFilter Dialog box. Enter index value as 1, 4 in the textbox corresponding to attributIndices. Enter 3 as the number of intervals (bins) As this simple binning, all the other options will remain False

Click on OK and then Apply. This will result in a new working relation with two selected attributes each partitioned into 3 intervals/bins. To examine the result, save the new working relation in the file bank-data3.arff or bank-data3.csv

10

University of Mumbai

Step 11: Evaluating the Discreized the data.

For example, the lower range of the attribute age is labeled (-inf-34.333333] and middle age as (34.333333-50.666667] Now replace the following attributed by the Replace option of the Word Pad. Attribute Age \ -inf-34.333333] 0_34 \ 34.33333 50.666667\ 35_51 \ 50.66667-inf \ 52_max Attribute Income \ - inf-24388.173333\ 0_24386 \ 24388.173333 43758.136667 \ \ 43758.136667-inf\ 43759_max

24386_43758

11

University of Mumbai

After Replacing the values

Now save the changes as bank-data-final.arff or bank-data-final.csv


12

University of Mumbai

Practical 2: Perform Association Rule Mining with WEKA


Step 1: Open the file bank-data-final.arff or bank-data-final.csv

Step 2: Clicking on the "Associate" tab will bring up the interface for the association rule algorithms. The Apriori algorithm which is used is the default algorithm selected. Click on the text box immediately to the right of the "Choose" button. Choose lift as the criteria. Now enter 1.5 as the minimum value for lift which is computed as the confidence of the rule divided by the support of the right-handside (RHS). In a simplified form, given a rule L => R,lift is the ratio of the probability that L and R occur together to the multiple of the two individual probabilities for L and R, i.e., lift = Pr(L,R) / Pr(L).Pr(R). If this value is 1, then L and R are independent. The higher this value, the more likely that the existence of L and R together in a transaction is not just a random occurrence, but because of some relationship between them.
13

University of Mumbai

Here change the default value of rules (10) to be 100; this indicates that the program will report no more than the top 100 rules. The upper bound for minimum support is set to 1.0 (100%) and the lower bound to 0.1 (10%). Apriori in WEKA starts with the upper bound support and incrementally decreases support. The algorithm halts when either the specified number of rules is generated, or the lower bound for min. support is reached.

14

University of Mumbai

Step 3: Once the parameters have been set, the command line text box will show the new command line. Now click on start to run the program.

15

University of Mumbai

Step 4: The panel on the left ("Result list") now shows an item indicating the algorithm that was run and the time of the run. Clicking on one of the results in this list will bring up the details of the run, including the discovered rules in the right panel. In addition, right-clicking on the result set allows us to save the result buffer into a separate file. Now save the output in the file bank-data-ar1.txt

16

University of Mumbai

Practical 3: Classification via decision tree


Step1:

WEKA has implementations of numerous classification and prediction algorithms. The basic ideas behind using all of these are similar. (Use "bank.arff")

17

University of Mumbai

Step 2: Next, select the "Classify" tab and click the "Choose" button to select the J48 classifier.

18

University of Mumbai

Various parameters can be specified. These can be specified by clicking in the text box to the right of the "Choose" button.

19

University of Mumbai

Step 3:

Under the "Test options" in the main panel, select 10-fold cross-validation as our evaluation approach. Now click "Start" to generate the model. The ASCII version of the tree as well as evaluation statistics will appear in the eight panel when the model construction is completed

20

University of Mumbai

To view this information in a separate window by right clicking the last result set (inside the "Result list" panel on the left) and selecting "View in separate window" from the pop-up menu.

21

University of Mumbai

Step 4: To view graphical rendition of the classification tree, right clicking the last result set and select "Visualize tree" from the pop-up menu.

22

University of Mumbai

Note that the attribute section is identical to the training data. However, in the data section, the value of the "pep" attribute is "?" (or unknown).

Step 5: In the main panel, under "Test options" click the "Supplied test set" radio button, and then click the "Set..." button. This will pop up a window which allows you to open the file containing test instances.

23

University of Mumbai

24

University of Mumbai

Open the file "bank-new.arff" and upon returning to the main window, and click the "start" button. This, once again generates the models from our training data, but this time it applies the model to the new unclassified instances in the "bank-new.arff" file in order to predict the value of "pep" attribute.

25

University of Mumbai

Step 6: To create a file containing all the new instances along with their predicted class value resulting from the application of the model. First, right-click the most recent result set in the left "Result list" panel. In the resulting pop-up window select the menu item "Visualize classifier errors".

26

University of Mumbai

Step 7: To "save" the classification results from which the graph is generated. In the new window, click on the "Save" button and save the result as the file: "bankpredicted.arff"

27

University of Mumbai

Practical 4: K-Means Clustering


Step 1: The sample data set used for this example is based on the "bank-data.csv ". This document assumes that appropriate data pre-processing has been performed. In this case a version of the initial data set has been created in which the ID field has been removed and the "children" attribute has been converted to categorical. To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button. This results in a drop down list of available clustering algorithms. In this case, select "Simple KMeans". Next, click on the text box to the right of the "Choose" button to get the pop-up window, for editing the clustering parameter.

28

University of Mumbai

Step 2: In the pop-up window, enter 6 as the number of clusters and leave the value of "seed" as is. The seed value is used in generating a random number which is, in turn, used for making the initial assignment of instances to clusters. Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window.

29

University of Mumbai

The result window shows the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters. Another way of understanding the characteristics of each cluster in through visualization. Do this by right-clicking the result set on the left "Result list" panel and selecting "Visualize cluster assignments".

The cluster number and any of the other attributes for each of the three different dimensions available (x-axis, y-axis, and color). Different combinations of choices will result in a visual rendering of different relationships within each cluster. In the above example, as chosen the cluster number as the x-axis, the instance number (assigned by WEKA) as the y-axis, and the "sex" attribute as the color dimension. This will result in a visualization of the distribution of males and females in each cluster. For instance, you can note that clusters 2 and 3 are dominated by males, while clusters 4 and 5 are dominated by females. In this case, by changing the color dimension to other attributes.

30

University of Mumbai

Step 3: Now, click the "Save" button in the visualization window and save the result as the file "bank-kmeans.arff"

----END----

31

You might also like