You are on page 1of 13

Ex no 4: R.SIDDARTH 9.3.

13 10i345 Aim:

Study of Weka Machine Learning toolkit

To learn to use Weka Machine Learning toolkit Questions: Introduction:1. What options are available on main panel of weka ? Preprocess, Classify, Cluster, Associate, Select Attributes and Visulaize. 2. What is the purpose of following in Weka Explorer An environment for exploring data with WEKA (the rest of this documentation deals with this application in more detail). Experimenter An environment for performing experiments and conducting statistical tests between learning schemes. KnowledgeFlow This environment supports essentially the same functions as the Explorer but with a drag-and-drop interface. One advantage is that it supports incremental learning. SimpleCLI Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. 3. Describe the arff file format An ARFF (= Attribute-Relation File Format ) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files have two distinct sections. The first section is the Header information, which is followed the Data information.

[Type text] The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in the data), and their types. An example header on the standard IRIS dataset looks like this: % 1. Title: Iris Plants Database % % 2. Sources: % (a) Creator: R.A. Fisher % (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) % (c) Date: July, 1988 % @RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @ATTRIBUTE petalwidth NUMERIC @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} The Data of the ARFF file looks like the following: @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa

4. What is meant by filtering in Weka? Which panel is used for filtering a dataset? In WEKA, filters are used to preprocess the data. They can be found below package weka.filters.

5. What are the two main types of filters in Weka? Each filter falls into one of the following two categories: supervised The filter requires a class attribute to be set.

[Type text] unsupervised A class attribute is not required to be present. And into one of the two sub-categories: attribute-based Columns are processed, e.g., added or removed. instance-based Rows are processed, e.g., added or deleted. 6. What is the difference between the two types of filters? What is the difference between an attribute filter and an instance filter? weka.filters.supervised.attribute Discretize is used to discretize numeric attributes into nominal ones, based on the class information, via Fayyad & Iranis MDL method, or optionally with Kononekos MDL method. At least some learning schemes or classifiers can only process nominal data. weka.filters.supervised.instance Resample creates a stratified subsample of the given dataset. This means that overall class distributions are approximately retained within the sample. weka.filters.unsupervised Classes below weka.filters.unsupervised in the class hierarchy are for unsupervised filtering, e.g. the nonstratified version of Resample. A class attribute should not be assigned here. weka.filters.unsupervised.attribute StringToWordVector transforms string attributes into word vectors, i.e. creating one attribute for each word which either encodes presence or word count (= -C) within the string. -W can be used to set an approximate limit on the number of words. When a class is assigned, the limit applies to each class separately. This filter is useful for text mining. 7. Load the iris dataset and answer the following questions a. How many instances (tuples) are there in the dataset? 150 b. State the name of the attribute along with their types and values.

[Type text]

8. Load the weather dataset and perform the following tasks a. Use the unsupervised filter RemoveWithValues to remove all instances where the attribute humidity has the value high?

[Type text]

9. Create an arff file for the following dataset and load the dataset in Weka

Department Sales Sales Sales Systems Systems Systems Systems Marketing Marketing Secretary Secretary

Status Senior Junior Junior Junior Senior Junior Senior Senior Junior Senior Junior

Count 30 40 40 20 5 3 3 10 4 4 6

[Type text]

Association Analysis:1. State the purpose of Apriori algorithm To find the minimum frequent set. 2. Perform the following tasks: a. Load the vote.arff dataset b. Apply the Apriori association rule c. What is the support threshold used? What is the confidence threshold used? Minimum support: 0.45 (196 instances) Minimum metric <confidence>: 0.9

[Type text] Number of cycles performed: 11 d. Write down the top 6 rules along with the support and confidence values. 1. adoption-of-the-budget-resolution=y physician-fee-freeze=n 219 ==> Class=democrat 219 conf:(1) 2. adoption-of-the-budget-resolution=y physician-fee-freeze=n aid-to-nicaraguancontras=y 198 ==> Class=democrat 198 conf:(1) 3. physician-fee-freeze=n aid-to-nicaraguan-contras=y 211 ==> Class=democrat 210 conf:(1) 4. physician-fee-freeze=n education-spending=n 202 ==> Class=democrat 201 conf:(1) 5. physician-fee-freeze=n 247 ==> Class=democrat 245 conf:(0.99) 6. el-salvador-aid=n Class=democrat 200 ==> aid-to-nicaraguan-contras=y 197 conf: (0.99) 3. Create the dataset to represent the following data T1 {F,A,D,B} T2 {D,A,C,E,B} T3 {C,A,B,E} T4 {B,A,D} a. Apply Aprioi algorithm. List down the top 10 rules along with support and confidence values. 1. B=YES 4 ==> A=YES 4 conf:(1) 2. A=YES 4 ==> B=YES 4 conf:(1) 3. D=YES 3 ==> A=YES 3 conf:(1) 4. E=YES 3 ==> A=YES 3 conf:(1) 5. F=NO 3 ==> A=YES 3 conf:(1) 6. D=YES 3 ==> B=YES 3 conf:(1)

[Type text] 7. E=YES 3 ==> B=YES 3 conf:(1) 8. F=NO 3 ==> B=YES 3 conf:(1) 9. B=YES D=YES 3 ==> A=YES 3 conf:(1) 10. A=YES D=YES 3 ==> B=YES 3 conf:(1)

b. List down the frequent item sets generated at each level. Size of set of large itemsets L(1): 5 Size of set of large itemsets L(2): 7 Size of set of large itemsets L(3): 3 c. Change the values of support and confidence and test them. Minimum support: 0.85 (3 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 23 Generated sets of large itemsets: Size of set of large itemsets L(1): 5 Size of set of large itemsets L(2): 7 Size of set of large itemsets L(3): 3 Best rules found: 1. B=YES 4 ==> A=YES 4 conf:(1) 2. A=YES 4 ==> B=YES 4 conf:(1) 3. D=YES 3 ==> A=YES 3 conf:(1) 4. E=YES 3 ==> A=YES 3 conf:(1)

[Type text] 5. F=NO 3 ==> A=YES 3 conf:(1) 6. D=YES 3 ==> B=YES 3 conf:(1) 7. E=YES 3 ==> B=YES 3 conf:(1) 8. F=NO 3 ==> B=YES 3 conf:(1) 9. B=YES D=YES 3 ==> A=YES 3 conf:(1) 10. A=YES D=YES 3 ==> B=YES 3 conf:(1) Data Clustering:Perform the following tasks: 1. Load the bank.arff data set in Weka. 2. Write down the following details regarding the attributes a. names b. types NAME Age Sex Region Income Married Children car Mortgage pep c. values. TYPE Numeric Nominal Nominal numeric Nominal Nominal Nominal Nominal Nominal

[Type text]

3. Run the SimpleKMeans clustering algorithm on the dataset 4. How many clusters are created? 2 5. What are the number of instances and percentage figures in each cluster? Clustered Instances 0 1 172 ( 57%) 128 ( 43%)

6. What is the number of iterations that were required? 3 7. What is the sum of squared errors? What does it represent? Within cluster sum of squared errors: 775.1756576878267 Missing values globally replaced with mean/mode 8. Tabulate the characteristics of the centroid of each cluster. Cluster centroids: Cluster# Attribute Full Data (300) age sex region (172) 0 (128) 1

============================================== 42.57 39.6744 46.4609 MALE MALE FEMALE TOWN INNER_CITY INNER_CITY

[Type text] income married children car mortgage pep 27655.4981 25693.731 30291.6227 YES YES NO NO NO YES NO YES NO NO NO YES NO NO YES

9. Visualize the results of this clustering (let the X-axis represent the cluster name, and the Y-axis represent the instance number) FOR IRIS TABLE

FOR BANK TABLE

[Type text]

10. Is there a significant variation in age between clusters? NO 11. Which clusters are predominated by males and which clusters are predominated by females? Cluster 0 is predominated by male and Cluster1 is predominated by female. 12. What can be said about the values of the region attribute in each cluster? Cluster 0 has more people in Inner City and Cluster1 has more people in town. 13. What can be said about the variation of income between clusters? The variation of income is more or less same in both the cluster.But the variation of income in Cluster 1 is little more than in Cluster 0 between the upper range. 14. Which clusters are dominated by married people and which clusters are dominated by unmarried people? Cluster0 is dominated by married people and Cluster1 is dominated by unmarried people. 15. How do the clusters differ with respect to the number of children? Cluster1 has Children and Cluster0 dont have children. 16. Which cluster has the highest number of people with cars? Cluster 0 17. What can be said about the variation of mortgage holdings between clusters? There is no mortgage holidays in both the clusters.

[Type text] 18. Which clusters comprise mostly of people who buy the PEP product and which ones are Comprised of people who do not buy the PEP product? Cluster 1 is comprised mostly of people who buy the PEP product and Cluster0 are Comprised of people who do not buy the PEP product Data Classification: 1. Load the weather.nominal.arff dataset into Weka and run Id3 classification algorithm . Answer the following questions a. Draw the decision tree generated by the classifier. outlook = sunny | humidity = high: no | humidity = normal: yes outlook = overcast: yes outlook = rainy | windy = TRUE: no | windy = FALSE: yes b. Draw the confusion matrix? What information does the confusion matrix provide? === Confusion Matrix === a b <-- classified as 9 0 | a = yes 0 5 | b = no RESULT Thus Study of Weka Machine Learning toolkit is successfully completed.

You might also like