You are on page 1of 28

Experiment No.

1
Installation and Configuration of Hadoop

Theory:

Hadoop-1.2.1 Installation Steps for Single-Node Cluster (On Ubuntu 12.04)

 Download and install VMware Player depending on your Host OS (32 bit or 64 bit
 Download the .iso image file of Ubuntu 12.04 LTS (32-bit or 64-bit depending on your
requirements)
 Install Ubuntu from image in VMware. (For efficient use, configure the Virtual Machine to
have at least 2GB (4GB preferred) of RAM and at least 2 cores of processor

----------------JAVA INSTALLATION---------------

 sudo mkdir -p /usr/local/java


 cd ~/Downloads
 sudo cp -r jdk-8-linux-i586.tar.gz /usr/local/java
 sudo cp -r jre-8-linux-i586.tar.gz /usr/local/java
 cd /usr/local/java
 sudo tar xvzf jdk-8-linux-i586.tar.gz
 sudo tar xvzf jre-8-linux-i586.tar.gz
 ls a jdk1.8.0 jre1.8.0 jdk-8-linux-i586.tar.gz jre-8-linux-i586.tar.gz
 sudo gedit /etc/profile
 JAVA_HOME=/usr/local/java/jdk1.7.0_4 PATH=$PATH:$HOME/bin:$JAVA_HOME
 /binJRE_HOME=/usr/local/java/jdk1.7.0_45/j rePATH=$PATH:$HOME/bin:$JRE_HOME/
binHADOOP_HOME=/home/hadoop/hadoop- 1.2.1
 PATH=$PATH:$HADOOP_HOME/binexpor t JAVA_HOME export JRE_HOME export PATH
 sudo update-alternatives --install "/usr/bin/java" "java" "/usr/local/java/jdk1.8.0/jre/bin/java" 1
 sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/local/java/jdk1.8.0/bin/javac" 1
 sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/local/java/jdk1.8.0/bin/javaws" 1
 sudo update-alternatives --set java /usr/local/java/jdk1.8.0/jre/bin/java
 sudo update-alternatives --set javac /usr/local/java/jdk1.8.0/bin/javac
 sudo update-alternatives --set javaws /usr/local/java/jdk1.8.0/bin/javaws
 ./etc/profile
 java -version

java version "1.8.0"


Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) Client VM (build 25.0-b70, mixed mode)
---------------------HADOOP INSTALLATION------------------

 open Home
 create a directory hadoop
 copy from downloads hadoop-1.2.1.tar.gz to hadoop
 right click on hadoop-1.2.1.tar.gz and Extract Here
 cd hadoop/
 ls -a
 tar -xvzf hadoop-1.2.1 hadoop-1.2.1.tar.gz
 edit the file conf/hadoop-env.sh
 export JAVA_HOME=/usr/local/java/jdk1.8.0
 cd hadoop-1.2.1

------------------STANDALONE OPERATION----------------

 mkdir input
 cp conf/*.xml input
 bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
 cat output/*

----------------PSEUDO DISTRIBUTED OPERATION---------------//WORDCOUNT

 conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

 conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
 conf/mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>

 ssh localhost
 ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
 cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
 bin/hadoop namenode -format
 bin/start-all.sh

Run the following command to verify that hadoop services are running

$ jps

If everything is successful, you should see following services running

2583 DataNode
2970 ResourceManager
3461 Jps
3177 NodeManager
2361 NameNode
2840 SecondaryNameNode

Conclusion:

Thus, we have studied how to install and configure Hadoop on Ubuntu operating system.
Experiment No. 2
Create an application (Ex: Word Count) using Hadoop Map/Reduce.

Theory:

The MapReduce Model

Traditional parallel computing algorithms were developed for systems with a small
number of processors, dozens rather than thousands. So it was safe to assume that
processors would not fail during a computation. At significantly larger scales this
assumption breaks down, as was experienced at Google in the course of having to carry
out many large-scale computations similar to the one in our word counting example. The
MapReduce parallel programming abstraction was developed in response to these needs,
so that it could be used by many different parallel applications while leveraging a
common underlying fault-tolerant implementation that was transparent to application
developers. Figure 11.1 illustrates MapReduce using the word counting example where
we needed to count the occurrences of each word in a collection of documents.

MapReduce proceeds in two phases, a distributed map operation followed by a


distributed reduce operation; at each phase a configurable number of M mapper
processors and R reducer processors are assigned to work on the problem (we have used
M = 3 and R = 2 in the illustration). The computation is coordinated by a single master
process (not shown in the figure).

A MapReduce implementation of the word counting task proceeds as follows: In


the map phase, each mapper reads approximately 1/M th of the input (in this case
documents), from the global file system, using locations given to it by the master. Each
mapped then performs a map operation to compute word frequencies for its subset of
documents. These frequencies are sorted by the words they represent and written to the
local file system of the mapper.

At the next phase reducers are each assigned a subset of words; in our illustration,
the first reducer is assigned w1 and w2 while the second one handles w3 and w4. In fact,
during the map phase itself each mapper writes one file per reducer, based on the words
assigned to each reducer, and keeps the master informed of these file locations. The
master in turn informs the reducers where the partial counts for their words have been
stored on the local files of respective mappers; the reducers then make remote procedure
call requests to the mappers to fetch these. Each reducer performs a reduce operation that
sums up the frequencies for each word, which are finally written back to the GFS file
system.
The MapReduce programming model generalizes the computational structure of the
above example. Each map operation consists of transforming one set of key-value
pairs to another:

Map: (k1, v1) → [(k2, v2)]

In our example, each map operation takes a document indexed by its id and emits a list
if word- count pairs indexed by word-id: (dk, [w1 . . .wn]) → [(wi, ci)]. The reduce
operation groups the results of the map step using the same key k2 and performs a
function f on the list of values that correspond to each
Reduce: (k2, [v2]) → (k2, f ([v2]))

In our example, each reduce operation sums the frequency counts for each word:
(w i , [ c i ]) → w i , ∑ c i
( i )
The implementation also generalizes. Each mapper is assigned an input-key range (set of
values for k1) on which map operations need to be performed. The mapper writes results of
its map operations to its local disk in R partitions, each corresponding to the output-key
range (values of k2) assigned to a particular reducer, and informs the master of these
locations. Next each reducer fetches these pairs from the respective mappers and performs
reduce operations for each key k2 assigned to it. If a processor fails during the execution,
the master detects this through regular heartbeat communications it maintains with each
worker, wherein updates are also exchanged regarding the status of tasks assigned to
workers.

If a mapper fails, then the master reassigns the key-range designated to it to another
working node for re-execution. Note that re-execution is required even if the mapper had
completed some of its map operations, because the results were written to local disk rather
than the GFS. On the other hand, if a reducer fails only its remaining tasks (values k2) are
reassigned to another node, since the completed tasks would already have been written to
the GFS.

Finally, heartbeat failure detection can be fooled by a wounded task that has a
heartbeat but is making no progress: Therefore, the master also tracks the overall progress
of the computation and if results from the last few processors in either phase are
excessively delayed, these tasks are duplicated and assigned to processors who have
already completed their work. The master declares the task completed when any one of the
duplicate workers complete. Such a fault-tolerant implementation of the MapReduce model
has been implemented and is widely used within Google; more importantly from an
enterprise perspective, it is also available as an open source implementation through the
Hadoop project along with the HDFS distributed file system.

The MapReduce model is widely applicable to a number of parallel computations,


including database-oriented tasks which we cover later. Finally we describe one more example,
that of indexing a large collection of documents, or, for that matter any data including database
records: The map task consists of emitting a word-document/record id pair for each word:
(dk, [w1 . . .wn]) → [(wi, dk)].
The reduce step groups the pairs by word and creates an index entry for each word:
[(wi, dk)] → (wi, [di1 . . .dim]).

Indexing large collections is not only important in web search, but also a critical
aspect of handling structured data; so it is important to know that it can be executed
efficiently in parallel using MapReduce. Traditional parallel databases focus on rapid query
execution against data warehouses that are updated infrequently; as a result, these systems
often do not parallelize index creation sufficiently well.
Open in any Browser

 Open in any Browser NameNode - http://localhost:50070/


 Open in any Browser JobTracker - http://localhost:50030/
 Open hadoop/hadoop-1.2.1 create a document type something in that document and save
it as test.txt
 bin/hadoop fs -ls /

Found 1 items
drwxr-xr-x - vishal supergroup 0 2014-04-15 01:13 /tmp

 bin/hadoop fs -mkdir example


 bin/hadoop fs -ls /user/vishal/

Found 1 items
drwxr-xr-x - vishal supergroup /user/vishal/example
 bin/hadoop fs -copyFromLocal test.txt /user/vishal/example
 bin/hadoop jar hadoop-examples-1.2.1.jar wordcount /user/vishal/example/test.txt /hello

package com.wordcount.Example;

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

public class WCE

{
public static class Map extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector <Text, IntWritable> output,
Reporter reporter) throws IOException
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
While (tokenizer.hasMoreTokens()
{
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
{
int sum = 0;
while (values.hasNext())
{
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception
{
JobConf conf = new JobConf(WCE.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}

Run the Application:


$ bin/hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output

Conclusion: Hence we have implemented Map Reduce example such as Word Count program
on a file which will count the no.of times a word repeats in the given file.
Experiment No. 3
Write a more complete and advanced version of word-count which has following
additional features.

 Demonstrates how applications can access configuration parameters in the setup method of
the Mapper (and Reducer) implementations.

 Demonstrates how the DistributedCache can be used to distribute read-only data needed by the jobs. Here it
allows the user to specify word-patterns to skip while counting.

 Demonstrates the utility of the GenericOptionsParser to handle generic Hadoop command-line options.

 Demonstrates how applications can use Counters and how they can set application-specific status information
passed to the map (and reduce) method.

The following implementation adds the above mentioned features.


import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Counter;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.StringUtils;

public class WordCount2 {

public static class TokenizerMapper


extends Mapper<Object, Text, Text, IntWritable>{

static enum CountersEnum { INPUT_WORDS }

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();

private boolean caseSensitive;


private Set<String> patternsToSkip = new HashSet<String>();

private Configuration conf;


private BufferedReader fis;
@Override
public void setup(Context context) throws IOException,
InterruptedException {
conf = context.getConfiguration();
caseSensitive = conf.getBoolean("wordcount.case.sensitive", true);
if (conf.getBoolean("wordcount.skip.patterns", true)) {
URI[] patternsURIs = Job.getInstance(conf).getCacheFiles();
for (URI patternsURI : patternsURIs) {
Path patternsPath = new Path(patternsURI.getPath());
String patternsFileName = patternsPath.getName().toString();
parseSkipFile(patternsFileName);
}
}
}

private void parseSkipFile(String fileName) {


try {
fis = new BufferedReader(new FileReader(fileName));
String pattern = null;
while ((pattern = fis.readLine()) != null) {
patternsToSkip.add(pattern);
}
} catch (IOException ioe) {
System.err.println("Caught exception while parsing the cached file '"
+ StringUtils.stringifyException(ioe));
}
}

@Override
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
String line = (caseSensitive) ?
value.toString() : value.toString().toLowerCase();
for (String pattern : patternsToSkip) {
line = line.replaceAll(pattern, "");
}
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
Counter counter = context.getCounter(CountersEnum.class.getName(),
CountersEnum.INPUT_WORDS.toString());
counter.increment(1);
}
}
}

public static class IntSumReducer


extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,


Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
GenericOptionsParser optionParser = new GenericOptionsParser(conf, args);
String[] remainingArgs = optionParser.getRemainingArgs();
if (!(remainingArgs.length != 2 | | remainingArgs.length != 4)) {
System.err.println("Usage: wordcount <in> <out> [-skip skipPatternFile]");
System.exit(2);
}
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount2.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

List<String> otherArgs = new ArrayList<String>();


for (int i=0; i < remainingArgs.length; ++i) {
if ("-skip".equals(remainingArgs[i])) {
job.addCacheFile(new Path(remainingArgs[++i]).toUri());
job.getConfiguration().setBoolean("wordcount.skip.patterns", true);
} else {
otherArgs.add(remainingArgs[i]);
}
}
FileInputFormat.addInputPath(job, new Path(otherArgs.get(0)));
FileOutputFormat.setOutputPath(job, new Path(otherArgs.get(1)));

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Sample text files as input:


$ bin/hadoop fs -ls /user/joe/wordcount/input/
/user/joe/wordcount/input/file01
/user/joe/wordcount/input/file02

$ bin/hadoop fs -cat /user/joe/wordcount/input/file01


Hello World, Bye World!

$ bin/hadoop fs -cat /user/joe/wordcount/input/file02


Hello Hadoop, Goodbye to hadoop.

Run the application:


$ bin/hadoop jar wc.jar WordCount2 /user/joe/wordcount/input /user/joe/wordcount/output

Output:
$ bin/hadoop fs -cat /user/joe/wordcount/output/part-r-00000
Bye 1
Goodbye 1
Hadoop, 1
Hello 2
World! 1
World, 1
hadoop. 1
to 1
Experiment No.4
Installation of Hortonworks Sandbox 2.1

1. Open VMware Workstation 12.

2. Import the required virtual machine of HortonWorks Sandbox 2.1.


3. Power on the virtual machine and change the memory to 2GB.

4. Hortonworks Sandbox 2.1 is powered on and ready for use.


Experiment No.5
Introduction to hadoop mapreduce

MapReduce is a framework using which we can write applications to process huge amounts of data, in
parallel, on large clusters of commodity hardware in a reliable manner.
MapReduce is a processing technique and a program model for distributed computing based on java.
The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set
of data and converts it into another set of data, where individual elements are broken down into
tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce
implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called mappers
and reducers. Decomposing a data processing application into mappers and reducers is sometimes
nontrivial. But, once we write an application in the MapReduce form, scaling the application to run
over hundreds, thousands, or even tens of thousands of machines in a cluster is merely a
configuration change. This simple scalability is what has attracted many programmers to use the
MapReduce model.

The Algorithm
 Generally MapReduce paradigm is based on sending the computer to where the data resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
o Map stage: The map or mapper’s job is to process the input data. Generally the input
data is in the form of file or directory and is stored in the Hadoop file system
(HDFS). The input file is passed to the mapper function line by line. The mapper
processes the data and creates several small chunks of data.
o Reduce stage: This stage is the combination of the Shufflestage and
the Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in the
HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers
in the cluster.
 The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
 Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
 After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
1. Input Phase: This is the input data / file to be processed.
2. Split Phase: Hadoop splits the incoming data into smaller pieces called "splits".
3. Map Phase: In this step, MapReduce processes each split according to the logic defined in
map() function. Each mapper works on each split at a time. Each mapper is treated as a task
and multiple tasks are executed across different TaskTrackers and coordinated by the
JobTracker.
4. Combine Phase: This is an optional step and is used to improve the performance by reducing
the amount of data transferred across the network. Combiner is the same as the reduce step
and is used for aggregating the output of the map() function before it is passed to the
subsequent steps.
5. Shuffle & Sort Phase: In this step, outputs from all the mappers is shuffled, sorted to put
them in order, and grouped before sending them to the next step.
6. Reduce Phase: This step is used to aggregate the outputs of mappers using the reduce()
function. Output of reducer is sent to the next and final step. Each reducer is treated as a task
and multiple tasks are executed across different TaskTrackers and coordinated by the
JobTracker.
7. Output Phase: Finally the output of reduce step is written to a file in HDFS.
Experiment No.6

Demonstration of preprocessing on dataset student.arff


Aim: This experiment illustrates some of the basic data preprocessing operations that can be
performed using WEKA-Explorer. The sample dataset used for this example is the student data
available in arff format.

Step1: Loading the data. We can load the dataset into weka by clicking on open button in
preprocessing interface and selecting the appropriate file.

Step2: Once the data is loaded, weka will recognize the attributes and during the scan of the data
weka will compute some basic strategies on each attribute. The left panel in the above figure
shows the list of recognized attributes while the top panel indicates the names of the base relation
or table and the current working relation (which are same initially).

Step3:Clicking on an attribute in the left panel will show the basic statistics on the attributes for
the categorical attributes the frequency of each attribute value is shown, while for continuous
attributes we can obtain min, max, mean, standard deviation and deviation etc.,

Step4:The visualization in the right button panel in the form of cross-tabulation across two
attributes.

Note:we can select another attribute using the dropdown list.

Step5:Selecting or filtering attributes

Removing an attribute-When we need to remove an attribute,we can do this by using the attribute
filters in weka.In the filter model panel,click on choose button,This will show a popup window
with a list of available filters.

Scroll down the list and select the “weka.filters.unsupervised.attribute.remove” filters.

Step 6:a)Next click the textbox immediately to the right of the choose button.In the resulting
dialog box enter the index of the attribute to be filtered out.

b)Make sure that invert selection option is set to false.The click OK now in the filter box.you will
see “Remove-R-7”.

c)Click the apply button to apply filter to this data.This will remove the attribute and create new
working relation.

d)Save the new working relation as an arff file by clicking save button on the top(button)panel.
(student.arff)
Discretization

1)Sometimes association rule mining can only be performed on categorical data.This requires
performing discretization on numeric or continuous attributes.In the following example let us
discretize age attribute.

􀃆Let us divide the values of age attribute into three bins(intervals).

􀃆First load the dataset into weka(student.arff)

􀃆Select the age attribute.

􀃆Activate filter-dialog box and select “WEKA.filters.unsupervised.attribute.discretize”from the


list.

􀃆To change the defaults for the filters,click on the box immediately to the right of the choose
button.

􀃆We enter the index for the attribute to be discretized.In this case the attribute is age.So we must
enter ‘1’ corresponding to the age attribute.

􀃆Enter ‘3’ as the number of bins.Leave the remaining field values as they are.

􀃆Click OK button.

􀃆Click apply in the filter panel.This will result in a new working relation with the selected
attribute partition into 3 bins.

􀃆Save the new working relation in a file called student-data-discretized.arff

Dataset student .arff

@relation student

@attribute age {<30,30-40,>40}

@attribute income {low, medium, high}

@attribute student {yes, no}

@attribute credit-rating {fair, excellent}

@attribute buyspc {yes, no}

@data

%
<30, high, no, fair, no

<30, high, no, excellent, no

30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

The following screenshot shows the effect of discretization.


Experiment No.7

Demonstration of classification rule process on dataset employee.arff using naïve bayes


algorithm

Aim: This experiment illustrates the use of naïve bayes classifier in weka. The sample data set
used in this experiment is “employee”data available at arff format. This document assumes that
appropriate data pre processing has been performed.

Steps involved in this experiment:

1. We begin the experiment by loading the data (employee.arff) into weka.

Step2: next we select the “classify” tab and click “choose” button to select the “id3”classifier.

Step3: now we specify the various parameters. These can be specified by clicking in the text box
to the right of the chose button. In this example, we accept the default values his default version
does perform some pruning but does not perform error pruning.

Step4: under the “text “options in the main panel. We select the 10-fold cross validation as our
evaluation approach. Since we don’t have separate evaluation data set, this is necessary to get a
reasonable idea of accuracy of generated model.

Step-5: we now click”start”to generate the model .the ASCII version of the tree as well as
evaluation statistic will appear in the right panel when the model construction is complete.

Step-6: note that the classification accuracy of model is about 69%.this indicates that we may find
more work. (Either in preprocessing or in selecting current parameters for the classification)

Step-7: now weka also lets us a view a graphical version of the classification tree. This can be
done by right clicking the last result set and selecting “visualize tree” from the pop-up menu.

Step-8: we will use our model to classify the new instances.

Step-9: In the main panel under “text “options click the “supplied test set” radio button and then
click the “set” button. This will show pop-up window which will allow you to open the file
containing test instances.
Data set employee.arff:

@relation employee

@attribute age {25, 27, 28, 29, 30, 35, 48}

@attribute salary{10k,15k,17k,20k,25k,30k,35k,32k}

@attribute performance {good, avg, poor}

@data

25, 10k, poor

27, 15k, poor

27, 17k, poor

28, 17k, poor

29, 20k, avg

30, 25k, avg

29, 25k, avg

30, 20k, avg

35, 32k, good

48, 34k, good

48, 32k, good

The following screenshot shows the classification rules that were generated when naive bayes
algorithm is applied on the given dataset.
Experiment No.8
Demonstration of clustering rule process on dataset iris.arff using simple k-means

Aim: This experiment illustrates the use of simple k-mean clustering with Weka explorer. The
sample data set used for this example is based on the iris data available in ARFF format. This
document assumes that appropriate preprocessing has been performed. This iris dataset includes
150 instances.

Steps involved in this Experiment

Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.

Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.

Step 4: Next click in text button to the right of the choose button to get popup window shown in
the screenshots. In this window we enter six on the number of clusters and we leave the value of
the seed on as it is. The seed value is used in generating a random number which is used for
making the internal assignments of instances of clusters.

Step 5 : Once of the option have been specified. We run the clustering algorithm there we must
make sure that they are in the ‘cluster mode’ panel. The use of training set option is selected and
then we click ‘start’ button. This process and resulting window are shown in the following
screenshots.

Step 6 : The result window shows the centroid of each cluster as well as statistics on the number
and the percent of instances assigned to different clusters. Here clusters centroid are means
vectors for each clusters. This clusters can be used to characterized the cluster.For eg, the centroid
of cluster1 shows the class iris.versicolor mean value of the sepal length is 5.4706, sepal width
2.4765, petal width 1.1294, petal length 3.7941.

Step 7: Another way of understanding characterstics of each cluster through visualization ,we can
do this, try right clicking the result set on the result. List panel and selecting the visualize cluster
assignments.

The following screenshot shows the clustering rules that were generated when simple k means
algorithm is applied on the given dataset.
Interpretation of the above visualization

From the above visualization, we can understand the distribution of sepal length and petal length
in each cluster. For instance, for each cluster is dominated by petal length. In this case by
changing the color dimension to other attributes we can see their distribution with in each of the
cluster.

Step 8: We can assure that resulting dataset which included each instance along with its assign
cluster. To do so we click the save button in the visualization window and save the result iris k-
mean .The top portion of this file is shown in the following figure.
Experiment No.9

Demonstration of clustering rule process on dataset student.arff using simple k-means

Aim: This experiment illustrates the use of simple k-mean clustering with Weka explorer. The
sample data set used for this example is based on the student data available in ARFF format. This
document assumes that appropriate preprocessing has been performed. This istudent dataset
includes 14 instances.

Steps involved in this Experiment

Step 1: Run the Weka explorer and load the data file student.arff in preprocessing interface.

Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.

Step 3 : In this case we select ‘simple k-means’.

Step 4: Next click in text button to the right of the choose button to get popup window shown in
the screenshots. In this window we enter six on the number of clusters and we leave the value of
the seed on as it is. The seed value is used in generating a random number which is used for
making the internal assignments of instances of clusters.

Step 5 : Once of the option have been specified. We run the clustering algorithm there we must
make sure that they are in the ‘cluster mode’ panel. The use of training set option is selected and
then we click ‘start’ button. This process and resulting window are shown in the following
screenshots.

Step 6 : The result window shows the centroid of each cluster as well as statistics on the number
and the percent of instances assigned to different clusters. Here clusters centroid are means
vectors for each clusters. This clusters can be used to characterized the cluster.

Step 7: Another way of understanding characterstics of each cluster through visualization ,we can
do this, try right clicking the result set on the result. List panel and selecting the visualize cluster
assignments.

Interpretation of the above visualization

From the above visualization, we can understand the distribution of age and instance number in
each cluster. For instance, for each cluster is dominated by age. In this case by changing the color
dimension to other attributes we can see their distribution with in each of the cluster.

Step 8: We can assure that resulting dataset which included each instance along with its assign
cluster. To do so we click the save button in the visualization window and save the result student
k-mean .The top portion of this file is shown in the following figure.
Dataset student .arff

@relation student

@attribute age {<30,30-40,>40}

@attribute income {low,medium,high}

@attribute student {yes,no}

@attribute credit-rating {fair,excellent}

@attribute buyspc {yes,no}

@data

<30, high, no, fair, no

<30, high, no, excellent, no

30-40, high, no, fair, yes

>40, medium, no, fair, yes

>40, low, yes, fair, yes

>40, low, yes, excellent, no

30-40, low, yes, excellent, yes

<30, medium, no, fair, no

<30, low, yes, fair, no

>40, medium, yes, fair, yes

<30, medium, yes, excellent, yes

30-40, medium, no, excellent, yes

30-40, high, yes, fair, yes

>40, medium, no, excellent, no

The following screenshot shows the clustering rules that were generated when simple k-means
algorithm is applied on the given dataset
.

You might also like