You are on page 1of 6

to build class=

javac -classpath ~/hadoop/hadoop-mapreduce-client-core-2.5.2.jar:commons-cli-1.2


.jar -d classes MarketRatings.java && jar -cvf MarketRating.jar -C classes/ .
ssh permissions are too open
chmod 400 ~/.ssh/id_rsa

error:=

-------------Prepare to Start the Hadoop Cluster:------------------Unpack the downloaded Hadoop distribution. In the distribution, edit the file by
following command:sudo vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh to define some parameters as
follows:
# set to the root of your Java installation
export JAVA_HOME=/usr/java/latest
# Assuming your installation directory is /usr/local/hadoop
export HADOOP_PREFIX=/usr/local/hadoop
-------------Setting Global Variables(for both users)--------------------sudo vi ~/.bashrc
add at the end of file
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
#export JAVA_HOME=/usr/java/default
export
export
export
export
export
export

HADOOP_HOME=/usr/local/hadoop
HADOOP_PREFIX=/usr/local/hadoop
HADOOP_INSTALL=/usr/local/hadoop
PATH=$PATH:$JAVA_HOME/bin
PATH=$PATH:$HADOOP_INSTALL/bin
PATH=$PATH:$HADOOP_INSTALL/sbin

export
export
export
export
export
export

HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
HADOOP_MAPRED_HOME=$HADOOP_INSTALL
HADOOP_COMMON_HOME=$HADOOP_INSTALL
HADOOP_HDFS_HOME=$HADOOP_INSTALL
YARN_HOME=$HADOOP_INSTALL
HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop

export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
After adding the following code, reload the settings usingsource ~/.bashrc
Try the following command:
$ hadoop
---------------------------------Standalone Operation:---------------------------(might not run if we have already set-up hadoop for Pseudo-mode i.e..Connection
refused )
By default, Hadoop is configured to run in a non-distributed mode, as a single J
ava process. This is useful for debugging.
The following example copies the unpacked conf directory to use as input and the
n finds and displays every match of the given regular
expression. Output is written to the given output directory.
$ mkdir input
$ cp /usr/local/hadoop/etc/hadoop/*.xml input

$ /usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hado


op-mapreduce-examples-2.5.2.jar grep input output 'dfs[a-z.]+'
$ cat output/*
or
$ hdfs dfs -cat /user/hduser/output/part-r-00000
---------------------------------Pseudo-Distributed Operation:---------------------------Hadoop can also be run on a single-node in a pseudo-distributed mode where each
Hadoop daemon runs in a separate Java process.
Configuration
Use the following:
etc/hadoop/core-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
-----------------------------------------Execution:----------------The following instructions are to run a MapReduce job locally. If you want to ex
ecute a job on YARN, see YARN on Single Node.
1. Format the filesystem:
$ hdfs namenode -format
2. Start NameNode daemon and DataNode daemon:
$ start-dfs.sh
use 'jps' to check running deamons
The hadoop daemon log output is written to the $HADOOP_LOG_DIR directory (defaul
ts to $HADOOP_HOME/logs ).
3. Browse the web interface for the NameNode; by default it is available at:
NameNode - http://localhost:50070/
4. Make the HDFS directories required to execute MapReduce jobs:
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/hduser
5. Copy the input files into the distributed filesystem:
$ hdfs dfs -put /usr/local/hadoop/etc/hadoop input
6. Run some of the examples provided:
$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples2.5.2.jar grep input output 'dfs[a-z.]+'
7. Examine the output files:
Copy the output files from the distributed filesystem to the local filesystem an
d examine them:
$ hdfs dfs -get output output
$ cat output/*
or
View the output files on the distributed filesystem:
$ hdfs dfs -cat output/*
8. When you're done, stop the daemons with:

$ stop-dfs.sh
-----------------------YARN on Single Node:--------------------------------You can run a MapReduce job on YARN in a pseudo-distributed mode by setting a fe
w parameters and running ResourceManager
daemon and NodeManager daemon in addition.
The following instructions assume that 1. ~ 4. steps of the above instructions a
re already executed.
1. Configure parameters as follows:
etc/hadoop/mapred-site.xml:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
etc/hadoop/yarn-site.xml:
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
2. Start ResourceManager daemon and NodeManager daemon:
$ start-yarn.sh
use 'jps' to check running deamons
3. Browse the web interface for the ResourceManager; by default it is available
at:
ResourceManager - http://localhost:8088/
4. Run a MapReduce job.
5. When you're done, stop the daemons with:
$ sbin/stop-yarn.sh
Inputs and Outputs for wordcount example:The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, valu
e> pairs as the output of
the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence nee
d to implement the
Writable interface. Additionally, the key classes have to implement the Writable
Comparable interface to
facilitate sorting by the framework.
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3>
(output)
----------------Example: WordCount v1.0------------------------------Before we jump into the details, lets walk through an example MapReduce applicat
ion to get a flavour for
how they work.
WordCount is a simple application that counts the number of occurrences of each
word in a given input set.
This works with a local-standalone, pseudo-distributed or fully-distributed Hado

op installation ( Single Node


Setup).
SOURCE CODE:
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
}
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
}
}
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}

Compile WordCount.java and create a jar: (put WordCount.java in default director


y i.e.. home/hduser)
$ hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf wc.jar WordCount*.class
put the newly created wc.jar in /usr/local/hadoop/
Assuming that:
/user/hduser/wordcount/input - input directory in HDFS
/user/hduser/wordcount/output - output directory in HDFS
make input directories:
$ hdfs dfs -mkdir /user
$ hdfs dfs -mkdir /user/hduser
$ hdfs dfs -mkdir /user/hduser/input
$ hdfs dfs -mkdir /user/hduser/output (optional to create,might give "already ex
ists" error)
(job doesn't run if output folder already exists in HDFS)
Let copy the file01 and file02 from /home/hduser/ to distributed storage
hdfs dfs -put /home/hduser/file01 /user/hduser/input/file01
hdfs dfs -put /home/hduser/file02 /user/hduser/input/file02
$ hdfs dfs -put /home/hduser/file01 file01
$ hdfs dfs -put /home/hduser/file02 file02
to check hdfs files
hadoop dfs -ls /user/hduser
To remove files from HDFS using hadoop, run command
hadoop dfs -rmr <directory/files>
Sample text-files as input:
$ hdfs dfs -ls /user/hduser/input/
/user/hduser/input/file01
/user/hduser/input/file02
$ hdfs dfs -cat /user/hduser/input/file01
Hello World Bye World
$ hdfs dfs -cat /user/hduser/input/file02
Hello Hadoop Goodbye Hadoop
Run the application:
$ hadoop jar wc.jar WordCount /user/hduser/input /user/hduser/output
Output: (without copying the output file to local system)
$ hdfs dfs -cat /user/hduser/output/part-r-00000
Bye 1
Goodbye 1
Hadoop 2
Hello 2
World 2
Lets copy the output file from distributed storage to the local disk
$ hadoop fs -copyToLocal /user/hduser/output/ /home/hduser/hadoopoutputfolder
When you're done, stop the daemons with:
$ stop-dfs.sh
$ stop-yarn.sh

You might also like