Hadoop HP Tutorial

1
single-node Hadoop Installation Preparation
Last Updated:- 3rd Nov 2012

All software will be in D:\\Software
Verify that VM player is installed else you need to install it.
Start your VM player and start the Hadoop VM Node master
Logon with the following credentials:
root / root123
Create directory hadoop :- mkdir /hadoop
The required steps for setting up a single-node Hadoop cluster using the Hadoop Distributed File System
(HDFS) on RedHat Linux.

Red Hat Linux

Hadoop 1.0.3, released May 2012
Prerequisites
Sun Java 6 Verify java as below.
# java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
www.hpottech.com
Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

If Java is not there then installed Java : change directory to /hadoop and run the following
# sh /mnt/hgfs/Hadoopsw/jdk-6u17-linux-i586.bin
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use
Hadoop on it .
#su - root
# ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/root/.ssh/id_rsa):
Created directory '/home/root/.ssh'.
Your identification has been saved in /home/root/.ssh/id_rsa.
Your public key has been saved in /home/root/.ssh/id_rsa.pub.
The key fingerprint is:
www.hpottech.com
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 root@ubuntu
The key's randomart image is:
[...snipp...]
The second line will create an RSA key pair with an empty password.
Second, you have to enable SSH access to your local machine with this newly created key.
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to your local machine with the root user. The step is also
needed to save your local machines host key fingerprint to the root users known_hosts file.
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
www.hpottech.com
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
# ssh localhost
Last login: Mon Dec 3 21:36:02 2012 from localhost.localdomain
www.hpottech.com
Hadoop Installation Single Node
Start the VM and follow the following steps

Hadoop
Installation
Extract the contents of the Hadoop package to a /hadoop.
$ cd /hadoop
$ tar xzf hadoop-1.0.0.tar.gz
$ mv hadoop-1.0.0 hadoop
$ chown -R hduser:hadoop hadoop (optional)
Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user root. If you use a shell other than bash, you should of course
update its appropriate configuration files instead of .bashrc.
# Set Hadoop-related environment variables
export HADOOP_HOME=/hadoop/hadoop
(Changes the directory app. to your installation.)
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/hadoop/jdk1.6.0_17
www.hpottech.com
# Add Hadoop bin/ directory to PATH

export PATH=$PATH:$HADOOP_HOME/bin
Hadoop Distributed File System (HDFS)

Configuration
Our goal in this tutorial is a single-node setup of Hadoop
hadoop-env.sh
The only required environment variable we have to configure for Hadoop in this tutorial is JAVA_HOME.
Open /hadoop/hadoop/conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the Sun
JDK/JRE 6 directory.
Change
# The java implementation to use. Required.

# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
www.hpottech.com
to
# The java implementation to use. Required.

export JAVA_HOME=/hadoop/jdk1.6.0_17/
conf/*-site.xml
Now we create the directory and set the required ownerships and permissions:
$ mkdir -p /app/hadoop/tmp
www.hpottech.com
Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
In file conf/core-site.xml:

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system </description>
</property>
www.hpottech.com
In file conf/mapred-site.xml:

<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job tracker runs at.</description>
</property>
www.hpottech.com
In file conf/hdfs-site.xml:


<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication. </description>
</property>
www.hpottech.com
Formatting the HDFS filesystem via the NameNode

The first step to starting up your Hadoop installation is formatting the Hadoop filesystem which is implemented on top of the local
filesystem of your cluster .
Do not format a running Hadoop filesystem as you will lose all the data currently in the cluster (in HDFS).
To format the filesystem (which simply initializes the directory specified by the dfs.name.dir variable), run the command
hduser@ubuntu:~$ /hadoop/hadoop/bin/hadoop namenode -format
The output will look like this:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format

10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/127.0.1.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo'
on Fri Feb 19 08:07:34 UTC 2010
www.hpottech.com
************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
hduser@ubuntu:/usr/local/hadoop$
Starting your single-node cluster

Run the command:
hduser@ubuntu:~$ /hadoop/hadoop/bin/start-all.sh
www.hpottech.com
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:
hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
hduser@ubuntu:/usr/local/hadoop$
A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Suns Java since v1.5.0).
hduser@ubuntu:$ jps
(if jps doesnt found, go to /Software/JDK/bin and execute ./jps)
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
www.hpottech.com
1788 NameNode
You can also check with netstat if Hadoop is listening on the configured ports.
hduser@ubuntu:~$ netstat -plten | grep java
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
0
0
0
0
0
0
0
0
0
0
0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java

0 0.0.0.0:50010 0.0.0.0:* LISTEN 1001 9998 2628/java
0 0.0.0.0:48159 0.0.0.0:* LISTEN 1001 8496 2628/java
0 0.0.0.0:53121 0.0.0.0:* LISTEN 1001 9228 2857/java
0 127.0.0.1:54310 0.0.0.0:* LISTEN 1001 8143 2471/java
0 127.0.0.1:54311 0.0.0.0:* LISTEN 1001 9230 2857/java
0 0.0.0.0:59305 0.0.0.0:* LISTEN 1001 8141 2471/java
0 0.0.0.0:50060 0.0.0.0:* LISTEN 1001 9857 3005/java
0 0.0.0.0:49900 0.0.0.0:* LISTEN 1001 9037 2785/java
0 0.0.0.0:50030 0.0.0.0:* LISTEN 1001 9773 2857/java
hduser@ubuntu:~$
If there are any errors, examine the log files in the /logs/ directory.
Stopping your single-node cluster

Run the command
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh
www.hpottech.com
to stop all the daemons running on your machine.

Exemplary output:
hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
www.hpottech.com
Lab 1 - HDFS
Lab objectives
In this lab you will practice HDFS command line interface.
Lab instructions
This lab has been developed as a tutorial. Simply execute the commands provided, and analyze the
results.
Basic Hadoop Filesystem commands

1. In order to work with HDFS you need to use the hadoop fs command. For example to list the /
and /app directories you need to input the following commands:
>
>
hadoop fs -ls /
hadoop fs -ls /app
2.
There are many commands you can run within the Hadoop filesystem. For example to make the
directory test you can issue the following command:
>
hadoop fs -mkdir test
Now let's see the directory we've created:

>
>
hadoop fs -ls /
hadoop fs -ls /user/root
You will notice that the test directory got created under the /user/root directory. This is because
as the root user, your default path is /user/root and thus if you don't specify an absolute path all
HDFS commands work out of /user/root (this will be your default working directory).
3.
You should be aware that you can pipe (using the | character) any HDFS command to be used
with the Linux shell. For example, you can easily use grep with HDFS by doing the following:
>
>
hadoop fs -mkdir /user/root/test2

hadoop fs -ls /user/root | grep test
As you can see the grep command only returned the lines which had test in them (thus
removing the "Found x items" line and oozie-root directory from the listing.
4.
In order to move files between your regular linux filesystem and HDFS you will likely use the put
and get commands. First, move a single file to the hadoop filesystem.
Copy pg20417.txt from software folder to /hadoop
> hadoop fs -put /hadoop/pg20417.txt pg20417.txt
> hadoop fs -ls /user/root
You should now see a new file called /user/root/README listed. In order to view the contents of
this file we will use the -cat command as follows:
>
hadoop fs -cat pg20417.txt
You should see the output of the README file (that is stored in HDFS). We can also use the linux
diff command to see if the file we put on HDFS is actually the same as the original on the local
filesystem. You can do this as follows:
>
diff <( hadoop fs -cat pg20417.txt) /hadoop/pg20417.txt
Since the diff command produces no output we know that the files are the same (the diff
command prints all the lines in the files that differ).
Some more Hadoop Filesystem commands

1.
In order to use HDFS commands recursively generally you add an "r" to the HDFS command (In
the Linux shell this is generally done with the "-R" argument)
For example, to do a recursive
listing we'll use the -lsr command rather than just -ls. Try this:
>
hadoop fs -ls /user
>
2.
hadoop fs -lsr /user
In order to find the size of files you need to use the -du or -dus commands. Keep in mind that
these commands return the file size in bytes. To find the size of the pg20417.txt file use the
following command:
>
hadoop fs -du pg20417.txt
To find the size of all files individually in the /user/root directory use the following command:
>
hadoop fs -du /user/root
To find the size of all files in total of the /user/root directory use the following command:
>
hadoop fs -dus /user/root
3.
If you would like to get more information about a given command, invoke -help as follows:
>
hadoop fs -help
For example, to get help on the dus command you'd do the following:
>
hadoop fs -help dus
------ This is the end of this lab -----System -> Network Config -> DNS
Hostname - Activate
/etc/sysconfig/network
HDFS Admin Command
bin/hadoop dfsadmin report

This is a dfsadmin command for reporting on each DataNode. It display the status of Hadoop cluster.
www.hpottech.com
HDFS Admin Command
bin/hadoop dfsadmin -metasave hadoop.txt

This will save some of NameNodes metadata into its log directory under filename.
In this metadata, youll find lists of blocks waiting for replication, blocks being replicated, and blocks awaiting
deletion. For replication each block will also have a list of DataNodes being replicated to. Finally, the metasave file
will also have summary statistics on each DataNode.
Go to log folder:
# cd /hadoop/hadoop/logs
# ls
# vi hadoop.txt
www.hpottech.com
HDFS Admin Command
hadoop dfsadmin -safemode get
hadoop dfsadmin -safemode enter
hadoop dfsadmin -safemode leave
www.hpottech.com
HDFS Admin Command
hadoop fsck /
hadoop version
www.hpottech.com
HDFS Admin Command
www.hpottech.com
Fair Scheduler
Installation
To run the fair scheduler in your Hadoop installation, you need to put it on the CLASSPATH.
copy the hadoop-fairscheduler-2.0.0-mr1-cdh4.1.0.jar from
/hadoop/hadoop-2.0.0-mr1-cdh4.1.0/contrib/fairscheduler to HADOOP_HOME/lib. Using the
following command.
cp hadoop-fairscheduler-2.0.0-mr1-cdh4.1.0.jar $HADOOP_HOME/lib
Edit HADOOP_CONF_DIR/mapred-site.xml to have Hadoop use the fair scheduler:

<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
Once you restart the cluster, you can check that the fair scheduler is running by going to
http://<jobtracker URL>/scheduler on the JobTracker's web UI.
A "job scheduler administration" page should be visible there.
Hpot-Tech.com
Fair Scheduler
http://192.168.1.5:50030/scheduler
Run the map reduce job from two telnet window and observer the link display.
Create two different folders for input and output.
Run the jobs with different input and output folder.
Hpot-Tech.com
Fair Scheduler
Hpot-Tech.com
Fair Scheduler
Hpot-Tech.com
HDFC Commands and Web Interface
1) Verifying File System Health

a. bin/hadoop fsck /
2) HDFS Web Interface : features
hPotTech | hadoop
Hadoop Web Interfaces

Hadoop comes with several web interfaces which are by default (see conf/core-site.xml) available at these
locations:
http://192.168.80.133:50030/ web UI for MapReduce job tracker(s)
http://192.168.80.133:50060/ web UI for task tracker(s)
http:// 92.168.80.133:50070/ web UI for HDFS name node(s)
These web interfaces provide concise information about whats happening in your Hadoop cluster. You might want to give
them a try.
hPotTech | hadoop
MapReduce Job Tracker Web Interface

The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed
jobs and a job history log file. It also gives access to the local machines Hadoop log files (the machine on which the web
UI is running on).
By default, its available at http://localhost:50030/.
http://localhost:50030/
A screenshot of Hadoop's Job Tracker web interface.

hPotTech | hadoop
Task Tracker Web Interface

The task tracker web UI shows you running and non-running
non
tasks. It also gives access to the local
ocal machines Hadoop
log files.
A screenshot of Hadoop's Task Tracker web interface.
hPotTech | hadoop
HDFS Name Node Web Interface

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead
nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It
also gives access to the local machines Hadoop log files.
hPotTech | hadoop
A screenshot of Hadoop's Name Node web interface.
hPotTech | hadoop
Managing a Hadoop Cluster - PigTutorial
Start the Hadoop VM Master

Base Directory:- /hadoop
Java Installation
1. Java 1.6.x (from Sun) is installed on /usr/jre16.
2. Verify The JAVA_HOME environment variable as below else set it.
www.hpottech.com
Pig Installation
# tar xzf pig-0.10.0.tar.gz
www.hpottech.com
#Set the PIG_HOME environment variable (vi $HOME/.bashrc)
www.hpottech.com
#export PIG_HOME=/hadoop/pig-0.10.0
#export PATH=$PATH:$PIG_HOME/bin
www.hpottech.com
Start the hadoop cluster.
#bash
www.hpottech.com
$ cd /hadoop/pig-0.10.0/
Set value conf/pig.properties: exectype=local
www.hpottech.com
$ bin/pig -x local
Enter the following command in the Grunt shell;
log = LOAD '/hadoop/pig-0.10.0/tutorial/data/excite-small.log' AS (user, timestamp, query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group, COUNT(log);
STORE cntd INTO output;
www.hpottech.com
# quit
file:///hadoop/pig-0.10.0/tutorial/data/output
Results:
www.hpottech.com
Hbase - Tutorial
Install:- Hbase 1) Get the hbase software from /software.

2) Untar the software in /hadoop
$ tar xfz hbase-0.92.1.tar.gz C /hadoop

$ cd /hadoop/hbase-0.92.1
3)
edit conf/hbase-site.xml and set the directory for HBase to write to
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///hadoop/hbase</value>
</property>
</configuration>
By default, hbase.rootdir is set to /tmp/hbase-${user.name}
Hadoop | HPot-Tech
Hbase - Tutorial
Edit conf/hbase-env.sh, uncommenting the JAVA_HOME line pointing it to your java install.
Installation complete here.
Hadoop | HPot-Tech
Hbase - Tutorial
Start hadoop
Start HBase
$ ./bin/start-hbase.sh
starting Master, logging to logs/hbase-user-master-example.org.out
Shell Exercises
Connect to your running HBase via the shell.
$ ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010
hbase(main):001:0>
Hadoop | HPot-Tech
Hbase - Tutorial
Create a table named test with a single column family named cf. Verify its creation by listing all tables and then insert
some values.
hbase(main):003:0> create 'test', 'cf'
0 row(s) in 1.2200 seconds
hbase(main):003:0> list 'test'
..
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
Hadoop | HPot-Tech
Hbase - Tutorial
Above we inserted 3 values, one at a time. The first insert is at row1, column cf:a with a value of value1. Columns in
HBase are comprised of a column family prefix -- cfin this example -- followed by a colon and then a column qualifier
suffix (a in this case).
Hadoop | HPot-Tech
Hbase - Tutorial
Verify the data insert.

Run a scan of the table by doing the following
hbase(main):007:0> scan 'test'
ROW
COLUMN+CELL
row1
column=cf:a, timestamp=1288380727188, value=value1
row2
column=cf:b, timestamp=1288380738440, value=value2
row3
column=cf:c, timestamp=1288380747365, value=value3
Hadoop | HPot-Tech
Hbase - Tutorial
Get a single row as follows

hbase(main):008:0> get 'test', 'row1'
COLUMN
CELL
cf:a
timestamp=1288380727188, value=value1
Hadoop | HPot-Tech
Hbase - Tutorial
Now, disable and drop your table. This will clean up all done above.
hbase(main):012:0> disable 'test'
hbase(main):013:0> drop 'test'
Exit the shell by typing exit.

hbase(main):014:0> exit
Stopping HBase
Stop your hbase instance by running the stop script.
$ ./bin/stop-hbase.sh
stopping hbase...............
Hadoop | HPot-Tech
Hive - Tutorial
Installing Hive
$ tar -xzvf hive-0.9.0.tar.gz C /hadoop
www.hpottech.com
Page 1
Hive - Tutorial
Set the environment variable HIVE_HOME to point to the installation directory at $HOME/.bashrc:
$ export HIVE_HOME=/hadoop/hive-0.9.0
Finally, add $HIVE_HOME/bin to your PATH:
$ export PATH=$HIVE_HOME/bin:$PATH
www.hpottech.com
Page 2
Hive - Tutorial
Running Hive
Hive uses hadoop that means:
you must have hadoop in your path OR

export HADOOP_HOME=<hadoop-install-dir>
In addition, you must create /tmp and /user/hive/warehouse

(aka hive.metastore.warehouse.dir) and set them chmod g+w in HDFS before a table can be created in Hive.
Start hadoop service.
Commands to perform this setup
$
$
$
$
$HADOOP_HOME/bin/hadoop
www.hpottech.com
fs
fs
fs
fs
-mkdir
-mkdir
-chmod g+w
-chmod g+w
/tmp
/user/hive/warehouse
/tmp
/user/hive/warehouse
Page 3
Hive - Tutorial
Errors log:
/tmp/<user.name>/hive.log
Type hive
#bash
#hive
Creates a table called pokes with two columns, the first being an integer and the other a string
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
www.hpottech.com
Page 4
Hive - Tutorial
hive> SHOW TABLES;
hive> DESCRIBE invites;
hive> LOAD DATA LOCAL INPATH

'/hadoop/hive-0.9.0/examples/files/kv2.txt'
15');
www.hpottech.com
OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-
Page 5
Hive - Tutorial
hive> SELECT a.foo FROM invites a WHERE a.ds='2008-08-15';
www.hpottech.com
Page 6
Hive - Tutorial
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
(Verify the file in the hdfs file system)
www.hpottech.com
Page 7
Hive - Tutorial
#hadoop fs -cat /tmp/hdfs_out/000000_0
Dropping tables:
hive> DROP TABLE invites;
# quit;
www.hpottech.com
Page 8
Hive - Bucket
Start Hadoop cluster start-all.sh

Start hive.
copy extend1502.log to /hadoop
Create table from the Hive console.
hive > CREATE TABLE weblog (mdate STRING , mtime STRING, ssitename STRING, scomputername STRING, sip STRING, csmethod STRING,
csuristem STRING, suriquery STRING, sport STRING, csusername STRING, cip STRING, csversion STRING, csUserAgent STRING, csCookie STRING,
csReferer STRING, cshost STRING, scstatus STRING, scsubstatus STRING, scwin32status STRING, scbytes STRING, csbytes STRING, timetaken
STRING)
hive > PARTITIONED BY (dt STRING) CLUSTERED BY (scomputername) INTO 96 BUCKETS;
Hpot-Tech
Hive - Bucket
hive> SET hive.enforce.bucketing = true;

hive > LOAD DATA LOCAL INPATH '/hadoop/extend1502.log' OVERWRITE INTO TABLE weblog PARTITION (dt='2008-08-15');
Hpot-Tech
Hive - Bucket
Hpot-Tech
Input Types
Create one Java project :- InputTypes and create the following classes.
hPot-Tech
Input Types
package com.hp.types;
// == JobBuilder
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
public class JobBuilder {
private final Class<?> driverClass;
private final Job job;
private final int extraArgCount;
private final String extrArgsUsage;
private String[] extraArgs;
public JobBuilder(Class<?> driverClass) throws IOException {
this(driverClass, 0, "");
}
public JobBuilder(Class<?> driverClass, int extraArgCount, String extrArgsUsage) throws IOException {
this.driverClass = driverClass;
this.extraArgCount = extraArgCount;
this.job = new Job();
this.job.setJarByClass(driverClass);
this.extrArgsUsage = extrArgsUsage;
}
// vv JobBuilder
public static Job parseInputAndOutput(Tool tool, Configuration conf,
String[] args) throws IOException {
if (args.length != 2) {
printUsage(tool, "<input> <output>");
hPot-Tech
Input Types
return null;
}
Job job = new Job(conf);
job.setJarByClass(tool.getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job;
}
public static void printUsage(Tool tool, String extraArgsUsage) {
System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
}
// ^^ JobBuilder
public JobBuilder withCommandLineArgs(String... args) throws IOException {
Configuration conf = job.getConfiguration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length < 2 && otherArgs.length > 3 + extraArgCount) {
System.err.printf("Usage: %s [genericOptions] [-overwrite] <input path> <output path> %s\n\n",
driverClass.getSimpleName(), extrArgsUsage);
System.exit(-1);
}
int index = 0;
boolean overwrite = false;
if (otherArgs[index].equals("-overwrite")) {
overwrite = true;
index++;
}
Path input = new Path(otherArgs[index++]);
Path output = new Path(otherArgs[index++]);
if (index < otherArgs.length) {
extraArgs = new String[otherArgs.length - index];
System.arraycopy(otherArgs, index, extraArgs, 0, otherArgs.length - index);
}
hPot-Tech
Input Types
if (overwrite) {
output.getFileSystem(conf).delete(output, true);
}
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
return this;
}
public Job build() {
return job;
}
public String[] getExtraArgs() {
return extraArgs;
}
}
hPot-Tech
Input Types
// cc SmallFilesToSequenceFileConverter A MapReduce program for packaging a collection of small files as a single SequenceFile
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.ToolRunner;
//vv SmallFilesToSequenceFileConverter
public class SmallFilesToSequenceFileConverter extends Configured
implements Tool {
static class SequenceFileMapper
extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
private Text filenameKey;
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Path path = ((FileSplit) split).getPath();
filenameKey = new Text(path.toString());
}
@Override
protected void map(NullWritable key, BytesWritable value, Context context)
throws IOException, InterruptedException {
context.write(filenameKey, value);
hPot-Tech
Input Types
}
}
@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}
job.setInputFormatClass(WholeFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
job.setMapperClass(SequenceFileMapper.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
args = new String[2];
args[0]="input";
args[1]="output"+System.currentTimeMillis();
int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(), args);
System.exit(exitCode);
}
}
// ^^ SmallFilesToSequenceFileConverter
hPot-Tech
Input Types
// cc WholeFileInputFormat An InputFormat for reading a whole file as a record
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.*;
//vv WholeFileInputFormat
public class WholeFileInputFormat
extends FileInputFormat<NullWritable, BytesWritable> {
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException,
WholeFileRecordReader reader = new WholeFileRecordReader();
reader.initialize(split, context);
return reader;
}
}
//^^ WholeFileInputFormat
hPot-Tech
Input Types
// cc WholeFileRecordReader The RecordReader used by WholeFileInputFormat for reading a whole file as a record
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
//vv WholeFileRecordReader
class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {
private FileSplit fileSplit;
private Configuration conf;
private BytesWritable value = new BytesWritable();
private boolean processed = false;
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
this.fileSplit = (FileSplit) split;
this.conf = context.getConfiguration();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);
hPot-Tech
Input Types
IOUtils.readFully(in, contents, 0, contents.length);

value.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
}
@Override
public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
}
@Override
public BytesWritable getCurrentValue() throws IOException,
return value;
}
@Override
public float getProgress() throws IOException {
return processed ? 1.0f : 0.0f;
}
@Override
public void close() throws IOException {
// do nothing
}
}
//^^ WholeFileRecordReader
hPot-Tech
10
Input Types
Copy the following files:
Run the application and output should be as follows:
hPot-Tech
11
Input Types
Submit the jobs to cluster

Un common the path initialization as follow:
/* args = new String[2];
args[0]="input";
args[1]="output"+System.currentTimeMillis();*/
Export the jar and run the following command

#hadoop fs -mkdir smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/a smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/b smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/c smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/d smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/e smallfiles/
#hadoop fs -copyFromLocal /hadoop/data/f smallfiles/
# hadoop jar /hadoop/hadoop/mytypes.jar com.hp.types.SmallFilesToSequenceFileConverter -D mapred.reduce.tasks=2 smallfiles outputit
hPot-Tech
12
Input Types
Verify the result as follows: Using HDFS View and Job Tracker
hPot-Tech
13
Input Types
hPot-Tech
Hadoop in cluster mode
From two single-node clusters to a multi-node cluster We will build a multi-node cluster using two Red hat boxes in
this tutorial. The best way to do this for starters is to install, configure and test a local Hadoop setup for each of the two
RH boxes, and in a second step to merge these two single-node clusters into one multi-node cluster in which one RH
box will become the designated master (but also act as a slave with regard to data storage and processing), and the other
box will become only a slave. Its much easier to track down any problems you might encounter due to the reduced
complexity of doing a single-node cluster setup first on each machine.
Tutorial approach and structure.
ww.hpottech.com
Prerequisites
Configuring single-node clusters first in both the VM
Use the earlier tutorial.
Now that you have two single-node clusters up and running, we will modify the Hadoop configuration to make one RH
box the master (which will also act as a slave) and the other RH box a slave.
We will call the designated master machine just the master from now on and the slave-only machine the
slave. We will also give the two machines these respective hostnames in their networking setup, most notably
in /etc/hosts. If the hostnames of your machines are different (e.g. node01) then you must adapt the
settings in this tutorial as appropriate.
Shutdown each single-node cluster with /bin/stop-all.sh before continuing if you havent done so already.
ww.hpottech.com
Copy the earlier VM and rename as HadoopSlave
ww.hpottech.com
Generate Mac for the new VM as follows:
ww.hpottech.com
Start both the VM

Change the New VM host name:
System -> Administration -> Network -> DNS hadoopslave
Update /etc/hosts
ww.hpottech.com
Verify the IP as belows:
ww.hpottech.com
Networking
Both machines must be able to reach each other over the network.
Update /etc/hosts on both machines with the following lines:
# vi /etc/hosts (for master AND slave)
10.72.47.42
master
10.72.47.27
slave
ww.hpottech.com
SSH access
The root user on the master must be able to connect
a) to its own user account on the master i.e. ssh
master in this context and not necessarily ssh
localhost and
b) to the root user account on the slave via a password-less SSH login.
You have to add the root@masters public SSH key (which should be in$HOME/.ssh/id_rsa.pub) to
the authorized_keys file of root@slave (in this users$HOME/.ssh/authorized_keys).
You can do this manually or use the following SSH command:
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@slave
This command will prompt you for the login password for user root on slave, then copy the public SSH key for you,
creating the correct directory and fixing the permissions as necessary.
ww.hpottech.com
The final step is to test the SSH setup by connecting with user root from the master to the user account
root on
the slave. The step is also needed to save slaves host key fingerprint to the root@mastersknown_hosts file.
So, connecting from master to master
$ ssh master
ww.hpottech.com
10
And from master to slave.

root@master:~$ ssh slave
ww.hpottech.com
11
Hadoop
Cluster Overview
How the final multi-node cluster will look like.
ww.hpottech.com
12
The master node will run the master daemons for each layer: NameNode for the HDFS storage layer, and JobTracker
for the MapReduce processing layer. Both machines will run the slave daemons: DataNode for the HDFS layer, and
TaskTracker for MapReduce processing layer. Basically, the master daemons are responsible for coordination and
management of the slave daemons while the latter will do the actual data storage and data processing work.
Masters vs. Slaves

Typically one machine in the cluster is designated as the NameNode and another machine the as JobTracker, exclusively.
These are the actual master nodes. The rest of the machines in the cluster act as both DataNode and TaskTracker.
These are the slaves or worker nodes.
Configuration
conf/masters (master only)
On master, update /conf/masters that it looks like this:
master
ww.hpottech.com
13
conf/slaves (master only)

On master, update conf/slaves that it looks like this:
master
slave
ww.hpottech.com
14
conf/*-site.xml (all machines)

Note: As of Hadoop 0.20.x and 1.x, the configuration settings previously found in hadoop-site.xmlwere moved
to conf/core-site.xml (fs.default.name),
conf/mapred-site.xml(mapred.job.tracker)
and conf/hdfs-site.xml (dfs.replication).

Assuming you configured each machine as described in the single-node cluster tutorial, you will only have to change a
few variables.
Important: You have to change the configuration files conf/core-site.xml, conf/mapred-
site.xmland conf/hdfs-site.xml on ALL machines as follows.

First, we have to change the fs.default.name variable (in conf/core-site.xml) which specifies
the NameNode(the HDFS master) host and port. In our case, this is the master machine.
ww.hpottech.com
15


<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
<description>The name of the default file system </description>
</property>
ww.hpottech.com
16
Second, we have to change the mapred.job.tracker variable (in conf/mapred-site.xml) which specifies
theJobTracker (MapReduce master) host and port. Again, this is the master in our case.

<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
<description>The host and port that the MapReduce job tracker runs at </description>
</property>
ww.hpottech.com
17


<property>
<name>dfs.name.dir</name>
<value>/hadoop/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/hadoop/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
<description>Default block replication. </description>
</property>
ww.hpottech.com
18
Create the necessary folder structure in both the node

#mkdir p /hadoop/hdfs/name
#mkdir p /hadoop/hdfs/data
Formatting the HDFS filesystem via the NameNode

# bin/hadoop namenode -format
ww.hpottech.com
19
ww.hpottech.com
20
Background: The HDFS name table is stored on the NameNodes (here: master) local filesystem in the directory
specified by dfs.name.dir. The name table is used by the NameNode to store tracking and coordination information
for the DataNodes.
Starting the multi-node cluster

HDFS daemons
Run the command /bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will
bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the
machines listed in the conf/slaves file.
In our case, we will run bin/start-dfs.sh on master:
#bin/start-dfs.sh
ww.hpottech.com
21
On slave, you can examine the success or failure of this command by inspecting the log file logs/. Exemplary
output:
ww.hpottech.com
22
As you can see in slaves output above, it will automatically format its storage directory (specified
bydfs.data.dir) if it is not formatted already. It will also create the directory if it does not exist yet.
At this point, the following Java processes should run on master
# jps
ww.hpottech.com
23
and the following on slave.
ww.hpottech.com
24
MapReduce daemons
Run the command /bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up
the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers
on the machines listed in the conf/slaves file.
In our case, we will run bin/start-mapred.sh on master:
$ bin/start-mapred.sh
ww.hpottech.com
25
On slave, you can examine the success or failure of this command by inspecting the log file logs/hadoop-
root-tasktracker-hadoopslave.log. Exemplary output:
ww.hpottech.com
26

$ jps
(the process IDs dont matter of course)
ww.hpottech.com
27

$ jps
ww.hpottech.com
28
Stopping the multi-node cluster

Like starting the cluster, stopping it is done in two steps. The workflow is the opposite of starting, however. First, we begin
with stopping the MapReduce daemons: the JobTracker is stopped on master, and TaskTracker daemons are stopped
on all slaves (here: master and slave). Second, the HDFS daemons are stopped: the NameNode daemon is
stopped on master, and DataNode daemons are stopped on all slaves (here: master and slave).
MapReduce daemons
Run the command /bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce
cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers
on the machines listed in the conf/slaves file.
In our case, we will run bin/stop-mapred.sh on master:
$ bin/stop-mapred.sh
ww.hpottech.com
29
(Note: The output above might suggest that the JobTracker was running and stopped on slave, but you can be assured
that the JobTracker ran on master.)
$ jps
ww.hpottech.com
30

$ jps
ww.hpottech.com
31
HDFS daemons
Run the command /bin/stop-dfs.sh on the NameNode machine. This will shut down HDFS by stopping the
NameNode daemon running on the machine you ran the previous command on, and DataNodes on the machines listed in
the conf/slaves file.
In our case, we will run bin/stop-dfs.sh on master:
$ bin/stop-dfs.sh
ww.hpottech.com
32
At this point, the only following Java processes should run on master
$ jps

$ jps
ww.hpottech.com
33
Running a MapReduce job

Just follow the steps described in the section Running a MapReduce job of the single-node cluster tutorial.
Heres the exemplary output on master
Copy the data before running the following.
$ bin/hadoop jar hadoop-examples-1.0.0.jar wordcount /user/root/in /user/root/out
ww.hpottech.com
34
and on slave for its datanode

# from logs/ hadoop-root-datanode-hadoopslave.log on slave
ww.hpottech.com
35
and on slave for its tasktracker.

# from logs/ hadoop-root-tasktracker-hadoopslave.log on slave
If you want to inspect the jobs output data, just retrieve the job result from HDFS to your local filesystem.
ww.hpottech.com
Joins
Create a java project JoinMap and create the following classes:
Hpot-Tech
Joins
package com.hp.join;
// == JobBuilder
import org.apache.hadoop.util.GenericOptionsParser;
public class JobBuilder {
private final Class<?> driverClass;
private final Job job;
private final int extraArgCount;
private final String extrArgsUsage;
private String[] extraArgs;
public JobBuilder(Class<?> driverClass) throws IOException {
this(driverClass, 0, "");
}
public JobBuilder(Class<?> driverClass, int extraArgCount, String extrArgsUsage) throws IOException {
this.driverClass = driverClass;
this.extraArgCount = extraArgCount;
this.job = new Job();
this.job.setJarByClass(driverClass);
this.extrArgsUsage = extrArgsUsage;
}
// vv JobBuilder
public static Job parseInputAndOutput(Tool tool, Configuration conf,
String[] args) throws IOException {
printUsage(tool, "<input> <output>");
return null;
Hpot-Tech
Joins
}
Job job = new Job(conf);
job.setJarByClass(tool.getClass());
return job;
}
public static void printUsage(Tool tool, String extraArgsUsage) {
System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
}
// ^^ JobBuilder
public JobBuilder withCommandLineArgs(String... args) throws IOException {
Configuration conf = job.getConfiguration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length < 2 && otherArgs.length > 3 + extraArgCount) {
System.err.printf("Usage: %s [genericOptions] [-overwrite] <input path> <output path> %s\n\n",
driverClass.getSimpleName(), extrArgsUsage);
System.exit(-1);
}
int index = 0;
boolean overwrite = false;
if (otherArgs[index].equals("-overwrite")) {
overwrite = true;
index++;
}
Path input = new Path(otherArgs[index++]);
Path output = new Path(otherArgs[index++]);
if (index < otherArgs.length) {
extraArgs = new String[otherArgs.length - index];
System.arraycopy(otherArgs, index, extraArgs, 0, otherArgs.length - index);
}
if (overwrite) {
Hpot-Tech
Joins
output.getFileSystem(conf).delete(output, true);
}
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
return this;
}
public Job build() {
return job;
}
public String[] getExtraArgs() {
return extraArgs;
}
}
Hpot-Tech
Joins
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class JoinRecordMapper extends MapReduceBase
implements Mapper<LongWritable, Text, TextPair, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();
public void map(LongWritable key, Text value,
OutputCollector<TextPair, Text> output, Reporter reporter)
throws IOException {
parser.parse(value);
output.collect(new TextPair(parser.getStationId(), "1"), value);
}
}
Hpot-Tech
Joins
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.MultipleInputs;
import org.apache.hadoop.util.*;
@SuppressWarnings("deprecation")
public class JoinRecordWithStationName extends Configured implements Tool {
public static class KeyPartitioner implements Partitioner<TextPair, Text> {
@Override
public void configure(JobConf job) {}
@Override
public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
@Override
JobBuilder.printUsage(this, "<ncdc input> <station input> <output>");
return -1;
}
JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Join record with station name");
Path ncdcInputPath = new Path(args[0]);
Path stationInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
MultipleInputs.addInputPath(conf, ncdcInputPath,
TextInputFormat.class, JoinRecordMapper.class);
MultipleInputs.addInputPath(conf, stationInputPath,
TextInputFormat.class, JoinStationMapper.class);
Hpot-Tech
Joins
FileOutputFormat.setOutputPath(conf, outputPath);
conf.setPartitionerClass(KeyPartitioner.class);
conf.setOutputValueGroupingComparator(TextPair.FirstComparator.class);
conf.setMapOutputKeyClass(TextPair.class);
conf.setReducerClass(JoinReducer.class);
conf.setOutputKeyClass(Text.class);
JobClient.runJob(conf);
return 0;
}
args = new String[3];
args[0] = "inputncdc";
args[1] = "inputstation";
args[2] = "output"+System.currentTimeMillis();
int exitCode = ToolRunner.run(new JoinRecordWithStationName(), args);
}
}
Hpot-Tech
Joins
import java.util.Iterator;
public class JoinReducer extends MapReduceBase implements
Reducer<TextPair, Text, Text, Text> {
public void reduce(TextPair key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
Text stationName = new Text(values.next());
while (values.hasNext()) {
Text record = values.next();
Text outValue = new Text(stationName.toString() + "\t" + record.toString());
output.collect(key.getFirst(), outValue);
}
}
}
Hpot-Tech
Joins
public class JoinStationMapper extends MapReduceBase
implements Mapper<LongWritable, Text, TextPair, Text> {
private NcdcStationMetadataParser parser = new NcdcStationMetadataParser();
public void map(LongWritable key, Text value,
OutputCollector<TextPair, Text> output, Reporter reporter)
if (parser.parse(value)) {
output.collect(new TextPair(parser.getStationId(), "0"),
new Text(parser.getStationName()));
}
}
}
Hpot-Tech
10
Joins
import java.math.*;
public class MetOfficeRecordParser {
private String year;
private String airTemperatureString;
private int airTemperature;
private boolean airTemperatureValid;
public void parse(String record) {
if (record.length() < 18) {
return;
}
year = record.substring(3, 7);
if (isValidRecord(year)) {
airTemperatureString = record.substring(13, 18);
if (!airTemperatureString.trim().equals("---")) {
BigDecimal temp = new BigDecimal(airTemperatureString.trim());
temp = temp.multiply(new BigDecimal(BigInteger.TEN));
airTemperature = temp.intValueExact();
airTemperatureValid = true;
}
}
}
private boolean isValidRecord(String year) {
try {
Integer.parseInt(year);
return true;
} catch (NumberFormatException e) {
return false;
}
}
public void parse(Text record) {
parse(record.toString());
}
Hpot-Tech
11
Joins
public String getYear() {

return year;
}
public int getAirTemperature() {
return airTemperature;
}
public String getAirTemperatureString() {
return airTemperatureString;
}
public boolean isValidTemperature() {
return airTemperatureValid;
}
}
Hpot-Tech
12
Joins
import java.text.*;
import java.util.Date;
public class NcdcRecordParser {
private static final int MISSING_TEMPERATURE = 9999;
private static final DateFormat DATE_FORMAT =
new SimpleDateFormat("yyyyMMddHHmm");
private String stationId;
private String observationDateString;
private String year;
private String airTemperatureString;
private int airTemperature;
private boolean airTemperatureMalformed;
private String quality;
public void parse(String record) {
stationId = record.substring(4, 10) + "-" + record.substring(10, 15);
observationDateString = record.substring(15, 27);
year = record.substring(15, 19);
airTemperatureMalformed = false;
// Remove leading plus sign as parseInt doesn't like them
if (record.charAt(87) == '+') {
airTemperature = Integer.parseInt(airTemperatureString);
} else if (record.charAt(87) == '-') {
} else {
airTemperatureMalformed = true;
}
quality = record.substring(92, 93);
}
Hpot-Tech
13
Joins
public void parse(Text record) {

parse(record.toString());
}
public boolean isValidTemperature() {
return !airTemperatureMalformed && airTemperature != MISSING_TEMPERATURE
&& quality.matches("[01459]");
}
public boolean isMalformedTemperature() {
return airTemperatureMalformed;
}
public boolean isMissingTemperature() {
return airTemperature == MISSING_TEMPERATURE;
}
public String getStationId() {
return stationId;
}
public Date getObservationDate() {
try {
System.out.println(observationDateString);
return DATE_FORMAT.parse(observationDateString);
} catch (ParseException e) {
throw new IllegalArgumentException(e);
}
}
public String getYear() {
return year;
}
public int getYearInt() {
return Integer.parseInt(year);
}
public int getAirTemperature() {
return airTemperature;
Hpot-Tech
14
Joins
}
public String getAirTemperatureString() {
return airTemperatureString;
}
public String getQuality() {
return quality;
}
}
Hpot-Tech
15
Joins
import java.io.*;
import java.util.*;
import org.apache.hadoop.io.IOUtils;
public class NcdcStationMetadata {
private Map<String, String> stationIdToName = new HashMap<String, String>();
public void initialize(File file) throws IOException {
BufferedReader in = null;
try {
in = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
NcdcStationMetadataParser parser = new NcdcStationMetadataParser();
String line;
while ((line = in.readLine()) != null) {
if (parser.parse(line)) {
stationIdToName.put(parser.getStationId(), parser.getStationName());
}
}
} finally {
IOUtils.closeStream(in);
}
}
public String getStationName(String stationId) {
String stationName = stationIdToName.get(stationId);
if (stationName == null || stationName.trim().length() == 0) {
return stationId; // no match: fall back to ID
}
return stationName;
}
public Map<String, String> getStationIdToNameMap() {
return Collections.unmodifiableMap(stationIdToName);
}
}
Hpot-Tech
16
Joins
public class NcdcStationMetadataParser {
private String stationId;
private String stationName;
public boolean parse(String record) {
if (record.length() < 42) { // header
return false;
}
String usaf = record.substring(0, 6);
String wban = record.substring(7, 12);
stationId = usaf + "-" + wban;
stationName = record.substring(13, 42);
try {
Integer.parseInt(usaf); // USAF identifiers are numeric
return true;
} catch (NumberFormatException e) {
return false;
}
}
public boolean parse(Text record) {
return parse(record.toString());
}
public String getStationId() {
return stationId;
}
public String getStationName() {
return stationName;
}
}
Hpot-Tech
17
Joins
// cc TextPair A Writable implementation that stores a pair of Text objects
// cc TextPairComparator A RawComparator for comparing TextPair byte representations
// cc TextPairFirstComparator A custom RawComparator for comparing the first field of TextPair byte representations
// vv TextPair
import java.io.*;
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
}
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second) {
set(first, second);
}
public void set(Text first, Text second) {
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public Text getSecond() {
return second;
}
@Override
Hpot-Tech
18
Joins
public void write(DataOutput out) throws IOException {

first.write(out);
second.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
}
@Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
}
@Override
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
@Override
public String toString() {
return first + "\t" + second;
}
@Override
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
// ^^ TextPair
Hpot-Tech
19
Joins
// vv TextPairComparator
public static class Comparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public Comparator() {
super(TextPair.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
if (cmp != 0) {
return cmp;
}
return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1,
b2, s2 + firstL2, l2 - firstL2);
} catch (IOException e) {
}
}
}
static {
WritableComparator.define(TextPair.class, new Comparator());
}
// ^^ TextPairComparator
// vv TextPairFirstComparator
public static class FirstComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
super(TextPair.class);
Hpot-Tech
20
Joins
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
}
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
}
return super.compare(a, b);
}
}
// ^^ TextPairFirstComparator
// vv TextPair
}
// ^^ TextPair
Hpot-Tech
21
Joins
Create the following folder and copy the file:
Hpot-Tech
22
Joins
Hpot-Tech
23
Joins
Run the application:
Hpot-Tech
24
Joins
Submit the jar in cluster:

Export the jar and submit as follows:
Create the necessary input folders:
Un common the path initialization as follow:
/*args = new String[3];
args[0] = "inputncdc";
args[1] = "inputstation";
args[2] = "output"+System.currentTimeMillis();*/
#hadoop fs -mkdir incdc/

#hadoop fs -mkdir instation/
#hadoop fs -copyFromLocal /hadoop/data/sample.txt incdc/
#hadoop fs -copyFromLocal /hadoop/data/stations*.txt instation/
#hadoop jar /hadoop/hadoop/myhadoopjoin.jar com.hp.join.JoinRecordWithStationName incdc instation outputs
Hpot-Tech
25
Joins
You can view the data as follows:
Hpot-Tech

We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and
counts how often words occur.
The input is text files and the output is text files, each line of which contains a word and the count of how often it
occurred, separated by a tab.
copy input data

$ls -l /mnt/hgfs/Hadoopsw
total 3604
-rw-r--r-- 1 hduser hadoop
674566 Feb
3 10:17 pg20417.txt
-rw-r--r-- 1 hduser hadoop 1573112 Feb
3 10:18 pg4300.txt
-rw-r--r-- 1 hduser hadoop 1423801 Feb
3 10:18 pg5000.txt
Restart the Hadoop cluster

Restart your Hadoop cluster if its not running already.
# bin/start-all.sh
www.hpottech.com

Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to HadoopsHDFS.
#bin/hadoop fs mkdir /user/root
#bin/hadoop fs mkdir /user/root/in
#bin/hadoop dfs -copyFromLocal /mnt/hgfs/Hadoopsw/*.txt /user/root/in
Run the MapReduce job

Now, we actually run the WordCount example job.
#cd $HADOOP_HOME
#bin/hadoop jar hadoop-examples-1.0.0.jar wordcount /user/root/in /user/root/out
This command will read all the files in the HDFS directory /user/root/in, process it, and store the result in the
HDFS directory /user/root/out.
www.hpottech.com
www.hpottech.com
www.hpottech.com
Check if the result is successfully stored in HDFS directory /user/root/out/:

#bin/hadoop dfs -ls /user/root
www.hpottech.com

$ bin/hadoop dfs -ls /user/root/out
www.hpottech.com

Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command
# bin/hadoop dfs -cat /user/root/out/part-r-00000
www.hpottech.com

Copy the output to local file.
$ mkdir /tmp/hadoop-output
# bin/hadoop dfs -getmerge /user/root/out/ /tmp/hadoop-output
www.hpottech.com

Hadoop Web Interfaces
Hadoop comes with several web interfaces which are by default (see conf/hadoop-default.xml) available at
these locations:
http://localhost:50030/ web UI for MapReduce job tracker(s)
http://localhost:50060/ web UI for task tracker(s)
http://localhost:50070/ web UI for HDFS name node(s)
These web interfaces provide concise information about whats happening in your Hadoop cluster. You might want to give
them a try.
MapReduce Job Tracker Web Interface

The job tracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed
jobs and a job history log file. It also gives access to the local machines Hadoop log files (the machine on which the web
UI is running on).
www.hpottech.com
10
A screenshot of Hadoop's Job Tracker web interface.
www.hpottech.com
11

Task Tracker Web Interface
The task tracker web UI shows you running and non-running
non running tasks. It also gives access to the local machines Hadoop
log files.
A screenshot of Hadoop's Task Tracker web interface.
www.hpottech.com
12

HDFS Name Node Web Interface
The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead
nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It
also gives access to the local machines Hadoop log files.
www.hpottech.com
13
A screenshot of Hadoop's Name Node web interface.
www.hpottech.com
MR Unit Testing
Start eclipse
Create New java project : MRUnitTest
Unzip mrunit jar in /hadoop/mrunit
Include mrunit jar in the project
Hpot-Tech
MR Unit Testing
Hpot-Tech
MR Unit Testing
package com.hp.hadoop;
import org.apache.hadoop.io.IntWritable;
public class MaxTemperatureMapper
extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text values,Context context) throws IOException,
InterruptedException{
String line = values.toString();
String year = line.substring(15, 19);
int airTemperature = Integer.parseInt(line.substring(87, 92));
System.out.println("-----"+ year +"="+ airTemperature );
context.write(new Text(year), new IntWritable(airTemperature));
}
}
Hpot-Tech
MR Unit Testing
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
Hpot-Tech
MR Unit Testing
package com.hp.test;
// cc MaxTemperatureMapperTestV1 Unit test for MaxTemperatureMapper
// == MaxTemperatureMapperTestV1Missing
// vv MaxTemperatureMapperTestV1
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.*;
import com.hp.hadoop.MaxTemperatureMapper;
public class MaxTemperatureMapperTest {
@Test
public void processesValidRecord() throws IOException, InterruptedException {
Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9-00111+99999999999");
// Temperature ^^^^^
new MapDriver<LongWritable, Text, Text, IntWritable>()
.withMapper(new MaxTemperatureMapper())
.withInputKey(new LongWritable(123))
.withInputValue(value)
.withOutput(new Text("1950"), new IntWritable(-11))
.runTest();
}
// ^^ MaxTemperatureMapperTestV1
//@Ignore //
// vv MaxTemperatureMapperTestV1Missing
@Test
public void ignoresMissingTemperatureRecord() throws IOException,
Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9+99991+99999999999");
Hpot-Tech
MR Unit Testing
.runTest();
}
// ^^ MaxTemperatureMapperTestV1Missing
@Test
public void processesMalformedTemperatureRecord() throws IOException,
Text value = new Text("0335999999433181957042302005+37950+139117SAO +0004" +
// Year ^^^^
"RJSN V02011359003150070356999999433201957010100005+353");
.withOutput(new Text("1957"), new IntWritable(1957))
.runTest();
}
// vv MaxTemperatureMapperTestV1
}
// ^^ MaxTemperatureMapperTestV1
Hpot-Tech
MR Unit Testing
package com.hp.test;
// == MaxTemperatureReducerTestV1
import java.util.*;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.junit.*;
import com.hp.hadoop.MaxTemperatureReducer;
public class MaxTemperatureReducerTest {
//vv MaxTemperatureReducerTestV1
@Test
public void returnsMaximumIntegerInValues() throws IOException,
new ReduceDriver<Text, IntWritable, Text, IntWritable>()
.withReducer(new MaxTemperatureReducer())
.withInputKey(new Text("1950"))
.withInputValues(Arrays.asList(new IntWritable(10), new IntWritable(5)))
.withOutput(new Text("1950"), new IntWritable(10))
.runTest();
}
//^^ MaxTemperatureReducerTestV1
}
Hpot-Tech
MR Unit Testing
Final updated project view:
Hpot-Tech
MR Unit Testing
Run as Junit
Hpot-Tech
Map reduce - Partitioner
Start eclipse.
Create on java project: Partioner and create the following java class:
Hpot-Tech
Add the User Libraries:
package com.hp.partitioner;
// cc MaxTemperatureMapper Mapper for maximum temperature example
// vv MaxTemperatureMapper
import org.apache.hadoop.mapreduce.Mapper;
extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, Context context)
String line = value.toString();
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
}
Hpot-Tech
String quality = line.substring(92, 93);

if (airTemperature != MISSING && quality.matches("[01459]")) {
}
}
}
// ^^ MaxTemperatureMapper
// cc MaxTemperatureReducer Reducer for maximum temperature example
// vv MaxTemperatureReducer
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
Context context)
}
}
}
// ^^ MaxTemperatureReducer
Hpot-Tech
// cc MaxTemperatureWithCombiner Application to find the maximum temperature, using a combiner function for efficiency
// vv MaxTemperatureWithCombiner
public class MaxTemperatureWithCombiner {
/*if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +
"<output path>");
System.exit(-1);
}*/
Job job = new Job();
job.setJarByClass(MaxTemperatureWithCombiner.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path("in"));

FileOutputFormat.setOutputPath(job, new Path("out"+System.currentTimeMillis()));
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
// ^^ MaxTemperatureWithCombiner
Hpot-Tech
Copy the following files in :
Run the application as java
Hpot-Tech
Out put.
Hpot-Tech
Hpot-Tech
Submit the application in cluster.

1) Comment the hardcoded path.
2) Copy the files to HDFS
3) And execute the class
#hadoop fs -copyFromLocal /hadoop/data/19*gz in/
# hadoop jar /hadoop/hadoop/mypartitioner.jar com.hp.partitioner.MaxTemperatureWithCombiner in output
View the jobtrackers:

http://192.168.92.128:50030/jobtracker.jsp
And output
http://192.168.92.128:50070/dfshealth.jsp
Hpot-Tech
Pig UDF
Start eclipse
Create java project. :- PigUDF
Include Hadoop Library in Java Build Path
Create and Include Pig User library (Available in Pig Installation folder)
Hpot-Tech
Pig UDF
package com.hp.hadoop.pig;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.FilterFunc;
import org.apache.pig.FuncSpec;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;
public class IsGoodQuality extends FilterFunc {

@Override
public Boolean exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0) {
return false;
}
try {
Object object = tuple.get(0);
if (object == null) {
return false;
}
int i = (Integer) object;
return i == 0 || i == 1 || i == 4 || i == 5 || i == 9;
} catch (ExecException e) {
throw new IOException(e);
}
}
//^^ IsGoodQuality
//vv IsGoodQualityTyped
@Override
public List<FuncSpec> getArgToFuncMapping() throws FrontendException {
List<FuncSpec> funcSpecs = new ArrayList<FuncSpec>();
funcSpecs.add(new FuncSpec(this.getClass().getName(),
new Schema(new Schema.FieldSchema(null, DataType.INTEGER))));
Hpot-Tech
Pig UDF
return funcSpecs;
}
}
- export the project as jar

/hadoop/pig-0.10.0/mypigudf.jar
-copy the pigudf.txt to /hadoop/pig-0.10.0/ [ using cp command from shared folder]
Type pig and type as follows:
grunt> records = LOAD '/hadoop/pig-0.10.0/pigudf.txt' AS (year:chararray, temperature:int, quality:int);
grunt> REGISTER /hadoop/pig-0.10.0/mypigudf.jar;
grunt> filtered_records = FILTER records BY temperature != 9999 AND com.hp.hadoop.pig.IsGoodQuality(quality);
grunt> grouped_records = GROUP filtered_records BY year;
grunt>max_temp = FOREACH grouped_records GENERATE group,
grunt>MAX(filtered_records.temperature);
grunt>DUMP max_temp;
Hpot-Tech
Sqoop
Untar the sqoop in /hadoop folder
Declare the $SQOOP_HOME In the $HOME/.bashrc file

$ export SQOOP_HOME=/hadoop/sqoop
$ export HBASE_HOME=/hadoop/hbase-0.92.1
And include in the PATH variable.
$ export PATH=$SQOOP_HOME/bin:$PATH
hPotTech
Sqoop
Run Sqoop:
$sqoop
Installed mysql
Verify that mysql is already installed in the system
rpm -qa | grep -i mysql
hPotTech
Sqoop
#rpm -ivh MySQL-server-5.5.25-1.rhel5.i386.rpm

wait till it display the completion message.
hPotTech
Sqoop
#rpm -i MySQL-client-5.5.25-1.rhel5.i386.rpm
verify the installation: mysql V
Start my SQl:
Follows the steps below to stop and start MySQL
[local-host]# service mysql status

MySQL running (12588)
OK
OK
[local-host]# service mysql stop

Shutting down MySQL.
[local-host]# service mysql start
Starting MySQL.
hPotTech
Sqoop
type:
# mysql
mysql> CREATE DATABASE hadoopguide;

Query OK, 1 row affected (0.02 sec)
mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO '%'@'localhost';
Query OK, 0 rows affected (0.00 sec)
mysql> GRANT ALL PRIVILEGES ON hadoopguide.* TO ''@'localhost';
mysql> quit;
Bye
hPotTech
Sqoop
Create table and insert some values:
# mysql hadoopguide
mysql> CREATE TABLE widgets(id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
-> widget_name VARCHAR(64) NOT NULL,
-> price DECIMAL(10,2),
-> design_date DATE,
-> version INT,
-> design_comment VARCHAR(100));
mysql> INSERT INTO widgets VALUES (NULL, 'sprocket', 0.25, '2010-02-10', 1, 'Connects two gizmos');
mysql> INSERT INTO widgets VALUES (NULL, 'gizmo', 4.00, '2009-11-30', 4, NULL);
mysql> INSERT INTO widgets VALUES (NULL, 'gadget', 99.99, '1983-08-13',
-> 13, 'Our flagship product');
hPotTech
Sqoop

mysql> quit;
hPotTech
Sqoop
#cd /hadoop
extract the following jar
#tar -xvf mysql-connector-java-5.1.20.tar.gz C /hadoop
#cd /hadoop/mysql-connector-java-5.1.20
#cp mysql-connector-java-5.1.20-bin.jar $SQOOP_HOME/lib
It will store the necessary library to connect to Mysql from Sqoop
Start the hdfs.
Connect using Sqoop
hPotTech
Sqoop
Type the following in the console

#sqoop import --connect jdbc:mysql://localhost/hadoopguide --table widgets
hPotTech
10
Sqoop
Verify the data import

#bin/hadoop fs -cat widgets/part-m-00000
hPotTech
11
Sqoop
Additional Notes:
PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER !
To do so, start the server, then issue the following commands:
/usr/bin/mysqladmin -u root password 'new-password'

/usr/bin/mysqladmin -u root -h master password 'new-password'
Alternatively you can run:

/usr/bin/mysql_secure_installation
which will also give you the option of removing the test
databases and anonymous user created by default. This is
strongly recommended for production servers.
See the manual for more instructions.

Please report any problems with the /usr/bin/mysqlbug script!
bin/mysqld_safe --user=mysql &
hPotTech
Map Reducing
Goals: You will be able to write Map Reduce Program using Eclipse IDE.
IDE Set Up:
1) Untar the eclipse-jee-juno-linux-gtk.tar.gz
Copy the hadoop-eclipse-plugin-1.0.3.jar to eclipse plugin folder

Start eclipse
# cd /hadoop/eclipse
Hpot-Tech
Map Reducing
Hpot-Tech
Map Reducing
OK
Hpot-Tech
Map Reducing
Configure the hadoop environment in IDE.

Windows -> preferences -> Hadoop Installation Directory -> browse to your hadoop installation folder -> Apply -> OK
Hpot-Tech
Map Reducing
Create new Java project :- MaxTemperature

Create code as below:
Hpot-Tech
Map Reducing
public class MaxTemperature {
/**
* @param args
*/
public static void main(String[] args) throws Exception{
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path("in"));
FileOutputFormat.setOutputPath(job, new Path("out"+System.currentTimeMillis()));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Hpot-Tech
Map Reducing
extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text values,Context context) throws IOException,
InterruptedException{
String line = values.toString();
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
} else {
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
}
}
}
Hpot-Tech
Map Reducing
public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override
Context context)
}
}
Hpot-Tech
Map Reducing
Create one folder in:
Copy the following files value as follows:

1901 and 1902
Run the application as java application :- right click MaxTemperature.java
Hpot-Tech
10
Map Reducing
Output;
Hpot-Tech
11
Map Reducing
Hpot-Tech
12
Map Reducing
13/01/05 15:14:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where
applicable
13/01/05 15:14:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the
same.
13/01/05 15:14:25 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/01/05 15:14:25 INFO input.FileInputFormat: Total input paths to process : 2
13/01/05 15:14:26 INFO mapred.JobClient: Running job: job_local_0001
13/01/05 15:14:26 INFO util.ProcessTree: setsid exited with exit code 0
13/01/05 15:14:26 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@ab444
13/01/05 15:14:26 INFO mapred.MapTask: io.sort.mb = 100
13/01/05 15:14:27 INFO mapred.JobClient: map 0% reduce 0%
13/01/05 15:14:30 INFO mapred.MapTask: data buffer = 79691776/99614720
13/01/05 15:14:30 INFO mapred.MapTask: record buffer = 262144/327680
13/01/05 15:14:31 INFO mapred.MapTask: Starting flush of map output
13/01/05 15:14:31 INFO mapred.MapTask: Finished spill 0
13/01/05 15:14:31 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/01/05 15:14:33 INFO mapred.LocalJobRunner:
13/01/05 15:14:33 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/01/05 15:14:33 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@ff8c74
13/01/05 15:14:33 INFO mapred.MapTask: io.sort.mb = 100
13/01/05 15:14:33 INFO mapred.MapTask: data buffer = 79691776/99614720
13/01/05 15:14:33 INFO mapred.MapTask: record buffer = 262144/327680
13/01/05 15:14:33 INFO mapred.MapTask: Starting flush of map output
13/01/05 15:14:33 INFO mapred.MapTask: Finished spill 0
13/01/05 15:14:33 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
13/01/05 15:14:37 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
13/01/05 15:14:37 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@13c6641
13/01/05 15:14:37 INFO mapred.Merger: Merging 2 sorted segments
13/01/05 15:14:37 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 144423 bytes
13/01/05 15:14:37 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
13/01/05 15:14:37 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
13/01/05 15:14:37 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to out1357379065119
13/01/05 15:14:40 INFO mapred.LocalJobRunner: reduce > reduce
13/01/05 15:14:40 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
Hpot-Tech
13
Map Reducing
13/01/05 15:14:41 INFO mapred.JobClient: Job complete: job_local_0001

13/01/05 15:14:41 INFO mapred.JobClient: Counters: 20
13/01/05 15:14:41 INFO mapred.JobClient: File Output Format Counters
13/01/05 15:14:41 INFO mapred.JobClient: Bytes Written=30
13/01/05 15:14:41 INFO mapred.JobClient: FileSystemCounters
13/01/05 15:14:41 INFO mapred.JobClient: FILE_BYTES_READ=4589102
13/01/05 15:14:41 INFO mapred.JobClient: FILE_BYTES_WRITTEN=456325
13/01/05 15:14:41 INFO mapred.JobClient: File Input Format Counters
13/01/05 15:14:41 INFO mapred.JobClient: Bytes Read=1777168
13/01/05 15:14:41 INFO mapred.JobClient: Map-Reduce Framework
13/01/05 15:14:41 INFO mapred.JobClient: Map output materialized bytes=144431
13/01/05 15:14:41 INFO mapred.JobClient: Map input records=13130
13/01/05 15:14:41 INFO mapred.JobClient: Reduce shuffle bytes=0
13/01/05 15:14:41 INFO mapred.JobClient: Spilled Records=26258
13/01/05 15:14:41 INFO mapred.JobClient: Map output bytes=118161
13/01/05 15:14:41 INFO mapred.JobClient: Total committed heap usage (bytes)=548327424
13/01/05 15:14:41 INFO mapred.JobClient: CPU time spent (ms)=0
13/01/05 15:14:41 INFO mapred.JobClient: SPLIT_RAW_BYTES=220
13/01/05 15:14:41 INFO mapred.JobClient: Combine input records=0
13/01/05 15:14:41 INFO mapred.JobClient: Reduce input records=13129
13/01/05 15:14:41 INFO mapred.JobClient: Reduce input groups=2
13/01/05 15:14:41 INFO mapred.JobClient: Combine output records=0
13/01/05 15:14:41 INFO mapred.JobClient: Physical memory (bytes) snapshot=0
13/01/05 15:14:41 INFO mapred.JobClient: Reduce output records=2
13/01/05 15:14:41 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0
13/01/05 15:14:41 INFO mapred.JobClient: Map output records=13129
Hpot-Tech
14
Map Reducing
Create one class as follows:

import org.apache.hadoop.util.ToolRunner;
public class MaxTemperatureDriver extends Configured implements Tool{
@Override
System.err.printf("Usage: %s [generic options] <input> <output>\n",
getClass().getSimpleName());
ToolRunner.printGenericCommandUsage(System.err);
return -1;
}
Job job = new Job(getConf(), "Max temperature");
job.setJarByClass(getClass());
job.setCombinerClass(MaxTemperatureReducer.class);
return job.waitForCompletion(true) ? 0 : 1;
}
Hpot-Tech
15
Map Reducing

int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
}
}
Hpot-Tech
16
Map Reducing
Export the application as jar
Hpot-Tech
17
Map Reducing
Next Next Finish
You can see the MaxTemperature.jar as above.
Hpot-Tech
18
Map Reducing
Run the program in Hadoop cluster as follows:
Hpot-Tech
19
Map Reducing
Start cluster : start-all.sh

#hadoop fs -mkdir intemp
Hpot-Tech
20
Map Reducing
dump two files in /hadoop/hadoop

1901
1902
Copy the above files in HDFS:
#hadoop fs -copyFromLocal /hadoop/hadoop/19* intemp
Hpot-Tech
21
Map Reducing
Execute job:
#bin/hadoop jar /hadoop/hadoop/mymapreduce.jar com.hp.hadoop.MaxTemperatureDriver intemp outtemp
Hpot-Tech
22
Map Reducing
You can view the result from HDFS browser as below.
Hpot-Tech

Hadoop HP Tutorial

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop HP Tutorial

Uploaded by

Copyright:

Available Formats

1

single-node Hadoop Installation Preparation

Last Updated:- 3rd Nov 2012

Red Hat Linux

single-node Hadoop Installation Preparation

Java HotSpot(TM) Client VM (build 16.3-b01, mixed mode, sharing)

single-node Hadoop Installation Preparation

single-node Hadoop Installation Preparation

Hadoop Installation Single Node

Start the VM and follow the following steps

Hadoop Installation Single Node

# Add Hadoop bin/ directory to PATH

Hadoop Distributed File System (HDFS)

# The java implementation to use. Required.

Hadoop Installation Single Node

# The java implementation to use. Required.

Hadoop Installation Single Node

Hadoop Installation Single Node

Hadoop Installation Single Node

Hadoop Installation Single Node

Formatting the HDFS filesystem via the NameNode

The output will look like this:

hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format

Hadoop Installation Single Node

Starting your single-node cluster

Hadoop Installation Single Node

Hadoop Installation Single Node

0 0.0.0.0:50070 0.0.0.0:* LISTEN 1001 9236 2471/java

Stopping your single-node cluster

Hadoop Installation Single Node

to stop all the daemons running on your machine.

Basic Hadoop Filesystem commands

hadoop fs -mkdir test

Now let's see the directory we've created:

hadoop fs -mkdir /user/root/test2

hadoop fs -cat pg20417.txt

diff <( hadoop fs -cat pg20417.txt) /hadoop/pg20417.txt

Some more Hadoop Filesystem commands

For example, to do a recursive

hadoop fs -ls /user

hadoop fs -lsr /user

hadoop fs -du pg20417.txt

hadoop fs -du /user/root

hadoop fs -dus /user/root

hadoop fs -help dus

HDFS Admin Command

bin/hadoop dfsadmin report

HDFS Admin Command

bin/hadoop dfsadmin -metasave hadoop.txt

HDFS Admin Command

hadoop dfsadmin -safemode get

hadoop dfsadmin -safemode enter

hadoop dfsadmin -safemode leave

HDFS Admin Command

HDFS Admin Command

Edit HADOOP_CONF_DIR/mapred-site.xml to have Hadoop use the fair scheduler:

HDFC Commands and Web Interface

1) Verifying File System Health

2) HDFS Web Interface : features

HDFC Commands and Web Interface

Hadoop Web Interfaces

http://192.168.80.133:50030/ web UI for MapReduce job tracker(s)

http://192.168.80.133:50060/ web UI for task tracker(s)