Professional Documents
Culture Documents
Prerequisites
Sun Java 6 Verify java as below.
# java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
www.hpottech.com
www.hpottech.com
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 root@ubuntu
The key's randomart image is:
[...snipp...]
The second line will create an RSA key pair with an empty password.
Second, you have to enable SSH access to your local machine with this newly created key.
$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to your local machine with the root user. The step is also
needed to save your local machines host key fingerprint to the root users known_hosts file.
$ ssh localhost
The authenticity of host 'localhost (::1)' can't be established.
RSA key fingerprint is d7:87:25:47:ae:02:00:eb:1d:75:4f:bb:44:f9:36:26.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (RSA) to the list of known hosts.
www.hpottech.com
Linux ubuntu 2.6.32-22-generic #33-Ubuntu SMP Wed Apr 28 13:27:30 UTC 2010 i686 GNU/Linux
Ubuntu 10.04 LTS
[...snipp...]
# ssh localhost
Last login: Mon Dec 3 21:36:02 2012 from localhost.localdomain
www.hpottech.com
Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user root. If you use a shell other than bash, you should of course
update its appropriate configuration files instead of .bashrc.
# Set Hadoop-related environment variables
export HADOOP_HOME=/hadoop/hadoop
(Changes the directory app. to your installation.)
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/hadoop/jdk1.6.0_17
www.hpottech.com
www.hpottech.com
to
conf/*-site.xml
Now we create the directory and set the required ownerships and permissions:
$ mkdir -p /app/hadoop/tmp
www.hpottech.com
Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
In file conf/core-site.xml:
<!-- In: conf/core-site.xml -->
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system </description>
</property>
www.hpottech.com
In file conf/mapred-site.xml:
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job tracker runs at.</description>
</property>
www.hpottech.com
In file conf/hdfs-site.xml:
www.hpottech.com
************************************************************/
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
hduser@ubuntu:/usr/local/hadoop$
hduser@ubuntu:~$ /hadoop/hadoop/bin/start-all.sh
www.hpottech.com
This will startup a Namenode, Datanode, Jobtracker and a Tasktracker on your machine.
The output will look like this:
hduser@ubuntu:/usr/local/hadoop$ bin/start-all.sh
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-namenode-ubuntu.out
localhost: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-datanode-ubuntu.out
localhost: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-jobtracker-ubuntu.out
localhost: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-hduser-tasktracker-ubuntu.out
hduser@ubuntu:/usr/local/hadoop$
A nifty tool for checking whether the expected Hadoop processes are running is jps (part of Suns Java since v1.5.0).
hduser@ubuntu:$ jps
(if jps doesnt found, go to /Software/JDK/bin and execute ./jps)
2287 TaskTracker
2149 JobTracker
1938 DataNode
2085 SecondaryNameNode
2349 Jps
www.hpottech.com
1788 NameNode
You can also check with netstat if Hadoop is listening on the configured ports.
hduser@ubuntu:~$ netstat -plten | grep java
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
tcp
0
0
0
0
0
0
0
0
0
0
hduser@ubuntu:~$
If there are any errors, examine the log files in the /logs/ directory.
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh
www.hpottech.com
hduser@ubuntu:/usr/local/hadoop$ bin/stop-all.sh
stopping jobtracker
localhost: stopping tasktracker
stopping namenode
localhost: stopping datanode
localhost: stopping secondarynamenode
www.hpottech.com
Lab 1 - HDFS
Lab objectives
In this lab you will practice HDFS command line interface.
Lab instructions
This lab has been developed as a tutorial. Simply execute the commands provided, and analyze the
results.
>
>
hadoop fs -ls /
hadoop fs -ls /app
2.
There are many commands you can run within the Hadoop filesystem. For example to make the
directory test you can issue the following command:
>
hadoop fs -ls /
hadoop fs -ls /user/root
You will notice that the test directory got created under the /user/root directory. This is because
as the root user, your default path is /user/root and thus if you don't specify an absolute path all
HDFS commands work out of /user/root (this will be your default working directory).
3.
You should be aware that you can pipe (using the | character) any HDFS command to be used
with the Linux shell. For example, you can easily use grep with HDFS by doing the following:
>
>
As you can see the grep command only returned the lines which had test in them (thus
removing the "Found x items" line and oozie-root directory from the listing.
4.
In order to move files between your regular linux filesystem and HDFS you will likely use the put
and get commands. First, move a single file to the hadoop filesystem.
Copy pg20417.txt from software folder to /hadoop
> hadoop fs -put /hadoop/pg20417.txt pg20417.txt
> hadoop fs -ls /user/root
You should now see a new file called /user/root/README listed. In order to view the contents of
this file we will use the -cat command as follows:
>
You should see the output of the README file (that is stored in HDFS). We can also use the linux
diff command to see if the file we put on HDFS is actually the same as the original on the local
filesystem. You can do this as follows:
>
Since the diff command produces no output we know that the files are the same (the diff
command prints all the lines in the files that differ).
In order to use HDFS commands recursively generally you add an "r" to the HDFS command (In
the Linux shell this is generally done with the "-R" argument)
listing we'll use the -lsr command rather than just -ls. Try this:
>
>
2.
In order to find the size of files you need to use the -du or -dus commands. Keep in mind that
these commands return the file size in bytes. To find the size of the pg20417.txt file use the
following command:
>
To find the size of all files individually in the /user/root directory use the following command:
>
To find the size of all files in total of the /user/root directory use the following command:
>
3.
If you would like to get more information about a given command, invoke -help as follows:
>
hadoop fs -help
For example, to get help on the dus command you'd do the following:
>
------ This is the end of this lab -----System -> Network Config -> DNS
Hostname - Activate
/etc/sysconfig/network
www.hpottech.com
Go to log folder:
# cd /hadoop/hadoop/logs
# ls
# vi hadoop.txt
www.hpottech.com
www.hpottech.com
hadoop fsck /
hadoop version
www.hpottech.com
www.hpottech.com
Fair Scheduler
Installation
To run the fair scheduler in your Hadoop installation, you need to put it on the CLASSPATH.
copy the hadoop-fairscheduler-2.0.0-mr1-cdh4.1.0.jar from
/hadoop/hadoop-2.0.0-mr1-cdh4.1.0/contrib/fairscheduler to HADOOP_HOME/lib. Using the
following command.
cp hadoop-fairscheduler-2.0.0-mr1-cdh4.1.0.jar $HADOOP_HOME/lib
Once you restart the cluster, you can check that the fair scheduler is running by going to
http://<jobtracker URL>/scheduler on the JobTracker's web UI.
A "job scheduler administration" page should be visible there.
Hpot-Tech.com
Fair Scheduler
http://192.168.1.5:50030/scheduler
Run the map reduce job from two telnet window and observer the link display.
Create two different folders for input and output.
Run the jobs with different input and output folder.
Hpot-Tech.com
Fair Scheduler
Hpot-Tech.com
Fair Scheduler
Hpot-Tech.com
hPotTech | hadoop
These web interfaces provide concise information about whats happening in your Hadoop cluster. You might want to give
them a try.
hPotTech | hadoop
hPotTech | hadoop
hPotTech | hadoop
hPotTech | hadoop
www.hpottech.com
Pig Installation
# tar xzf pig-0.10.0.tar.gz
www.hpottech.com
www.hpottech.com
#export PIG_HOME=/hadoop/pig-0.10.0
#export PATH=$PATH:$PIG_HOME/bin
www.hpottech.com
#bash
www.hpottech.com
$ cd /hadoop/pig-0.10.0/
Set value conf/pig.properties: exectype=local
www.hpottech.com
$ bin/pig -x local
Enter the following command in the Grunt shell;
log = LOAD '/hadoop/pig-0.10.0/tutorial/data/excite-small.log' AS (user, timestamp, query);
grpd = GROUP log BY user;
cntd = FOREACH grpd GENERATE group, COUNT(log);
STORE cntd INTO output;
www.hpottech.com
# quit
file:///hadoop/pig-0.10.0/tutorial/data/output
Results:
www.hpottech.com
Hbase - Tutorial
3)
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///hadoop/hbase</value>
</property>
</configuration>
Hadoop | HPot-Tech
Hbase - Tutorial
Edit conf/hbase-env.sh, uncommenting the JAVA_HOME line pointing it to your java install.
Hadoop | HPot-Tech
Hbase - Tutorial
Start hadoop
Start HBase
$ ./bin/start-hbase.sh
starting Master, logging to logs/hbase-user-master-example.org.out
Shell Exercises
Connect to your running HBase via the shell.
$ ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010
hbase(main):001:0>
Hadoop | HPot-Tech
Hbase - Tutorial
Create a table named test with a single column family named cf. Verify its creation by listing all tables and then insert
some values.
hbase(main):003:0> create 'test', 'cf'
0 row(s) in 1.2200 seconds
hbase(main):003:0> list 'test'
..
1 row(s) in 0.0550 seconds
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0560 seconds
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0370 seconds
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0450 seconds
Hadoop | HPot-Tech
Hbase - Tutorial
Above we inserted 3 values, one at a time. The first insert is at row1, column cf:a with a value of value1. Columns in
HBase are comprised of a column family prefix -- cfin this example -- followed by a colon and then a column qualifier
suffix (a in this case).
Hadoop | HPot-Tech
Hbase - Tutorial
COLUMN+CELL
row1
row2
row3
Hadoop | HPot-Tech
Hbase - Tutorial
CELL
cf:a
timestamp=1288380727188, value=value1
Hadoop | HPot-Tech
Hbase - Tutorial
Now, disable and drop your table. This will clean up all done above.
hbase(main):012:0> disable 'test'
0 row(s) in 1.0930 seconds
hbase(main):013:0> drop 'test'
0 row(s) in 0.0770 seconds
Stopping HBase
Stop your hbase instance by running the stop script.
$ ./bin/stop-hbase.sh
stopping hbase...............
Hadoop | HPot-Tech
Hive - Tutorial
Installing Hive
$ tar -xzvf hive-0.9.0.tar.gz C /hadoop
www.hpottech.com
Page 1
Hive - Tutorial
Set the environment variable HIVE_HOME to point to the installation directory at $HOME/.bashrc:
$ export HIVE_HOME=/hadoop/hive-0.9.0
Finally, add $HIVE_HOME/bin to your PATH:
$ export PATH=$HIVE_HOME/bin:$PATH
www.hpottech.com
Page 2
Hive - Tutorial
Running Hive
Hive uses hadoop that means:
$HADOOP_HOME/bin/hadoop
$HADOOP_HOME/bin/hadoop
$HADOOP_HOME/bin/hadoop
$HADOOP_HOME/bin/hadoop
www.hpottech.com
fs
fs
fs
fs
-mkdir
-mkdir
-chmod g+w
-chmod g+w
/tmp
/user/hive/warehouse
/tmp
/user/hive/warehouse
Page 3
Hive - Tutorial
Errors log:
/tmp/<user.name>/hive.log
Type hive
#bash
#hive
Creates a table called pokes with two columns, the first being an integer and the other a string
hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);
www.hpottech.com
Page 4
Hive - Tutorial
www.hpottech.com
Page 5
Hive - Tutorial
www.hpottech.com
Page 6
Hive - Tutorial
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='2008-08-15';
www.hpottech.com
Page 7
Hive - Tutorial
Dropping tables:
hive> DROP TABLE invites;
# quit;
www.hpottech.com
Page 8
Hive - Bucket
Hpot-Tech
Hive - Bucket
Hpot-Tech
Hive - Bucket
Hpot-Tech
Input Types
Create one Java project :- InputTypes and create the following classes.
hPot-Tech
Input Types
package com.hp.types;
// == JobBuilder
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
public class JobBuilder {
private final Class<?> driverClass;
private final Job job;
private final int extraArgCount;
private final String extrArgsUsage;
private String[] extraArgs;
public JobBuilder(Class<?> driverClass) throws IOException {
this(driverClass, 0, "");
}
public JobBuilder(Class<?> driverClass, int extraArgCount, String extrArgsUsage) throws IOException {
this.driverClass = driverClass;
this.extraArgCount = extraArgCount;
this.job = new Job();
this.job.setJarByClass(driverClass);
this.extrArgsUsage = extrArgsUsage;
}
// vv JobBuilder
public static Job parseInputAndOutput(Tool tool, Configuration conf,
String[] args) throws IOException {
if (args.length != 2) {
printUsage(tool, "<input> <output>");
hPot-Tech
Input Types
return null;
}
Job job = new Job(conf);
job.setJarByClass(tool.getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job;
}
public static void printUsage(Tool tool, String extraArgsUsage) {
System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
}
// ^^ JobBuilder
public JobBuilder withCommandLineArgs(String... args) throws IOException {
Configuration conf = job.getConfiguration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length < 2 && otherArgs.length > 3 + extraArgCount) {
System.err.printf("Usage: %s [genericOptions] [-overwrite] <input path> <output path> %s\n\n",
driverClass.getSimpleName(), extrArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
System.exit(-1);
}
int index = 0;
boolean overwrite = false;
if (otherArgs[index].equals("-overwrite")) {
overwrite = true;
index++;
}
Path input = new Path(otherArgs[index++]);
Path output = new Path(otherArgs[index++]);
if (index < otherArgs.length) {
extraArgs = new String[otherArgs.length - index];
System.arraycopy(otherArgs, index, extraArgs, 0, otherArgs.length - index);
}
hPot-Tech
Input Types
if (overwrite) {
output.getFileSystem(conf).delete(output, true);
}
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
return this;
}
public Job build() {
return job;
}
public String[] getExtraArgs() {
return extraArgs;
}
}
hPot-Tech
Input Types
package com.hp.types;
// cc SmallFilesToSequenceFileConverter A MapReduce program for packaging a collection of small files as a single SequenceFile
import java.io.IOException;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
//vv SmallFilesToSequenceFileConverter
public class SmallFilesToSequenceFileConverter extends Configured
implements Tool {
static class SequenceFileMapper
extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
private Text filenameKey;
@Override
protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Path path = ((FileSplit) split).getPath();
filenameKey = new Text(path.toString());
}
@Override
protected void map(NullWritable key, BytesWritable value, Context context)
throws IOException, InterruptedException {
context.write(filenameKey, value);
hPot-Tech
Input Types
}
}
@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
}
job.setInputFormatClass(WholeFileInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(BytesWritable.class);
job.setMapperClass(SequenceFileMapper.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
args = new String[2];
args[0]="input";
args[1]="output"+System.currentTimeMillis();
int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(), args);
System.exit(exitCode);
}
}
// ^^ SmallFilesToSequenceFileConverter
hPot-Tech
Input Types
package com.hp.types;
// cc WholeFileInputFormat An InputFormat for reading a whole file as a record
import java.io.IOException;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.*;
//vv WholeFileInputFormat
public class WholeFileInputFormat
extends FileInputFormat<NullWritable, BytesWritable> {
@Override
protected boolean isSplitable(JobContext context, Path file) {
return false;
}
@Override
public RecordReader<NullWritable, BytesWritable> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException,
InterruptedException {
WholeFileRecordReader reader = new WholeFileRecordReader();
reader.initialize(split, context);
return reader;
}
}
//^^ WholeFileInputFormat
hPot-Tech
Input Types
package com.hp.types;
// cc WholeFileRecordReader The RecordReader used by WholeFileInputFormat for reading a whole file as a record
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
//vv WholeFileRecordReader
class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {
private FileSplit fileSplit;
private Configuration conf;
private BytesWritable value = new BytesWritable();
private boolean processed = false;
@Override
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
this.fileSplit = (FileSplit) split;
this.conf = context.getConfiguration();
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);
hPot-Tech
Input Types
hPot-Tech
10
Input Types
hPot-Tech
11
Input Types
hPot-Tech
12
Input Types
Verify the result as follows: Using HDFS View and Job Tracker
hPot-Tech
13
Input Types
hPot-Tech
From two single-node clusters to a multi-node cluster We will build a multi-node cluster using two Red hat boxes in
this tutorial. The best way to do this for starters is to install, configure and test a local Hadoop setup for each of the two
RH boxes, and in a second step to merge these two single-node clusters into one multi-node cluster in which one RH
box will become the designated master (but also act as a slave with regard to data storage and processing), and the other
box will become only a slave. Its much easier to track down any problems you might encounter due to the reduced
complexity of doing a single-node cluster setup first on each machine.
ww.hpottech.com
Prerequisites
Configuring single-node clusters first in both the VM
Use the earlier tutorial.
Now that you have two single-node clusters up and running, we will modify the Hadoop configuration to make one RH
box the master (which will also act as a slave) and the other RH box a slave.
We will call the designated master machine just the master from now on and the slave-only machine the
slave. We will also give the two machines these respective hostnames in their networking setup, most notably
in /etc/hosts. If the hostnames of your machines are different (e.g. node01) then you must adapt the
settings in this tutorial as appropriate.
Shutdown each single-node cluster with /bin/stop-all.sh before continuing if you havent done so already.
ww.hpottech.com
ww.hpottech.com
ww.hpottech.com
Update /etc/hosts
ww.hpottech.com
ww.hpottech.com
Networking
Both machines must be able to reach each other over the network.
Update /etc/hosts on both machines with the following lines:
# vi /etc/hosts (for master AND slave)
10.72.47.42
master
10.72.47.27
slave
ww.hpottech.com
SSH access
The root user on the master must be able to connect
a) to its own user account on the master i.e. ssh
localhost and
b) to the root user account on the slave via a password-less SSH login.
You have to add the root@masters public SSH key (which should be in$HOME/.ssh/id_rsa.pub) to
the authorized_keys file of root@slave (in this users$HOME/.ssh/authorized_keys).
You can do this manually or use the following SSH command:
$ ssh-copy-id -i $HOME/.ssh/id_rsa.pub root@slave
This command will prompt you for the login password for user root on slave, then copy the public SSH key for you,
creating the correct directory and fixing the permissions as necessary.
ww.hpottech.com
The final step is to test the SSH setup by connecting with user root from the master to the user account
root on
the slave. The step is also needed to save slaves host key fingerprint to the root@mastersknown_hosts file.
So, connecting from master to master
$ ssh master
ww.hpottech.com
10
ww.hpottech.com
11
Hadoop
Cluster Overview
ww.hpottech.com
12
The master node will run the master daemons for each layer: NameNode for the HDFS storage layer, and JobTracker
for the MapReduce processing layer. Both machines will run the slave daemons: DataNode for the HDFS layer, and
TaskTracker for MapReduce processing layer. Basically, the master daemons are responsible for coordination and
management of the slave daemons while the latter will do the actual data storage and data processing work.
Configuration
conf/masters (master only)
On master, update /conf/masters that it looks like this:
master
ww.hpottech.com
13
ww.hpottech.com
14
conf/mapred-site.xml(mapred.job.tracker)
ww.hpottech.com
15
ww.hpottech.com
16
Second, we have to change the mapred.job.tracker variable (in conf/mapred-site.xml) which specifies
theJobTracker (MapReduce master) host and port. Again, this is the master in our case.
<!-- In: conf/mapred-site.xml -->
<property>
<name>mapred.job.tracker</name>
<value>master:9001</value>
<description>The host and port that the MapReduce job tracker runs at </description>
</property>
ww.hpottech.com
17
ww.hpottech.com
18
ww.hpottech.com
19
ww.hpottech.com
20
Background: The HDFS name table is stored on the NameNodes (here: master) local filesystem in the directory
specified by dfs.name.dir. The name table is used by the NameNode to store tracking and coordination information
for the DataNodes.
ww.hpottech.com
21
On slave, you can examine the success or failure of this command by inspecting the log file logs/. Exemplary
output:
ww.hpottech.com
22
As you can see in slaves output above, it will automatically format its storage directory (specified
bydfs.data.dir) if it is not formatted already. It will also create the directory if it does not exist yet.
At this point, the following Java processes should run on master
# jps
ww.hpottech.com
23
ww.hpottech.com
24
MapReduce daemons
Run the command /bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up
the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers
on the machines listed in the conf/slaves file.
In our case, we will run bin/start-mapred.sh on master:
$ bin/start-mapred.sh
ww.hpottech.com
25
On slave, you can examine the success or failure of this command by inspecting the log file logs/hadoop-
ww.hpottech.com
26
ww.hpottech.com
27
ww.hpottech.com
28
MapReduce daemons
Run the command /bin/stop-mapred.sh on the JobTracker machine. This will shut down the MapReduce
cluster by stopping the JobTracker daemon running on the machine you ran the previous command on, and TaskTrackers
on the machines listed in the conf/slaves file.
In our case, we will run bin/stop-mapred.sh on master:
$ bin/stop-mapred.sh
ww.hpottech.com
29
(Note: The output above might suggest that the JobTracker was running and stopped on slave, but you can be assured
that the JobTracker ran on master.)
At this point, the following Java processes should run on master
$ jps
ww.hpottech.com
30
ww.hpottech.com
31
HDFS daemons
Run the command /bin/stop-dfs.sh on the NameNode machine. This will shut down HDFS by stopping the
NameNode daemon running on the machine you ran the previous command on, and DataNodes on the machines listed in
the conf/slaves file.
In our case, we will run bin/stop-dfs.sh on master:
$ bin/stop-dfs.sh
ww.hpottech.com
32
At this point, the only following Java processes should run on master
$ jps
ww.hpottech.com
33
ww.hpottech.com
34
ww.hpottech.com
35
If you want to inspect the jobs output data, just retrieve the job result from HDFS to your local filesystem.
ww.hpottech.com
Joins
Hpot-Tech
Joins
package com.hp.join;
// == JobBuilder
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
public class JobBuilder {
private final Class<?> driverClass;
private final Job job;
private final int extraArgCount;
private final String extrArgsUsage;
private String[] extraArgs;
public JobBuilder(Class<?> driverClass) throws IOException {
this(driverClass, 0, "");
}
public JobBuilder(Class<?> driverClass, int extraArgCount, String extrArgsUsage) throws IOException {
this.driverClass = driverClass;
this.extraArgCount = extraArgCount;
this.job = new Job();
this.job.setJarByClass(driverClass);
this.extrArgsUsage = extrArgsUsage;
}
// vv JobBuilder
public static Job parseInputAndOutput(Tool tool, Configuration conf,
String[] args) throws IOException {
if (args.length != 2) {
printUsage(tool, "<input> <output>");
return null;
Hpot-Tech
Joins
}
Job job = new Job(conf);
job.setJarByClass(tool.getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job;
}
public static void printUsage(Tool tool, String extraArgsUsage) {
System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
}
// ^^ JobBuilder
public JobBuilder withCommandLineArgs(String... args) throws IOException {
Configuration conf = job.getConfiguration();
GenericOptionsParser parser = new GenericOptionsParser(conf, args);
String[] otherArgs = parser.getRemainingArgs();
if (otherArgs.length < 2 && otherArgs.length > 3 + extraArgCount) {
System.err.printf("Usage: %s [genericOptions] [-overwrite] <input path> <output path> %s\n\n",
driverClass.getSimpleName(), extrArgsUsage);
GenericOptionsParser.printGenericCommandUsage(System.err);
System.exit(-1);
}
int index = 0;
boolean overwrite = false;
if (otherArgs[index].equals("-overwrite")) {
overwrite = true;
index++;
}
Path input = new Path(otherArgs[index++]);
Path output = new Path(otherArgs[index++]);
if (index < otherArgs.length) {
extraArgs = new String[otherArgs.length - index];
System.arraycopy(otherArgs, index, extraArgs, 0, otherArgs.length - index);
}
if (overwrite) {
Hpot-Tech
Joins
output.getFileSystem(conf).delete(output, true);
}
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
return this;
}
public Job build() {
return job;
}
public String[] getExtraArgs() {
return extraArgs;
}
}
Hpot-Tech
Joins
package com.hp.join;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class JoinRecordMapper extends MapReduceBase
implements Mapper<LongWritable, Text, TextPair, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();
public void map(LongWritable key, Text value,
OutputCollector<TextPair, Text> output, Reporter reporter)
throws IOException {
parser.parse(value);
output.collect(new TextPair(parser.getStationId(), "1"), value);
}
}
Hpot-Tech
Joins
package com.hp.join;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.MultipleInputs;
import org.apache.hadoop.util.*;
@SuppressWarnings("deprecation")
public class JoinRecordWithStationName extends Configured implements Tool {
public static class KeyPartitioner implements Partitioner<TextPair, Text> {
@Override
public void configure(JobConf job) {}
@Override
public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;
}
}
@Override
public int run(String[] args) throws Exception {
if (args.length != 3) {
JobBuilder.printUsage(this, "<ncdc input> <station input> <output>");
return -1;
}
JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Join record with station name");
Path ncdcInputPath = new Path(args[0]);
Path stationInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);
MultipleInputs.addInputPath(conf, ncdcInputPath,
TextInputFormat.class, JoinRecordMapper.class);
MultipleInputs.addInputPath(conf, stationInputPath,
TextInputFormat.class, JoinStationMapper.class);
Hpot-Tech
Joins
FileOutputFormat.setOutputPath(conf, outputPath);
conf.setPartitionerClass(KeyPartitioner.class);
conf.setOutputValueGroupingComparator(TextPair.FirstComparator.class);
conf.setMapOutputKeyClass(TextPair.class);
conf.setReducerClass(JoinReducer.class);
conf.setOutputKeyClass(Text.class);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
args = new String[3];
args[0] = "inputncdc";
args[1] = "inputstation";
args[2] = "output"+System.currentTimeMillis();
int exitCode = ToolRunner.run(new JoinRecordWithStationName(), args);
System.exit(exitCode);
}
}
Hpot-Tech
Joins
package com.hp.join;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;
public class JoinReducer extends MapReduceBase implements
Reducer<TextPair, Text, Text, Text> {
public void reduce(TextPair key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
Text stationName = new Text(values.next());
while (values.hasNext()) {
Text record = values.next();
Text outValue = new Text(stationName.toString() + "\t" + record.toString());
output.collect(key.getFirst(), outValue);
}
}
}
Hpot-Tech
Joins
package com.hp.join;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class JoinStationMapper extends MapReduceBase
implements Mapper<LongWritable, Text, TextPair, Text> {
private NcdcStationMetadataParser parser = new NcdcStationMetadataParser();
public void map(LongWritable key, Text value,
OutputCollector<TextPair, Text> output, Reporter reporter)
throws IOException {
if (parser.parse(value)) {
output.collect(new TextPair(parser.getStationId(), "0"),
new Text(parser.getStationName()));
}
}
}
Hpot-Tech
10
Joins
package com.hp.join;
import java.math.*;
import org.apache.hadoop.io.Text;
public class MetOfficeRecordParser {
private String year;
private String airTemperatureString;
private int airTemperature;
private boolean airTemperatureValid;
public void parse(String record) {
if (record.length() < 18) {
return;
}
year = record.substring(3, 7);
if (isValidRecord(year)) {
airTemperatureString = record.substring(13, 18);
if (!airTemperatureString.trim().equals("---")) {
BigDecimal temp = new BigDecimal(airTemperatureString.trim());
temp = temp.multiply(new BigDecimal(BigInteger.TEN));
airTemperature = temp.intValueExact();
airTemperatureValid = true;
}
}
}
private boolean isValidRecord(String year) {
try {
Integer.parseInt(year);
return true;
} catch (NumberFormatException e) {
return false;
}
}
public void parse(Text record) {
parse(record.toString());
}
Hpot-Tech
11
Joins
Hpot-Tech
12
Joins
package com.hp.join;
import java.text.*;
import java.util.Date;
import org.apache.hadoop.io.Text;
public class NcdcRecordParser {
private static final int MISSING_TEMPERATURE = 9999;
private static final DateFormat DATE_FORMAT =
new SimpleDateFormat("yyyyMMddHHmm");
private String stationId;
private String observationDateString;
private String year;
private String airTemperatureString;
private int airTemperature;
private boolean airTemperatureMalformed;
private String quality;
public void parse(String record) {
stationId = record.substring(4, 10) + "-" + record.substring(10, 15);
observationDateString = record.substring(15, 27);
year = record.substring(15, 19);
airTemperatureMalformed = false;
// Remove leading plus sign as parseInt doesn't like them
if (record.charAt(87) == '+') {
airTemperatureString = record.substring(88, 92);
airTemperature = Integer.parseInt(airTemperatureString);
} else if (record.charAt(87) == '-') {
airTemperatureString = record.substring(87, 92);
airTemperature = Integer.parseInt(airTemperatureString);
} else {
airTemperatureMalformed = true;
}
airTemperature = Integer.parseInt(airTemperatureString);
quality = record.substring(92, 93);
}
Hpot-Tech
13
Joins
Hpot-Tech
14
Joins
}
public String getAirTemperatureString() {
return airTemperatureString;
}
public String getQuality() {
return quality;
}
}
Hpot-Tech
15
Joins
package com.hp.join;
import java.io.*;
import java.util.*;
import org.apache.hadoop.io.IOUtils;
public class NcdcStationMetadata {
private Map<String, String> stationIdToName = new HashMap<String, String>();
public void initialize(File file) throws IOException {
BufferedReader in = null;
try {
in = new BufferedReader(new InputStreamReader(new FileInputStream(file)));
NcdcStationMetadataParser parser = new NcdcStationMetadataParser();
String line;
while ((line = in.readLine()) != null) {
if (parser.parse(line)) {
stationIdToName.put(parser.getStationId(), parser.getStationName());
}
}
} finally {
IOUtils.closeStream(in);
}
}
public String getStationName(String stationId) {
String stationName = stationIdToName.get(stationId);
if (stationName == null || stationName.trim().length() == 0) {
return stationId; // no match: fall back to ID
}
return stationName;
}
public Map<String, String> getStationIdToNameMap() {
return Collections.unmodifiableMap(stationIdToName);
}
}
Hpot-Tech
16
Joins
package com.hp.join;
import org.apache.hadoop.io.Text;
public class NcdcStationMetadataParser {
private String stationId;
private String stationName;
public boolean parse(String record) {
if (record.length() < 42) { // header
return false;
}
String usaf = record.substring(0, 6);
String wban = record.substring(7, 12);
stationId = usaf + "-" + wban;
stationName = record.substring(13, 42);
try {
Integer.parseInt(usaf); // USAF identifiers are numeric
return true;
} catch (NumberFormatException e) {
return false;
}
}
public boolean parse(Text record) {
return parse(record.toString());
}
public String getStationId() {
return stationId;
}
public String getStationName() {
return stationName;
}
}
Hpot-Tech
17
Joins
package com.hp.join;
// cc TextPair A Writable implementation that stores a pair of Text objects
// cc TextPairComparator A RawComparator for comparing TextPair byte representations
// cc TextPairFirstComparator A custom RawComparator for comparing the first field of TextPair byte representations
// vv TextPair
import java.io.*;
import org.apache.hadoop.io.*;
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
}
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second) {
set(first, second);
}
public void set(Text first, Text second) {
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public Text getSecond() {
return second;
}
@Override
Hpot-Tech
18
Joins
Hpot-Tech
19
Joins
// vv TextPairComparator
public static class Comparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public Comparator() {
super(TextPair.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
if (cmp != 0) {
return cmp;
}
return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1,
b2, s2 + firstL2, l2 - firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
}
static {
WritableComparator.define(TextPair.class, new Comparator());
}
// ^^ TextPairComparator
// vv TextPairFirstComparator
public static class FirstComparator extends WritableComparator {
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
super(TextPair.class);
Hpot-Tech
20
Joins
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {
try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
}
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
}
return super.compare(a, b);
}
}
// ^^ TextPairFirstComparator
// vv TextPair
}
// ^^ TextPair
Hpot-Tech
21
Joins
Hpot-Tech
22
Joins
Hpot-Tech
23
Joins
Hpot-Tech
24
Joins
Hpot-Tech
25
Joins
Hpot-Tech
674566 Feb
3 10:17 pg20417.txt
3 10:18 pg4300.txt
3 10:18 pg5000.txt
www.hpottech.com
This command will read all the files in the HDFS directory /user/root/in, process it, and store the result in the
HDFS directory /user/root/out.
www.hpottech.com
www.hpottech.com
www.hpottech.com
www.hpottech.com
www.hpottech.com
www.hpottech.com
www.hpottech.com
These web interfaces provide concise information about whats happening in your Hadoop cluster. You might want to give
them a try.
www.hpottech.com
10
www.hpottech.com
11
www.hpottech.com
12
www.hpottech.com
13
www.hpottech.com
MR Unit Testing
Start eclipse
Create New java project : MRUnitTest
Unzip mrunit jar in /hadoop/mrunit
Include mrunit jar in the project
Hpot-Tech
MR Unit Testing
Hpot-Tech
MR Unit Testing
package com.hp.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
public class MaxTemperatureMapper
extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text values,Context context) throws IOException,
InterruptedException{
String line = values.toString();
String year = line.substring(15, 19);
int airTemperature = Integer.parseInt(line.substring(87, 92));
System.out.println("-----"+ year +"="+ airTemperature );
context.write(new Text(year), new IntWritable(airTemperature));
}
}
Hpot-Tech
MR Unit Testing
package com.hp.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
Hpot-Tech
MR Unit Testing
package com.hp.test;
// cc MaxTemperatureMapperTestV1 Unit test for MaxTemperatureMapper
// == MaxTemperatureMapperTestV1Missing
// vv MaxTemperatureMapperTestV1
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.*;
import com.hp.hadoop.MaxTemperatureMapper;
public class MaxTemperatureMapperTest {
@Test
public void processesValidRecord() throws IOException, InterruptedException {
Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9-00111+99999999999");
// Temperature ^^^^^
new MapDriver<LongWritable, Text, Text, IntWritable>()
.withMapper(new MaxTemperatureMapper())
.withInputKey(new LongWritable(123))
.withInputValue(value)
.withOutput(new Text("1950"), new IntWritable(-11))
.runTest();
}
// ^^ MaxTemperatureMapperTestV1
//@Ignore //
// vv MaxTemperatureMapperTestV1Missing
@Test
public void ignoresMissingTemperatureRecord() throws IOException,
InterruptedException {
Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
"99999V0203201N00261220001CN9999999N9+99991+99999999999");
// Temperature ^^^^^
new MapDriver<LongWritable, Text, Text, IntWritable>()
.withMapper(new MaxTemperatureMapper())
.withInputKey(new LongWritable(123))
.withInputValue(value)
Hpot-Tech
MR Unit Testing
.runTest();
}
// ^^ MaxTemperatureMapperTestV1Missing
@Test
public void processesMalformedTemperatureRecord() throws IOException,
InterruptedException {
Text value = new Text("0335999999433181957042302005+37950+139117SAO +0004" +
// Year ^^^^
"RJSN V02011359003150070356999999433201957010100005+353");
// Temperature ^^^^^
new MapDriver<LongWritable, Text, Text, IntWritable>()
.withMapper(new MaxTemperatureMapper())
.withInputValue(value)
.withInputKey(new LongWritable(123))
.withOutput(new Text("1957"), new IntWritable(1957))
.runTest();
}
// vv MaxTemperatureMapperTestV1
}
// ^^ MaxTemperatureMapperTestV1
Hpot-Tech
MR Unit Testing
package com.hp.test;
// == MaxTemperatureReducerTestV1
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.ReduceDriver;
import org.junit.*;
import com.hp.hadoop.MaxTemperatureReducer;
public class MaxTemperatureReducerTest {
//vv MaxTemperatureReducerTestV1
@Test
public void returnsMaximumIntegerInValues() throws IOException,
InterruptedException {
new ReduceDriver<Text, IntWritable, Text, IntWritable>()
.withReducer(new MaxTemperatureReducer())
.withInputKey(new Text("1950"))
.withInputValues(Arrays.asList(new IntWritable(10), new IntWritable(5)))
.withOutput(new Text("1950"), new IntWritable(10))
.runTest();
}
//^^ MaxTemperatureReducerTestV1
}
Hpot-Tech
MR Unit Testing
Hpot-Tech
MR Unit Testing
Run as Junit
Hpot-Tech
Start eclipse.
Create on java project: Partioner and create the following java class:
Hpot-Tech
package com.hp.partitioner;
// cc MaxTemperatureMapper Mapper for maximum temperature example
// vv MaxTemperatureMapper
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
Hpot-Tech
package com.hp.partitioner;
// cc MaxTemperatureReducer Reducer for maximum temperature example
// vv MaxTemperatureReducer
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}
}
// ^^ MaxTemperatureReducer
Hpot-Tech
package com.hp.partitioner;
// cc MaxTemperatureWithCombiner Application to find the maximum temperature, using a combiner function for efficiency
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
// vv MaxTemperatureWithCombiner
public class MaxTemperatureWithCombiner {
public static void main(String[] args) throws Exception {
/*if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +
"<output path>");
System.exit(-1);
}*/
Job job = new Job();
job.setJarByClass(MaxTemperatureWithCombiner.class);
job.setJobName("Max temperature");
Hpot-Tech
Hpot-Tech
Out put.
Hpot-Tech
Hpot-Tech
Hpot-Tech
Pig UDF
Start eclipse
Create java project. :- PigUDF
Include Hadoop Library in Java Build Path
Create and Include Pig User library (Available in Pig Installation folder)
Hpot-Tech
Pig UDF
package com.hp.hadoop.pig;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.pig.FilterFunc;
import org.apache.pig.FuncSpec;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.logicalLayer.FrontendException;
import org.apache.pig.impl.logicalLayer.schema.Schema;
Hpot-Tech
Pig UDF
return funcSpecs;
}
}
Hpot-Tech
Sqoop
hPotTech
Sqoop
Run Sqoop:
$sqoop
Installed mysql
Verify that mysql is already installed in the system
hPotTech
Sqoop
hPotTech
Sqoop
#rpm -i MySQL-client-5.5.25-1.rhel5.i386.rpm
verify the installation: mysql V
Start my SQl:
Follows the steps below to stop and start MySQL
OK
OK
hPotTech
Sqoop
type:
# mysql
hPotTech
Sqoop
# mysql hadoopguide
mysql> CREATE TABLE widgets(id INT NOT NULL PRIMARY KEY AUTO_INCREMENT,
-> widget_name VARCHAR(64) NOT NULL,
-> price DECIMAL(10,2),
-> design_date DATE,
-> version INT,
-> design_comment VARCHAR(100));
Query OK, 0 rows affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'sprocket', 0.25, '2010-02-10', 1, 'Connects two gizmos');
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'gizmo', 4.00, '2009-11-30', 4, NULL);
Query OK, 1 row affected (0.00 sec)
mysql> INSERT INTO widgets VALUES (NULL, 'gadget', 99.99, '1983-08-13',
-> 13, 'Our flagship product');
hPotTech
Sqoop
hPotTech
Sqoop
#cd /hadoop
extract the following jar
#tar -xvf mysql-connector-java-5.1.20.tar.gz C /hadoop
#cd /hadoop/mysql-connector-java-5.1.20
#cp mysql-connector-java-5.1.20-bin.jar $SQOOP_HOME/lib
It will store the necessary library to connect to Mysql from Sqoop
Start the hdfs.
Connect using Sqoop
hPotTech
Sqoop
hPotTech
10
Sqoop
hPotTech
11
Sqoop
Additional Notes:
PLEASE REMEMBER TO SET A PASSWORD FOR THE MySQL root USER !
To do so, start the server, then issue the following commands:
which will also give you the option of removing the test
databases and anonymous user created by default. This is
strongly recommended for production servers.
hPotTech
Map Reducing
Goals: You will be able to write Map Reduce Program using Eclipse IDE.
IDE Set Up:
1) Untar the eclipse-jee-juno-linux-gtk.tar.gz
Hpot-Tech
Map Reducing
Hpot-Tech
Map Reducing
OK
Hpot-Tech
Map Reducing
Hpot-Tech
Map Reducing
Hpot-Tech
Map Reducing
package com.hp.hadoop;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
/**
* @param args
*/
public static void main(String[] args) throws Exception{
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path("in"));
FileOutputFormat.setOutputPath(job, new Path("out"+System.currentTimeMillis()));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Hpot-Tech
Map Reducing
package com.hp.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
public class MaxTemperatureMapper
extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text values,Context context) throws IOException,
InterruptedException{
String line = values.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}
}
}
Hpot-Tech
Map Reducing
package com.hp.hadoop;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
Hpot-Tech
Map Reducing
Hpot-Tech
10
Map Reducing
Output;
Hpot-Tech
11
Map Reducing
Hpot-Tech
12
Map Reducing
13/01/05 15:14:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where
applicable
13/01/05 15:14:25 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the
same.
13/01/05 15:14:25 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/01/05 15:14:25 INFO input.FileInputFormat: Total input paths to process : 2
13/01/05 15:14:26 INFO mapred.JobClient: Running job: job_local_0001
13/01/05 15:14:26 INFO util.ProcessTree: setsid exited with exit code 0
13/01/05 15:14:26 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@ab444
13/01/05 15:14:26 INFO mapred.MapTask: io.sort.mb = 100
13/01/05 15:14:27 INFO mapred.JobClient: map 0% reduce 0%
13/01/05 15:14:30 INFO mapred.MapTask: data buffer = 79691776/99614720
13/01/05 15:14:30 INFO mapred.MapTask: record buffer = 262144/327680
13/01/05 15:14:31 INFO mapred.MapTask: Starting flush of map output
13/01/05 15:14:31 INFO mapred.MapTask: Finished spill 0
13/01/05 15:14:31 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
13/01/05 15:14:33 INFO mapred.LocalJobRunner:
13/01/05 15:14:33 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
13/01/05 15:14:33 INFO mapred.JobClient: map 100% reduce 0%
13/01/05 15:14:33 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@ff8c74
13/01/05 15:14:33 INFO mapred.MapTask: io.sort.mb = 100
13/01/05 15:14:33 INFO mapred.MapTask: data buffer = 79691776/99614720
13/01/05 15:14:33 INFO mapred.MapTask: record buffer = 262144/327680
13/01/05 15:14:33 INFO mapred.MapTask: Starting flush of map output
13/01/05 15:14:33 INFO mapred.MapTask: Finished spill 0
13/01/05 15:14:33 INFO mapred.Task: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
13/01/05 15:14:37 INFO mapred.LocalJobRunner:
13/01/05 15:14:37 INFO mapred.Task: Task 'attempt_local_0001_m_000001_0' done.
13/01/05 15:14:37 INFO mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@13c6641
13/01/05 15:14:37 INFO mapred.LocalJobRunner:
13/01/05 15:14:37 INFO mapred.Merger: Merging 2 sorted segments
13/01/05 15:14:37 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 144423 bytes
13/01/05 15:14:37 INFO mapred.LocalJobRunner:
13/01/05 15:14:37 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
13/01/05 15:14:37 INFO mapred.LocalJobRunner:
13/01/05 15:14:37 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now
13/01/05 15:14:37 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to out1357379065119
13/01/05 15:14:40 INFO mapred.LocalJobRunner: reduce > reduce
13/01/05 15:14:40 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
13/01/05 15:14:41 INFO mapred.JobClient: map 100% reduce 100%
Hpot-Tech
13
Map Reducing
Hpot-Tech
14
Map Reducing
Hpot-Tech
15
Map Reducing
Hpot-Tech
16
Map Reducing
Hpot-Tech
17
Map Reducing
Hpot-Tech
18
Map Reducing
Hpot-Tech
19
Map Reducing
Hpot-Tech
20
Map Reducing
Hpot-Tech
21
Map Reducing
Execute job:
#bin/hadoop jar /hadoop/hadoop/mymapreduce.jar com.hp.hadoop.MaxTemperatureDriver intemp outtemp
Hpot-Tech
22
Map Reducing
Hpot-Tech