You are on page 1of 95

Index

1. Why Hadoop .......................................................................... 1


2. Hadoop Basic Concepts ........................................................ 3
3. Introduction to the HDFS ...................................................... 11
4. Introduction to MapReduce ................................................... 19
5. MapReduce API .................................................................... 29
6. Advance Hadoop API ............................................................ 37
7. More Advanced MapReduce Programming ......................... 49
8. Joining Data Sets in MapReduce Jobs .................................. 61
9. Hive ....................................................................................... 65
10. Pig ........................................................................................ 76
11. Hbase ................................................................................... 85

Why Hadoop
Simply put, Hadoop can transform the way you store and process data throughout your enterprise.
According to analysts, about 80% of the data in the world is unstructured, and until Hadoop, it was
essentially unusable in any systematic way. With Hadoop, for the first time you can combine all your
data and look at it as one.

Make All Your Data Profitable


Hadoop enables you to gain insight from all the data you already have; to ingest the data flowing into
your systems 24/7 and leverage it to make optimizations that were impossible before; to make decisions
based on hard data, not hunches; to look at complete data, not samples; to look at years of transactions,
not days or weeks. In short, Hadoop will change the way you run your organization.

Leverage All Types of Data, From All Types of Systems


Hadoop can handle all types of data from disparate systems: structured, unstructured, log files, pictures,
audio files, communications records, email just about anything you can think of. Even when different
types of data have been stored in unrelated systems, you can dump it all into your Hadoop cluster
before you even know how you might take advantage of it in the future.

Scale beyond Anything You Have Today


The largest social network in the world is built on the same open-source technology as Hadoop, and now
exceeds 100 petabytes. Its unlikely your organization has that much data. As you need more capacity,
you just add more commodity servers and Hadoop automatically incorporates the new storage and
compute capacity.

E-tailing
Recommendation engines increase average order size by recommending complementary
products based on predictive analysis for cross-selling.
Cross-channel analytics sales attribution, average order value, lifetime value (e.g., how many
in-store purchases resulted from a particular recommendation, advertisement or promotion).
Event analytics what series of steps (golden path) led to a desired outcome (e.g., purchase,
registration).

Financial Services
Compliance and regulatory reporting.
Risk analysis and management.
Fraud detection and security analytics.
CRM and customer loyalty programs.
Credit scoring and analysis.
Trade surveillance.
1

Government
Fraud detection and cybersecurity.
Compliance and regulatory analysis.
Energy consumption and carbon footprint management.

Health & Life Sciences


Campaign and sales program optimization.
Brand management.
Patient care quality and program analysis.
Supply-chain management.
Drug discovery and development analysis.

Retail/CPG
Merchandizing and market basket analysis.
Campaign management and customer loyalty programs.
Supply-chain management and analytics.
Event- and behavior-based targeting.
Market and consumer segmentations.

Telecommunications
Revenue assurance and price optimization.
Customer churn prevention.
Campaign management and customer loyalty.
Call Detail Record (CDR) analysis.
Network performance and optimization.

Web & Digital Media Services


Large-scale clickstream analytics.
Ad targeting, analysis, forecasting and optimization.
Abuse and click-fraud prevention.
Social graph analysis and profile segmentation.
Campaign management and loyalty programs.

Hadoop Basic Concepts


Apache Hadoop
Apache Hadoop is a software solution for distributed computing of large datasets.
Hadoop provides a distributed file system (HDFS) and a MapReduce implementation.
Apache Hadoop can be used to filter and aggregate data, e.g. a typical use case would be the
analysis of webserver logs file to find the most visited pages.

HDFS Hadoop Distributed File System


HDFS is an Apache Software Foundation project and a subproject of the Apache Hadoop project.
HDFS is fault tolerant and provides high-throughput access to large data sets.

Overview of HDFS
HDFS has many similarities with other distributed file systems, but is different in several respects. One
noticeable difference is HDFS's write-once-read-many model that relaxes concurrency control
requirements, simplifies data coherency, and enables high-throughput access.
HDFS has many goals. Here are some of the most notable:
Scalability to reliably store and process large amounts of data.
Economy by distributing data and processing across clusters of commodity personal computers.
Efficiency by distributing data and logic to process it in parallel on nodes where data is located.
Reliability by automatically maintaining multiple copies of data and automatically redeploying
processing logic in the event of failures.
Hadoop Multi-node Architecture
The Hadoop architecture is made simple in the diagram. The MapReduce algorithm sits on top
of a distributed file system. Arrows represent data access. Large enclosing rectangles represent
the master and slave nodes. The small rectangles represent functional units.
The file system layer can be any virtualized distributed file system. Hadoop performs best when
coupled with the Hadoop Distributed File System because the physical data node, being
location/rack aware, can be placed closer to the task tracker that will access this data.

Fig : Hadoop Multi-Node cluster Architecture.


Definitions /Acronyms
DataNode:
A DataNode stores data in the [Hadoop Filesystem]. A functional file system has more than one
DataNode, with data replicated across them.
NameNode:
NameNode serves as both directory namespace manager and "inode table" for the Hadoop DFS. There
is a single Name Node running in any DFS deployment.
MapReduce:
Hadoop MapReduce is a programming model and software framework for writing applications that
rapidly process vast amounts of data in parallel on large clusters of compute nodes.
Secondary NameNode:
The Secondary Namenode regularly connects with the Primary Namenode and builds snapshots of the
Primary Namenode's directory information, which is then saved to local/remote directories.
4

JobTracker:
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the
cluster, ideally the nodes that have the data, or at least are in the same rack.
1.
2.
3.
4.
5.

Client applications submit jobs to the Job tracker.


The JobTracker talks to the NameNode to determine the location of the data.
The JobTracker locates TaskTracker nodes with available slots at or near the data.
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough,
they are deemed to have failed and the work is scheduled on a different TaskTracker.
6. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do
then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid,
and it may even blacklist the TaskTracker as unreliable.
7. When the work is completed, the JobTracker updates its status.
8. Client applications can poll the JobTracker for information.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs
are halted.
TaskTracker:
1. A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations from a JobTracker.
2. Every TaskTracker is configured with a set of slots; these indicate the number of tasks that it can
accept.
3. When the JobTracker tries to find somewhere to schedule a task within the MapReduce
operations, it first looks for an empty slot on the same server that hosts the DataNode
containing the data, and if not, it looks for an empty slot on a machine in the same rack.

Hadoop Architecture:

Fig : An architecture that explains how HDFS works


The following are some of the key points to remember about the HDFS:
In the above diagram, there is one NameNode, and multiple DataNodes (servers). b1, b2,
indicates data blocks.
When you dump a file (or data) into the HDFS, it stores them in blocks on the various nodes in
the hadoop cluster. HDFS creates several replications of the data blocks and distributes them
accordingly in the cluster in way that will be reliable and can be retrieved faster. A typical HDFS
block size is 128MB. Each and every data block is replicated to multiple nodes across the cluster.
Hadoop will internally make sure that any node failure will never results in a data loss.
There will be one NameNode that manages the file system metadata.
There will be multiple DataNodes (These are the real cheap commodity servers) that will store
the data blocks.
When you execute a query from a client, it will reach out to the NameNode to get the file
metadata information, and then it will reach out to the DataNodes to get the real data blocks.
Hadoop provides a command line interface for administrators to work on HDFS.
The NameNode comes with an in-built web server from where you can browse the HDFS
filesystem and view some basic cluster statistics.
How MapReduce Works?
The whole process is illustrated in Figure 1. At the highest level, there are four independent entities:
The client, which submits the MapReduce job.
The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main
class is JobTracker.
The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java
applications whose main class is TaskTracker.
The distributed filesystem , which is used for sharing job files between the other entities.

Figure 1. How Hadoop runs a MapReduce job

Other Hadoop Ecosystem Components

Figure : Hadoop Ecosystem Components


7

Hive:
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query and analysis.
Using Hadoop was not easy for end users, especially for the ones who were not familiar with
MapReduce framework. End users had to write map/reduce programs for simple tasks like getting raw
counts or averages. Hive was created to make it possible for analysts with strong SQL skills (but meager
Java programming skills) to run queries on the huge volumes of data to extract patterns and meaningful
information. It provides an SQL-like language called HiveQL while maintaining full support for
map/reduce. In short, a Hive query is converted to MapReduce tasks.
The main building blocks of Hive are
1.
2.
3.
4.
5.

Metastore stores the system catalog and metadata about tables, columns, partitions, etc.
Driver manages the lifecycle of a HiveQL statement as it moves through Hive
Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce tasks
Execution Engine executes the tasks produced by the compiler in proper dependency order
HiveServer provides a Thrift interface and a JDBC / ODBC server

HBase:
HBase is the Hadoop application to use when you require real-time read/write random-access to
very large datasets.
It is a distributed column-oriented database built on top of HDFS.
HBase is not relational and does not support SQL, but given the proper problem space, it is able to do
what an RDBMS cannot: host very large, sparsely populated tables on clusters made from
commodity hardware.

Mahout:
Mahout is an open source machine learning library from Apache.
Its highly scalable.
Mahout aims to be the machine learning tool of choice when the collection of data to be
processed is very large, perhaps far too large for a single machine. At the moment, it primarily
implements recommender engines (collaborative filtering), clustering, and classification.

Sqoop:
Loading bulk data into Hadoop from production systems or accessing it from map-reduce applications
running on large clusters can be a challenging task. Transferring data using scripts is inefficient and timeconsuming.
How do we efficiently move data from an external storage into HDFS or Hive or HBase? Meet Apache
Sqoop. Sqoop allows easy import and export of data from structured data stores such as relational
databases, enterprise data warehouses, and NoSQL systems. The dataset being transferred is sliced up
into different partitions and a map-only job is launched with individual mappers responsible for
transferring a slice of this dataset.

ZooKeeper:
ZooKeeper is a distributed, open-source coordination service for distributed applications.
It exposes a simple set of primitives that distributed applications can build upon to implement
higher level services for synchronization, configuration maintenance, and groups and naming.

10

An introduction to the Hadoop Distributed File System


HDFS is an Apache Software Foundation project and a subproject of the Apache Hadoop project Hadoop
is ideal for storing large amounts of data, like terabytes and petabytes, and uses HDFS as its storage
system. HDFS lets you connect nodes (commodity personal computers) contained within clusters over
which data files are distributed. You can then access and store the data files as one seamless file system.
Access to data files is handled in a streaming manner, meaning that applications or commands are
executed directly using the MapReduce processing model.
HDFS is fault tolerant and provides high-throughput access to large data sets. This article explores the
primary features of HDFS and provides a high-level view of the HDFS architecture.

Overview of HDFS
HDFS has many similarities with other distributed file systems, but is different in several
respects. One noticeable difference is HDFS's write-once-read-many model that relaxes concurrency
control requirements, simplifies data coherency, and enables high-throughput access.
Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near
the data rather than moving the data to the application space.
HDFS rigorously restricts data writing to one writer at a time. Bytes are always appended to the end of a
stream, and byte streams are guaranteed to be stored in the order written.
HDFS has many goals. Here are some of the most notable:
Fault tolerance by detecting faults and applying quick, automatic recovery
Data access via MapReduce streaming
Simple and robust coherency model
Processing logic close to the data, rather than the data close to the processing logic
Portability across heterogeneous commodity hardware and operating systems
Scalability to reliably store and process large amounts of data
Economy by distributing data and processing across clusters of commodity personal computers
Efficiency by distributing data and logic to process it in parallel on nodes where data is located
Reliability by automatically maintaining multiple copies of data and automatically redeploying
processing logic in the event of failures
HDFS provides interfaces for applications to move them closer to where the data is located, as described
in the following section.

11

Application interfaces into HDFS


You can access HDFS in many different ways. HDFS provides a native Java application programming
interface (API) and a native C-language wrapper for the Java API. In addition, you can use a web browser
to browse HDFS files.
The applications described in Table 1 are also available to interface with HDFS.

Table 1. Applications that can interface with HDFS


Application
Description
FileSystem (FS)
shell
DFSAdmin

A command-line interface similar to common Linux and UNIX shells (bash, csh, etc.)
that allows interaction with HDFS data.
A command set that you can use to administer an HDFS cluster.
A subcommand of the Hadoop command/application. You can use the fsck command
fsck
to check for inconsistencies with files, such as missing blocks, but you cannot use the
fsck command to correct these inconsistencies.
Name nodes and These have built-in web servers that let administrators check the current status of a
data nodes
cluster.

HDFS has an extraordinary feature set with high expectations thanks to its simple, yet powerful,
architecture.

HDFS architecture
HDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS
cluster consists of a single node, known as a NameNode, that manages the file system namespace and
regulates client access to files. In addition, data nodes (DataNodes) store data as blocks within files.
Name nodes and data nodes
Within HDFS, a given name node manages file system namespace operations like opening, closing, and
renaming files and directories. A name node also maps data blocks to data nodes, which handle read
and write requests from HDFS clients. Data nodes also create, delete, and replicate data blocks
according to instructions from the governing name node.

12

Figure 1 illustrates the high-level architecture of HDFS.

Figure 1. The HDFS architecture

As Figure 1 illustrates, each cluster contains one name node. This design facilitates a simplified model
for managing each namespace and arbitrating data distribution.
Relationships between name nodes and data nodes
Name nodes and data nodes are software components designed to run in a decoupled manner on
commodity machines across heterogeneous operating systems. HDFS is built using the Java
programming language; therefore, any machine that supports the Java programming language can run
HDFS. A typical installation cluster has a dedicated machine that runs a name node and possibly one
data node. Each of the other machines in the cluster runs one data node.

Communications protocols
All HDFS communication protocols build on the TCP/IP protocol. HDFS clients connect to a Transmission
Control Protocol (TCP) port opened on the name node, and then communicate with the name node
using a proprietary Remote Procedure Call (RPC)-based protocol. Data nodes talk to the name node
using a proprietary block-based protocol.
Data nodes continuously loop, asking the name node for instructions. A name node can't connect
directly to a data node; it simply returns values from functions invoked by a data node. Each data node
maintains an open server socket so that client code or other data nodes can read or write data. The host
or port for this server socket is known by the name node, which provides the information to interested
13

clients or other data nodes. See the Communications protocols sidebar for more about communication
between data nodes, name nodes, and clients.
The name node maintains and administers changes to the file system namespace.
File system namespace
HDFS supports a traditional hierarchical file organization in which a user or an application can create
directories and store files inside them. The file system namespace hierarchy is similar to most other
existing file systems; you can create, rename, relocate, and remove files.
HDFS also supports third-party file systems such as CloudStore and Amazon Simple Storage Service (S3)

Data replication
HDFS replicates file blocks for fault tolerance. An application can specify the number of replicas of a file
at the time it is created, and this number can be changed any time after that. The name node makes all
decisions concerning block replication.

Rack awareness
Typically, large HDFS clusters are arranged across multiple installations (racks). Network traffic between
different nodes within the same installation is more efficient than network traffic across installations. A
name node tries to place replicas of a block on multiple installations for improved fault tolerance.
However, HDFS allows administrators to decide on which installation a node belongs. Therefore, each
node knows its rack ID, making it rack aware.
HDFS uses an intelligent replica placement model for reliability and performance. Optimizing replica
placement makes HDFS unique from most other distributed file systems, and is facilitated by a rackaware replica placement policy that uses network bandwidth efficiently.
Large HDFS environments typically operate across multiple installations of computers. Communication
between two data nodes in different installations is typically slower than data nodes within the same
installation. Therefore, the name node attempts to optimize communications between data nodes. The
name node identifies the location of data nodes by their rack IDs.

Data organization
One of the main goals of HDFS is to support large files. The size of a typical HDFS block is 64MB.
Therefore, each HDFS file consists of one or more 64MB blocks. HDFS tries to place each block on
separate data nodes.
File creation process
14

Manipulating files on HDFS is similar to the processes used with other file systems. However, because
HDFS is a multi-machine system that appears as a single disk, all code that manipulates files on HDFS
uses a subclass of the org.apache.hadoop.fs.FileSystem object
The code shown in Listing 1 illustrates a typical file creation process on HDFS.

Listing 1. Typical file creation process on HDFS


byte[] fileData = retrieveFileDataFromSomewhere();
String filePath = retrieveFilePathStringFromSomewhere();
Configuration config = new Configuration(); // assumes to
automatically load
// hadoop-default.xml and hadoopsite.xml
org.apache.hadoop.fs.FileSystem hdfs =
org.apache.hadoop.fs.FileSystem.get(config);
org.apache.hadoop.fs.Path path = new
org.apache.hadoop.fs.Path(filePath);
org.apache.hadoop.fs.FSDataOutputStream outputStream =
hdfs.create(path);
outputStream.write(fileData, 0, fileData.length);

Staging to commit
When a client creates a file in HDFS, it first caches the data into a temporary local file. It then redirects
subsequent writes to the temporary file. When the temporary file accumulates enough data to fill an
HDFS block, the client reports this to the name node, which converts the file to a permanent data node.
The client then closes the temporary file and flushes any remaining data to the newly created data node.
The name node then commits the data node to disk.

Replication pipelining
When a client accumulates a full block of user data, it retrieves a list of data nodes that contains a
replica of that block from the name node. The client then flushes the full data block to the first data
node specified in the replica list. As the node receives chunks of data, it writes them to disk and
transfers copies to the next data node in the list. The next data node does the same. This pipelining
process is repeated until the replication factor is satisfied.

15

Data storage reliability


One important objective of HDFS is to store data reliably, even when failures occur within name nodes,
data nodes, or network partitions.
Detection is the first step HDFS takes to overcome failures. HDFS uses heartbeat messages to detect
connectivity between name and data nodes.

HDFS heartbeats
Several things can cause loss of connectivity between name and data nodes. Therefore, each data node
sends periodic heartbeat messages to its name node, so the latter can detect loss of connectivity if it
stops receiving them. The name node marks as dead data nodes not responding to heartbeats and
refrains from sending further requests to them. Data stored on a dead node is no longer available to an
HDFS client from that node, which is effectively removed from the system. If the death of a node causes
the replication factor of data blocks to drop below their minimum value, the name node initiates
additional replication to bring the replication factor back to a normalized state.
Figure 2 illustrates the HDFS process of sending heartbeat messages.

Figure 2. The HDFS heartbeat process

Data block rebalancing


HDFS data blocks might not always be placed uniformly across data nodes, meaning that the used space
for one or more data nodes can be underutilized. Therefore, HDFS supports rebalancing data blocks
using various models. One model might move data blocks from one data node to another automatically
16

if the free space on a data node falls too low. Another model might dynamically create additional
replicas and rebalance other data blocks in a cluster if a sudden increase in demand for a given file
occurs. HDFS also provides the hadoop balance command for manual rebalancing tasks.
One common reason to rebalance is the addition of new data nodes to a cluster. When placing new
blocks, name nodes consider various parameters before choosing the data nodes to receive them. Some
of the considerations are:
Block-replica writing policies
Prevention of data loss due to installation or rack failure
Reduction of cross-installation network I/O
Uniform data spread across data nodes in a cluster
The cluster-rebalancing feature of HDFS is just one mechanism it uses to sustain the integrity of its data.
Other mechanisms are discussed next.

Data integrity
HDFS goes to great lengths to ensure the integrity of data across clusters. It uses checksum validation on
the contents of HDFS files by storing computed checksums in separate, hidden files in the same
namespace as the actual data. When a client retrieves file data, it can verify that the data received
matches the checksum stored in the associated file.
The HDFS namespace is stored using a transaction log kept by each name node. The file system
namespace, along with file block mappings and file system properties, is stored in a file called FsImage.
When a name node is initialized, it reads the FsImage file along with other files, and applies the
transactions and state information found in these files.

Synchronous metadata updating


A name node uses a log file known as the EditLog to persistently record every transaction that occurs to
HDFS file system metadata. If the EditLog or FsImage files become corrupted, the HDFS instance to
which they belong ceases to function. Therefore, a name node supports multiple copies of the FsImage
and EditLog files. With multiple copies of these files in place, any change to either file propagates
synchronously to all of the copies. When a name node restarts, it uses the latest consistent version of
FsImage and EditLog to initialize itself.

HDFS permissions for users, files, and directories


HDFS implements a permissions model for files and directories that has a lot in common with the
Portable Operating System Interface (POSIX) model; for example, every file and directory is associated
with an owner and a group. The HDFS permissions model supports read (r), write (w), and execute (x).
Because there is no concept of file execution within HDFS, the x permission takes on a different
meaning. Simply put, the x attribute indicates permission for accessing a child directory of a given parent
directory. The owner of a file or directory is the identity of the client process that created it. The group is
the group of the parent directory.

17

18

Introduction to MapReduce
Introduction
MapReduce is a programming model designed for processing large volumes of data in parallel by
dividing the work into a set of independent tasks.
MapReduce programs are written in a particular style influenced by functional programming
constructs, specifically idioms for processing lists of data.
This module explains the nature of this programming model and how it can be used to write
programs which run in the Hadoop environment.
Goals for this Module:
Understand functional programming as it applies to MapReduce
Understand the MapReduce program flow
Understand how to write programs for Hadoop MapReduce
Learn about additional features of Hadoop designed to aid software development.
MapReduce Basics
Functional Programming Concepts
MapReduce programs are designed to compute large volumes of data in a parallel fashion. This
requires dividing the workload across a large number of machines.
This model would not scale to large clusters (hundreds or thousands of nodes) if the
components were allowed to share data arbitrarily.
The communication overhead required to keep the data on the nodes synchronized at all times
would prevent the system from performing reliably or efficiently at large scale.
Instead, all data elements in MapReduce are immutable, meaning that they cannot be updated.
If in a mapping task you change an input (key, value) pair, it does not get reflected back in the
input files; communication occurs only by generating new output (key, value) pairs which are
then forwarded by the Hadoop system into the next phase of execution.
List Processing
Conceptually, MapReduce programs transform lists of input data elements into lists of output
data elements.
A MapReduce program will do this twice, using two different list processing idioms: map, and
reduce. These terms are taken from several list processing languages such as LISP, Scheme, or
ML.
Mapping Lists
The first phase of a MapReduce program is called mapping. A list of data elements are provided,
one at a time, to a function called the Mapper, which transforms each element individually to an
output data element.
19

As an example of the utility of map: Suppose you had a function toUpper(str) which returns an
uppercase version of its input string. You could use this function with map to turn a list of strings
into a list of uppercase strings.
Note that we are not modifying the input string here: we are returning a new string that will
form part of a new output list.
Reducing Lists
Reducing lets you aggregate values together. A reducer function receives an iterator of input
values from an input list. It then combines these values together, returning a single output
value.
Reducing is often used to produce "summary" data, turning a large volume of data into a smaller
summary of itself. For example, "+" can be used as a reducing function, to return the sum of a
list of input values.
Putting Them Together in MapReduce:
The Hadoop MapReduce framework takes these concepts and uses them to process large volumes of
information. A MapReduce program has two components: one that implements the mapper, and
another that implements the reducer. The Mapper and Reducer idioms described above are extended
slightly to work in this environment, but the basic principles are the same.
Keys and values: In MapReduce, no value stands on its own. Every value has a key associated with it.
Keys identify related values. For example, a log of time-coded speedometer readings from multiple cars
could be keyed by license-plate number; it would look like:
AAA-123
ZZZ-789
AAA-123
CCC-456
...

65mph, 12:00pm
50mph, 12:02pm
40mph, 12:05pm
25mph, 12:15pm

The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of
these functions is the same: both a key and a value must be emitted to the next list in the data flow.
MapReduce is also less strict than other languages about how the Mapper and Reducer work.
In more formal functional mapping and reducing settings, a mapper must produce exactly one
output element for each input element, and a reducer must produce exactly one output
element for each input list.
In MapReduce, an arbitrary number of values can be output from each phase; a mapper may
map one input into zero, one, or one hundred outputs.
A reducer may compute over an input list and emit one or a dozen different outputs.
Keys divide the reduce space: A reducing function turns a large list of values into one (or a few) output
values. In MapReduce, all of the output values are not usually reduced together. All of the values with
the same key are presented to a single reducer together. This is performed independently of any reduce
operations occurring on other lists of values, with different keys attached.
20

An Example Application: Word Count


A simple MapReduce program can be written to determine how many times different words appear in a
set of files. For example, if we had the files:
foo.txt: Sweet, this is the foo file
bar.txt: This is the bar file
We would expect the output to be:
sweet 1
this 2
is 2
the 2
foo 1
bar 1
file 2
Naturally, we can write a program in MapReduce to compute this output. The high-level structure would
look like this:
mapper (filename, file-contents):
for each word in file-contents:
emit (word, 1)
reducer (word, values):
sum = 0
for each value in values:
sum = sum + value
emit (word, sum)
Listing 1: High-Level MapReduce Word Count
Several instances of the mapper function are created on the different machines in our cluster. Each
instance receives a different input file (it is assumed that we have many such files). The mappers output
(word, 1) pairs which are then forwarded to the reducers. Several instances of the reducer method are
also instantiated on the different machines. Each reducer is responsible for processing the list of values
associated with a different word. The list of values will be a list of 1's; the reducer sums up those ones
into a final count associated with a single word. The reducer then emits the final (word, count) output
which is written to an output file.

21

We can write a very similar program to this in Hadoop MapReduce; it is included in the Hadoop
distribution in src/examples/org/apache/hadoop/examples/WordCount.java. It is partially reproduced
below:

public static class MapClass extends MapReduceBase


implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, one);
}
}
}
/**
* A reducer class that just emits the sum of the input values.
*/
public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Listing 2: Hadoop MapReduce Word Count Source
There are some minor differences between this actual Java implementation and the pseudo-code shown
above.
First, Java has no native emit keyword; the OutputCollector object you are given as an input will
receive values to emit to the next stage of execution.
22

Second, the default input format used by Hadoop presents each line of an input file as a
separate input to the mapper function, not the entire file at a time. It also uses a StringTokenizer
object to break up the line into words. This does not perform any normalization of the input, so
"cat", "Cat" and "cat," are all regarded as different strings.
Note that the class-variable word is reused each time the mapper outputs another (word, 1)
pairing; this saves time by not allocating a new variable for each output.
The output.collect() method will copy the values it receives as input, so you are free to overwrite
the variables you use.

The Driver Method


There is one final component of a Hadoop MapReduce program, called the Driver. The driver initializes
the job and instructs the Hadoop platform to execute your code on a set of input files, and controls
where the output files are placed. A cleaned-up version of the driver from the example Java
implementation that comes with Hadoop is presented below:
public void run(String inputPath, String outputPath) throws Exception {
JobConf conf = new JobConf (WordCount.class);
conf.setJobName("wordcount");
// the keys are words (strings)
conf.setOutputKeyClass(Text.class);
// the values are counts (ints)
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(MapClass.class);
conf.setReducerClass(Reduce.class);
FileInputFormat.addInputPath(conf, new Path(inputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));
JobClient.runJob(conf);
}
Listing 3: Hadoop MapReduce Word Count Driver
This method sets up a job to execute the word count program across all the files in a given input
directory (the inputPath argument).
The output from the reducers is written into files in the directory identified by outputPath.
The configuration information to run the job is captured in the JobConf object.
The mapping and reducing functions are identified by the setMapperClass() and
setReducerClass() methods.
The data types emitted by the reducer are identified by setOutputKeyClass() and
setOutputValueClass(). By default, it is assumed that these are the output types of the mapper
23

as well. If this is not the case, the methods setMapOutputKeyClass() and


setMapOutputValueClass() methods of the JobConf class will override these.
The input types fed to the mapper are controlled by the InputFormat used.
The default input format, "TextInputFormat," will load data in as (LongWritable, Text) pairs.
The long value is the byte offset of the line in the file. The Text object holds the string contents
of the line of the file.
The call to JobClient.runJob(conf) will submit the job to MapReduce. This call will block until the
job completes. If the job fails, it will throw an IOException. JobClient also provides a nonblocking version called submitJob().
MapReduce Data Flow
Now that we have seen the components that make up a basic MapReduce job, we can see how
everything works together at a higher level:
MapReduce inputs typically come from input files loaded onto our processing cluster in HDFS.
These files are evenly distributed across all our nodes.
Running a MapReduce program involves running mapping tasks on many or all of the nodes in
our cluster. Each of these mapping tasks is equivalent: no mappers have particular "identities"
associated with them. Therefore, any mapper can process any input file.
Each mapper loads the set of files local to that machine and processes them.
When the mapping phase has completed, the intermediate (key, value) pairs must be exchanged
between machines to send all values with the same key to a single reducer.
The reduce tasks are spread across the same nodes in the cluster as the mappers. This is the
only communication step in MapReduce.
Individual map tasks do not exchange information with one another, nor are they aware of one
another's existence.
Similarly, different reduce tasks do not communicate with one another.
The user never explicitly marshals information from one machine to another; all data transfer is
handled by the Hadoop MapReduce platform itself, guided implicitly by the different keys
associated with values.
This is a fundamental element of Hadoop MapReduce's reliability. If nodes in the cluster fail,
tasks must be able to be restarted. If they have been performing side-effects, e.g.,
communicating with the outside world, then the shared state must be restored in a restarted
task. By eliminating communication and side-effects, restarts can be handled more gracefully.
Input files: This is where the data for a MapReduce task is initially stored. While this does not need to be
the case, the input files typically reside in HDFS. The format of these files is arbitrary; while line-based
log files can be used, we could also use a binary format, multi-line input records, or something else
entirely. It is typical for these input files to be very large -- tens of gigabytes or more.
InputFormat: How these input files are split up and read is defined by the InputFormat. An InputFormat
is a class that provides the following functionality:
Selects the files or other objects that should be used for input
Defines the InputSplits that break a file into tasks
Provides a factory for RecordReader objects that read the file

24

InputFormat:
TextInputFormat

Description:
Default format; reads lines of
text files

KeyValueInputFormat

Parses lines into key, val pairs

SequenceFileInputFormat

A Hadoop-specific highperformance binary format

Key:
Value:
The byte offset of the
The line contents
line
Everything up to the first The remainder of
tab character
the line
user-defined

user-defined

Table 1: InputFormats provided by MapReduce


The default InputFormat is the TextInputFormat. This treats each line of each input file as a
separate record, and performs no parsing. This is useful for unformatted data or line-based
records like log files.
A more interesting input format is the KeyValueInputFormat. This format also treats each line of
input as a separate record. While the TextInputFormat treats the entire line as the value, the
KeyValueInputFormat breaks the line itself into the key and value by searching for a tab
character. This is particularly useful for reading the output of one MapReduce job as the input to
another, as the default OutputFormat (described in more detail below) formats its results in this
manner.
Finally, the SequenceFileInputFormat reads special binary files that are specific to Hadoop.
These files include many features designed to allow data to be rapidly read into Hadoop
mappers. Sequence files are block-compressed and provide direct serialization and
deserialization of several arbitrary data types (not just text). Sequence files can be generated as
the output of other MapReduce tasks and are an efficient intermediate representation for data
that is passing from one MapReduce job to another.
InputSplits:
An InputSplit describes a unit of work that comprises a single map task in a MapReduce
program.
A MapReduce program applied to a data set, collectively referred to as a Job, is made up of
several (possibly several hundred) tasks.
Map tasks may involve reading a whole file; they often involve reading only part of a file. By
default, the FileInputFormat and its descendants break a file up into 64 MB chunks (the same
size as blocks in HDFS).
RecordReader:
The InputSplit has defined a slice of work, but does not describe how to access it.
The RecordReader class actually loads the data from its source and converts it into (key, value)
pairs suitable for reading by the Mapper.
The RecordReader instance is defined by the InputFormat.
The default InputFormat, TextInputFormat, provides a LineRecordReader, which treats each line
of the input file as a new value.
25

The key associated with each line is its byte offset in the file.
The RecordReader is invoke repeatedly on the input until the entire InputSplit has been
consumed.
Each invocation of the RecordReader leads to another call to the map() method of the Mapper.
Mapper:
The Mapper performs the interesting user-defined work of the first phase of the MapReduce
program.
Given a key and a value, the map() method emits (key, value) pair(s) which are forwarded to the
Reducers.
Partition & Shuffle:
The process of moving map outputs to the reducers is known as shuffling.
A different subset of the intermediate key space is assigned to each reduce node; these subsets
(known as "partitions") are the inputs to the reduce tasks.
Each map task may emit (key, value) pairs to any partition; all values for the same key are always
reduced together regardless of which mapper is its origin.
Therefore, the map nodes must all agree on where to send the different pieces of the
intermediate data. The Partitioner class determines which partition a given (key, value) pair will
go to. The default partitioner computes a hash value for the key and assigns the partition based
on this result.
Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys.
The set of intermediate keys on a single node is automatically sorted by Hadoop before they are
presented to the Reducer.
Reduce:
A Reducer instance is created for each reduce task.
This is an instance of user-provided code that performs the second important phase of jobspecific work.
For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called
once.
This receives a key as well as an iterator over all the values associated with the key.
The values associated with a key are returned by the iterator in an undefined order.
The Reducer also receives as parameters OutputCollector and Reporter objects; they are used in
the same manner as in the map() method.
OutputFormat:
The (key, value) pairs provided to this OutputCollector are then written to output files. The way
they are written is governed by the OutputFormat.
The OutputFormat functions much like the InputFormat class described earlier.
The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS;
they all inherit from a common FileOutputFormat.
26

Each Reducer writes a separate file in a common output directory.


These files will typically be named part-nnnnn, where nnnnn is the partition id associated with
the reduce task.
The output directory is set by the FileOutputFormat.setOutputPath() method.
You can control which particular OutputFormat is used by calling the setOutputFormat() method
of the JobConf object that defines your MapReduce job.
A table of provided OutputFormats is given below.

OutputFormat:

Description

TextOutputFormat
Default; writes lines in "key \t value" form
SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs
NullOutputFormat
Disregards its inputs
Table 2: OutputFormats provided by Hadoop
RecordWriter: Much like how the InputFormat actually reads individual records through the
RecordReader implementation, the OutputFormat class is a factory for RecordWriter objects; these are
used to write the individual records to the files as directed by the OutputFormat.
The output files written by the Reducers are then left in HDFS for your use, either by another
MapReduce job, a separate program, for for human inspection.
Hadoop Streaming
Whereas Pipes is an API that provides close coupling between C++ application code and Hadoop,
Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop
Mapper and Reducer implementations.
Hadoop Streaming allows you to use arbitrary programs for the Mapper and Reducer phases of a
MapReduce job. Both Mappers and Reducers receive their input on stdin and emit output (key, value)
pairs on stdout.
Input and output are always represented textually in Streaming. The input (key, value) pairs are written
to stdin for a Mapper or Reducer, with a 'tab' character separating the key from the value. The
Streaming programs should split the input on the first tab character on the line to recover the key and
the value. Streaming programs write their output to stdout in the same format: key \t value \n.
The inputs to the reducer are sorted so that while each line contains only a single (key, value) pair, all
the values for the same key are adjacent to one another.
Provided it can handle its input in the text format described above, any Linux program or tool can be
used as the mapper or reducer in Streaming. You can also write your own scripts in bash, python, perl,
27

or another language of your choice, provided that the necessary interpreter is present on all nodes in
your cluster.
Running a Streaming Job: To run a job with Hadoop Streaming, use the following command:
$ bin/hadoop jar contrib/streaming/hadoop-version-streaming.jar

The command as shown, with no arguments, will print some usage information. An example of how to
run real commands is given below:
$ bin/hadoop jar contrib/streaming-hadoop-0.18.0-streaming.jar -mapper \
myMapProgram -reducer myReduceProgram -input /some/dfs/path \
-output /some/other/dfs/path
This assumes that myMapProgram and myReduceProgram are present on all nodes in the system ahead
of time. If this is not the case, but they are present on the node launching the job, then they can be
"shipped" to the other nodes with the -file option:
$ bin/hadoop jar contrib/streaming-hadoop-0.18.0-streaming.jar -mapper \
myMapProgram -reducer myReduceProgram -file \
myMapProgram -file myReduceProgram -input some/dfs/path \
-output some/other/dfs/path
Any other support files necessary to run your program can be shipped in this manner as well.

28

MapReduce API
Package org.apache.hadoop.mapreduce
Interface Summary
Counter

A named counter that tracks the


progress of a map/reduce job.

CounterGroup

A group of Counters that logically


belong together.

JobContext

A read-only view of the job that is


provided to the tasks while they are
running.

MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

The context that is given to the


Mapper.

ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

The context passed to the Reducer.

TaskAttemptContext

The context for task attempts.

TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

A context object that allows input and


output from the task.

Class Summary
Cluster

Provides a way to access information about the


map/reduce cluster.

ClusterMetrics

Status information on the current state of the MapReduce cluster.

Counters

Counters holds per job/task counters, defined either by


the Map-Reduce framework or applications.

ID

A general identifier, which internally stores the id as an


integer.

InputFormat<K,V>

InputFormat describes the input-specification for a


Map-Reduce job.

InputSplit

InputSplit represents the data to be processed by an


individual Mapper.

Job

The job submitter's view of the Job.

JobID

JobID represents the immutable and unique identifier


for the job.

JobStatus

Describes the current status of a job.

Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Maps input key/value pairs to a set of intermediate


key/value pairs.
29

MarkableIterator<VALUE>

MarkableIterator is a wrapper iterator class that


implements the MarkableIteratorInterface.

OutputCommitter

OutputCommitter describes the commit of task output


for a Map-Reduce job.

OutputFormat<K,V>

OutputFormat describes the output-specification for a


Map-Reduce job.

Partitioner<KEY,VALUE>

Partitions the key space.

QueueAclsInfo

Class to encapsulate Queue ACLs for a particular user.

QueueInfo

Class that contains the information regarding the Job


Queues which are maintained by the Hadoop
Map/Reduce framework.

RecordReader<KEYIN,VALUEIN>

The record reader breaks the data into key/value pairs


for input to the Mapper.

RecordWriter<K,V>

RecordWriter writes the output <key, value> pairs to an


output file.

Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

Reduces a set of intermediate values which share a key


to a smaller set of values.

TaskAttemptID

TaskAttemptID represents the immutable and unique


identifier for a task attempt.

TaskCompletionEvent

This is used to track task completion events on job


tracker.

TaskID

TaskID represents the immutable and unique identifier


for a Map or Reduce Task.

TaskTrackerInfo

Information about TaskTracker.

Enum Summary
JobCounter
JobPriority

Used to describe the priority of the running job.

QueueState

Enum representing queue state

TaskCompletionEvent.Status
TaskCounter
TaskType

Enum for map, reduce, job-setup, job-cleanup, task-cleanup task types.

30

Mapper
Constructor Detail
Mapper
public Mapper ()
Method Detail
setup
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
Called once at the beginning of the task.
Throws:
IOException
InterruptedException

map
protected void map(KEYIN key,
VALUEIN value,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
Called once for each key/value pair in the input split. Most applications should override this, but
the default is the identity function.
Throws:
IOException
InterruptedException

cleanup
protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
Called once at the end of the task.
Throws:
IOException
InterruptedException

31

run
public void run(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
Expert users can override this method for more complete control over the execution of the
Mapper.
Parameters:
context Throws:
IOException
InterruptedException

RecordReader
Constructor Detail
RecordReader
public RecordReader()
Method Detail
initialize
public abstract void initialize(InputSplit split, TaskAttemptContext context)
throws IOException,InterruptedException
Called once at initialization.
Parameters:
split - the split that defines the range of records to read
context - the information about the task
Throws:
IOException
InterruptedException

nextKeyValue
public abstract boolean nextKeyValue()
throws IOException,InterruptedException
Read the next key, value pair.
Returns:
true if a key/value pair was read
Throws:
IOException
InterruptedException

32

getCurrentKey
public abstract KEYIN getCurrentKey()
throws IOException, InterruptedException
Get the current key
Returns:
the current key or null if there is no current key
Throws:
IOException
InterruptedException

getCurrentValue
public abstract VALUEIN getCurrentValue()
throws IOException, InterruptedException
Get the current value.
Returns:
the object that was read
Throws:
IOException
InterruptedException

getProgress
public abstract float getProgress()
throws IOException, InterruptedException
The current progress of the record reader through its data.
Returns:
a number between 0.0 and 1.0 that is the fraction of the data read
Throws:
IOException
InterruptedException

close
public abstract void close()
throws IOException
Close the record reader.
Specified by:
close in interface Closeable
Throws:
IOException

33

Reducer
Constructor Detail
Reducer
public Reducer()
Method Detail
setup
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
Called once at the start of the task.
Throws:
IOException
InterruptedException

reduce
protected void reduce(KEYIN key,
Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
This method is called once for each key. Most applications will define their reduce class by
overriding this method. The default implementation is an identity function.
Throws:
IOException
InterruptedException

cleanup
protected void cleanup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException,InterruptedException
Called once at the end of the task.
Throws:
IOException
InterruptedException

34

run
public void run(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
Advanced application writers can use the run(org.apache.hadoop.mapreduce.Reducer.Context)
method to control how the reduce task works.
Throws:
IOException
InterruptedException

Prior to Hadoop 0.20.x, a Map class had to extend a MapReduceBase and implement a Mapper as such:
public static class Map extends MapReduceBase implements Mapper {
...
}
and similarly, a map function had to use an OutputCollector and a Reporter object to emit (key, value)
pairs and send progress updates to the main program. A typical map function looked like:
public void map(K1, V1, OutputCollector o, Reporter r) throws IOException {
...
output. Collect(key,value);
}
With the new Hadoop API, a mapper or reducer has to extend classes from the package
org.apache.hadoop.mapreduce.* and there is no need to implement an interface anymore. Here is how
a Map class is defined in the new API:
public class MapClass extends Mapper { ...
}
and a map function uses Context objects to emit records and send progress updates. A typical map
function is now defined as:
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException { ...
context.write(key,value);
}
All of the changes for a Mapper above go the same way for a Reducer.
Another major change has been done in the way a job is configured and controlled. Earlier, a map
reduce job was configured through a JobConf object and the job control was done using an instance of
JobClient. The main body of a driver class used to look like:
JobConf conf = new JobConf(Driver.class);
conf.setPropertyX(..);
conf.setPropertyY(..);
...
...
35

JobClient.runJob(conf);
In the new Hadoop API, the same functionality is achieved as follows:
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Driver.class);
job.setPropertyX(..);
job.setPropertyY(..);
job.waitForCompletion(true);

36

Advance Hadoop API

Combiner:
The primary goal of combiners is to optimize/minimize the number of key value pairs that
will be shuffled accross the network between mappers and reducers and thus to save as
most bandwidth as possible
Eg. Take word count example on a text containing one million times the word the. Without
combiner the mapper will send one million key/value pairs of the form <the,1>. With
combiners, it will potentially send much less key/value pairs of the form <the,N> with N a
number potentially much bigger than 1. Thats just the intuition (see the references at the
end of the post for more details).
Simply speaking a combiner can be considered as a mini reducer that will be applied
potentially several times still during the map phase before to send the new (hopefully
reduced) set of key/value pairs to the reducer(s). This is why a combiner must implement
the Reducer interface (or extend the Reducer class as of hadoop 0.20).
conf.setCombinerClass(Reduce.class);

Indeed, suppose 5 key/value pairs emitted from the mapper for a given key k: <k,40>, <k,30>,
<k,20>, <k,2>, <k,8>. Without combiner, when the reducer will receive the list
<k,{40,30,20,2,8}>, the mean output will be 20, but if a combiner were applied before on the
two sets (<k,40>, <k,30>, <k,20>) and (<k,2>, <k,8>) separately, then the reducer would have
received the list <k,{30,5}> and the output would have been different (17.5) which is an
unexpected behavior.

37

Performance Measurement:
Local Execution Mode using LocalJobRunner from Hadoop
Hadoop's LocalJobRunner is to execute the same Map Reduce Physical plans locally. So we
compile the logical plan into a map reduce physical plan and create the jobcontrol object
corresponding to the mapred plan. We just need to write a separate launcher which will submit
the job to the LocalJobRunner instead of submitting to an external Job Tracker.
Pros
Code Reuse
No need to write and maintain
Different operators
Different logical to physical tranlators
Different launchers
The current framework does not have any progress reporting. With this approach we
will have it at no extra cost.
Cons
Not sure how stable LocalJobRunner is.
38

Found some bugs in hadoop-15 on it which makes it practically useless for us right now.
These have been fixed however in hadoop-16
Not sure how this will affect Example generator

1) Will the invocation of LocalJobRunner have some latency?

Definitely it does. As measured in hadoop 15, it has about 5 sec startup latency. Whether this
affects depends on how and where we are using LocalJobRunner. If we strictly use it only when
the user asks for local execution mode it should not matter. Also if the size of the data is at
least in 10s of MBs, the LocalJobRunner performs better than streaming tuples through the
plan of local operators.

The Configuration API


Components in Hadoop are configured using Hadoops own configuration API.
instance of the Configuration class (found in the org.apache.hadoop.conf package)
represents a collection of configuration properties and their values.
Each property is named by a String, and the type of a value may be one of several types,
including Java primitives i.e. Boolean int, float etc.
Configurations read their properties from resourcesXML files with a simple structure
for defining name-value pairs
Example. A simple configuration file, configuration-1.xml
<?xml version="1.0"?>
<configuration>
<property>
<name>color</name>
<value>yellow</value>
<description>Color</description>
</property>
<property>
39

<name>size</name>
<value>10</value>
<description>Size</description>
</property>
<property>
<name>weight</name>
<value>heavy</value>
<final>true</final>
<description>Weight</description>
</property>
<property>
<name>size-weight</name>
<value>${size},${weight}</value>
<description>Size and weight</description>
</property>
</configuration>

Assuming this configuration file is in a file called configuration-1.xml, we can access its
properties using a piece of code like this:
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
assertThat(conf.get("color"), is("yellow"));
assertThat(conf.getInt("size", 0), is(10));
assertThat(conf.get("breadth", "wide"), is("wide"));
Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order
from the classpath:
40

1. core-default.xml : Read-only defaults for hadoop.


2. core-site.xml: Site-specific configuration for a given hadoop installation

Partitioner
A Partitioner is responsible to perform the partitioning.
In Hadoop, the default partitioner is HashPartitioner.
The number of partition is then equal to the number of reduce tasks for the job.
Why is it important?
First, it has a direct impact on the overall performance of your job: a poorly designed
partitioning function will not evenly distribute the charge over the reducers, potentially losing
all the interest of the map/reduce distributed infrastructure.

Example

As you can see, the tokens are correctly ordered by number of occurrences on each reducer
(which is what hadoop guarantees by default) but this is not what you need! Youd rather
expect something like:

41

where tokens are totally ordered over the reducers, from 1 to 30 occurrences on the first reducer and
from 31 to 14620 on the second. This would happen as a result of a correct partitioning function: all the
tokens having a number of occurrences inferior to N (here 30) are sent to reducer 1 and the others are
sent to reducer 2, resulting in two partitions. Since the tokens are sorted on each partition, you get the
expected total order on the number of occurrences.

Conclusion
Partitioning in map/reduce is a fairly simple concept but that is important to get correctly. Most of the
time, the default partitioning based on an hash function can be sufficient. But as we illustrated in this
Issue, youll need some time to modify the default behavior and to customize your own partitioning
suited for your needs.

HDFS Accessibility
HDFS can be accessed from applications in many different ways. Natively, HDFS provides a FileSystem
Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an
HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose
HDFS through the WebDAV protocol.

FS Shell
42

HDFS allows user data to be organized in the form of files and directories. It provides a commandline
interface called FS shell that lets a user interact with the data in HDFS. The syntax of this command set is
similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample
action/command pairs:

Action

Command

Create a directory named /tmp

bin/hadoop dfs -mkdir /tmp

Remove a directory named /tmp

bin/hadoop dfs -rmr /tmp

View the contents of a file named /tmp/myfile.txt

bin/hadoop dfs -cat /tmp/myfile.txt

List The Directory name present in HDFS

bin/hadoop dfs ls /tmp/

To Copy Files into HDFS

bin/hadoop dfs copyFromLocal <input


path> <output path>

To Copy Files from HDFS

bin/hadoop dfs copyToLocal <input


path> <output path>

FS shell is targeted for applications that need a scripting language to interact with the stored data.

DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster. These are commands that
are used only by an HDFS administrator. Here are some sample action/command pairs:

Action

Command

Put the cluster in Safemode

bin/hadoop dfsadmin -safemode enter

Generate a list of DataNodes

bin/hadoop dfsadmin -report

Recommission or decommission DataNode(s)

bin/hadoop dfsadmin -refreshNodes

Browser Interface
43

A typical HDFS install configures a web server to expose the HDFS namespace through a
configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of
its files using a web browser.

Using Hadoops DistributedCache


While working with Map Reduce applications, there are times when we need to share files
globally with all nodes on the cluster. This can be a shared library to be accessed by each task,
Hadoops Map Reduce project provides this functionality through a distributed cache. This
distributed cache can be configured with the job, and provides read only data to the application
across all machines.
This provides a service for copying files and archives to the task nodes in time for the
tasks to use them when they run.
To save network bandwidth, files are normally copied to any particular node once per
job.
Distributing files is pretty straight forward. To cache a file addToCache.txt on HDFS, one can
setup the job as
Job job = new Job(conf);
job.addCacheFile(new URI("/user/local/hadoop/addToCache.txt"));
Other URI schemes can also be specified.
Now, in the Mapper/Reducer, one can access the file as:
Path[] cacheFiles = context.getLocalCacheFiles();
FileInputStream fileStream = new FileInputStream(cacheFiles[0].toString());

HIVE Basics
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis
44

Features of Hive
Hive supports indexing to provide acceleration
Support for different storage types.
Hive stores metadata in an RDBMS which reduces significant time to perform the
semantic checks during the query execution.
Hive can operate on compressed data stored into Hadoop ecosystem
Built-in user defined functions (UDFs) to manipulate dates, strings, and other datamining tools. If none serves our need, we can create our own UDFs
Hive supports SQL like queries (Hive QL) which is implicitly converted into map-reduce
jobs

HiveQL
While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers
extensions not in SQL
**Detail will be provided Later

PIG Basics
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language
for this platform is called Pig Latin
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the
Hadoop subproject). Pig's language layer currently consists of a textual language called Pig
Latin, which has the following key properties:

45

Ease of programming It is trivial to achieve parallel execution of simple, "embarrassingly


parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data
transformations are explicitly encoded as data flow sequences, making them easy to
write, understand, and maintain.
Optimization opportunities The way in which tasks are encoded permits the system to
optimize their execution automatically, allowing the user to focus on semantics rather
than efficiency.
Extensibility Users can create their own functions to do special-purpose processing.

Practical Development

Counters
A named counter that tracks the progress of a map/reduce job.
Counters represent global counters, defined either by the Map-Reduce framework or
applications. Each Counter is named by an Enum and has a long for the value.
Counters are a useful channel for gathering statistics about the job. In addition to counter
values being much easier to retrieve than log output for large
distributed jobs, you get a record of the number of times that condition occurred, which
is more work to obtain from a set of logfiles

Types of Counter
Built-in Counters
Hadoop maintains some built-in counters for every job, which report various metrics
for your job.
Eg. MapReduce Task Counters , Filesystem Counters

Task Counters
Task counters gather information about tasks over the course of their execution, and
46

the results are aggregated over all the tasks in a job. Task counters are maintained by
each task attempt, and periodically sent to the Task tracker and then to the jobtracker,
so they can be globally aggregated.
Eg. Map input records, Map skipped records

Job counters
Job counters are maintained by the jobtracker. They measure job-level statistics, not
values that change while a task is running. For example, TOTAL_LAUNCHED_MAPS
counts the number of map tasks that were launched over the course of a job (including
ones that failed).
Eg. Launched map tasks, Launched reduce tasks

User-Defined Java Counters


MapReduce allows user code to define a set of counters, which are then incremented as
desired in the mapper or reducer. Counters are defined by a Java enum, which serves to
group related counters.

Determining the Optimal number of Reducer


The optimal number of reducers is related to the total number of available reducer slots in
your cluster. The total number of slots is found by multiplying the number of nodes in the
cluster and the number of slots per node (which is determined by the value of
the mapred.tasktracker.reduce.tasks.maximum property. By default it is 2)

ChainMapper
The ChainMapper class allows to use multiple Mapper classes within a single Map task.
The Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes
the input of the second, and so on until the last Mapper, the output of the last Mapper will be
written to the task's output.
47

The key functionality of this feature is that the Mappers in the chain do not need to be aware
that they are executed in a chain. This enables having reusable specialized Mappers that can be
combined to perform composite operations within a single task.

48

More Advanced Map-Reduce Programming

The Writable Interface


Any key or value type in the Hadoop Map-Reduce framework implements this interface.
The Writable interface defines two methods:
1 . writing its state to a DataOutput binary stream
2 . reading its state from a DataInput binary stream.
package org.apache.hadoop.io;
import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
}

Lets look at a particular Writable to see what we can do with it. We will use IntWritable, a
wrapper for a Java int. We can create one and set its value using the set() method:
IntWritable writable = new IntWritable();
writable.set(163);
Equivalently, we can use the constructor that takes the integer value:
IntWritable writable = new IntWritable(163);
Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io package.
They form the class hierarchy shown in figure

49

Custom Writable and Writable Comparable


Implementing a Custom Writable
Hadoop comes with a useful set of Writable implementations that serve most purposes; however, on
occasion, you may need to write your own custom implementation. With a custom Writable, you have
full control over the binary representation and the sort order. Because Writables are at the heart of the
MapReduce data path, tuning the binary representation can have a significant effect on performance.
The stock Writable implementations that come with Hadoop are well-tuned, but for more elaborate
structures, it is often better to create a new Writable type, rather than compose the stock types.
To demonstrate how to create a custom Writable, we shall write an implementation that represents a
pair of strings, called TextPair.
Example . A Writable implementation that stores a pair of Text objects
import java.io.*;
import org.apache.hadoop.io.*;
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
}
public TextPair(String first, String second) {
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second) {
set(first, second);
}
public void set(Text first, Text second) {
this.first = first;
this.second = second;
}
public Text getFirst() {
return first;
}
public Text getSecond() {
return second;
}
@Override
50

public void write(DataOutput out) throws IOException {


first.write(out);
second.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
}
@Override
public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
@Override
public String toString() {
return first + "\t" + second;
}
@Override
public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
}
The first part of the implementation is straightforward: there are two Text instance variables,
first and second, and associated constructors, getters, and setters. All Writable implementations
must have a default constructor so that the MapReduce framework can instantiate them, then
populate their fields by calling readFields().
TextPairs write() method serializes each Text object in turn to the output stream, by delegating
to the Text objects themselves. Similarly, readFields() deserializes the bytes from the input
stream by delegating to each Text object. The DataOutput and DataInput interfaces have a rich
set of methods for serializing and deserializing Javaprimitives, so, in general, you have complete
control over the wire format of your Writable object.

51

TextPair is an implementation of WritableComparable, so it provides an implementation of the


compareTo() method that imposes the ordering you would expect: it sorts by the first string
followed by the second.

WritableComparable and comparators


IntWritable implements the WritableComparable interface, which is just a subinterface of the Writable
and java.lang.Comparable interfaces:
package org.apache.hadoop.io;
public interface WritableComparable<T> extends Writable, Comparable<T> {
}
Comparison of types is crucial for MapReduce, where there is a sorting phase during which keys are
compared with one another. One optimization that Hadoop provides is the RawComparator extension of
Javas Comparator:
package org.apache.hadoop.io;
import java.util.Comparator;
public interface RawComparator<T> extends Comparator<T> {
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);
}
This interface permits implementors to compare records read from a stream without deserializing them
into objects, thereby avoiding any overhead of object creation. For example, the comparator for
IntWritables implements the raw compare() method by reading an integer from each of the byte arrays
b1 and b2 and comparing them directly, from the given start positions (s1 and s2) and lengths (l1 and
l2).

Avro
Apache Avro is a language-neutral data serialization system.
Avro data is described using a language-independent schema.
Avro schemas are usually written in JSON, and data is usually encoded using a binary format, but
there are other options, too. There is a higher-level language called Avro IDL, for writing
schemas in a C-like language that is more familiar to developers. There is also a JSON-based data
encoder, which, being human-readable, is useful for prototyping and debugging Avro data.
Avro specifies an object container format for sequences of objectssimilar to Hadoops
sequence file. An Avro data file has a metadata section where the schema is stored, which

52

makes the file self-describing. Avro data files support compression and are splittable, which is
crucial for a MapReduce data input format.
Avro provides APIs for serialization and deserialization, which are useful when you want to
integrate Avro with an existing system, such as a messaging system where the framing format is
already defined. In other cases, consider using Avros data file format.
Lets write a Java program to read and write Avro data to and from streams. Well start with a simple
Avro schema for representing a pair of strings as a record:
{
"type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"}
]
}

If this schema is saved in a file on the classpath called StringPair.avsc (.avsc is the conventional
extension for an Avro schema), then we can load it using the following two lines of code:
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(getClass().getResourceAsStream("StringPair.avsc"));

We can create an instance of an Avro record using the generic API as follows:
GenericRecord datum = new GenericData.Record(schema);
datum.put("left", "L");
datum.put("right", "R");

Next, we serialize the record to an output stream:


ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(datum, encoder);
encoder.flush();
out.close();

There are two important objects here: the DatumWriter and the Encoder. A DatumWriter
translates data objects into the types understood by an Encoder, which the latter writes to the
output stream. Here we are using a GenericDatumWriter, which passes the fields of
GenericRecord to the Encoder. We pass a null to the encoder factory since we are not reusing a
previously constructed encoder here.

Avro data files

53

Avros object container file format is for storing sequences of Avro objects. It is very similar in
design to Hadoops sequence files. A data file has a header containing metadata, including the
Avro schema and a sync marker, followed by a series of (optionally compressed) blocks
containing the serialized Avro objects.
Writing Avro objects to a data file is similar to writing to a stream. We use a DatumWriter, as
before, but instead of using an Encoder, we create a DataFileWriter instance with the
DatumWriter. Then we can create a new data file (which, by convention, has a .avro extension)
and append objects to it:
File file = new File("data.avro");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
dataFileWriter.append(datum);
dataFileWriter.close();

The objects that we write to the data file must conform to the files schema, otherwise an
exception will be thrown when we call append().

Writing a SequenceFile
SequenceFile provides Writer, Reader and SequenceFile.Sorter classes for writing, reading and
sorting respectively.
Hadoop has ways of splitting sequence files for doing jobs in parallel, even if they are
compressed, making them a convenient way of storing your data without making your own
format.
Hadoop provides two file formats for grouping multiple entries in a single file:
SequenceFile: A flat file which stores binary key/value pairs. The output of Map/Reduce
tasks is usually written into a SequenceFile.
MapFile: Consists of two SequenceFiles. The data file is identical to the SequenceFile
and contains the data stored as binary key/value pairs. The second file is an index file,
which contains a key/value map with seek positions inside the data file to quickly access
the data.

We started using the SequenceFile format to store log messages. It turned out that, while this
format seems to be well suited for storing log messages and processing them with Map/Reduce
jobs, the direct access to specific log messages is very slow. The API to read data from a
54

SequenceFile is iterator based, so that it is necessary to jump from entry to entry until the
target entry is reached.
Since one of our most important use cases is searching for log messages in real time, slow
random access performance is a show stopper.

MapFiles use 2 files: the index file stores seek positions for every n-th key in the datafile. The
datafile stores to data as binary key/value-pairs.
Therefore we moved to MapFiles. MapFiles have the disadvantage that a random access needs
to read from 2 separate files. This seems to be slow, but the indexes which store the seek
positions for our log entries are small enough to be cached in memory. Once the seek position
is identified, only relevant portions of the data file are read. Overall this leads to a nice
performance gain.
To create a SequenceFile, use one of its createWriter() static methods, which returns a
SequenceFile.Writer instance.

Write Sequence File in Hadoop


public class SequenceFileCreator
{
public static void main(String args[]) throws Exception
{
System.out.println("Sequence File Creator");
String uri = args[0];
String filePath = args[1];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri),conf);
Path path = new Path(uri);
SequenceFile.Writer writer = null;
org.apache.hadoop.io.Text key = new org.apache.hadoop.io.Text();
BufferedReader buffer = new BufferedReader(new FileReader(filePath));
String line = null;
org.apache.hadoop.io.Text value = new org.apache.hadoop.io.Text();
55

try
{
writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
while((line = buffer.readLine()) != null)
{
key.set(line);
value.set(line);
writer.append(key, value);
}
}
finally
{
IOUtils.closeStream(writer);
}
}
}

Read Sequence File in Hadoop


public class SequenceFileReader
{
public static void main(String args[]) throws Exception
{
System.out.println("Sequence File Reader");
String uri = args[0]; // Input should be a sequence file
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri),conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try
{
reader = new SequenceFile.Reader(fs,path,conf);
Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);
long position = reader.getPosition();
while(reader.next(key,value))
{
String syncSeen = reader.syncSeen() ? "*" : "";
System.out.printf("[%s%s]\t%s\t%s\n", position , syncSeen , key , value);
position = reader.getPosition();
}
}catch(Exception e)
{
e.printStackTrace();
56

}
finally
{
IOUtils.closeStream(reader);
}
}
}

Creating InputFormats and OutputFormats


The input types are determined by the input format, which defaults to TextInputFormat and has
LongWritable keys and Text values
Properties for configuring types:
mapred.input.format.class
setInputFormat()
mapred.mapoutput.key.class
setMapOutputKeyClass()
mapred.mapoutput.value.class setMapOutputValueClass()
mapred.output.key.class
setOutputKeyClass()
mapred.output.value.class
setOutputValueClass()
mapred.output.format.class
setOutputFormat()
Properties that must be consistent with the types:
mapred.mapper.class
setMapperClass()
mapred.map.runner.class
setMapRunnerClass()
mapred.combiner.class
setCombinerClass()
mapred.partitioner.class
setPartitionerClass()
mapred.output.key.comparator.class setOutputKeyComparatorClass()
mapred.output.value.groupfn.class
setOutputValueGroupingComparator()
mapred.reducer.class
setReducerClass()
mapred.output.format.class
setOutputFormat()
A minimal MapReduce driver, with the defaults explicitly set
public class MinimalMapReduceWithDefaults extends Configured implements Tool {
public int run(String[] args) throws IOException {
JobConf conf = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (conf == null) {
return -1;
}
conf.setInputFormat(TextInputFormat.class);
conf.setNumMapTasks(1);
conf.setMapperClass(IdentityMapper.class);
conf.setMapRunnerClass(MapRunner.class);
conf.setMapOutputKeyClass(LongWritable.class);
conf.setMapOutputValueClass(Text.class);
conf.setPartitionerClass(HashPartitioner.class);
57

conf.setNumReduceTasks(1);
conf.setReducerClass(IdentityReducer.class);
conf.setOutputKeyClass(LongWritable.class);
conf.setOutputValueClass(Text.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args);
System.exit(exitCode);
}
}

Input Formats
The InputFormat defines how to read data from a file into the Mapper instances. Hadoop comes with
several implementations of InputFormat; some work with text files and describe different ways in which
the text files can be interpreted. Others, like SequenceFileInputFormat, are purpose-built for reading
particular binary file formats.
Input Splits and Records
An input split is a chunk of the input that is processed by a single map. Each map processes a single split.
Each split is divided into records, and the map processes each recorda key-value pairin turn.
FileInputFormat- use files as their data source
FileInputFormat input paths- The input to a job is specified as a collection of paths
public static void addInputPath(JobConf conf, Path path)
public static void addInputPaths(JobConf conf, String commaSeparatedPaths)
public static void setInputPaths(JobConf conf, Path... inputPaths)
public static void setInputPaths(JobConf conf, String commaSeparatedPaths)
FileInputFormat input splits- FileInputFormat splits only large files. Here large means larger
than an HDFS block
Small files and CombineFileInputFormat- Hadoop works better with a small number of large files
than a large number of small files. One reason for this is that FileInputFormat generates splits in
such a way that each split is all or part of a single file

58

Text Input
Hadoop excels at processing unstructured text
TextInputFormat- TextInputFormat is the default InputFormat
KeyValueTextInputFormat- It is common for each line in a file to be a key-value pair, separated
by a delimiter such as a tab character
NLineInputFormat- N refers to the number of lines of input that each mapper receives

XML
Most XML parsers operate on whole XML documents, so if a large XML document is made up of multiple
input splits. Using StreamXmlRecordReader, the page elements can be interpreted as records for
processing by a mapper.

Binary Input
SequenceFileInputFormat- Hadoops sequence file format stores sequences of binary key-value
pairs.
SequenceFileAsTextInputFormatSequenceFileAsTextInputFormat
is
a
variant
of
SequenceFileInputFormat that converts the sequence files keys and values to Text objects
SequenceFileAsBinaryInputFormat- SequenceFileAsBinaryInputFormat is a variant of
SequenceFileInputFormat that retrieves the sequence files keys and values as opaque binary
objects
Multiple Inputs
Although the input to a MapReduce job may consist of multiple input files (constructed by a
combination of file globs, filters, and plain paths), all of the input is interpreted by a single InputFormat
and a single Mapper.
MultipleInputs.addInputPath(conf,InputPath,TextInputFormat.class, Mapper.class)
Database Input (and Output)
DBInputFormat is an input format for reading data from a relational database, using JDBC.
The corresponding output format is DBOutputFormat, which is useful for dumping job outputs (of
modest size) into a database.

59

Output Formats
Text Output -The default output format, TextOutputFormat, writes records as lines of text.
Binary Output
SequenceFileOutputFormat -As the name indicates, SequenceFileOutputFormat writes sequence
files for its output.
SequenceFileAsBinaryOutputFormat- SequenceFileAsBinaryOutputFormat is the counterpart to
SequenceFileAsBinaryInputFormat, and it writes keys and values in raw binary format into a
SequenceFile container.
MapFileOutputFormat- MapFileOutputFormat writes MapFiles as output.
Multiple Outputs
There are two special cases when it does make sense to allow the application to set the number of
partitions (or equivalently, the number of reducers):

Zero reducers
This is a vacuous case: there are no partitions, as the application needs to run only map tasks.

One reducer
It can be convenient to run small jobs to combine the output of previous jobs into a single file. This
should only be attempted when the amount of data is small enough to be processed comfortably by one
reducer.
MultipleOutputFormatMultipleOutputFormat allows you to write data to multiple files whose names are derived from the
output keys and values.

60

Joining Data Sets in MapReduce Jobs

Joins
MapReduce can perform joins between large datasets.
Ex : Inner join of 2 data sets.
Stations
station_id
2
7

station_loc
Pune
Mumbai

Records
st_id
7
7
2
2
2

St_name
atlanta
atlanta
richmond
richmond
richmond

temp
111
78
0
22
-11

station_id
2
2
2
7
7

Station_loc
Pune
Pune
Pune
Mumbai
Mumbai

St_name
richmond
richmond
richmond
atlanta
atlanta

JOIN
temp
0
22
-11
111
78

If the join is performed by the mapper, it is called a map-side join, whereas if it is performed by
the reducer it is called a reduce-side join.
If both datasets are too large for either to be copied to each node in the cluster, then we can
still join them using MapReduce with a map-side or reduce-side join, depending on how the
data is structured.

61

Side Data Distribution


Side-Data is the additional data needed by the job to process the main dataset. The critical part
is to make this side-data available to all the map or reduce tasks running in the cluster. It is
possible to cache the side-data in memory in a static field, so that the tasks running successively
in a task tracker will share the data.
Caching of side data can be done in two ways:
Job Configuration:- Using Job Configuration object setter method we can set the key-value pairs
and the same can be retrieved in the map or reduce tasks. We should be careful using this
option to not to use huge amount of data to be shared in this way since the configuration is
read by the JobTracker, TaskTracker and the child JVMs and everytime the configurations will
be loaded into the memory.

Distributed Cache

Side-Data can be shared using the Hadoops Distributed cache mechanism. We can copy files
and archives to the task nodes when the tasks need to run. Usually this is the preferrable way
over the JobConfigurtion.
If both the datasets are too large then we cannot copy either of the datasets to each node in
the cluster as we did in the Side data distribution.

Map-Side Joins
A map-side join between large inputs works by performing the join before the data reaches the map
function. the inputs to each map must be partitioned and sorted. Each input dataset must be divided
into the same number of partitions, and it must be sorted by the same key (the join key) in each source.
Use a CompositeInputFormat from the org.apache.hadoop.mapred.join package to run a map-side join.
We can set it to the CompositeInputFormat using,
inner(tbl(org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.class,
hdfs://localhost:8000/usr/data)
tbl(org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.class,
hdfs://localhost:8000/usr/activity)

62

We can achieve following kind of joins using Map-Side techniques,


1) Inner Join
2) Outer Join
3) Override MultiFilter for a given key, prefered values from the right most source

Reduce-Side Joins

Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be
structured. But it is less efficient as both datasets have to go through the MapReduce shuffle
phase. the records with the same key are brought together in the reducer. We can also use the
Secondary Sort technique to control the order of the records.

63

Secondary Sort
The MapReduce framework sorts the records by key before they reach the reducers.For any
particular key, however, the values are not sorted.
It is possible to impose an order on the values by sorting and grouping the keys in a particular
way.
To illustrate the idea, consider the MapReduce program for calculating the maximum
temperature for each year. If we arranged for the values (temperatures) to be sorted in
descending order, we wouldnt have to iterate through them to find the maximum we could
take the first for each year and ignore the rest. (This approach isnt the most efficient way to
solve this particular problem, but it illustrates how secondary sort works in general.)
To achieve this, we change our keys to be composite: a combination of year and temperature.
1901 35C
1900 35C
1900 34C
....
1900 34C
1901 36C
We want the sort order for keys to be by year (ascending) and then by temperature
(descending):

To summarize, there is a recipe here to get the effect of sorting by value:


Make the key a composite of the natural key and the natural value.
The sort comparator should order by the composite key, that is, the natural key and
natural value.
The partitioner and grouping comparator for the composite key should consider only
the natural key for partitioning and grouping.

64

HIVE
Hive runs on your workstation and converts your SQL query into a series of MapReduce
jobs for execution on a Hadoop cluster.
Hive organizes data into tables, which provide a means for attaching structure to data
stored in HDFS.
Metadatasuch as table schemasis stored in a database called the metastore.

Manupulating data with Hive


The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL.
HiveQL is Hives query language, heavily influenced by MySQL.
Listing Tables in hive :
hive> SHOW TABLES;

Query Execution
Input data file: sample.txt
1950
1950
1950
1949
1949

34
22
11
18
42

1
2
2
1
1

CREATE TABLE records (year STRING, temperature INT, quality INT)


ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
Each row in the data file is tab-delimited text, with fields separated by tabs, and rows by
newlines.
65

Loading data
LOAD DATA LOCAL INPATH '/home/hadoop/Documents/hivedata/sample.txt'
OVERWRITE INTO TABLE records;

Running this command tells Hive to put the specified local file in its warehouse directory.

Example of running a Query from the command line


Command: $HIVE_HOME/bin/hive -e 'select a.col from table a'

66

Example of dumping data out from a query into a file using silent mode
You can suppress the messages time taken to run a query using the -S option at launch time.
Command : $HIVE_HOME/bin/hive -S -e 'select a.col from tab1 a' > a.txt

output file quality.txt


1
2
2
1
1

Example of running a script non-interactively


Command : $HIVE_HOME/bin/hive -f /home/hadoop/Documents/hivedata/hive-script.sql

67

The Metastore
The metastore is the central repository of Hive metadata. The metastore is divided into two
pieces:
a service
the backing store for the data.

Using an embedded metastore is a simple way to get started with Hive; however, only one
embedded Derby database can access the database files on disk at any one time, which means
you can only have one Hive session open at a time that shares the same metastore. Trying to
start a second session gives the error:
Failed to start database 'metastore_db'
when it attempts to open a connection to the metastore.
The solution to supporting multiple sessions (and therefore multiple users) is to use a
standalone database. This configuration is referred to as a local metastore, since the metastore
service still runs in the same process as the Hive service, but connects to a database running in
a separate process, either on the same machine or on a remote machine.
MySQL is a popular choice for the standalone metastore. In this case,
javax.jdo.option.ConnectionURL is set to jdbc:mysql://host/dbname?createDatabaseIf
NotExist=true, and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver.
(The user name and password should be set, too, of course.) The JDBC driver JAR file for MySQL
(Connector/J) must be on Hives classpath, which is simply achieved by placing it in Hives lib
directory.
Going a step further, theres another metastore configuration called a remote metastore,
where one or more metastore servers run in separate processes to the Hive service. This brings
better manageability and security, since the database tier can be completely firewalled off, and
the clients no longer need the database credentials.
A Hive service is configured to use a remote metastore by setting hive.meta store.local to false,
and hive.metastore.uris to the metastore server URIs, separated by commas if there is more
than one. Metastore server URIs are of the form thrift:// host:port, where the port corresponds
to the one set by METASTORE_PORT when starting the metastore server.

Partitions and Buckets


Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts based
68

on the value of a partition column.


Using partitions can make it faster to do queries on slices of the data. Tables or partitions may
further be subdivided into buckets, to give extra structure to the data that may be used for
more efficient queries.

Partitions
The advantage to this scheme is that queries that are restricted to a particular date or set of
dates can be answered much more efficiently since they only need to scan the files in the
partitions that the query pertains to.
CREATE TABLE logs (ts INT, line STRING)
PARTITIONED BY (dt STRING, country STRING);

Load data into a partitioned table


LOAD DATA LOCAL INPATH '/home/hadoop/Documents/hivedata/data'
INTO TABLE logs
PARTITION (dt='2006-01-02', country='ind');
After loading a few more files into the logs table, the directory structure might look
like this:
/user/hive/warehouse/logs/dt=2010-01-01/country=GB/file1
/file2
/country=US/file3
/dt=2010-01-02/country=GB/file4
/country=US/file5
/file6
Other example:
SELECT ts, dt, line
FROM logs
WHERE country='GB';
will only scan file1, file2, and file4. Notice, too, that the query returns the values of the dt
partition column, which Hive reads from the directory names since they are not in the data files.
Hive for the partitions in a table using SHOW PARTITIONS:

69

Buckets
Bucketing imposes extra structure on the table, which Hive can take advantage of when
performing certain queries.
The CLUSTERED BY clause to specify the columns to bucket on and the number of buckets
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Here we are using the user ID to determine the bucket (which Hive does by hashing the value
and reducing modulo the number of buckets), so any particular bucket will effectively have a
random set of users in it.

70

Hives Serde
Internally, Hive uses a SerDe called LazySimpleSerDe for this delimited format, along with the
line-oriented MapReduce text input and output formats

Hive-json-serde
This SerDe can be used to read data in JSON format. For example, if your JSON files had the
following contents:
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
The following steps can be used to read this data:
71

1. Build this project using ant clean build


2. Copy hive-json-serde.jar to the Hive server
3. Inside the Hive client, run
ADD JAR /path-to/hive-json-serde.jar;
4. Create a table that uses files where each line is JSON object
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (
field1 string, field2 int, field3 string, field4 double
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/path-to/my_table/';
5. Copy your JSON files to /path-to/my_table/. You can now select data using normal
SELECT statements
SELECT * FROM my_table LIMIT 10;
If the table has a column that does not exist in the JSON object, it will have a NULL value. If the
JSON file contains fields that are not columns in the table, they will be ignored and not visible to
the table.

JOINS
Inner joins
where each match in the input tables results in a row in the output.
Sales Table
Joe
2
Hank 4
Ali
0
Eve
3
Hank 2

things Table
2
Tie
4
Coat
3
Hat
1
Scarf

72

hive> SELECT sales.*, things.*


> FROM sales JOIN things ON (sales.item = things.item);

Joe
Hank
Eve
Hank

2
2
3
4

2
2
3
4

Tie
Tie
Hat
Coat

Outer joins
Outer joins allow you to find nonmatches in the tables being joined

LEFT OUTER JOIN


The query will return a row for every row in the left table (sales), even if there is no
corresponding row in the table it is being joined to (things)
hive> SELECT sales.*, things.*
> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
The columns from the things table are NULL, since there is no match.

73

74

Right Outer Join


In this case, all items from the things table are included, even those that werent
purchased by anyone
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);

75

PIG

Apache Pig is a high-level procedural language for querying large semi-structured data sets
using Hadoop and the MapReduce Platform.
Pig simplifies the use of Hadoop by allowing SQL-like queries to a distributed dataset. Explore
the language behind Pig and discover its use in a simple Hadoop cluster.
The Pig tutorial shows you how to run two Pig scripts in local mode and mapreduce mode.
Local Mode: To run the scripts in local mode, no Hadoop or HDFS installation is required. All
files are installed and run from your local host and file system.
Mapreduce Mode: To run the scripts in mapreduce mode, you need access to a Hadoop
cluster and HDFS installation.
Why Pig?
Programming Map and Reduce applications is not overly complex; doing so does require
some experience with software development.
Apache Pig changes this by creating a simpler procedural language abstraction over
MapReduce to expose a more Structured Query Language (SQL)-like interface for
Hadoop applications. So instead of writing a separate MapReduce application, you can
write a single script in Pig Latin that is automatically parallelized and distributed across a
cluster.

Pig Latin
Pig Latin is a relatively simple language that executes statements.
A statement is an operation that takes input (such as a bag, which represents a set of
tuples) and emits another bag as its output.
A bag is a relation, similar to table, that you'll find in a relational database (where tuples
represent the rows, and individual tuples are made up of fields).
A script in Pig Latin often follows a specific format in which data is read from the file
system, a number operations are performed on the data (transforming it in one or more
ways), and then the resulting relation is written back to the file system.
Pig has a rich set of data types, supporting not only high-level concepts like bags, tuples,
and maps, but also simple data types such as ints, longs, floats, doubles, chararrays, and
bytearrays. With the simple types.

76

Pig consists of a range of arithmetic operators (such as add, subtract, multiply, divide,
and module) in addition to a conditional operator called bincond that operates similar
to the C ternary operator. And as you'd expect, a full suite of comparison operators,
including rich pattern matching using regular expressions.
A simple Pig Latin script
messages = LOAD 'messages';
warns = FILTER messages BY $0 MATCHES '.*WARN+.*';
STORE warns INTO 'warnings';
The above Pig Latin Script shows the simplicity of this process in Pig. Given the three lines
shown, only one is the actual search. The first line simply reads the test data set (the messages
log) into a bag that represents a collection of tuples. You filter this data (the only entry in the
tuple, represented as $0, or field 1) with a regular expression, looking for the character
sequence WARN. Finally, you store this bag, which now represents all of those tuples from
messages that contain WARN into a new file called warnings in the host file system.
List of Pig Latin relational operators
Operator
FILTER

Description
Select a set of tuples from a relation based on a condition.

FOREACH Iterate the tuples of a relation, generating a data transformation.


GROUP

Group the data in one or more relations.

JOIN

Join two or more relations (inner or outer join).

LOAD

Load data from the file system.

ORDER

Sort a relation based on one or more fields.

SPLIT

Partition a relation into two or more relations.

STORE

Store data in the file system.

Check Your Setup


Check your run-time environment and do the following preliminary tasks:
1. Make sure the JAVA_HOME environment variable is set the root of your Java installation.
2. Make sure your PATH includes bin/pig (this enables you to run the tutorials using the "pig"
command).
$ export PATH=/<my-path-to-pig>/pig-0.8.1/bin:$PATH
77

3. Set the PIG_HOME environment variable:


$ export PIG_HOME=/<my-path-to-pig>/pig-0.8.1
4. Create the pigtutorial.tar.gz file:
Move to the Pig tutorial directory (.../pig-0.8.1/tutorial).
Edit the build.xml file in the tutorial directory.
Change this: <property name="pigjar" value="../pig.jar" />
To this: <property name="pigjar" value="../pig-0.8.1-core.jar"
/>
Run the "ant" command from the tutorial directory. This will create the
pigtutorial.tar.gz file.
5. Copy the pigtutorial.tar.gz file from the Pig tutorial directory to your local directory.
6. Unzip the pigtutorial.tar.gz file.
$ tar -xzf pigtutorial.tar.gz
7. A new directory named pigtmp is created. This directory contains the Pig tutorial files.
These files work with Hadoop 0.20.2 and include everything you need to run the Pig scripts.
Pig in Local mode
For Local mode, simply start Pig and specify Local mode with the exectype option. Doing so
brings you to the Grunt shell, which allows you to interactively enter Pig statements:
$ pig -x local
...
grunt>
From here, you can interactively code your Pig Latin script, seeing the result after each
operator. Return to Listing 1 and try this script out (see Listing 2). Note in this case that instead
of storing your data to a file, you simply dump it as a set of relations. You'll note in the modified
output that each log line (that matches the search criteria defined by the FILTER) is itself a
relation (bounded by parentheses [()]).

Listing 2. Using Pig interactively in Local mode


grunt> messages = LOAD '/var/log/messages';
grunt> warns = FILTER messages BY $0 MATCHES '.*WARN+.*';
grunt> DUMP warns
...
(Dec 10 03:56:43 localhost NetworkManager: <WARN> nm_generic_enable_loopback(): error ...
(Dec 10 06:10:18 localhost NetworkManager: <WARN> check_one_route(): (eth0) error ...
grunt>

78

If you had specified the STORE operator, it would have generated your data within a directory
of the name specified (not a simple regular file).

Pig in Mapreduce mode


For Mapreduce mode, you must first ensure that Hadoop is running. The easiest way to do that
is to perform a file list operation on the root of the Hadoop file system tree, as in Listing 3.
Listing 3. Testing Hadoop availability
$ hadoop dfs -ls /
Found 3 items
drwxrwxrwx - hue supergroup
drwxr-xr-x
- hue supergroup
drwxr-xr-x - mapred supergroup

0 2011-12-08 05:20 /tmp


0 2011-12-08 05:20 /user
0 2011-12-08 05:20 /var

As shown, this code will result in a listing of one or more files, if Hadoop is running successfully.
Now, let's test Pig. Begin by starting Pig, and then changing the directory to your hdfs root to
determine whether you can see what you saw externally in HDFS (see Listing 4).
Listing 4. Testing Pig
$ pig
2011-12-10 06:39:44,276 [main] INFO org.apache.pig.Main - Logging error messages to...
2011-12-10 06:39:44,601 [main] INFO org.apache.pig.... Connecting to hadoop file \
system at: hdfs://0.0.0.0:8020
2011-12-10 06:39:44,988 [main] INFO org.apache.pig.... connecting to map-reduce \
job tracker at: 0.0.0.0:8021
grunt> cd hdfs:///
grunt> ls
hdfs://0.0.0.0/tmp <dir>
hdfs://0.0.0.0/user <dir>
hdfs://0.0.0.0/var <dir>
grunt>
So far, so good. You can see your Hadoop file system from within Pig, so now, try to read some
data into it from your local host file system. Copy a file from local to HDFS through Pig (see
Listing 5).

79

Listing 5. Getting some test data


grunt> mkdir test
grunt> cd test
grunt> copyFromLocal /etc/passwd passwd
grunt> ls
hdfs://0.0.0.0/test/passwd<r 1> 1728
Next, with your test data now safely within Hadoops file system, you can try another script.
Note that you can cat the file within Pig to see its contents (just to see if it's there). In this
particular example, identify the number of shells specified for users within the passwd file (the
last column within passwd).
To begin, you need to load your passwd file from HDFS into a Pig relation. You do this before
using the LOAD operator, but in this case, you want to parse the fields of the password file
down to their individual fields. In this example, specify the PigStorage function, which allows
you to indicate the delimiter character for the file (in this case, a colon [:] character). You also
specify the individual fields (or the schema) with the AS keyword, including their individual
types (see Listing 6).
Listing 6. Reading your file into a relation
grunt> passwd = LOAD '/etc/passwd' USING PigStorage(':') AS (user:chararray, \
passwd:chararray, uid:int, gid:int, userinfo:chararray, home:chararray, \
shell:chararray);
grunt> DUMP passwd;
(root,x,0,0,root,/root,/bin/bash)
(bin,x,1,1,bin,/bin,/sbin/nologin)
...
(cloudera,x,500,500,,/home/cloudera,/bin/bash)
grunt>
Next, use the GROUP operator to group the tuples in this relation based on their shell (see
Listing 7). Dump this again, just to illustrate the result of the GROUP operator. Note here that
you have tuples grouped (as an inner bag) under their particular shell being used (with the shell
specified at the beginning).
Listing 7. Grouping the tuples as a function of their shell
grunt> grp_shell = GROUP passwd BY shell;
grunt> DUMP grp_shell;
(/bin/bash,{(cloudera,x,500,500,,/home/cloudera,/bin/bash),(root,x,0,0,...), ...})
(/bin/sync,{(sync,x,5,0,sync,/sbin,/bin/sync)})
(/sbin/shutdown,{(shutdown,x,6,0,shutdown,/sbin,/sbin/shutdown)})
80

grunt>
But your desire is a count of the unique shells specified within the passwd file. So, you use the
FOREACH operator to iterate each tuple in your group to COUNT the number that appear (see
Listing 8).

Listing 8. Grouping the results with counts for each shell


grunt> counts = FOREACH grp_shell GENERATE group, COUNT(passwd);
grunt> DUMP counts;
...
(/bin/bash,5)
(/bin/sync,1)
(/bin/false,1)
(/bin/halt,1)
(/bin/nologin,27)
(/bin/shutdown,1)
grunt>

Note: To execute this code as a script, simply type your script into a file, and then execute it as
pig myscript.pig.
Important Points
Pig has several built-in data types (chararray, float, integer)
PigStorage can parse standard line oriented text files.
Pig can be extended with custom load types written in Java.
Pig doesnt read any data until triggered by a DUMP or STORE.
Use FOREACH..GENERATE to pick of specific fields or generate new fields. Also referred to as a
projection
GROUP will create a new record with the group name and a bag of the tuples in each group
You can reference a specific field in a bag with <bag>.field (i.e. a models.model)
You can use aggregate functions like COUNT, MAX, etc on a bag.
Nothing really happens until a DUMP or STORE is performed.
Use FILTER and FOREACH early to remove unneeded columns or rows to reduce temporary
output.
Use PARALLEL keyword on GROUP operations to run more reduce tasks.
A quick word on writing UDFs in Pig
public class ComputeAverage extends EvalFunc {
81

public Integer exec(Tuple input) throws IOException {


if (input == null || input.size() == 0)
return null;
int averageRatingPercent = 0;
try {
int count = 0;
int sum = 0;
DataBag bag = (DataBag) input.get(0);
Iterator it = bag.iterator();
while (it.hasNext()) {
Tuple t = it.next();
count++;
// just insert your String to int conversion logic here;
sum += stringToInt(t.get(0));
}
if (count > 0) {
averageRatingPercent = sum / count;
}
} catch (Exception e) {
System.err.println(
"Failed to process input: "+ e.getMessage());
}
return averageRatingPercent;
}
@Override
public Schema outputSchema(Schema input) {
return new Schema(new Schema.FieldSchema(null, DataType.INTEGER));
}
}

And this is how you call it (r1,r2,r3,r4 are just columns/fields from another variable)
grunt> B = foreach A generate id, hid, com.pfalabs.test.ComputeAverage(r1,r2,r3,r4);

Just make sure you pack this into a jar and run this first:
grunt> register /path/to/your/jar/my-udfs.jar;

82

The number one question here is, how do you iterate through the values you can receive? I can
obviously push more fields into this function.
If it's a one to one function (for one value of input you get one value of output) you can look at
pig-release-0.5.0/tutorial/src/org/apache/pig/tutorial/ExtractHour.java:
String timestamp = (String)input.get(0);

If it's a many to one (just like my usecase) :


....
DataBag bag = (DataBag) input.get(0);
Iterator it = bag.iterator();
while (it.hasNext()) {
Tuple t = it.next();
String actualValue = (String)t.get(0);
}
.....

What do we have here? a DataBag that has an Iterator as a first element. Iterator of Touple(s)
that have your value as a first element...wow...
Next, we have the one to many functions. Luckily we can use pig-release0.5.0/tutorial/src/org/apache/pig/tutorial/NGramGenerator.java as a reference.
....
// take the value
String query = (String)input.get(0);
// generate the output and push it to the return value
DataBag output = DefaultBagFactory.getInstance().newDefaultBag();
// its a DataBag, so feel free to fill that up!
for (String ngram : ngrams) {
Tuple t = DefaultTupleFactory.getInstance().newTuple(1);
t.set(0, ngram);
output.add(t);

83

HBase is a distributed column-oriented database built on top of HDFS. HBase is the Hadoop
application to use when you require real-time read/write random-access to very large datasets.
HBase comes at the scaling problem from the opposite direction. It is built from the ground-up
to scale linearly just by adding nodes. HBase is not relational and does not support SQL, but
given the proper problem space, it is able to do what an RDBMS cannot: host very large,
sparsely populated tables on clusters made from commodity hardware.

In Another Word
HBase is a key/value store. Specifically it is a Sparse, Consistent, Distributed,
Multidimensional, Sorted map.

Map
HBase maintains maps of Keys to Values (key -> value). Each of these mappings is called a
"KeyValue" or a "Cell". You can find a value by its key... That's it.

Sorted
These cells are sorted by the key. This is a very important property as it allows for searching
("give me all values for which the key is between X and Y"), rather than just retrieving a value
for a known key.

Multidimensional
The key itself has structure. Each key consists of the following parts: row-key, column family,
column, and time-stamp. So the mapping is actually:
(rowkey, column family, column, timestamp) -> value
rowkey and value are just bytes (column family needs to be printable), so you can store
anything that you can serialize into a byte[] into a cell.

84

Sparse
This follows from the fact the HBase stores key -> value mappings and that a "row" is nothing
more than a grouping of these mappings (identified by the rowkey mentioned above).Unlike
NULL in most relational databases, no storage is needed for absent information, there will be
just no cell for a column that does not have any value. It also means that every value carries all
its coordinates with it.

Distributed
One key feature of HBase is that the data can be spread over 100s or 1000s of machines and
reach billions of cells. HBase manages the load balancing automatically.

Consistent
HBase makes two guarantees:
All changes the with the same rowkey (see Multidimensional above) are atomic. A reader will
always read the last written (and committed) values.

HBASE Architecture

85

HBASE Characteristics
HBase uses the Hadoop Filesystem (HDFS) as its data storage engine
The advantage of this approach is then HBase doesn't need to worry about data
replication
The downside is that it is also constrained by the characteristics of HDFS, which is not
optimized for random read access.
Data is stored in a farm of Region Servers.
The "key-to-server" mapping is needed to locate the corresponding server and this
mapping is stored as a "Table" similar to other user data table.
Also in the HBase architecture, there is a special machine playing the "role of master" who
monitors and coordinates the activities of all region servers (the heavy-duty worker node). To
the best of my knowledge, the master node is the single point of failure at this moment.

HBASE Data Storage

86

Regions
Tables are automatically partitioned horizontally by HBase into regions.
Each region comprises a subset of a tables rows.
A region is denoted by the table it belongs to, its first row, inclusive, and last row,
exclusive.
Initially, a table comprises a single region, but as the size of the region grows, after it
crosses a configurable size threshold,
it splits at a row boundary into two new regions of approximately equal size.

Locking
Row updates are atomic, no matter how many row columns constitute the row-level
transaction. This keeps the locking model simple.

Implementation

87

HBase depends on ZooKeeper and by default it manages a ZooKeeper instance as the


authority on cluster state.
Assignment of regions is mediated via ZooKeeper in case participating servers crash midassignment
The client navigates the ZooKeeper hierarchy to learn cluster attributes such as server
locations

Installation
Download a stable release from an Apache Download Mirror and unpack it on your local
filesystem. For example:
% tar xzf hbase-x.y.z.tar.gz

** Make Sure Java has been installed and their path


% export HBASE_HOME=/home/hbase/hbase-x.y.z
% export PATH=$PATH:$HBASE_HOME/bin
To get the list of HBase options, type:

% hbase
Usage: hbase <command>
where <command> is one of:
shell
master
regionserver
zookeeper
rest
thrift
avro
migrate
hbck

run the HBase shell


run an HBase HMaster node
run an HBase HRegionServer node
run a Zookeeper server
run an HBase REST server
run an HBase Thrift server
run an HBase Avro server
upgrade an hbase.rootdir
run the hbase 'fsck' tool

Getting Started
88

Start Hbase
$ ./bin/start-hbase.sh
starting Master, logging to logs/hbase-user-master-example.org.out

Connect to your running HBase via the shell


$ ./bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.90.0, r1001068, Fri Sep 24 13:55:42 PDT 2010
hbase(main):001:0>

Create a table named test with a single column family named cf. Verify its
creation by listing all tables and then insert some values
hbase(main):003:0> create 'test', 'cf'
0 row(s) in 1.2200 seconds
hbase(main):003:0> list 'test'
..
1 row(s) in 0.0550 seconds
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0560 seconds
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0370 seconds
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0450 seconds

Verify the data insert.


hbase(main):007:0> scan 'test'
ROW
COLUMN+CELL
row1
column=cf:a, timestamp=1288380727188, value=value1
89

row2
column=cf:b, timestamp=1288380738440, value=value2
row3
column=cf:c, timestamp=1288380747365, value=value3
3 row(s) in 0.0590 seconds

Get a single row as follows


hbase(main):008:0> get 'test', 'row1'
COLUMN CELL
cf:a
timestamp=1288380727188, value=value1
1 row(s) in 0.0400 seconds

Now, disable and drop your table. This will clean up all done above.
hbase(main):012:0> disable 'test'
0 row(s) in 1.0930 seconds
hbase(main):013:0> drop 'test'
0 row(s) in 0.0770 seconds

Exit the shell by typing exit.


hbase(main):014:0> exit

Stopping Hbase
$ ./bin/stop-hbase.sh
stopping hbase...............

The HBase Shell

The HBase Shell is (J)Ruby's IRB with some HBase particular commands added. Anything
you can do in IRB, you should be able to do in the HBase Shell.
To run the HBase shell, do as follows:

90

$ ./bin/hbase shell

Type help and then <RETURN> to see a listing of shell commands and options. Browse at
least the paragraphs at the end of the help emission for the gist of how variables and
command arguments are entered into the HBase shell; in particular note how table
names, rows, and columns, etc., must be quoted.
See Section 1.2.3, Shell Exercises for example basic shell operation.

Scripting
For examples scripting HBase, look in the HBase bin directory. Look at the files that end in
*.rb. To run one of these files, do as follows:
$ ./bin/hbase org.jruby.Main PATH_TO_SCRIPT

Shell Tricks
irbrc

Create an .irbrc file for yourself in your home directory. Add customizations. A useful one
is command history so commands are save across Shell invocations:
$ more .irbrc
require 'irb/ext/save-history'
IRB.conf[:SAVE_HISTORY] = 100
IRB.conf[:HISTORY_FILE] = "#{ENV['HOME']}/.irb-save-history"

See the ruby documentation of .irbrc to learn about other possible confiurations.

LOG data to timestamp


To convert the date '08/08/16 20:56:29' from an hbase log into a timestamp, do:
hbase(main):021:0> import java.text.SimpleDateFormat
hbase(main):022:0> import java.text.ParsePosition
hbase(main):023:0> SimpleDateFormat.new("yy/MM/dd HH:mm:ss").parse("08/08/16
20:56:29", ParsePosition.new(0)).getTime() => 1218920189000

91

To go the other direction:


hbase(main):021:0> import java.util.Date
hbase(main):022:0> Date.new(1218920189000).toString() => "Sat Aug 16 20:56:29 UTC 2008"

To output in a format that is exactly like that of the HBase log format will take a little
messing with SimpleDateFormat.

Debug
Shell debug switch
You can set a debug switch in the shell to see more output -- e.g. more of the stack trace
on exception -- when you run a command:
hbase> debug <RETURN>

DEBUG log level


To enable DEBUG level logging in the shell, launch it with the -d option.
$ ./bin/hbase shell -d

Overview
NoSQL?
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't
an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL
databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a
distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base"
because it lacks many of the features you find in an RDBMS, such as typed columns, secondary
indexes, triggers, and advanced query languages, etc.
However, HBase has many features which supports both linear and modular scaling. HBase
clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster
expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as
well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of
a single database server - and for the best performance requires specialized hardware and
storage devices. HBase features of note are:

92

Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This


makes it very suitable for tasks such as high-speed counter aggregation.
Automatic sharding: HBase tables are distributed on the cluster via regions, and regions
are automatically split and re-distributed as your data grows.
Automatic RegionServer failover
Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file
system.
MapReduce: HBase supports massively parallelized processing via MapReduce for using
HBase as both source and sink.
Java Client API: HBase supports an easy to use Java API for programmatic access.
Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high
volume query optimization.
Operational Management: HBase provides build-in web-pages for operational insight as
well as JMX metrics.

When Should I Use HBase?


HBase isn't suitable for every problem.
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase
is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might
be a better choice due to the fact that all of your data might wind up on a single node (or two) and the
rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed
columns, secondary indexes, transactions, advanced query languages, etc.) An application built against
an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving
from an RDBMS to HBase as a complete redesign as opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5
DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
HBase can run quite well stand-alone on a laptop - but this should be considered a development
configuration only.

93

What Is The Difference Between HBase and Hadoop/HDFS?


HDFS is a distributed file system that is well suited for the storage of large files. It's documentation
states that it is not, however, a general purpose file system, and does not provide fast individual record
lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and
updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts
your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups.

94

You might also like