You are on page 1of 75

History

In the early 2000s, Google developed the Google File


System to support large distributed data-intensive
applications. Examples: processing crawled documents,
web request logs, etc. to produce inverted indices,
statistics, etc.
Hadoop was created by Doug Cutting and Mike Cafarella
in 2005. Cutting, who was working at Yahoo! at the time.
It was originally developed to support distribution for the
Nutch search engine project.
Hadoop's MapReduce and HDFS components originally
derived respectively from Google's MapReduce and
Google File System (GFS) papers.

Big Data
Gartner definition : "Big data is high volume, high
velocity, and/or high variety information assets that
require new forms of processing to enable enhanced
decision making, insight discovery and process
optimization."
Big data is the term for a collection of data sets so
large and complex.
The data difficult to process using on-hand database
management tools or traditional data processing
applications.
Like, Millions of files are scanned, reported and
analyzed each day.

Large-Scale Data

HDFS vs GFS
Hadoop Distributed Google File System
File System
Platform

Cross-platform (Java)

Linux (C/C++)

License

Open source (Apache


2.0)

Proprietary

Chunk Size

128 MB default but


user configurable per
file

64 MB default but
user configurable per
file

Developer(s)

Yahoo! and open


source community

Google

Database vs Hadoop
Traditional Database

Hadoop

Specialized Software

Open Source Software

Structured Databases

Any Type of data allowed to


stored

Difficult Scalability

Provide linear scalability

Online processing and analysis of


data

Offline processing and analysis of


data

Expensive

Low Cost

What is Hadoop ?
It

is Open-source software for reliable,


scalable, distributed computing.
It is a reliable shared storage and analysis
system.
It is designed to scale up from single
servers to thousands of machines.
Large datasets Terabytes or petabytes of
data
Large clusters hundreds or thousands of
nodes

Why Hadoop ?
Reading

on single drive of huge size like


(100TB) will be slow and writing will be more
slower, way to reduce the time having (10TB
x 10) multiple drive working in parallel.
It brings several hundred gigabytes of data
together and having tool to analyze.
Chance of failure of one will be fairly high,
common way to avoid this is having replica
of data in system so in event of failure,
there is another copy available.

Who uses Hadoop ?


A wide variety of companies and
organizations use Hadoop for both
research and production.
Like,

Facebook currently using 2 major clusters


(i)

(ii)

A 1100-machine cluster with 8800 cores and about 12 PB


raw storage
A 300-machine cluster with 2400 cores and about 3 PB
raw storage

Yahoo
More than 100,000 CPUs in ~20,000 computers running Hadoop;
biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to
support research for Ad Systems and Web Search

What is Hadoop used for?


Search

Yahoo, Amazon, Zvents

Log processing

Facebook, Yahoo, ContextWeb. Joost, Last.fm

Recommendation Systems

Facebook

Data Warehouse

Facebook, AOL

Video and Image Analysis

New York Times, Eyealike

Hadoop framework is composed


of the following modules:

Hadoop Common contains libraries and


utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) a
distributed file-system that stores data on
commodity machines, providing very high
aggregate bandwidth across the cluster.
Hadoop YARN a resource-management
platform responsible for managing compute
resources in clusters and using them for
scheduling of users' applications.
Hadoop MapReduce a programming model
for large scale data processing.

Components
1)HDFS (Hadoop distributed file system): It
provides Storage.
Chunk Servers

File is split into contiguous chunks


Typically each chunk is 16-64MB
Each chunk replicated (usually 2x or 3x)
Try to keep replicas in different racks

Master node

Stores metadata
Might be replicated
Keeps track of namespace and metadata about items.
Keeps track of MapReduce jobs in the system

2) MapReduce: For analysis.

Hadoop Master/Slave
Architecture
Hadoop is designed as a master-slave sharednothing architecture

Master node (single node)

Many slave nodes

NameNodes(The Master)
The namenode manages the filesystem
namespace.
It maintains the filesystem tree and the
metadata for all the files and directories in the
tree.
This information is stored persistently on the
local disk in the form of two files: the
namespace image and the edit log.
It does not store block locations persistently,
because this information is reconstructed from
datanodes when the system starts.

Data Node
Datanodes are the workhorses of the
filesystem.
They store and retrieve blocks, and they
report back to the namenode periodically
with lists of blocks that they are storing.

HDFS
HDFS has two main layers
(i)Namespace Consists of
directories, files and
blocks & operations such
as create, delete, modify
and list files and
directories
The Namenodes are
independent and don't
require coordination with
each other.

HDFS
(ii)Block Storage Service(has two parts)
(a)Block Management(in Namenode)
Provides datanode cluster membership by
handling registrations, and periodic heart
beats.
Processes block reports and maintains
location of blocks.

HDFS
Supports block related operations such as
create, delete, modify and get block
location.
Manages replica placement and replication
of a block for under replicated blocks and
deletes blocks that are over replicated.

(b)Storage
Provided by datanodes by storing blocks on
the local file system and allows read/write
access.

Multiple Namenodes/Namespaces
Scale the name service horizontally

Multiple Namenodes/Namespaces
Scale the name service horizontally

Block Pool: A Block Pool is a set of blocks


that belong to a single namespace.
Datanodes store blocks for all the block
pools in the cluster.
ClusterID: A new identifier to identify all
the nodes in the cluster. When it is
formatted a new ID autogenerated.

Multiple Namenodes/Namespaces
Scale the name service horizontally.
The Namenodes are independent and dont
require coordination with each other.
The datanodes are used as common storage
for blocks by all the Namenodes.
Each datanode registers with all the
Namenodes in the cluster.
Datanodes send periodic heartbeats and
block reports and handles commands from
the Namenodes and tell that it is alive.

Multiple
Namenodes/Namespaces

Filesystem Metadata
The HDFS namespace is stored by Namenode.
Namenode uses a transaction log called the
EditLog to record every change that occurs to
the filesystem meta data.

o For example, creating a new file.


o Change replication factor of a file.
o EditLog is stored in the Namenodes local filesystem.

Entire filesystem namespace including


mapping of blocks to files and file system
properties is stored in a file FsImage. Stored in
Namenodes local filesystem

Safemode Startup

On startup Namenode enters Safemode.


Replication of data blocks do not occur in Safemode.
Each DataNode checks in with Heartbeat and
BlockReport.
Namenode verifies that each block has acceptable
number of replicas.
After a configurable percentage of safely replicated
blocks check in with the Namenode, Namenode exits
Safemode.
It then makes the list of blocks that need to be
replicated.
Namenode then proceeds to replicate these blocks to
other Datanodes.

Key Benefits
Moving Computation is Cheaper than Moving Data

1.Namespace Scalability: Large deployments or


deployments using lot of small files benefit from
scaling the namespace by adding more
Namenodes

2.Performance: Scales the file system read/write


operations throughput.

3.Isolation: Different categories of applications and


users can be isolated to different namespaces.

What is Map/Reduce

A MapReduce job usually splits the input


data-set into independent chunks which are
processed by the map tasks in a completely
parallel manner.
MapReduce can take advantage of locality of
data, processing data on or near the storage
assets to decrease transmission of data.
Each node is expected to report back
periodically with completed work and status
updates.

MapReduce

Client Computer

A client accesses the filesystem on behalf of


the user by communicating with the
namenode and datanodes.

The client presents a filesystem interface,


so the user code does not need to know
about the namenode and datanode to
function.

Job Client
The Job Client prepares a job for execution
when submit a MapReduce job.
1. Validates the job configuration.
2. Generates the input splits.
3. Copies the job resources to a shared
location, such as an HDFS directory, where
it is accessible to the Job Tracker and Task
Trackers.
4. Submits the job to the Job Tracker.

Job Tracker
The Job Tracker is responsible for scheduling
jobs, dividing a job into map and reduce tasks,
distributing map and reduce tasks among
worker nodes, task failure recovery, and
tracking the job status. Validates the job
configuration.
1. Fetches input splits from the shared location
where the Job Client placed the information.
2. Creates a map task for each.
3. Assigns each map task to a Task Tracker
(worker node).

Job Tracker
Contu
Contu

The Job Tracker monitors the health of the


Task Trackers and the progress of the job. As
map tasks complete and results become
available, the Job Tracker:
1. Creates reduce tasks up to the maximum
enableed by the job configuration.
2. Assigns each map result partition to a
reduce task.
3. Assigns each reduce task to a Task Tracker.

Task Tracker
A Task Tracker manages the tasks of one
worker node and reports status to the Job
Tracker. A TaskTracker is a node in the
cluster that accepts tasks - Map, Reduce
and Shuffle operations - from a JobTracker.
1. Fetches job resources locally.
2. Reports status to the Job Tracker.

MapReduce
1.Client applications submit jobs to the Job
tracker.
2.The JobTracker talks to the NameNode to
determine the location of the data.
3.The JobTracker locates TaskTracker nodes with
available slots at or near the data.
4.The JobTracker submits the work to the chosen
TaskTracker nodes.

MapReduce
5.The TaskTracker nodes are monitored. If they
do not submit heartbeat signals often enough,
they are deemed to have failed and the work is
scheduled on a different TaskTracker.
6.A TaskTracker will notify the JobTracker when a
task fails. The JobTracker decides what to do
then: it may resubmit the job elsewhere, it may
mark that specific record as something to
avoid, and it may may even blacklist the
TaskTracker as unreliable.
7.When the work is completed, the JobTracker
updates its status.

Map function

The Map function takes a series of key/value


pairs, processes each, and generates zero
or more output key/value pairs. Eg: If it is
word count problem than function break line into
words and output key/value pair for each word.
And each word is Key and no of instances of that
word in the line is as value.

map(key1,value) -> list<key2,value2>

Reduce function
The framework sorts the outputs of the
maps, which are then input to the reduce
tasks.
The framework calls the application's
Reduce function once for each unique key in
the sorted order.
The Reduce can iterate through the values
that are associated with that key and
produce zero or more outputs.
reduce(key2, list<value2>) -> list<value3>

What is Map/Reduce
Contu..
Contu..

The MapReduce framework operates


exclusively on <key, value>

Input and Output types of a MapReduce job:


(input) <k1, v1> -> map -> <k2, v2> ->
combine -> <k2, v2> -> reduce -> <k3,
v3> (output)

What is Map/Reduce
Contu..
Contu..

What is Map/Reduce
Contu..
Contu..

map(key=url, val=contents):
For each word w in contents, emit (w, 1)

reduce(key=word, values=uniq_counts):
Sum all 1s in values list
Emit result (word, sum)
see =1 bob1
see bob run
bob=1
run =1
see spot
see =1
throw
spot=1
throw=1

run
1
see
2
spot
1
throw 1

Hadoop NextGen MapReduce


(YARN)

Hadoop NextGen MapReduce


(YARN)

Split up the two major functionalities of the


JobTracker, resource management and job
scheduling/monitoring, into separate.

The idea is to have a global


ResourceManager (RM) and per-application
ApplicationMaster (AM).

Hadoop NextGen MapReduce


(YARN)

ApplicationMaster specific library and is


tasked with negotiating resources from the
ResourceManager and working with the
NodeManager(s) to execute and monitor the
tasks.

ResourceManager is the ultimate authority


that arbitrates resources among all the
applications in the system.

Hadoop NextGen MapReduce


(YARN)

ResourceManager has two main components:

1.Scheduler: It is responsible for allocating resources to


the various running applications subject to familiar
constraints of capacities, queues etc.
2. ApplicationsManager: It is responsible for accepting
job-submissions, negotiating the first container for
executing the application specific ApplicationMaster and
provides the service for restarting the ApplicationMaster
container on failure.

Data Flow-File Read

Data Flow-File Read


Step 1: The client opens the file it wishes to
read by calling open() on the FileSystem object,
which for HDFS is an instance of
DistributedFileSystem.
Step 2: DistributedFileSystem calls the
namenode, using RPC, to determine the locations
of the blocks for the first few blocks in the file.
The DistributedFileSystem returns an
FSDataInputStream (an input stream that
supports file seeks) to the client for it to read
data from.

Data Flow-File Read


Step 3: The client then calls read() on the
stream. DFSInputStream, which has stored
the datanode addresses for the first few
blocks in the file.
Step 4: Data is streamed from the
datanode back to the client, which calls
read() repeatedly on the stream.

Data Flow-File Read


Step 5: When the end of the block is
reached, DFSInputStream will close the
connection to the datanode, then find the
best datanode for the next block.
Step 6: It call the namenode to retrieve the
datanode locations for the next batch of
blocks as needed. When the client has
finished reading, it calls close() on the
FSDataInputStream.

Data Flow-File Read

If the DFSInputStream encounters an error while


communicating with a datanode, it will try the next
closest one for that block.
It will also remember datanodes that have failed so
that it doesnt needlessly retry them for later
blocks.
The client contacts datanodes directly to retrieve
data and is guided by the namenode to the best
datanode for each block.
This allows HDFS to scale to a large number of
concurrent clients traffic is spread across all the
datanodes in the cluster.

Data Flow-File Write

Data Flow-File Write

Step 1:The client creates the file by calling


create() on DistributedFileSystem.

Step 2:DistributedFileSystem makes an


RPC call to the namenode to create a new
file in the filesystems namespace, with no
blocks associated with it.

Data Flow-File Write

Step 3:The client writes data (step 3),


DFSOutputStream splits it into packets,
which it writes to an internal queue, called
the data queue.

The data queue is consumed by the Data


Streamer, which is responsible for asking
the namenode to allocate new blocks by
picking a list of suitable datanodes to store
the replicas.

Data Flow-File Write

Step 4:The DataStreamer streams the packets to


the first datanode in the pipeline,which stores the
packet and forwards it to the second datanode in
the pipeline. Similarly second datanode to third
node(last).

Step 5:DFSOutputStream also maintains an


internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack
queue.
A packet is removed from the ack queue only when it has
been acknowledged by all the datanodes in the pipeline.

Data Flow-File Write

Step 6: The client has finished writing data,


it calls close() on the stream.

Step 7: Then it flushes all the remaining


packets to the datanode pipeline and waits
for acknowledgments before contacting the
namenode to signal that the file is
complete.

Data Flow-File Write


If a datanode fails while data is being
written to it, then the following actions are
taken:
1. The pipeline is closed, and any packets in the
ack queue are added to the front of the data
queue so that datanodes that are downstream
from the failed node will not miss any packets.
2. The current block on the good datanodes is given
a new identity, which is communicated to the
namenode, so that the partial block on the failed
datanode will be deleted if the failed datanode
recovers later on.

Data Flow-File Write


3. The failed datanode is removed from the
pipeline, and the remainder of the blocks data is
written to the two good datanodes in the pipeline.
4. The namenode notices that the block is underreplicated, and it arranges for a further replica
to be created on another node.

5. Subsequent blocks are then treated as normal.

Image and Journal

An image is the file system metadata that


describes organization of application data
as directories and files.

A persistent record of it written to disk is


called a checkpoint .

The journal is a write-head commit log for


changes that must be persistent .

CheckpointNode and
BackupNode

A NameNode can alternatively be run as a


CheckpointNode or BackupNode.

The CheckpointNode periodically combines the


existing checkpoint and journal to create a new
checkpoint and empty journal.

A BackupNode acts like a shadow of the


NameNode and keeps an up-to-date copy of the
image in memory .

Node-to-Node
Communication

Hadoop uses its own RPC protocol

All communication begins in slave nodes

i) Prevents circular-wait deadlock


ii) Slaves periodically poll for status message

Classes must provide explicit serialization

Node-to-Node
Communication

Communication protocols are layered on top of


the TCP/IP protocol.
The NameNode never initiates any RPCs, it
only responds to RPC requests issued by
DataNodes or clients.
A client establishes a connection to a
configurable TCP port on the NameNode
machine.
Remote Procedure Call (RPC) abstraction wraps
both the Client Protocol and the DataNode
Protocol.

Space Reclamation
When a file is deleted by a client, HDFS renames file
to a file in be the /trash directory for a configurable
amount of time.
A client can request for an undelete in this allowed
time.
After the specified time the file is deleted and the
space is reclaimed.
When the replication factor is reduced, the
Namenode selects excess replicas that can be
deleted.
Next heartbeat transfers this information to the
Datanode that clears the blocks for use.

Data Integrity

Consider a situation: a block of data fetched from


Datanode arrives corrupted.
This corruption may occur because of faults in a
storage device, network faults, or buggy software.
A HDFS client creates the checksum of every block
of its file and stores it in hidden files in the HDFS
namespace.
When a clients retrieves the contents of file, it
verifies that the corresponding checksums match.
If does not match, the client can retrieve the block
from a replica.

MapReduce-Fault
tolerance
Worker failure: The master pings every
worker periodically. If no response is
received from a worker in a certain
amount of time, the master marks the
worker as failed.
Master Failure: It is easy to make the
master write periodic checkpoints of the
master data structures described above. If
the master task dies, a new copy can be
started from the last checkpointed state.

Re-replication

The necessity for rereplication may arise due


to:
A Datanode may become
unavailable,
A replica may become
corrupted,
A hard disk on a Datanode
may fail, or
The replication factor on the
block may be increased.

Re-replication- Defalut Strategy


The first replica on the same node as the client.
The second replica is placed on a different rack
from the first (off-rack), chosen at random.
The third replica is placed on the same rack as
the second, but on a different node chosen at
random.
Further replicas are placed on random nodes on
the cluster, although the system tries to avoid
placing too many replicas on the same rack.

Cluster Rebalancing
HDFS architecture is compatible with data
rebalancing schemes.
A scheme might move data from one
Datanode to another if the free space on a
Datanode falls below a certain threshold.
In the event of a sudden high demand for a
particular file, a scheme might dynamically
create additional replicas and rebalance
other data in the cluster.

Hadoop and Hive at


Facebook

Hive
The Apache Hive data warehouse software
facilitates querying and managing large datasets
residing in distributed storage.
Hive provides a mechanism to project structure
onto this data and query the data using a SQL-like
language called HiveQL.

Use cases
Producing daily and hourly summaries over
large amounts of data.
Running ad hoc jobs over historical data.
These analyses help answer questions from
our product groups and executive team.
To look up log events by specific attributes
which is used to maintain the integrity of
the site and protect users.

Data architecture

Data architecture
Scribe:an open source log collection
service developed in Facebook that deposits
hundreds of log datasets with a daily
volume in the tens of terabytes into a
handful of NFS servers.
HDFS:A large fraction of this log data is
copied into one central HDFS instance.
Dimension data is also scraped from our
internal MySQL databases and copied over
into HDFS daily.

Data architecture

Hive/Hadoop:log data from Scribe and


dimension data from the MySQL tier, are
made available as tables with logical
partitions.
A SQL-like query language provided by Hive is
used in conjunction with MapReduce to
create/publish a variety of summaries and
reports, as well as to perform historical analysis
over these tables.

Data architecture
Tools: Browser-based interfaces built on top
of Hive allow users to compose and launch
Hive queries.
Traditional RDBMS: Oracle and MySQL
databases to publish summaries. The
volume of data is relatively small, but the
query rate is high and needs real-time
response.

Data architecture

DataBee: In-house Extract, Transform, Load


workflow software that is used to provide a
common framework for reliable batch
processing across all data processing jobs.
Data from the NFS tier storing Scribe
data is continuously replicated to the HDFS
cluster by copier jobs. The NFS devices are
mounted on the Hadoop tier, and the copier
processes run as Map-only jobs on the
Hadoop cluster.

Advertiser insights and


performance
One of the most common uses of Hadoop is to
produce summaries from large volumes of
data.
It is very typical of large ad networks, such as
the Facebook ad network, Google AdSense,
and many others.
Advertisers to embed information from a
users network of friends; for example, a Nike
ad may refer to a friend of the user who
recently fanned Nike and shared that
information with her friends on Facebook.

Conclusions
Provide a general-purpose model to simplify
large-scale computation.
Hadoop File System, a peta-scale file
system to handle big-data sets.
Allow users to focus on the problem
without worrying about details
Greatly simplifies large-scale computations

References
[1] http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
[2] http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
[3] http://wiki.apache.org/hadoop/PoweredBy
[4] D. Cutting and E. Baldeschwieler, ``Meet Hadoop, OSCON, Portland,
OR, USA, 25 July 2007 (Yahoo!)
[5] http://hadoop.apache.org/docs/current/hadoopyarn/hadoop-yarn-site/YARN.html
[6] Hadoop: The Definiate Guide.

You might also like