Professional Documents
Culture Documents
Big Data
Gartner definition : "Big data is high volume, high
velocity, and/or high variety information assets that
require new forms of processing to enable enhanced
decision making, insight discovery and process
optimization."
Big data is the term for a collection of data sets so
large and complex.
The data difficult to process using on-hand database
management tools or traditional data processing
applications.
Like, Millions of files are scanned, reported and
analyzed each day.
Large-Scale Data
HDFS vs GFS
Hadoop Distributed Google File System
File System
Platform
Cross-platform (Java)
Linux (C/C++)
License
Proprietary
Chunk Size
64 MB default but
user configurable per
file
Developer(s)
Database vs Hadoop
Traditional Database
Hadoop
Specialized Software
Structured Databases
Difficult Scalability
Expensive
Low Cost
What is Hadoop ?
It
Why Hadoop ?
Reading
(ii)
Yahoo
More than 100,000 CPUs in ~20,000 computers running Hadoop;
biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to
support research for Ad Systems and Web Search
Log processing
Recommendation Systems
Data Warehouse
Facebook, AOL
Components
1)HDFS (Hadoop distributed file system): It
provides Storage.
Chunk Servers
Master node
Stores metadata
Might be replicated
Keeps track of namespace and metadata about items.
Keeps track of MapReduce jobs in the system
Hadoop Master/Slave
Architecture
Hadoop is designed as a master-slave sharednothing architecture
NameNodes(The Master)
The namenode manages the filesystem
namespace.
It maintains the filesystem tree and the
metadata for all the files and directories in the
tree.
This information is stored persistently on the
local disk in the form of two files: the
namespace image and the edit log.
It does not store block locations persistently,
because this information is reconstructed from
datanodes when the system starts.
Data Node
Datanodes are the workhorses of the
filesystem.
They store and retrieve blocks, and they
report back to the namenode periodically
with lists of blocks that they are storing.
HDFS
HDFS has two main layers
(i)Namespace Consists of
directories, files and
blocks & operations such
as create, delete, modify
and list files and
directories
The Namenodes are
independent and don't
require coordination with
each other.
HDFS
(ii)Block Storage Service(has two parts)
(a)Block Management(in Namenode)
Provides datanode cluster membership by
handling registrations, and periodic heart
beats.
Processes block reports and maintains
location of blocks.
HDFS
Supports block related operations such as
create, delete, modify and get block
location.
Manages replica placement and replication
of a block for under replicated blocks and
deletes blocks that are over replicated.
(b)Storage
Provided by datanodes by storing blocks on
the local file system and allows read/write
access.
Multiple Namenodes/Namespaces
Scale the name service horizontally
Multiple Namenodes/Namespaces
Scale the name service horizontally
Multiple Namenodes/Namespaces
Scale the name service horizontally.
The Namenodes are independent and dont
require coordination with each other.
The datanodes are used as common storage
for blocks by all the Namenodes.
Each datanode registers with all the
Namenodes in the cluster.
Datanodes send periodic heartbeats and
block reports and handles commands from
the Namenodes and tell that it is alive.
Multiple
Namenodes/Namespaces
Filesystem Metadata
The HDFS namespace is stored by Namenode.
Namenode uses a transaction log called the
EditLog to record every change that occurs to
the filesystem meta data.
Safemode Startup
Key Benefits
Moving Computation is Cheaper than Moving Data
What is Map/Reduce
MapReduce
Client Computer
Job Client
The Job Client prepares a job for execution
when submit a MapReduce job.
1. Validates the job configuration.
2. Generates the input splits.
3. Copies the job resources to a shared
location, such as an HDFS directory, where
it is accessible to the Job Tracker and Task
Trackers.
4. Submits the job to the Job Tracker.
Job Tracker
The Job Tracker is responsible for scheduling
jobs, dividing a job into map and reduce tasks,
distributing map and reduce tasks among
worker nodes, task failure recovery, and
tracking the job status. Validates the job
configuration.
1. Fetches input splits from the shared location
where the Job Client placed the information.
2. Creates a map task for each.
3. Assigns each map task to a Task Tracker
(worker node).
Job Tracker
Contu
Contu
Task Tracker
A Task Tracker manages the tasks of one
worker node and reports status to the Job
Tracker. A TaskTracker is a node in the
cluster that accepts tasks - Map, Reduce
and Shuffle operations - from a JobTracker.
1. Fetches job resources locally.
2. Reports status to the Job Tracker.
MapReduce
1.Client applications submit jobs to the Job
tracker.
2.The JobTracker talks to the NameNode to
determine the location of the data.
3.The JobTracker locates TaskTracker nodes with
available slots at or near the data.
4.The JobTracker submits the work to the chosen
TaskTracker nodes.
MapReduce
5.The TaskTracker nodes are monitored. If they
do not submit heartbeat signals often enough,
they are deemed to have failed and the work is
scheduled on a different TaskTracker.
6.A TaskTracker will notify the JobTracker when a
task fails. The JobTracker decides what to do
then: it may resubmit the job elsewhere, it may
mark that specific record as something to
avoid, and it may may even blacklist the
TaskTracker as unreliable.
7.When the work is completed, the JobTracker
updates its status.
Map function
Reduce function
The framework sorts the outputs of the
maps, which are then input to the reduce
tasks.
The framework calls the application's
Reduce function once for each unique key in
the sorted order.
The Reduce can iterate through the values
that are associated with that key and
produce zero or more outputs.
reduce(key2, list<value2>) -> list<value3>
What is Map/Reduce
Contu..
Contu..
What is Map/Reduce
Contu..
Contu..
What is Map/Reduce
Contu..
Contu..
map(key=url, val=contents):
For each word w in contents, emit (w, 1)
reduce(key=word, values=uniq_counts):
Sum all 1s in values list
Emit result (word, sum)
see =1 bob1
see bob run
bob=1
run =1
see spot
see =1
throw
spot=1
throw=1
run
1
see
2
spot
1
throw 1
CheckpointNode and
BackupNode
Node-to-Node
Communication
Node-to-Node
Communication
Space Reclamation
When a file is deleted by a client, HDFS renames file
to a file in be the /trash directory for a configurable
amount of time.
A client can request for an undelete in this allowed
time.
After the specified time the file is deleted and the
space is reclaimed.
When the replication factor is reduced, the
Namenode selects excess replicas that can be
deleted.
Next heartbeat transfers this information to the
Datanode that clears the blocks for use.
Data Integrity
MapReduce-Fault
tolerance
Worker failure: The master pings every
worker periodically. If no response is
received from a worker in a certain
amount of time, the master marks the
worker as failed.
Master Failure: It is easy to make the
master write periodic checkpoints of the
master data structures described above. If
the master task dies, a new copy can be
started from the last checkpointed state.
Re-replication
Cluster Rebalancing
HDFS architecture is compatible with data
rebalancing schemes.
A scheme might move data from one
Datanode to another if the free space on a
Datanode falls below a certain threshold.
In the event of a sudden high demand for a
particular file, a scheme might dynamically
create additional replicas and rebalance
other data in the cluster.
Hive
The Apache Hive data warehouse software
facilitates querying and managing large datasets
residing in distributed storage.
Hive provides a mechanism to project structure
onto this data and query the data using a SQL-like
language called HiveQL.
Use cases
Producing daily and hourly summaries over
large amounts of data.
Running ad hoc jobs over historical data.
These analyses help answer questions from
our product groups and executive team.
To look up log events by specific attributes
which is used to maintain the integrity of
the site and protect users.
Data architecture
Data architecture
Scribe:an open source log collection
service developed in Facebook that deposits
hundreds of log datasets with a daily
volume in the tens of terabytes into a
handful of NFS servers.
HDFS:A large fraction of this log data is
copied into one central HDFS instance.
Dimension data is also scraped from our
internal MySQL databases and copied over
into HDFS daily.
Data architecture
Data architecture
Tools: Browser-based interfaces built on top
of Hive allow users to compose and launch
Hive queries.
Traditional RDBMS: Oracle and MySQL
databases to publish summaries. The
volume of data is relatively small, but the
query rate is high and needs real-time
response.
Data architecture
Conclusions
Provide a general-purpose model to simplify
large-scale computation.
Hadoop File System, a peta-scale file
system to handle big-data sets.
Allow users to focus on the problem
without worrying about details
Greatly simplifies large-scale computations
References
[1] http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
[2] http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
[3] http://wiki.apache.org/hadoop/PoweredBy
[4] D. Cutting and E. Baldeschwieler, ``Meet Hadoop, OSCON, Portland,
OR, USA, 25 July 2007 (Yahoo!)
[5] http://hadoop.apache.org/docs/current/hadoopyarn/hadoop-yarn-site/YARN.html
[6] Hadoop: The Definiate Guide.