Professional Documents
Culture Documents
Big data: A big data when the data is beyond the storage capacity
and beyond the processing power the data is called as big data.
Eg: NCDC data, social network, cc camera, airlines, hospital data
etc.
Characteristics of big data
Volume-big data comes in large scale (tb,pb,...)
Veracity-big data extends structured, semi structured, and
unstructured.
Velocity-the flow is continuous. Its like a streaming flow.
Structured-RDBMS data, MS excel, enterprise data...etc.
Semi structured-XML files, log files...etc.
Unstructured-E-mail, word document, pdf, images, audios,
videos...etc.
Hadoop
Apache Hadoop is a framework that allows for distributing and
processing of large data sets across clusters of commodity (low
cost) computers using a simple programming code MapReduce.
It is basically open source data management with scale out storage
and distributed processing.
Clusters: Hadoop gives its power to distribute to data across node.
These would be as big as 10000 node where your data is stored and
retrieved parallelly.
Hadoop core component (2)
HDFS
MapReduce
Hadoop ecosystem (architecture)
Name node
Commands
Integer int
IntWritable
Long long
LongWritable
Float float
FloatWritable
Byte byte
ByteWritable
String string
Text
Double double
DoubleWritable
Classes
String
Text
byte[]
ByteWritable
object
ObjectWritable
null
NullWritable
Java collection
Writable implementation
Array
ArrayWritable
ArrayPrimitiveWritable
TwoDArrayWritable
Map
Mapwritable
Sortedmap
SortedMapWritable
Enum
EnumsetWritable
MapReduce
A programming module is used by google. A combination of map and
reduce model with an associated implementation used for processing
and generating large datasets.
Map
Map task in MapReduce is performed using the map() function and this
part of MapReduce is responsible for processing.
Reduce
Next component is of MapReduce programming is reduce() function.
This part is responsible for consolidating the results of the each
of the map() function.
How MapReduce works
input splitmappartition and combineshuffle and
sortreduceoutput
Shows the logical flow of a MapReduce programming model.
Input-this is the input data/file to be processed.
Split-hadoop splits the incoming data into smaller piece called
splits.
Map-MapReduce processes each split according to the logic defined in
map() function. Each mapper works on each split at a time. Each
mapper is treated as a task and multiple tasks are executed across
different TaskTracker (slave node)and co-ordinated by the
JobTracker(master node).
Partition and combine-this is an optional step and is used to
improve the performance by reducing the account of data transferred
across the network. Combiner is the same as the reduce of the map()
function before it is passed to the subsequent steps.
Shuffle and sort-in this step output from the all mappers is
shuffled, sorted to put them in order, and grouped before sending
them to the next step.
Reduce-this step is used to aggregate the output of mappers using
the reduce() function. Output of the reducer is sent to next and
final step. Each reducer is treated as a task and multiple tasks are
executed across different TaskTracker and co-ordinated by the
JobTracker.
Output-finally the output of reduce step is written to a file in
HDFS.
Hadoop MapReduce
-MapReduce works in a master slave fashion. JobTracker act as the
master and TaskTracker act as slave.
-MapReduce has two major phases-a map phase and a reduce phase. Map
phase processes parts of input data using mappers based on the logic
defined in the map() function. The reduce phase aggregate the data
using a reduce based on the logic defined in the reduce() function.
-depending upon the problem at hand, we can use reduce task,
multiple reduce tasks, or no reduce task.
-MapReduce takes care of distributing the data across various node,
assigning the tasks to each of the node, getting the result back
from each node, re-running the task in case of any node failures,
consolidation of result etc.
-MapReduce processes the data in the form of (key, value) pairs.
Hence we need to fit out business problem in key value arrangement.
YARN concept in MapReduce
MR1-classic
MR2-NextGen
MapReduce1 (classic) has three main core components.
API-for user level programming of MR applications.
Framework-runtime services for running map and reduce processes,
shuffling and sorting etc.
Resource management-infrastructure to monitor node allocate
resources and schedule jobs.
MapReduce2-(NextGen) moves resource management into YARN.
YARN-is a new framework created to manage resources.
RM-resource manager
NM-node manager
-it provides daemons and APIs.
-it handles and schedules requests from applications.
-it supervises execution of the requests.
-examples of resources are memory and CPU.
-in earlier versions, the Resource management was built-in into
MapReduce component.
-MapReduce was handling both resource management and data
processing.
-now YARN has taken over resource management MR2 is distributed
application that runs MapReduce framework on top of YARN.
Hadoop MapReduce architecture
Map-Reduce master JobTracker
-accepts MR jobs submitted by users.
-assign map and reduce tasks to TaskTracker.
-monitor task and TaskTracker status, re-executes tasks upon
failure.
Map-Reduce slave TaskTracker
-run map and reduce tasks upon instruction from the JobTracker.
-manage storage and transmission of intermediate output.
Anatomy of MapReduce
MapReduce at high level
Master node run JobTracker instance which accepts job requests from
client.
TaskTracker instance run on slave nodes.
Anatomy of MapReduce
The reducer produces the final output for overall, MapReduce breaks
the data flow into two phases map phase and reduce phase.
Let us take a closer look at MapReduce. Suppose I have these two
slaves.