Hadoop Note (HDFS, MapReduce)

HADOOP
Big data: A big data when the data is beyond the storage capacity
and beyond the processing power the data is called as big data.
Eg: NCDC data, social network, cc camera, airlines, hospital data
etc.
Characteristics of big data
Volume-big data comes in large scale (tb,pb,...)
Veracity-big data extends structured, semi structured, and
unstructured.
Velocity-the flow is continuous. Its like a streaming flow.
Structured-RDBMS data, MS excel, enterprise data...etc.
Semi structured-XML files, log files...etc.
Unstructured-E-mail, word document, pdf, images, audios,
videos...etc.
Hadoop
Apache Hadoop is a framework that allows for distributing and
processing of large data sets across clusters of commodity (low
cost) computers using a simple programming code MapReduce.
It is basically open source data management with scale out storage
and distributed processing.
Clusters: Hadoop gives its power to distribute to data across node.
These would be as big as 10000 node where your data is stored and
retrieved parallelly.
Hadoop core component (2)
HDFS
MapReduce
Hadoop ecosystem (architecture)
- in HDFS we follow the architecture called as master and slave

architecture.
- master is called as name node.
- slave node is called as data node.
- name node act as master node.
Local host:530070-it stores all the information about hadoop file
system. There will be one name node per HDFS cluster.
- cluster is a signle logical unit consisting of multiple computers
that are linked to LAN.
- name node generally knows all information about allocated and
replicated blocks in a cluster.
- block is nothing but breaking the large files into small pieces.
- name node also knows about free blocks that has to be allocated
next.
- it provides information about newly added, modified, removed data
from data node.
- name node executes file system operations like opening, closing,
renaming file and directories.
- the default block size is 64MB or 128MB.
- blocks are stored in data nodes.
- data nodes does not have any knoledge of HDFS file system.
- it stores each block in a separate file.
- data blocks are replicated across multiple data node and accessed
and managed by name node.
- since data blocks are replicated across several data nodes failure
of one system may not affect the process, you will get fault
tolerance.
Block replication
Data nodes
Main tasks of data node:
1. Stores a block in HDFS.

2. Acts as a platform for running a job.
3. Data node also performs block creation, deletion, replication
upon the instruction from name node.
Initialisation:
Data node
Name node
Data node handshakes with name node.

-it checks for accurate name, space, id-if found-connect to name
node, otherwise loses connection.
-each data node maintains current status of the block in its node
and generate block report.
-after every hour data node sends block report to name node and
hence always have updated information about data node.
-every 10 minutes data node senses heartbeat signal to name node.
-due to this, name node knows which data node is functioning. If
name node does not receive heartbeat signal from data node, it
assumes data node is lost and generates replica of data node.
HDFS architecture
Writing a file into HDFS

Client ask name node that it want to write a file. Name node
tells which block is free, so client writes in free block of
data node. Thus writing file to HDFS is performed.
Reading a file from HDFS
Client interact with name node to get block location. Name
node sends address of block that holds the data and also
address of replicated blocks after getting the information.
After getting the information, client reads the data from data
node.
Commands
mkdir:eghadoop fs mkdir/user/Hadoop/dir1 /user/hadoop/dir2

copyFromLocal:hadoop fs copyFromLocal root/abc.txt /user/
hadoop-1.0.3/test
copyToLocal:hadoop fs copyToLocal
hdfs://localhost:54310/user/Hadoop-inst/ hadoop1.0.3/test /training/pgm1 /home/geetha/pgm/test
delete file/dir from HDFS
hadoop fs rmr /user/Hadoop-inst/hadoop-1.0.3/test
Safe mode leave commands
1. hadoop fs safemode leave
2. hadoop fs admin safemode leave
3. hadoop dfsadmin refresh
4. hadoop dfsadmin refreshnodes
Hadoop primitive data types

Hadoop has its own set of primitive datatypes for representing its
variables for efficient serialization over the network.
Java primitive data types
Hadoop data types
Integer int
IntWritable
Long long
LongWritable
Float float
FloatWritable
Byte byte
ByteWritable
String string
Text
Double double
DoubleWritable
Classes
String
Text
byte[]
ByteWritable
object
ObjectWritable
null
NullWritable
Java collection
Writable implementation
Array
ArrayWritable
ArrayPrimitiveWritable
TwoDArrayWritable
Map
Mapwritable
Sortedmap
SortedMapWritable
Enum
EnumsetWritable
MapReduce
A programming module is used by google. A combination of map and
reduce model with an associated implementation used for processing
and generating large datasets.
Map
Map task in MapReduce is performed using the map() function and this
part of MapReduce is responsible for processing.
Reduce
Next component is of MapReduce programming is reduce() function.
This part is responsible for consolidating the results of the each
of the map() function.
How MapReduce works
input splitmappartition and combineshuffle and
sortreduceoutput
Shows the logical flow of a MapReduce programming model.
Input-this is the input data/file to be processed.
Split-hadoop splits the incoming data into smaller piece called
splits.
Map-MapReduce processes each split according to the logic defined in
map() function. Each mapper works on each split at a time. Each
mapper is treated as a task and multiple tasks are executed across
different TaskTracker (slave node)and co-ordinated by the
JobTracker(master node).
Partition and combine-this is an optional step and is used to
improve the performance by reducing the account of data transferred
across the network. Combiner is the same as the reduce of the map()
function before it is passed to the subsequent steps.
Shuffle and sort-in this step output from the all mappers is
shuffled, sorted to put them in order, and grouped before sending
them to the next step.
Reduce-this step is used to aggregate the output of mappers using
the reduce() function. Output of the reducer is sent to next and
final step. Each reducer is treated as a task and multiple tasks are
executed across different TaskTracker and co-ordinated by the
JobTracker.
Output-finally the output of reduce step is written to a file in
HDFS.
Hadoop MapReduce
-MapReduce works in a master slave fashion. JobTracker act as the
master and TaskTracker act as slave.
-MapReduce has two major phases-a map phase and a reduce phase. Map
phase processes parts of input data using mappers based on the logic
defined in the map() function. The reduce phase aggregate the data
using a reduce based on the logic defined in the reduce() function.
-depending upon the problem at hand, we can use reduce task,
multiple reduce tasks, or no reduce task.
-MapReduce takes care of distributing the data across various node,
assigning the tasks to each of the node, getting the result back
from each node, re-running the task in case of any node failures,
consolidation of result etc.
-MapReduce processes the data in the form of (key, value) pairs.
Hence we need to fit out business problem in key value arrangement.
YARN concept in MapReduce
MR1-classic
MR2-NextGen
MapReduce1 (classic) has three main core components.
API-for user level programming of MR applications.
Framework-runtime services for running map and reduce processes,
shuffling and sorting etc.
Resource management-infrastructure to monitor node allocate
resources and schedule jobs.
MapReduce2-(NextGen) moves resource management into YARN.
YARN-is a new framework created to manage resources.
RM-resource manager
NM-node manager
-it provides daemons and APIs.
-it handles and schedules requests from applications.
-it supervises execution of the requests.
-examples of resources are memory and CPU.
-in earlier versions, the Resource management was built-in into
MapReduce component.
-MapReduce was handling both resource management and data
processing.
-now YARN has taken over resource management MR2 is distributed
application that runs MapReduce framework on top of YARN.
Hadoop MapReduce architecture
Map-Reduce master JobTracker
-accepts MR jobs submitted by users.
-assign map and reduce tasks to TaskTracker.
-monitor task and TaskTracker status, re-executes tasks upon
failure.
Map-Reduce slave TaskTracker
-run map and reduce tasks upon instruction from the JobTracker.
-manage storage and transmission of intermediate output.
Anatomy of MapReduce
MapReduce at high level
Suppose if client is submitting MapReduce job to my Hadoop cluster,

firstly he should interactive with JobTracker (master node).
JobTracker who is continuously contact with the TaskTracker (slave

node).
TaskInstance-task instance can be a mapper task or reducer task it
depend upon our program (requirement).
MapReduce terminology
Job-a full program-an execution of a mapper and reduce across a data
set.
Task-an execution of a mapper or a reducer on a slice of data.
Task-in progress.
Task attempt-a particular instance of an attempt to execute a task
on a node.
Suppose while processing the data, a machine crushes or a JVM
crushes or some issues are there, what my framework does, it
reschedules the task to some other node, by the way how many task
attempts are there? Default value task attempt is 4(task attempts).
It cannot be infinite. If your mapper is failed for 4 times then
your job will be considered as failed job. We also increases the
task attempts in bigger cluster (increases it 4 to 6).
MapReduce architecture
Master node run JobTracker instance which accepts job requests from
client.
TaskTracker instance run on slave nodes.
Anatomy of MapReduce
1. In MapReduce, chunks are processed in isolation by tasks

called mappers.
2. The output from the mappers are denoted as intermediate output
and are brought into a second set of tasks called reducers.
3. The process of bringing together IOs into a set of reducers is
known as shuffling and sort process.
The reducer produces the final output for overall, MapReduce breaks
the data flow into two phases map phase and reduce phase.
Let us take a closer look at MapReduce. Suppose I have these two
slaves.
The default input format that we using is text format.

Split is the logical representation of a block.
My data is divided into blocks by HDFS that is physical partition
(physical data). Same logical representation of the same block is a
split.
By default split=block.
Input split defines the basic unit of work that is comprises signal
map task.
Split is given as an input to the mapper, one mapper processes one
split at a time.
Mapper uses input format internally.
Next component is Record Reader is the main class that actual load
or read the data from the source and give input to the mapper.
Record Reader is class which connects the data into key, value pairs
(format) that is mapper understands only key, value format.
RR is defined by input format only. So the default input format is
text input format and on the other hand RR is live Record Reader.
Mapper process the data and write the data to local disk of Hadoop
file system, but there is another component called as partitioner.
Specific key will go to specific reducer. Output of all mapper is
going as input to partitioner.
Intermediate (k,v)pair output is going as input to the partitioner.
The responsibility of partitioner class is to partition the data
into different partitions and parts. The partitioner class defines
which partition will go to which reducer. It computes hash value of
key and assigns the partitioner based on the result.
Shuffling is costly process, but we cannot ignore shuffling. This is
the place where my volume of data is travelling across network.
These volume can be high. If I have petabytes of data, then huge

volume of data is travelling across network.
We can compress the data i.e., output of our partitioner, then
compressed output will travel across network and then uncompress
when shuffling is completed. There are some optimize available i.e.,
combiner data.
Next component is sorting, which sorts all our data based on our
key, so sorting is based on the key. MapReduce job output will in
sorted order.
After sort, my data will be given as input to reducer. Reducer
second stage, second phase, second place where I can write business
logic, after the reduce the data will be given as output format,
final (k,v) pairs.
Combiner can be viewed as mini reducers in the map phase.
Partitioner determine which reducer is responsible for a particular
key.
If you does not code this particular logic by default partitioner is
taken i.e., hash partitioner that creates hash value and thus moves
the data.
How to write combiner and partitioner logic.
Eg: in word count mapper and reducer remains same only partitioners
and combiners customizes.

Hadoop Note (HDFS, MapReduce)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Hadoop Note (HDFS, MapReduce)

Uploaded by

Copyright:

Available Formats

HADOOP

- in HDFS we follow the architecture called as master and slave

Main tasks of data node:

1. Stores a block in HDFS.

Data node handshakes with name node.

Writing a file into HDFS

mkdir:eghadoop fs mkdir/user/Hadoop/dir1 /user/hadoop/dir2

Hadoop primitive data types

Hadoop data types

Suppose if client is submitting MapReduce job to my Hadoop cluster,

JobTracker who is continuously contact with the TaskTracker (slave

1. In MapReduce, chunks are processed in isolation by tasks

The default input format that we using is text format.

These volume can be high. If I have petabytes of data, then huge

You might also like