You are on page 1of 37

UNIT-II

MAPREDUCE

Richa
Assistant Professor
CSE Dept
Chandigarh University

University Institute of Engineering (UIE)


CONTENTS

Introduction To Map Reduce


Architecture/ Data Flow in Map Reduce
Difference between Input Split and Block
Role of Record Reader

University Institute of Engineering (UIE)


INTRODUCTION
•It is the processing component of Apache Hadoop.
•It may also be defined as component that is used to
process huge amount of data.
•MapReduce processes data parallelly.
•The MapReduce framework consists of single Job
Tracker and Multiple TaskTrackers (one per DataNode).
•It is heart of Hadoop System.
•By using MapReduce, we can process the Big Data that
is present on Hadoop or that is stored on HDFS.

University Institute of Engineering (UIE)


NEED FOR MAP REDUCE????
• These days data is not stored in traditional way.
• Data gets divided into chunks which is stored on the
different data nodes.
• There is no complete data that is stored on a single place so
a new framework was required that would have the
capability to process the huge amount of data that is stored
on different DataNodes.
• Also we need a framework that goes to the place where the
data actually resides, processes the data there only and
return the results.

University Institute of Engineering (UIE)


MAPREDUCE CAN BE USED IN:
➢ Index and Search
➢ Classification
➢ Recommendation
Amazon
Flipkart
➢ Analytics

University Institute of Engineering (UIE)


MAPREDUCE FEATURES
➢ Programming model
➢ Parallel Processing
➢ Large Scale Distributed Model

University Institute of Engineering (UIE)


MapReduce is used by ApacheHadoop
for Processing data using the
following:
• HDFS
• Pig
• Hive
• Base

University Institute of Engineering (UIE)


MAPREDUCE FUNCTIONS
➢ MAP
It takes set of data and converts it into other set of data
where individual elements are broken into tuples i.e key
and value.
➢ REDUCE
It takes the output from MAP function as input and
combine those tuples into smaller set of tuples.

University Institute of Engineering (UIE)


MAPREDUCE FUNCTIONS

University Institute of Engineering (UIE)


MAPREDUCE FUNCTIONS
• Every input is processed by the respective Map().
• After that the result is send to the Reduce() for
aggregation and the final result is given as the answer.
• Map reduce task works on Key Value pair.

University Institute of Engineering (UIE)


MAPREDUCE EXAMPLE

University Institute of Engineering (UIE)


MAP REDUCE ARCHITECTURE /
DATA FLOW IN MAP REDUCE

University Institute of Engineering (UIE)


Daemons of MapReduce
1.JobTracker
•manages how data is processed
•coordinates job in the cluster
•tracks overall progress of each job
2.TaskTracker
•runs the task that does the actual processing of data
•it sends progress report to JobTracker

University Institute of Engineering (UIE)


University Institute of Engineering (UIE)
MAPREDUCE
• It works on “Divide and process” means the data is divided into
tasks and the tasks runs across the cluster.
• JOB?????
• Job is the basic unit of work and when you write a MapReduce
program that will convert into java objects called job.
• When you run a MapReduce program a JobSubmitter instance is
created.
• JobSubmitter will ask JobTracker for new jobID by calling
getnewJobID() on JobTracker.
• And JobTracker will return new job ID like job_1 to uniquely
identify job on cluster.

University Institute of Engineering (UIE)


• JobSubmitter will compute the Input Split.
• Input Split refers to array list that contains the reference of these
blocks.
• By default size of 1 split is equal to size of 1 block.
• After computing Input Split, all resources required to run the job
is copied to the cluster.
• It will copy Input Split, jar file and configuration file.
• After copying these resources, JobSubmitter tells the JobTracker
that the job is ready for execution by calling submitJob() go
JobTracker.

University Institute of Engineering (UIE)


• After submitting job, JobSubmitter waits for the progress of the job.
• Next is Job Scheduler
• JobScheduler: Tasks are created here.
• It first reads the Input Split and for each split a Map task is created.
• Number of Reduce task is specified in the MapReduce program.
• Map tasks are assigned to the TaskTracker such that its Input Split is
as close as possible.
• e.g first split of file is in DataNode1, so first Map Task is assigned to
the TaskTracker running on DataNode1.

University Institute of Engineering (UIE)


MAPREDUCE COMPONENTS
INPUT FORMAT

RECORD READER

MAPPER

PARTITIONER

REDUCER

RECORD WRITER

OUTPUT FORMAT

University Institute of Engineering (UIE)


ARCHITECTURE/ DATA FLOW USING
WORD COUNT EXAMPLE

University Institute of Engineering (UIE)


WORD COUNT EXAMPLE
What is Hadoop
What is HDFS
What is HBase
What is Hive

key & value

MAPREDUCE
WORDCOUNT
PROGRAM

key & value


Hadoop, 1
HBase, 1
HDFS, 1
Hive, 1
is, 4
What, 4

University Institute of Engineering (UIE)


WORD COUNT EXAMPLE

• The foremost thing in MapReduce is that everything is in


the form of keys and values.
• We are having so many functions in this MapReduce
Framework like:
• Mapper
• Reducer
• Partitioner
• Combiner
• Sort and Shuffle
• Input splits

University Institute of Engineering (UIE)


• First of all the Mapper Functions takes the keys and values as
input parameter.
• Output of Map Functions are Keys and Values.
• Reducer input as well as output parameters are again keys and
values.
• Hence we can say in MapReduce framework data must be
expressed as Keys and Values.
• Whenever developer is solving a problem using MapReduce, then
the developer need to decide what should be the key and what
should be the value.

University Institute of Engineering (UIE)


Diagram for data flow in word count example is in separate pdf file.

University Institute of Engineering (UIE)


DESCRIPTION

• To process the file/data using MapReduce , First of all one need to


specify the file input that has to be processed.
• if directory is specified, all the files in the directory will be taken as
input data.
• Also, there is a need to specify the input data format(InputFormat)
in the MapReduce program.
• The JobSubmitter uses this InputFormat and computes the Input
split of the file.
• All the code to split the file is available in the sub class of the
InputFormat.

University Institute of Engineering (UIE)


• Input Split is just the reference to blocks of HDFS.
• We choose word as Key and Count as Value.
• Input File is divided into 2 splits.
• 2 MapTask will be created .
• 1 MapTask for Each split.
• All map tasks will be running in parallel on different DataNodes.
• Now, we need to read the input split and therefore comes the role
of Record Reader.
• InputFormat defines the RecordReader to read the data from
Input Split.

University Institute of Engineering (UIE)


• Record Reader reads the Input Split and passes the input data to
the Mapper function.
• Mapper Function is defined in MapReduce Program.
• Mapper Function passes input data and emits Keys(word) and
values(count).
• Mapper will return numerical value for each word it encounters.
• In our example Word “What” is encountered first so key is What
and the value is 1.
• Similarly for word “is” and so on.
• After that the Map Output is Partitioned.

University Institute of Engineering (UIE)


• No. of reducers configured in our MapReduce is taken into
account to create partition.
• No. of Partition will always equal to no. of Reducers.
• After the Map Operation is partitioned, each partition is sorted
independently(Always by key, never by value).

University Institute of Engineering (UIE)


• After this Shuffle and Sort, Combiner comes into existence where
the keys and values merge.
• Partitioning and Shuffling makes sure that all the values of a
particular key must go to the same Reducer.
• All the Reducers compute the word count and stores the final
result in HDFS.
• The output path is also configured in Map Reduce Program.

University Institute of Engineering (UIE)


Phases of Map Reducer
• Input Phase − Here we have a Record Reader that translates each
record in an input file and sends the parsed data to the mapper in the
form of key-value pairs.
• Map − Map is a user-defined function, which takes a series of key-value
pairs and processes each one of them to generate zero or more key-value
pairs.
• Intermediate Keys − They key-value pairs generated by the mapper are
known as intermediate keys.
• Combiner − A combiner is a type of local Reducer that groups similar
data from the map phase into identifiable sets. It takes the intermediate
keys from the mapper as input and applies a user-defined code to
aggregate the values in a small scope of one mapper. It is not a part of
the main MapReduce algorithm; it is optional.

University Institute of Engineering (UIE)


Phases of Map Reducer
• Shuffle and Sort − The Reducer task starts with the Shuffle and Sort
step. It downloads the grouped key-value pairs onto the local machine,
where the Reducer is running. The individual key-value pairs are sorted
by key into a larger data list. The data list groups the equivalent keys
together so that their values can be iterated easily in the Reducer task.
• Reducer − The Reducer takes the grouped key-value paired data as
input and runs a Reducer function on each one of them. Here, the data
can be aggregated, filtered, and combined in a number of ways, and it
requires a wide range of processing. Once the execution is over, it gives
zero or more key-value pairs to the final step.
• Output Phase − In the output phase, we have an output formatter that
translates the final key-value pairs from the Reducer function and writes
th onto a file using a record writer.

University Institute of Engineering (UIE)


Input Split Vs Block

InputSplit is not the same as the block.


• A block is a hard division of data at the block size. So if the block
size in the cluster is 128 MB, each block for the dataset will be 128
MB except for the last block which could be less than the block
size if the file size is not entirely divisible by the block size. So a
block is a hard cut at the block size and blocks can end even
before a logical record ends.
• Consider the block size in your cluster is 128 MB and each logical
record is your file is about 100 Mb. (yes.. huge records)

University Institute of Engineering (UIE)


Input Split Vs Block
• So the first record will perfectly fit in the block no problem since the
record size 100 MB is well with in the block size which is 128 MB.
However the 2nd record can not fit in the block, so the record number
2 will start in block 1 and will end in block 2.
• If you assign a mapper to a block 1, in this case, the Mapper can not
process Record 2 because block 1 does not have the complete record
2. That is exactly the problem InputSplit solves. In this case InputSplit
1 will have both record 1 and record 2. InputSplit 2 does not start with
Record 2 since Record 2 is already included in the Input Split 1. So
InputSplit 2 will have only record 3. As you can see record 3 is divided
between Block 2 and 3 but still InputSplit 2 will have the whole of
record 3.

University Institute of Engineering (UIE)


Input Split Vs Block

• Blocks are physical chunks of data store in disks where as InputSplit


is not physical chunks of data. It is a Java class with pointers to
start and end locations in blocks. So when Mapper tries to read the
data it clearly knows where to start reading and where to stop
reading. The start location of an InputSplit can start in a block and
end in another block.
• InputSplit respect logical record boundary and that is why it
becomes very important. During MapReduce execution Hadoop
scans through the blocks and create InputSplits and each InputSplit
will be assigned to individual mappers for processing.

University Institute of Engineering (UIE)


Input Split Vs Block
• InputSplit is the logical representation of data whereas blocks are
the physical representation of data.
• The size of input split is not fixed but size of block is fixed.
• InputSplit is the reference of the block but block stores the actual
data.

University Institute of Engineering (UIE)


Role of Record Reader
• The RecordReader which is in the Ist map task will open a file
using FSDataInputStream and then it will move the read pointer
to the start offset i.e 0
• So it will start reading the file from offset 0.
• Similarly, RecordReader of map task 2 will also open the file using
FSDataInputStream and move the read pointer to the respective
start offset and start reading from that offset.
• RecordReader must read a line and generates the key and value.
• is the responsibility of the Record Reader to read the complete line
even though the part of it is in the next split.

University Institute of Engineering (UIE)


Role of Record Reader

• RecordReader uses the FSDataInputStream( has all information


about the blocks of the file) to read the data.
• RecordReader of 2nd map task will check its offset, if its non-zero
then it will move the start pointer till it encounter end of line, so
that can skip the lines that have already read by the previous
RecordReader.
• In this way, RecordReaders of all the Map Task makes sure that
none of the record is skips and all the records are read only once.

University Institute of Engineering (UIE)


THANK YOU

University Institute of Engineering (UIE)

You might also like