You are on page 1of 24

By : Shrey

MapReduce is a programming model and an associated implementation for


processing and generating large data sets with a parallel, distributed algorithm on a
cluster.
A MapReduce program is composed of a Map() procedure that performs filtering and
sorting.
A Reduce() procedure that performs a summary operation.
Map(k1,v1) list(k2,v2)
Reduce(k2, list (v2)) list(v3)

Prepare the Map() input the "MapReduce system" designates Map processors, assigns the input key value K1
that each processor would work on, and provides that processor with all the input data associated with that key
value.
Run the user-provided Map() code Map() is run exactly once for each K1 key value, generating output
organized by key values K2.
"Shuffle" the Map output to the Reduce processors the MapReduce system designates Reduce processors,
assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated
data associated with that key value.
Run the user-provided Reduce() code Reduce() is run exactly once for each K2 key value produced by the
Map step.
Produce the final output the MapReduce system collects all the Reduce output, and sorts it by K2 to produce
the final outcome.

Map-Reduce
application
from Client

Resource
Manager

Executes individual
Map-Reduce
application container
Node Manager 1

Node Manager 2

Node Manager 3

Node
Manager

Slave
m/c
Data
Node

HDFS
Block 1

HDFS
Block 2

HDFS
Block 3

Block 4

Input Reader

Map
Phase

Combiner
Function

Partition
Function

Reduce
Phase

Output
Writer

InputSplit presents a byte-oriented view of the input.


It is the responsibility of RecordReader to process and present a record-oriented
view.
FileSplit is the default InputSplit.

InputFormat describes the input-specification for a Map-Reduce job.


Validate the input-specification of the job.

Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual
Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical
InputSplit for processing by the Mapper.
In some cases, the application has to implement a RecordReader on whom lies the responsibilty
to respect record-boundaries and present a record-oriented view of the logical InputSplit to the
individual task.

The record reader breaks the data into key/value pairs for input to the Mapper.
It provides following interfaces to get key and value from the input:
getCurrentKey()
Get the current key.
getCurrentValue()
Get the current value.
getProgress()
A number between 0.0 and 1.0 that is the fraction of the data read
initialize(InputSplit split, TaskAttemptContext context)
Called once at initialization.

nextKeyValue()
Read the next key, value pair.

It maps the input dataset processed as a <key,value> pair to a set of intermediatory <key,value>
pair of records.
The structure and number of intermediate output records might be same or different from the input
chunk data.
These Mappers run on different chunks of data available on different data-nodes and produce the
output result for that chunk.

The number of maps to be executed by the map-reduce application depends on the total size of
input and total number of blocks of input file.

The Combiner is a "mini-reduce" process which operates only on data generated by one
machine.
Combiner receiver the data emitted by all mappers on a node.

Input and Output type of combiner must be same as Mapper output.


Combiner can only be used on functions those are commutative(a.b = b.a) and associative
{a.(b.c) = (a.b).c}.

Example: Word Count

It partitions the key space.


It controls the partitioning of keys of intermediate map-outputs.

The key or a subset of it is used to create the partition usually by a hash function.
The total number of partitions is the same as the number of reduce task.
HashPartitioner is the default Partitioner.
It controls which of the m keys of Mapper is sent to for reduction.

It reduces the intermediate value produced by the Mappers.


It has 3 phases
Shuffle: In this phase, the sorted output from the Mapper is collected over HTTP.

Sort: The input from different Mappers are again sorted based on the similar keys in different
Mappers.

Reduce: In this phase, reduce method is called for each <key, (list of values)> pair in the
grouped inputs.

The shuffle and sort phase occurs simultaniously. As the Mappers output is collected, it is also
sorted.

Reporter facilitates the MapReduce application to report progress, update counters and set
application level status message.

Mappers and Reducers use Reporters to report progress and to indicate that they are alive.

Output Collector facilitates the MapReduce framework to collect data output by the Mapper and
Reducer.

Split 0

64 MB

Split 1

64 MB
163 MB File

Split 2

35 MB

setInputFormatClass

Define the Input Format for the job files.

setMapperClass

Set the customer mapper class for the job.

setReducerClass

Set the customer reducer class for the job.

setJarByClass

Set main class for the job.

setJobName

Set a name for the job.

setOutputKeyClass

Set the key class for the job output data.

setOutputValueClass

Set the value class for job outputs.

setPartitionerClass

Set the Partitioner for the job.

setCombinerClass

Set the combiner class for the job.

Facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by
applications.
Distribute application-specific large, read-only files efficiently.

DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job);


DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);

Default Compression Codes Non Splittable


LZO Compression Splittable

Snappy Compression Non Splittable


Sequence Files Splittable
Enable for a Job by setting up following property:
conf.set("io.compression.codecs"," org.apache.hadoop.io.compress.GzipCodec");

Source : OReilly Definitive Guide

You might also like