Map Reduce

By : Shrey
MapReduce is a programming model and an associated implementation for

processing and generating large data sets with a parallel, distributed algorithm on a
cluster.
A MapReduce program is composed of a Map() procedure that performs filtering and
sorting.
A Reduce() procedure that performs a summary operation.
Map(k1,v1) list(k2,v2)
Reduce(k2, list (v2)) list(v3)
Prepare the Map() input the "MapReduce system" designates Map processors, assigns the input key value K1
that each processor would work on, and provides that processor with all the input data associated with that key
value.
Run the user-provided Map() code Map() is run exactly once for each K1 key value, generating output
organized by key values K2.
"Shuffle" the Map output to the Reduce processors the MapReduce system designates Reduce processors,
assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated
data associated with that key value.
Run the user-provided Reduce() code Reduce() is run exactly once for each K2 key value produced by the
Map step.
Produce the final output the MapReduce system collects all the Reduce output, and sorts it by K2 to produce
the final outcome.
Map-Reduce
application
from Client
Resource
Manager
Executes individual
Map-Reduce
application container
Node Manager 1
Node Manager 2
Node Manager 3
Node
Manager
Slave
m/c
Data
Node
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3
Block 4
Input Reader
Map
Phase
Combiner
Function
Partition
Function
Reduce
Phase
Output
Writer
InputSplit presents a byte-oriented view of the input.

It is the responsibility of RecordReader to process and present a record-oriented
view.
FileSplit is the default InputSplit.
InputFormat describes the input-specification for a Map-Reduce job.

Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual
Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical
InputSplit for processing by the Mapper.
In some cases, the application has to implement a RecordReader on whom lies the responsibilty
to respect record-boundaries and present a record-oriented view of the logical InputSplit to the
individual task.
The record reader breaks the data into key/value pairs for input to the Mapper.
It provides following interfaces to get key and value from the input:
getCurrentKey()
Get the current key.
getCurrentValue()
Get the current value.
getProgress()
A number between 0.0 and 1.0 that is the fraction of the data read
initialize(InputSplit split, TaskAttemptContext context)
Called once at initialization.
nextKeyValue()
Read the next key, value pair.
It maps the input dataset processed as a <key,value> pair to a set of intermediatory <key,value>
pair of records.
The structure and number of intermediate output records might be same or different from the input
chunk data.
These Mappers run on different chunks of data available on different data-nodes and produce the
output result for that chunk.
The number of maps to be executed by the map-reduce application depends on the total size of
input and total number of blocks of input file.
The Combiner is a "mini-reduce" process which operates only on data generated by one
machine.
Combiner receiver the data emitted by all mappers on a node.
Input and Output type of combiner must be same as Mapper output.

Combiner can only be used on functions those are commutative(a.b = b.a) and associative
{a.(b.c) = (a.b).c}.
Example: Word Count
It partitions the key space.

It controls the partitioning of keys of intermediate map-outputs.
The key or a subset of it is used to create the partition usually by a hash function.
The total number of partitions is the same as the number of reduce task.
HashPartitioner is the default Partitioner.
It controls which of the m keys of Mapper is sent to for reduction.
It reduces the intermediate value produced by the Mappers.

It has 3 phases
Shuffle: In this phase, the sorted output from the Mapper is collected over HTTP.
Sort: The input from different Mappers are again sorted based on the similar keys in different
Mappers.
Reduce: In this phase, reduce method is called for each <key, (list of values)> pair in the
grouped inputs.
The shuffle and sort phase occurs simultaniously. As the Mappers output is collected, it is also
sorted.
Reporter facilitates the MapReduce application to report progress, update counters and set
application level status message.
Mappers and Reducers use Reporters to report progress and to indicate that they are alive.
Output Collector facilitates the MapReduce framework to collect data output by the Mapper and
Reducer.
Split 0
64 MB
Split 1
64 MB
163 MB File
Split 2
35 MB
setInputFormatClass
Define the Input Format for the job files.
setMapperClass
Set the customer mapper class for the job.
setReducerClass
Set the customer reducer class for the job.
setJarByClass
Set main class for the job.
setJobName
Set a name for the job.
setOutputKeyClass
Set the key class for the job output data.
setOutputValueClass
Set the value class for job outputs.
setPartitionerClass
Set the Partitioner for the job.
setCombinerClass
Set the combiner class for the job.
Facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by
applications.
Distribute application-specific large, read-only files efficiently.
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job);

DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);
Default Compression Codes Non Splittable

LZO Compression Splittable
Snappy Compression Non Splittable

Sequence Files Splittable
Enable for a Job by setting up following property:
conf.set("io.compression.codecs"," org.apache.hadoop.io.compress.GzipCodec");
Source : OReilly Definitive Guide

Map Reduce

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Map Reduce

Uploaded by

Copyright:

Available Formats

By : Shrey

MapReduce is a programming model and an associated implementation for

InputSplit presents a byte-oriented view of the input.

InputFormat describes the input-specification for a Map-Reduce job.

Input and Output type of combiner must be same as Mapper output.

Example: Word Count

It partitions the key space.

It reduces the intermediate value produced by the Mappers.

Define the Input Format for the job files.

Set the customer mapper class for the job.

Set the customer reducer class for the job.

Set main class for the job.

Set a name for the job.

Set the key class for the job output data.

Set the value class for job outputs.

Set the Partitioner for the job.

Set the combiner class for the job.

DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job);

Default Compression Codes Non Splittable

Snappy Compression Non Splittable

Source : OReilly Definitive Guide

You might also like