Professional Documents
Culture Documents
Prepare the Map() input the "MapReduce system" designates Map processors, assigns the input key value K1
that each processor would work on, and provides that processor with all the input data associated with that key
value.
Run the user-provided Map() code Map() is run exactly once for each K1 key value, generating output
organized by key values K2.
"Shuffle" the Map output to the Reduce processors the MapReduce system designates Reduce processors,
assigns the K2 key value each processor should work on, and provides that processor with all the Map-generated
data associated with that key value.
Run the user-provided Reduce() code Reduce() is run exactly once for each K2 key value produced by the
Map step.
Produce the final output the MapReduce system collects all the Reduce output, and sorts it by K2 to produce
the final outcome.
Map-Reduce
application
from Client
Resource
Manager
Executes individual
Map-Reduce
application container
Node Manager 1
Node Manager 2
Node Manager 3
Node
Manager
Slave
m/c
Data
Node
HDFS
Block 1
HDFS
Block 2
HDFS
Block 3
Block 4
Input Reader
Map
Phase
Combiner
Function
Partition
Function
Reduce
Phase
Output
Writer
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual
Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical
InputSplit for processing by the Mapper.
In some cases, the application has to implement a RecordReader on whom lies the responsibilty
to respect record-boundaries and present a record-oriented view of the logical InputSplit to the
individual task.
The record reader breaks the data into key/value pairs for input to the Mapper.
It provides following interfaces to get key and value from the input:
getCurrentKey()
Get the current key.
getCurrentValue()
Get the current value.
getProgress()
A number between 0.0 and 1.0 that is the fraction of the data read
initialize(InputSplit split, TaskAttemptContext context)
Called once at initialization.
nextKeyValue()
Read the next key, value pair.
It maps the input dataset processed as a <key,value> pair to a set of intermediatory <key,value>
pair of records.
The structure and number of intermediate output records might be same or different from the input
chunk data.
These Mappers run on different chunks of data available on different data-nodes and produce the
output result for that chunk.
The number of maps to be executed by the map-reduce application depends on the total size of
input and total number of blocks of input file.
The Combiner is a "mini-reduce" process which operates only on data generated by one
machine.
Combiner receiver the data emitted by all mappers on a node.
The key or a subset of it is used to create the partition usually by a hash function.
The total number of partitions is the same as the number of reduce task.
HashPartitioner is the default Partitioner.
It controls which of the m keys of Mapper is sent to for reduction.
Sort: The input from different Mappers are again sorted based on the similar keys in different
Mappers.
Reduce: In this phase, reduce method is called for each <key, (list of values)> pair in the
grouped inputs.
The shuffle and sort phase occurs simultaniously. As the Mappers output is collected, it is also
sorted.
Reporter facilitates the MapReduce application to report progress, update counters and set
application level status message.
Mappers and Reducers use Reporters to report progress and to indicate that they are alive.
Output Collector facilitates the MapReduce framework to collect data output by the Mapper and
Reducer.
Split 0
64 MB
Split 1
64 MB
163 MB File
Split 2
35 MB
setInputFormatClass
setMapperClass
setReducerClass
setJarByClass
setJobName
setOutputKeyClass
setOutputValueClass
setPartitionerClass
setCombinerClass
Facility provided by the Map-Reduce framework to cache files (text, archives, jars etc.) needed by
applications.
Distribute application-specific large, read-only files efficiently.