You are on page 1of 20

Presents

Session 3 - MapReduce
Introduction

Dealing with
MapReduce workflow
Combiner, Partitioner
Introduction to
MapReduce programming
Use case Explanation

Queries:support@3clouds.com

Introduction to MapReduce

Mapper maps input key/value pairs to a set of intermediate key/value pairs


Maps are the individual tasks that transform input records into intermediate records. The
transformed intermediate records do not need to be of the same type as the input
records. A given input pair may map to zero or many output pairs
The Hadoop MapReduce framework spawns one map task for each InputSplit generated
by the InputFormat for the job
The number of maps is usually driven by the total size of the inputs, that is, the total
number of blocks of the input files
Reducer reduces a set of intermediate values which share a key to a smaller set of values
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes>
*mapred.tasktracker.reduce.tasks.maximum)
Shuffle phase - Input to the Reducer is the sorted output of the mappers. In this phase
the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Sort Phase - The framework groups Reducer inputs by keys (since different mappers may
have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched
they are merged.

Queries:support@3clouds.com

MapReduce Workflow

Queries:support@3clouds.com

MapReduce Workflow

Queries:support@3clouds.com

MapReduce with Partitioner

Queries:support@3clouds.com

MapReduce with Custom Partitioner

Queries:support@3clouds.com

MapReduce with Combiner

Queries:support@3clouds.com

MapReduce Workflow

Map(k1,v1) list(k2,v2)
Reduce(k2, list (v2)) list(v3)

Queries:support@3clouds.com

5 Daemons
Master Nodes :
NameNode (Central organizer) which holds the metadata (Info of
data)
Secondary NameNode which stands as a Backup for NameNode
Metadata
Jobtracker which initiates the job for your MapReduce program

Slave Nodes :
TaskTracker works on the Data for getting the processing results
DataNode is where the data is stored

Queries:support@3clouds.com

Writing Anatomy for HDFS

Queries:support@3clouds.com

Reading Anatomy for HDFS

Queries:support@3clouds.com

Rack Awareness

Queries:support@3clouds.com

NameNode and DataNode Communication

Queries:support@3clouds.com

Jobtracker & Tasktracker

Queries:support@3clouds.com

Re-Replicating Missing Replicas

Queries:support@3clouds.com

Secondary NameNode

Queries:support@3clouds.com

Data Processing in Map & Reduce

Queries:support@3clouds.com

Cluster Balancing & Un-Balancing

Queries:support@3clouds.com

Thanking You

Queries:support@3clouds.com
Images Via :SNIA, Google.

You might also like