You are on page 1of 62

BUMPER

Class 3 - MapReduce Advanced

Topic 1
Designing MapReduce
Implementations Part 1

AGENDA

What is Big Data?


Hadoop Distributed File System
MapReduce
Understanding Hadoop Ecosystem
Setting up a Hadoop Cluster
HDFS Hands On
MapReduce-Hands On

MapReduce
Thinking MapReduce!!!

MapReduce
Thinking MapReduce!!!
Are input files input files independent of each
other?

MapReduce
Thinking MapReduce!!!
Are input files input files independent of each
other?
Can the problem be broken into smaller tasks?

MapReduce
Thinking MapReduce!!!
Are input files input files independent of each
other?
Can the problem be broken into smaller tasks?
Can the partial results of executing processing on
small tasks be aggregated or consolidated?

MapReduce
Thinking MapReduce!!!
Are input files input files independent of each
other?
Can the problem be broken into smaller tasks?
Can the partial results of executing processing on
small tasks be aggregated or consolidated?
Identify the individual entity on which processing
happens.

MapReduce
MapReduce Design Patterns

MapReduce
MapReduce Design Patterns
Design pattern is a template for solving a
common and general data manipulation problem
with MapReduce.

MapReduce
MapReduce Design Patterns
Design pattern is a template for solving a
common and general data manipulation problem
with MapReduce.
Summarization Patterns

MapReduce
MapReduce Design Patterns
Design pattern is a template for solving a
common and general data manipulation problem
with MapReduce.
Summarization Patterns
Filtering Patterns

MapReduce
MapReduce Design Patterns
Design pattern is a template for solving a
common and general data manipulation problem
with MapReduce.
Summarization Patterns
Filtering Patterns
Join Patterns

MapReduce
MapReduce Design Patterns
Design pattern is a template for solving a
common and general data manipulation problem
with MapReduce.
Summarization Patterns
Filtering Patterns
Join Patterns
Job Chaining Patterns

MapReduce
MapReduce Design Patterns

MapReduce
MapReduce Design Patterns

Description of the Design Pattern

MapReduce
MapReduce Design Patterns

Description of the Design Pattern

Examples of the Design Pattern

MapReduce
MapReduce Design Patterns

Description of the Design Pattern

Examples of the Design Pattern

Structure of the Design Pattern

MapReduce
MapReduce Design Patterns

Description of the Design Pattern

Examples of the Design Pattern

Structure of the Design Pattern

Detailed explanation of one example

MapReduce
MapReduce Design Patterns

Description of the Design Pattern

Examples of the Design Pattern

Structure of the Design Pattern

Detailed explanation of one example

MapReduce feasibility

MapReduce
MapReduce Design Patterns

Description of the Design Pattern

Examples of the Design Pattern

Structure of the Design Pattern

Detailed explanation of one example

MapReduce feasibility

High Level MapReduce design

MapReduce
MapReduce Design Patterns

Description of the Design Pattern

Examples of the Design Pattern

Structure of the Design Pattern

Detailed explanation of one example

MapReduce feasibility

High Level MapReduce design

Detailed design

MapReduce
MapReduce Design Patterns
Description of the Design Pattern
Examples of the Design Pattern
Structure of the Design Pattern
Detailed explanation of one example
MapReduce feasibility
High Level MapReduce design
Detailed design
Java MapReduce code

MapReduce
Summarization Pattern

MapReduce
Summarization Pattern
Numerical Summarization

MapReduce
Summarization Pattern
Numerical Summarization
Description

Date or Numerical data with attributes or


dimensions

MapReduce
Summarization Pattern
Numerical Summarization
Description

Date or Numerical data with attributes or


dimensions
Perform aggregations Sum, Minimum,
Maximum, Average, Count etc.

MapReduce
Numerical Summarization Pattern
Examples

MapReduce
Numerical Summarization Pattern
Examples

Web log analysis


- Summarize user logins by hour
- Plot a histogram
- Analyze usage patterns to see when website is more active

MapReduce
Numerical Summarization Pattern
Examples

Web log analysis


- Summarize user logins by hour
- Plot a histogram
- Analyze usage patterns to see when website is more active

Digital Marketing
- Summarize ads by types, time of the day
- Plot a histogram
- Analyze ad effectiveness

MapReduce
Summarization Pattern General Structure

MapReduce
Summarization Pattern General Structure
HDFS
Input

MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 1

Block 2

Block 3

Block 4

MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 1

Block 3

Block 4

InputFormat

Block 2

MapReduce
Summarization Pattern General Structure
HDFS
Input
Block 1
I/P Split
1

Block 3
I/P Split
3

Block 4
I/P Split
4

InputFormat

Block 2
I/P Split
2

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 2
I/P Split
2

Block 3
I/P Split
3

Block 4
I/P Split
4

InputFormat RecordReader

Block 1
I/P Split
1

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 2
I/P Split
2

Block 3
I/P Split
3

Block 4
I/P Split
4

InputFormat RecordReader

Block 1
I/P Split
1

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 2
I/P Split
2

Block 3
I/P Split
3

Block 4
I/P Split
4

InputFormat RecordReader

Block 1
I/P Split
1

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 2
I/P Split
2

Block 3
I/P Split
3

Block 4
I/P Split
4

InputFormat RecordReader

Block 1
I/P Split
1

Map
()
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 2
I/P Split
2

Block 3
I/P Split
3

Block 4
I/P Split
4

InputFormat RecordReader

Block 1
I/P Split
1

Map
()
<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 3
I/P Split
3

Block 4
I/P Split
4

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>

Partitioner

Block 2
I/P Split
2

InputFormat RecordReader

Block 1
I/P Split
1

Map
()

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 3
I/P Split
3

Block 4
I/P Split
4

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>

Partitioner

Block 2
I/P Split
2

InputFormat RecordReader

Block 1
I/P Split
1

Map
()

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 3
I/P Split
3

Block 4
I/P Split
4

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>

Partitioner

Block 2
I/P Split
2

InputFormat RecordReader

Block 1
I/P Split
1

Map
()

<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 3
I/P Split
3

Block 4
I/P Split
4

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>

Reduce
()

Partitioner

Block 2
I/P Split
2

InputFormat RecordReader

Block 1
I/P Split
1

Map
()

<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 3
I/P Split
3

Block 4
I/P Split
4

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

Map
Output
<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>

Reduce
()

Partitioner

Block 2
I/P Split
2

InputFormat RecordReader

Block 1
I/P Split
1

Map
()

<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>

Reduc
e
Output

<group 1,
Summary>
<group 2,
Summary>
<group 3,
Summary>
<group 6,
Summary>
<group 7,
Summary>
<group 8,
Summary>

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 4
I/P Split
4

<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>

Reduce
()

<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>

Reduc
e
Output

<group 1,
Summary>
<group 2,
Summary>
<group 3,
Summary>
<group 6,
Summary>
<group 7,
Summary>
<group 8,
Summary>

OutputFormat RecordWriter

Block 3
I/P Split
3

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

Map
Output

Partitioner

Block 2
I/P Split
2

InputFormat RecordReader

Block 1
I/P Split
1

Map
()

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 4
I/P Split
4

<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>

Reduce
()

<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>

Reduc
e
Output

<group 1,
Summary>
<group 2,
Summary>
<group 3,
Summary>
<group 6,
Summary>
<group 7,
Summary>
<group 8,
Summary>

HDFS
Outpu
t

OutputFormat RecordWriter

Block 3
I/P Split
3

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

Map
Output

Partitioner

Block 2
I/P Split
2

InputFormat RecordReader

Block 1
I/P Split
1

Map
()

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 4
I/P Split
4

<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>

Reduce
()

<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>

Reduc
e
Output

<group 1,
Summary>
<group 2,
Summary>
<group 3,
Summary>
<group 6,
Summary>
<group 7,
Summary>
<group 8,
Summary>

HDFS
Outpu
t

OutputFormat RecordWriter

Block 3
I/P Split
3

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

Map
Output

Partitioner

Block 2
I/P Split
2

InputFormat RecordReader

Block 1
I/P Split
1

Map
()

part-r00000

MapReduce
Summarization Pattern General Structure
HDFS
Input

Block 4
I/P Split
4

<K1, Summary
Fld>
<K2, Summary
Fld>
<K4, Summary
Fld>
<K3, Summary
Fld>
<K6, Summary
Fld>
<K7, Summary
Fld>
<K8, Summary
Fld>
<K6, Summary
Fld>
<K9, Summary
Fld>
<K1, Summary
Fld>

Reduce
()

<K1, (.
)>
<K2, (.
)>
<K3,
(.)>
<K6,(.
<K4,
(.)>
)>
<K7,
(.)>
<K8,
(.)>
<K9, (.
)>

Reduc
e
Output

<group 1,
Summary>
<group 2,
Summary>
<group 3,
Summary>
<group 6,
Summary>
<group 7,
Summary>
<group 8,
Summary>

HDFS
Outpu
t

OutputFormat RecordWriter

Block 3
I/P Split
3

<K1,
V1>
<K2,
V2>
<K4,
V4>
<K3,
V3>
<K5,
V5>
<K6,
V6>
<K7,
V7>
<K8,
V8>
<K5,
V5>
<K6,
V6>
<K9,
V9>
<K1,
V1>

Map
Output

Partitioner

Block 2
I/P Split
2

InputFormat RecordReader

Block 1
I/P Split
1

Map
()

part-r00000

part-r00001

MapReduce
Example Dataset

MapReduce
Example Dataset
stackexchange.com

MapReduce
Example Dataset
stackexchange.com

https://archive.org/details/stackexchange

MapReduce
Example Dataset
stackexchange.com

https://archive.org/details/stackexchange
posts.xml
comments.xml
users.xml

MapReduce
Example Dataset
posts.xml
<row
FavoriteCount="4"
CommentCount="0"
AnswerCount="5"
Tags="<tells><online>"
Title="How
to
detect
tells
online?"
LastActivityDate="2012-07-02T19:20:08.690"
LastEditDate="2012-0110T19:40:36.453" LastEditorUserId="39" OwnerUserId="36" Body="<p>I
know there ara some tips to detect tells while playing live, but how can you do
it while playing poker online? What should I look for?</p> " ViewCount="411"
Score="13"
CreationDate="2012-01-10T19:35:18.413"
AcceptedAnswerId="50" PostTypeId="1" Id="18"/>
Users:
post questions on the site.
post answers to the question.
up-vote or down-vote questions and answers.
specify tags to attribute the post.

MapReduce
Example Dataset
posts.xml - Metadata
Attribute
FavoriteCoun
t
CommentCou
nt
AnswerCount
Tags
Title
LastActivityD
ate

Description

MapReduce
Example Dataset
comments.xml
<row UserId="35" CreationDate="2012-01-10T19:33:35.373" Text="What
were the suits of the cards? It matters in particular because each player had a
hole card that might play, and they can't both be of the same suit." Score="1"
PostId="6" Id="6"/>
Comments are follow-up questions or suggestions on posts (questions or
answers).

MapReduce
Example Dataset
comments.xml - Metadata
Attribute

Description

MapReduce
Example Dataset
users.xml
<row AccountId="407388" Age="26" DownVotes="0" UpVotes="1" Views="1"
AboutMe="<p>Symfony 2 developer</p> <p>Fancy front-end HTML5/JS
enthusiast</p> <p>Amateur Poker Player</p> " Location="Slovakia"
LastAccessDate="2012-01-24T21:41:56.360"
DisplayName="Teo.sk"
CreationDate="2012-01-10T19:14:15.500"
Reputation="101"
Id="15"
WebsiteUrl="http://teo.sk"/>
Account details provided on the site.

MapReduce
Example Dataset
users.xml - Metadata
Attribute

Description

RECAP
How to approach a MapReduce problem?
Numerical Summarization Pattern Part 1

BUMPER

You might also like