You are on page 1of 29

A Hadoop MapReduce Performance Prediction Method

Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin# * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China

Background
Hadoop MapReduce
Job + Map Reduce
(Key, Value) Partion1 Partion2

I N P U T D A T A

Map Map
Split

Reduce

Map Reduce Map HDFS

Background
Hadoop
Many steps within Map stage and Reduce stage Different step may consume different type of resource
Map

R E A D

Map

S O R T

M E R G E

O U T P U T

Motivation
Problems
Scheduling
CPU Intensive CPU Intensive

No consideration about the execution time and different type of resources consumed

Hadoop

Hadoop

Hadoop Parameter Tuning


Job

Numerous parameters, default value is not optimal

Hadoop Default Conf

Job

Default Hadoop

Motivation
Solution
Scheduling No consideration about the execution time and different type of resources consumed

Predict the performance of Hadoop Jobs Hadoop Parameter Tuning Numerous parameters, default value is not optimal

Related Work
Existing Prediction Method 1 - Black Box Based
Hadoop

Hard to choose

Lack of the analysis about Hadoop

Job Features

Statistic/Learning Models

Execution Time
6

Related Work
Existing Prediction Method 2 - Cost Model Based
Hadoop
Read map Out Hadoop Read put reduce Out put

Lots of concurrent processes Hard to divide stages

Difficult to ensure accuracy

Job Feature

F(map)=f(read,map,sort,spill,merge,write) F(reduce)=f(read,write,merge,reduce,write)

Execution Time
7

Related Work
A Brief Summary about Existing Prediction Method
Black Box Cost Model

Advantage

Simple and Effective High accuracy High isomorphism


Lack of job feature extraction Lack of analysis Hard to divide each step and resource

Detailed analysis about Hadoop processing Division is flexible (stage, resource) Multiple prediction
Lack of job feature extraction A lot of concurrent, hard to model Better for theoretical analysis, not suitable for prediction

Short Coming

o Simple prediction,
o Lack of jobs (jar package + data) analysis
8

Goal
Design a Hadoop MapReduce performance prediction system to: - Predict the job consumption of various type of resources (CPU, Disk IO, Network) - Predict the execution time of Map phase and Reduce phase
Job
- Map execution time - Reduce execution time
- CPU Occupation Time - Disk Occupation Time - Network Occupation Time
9

Prediction System

Design - 1
Cost Model

Job

C O S T M O D E L

- Map execution time - Reduce execution time


- CPU Occupation Time - Disk Occupation Time - Network Occupation Time

10

Cost Model [1]


Analysis about Map - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources
CPU: Disk: Net:

Map

Read Data Network Transfer Create Object

Initiation

Sort In Memory

Merge Sort
Read/Writ e Disk

Seriali zation Write Disk

Map Function

11 [1] X. Lin, Z. Meng, C. Xu, and M. Wang, A practical performance model for hadoop mapreduce, in CLUSTER Workshops, 2012, pp. 231239.

Cost Model [1]


Cost Function Parameters Analysis

Type OneConstant
Hadoop System ConsumeInitialization Consume

Type TwoJob-related Parameters


Map Function Computational ComplexityMap Input Records Sorting Coefficient, Complexity Factor

Type ThreeParameters defined by Cost Model

12 [1] X. Lin, Z. Meng, C. Xu, and M. Wang, A practical performance model for hadoop mapreduce, in CLUSTER Workshops, 2012, pp. 231239.

Parameters Collection
Type One and Type Three
Type one: Run empty map taskscalculate the system consumed from the logs Type Three: Extract the sort part from Hadoop source code, sort a certain number of records.

Type Two
Run a new jobanalyze log
High Latency Large Overhead

Sampling Dataonly analyze the behavior of map function and reduce function
Almost no latency Very low extra overhead
Job Analyzer
13

Job Analyzer - Implementation


Job Analyzer Implementation
Hadoop virtual execution environment
Accept the job Jar File & Input Data
Jar File + Input Data

Sampling Module
Sample input data by a certain percentage (less than 5%). Instantiate user jobs class in using Java reflection Input Data (Amount & Number) Relative computational complexity Data conversion rates (output/input)

MR Module
Analyze Module

Hadoop virtual execution environment


MR Module Sampling Module

Analyze Module

Job Feature

14

Job Analyzer - Feasibility


Data similarity: Logs have uniform format Execution similarity: each record will be processed by the same map & reduce function repeatedly
I N P U T D A T A Map Map Split Map Reduce Map
15

Reduce

Design - 2
Parameters Collection
Job Analyzer: Collect Parameters of Type 2

C O S T M O D E L

- Map execution time - Reduce execution time


- CPU Occupation Time - Disk Occupation Time - Network Occupation Time

Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

16

Prediction Model
Problem Analysis
-Many concurrent steps -- the total time can not be added up by the time of each part
CPU: Disk: Net:
Read Data

Initiation

Network Transfer Create Object

Sort In Memory

Merge Sort

Serializat ion

Map Function

Read/Write Disk

Write Disk

17

Prediction Model
Main Factors (according to the performance model)
- Map Stage
Read Data

The amount of input data


Merge Sort
Read/Write Disk

Initiation

Network Transfer Create Object

Sort In Memory

Serializ ation Write Disk

The number of input records (N)

NlogN
The complexity of Map function

Map Function

Tmap=0 +1*MapInput +2*N +3*N*Log(N) +4*The complexity of map function +5*The conversion rate of map data

The conversion rate of Map data

18

Prediction Model
Experimental Analysis
Test 4 kinds of jobs (0-10000 records) Extract the features for linear regression Calculate the correlation coefficient (R2)
Jobs R2 Dedup 0.9982 WordCount 0.9992 Project 0.9991 Grep 0.9949 Total 0.6157

19

Prediction Model
3500000 3000000

Execution Time of Map

2500000

2000000

- Very good linear relationship within the same kind of jobs. - But no linear relationship among different kind of jobs.
0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Dedup Grep Project WordCount

1500000

1000000

500000

Number of Records
20

Find the nearest jobs!


Instance-Based Linear Regression
Find the nearest samples to the jobs to be predicted in history logs nearest-> similar jobs (Top K nearest, with K=10%-15%) Do linear regression to the samples we have found Calculate the prediction value

Nearest
The weighted distance of job features (weight w) High contribution for job classification
map/reduce complexitymap/reduce data conversion rate

Low contribution for job classification


Data amountNumber of records
21

Prediction Module
Procedure
Job Features 3
Search for the nearest samples

4
Tmap=0+1*MapInput +2*N +3*N*Log(N) +4*The complexity of map function +5*The conversion rate of map data

Main Factors

Cost Model

6 5 Prediction Function 7 Prediction Results

22

Prediction Module
Procedure
Cost Model

Find-Neighbor Module Prediction Function Training Set

Prediction Results

23

Design - 3
Parameters Collection
Job Analyzer: Collect Parameters of Type 2

C O S T M O D E L

Prediction Module

- Map execution time - Reduce execution time


- CPU Occupation Time - Disk Occupation Time - Network Occupation Time

Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

24

Experience
Task Execution Time (Error Rate)

90
80 Error Rate (100% 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Error Rate (100%)

K=12%, and with w different for each feature K=12%, and with w the same for each feature K=25%, and with w different for each feature 4 kinds of jobs, 64M-8G
Map Tasks
180
160 140 120 100 80 60 40 20 k=12% k=25% k=12%,w=1

Reduce Tasks

0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Job ID

Job ID

25

Conclusion
Job Analyzer :
Analyze Job Jar + Input File Collect parameters

Prediction Module:
Find the main factor Propose a linear equation Job classification Multiple prediction
26

Thank you! Question?

27

Cost Model [1]


Analysis about Reduce - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources

Reduce

CPU: Disk: Net:


Merge Sort
Read/Write Disk Serialization Deserialization

Initiation

Read Data Network Transfer

Create Object
Reduce Function

Write Disk Network

28

Prediction Model
Main Factors (according to the performance model)
- Reduce Stage
Initiation
Read Data Network Transfer Merge Sort
Read/Write Disk Serialization Deserialization

Create Object
Reduce Function

The amount of input data


The number of input records

Write Disk Network

NlogN
The complexity of Reduce function

Treduce=0 +1*MapInput +2*N +3*Nlog(N) +4*The complexity of Reduce function +5*The conversion rate of Map data +6*The conversion rate of Reduce data

The conversion rate of Map data The conversion rate of Reduce data

29

You might also like