MapReduce Performance Prediction

A Hadoop MapReduce Performance Prediction Method
Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin# * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China
Background
Hadoop MapReduce
Job + Map Reduce
(Key, Value) Partion1 Partion2
I N P U T D A T A
Map Map
Split
Reduce
Map Reduce Map HDFS
Background
Hadoop
Many steps within Map stage and Reduce stage Different step may consume different type of resource
Map
R E A D
Map
S O R T
M E R G E
O U T P U T
Motivation
Problems
Scheduling
CPU Intensive CPU Intensive
No consideration about the execution time and different type of resources consumed
Hadoop
Hadoop
Hadoop Parameter Tuning

Job
Numerous parameters, default value is not optimal
Hadoop Default Conf
Job
Default Hadoop
Motivation
Solution
Scheduling No consideration about the execution time and different type of resources consumed
Predict the performance of Hadoop Jobs Hadoop Parameter Tuning Numerous parameters, default value is not optimal
Related Work
Existing Prediction Method 1 - Black Box Based
Hadoop
Hard to choose
Lack of the analysis about Hadoop
Job Features
Statistic/Learning Models
Execution Time
6
Related Work
Existing Prediction Method 2 - Cost Model Based
Hadoop
Read map Out Hadoop Read put reduce Out put
Lots of concurrent processes Hard to divide stages
Difficult to ensure accuracy
Job Feature
F(map)=f(read,map,sort,spill,merge,write) F(reduce)=f(read,write,merge,reduce,write)
Execution Time
7
Related Work
A Brief Summary about Existing Prediction Method
Black Box Cost Model
Advantage
Simple and Effective High accuracy High isomorphism

Lack of job feature extraction Lack of analysis Hard to divide each step and resource
Detailed analysis about Hadoop processing Division is flexible (stage, resource) Multiple prediction
Lack of job feature extraction A lot of concurrent, hard to model Better for theoretical analysis, not suitable for prediction
Short Coming
o Simple prediction,
o Lack of jobs (jar package + data) analysis
8
Goal
Design a Hadoop MapReduce performance prediction system to: - Predict the job consumption of various type of resources (CPU, Disk IO, Network) - Predict the execution time of Map phase and Reduce phase
Job
- Map execution time - Reduce execution time
- CPU Occupation Time - Disk Occupation Time - Network Occupation Time
9
Prediction System
Design - 1
Cost Model
Job
C O S T M O D E L

10
Cost Model [1]

Analysis about Map - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources
CPU: Disk: Net:
Map
Read Data Network Transfer Create Object
Initiation
Sort In Memory
Merge Sort
Read/Writ e Disk
Seriali zation Write Disk
Map Function
11 [1] X. Lin, Z. Meng, C. Xu, and M. Wang, A practical performance model for hadoop mapreduce, in CLUSTER Workshops, 2012, pp. 231239.
Cost Model [1]

Cost Function Parameters Analysis
Type OneConstant
Hadoop System ConsumeInitialization Consume
Type TwoJob-related Parameters

Map Function Computational ComplexityMap Input Records Sorting Coefficient, Complexity Factor
Type ThreeParameters defined by Cost Model
12 [1] X. Lin, Z. Meng, C. Xu, and M. Wang, A practical performance model for hadoop mapreduce, in CLUSTER Workshops, 2012, pp. 231239.
Parameters Collection
Type One and Type Three
Type one: Run empty map taskscalculate the system consumed from the logs Type Three: Extract the sort part from Hadoop source code, sort a certain number of records.
Type Two
Run a new jobanalyze log
High Latency Large Overhead
Sampling Dataonly analyze the behavior of map function and reduce function
Almost no latency Very low extra overhead
Job Analyzer
13
Job Analyzer - Implementation

Job Analyzer Implementation
Hadoop virtual execution environment
Accept the job Jar File & Input Data
Jar File + Input Data
Sampling Module
Sample input data by a certain percentage (less than 5%). Instantiate user jobs class in using Java reflection Input Data (Amount & Number) Relative computational complexity Data conversion rates (output/input)
MR Module
Analyze Module
Hadoop virtual execution environment

MR Module Sampling Module
Analyze Module
Job Feature
14
Job Analyzer - Feasibility

Data similarity: Logs have uniform format Execution similarity: each record will be processed by the same map & reduce function repeatedly
I N P U T D A T A Map Map Split Map Reduce Map
15
Reduce
Design - 2
Job Analyzer: Collect Parameters of Type 2
C O S T M O D E L

Static Parameters Collection Module: Collect Parameters of Type1 & Type 3
16
Prediction Model
Problem Analysis
-Many concurrent steps -- the total time can not be added up by the time of each part
CPU: Disk: Net:
Read Data
Initiation
Network Transfer Create Object
Sort In Memory
Merge Sort
Serializat ion
Map Function
Read/Write Disk
Write Disk
17
Prediction Model
Main Factors (according to the performance model)
- Map Stage
Read Data
The amount of input data

Merge Sort
Read/Write Disk
Initiation
Network Transfer Create Object
Sort In Memory
Serializ ation Write Disk
The number of input records (N)
NlogN
The complexity of Map function
Map Function
Tmap=0 +1*MapInput +2*N +3*N*Log(N) +4*The complexity of map function +5*The conversion rate of map data
The conversion rate of Map data
18
Prediction Model
Experimental Analysis
Test 4 kinds of jobs (0-10000 records) Extract the features for linear regression Calculate the correlation coefficient (R2)
Jobs R2 Dedup 0.9982 WordCount 0.9992 Project 0.9991 Grep 0.9949 Total 0.6157
19
Prediction Model
3500000 3000000
Execution Time of Map
2500000
2000000
- Very good linear relationship within the same kind of jobs. - But no linear relationship among different kind of jobs.
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
Dedup Grep Project WordCount
1500000
1000000
500000
Number of Records
20
Find the nearest jobs!

Instance-Based Linear Regression
Find the nearest samples to the jobs to be predicted in history logs nearest-> similar jobs (Top K nearest, with K=10%-15%) Do linear regression to the samples we have found Calculate the prediction value
Nearest
The weighted distance of job features (weight w) High contribution for job classification
map/reduce complexitymap/reduce data conversion rate
Low contribution for job classification

Data amountNumber of records
21
Prediction Module
Procedure
Job Features 3
Search for the nearest samples
4
Tmap=0+1*MapInput +2*N +3*N*Log(N) +4*The complexity of map function +5*The conversion rate of map data
Main Factors
Cost Model
6 5 Prediction Function 7 Prediction Results
22
Prediction Module
Procedure
Cost Model
Find-Neighbor Module Prediction Function Training Set
Prediction Results
23
Design - 3
Job Analyzer: Collect Parameters of Type 2
C O S T M O D E L
Prediction Module

Static Parameters Collection Module: Collect Parameters of Type1 & Type 3
24
Experience
Task Execution Time (Error Rate)

90
80 Error Rate (100% 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Error Rate (100%)
K=12%, and with w different for each feature K=12%, and with w the same for each feature K=25%, and with w different for each feature 4 kinds of jobs, 64M-8G
Map Tasks
180
160 140 120 100 80 60 40 20 k=12% k=25% k=12%,w=1
Reduce Tasks
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Job ID
Job ID
25
Conclusion
Job Analyzer :
Analyze Job Jar + Input File Collect parameters
Prediction Module:
Find the main factor Propose a linear equation Job classification Multiple prediction
26
Thank you! Question?
27
Cost Model [1]

Analysis about Reduce - Modeling the resources (CPU Disk Network) consumption - Each stage involves only one type of resources
Reduce
CPU: Disk: Net:

Merge Sort
Read/Write Disk Serialization Deserialization
Initiation
Read Data Network Transfer
Create Object
Reduce Function
Write Disk Network
28
Prediction Model
Main Factors (according to the performance model)
- Reduce Stage
Initiation
Read Data Network Transfer Merge Sort
Read/Write Disk Serialization Deserialization
Create Object
Reduce Function
The amount of input data

The number of input records
Write Disk Network
NlogN
The complexity of Reduce function
Treduce=0 +1*MapInput +2*N +3*Nlog(N) +4*The complexity of Reduce function +5*The conversion rate of Map data +6*The conversion rate of Reduce data
The conversion rate of Map data The conversion rate of Reduce data
29

MapReduce Performance Prediction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MapReduce Performance Prediction

Uploaded by

Copyright:

Available Formats

A Hadoop MapReduce Performance Prediction Method

Map Reduce Map HDFS

Hadoop Parameter Tuning

Numerous parameters, default value is not optimal

Hadoop Default Conf

Lack of the analysis about Hadoop

Lots of concurrent processes Hard to divide stages

Difficult to ensure accuracy

Simple and Effective High accuracy High isomorphism

- Map execution time - Reduce execution time

Cost Model [1]

Read Data Network Transfer Create Object

Seriali zation Write Disk

Cost Model [1]

Type TwoJob-related Parameters

Type ThreeParameters defined by Cost Model

Job Analyzer - Implementation

Hadoop virtual execution environment

Job Analyzer - Feasibility

- Map execution time - Reduce execution time

Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

Network Transfer Create Object

The amount of input data

Network Transfer Create Object

Serializ ation Write Disk

The number of input records (N)

The conversion rate of Map data

Execution Time of Map

Dedup Grep Project WordCount

Find the nearest jobs!

Low contribution for job classification

6 5 Prediction Function 7 Prediction Results

Find-Neighbor Module Prediction Function Training Set

- Map execution time - Reduce execution time

Static Parameters Collection Module: Collect Parameters of Type1 & Type 3

Thank you! Question?

Cost Model [1]

CPU: Disk: Net:

Read Data Network Transfer

Write Disk Network

The amount of input data

Write Disk Network

You might also like