Professional Documents
Culture Documents
Ge Song*+, Zide Meng*, Fabrice Huet*, Frederic Magoules+, Lei Yu# and Xuelian Lin# * University of Nice Sophia Antipolis, CNRS, I3S, UMR 7271, France + Ecole Centrale de Paris, France # Beihang University, Beijing China
Background
Hadoop MapReduce
Job + Map Reduce
(Key, Value) Partion1 Partion2
I N P U T D A T A
Map Map
Split
Reduce
Background
Hadoop
Many steps within Map stage and Reduce stage Different step may consume different type of resource
Map
R E A D
Map
S O R T
M E R G E
O U T P U T
Motivation
Problems
Scheduling
CPU Intensive CPU Intensive
No consideration about the execution time and different type of resources consumed
Hadoop
Hadoop
Job
Default Hadoop
Motivation
Solution
Scheduling No consideration about the execution time and different type of resources consumed
Predict the performance of Hadoop Jobs Hadoop Parameter Tuning Numerous parameters, default value is not optimal
Related Work
Existing Prediction Method 1 - Black Box Based
Hadoop
Hard to choose
Job Features
Statistic/Learning Models
Execution Time
6
Related Work
Existing Prediction Method 2 - Cost Model Based
Hadoop
Read map Out Hadoop Read put reduce Out put
Job Feature
F(map)=f(read,map,sort,spill,merge,write) F(reduce)=f(read,write,merge,reduce,write)
Execution Time
7
Related Work
A Brief Summary about Existing Prediction Method
Black Box Cost Model
Advantage
Detailed analysis about Hadoop processing Division is flexible (stage, resource) Multiple prediction
Lack of job feature extraction A lot of concurrent, hard to model Better for theoretical analysis, not suitable for prediction
Short Coming
o Simple prediction,
o Lack of jobs (jar package + data) analysis
8
Goal
Design a Hadoop MapReduce performance prediction system to: - Predict the job consumption of various type of resources (CPU, Disk IO, Network) - Predict the execution time of Map phase and Reduce phase
Job
- Map execution time - Reduce execution time
- CPU Occupation Time - Disk Occupation Time - Network Occupation Time
9
Prediction System
Design - 1
Cost Model
Job
C O S T M O D E L
10
Map
Initiation
Sort In Memory
Merge Sort
Read/Writ e Disk
Map Function
11 [1] X. Lin, Z. Meng, C. Xu, and M. Wang, A practical performance model for hadoop mapreduce, in CLUSTER Workshops, 2012, pp. 231239.
Type OneConstant
Hadoop System ConsumeInitialization Consume
12 [1] X. Lin, Z. Meng, C. Xu, and M. Wang, A practical performance model for hadoop mapreduce, in CLUSTER Workshops, 2012, pp. 231239.
Parameters Collection
Type One and Type Three
Type one: Run empty map taskscalculate the system consumed from the logs Type Three: Extract the sort part from Hadoop source code, sort a certain number of records.
Type Two
Run a new jobanalyze log
High Latency Large Overhead
Sampling Dataonly analyze the behavior of map function and reduce function
Almost no latency Very low extra overhead
Job Analyzer
13
Sampling Module
Sample input data by a certain percentage (less than 5%). Instantiate user jobs class in using Java reflection Input Data (Amount & Number) Relative computational complexity Data conversion rates (output/input)
MR Module
Analyze Module
Analyze Module
Job Feature
14
Reduce
Design - 2
Parameters Collection
Job Analyzer: Collect Parameters of Type 2
C O S T M O D E L
16
Prediction Model
Problem Analysis
-Many concurrent steps -- the total time can not be added up by the time of each part
CPU: Disk: Net:
Read Data
Initiation
Sort In Memory
Merge Sort
Serializat ion
Map Function
Read/Write Disk
Write Disk
17
Prediction Model
Main Factors (according to the performance model)
- Map Stage
Read Data
Initiation
Sort In Memory
NlogN
The complexity of Map function
Map Function
Tmap=0 +1*MapInput +2*N +3*N*Log(N) +4*The complexity of map function +5*The conversion rate of map data
18
Prediction Model
Experimental Analysis
Test 4 kinds of jobs (0-10000 records) Extract the features for linear regression Calculate the correlation coefficient (R2)
Jobs R2 Dedup 0.9982 WordCount 0.9992 Project 0.9991 Grep 0.9949 Total 0.6157
19
Prediction Model
3500000 3000000
2500000
2000000
- Very good linear relationship within the same kind of jobs. - But no linear relationship among different kind of jobs.
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
1500000
1000000
500000
Number of Records
20
Nearest
The weighted distance of job features (weight w) High contribution for job classification
map/reduce complexitymap/reduce data conversion rate
Prediction Module
Procedure
Job Features 3
Search for the nearest samples
4
Tmap=0+1*MapInput +2*N +3*N*Log(N) +4*The complexity of map function +5*The conversion rate of map data
Main Factors
Cost Model
22
Prediction Module
Procedure
Cost Model
Prediction Results
23
Design - 3
Parameters Collection
Job Analyzer: Collect Parameters of Type 2
C O S T M O D E L
Prediction Module
24
Experience
Task Execution Time (Error Rate)
90
80 Error Rate (100% 70 60 50 40 30 20 10 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 Error Rate (100%)
K=12%, and with w different for each feature K=12%, and with w the same for each feature K=25%, and with w different for each feature 4 kinds of jobs, 64M-8G
Map Tasks
180
160 140 120 100 80 60 40 20 k=12% k=25% k=12%,w=1
Reduce Tasks
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Job ID
Job ID
25
Conclusion
Job Analyzer :
Analyze Job Jar + Input File Collect parameters
Prediction Module:
Find the main factor Propose a linear equation Job classification Multiple prediction
26
27
Reduce
Initiation
Create Object
Reduce Function
28
Prediction Model
Main Factors (according to the performance model)
- Reduce Stage
Initiation
Read Data Network Transfer Merge Sort
Read/Write Disk Serialization Deserialization
Create Object
Reduce Function
NlogN
The complexity of Reduce function
Treduce=0 +1*MapInput +2*N +3*Nlog(N) +4*The complexity of Reduce function +5*The conversion rate of Map data +6*The conversion rate of Reduce data
The conversion rate of Map data The conversion rate of Reduce data
29