Professional Documents
Culture Documents
content
MapReduce
Apache Spark
Spark framework
RDD
What is MapReduce?
Programming model
An associated implementation for processing and generating large data
sets.
Users specify two function:
o Map function: Processes a key/value pair to generate a set of intermediate key/value
pairs,
o Reduce function: Merges all intermediate values associated with the same intermediate
key.
Limitation of MapReduce
MapReduce is suitable only for batch processing jobs, implementing interactive jobs and
models becomes impossible.
Applications that involve precomputation on the dataset brings down the advantages of
MapReduce.
Implementing iterative map reduce jobs is expensive due to the huge space consumption by
each job.
Problems that cannot be trivially partitionable or recombinable becomes a candid limitation of
MapReduce problem solving. For instance, Travelling Salesman problem.
Due to the fixed cost incurred by each MapReduce job submitted, application that requires
low latency time or random access to a large set of data is infeasible.
Also, tasks that has a dependency on each other cannot be parallelized, which is not possible
through MapReduce.
When you have OLTP needs. MR is not suitable for a large number of short on-line
transactions.
Complex algorithms - some machine learning algorithms like SVM,
When map phase generate too many keys. Then sorting takes for ever.
When your processing requires lot of data to be shuffled over the network
Spark
Other framework
Pregel
F1
MillWheel
Storm
Impala
GraphLab
Why Spark ?
Work duplication
Composition
Limited scope
Resource sharing
Management and administration
Apache Spark
A fast and general engine for large scale data
processing
Created by AMPLab now Databricks
Written in Scala
Licensed under Apache
Spam Detection
Spark Framework
Programmi
ng
Library
SCAL
A
Spark
SQL
Python
Spark
Streamin
g
YARN
Storage
JAVA
Spark
GraphX
(Graph
Computatio
n)
Tool
Spark R
(R on
Spark )
Engine
Managem
ent
R
Spark
Mllib
(Machin
e
Learning
)
HDFS
Mesos
Local
Spark Scheduler
S3
RDBMS
NoSQL
Immutability
Immutability means once created it never change.
Big data by default immutable in nature.
Immutability helps
o Parallelize
o Caching
Immutability in action
const int a=0; //immutable
int b=5;//mutable
Update
b++ //in place
C=a+1
Immutability is about value not about reference
Immutability in collection
mutable
Immutable
for(i=0;i<collection.length;i++)
for(i=0;i<collection.length;i++)
Collection[i]+=1
Multiple Transformation
mutable
Immutable
for(i=0;i<collection.length;i++)
for(i=0;i<collection.length;i++)
Collection[i]+=1
for(i=0;i<collection.length;i++)
for(i=0;i<collection.length;i++)
Collection[i]+=2
Laziness
Laziness means not to compute transformation till its
need
Laziness defers the evolution
Laziness allow separating execution from evaluation
Laziness in action
val c1=collection .map(value=>value+1)
Val c2= c1.map(value=>value+1)
Print c2 // now transformation take
transformation
val c2=collection.map(value=>{
var result =value+1
result=result +2})
Multiple transformation in one
Challenges of Laziness
Laziness poses the challenge in term of data type.
If laziness defer the execution determine the data type of the
variable become challenging.
Running the big data program and semantic issue is not fun.
Type inference
Type inference is the part of compiler to determine the data
type.
As all the transformation are side effect type free we can
determine type the operator
Every transformation have specific return type.
Have type inference relieves you think about representation
for many transformation
val collection=[1,2,3,5]
val c1=collection .map(v=>v+1)
val c2=c1.map(v=>v+1)
val c3=c2.count();
val c4=c3.map(v=>v+1) //error
Caching
Immutable data allow you to cache the data for long time
Lazy transformation allow to recreate data on failure.
Transformation can also save .
Caching data also improve the performance of the engine.