You are on page 1of 28

Introduction of Apache Sparks

content

MapReduce
Apache Spark
Spark framework
RDD

What is MapReduce?
Programming model
An associated implementation for processing and generating large data
sets.
Users specify two function:
o Map function: Processes a key/value pair to generate a set of intermediate key/value
pairs,
o Reduce function: Merges all intermediate values associated with the same intermediate
key.

Automatically parallelized and executed on a large cluster of commodity


machines..
Its manage the following things:
o
o
o
o

details of partitioning the input data,


scheduling the programs execution across a set of machines,
handling machine failures,
Managing the required inter-machine communication.

Execution step of MapReduce

Limitation of MapReduce

MapReduce is suitable only for batch processing jobs, implementing interactive jobs and
models becomes impossible.
Applications that involve precomputation on the dataset brings down the advantages of
MapReduce.
Implementing iterative map reduce jobs is expensive due to the huge space consumption by
each job.
Problems that cannot be trivially partitionable or recombinable becomes a candid limitation of
MapReduce problem solving. For instance, Travelling Salesman problem.
Due to the fixed cost incurred by each MapReduce job submitted, application that requires
low latency time or random access to a large set of data is infeasible.
Also, tasks that has a dependency on each other cannot be parallelized, which is not possible
through MapReduce.
When you have OLTP needs. MR is not suitable for a large number of short on-line
transactions.
Complex algorithms - some machine learning algorithms like SVM,
When map phase generate too many keys. Then sorting takes for ever.
When your processing requires lot of data to be shuffled over the network

Word count example


MapReduce

Spark

Other framework

Pregel
F1
MillWheel
Storm
Impala
GraphLab

Why Spark ?

Work duplication
Composition
Limited scope
Resource sharing
Management and administration

Apache Spark
A fast and general engine for large scale data
processing
Created by AMPLab now Databricks
Written in Scala
Licensed under Apache

Unified model for Big Data

Why unification matters?


Good for developers : One platform to learn
Good for users :
Good for distributors :

Take apps every where


More apps

Spam Detection

Spark Framework
Programmi
ng
Library

SCAL
A
Spark
SQL

Python
Spark
Streamin
g

YARN

Storage

JAVA
Spark
GraphX
(Graph
Computatio
n)

Tool
Spark R
(R on
Spark )

Spark Core Engine 11

Engine

Managem
ent

R
Spark
Mllib
(Machin
e
Learning
)

HDFS

Mesos
Local

Spark Scheduler
S3

RDBMS

NoSQL

RDD- Resilient Distributed Dataset


An big collection of data having the following properties
o Immutable
o Lazy evaluation
o Cacheable
o Type inferred

Immutability
Immutability means once created it never change.
Big data by default immutable in nature.
Immutability helps
o Parallelize
o Caching

Immutability in action
const int a=0; //immutable
int b=5;//mutable
Update
b++ //in place
C=a+1
Immutability is about value not about reference

Immutability in collection
mutable

Immutable

var collection =[1,2,3,4]

val collection =[1,2,3,4]

for(i=0;i<collection.length;i++)

for(i=0;i<collection.length;i++)

Collection[i]+=1

val new Collection[i]= collectionofmap(value


=>value+1)

Use loop for update

Uses for transformation for change

Collection is update in place

Create new copy of data and leave the


collection infact

Multiple Transformation
mutable

Immutable

var collection =[1,2,3,4]

val collection =[1,2,3,4]

for(i=0;i<collection.length;i++)

for(i=0;i<collection.length;i++)

Collection[i]+=1

val new Collection[i]= collection.map(value


=>value+1)

for(i=0;i<collection.length;i++)

for(i=0;i<collection.length;i++)

Collection[i]+=2

val new Collection1[i]=new


collection.map(value =>value+2)

Collection is update in place


Create 3copy of data and leave the collection

Challenges for Immutable


Immutability is good for parallel but not good for space
Doing multiple transformation result in
o Multiple copies of data
o Multiple passes over the data

In big data , multiple copies of data and multiple passes have


poor performance characteristics.

Laziness
Laziness means not to compute transformation till its
need
Laziness defers the evolution
Laziness allow separating execution from evaluation

Laziness in action
val c1=collection .map(value=>value+1)
Val c2= c1.map(value=>value+1)
Print c2 // now transformation take

transformation
val c2=collection.map(value=>{
var result =value+1
result=result +2})
Multiple transformation in one

Laziness and Immutability


You can lazy only if the underneath data is immutable
You cannot combine the transformation if the
transformation have side effect
Combining the laziness and immutability give the
good performance and distributed processing

Challenges of Laziness
Laziness poses the challenge in term of data type.
If laziness defer the execution determine the data type of the
variable become challenging.

if we are not determine the right type then it allow to have


semantic issues

Running the big data program and semantic issue is not fun.

Type inference
Type inference is the part of compiler to determine the data
type.
As all the transformation are side effect type free we can
determine type the operator
Every transformation have specific return type.
Have type inference relieves you think about representation
for many transformation

Type inference in action

val collection=[1,2,3,5]
val c1=collection .map(v=>v+1)
val c2=c1.map(v=>v+1)
val c3=c2.count();
val c4=c3.map(v=>v+1) //error

Caching
Immutable data allow you to cache the data for long time
Lazy transformation allow to recreate data on failure.
Transformation can also save .
Caching data also improve the performance of the engine.

You might also like