Introduction On Spark Anuj Jain

Introduction of Apache Sparks
content
MapReduce
Apache Spark
Spark framework
RDD
What is MapReduce?
Programming model
An associated implementation for processing and generating large data
sets.
Users specify two function:
o Map function: Processes a key/value pair to generate a set of intermediate key/value
pairs,
o Reduce function: Merges all intermediate values associated with the same intermediate
key.
Automatically parallelized and executed on a large cluster of commodity

machines..
Its manage the following things:
o
o
o
o
details of partitioning the input data,

scheduling the programs execution across a set of machines,
handling machine failures,
Managing the required inter-machine communication.
Execution step of MapReduce
Limitation of MapReduce
MapReduce is suitable only for batch processing jobs, implementing interactive jobs and
models becomes impossible.
Applications that involve precomputation on the dataset brings down the advantages of
MapReduce.
Implementing iterative map reduce jobs is expensive due to the huge space consumption by
each job.
Problems that cannot be trivially partitionable or recombinable becomes a candid limitation of
MapReduce problem solving. For instance, Travelling Salesman problem.
Due to the fixed cost incurred by each MapReduce job submitted, application that requires
low latency time or random access to a large set of data is infeasible.
Also, tasks that has a dependency on each other cannot be parallelized, which is not possible
through MapReduce.
When you have OLTP needs. MR is not suitable for a large number of short on-line
transactions.
Complex algorithms - some machine learning algorithms like SVM,
When map phase generate too many keys. Then sorting takes for ever.
When your processing requires lot of data to be shuffled over the network
Word count example

MapReduce
Spark
Other framework
Pregel
F1
MillWheel
Storm
Impala
GraphLab
Why Spark ?
Work duplication
Composition
Limited scope
Resource sharing
Management and administration
Apache Spark
A fast and general engine for large scale data
processing
Created by AMPLab now Databricks
Written in Scala
Licensed under Apache
Unified model for Big Data
Why unification matters?

Good for developers : One platform to learn
Good for users :
Good for distributors :
Take apps every where

More apps
Spam Detection
Spark Framework
Programmi
ng
Library
SCAL
A
Spark
SQL
Python
Spark
Streamin
g
YARN
Storage
JAVA
Spark
GraphX
(Graph
Computatio
n)
Tool
Spark R
(R on
Spark )
Spark Core Engine 11
Engine
Managem
ent
R
Spark
Mllib
(Machin
e
Learning
)
HDFS
Mesos
Local
Spark Scheduler
S3
RDBMS
NoSQL
RDD- Resilient Distributed Dataset

An big collection of data having the following properties
o Immutable
o Lazy evaluation
o Cacheable
o Type inferred
Immutability
Immutability means once created it never change.
Big data by default immutable in nature.
Immutability helps
o Parallelize
o Caching
Immutability in action
const int a=0; //immutable
int b=5;//mutable
Update
b++ //in place
C=a+1
Immutability is about value not about reference
Immutability in collection
mutable
Immutable
var collection =[1,2,3,4]
val collection =[1,2,3,4]
for(i=0;i<collection.length;i++)
Collection[i]+=1
val new Collection[i]= collectionofmap(value

=>value+1)
Use loop for update
Uses for transformation for change
Collection is update in place
Create new copy of data and leave the

collection infact
Multiple Transformation
mutable
Immutable
var collection =[1,2,3,4]
val collection =[1,2,3,4]
Collection[i]+=1
val new Collection[i]= collection.map(value

=>value+1)
Collection[i]+=2
val new Collection1[i]=new

collection.map(value =>value+2)
Collection is update in place

Create 3copy of data and leave the collection
Challenges for Immutable

Immutability is good for parallel but not good for space
Doing multiple transformation result in
o Multiple copies of data
o Multiple passes over the data
In big data , multiple copies of data and multiple passes have

poor performance characteristics.
Laziness
Laziness means not to compute transformation till its
need
Laziness defers the evolution
Laziness allow separating execution from evaluation
Laziness in action
val c1=collection .map(value=>value+1)
Val c2= c1.map(value=>value+1)
Print c2 // now transformation take
transformation
val c2=collection.map(value=>{
var result =value+1
result=result +2})
Multiple transformation in one
Laziness and Immutability

You can lazy only if the underneath data is immutable
You cannot combine the transformation if the
transformation have side effect
Combining the laziness and immutability give the
good performance and distributed processing
Challenges of Laziness
Laziness poses the challenge in term of data type.
If laziness defer the execution determine the data type of the
variable become challenging.
if we are not determine the right type then it allow to have

semantic issues
Running the big data program and semantic issue is not fun.
Type inference
Type inference is the part of compiler to determine the data
type.
As all the transformation are side effect type free we can
determine type the operator
Every transformation have specific return type.
Have type inference relieves you think about representation
for many transformation
Type inference in action
val collection=[1,2,3,5]
val c1=collection .map(v=>v+1)
val c2=c1.map(v=>v+1)
val c3=c2.count();
val c4=c3.map(v=>v+1) //error
Caching
Immutable data allow you to cache the data for long time
Lazy transformation allow to recreate data on failure.
Transformation can also save .
Caching data also improve the performance of the engine.

Introduction On Spark Anuj Jain

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction On Spark Anuj Jain

Uploaded by

Copyright:

Available Formats

Introduction of Apache Sparks

Automatically parallelized and executed on a large cluster of commodity

details of partitioning the input data,

Execution step of MapReduce

Word count example

Unified model for Big Data

Why unification matters?

Take apps every where

Spark Core Engine 11

RDD- Resilient Distributed Dataset

var collection =[1,2,3,4]

val collection =[1,2,3,4]

val new Collection[i]= collectionofmap(value

Use loop for update

Uses for transformation for change

Collection is update in place

Create new copy of data and leave the

var collection =[1,2,3,4]

val collection =[1,2,3,4]

val new Collection[i]= collection.map(value

val new Collection1[i]=new

Collection is update in place

Challenges for Immutable

In big data , multiple copies of data and multiple passes have

Laziness and Immutability

if we are not determine the right type then it allow to have

Type inference in action

You might also like