Professional Documents
Culture Documents
www.spark-project.org
UC BERKELEY
Project
Goals
Extend
the
MapReduce
model
to
better
support
two
common
classes
of
analytics
apps:
Iterative
algorithms
(machine
learning,
graphs)
Interactive
data
mining
Integrate
into
Scala
programming
language
Allow
interactive
use
from
Scala
interpreter
Enhance programmability:
Motivation
Most
current
cluster
programming
models
are
based
on
acyclic
data
ow
from
stable
storage
to
stable
storage
Map
Reduce
Input
Output
Motivation
Most
current
cluster
programming
models
are
based
on
acyclic
data
ow
from
stable
storage
to
stable
storage
Reduce
Benets
of
data
ow:
runtime
can
decide
Map Input
to
run
Output
where
t
asks
and
can
automatically
recover
from
failures
Reduce
Map
Map
Motivation
Acyclic
data
ow
is
inecient
for
applications
that
repeatedly
reuse
a
working
set
of
data:
Iterative
algorithms
(machine
learning,
graphs)
Interactive
data
mining
tools
(R,
Excel,
Python)
With current frameworks, apps reload data from stable storage on each query
Outline
Spark
programming
model
Implementation
Demo
User
applications
Programming
Model
Resilient
distributed
datasets
(RDDs)
Immutable,
partitioned
collections
of
objects
Created
through
parallel
transformations
(map,
lter,
groupBy,
join,
)
on
data
in
stable
storage
Can
be
cached
for
ecient
reuse
Count,
reduce,
collect,
save,
Actions on RDDs
Cache 1
Worker
errors = lines.filter(_.startsWith(ERROR))
Driver
Action
tasks Block 1
Cache 2
Worker
Worker
Block
3
Block
2
Result: s full-text caled to s1 earch TB data of W in ikipedia 5-7 sec (vs 170 sec o data) in <1 sec (vs 20 f sor ec fn-disk or on-disk data)
127
s
/
iteration
Hadoop
Spark
Spark
Applications
In-memory
data
mining
on
Hive
data
(Conviva)
Predictive
analytics
(Quantind)
City
trac
prediction
(Mobile
Millennium)
Twitter
spam
classication
(Monarch)
Collaborative
ltering
via
matrix
factorization
Conviva
GeoReport
Hive
Spark
0
0.5
5
10
15
20
Time
(hours)
20
Implementation
Runs
on
Apache
Mesos
to
share
resources
with
Hadoop
&
other
apps
Can
read
from
any
Hadoop
input
source
(e.g.
HDFS)
Spark
Hadoop
Mesos
Node
Node
Node
Node
MPI
Spark
Scheduler
Dryad-like
DAGs
Pipelines
functions
within
a
stage
Cache-aware
work
reuse
&
locality
Partitioning-aware
to
avoid
shues
A:
B:
G:
groupBy
D:
map
E:
Stage
2
union
=
cached
data
partition
F:
Stage
1
C:
join Stage 3
Interactive
Spark
Modied
Scala
interpreter
to
allow
Spark
to
be
used
interactively
from
the
command
line
Required
two
changes:
Modied
wrapper
code
generation
so
that
each
line
typed
has
references
to
objects
for
its
dependencies
Distribute
generated
classes
over
the
network
Demo
Conclusion
Spark
provides
a
simple,
ecient,
and
powerful
programming
model
for
a
wide
range
of
apps
Download
our
open
source
release:
www.spark-project.org
matei@berkeley.edu
Related
Work
DryadLINQ,
FlumeJava
Relational
databases
Similar
distributed
collection
API,
but
cannot
reuse
datasets
eciently
across
queries
Lineage/provenance,
logical
logging,
materialized
views
Fine-grained
writes
similar
to
distributed
shared
memory
Implicit
data
sharing
for
a
xed
computation
pattern
Store
data
in
les,
no
explicit
control
over
what
is
cached
29.7
11.5
5 6 Iteration
57
57
Spark
Operations
Transformations
(dene
a
new
RDD)
map
lter
sample
groupByKey
reduceByKey
sortByKey
atMap
union
join
cogroup
cross
mapValues