Professional Documents
Culture Documents
Reza Zadeh
Wide deployment
» More common than MPI, especially “near” data
iter. 1 iter. 2 . . .
Input
query 2 result 2
query 3
result 3
Input
. . .
Ranks
(id, rank)
…
iteration 1
iteration 2
iteration 3
Result
While MapReduce is simple, it can require
asymptotically more communication or I/O
Spark computing engine
Spark Computing Engine
Extends a programming language with a
distributed collection data-structure
» “Resilient distributed datasets” (RDD)
// Java:
JavaRDD<String> lines = sc.textFile(...);
lines.filter(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(“error”);
}
}).count();
Fault Tolerance
RDDs track lineage info to rebuild lost data
file.map(lambda
rec:
(rec.type,
1))
.reduceByKey(lambda
x,
y:
x
+
y)
.filter(lambda
(type,
count):
count
>
10)
Also known
map
reduce
filter
Input file
Machine Learning example
Logistic Regression
data
=
spark.textFile(...).map(readPoint).cache()
w
=
numpy.random.rand(D)
for
i
in
range(iterations):
gradient
=
data.map(lambda
p:
(1
/
(1
+
exp(-‐p.y
*
w.dot(p.x))))
*
p.y
*
p.x
).reduce(lambda
a,
b:
a
+
b)
w
-‐=
gradient
print
“Final
w:
%s”
%
w
Logistic Regression Results
4000
3500
110 s / iteration
Running Time (s)
3000
2500
2000
Hadoop
1500
Spark
1000
500
first iteration 80 s
0
further iterations 1 s
1
5
10
20
30
Number of Iterations
58.1
80
Iteration time (s)
40.7
60
29.7
40
11.5
20
0
0%
25%
50%
75%
100%
% of working set in memory
Benefit for Users
Same engine performs data extraction, model
training and interactive queries
Separate engines
parse
query
train
DFS DFS DFS DFS DFS DFS
read
write
read
write
read
write
…
Spark
query
parse
train
DFS
read
DFS
State of the Spark ecosystem
Spark Community
Most active open source community in big data
200+ developers, 50+ companies contributing
100
50
Giraph
Storm
0
Project Activity
Spark
Spark
1600
350000
1400 300000
1200
250000
1000
200000
HDFS
800
HDFS
Storm
150000
MapReduce
MapReduce
YARN
600
YARN
100000
400
Storm
200
50000
0
0
Commits
Lines of Code Changed
source: ohloh.net
Built-in libraries
Standard Library for Big Data
Python
Scala
Java
R
Big data apps lack libraries"
of common algorithms
SQL
ML
graph
…
Spark’s generality + support"
Core
for multiple languages make it"
suitable to offer this
Spark MLlib
Spark SQL
GraphX
structured
Streaming" graph
machine
real-time
learning
…
Spark Core
Machine Learning Library (MLlib)
40 contributors in
past year
MLlib algorithms
classification: logistic regression, linear SVM,"
naïve Bayes, classification tree
regression: generalized linear models (GLMs),
regression tree
collaborative filtering: alternating least squares (ALS),
non-negative matrix factorization (NMF)
clustering: k-means||
decomposition: SVD, PCA
optimization: stochastic gradient descent, L-BFGS
GraphX
36
GraphX
General graph processing library
Build graph using RDDs of nodes and edges
Large library of graph algorithms with
composable steps
37
GraphX Algorithms
Collaborative Filtering
Community Detection
» Alternating Least Squares
» Triangle-Counting
» Stochastic Gradient Descent
» K-core Decomposition
» Tensor Factorization
» K-Truss
Structured Prediction
Graph Analytics
» Loopy Belief Propagation
» PageRank
» Max-Product Linear Programs
» Personalized PageRank
» Gibbs Sampling
» Shortest Path
» Graph Coloring
Semi-supervised ML
» Graph SSL
Classification
» CoEM
» Neural Networks
38
Spark Streaming
Run a streaming computation as a series
of very small, deterministic batch jobs
live
data
stream
Spark
• Chop
up
the
live
stream
into
batches
of
Streaming
X
seconds
• Spark
treats
each
batch
of
data
as
batches
of
X
seconds
RDDs
and
processes
them
using
RDD
opera;ons
• Finally,
the
processed
results
of
the
Spark
processed
RDD
opera;ons
are
returned
in
results
batches
39
Spark Streaming
Run a streaming computation as a series
of very small, deterministic batch jobs
live
data
stream
Spark
• Batch
sizes
as
low
as
½
second,
latency
Streaming
~
1
second
• Poten;al
for
combining
batch
batches
of
X
seconds
processing
and
streaming
processing
in
the
same
system
Spark
processed
results
40
Spark SQL
// Run SQL statements!
val teenagers = context.sql(!
"SELECT name FROM people WHERE age >= 13 AND age <= 19")!
!
// The results of SQL queries are RDDs of Row objects!
val names = teenagers.map(t => "Name: " + t(0)).collect()!
Spark SQL
Enables loading & querying structured data in Spark
From Hive:
c = HiveContext(sc)!
rows = c.sql(“select text, year from hivetable”)!
rows.filter(lambda r: r.year > 2013).collect()!