Overview of Spark

Spark
Fast, Interactive, Language-Integrated Cluster Computing

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica
www.spark-project.org
UC BERKELEY
Project Goals
Extend the MapReduce model to better support two common classes of analytics apps:
Iterative algorithms (machine learning, graphs) Interactive data mining Integrate into Scala programming language Allow interactive use from Scala interpreter
Enhance programmability:
Motivation
Most current cluster programming models are based on acyclic data ow from stable storage to stable storage
Map
Reduce
Input
Map Map Reduce
Output
Motivation
Most current cluster programming models are based on acyclic data ow from stable storage to stable storage
Reduce Benets of data ow: runtime can decide Map Input to run Output where t asks and can automatically recover from failures Reduce Map Map
Motivation
Acyclic data ow is inecient for applications that repeatedly reuse a working set of data:
Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python)
With current frameworks, apps reload data from stable storage on each query
Solution: Resilient Distributed Datasets (RDDs)

Allow apps to keep working sets in memory for ecient reuse Retain the attractive properties of MapReduce
Fault tolerance, data locality, scalability
Support a wide range of applications
Outline
Spark programming model Implementation Demo User applications
Programming Model
Resilient distributed datasets (RDDs)
Immutable, partitioned collections of objects Created through parallel transformations (map, lter, groupBy, join, ) on data in stable storage Can be cached for ecient reuse Count, reduce, collect, save,
Actions on RDDs
Example: Log Mining

Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(hdfs://...) messages = errors.map(_.split(\t)(2)) cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(foo)).count cachedMsgs.filter(_.contains(bar)).count . . .
Cache 3
Base RDD Transformed RDD

results
Cache 1
Worker
errors = lines.filter(_.startsWith(ERROR))
Driver
Action
tasks Block 1
Cache 2
Worker Worker
Block 3 Block 2
Result: s full-text caled to s1 earch TB data of W in ikipedia 5-7 sec (vs 170 sec o data) in <1 sec (vs 20 f sor ec fn-disk or on-disk data)
RDD Fault Tolerance

RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages
HDFS File lter (func = _.contains(...)) Filtered RDD map (func = _.split(...)) Mapped RDD
= textFile(...).filter(_.startsWith(ERROR)) .map(_.split(\t)(2))
Example: Logistic Regression

Goal: nd best line separating two sets of points
+ + ++ + + + + + + target
random initial line
Example: Logistic Regression

val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)
Logistic Regression Performance

4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 Number of Iterations 30 Running Time (s)
127 s / iteration
Hadoop Spark
rst iteration 174 s further iterations 6 s
Spark Applications
In-memory data mining on Hive data (Conviva) Predictive analytics (Quantind) City trac prediction (Mobile Millennium) Twitter spam classication (Monarch) Collaborative ltering via matrix factorization
Conviva GeoReport
Hive Spark 0 0.5 5 10 15 20 Time (hours) 20
Aggregations on many keys w/ same WHERE clause 40 gain comes from:

Not re-reading unused columns or ltered records Avoiding repeated decompression In-memory storage of deserialized objects
Frameworks Built on Spark

Pregel on Spark (Bagel)
Google message passing model for graph computation 200 lines of code 3000 lines of code Compatible with Apache Hive ML operators in Scala
Hive on Spark (Shark)
Implementation
Runs on Apache Mesos to share resources with Hadoop & other apps Can read from any Hadoop input source (e.g. HDFS)
Spark Hadoop Mesos Node Node Node Node MPI
No changes to Scala compiler
Spark Scheduler
Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shues
A: B: G: groupBy D: map E: Stage 2 union = cached data partition F: Stage 1 C:
join Stage 3
Interactive Spark
Modied Scala interpreter to allow Spark to be used interactively from the command line Required two changes:
Modied wrapper code generation so that each line typed has references to objects for its dependencies Distribute generated classes over the network
Demo
Conclusion
Spark provides a simple, ecient, and powerful programming model for a wide range of apps Download our open source release:
www.spark-project.org
matei@berkeley.edu
Related Work
DryadLINQ, FlumeJava Relational databases
Similar distributed collection API, but cannot reuse datasets eciently across queries Lineage/provenance, logical logging, materialized views Fine-grained writes similar to distributed shared memory Implicit data sharing for a xed computation pattern Store data in les, no explicit control over what is cached
GraphLab, Piccolo, BigTable, RAMCloud
Iterative MapReduce (e.g. Twister, HaLoop) Caching systems (e.g. Nectar)
Behavior with Not Enough RAM

100 Iteration time (s) 68.8 80 60 40 20 0 Cache disabled 25% 50% 75% Fully cached 58.1 40.7
29.7
% of working set in memory
11.5
Fault Recovery Results

140 120 100 80 60 40 20 0 Iteratrion time (s) No Failure Failure in the 6th Iteration 81 59 58 56 58 59 10 119 57
5 6 Iteration
57
57
Spark Operations
Transformations (dene a new RDD) map lter sample groupByKey reduceByKey sortByKey atMap union join cogroup cross mapValues
Actions (return a result to driver program)
collect reduce count save lookupKey

Overview of Spark

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Overview of Spark

Uploaded by

Copyright:

Available Formats

Spark

Fast, Interactive, Language-Integrated Cluster Computing

Map Map Reduce

Solution: Resilient Distributed Datasets (RDDs)

Support a wide range of applications

Example: Log Mining

Base RDD Transformed RDD

RDD Fault Tolerance

Example: Logistic Regression

Example: Logistic Regression

Logistic Regression Performance

rst iteration 174 s further iterations 6 s

Aggregations on many keys w/ same WHERE clause 40 gain comes from:

Frameworks Built on Spark

Hive on Spark (Shark)

No changes to Scala compiler

GraphLab, Piccolo, BigTable, RAMCloud

Iterative MapReduce (e.g. Twister, HaLoop) Caching systems (e.g. Nectar)

Behavior with Not Enough RAM

% of working set in memory

Fault Recovery Results

Actions (return a result to driver program)

collect reduce count save lookupKey

You might also like