You are on page 1of 25

Spark

Fast, Interactive, Language-Integrated Cluster Computing


Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica

www.spark-project.org

UC BERKELEY

Project Goals
Extend the MapReduce model to better support two common classes of analytics apps:
Iterative algorithms (machine learning, graphs) Interactive data mining Integrate into Scala programming language Allow interactive use from Scala interpreter

Enhance programmability:

Motivation
Most current cluster programming models are based on acyclic data ow from stable storage to stable storage
Map

Reduce

Input

Map Map Reduce

Output

Motivation
Most current cluster programming models are based on acyclic data ow from stable storage to stable storage
Reduce Benets of data ow: runtime can decide Map Input to run Output where t asks and can automatically recover from failures Reduce Map Map

Motivation
Acyclic data ow is inecient for applications that repeatedly reuse a working set of data:
Iterative algorithms (machine learning, graphs) Interactive data mining tools (R, Excel, Python)

With current frameworks, apps reload data from stable storage on each query

Solution: Resilient Distributed Datasets (RDDs)


Allow apps to keep working sets in memory for ecient reuse Retain the attractive properties of MapReduce
Fault tolerance, data locality, scalability

Support a wide range of applications

Outline
Spark programming model Implementation Demo User applications

Programming Model
Resilient distributed datasets (RDDs)
Immutable, partitioned collections of objects Created through parallel transformations (map, lter, groupBy, join, ) on data in stable storage Can be cached for ecient reuse Count, reduce, collect, save,

Actions on RDDs

Example: Log Mining


Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(hdfs://...) messages = errors.map(_.split(\t)(2)) cachedMsgs = messages.cache() cachedMsgs.filter(_.contains(foo)).count cachedMsgs.filter(_.contains(bar)).count . . .
Cache 3

Base RDD Transformed RDD


results

Cache 1

Worker

errors = lines.filter(_.startsWith(ERROR))

Driver
Action

tasks Block 1

Cache 2

Worker Worker
Block 3 Block 2

Result: s full-text caled to s1 earch TB data of W in ikipedia 5-7 sec (vs 170 sec o data) in <1 sec (vs 20 f sor ec fn-disk or on-disk data)

RDD Fault Tolerance


RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages
HDFS File lter (func = _.contains(...)) Filtered RDD map (func = _.split(...)) Mapped RDD
= textFile(...).filter(_.startsWith(ERROR)) .map(_.split(\t)(2))

Example: Logistic Regression


Goal: nd best line separating two sets of points
+ + ++ + + + + + + target
random initial line

Example: Logistic Regression


val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w)

Logistic Regression Performance


4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 5 10 20 Number of Iterations 30 Running Time (s)

127 s / iteration
Hadoop Spark

rst iteration 174 s further iterations 6 s

Spark Applications
In-memory data mining on Hive data (Conviva) Predictive analytics (Quantind) City trac prediction (Mobile Millennium) Twitter spam classication (Monarch) Collaborative ltering via matrix factorization

Conviva GeoReport
Hive Spark 0 0.5 5 10 15 20 Time (hours) 20

Aggregations on many keys w/ same WHERE clause 40 gain comes from:


Not re-reading unused columns or ltered records Avoiding repeated decompression In-memory storage of deserialized objects

Frameworks Built on Spark


Pregel on Spark (Bagel)
Google message passing model for graph computation 200 lines of code 3000 lines of code Compatible with Apache Hive ML operators in Scala

Hive on Spark (Shark)

Implementation
Runs on Apache Mesos to share resources with Hadoop & other apps Can read from any Hadoop input source (e.g. HDFS)
Spark Hadoop Mesos Node Node Node Node MPI

No changes to Scala compiler

Spark Scheduler
Dryad-like DAGs Pipelines functions within a stage Cache-aware work reuse & locality Partitioning-aware to avoid shues
A: B: G: groupBy D: map E: Stage 2 union = cached data partition F: Stage 1 C:

join Stage 3

Interactive Spark
Modied Scala interpreter to allow Spark to be used interactively from the command line Required two changes:
Modied wrapper code generation so that each line typed has references to objects for its dependencies Distribute generated classes over the network

Demo

Conclusion
Spark provides a simple, ecient, and powerful programming model for a wide range of apps Download our open source release:

www.spark-project.org

matei@berkeley.edu

Related Work
DryadLINQ, FlumeJava Relational databases
Similar distributed collection API, but cannot reuse datasets eciently across queries Lineage/provenance, logical logging, materialized views Fine-grained writes similar to distributed shared memory Implicit data sharing for a xed computation pattern Store data in les, no explicit control over what is cached

GraphLab, Piccolo, BigTable, RAMCloud

Iterative MapReduce (e.g. Twister, HaLoop) Caching systems (e.g. Nectar)

Behavior with Not Enough RAM


100 Iteration time (s) 68.8 80 60 40 20 0 Cache disabled 25% 50% 75% Fully cached 58.1 40.7

29.7

% of working set in memory

11.5

Fault Recovery Results


140 120 100 80 60 40 20 0 Iteratrion time (s) No Failure Failure in the 6th Iteration 81 59 58 56 58 59 10 119 57

5 6 Iteration

57

57

Spark Operations
Transformations (dene a new RDD) map lter sample groupByKey reduceByKey sortByKey atMap union join cogroup cross mapValues

Actions (return a result to driver program)

collect reduce count save lookupKey

You might also like