You are on page 1of 50

Using Apache

Spark
Pat McDonough - Databricks

Apache Spark
spark.incubator.apache.org
github.com/apache/incubat
or-spark
user@spark.incubator.apac
he.org

The Spark Community

+You
!

INTRODUCTION TO APACHE
SPARK

What is Spark?
Fast and Expressive Cluster Computing
System Compatible with Apache Hadoop

1 0 f as te
r
2
o n di sk ,
-5 l
100 i
c ode e ss
n mem
o ry Usable
Efficient
Up t o

General
Rich APIs in Java,
execution graphs
Scala, Python
In-memory
Interactive shell
storage

Key Concepts
Write programs in terms of
transformations on distributed
datasets
Resilient Distributed
Operations
Datasets
Collections of objects spread
across a cluster, stored in RAM
or on Disk
Built through parallel
transformations
Automatically rebuilt on failure

Transformations
(e.g. map, filter,
groupBy)
Actions
(e.g. count,
collect, save)

Working With RDDs


textFile =
= sc.textFile(SomeFile.txt)
sc.textFile(SomeFile.txt)
textFile

RDD
RDD
RDD
RDD

Action

Valu
e

Transformations
linesWithSpark.count()
linesWithSpark.count()
74
74
linesWithSpark.first()
linesWithSpark.first()
# Apache
Apache Spark
Spark
#

linesWithSpark =
= textFile.filter(lambda
textFile.filter(lambda line:
line: "Spark
"Spark in
in line)
line)
linesWithSpark

Example: Log Mining


Load error messages from a log into memory,
then interactively search for various patterns
Base
Transformed
RDD RDD

Cache 1

lines = spark.textFile(hdfs://...)
errors = lines.fi
lter(lam bda s: s.startsw ith(ERRO R))
m essages = errors.m ap(lam bda s: s.split(\t)[2])
m essages.cache()
m essages.fi
lter(lam bda s: m ysql in s).count()

results

Drive
r
Action

Full-text search of
Wikipedia

60GB on 20 EC2 machine


0.5 sec vs. 20s for on-disk

tasks Block 1

Cache 2

Worker

m essages.fi
lter(lam bda s: php in s).count()
...

Worker

Cache 3

Worker
Block 3

Block 2

Scaling Down
Execution time (s)

100 69
80
60
40
20
0

58

41

30

12

% of working set in cache

Fault Recovery
RDDs track lineage information that
can be used to efficiently recompute
lost data
m sgs = textFile.fi
lter(lam bda s: s.startsW ith(ERRO R))
.m ap(lam bda s: s.split(\t)[2])

HDFS File

Filtered
RDD

filter
(func = startsWith())

Mapped
RDD

map
(func = split(...))

Language Support
Python
l
li
ines
nes =
= sc.
sc.t
text
extFi
Fil
le(
e(.
..
..
.)
)
l
i
nes.
fi
l
t
er(
l
am
bda
s:
lines.fi
lter(lam bda s: ERRO
ERRO R
R i
in
n s)
s).
.count
count(
()
)

Scala
val
val l
li
ines
nes =
= sc.
sc.t
text
extFi
Fil
le(
e(.
..
..
.)
)
li
ines.
nes.fi
fi
lt
ter(
er(x
x=
=>
> x.
x.cont
contai
ains(
ns(ERRO
ERRO R))
R)).
.count
count(
()
)
l
l

Java
JavaRD
avaRD D
D<
< St
Stri
ring>
ng> l
li
ines
nes =
= sc.
sc.t
text
extFi
Fil
le(
e(.
..
..
.)
);
;
J
l
i
nes.
fi
l
t
er(
n
ew
Funct
i
on<
St
ri
ng,
Bool
ean>
lines.fi
lter(n ew Function< String, Boolean> (
()
){
{
Bool
ean
cal
l
(
St
ri
ng
s)
{
Boolean call(String s) {
retu rn
rn s.
s.cont
contai
ains(
ns(er
err
ror)
or);
;
retu
}
}
})
}).
.count
count(
()
);
;

Standalone Programs
Python, Scala, & Java
Interactive Shells
Python & Scala
Performance
Java & Scala are faster
due to static typing
but Python is often
fine

Interactive Shell
The Fastest Way to
Learn Spark
Available in Python
and Scala
Runs as an application
on an existing Spark
Cluster
OR Can run locally

Administrative GUIs
http://<Standalone Master>:8080
(by default)

JOB EXECUTION

Software Components
Spark runs as a library in your
program (1 instance per app)
Runs tasks locally or on
cluster
Mesos, YARN or standalone
mode

Accesses storage systems via


Hadoop InputFormat API
Can use HBase, HDFS, S3,

Your application
SparkContex
t
Cluster
manag
er
Worker
Spark
execut
or

Local
threads

Worker
Spark
execut
or

HDFS or other storage

Task Scheduler
General task
graphs
Automatically
pipelines
functions
Data locality
aware
Partitioning aware
to avoid shuffles

B:

A:

F:

Stage 1
C:

groupBy
D:

E:
join

Stage 2 map
= RDD

filter

Stage 3

= cached partition

Advanced Features
Controllable partitioning
Speed up joins against a dataset

Controllable storage formats


Keep data serialized for efficiency, replicate to
multiple nodes, cache on disk

Shared variables: broadcasts,


accumulators
See online docs for details!

WORKING WITH SPARK

Using the Shell


Launching:
spark-shell
pyspark (IPYTH O N = 1)

Modes:
M ASTER= local ./spark-shell # local, 1 thread
M ASTER= local[2] ./spark-shell # local, 2 threads
M ASTER= spark://host:port ./spark-shell # cluster

SparkContext
Main entry point to Spark
functionality
Available in shell as variable sc
In standalone programs, youd make
your own (see later for details)

Creating RDDs
# Turn a Python collection into an RD D
> sc.parallelize([1, 2, 3])
#
>
>
>

Load text fi
le from localFS, H D FS, or S3
sc.textFile(fi
le.txt)
sc.textFile(directory/*.txt)
sc.textFile(hdfs://nam enode:9000/path/fi
le)

# U se existing H adoop InputForm at (Java/Scala only)


> sc.hadoopFile(keyClass, valClass, inputFm t, conf)

Basic Transformations
> num s = sc.parallelize([1, 2, 3])
# Pass each elem ent through a function
> squares = num s.m ap(lam bda x: x*x) // {1, 4, 9}
# Keep elem ents passing a predicate
> even = squares.fi
lter(lam bda x: x % 2 = = 0) // {4}
# M ap each elem ent to zero or m ore others
> num s.fl
atM ap(lam bda x: = > range(x))
> # = > {0, 0, 1, 0, 1, 2}

Range object
(sequence of
numbers 0, 1, , x-

Basic Actions
> num s = sc.parallelize([1, 2, 3])
# Retrieve RD D contents as a localcollection
> num s.collect() # = > [1, 2, 3]
# Return fi
rst K elem ents
> num s.take(2) # = > [1, 2]
# Count num ber of elem ents
> num s.count() # = > 3
# M erge elem ents w ith an associative function
> num s.reduce(lam bda x, y: x + y) # = > 6
# W rite elem ents to a text fi
le
> num s.saveAsTextFile(hdfs://fi
le.txt)

Working with KeyValue


Pairs
Sparks distributed reduce transformations
operate on RDDs of key-value pairs

Python:

pair = (a, b)
pair[0] # = > a
pair[1] # = > b

Scala:

val pair = (a, b)


pair._1 // = > a
pair._2 // = > b

Java:

Tuple2 pair = n ew Tuple2(a, b);


pair._1 // = > a
pair._2 // = > b

Some Key-Value
Operations
> pets = sc.parallelize(
[(cat,1), (dog,1),(cat,2)])
> pets.reduceByKey(lam bda x,y: x + y)
# = > {(cat,3),(dog,1)}
> pets.groupByKey() # = > {(cat,[1,2]),(dog,[1])}
> pets.sortByKey() # = > {(cat,1),(cat,2),(dog, 1)}

also automatically implements


combiners on the map side
reduceByKey

Example: Word
Count

> lines = sc.textFile(ham let.txt)


> counts = lines.fl
atM ap(lam bda line: line.split( ))
.m ap(lam bda w ord = > (w ord, 1))
.reduceByKey(lam bda x, y: x + y)

to be or

not to be

to
be
or
not
to
be

(to, 1)
(be, 1)
(or, 1)
(not, 1)
(to, 1)
(be, 1)

(be, 2)
(not, 1)
(or, 1)
(to, 2)

Other Key-Value
Operations
> visits = sc.parallelize([ (index.htm l,1.2.3.4),
(about.htm l,3.4.5.6),
(index.htm l,1.3.3.1) ])

> pageN am es = sc.parallelize([ (index.htm l,H om e),


(about.htm l,About) ])
> visits.join(pageN am es)
# (index.htm l,(1.2.3.4,H om e))
# (index.htm l,(1.3.3.1,H om e))
# (about.htm l,(3.4.5.6,About))
> visits.cogroup(pageN am es)
# (index.htm l,([1.2.3.4,1.3.3.1],[H om e]))
# (about.htm l,([3.4.5.6],[About]))

Setting the Level of


Parallelism
All the pair RDD operations take an
optional second parameter for number
of tasks
> w ords.reduceByKey(lam bda x, y: x + y, 5)
> w ords.groupByKey(5)
> visits.join(pageView s, 5)

Using Local Variables


Any external variables you use in a closure will
automatically be shipped to the cluster:
> query = sys.stdin.readline()
> pages.fi
lter(lam bda x: query in x).count()

Some caveats:
Each task gets a new copy (updates arent sent back)
Variable must be Serializable / Pickle-able
Dont use fields of an outer object (ships all of it!)

Closure Mishap
Example
This is a problem:
class M yCoolRddApp {
val param = 3.14
val log = new Log(...)
...

How to get around


it:
class M yCoolRddApp {
...
...

d ef w ork(rdd: RD D [Int]) {
rdd.m ap(x = > x + param )
.reduce(...)
} NotSerializableExcepti

on:
MyCoolRddApp (or
Log)

d ef w ork(rdd: RD D [Int]) {
valparam _ = param
rdd.m ap(x = > x + param _)
.reduce(
...)
References
only local
}
variable instead of

variable instead of
this.param

More RDD Operators


m ap

reduce

sam ple

fi
lter

count

take

groupBy

fold

fi
rst

sort

reduceByKey

union

groupByKey

join

cogroup

leftO uterJoin

cross

pipe

rightO uterJoin

zip

save

partitionBy
m apW ith

...

CREATING SPARK
APPLICATIONS

Add Spark to Your


Project

Scala / Java: add a Maven dependency on


groupId:
org.spark-project
artifactId: spark-core_2.9.3
version: 0.8.0
Python: run program with our pyspark
script

Python

Java

Scala

Create a SparkContext
im p ort org.apache.spark.SparkContext
im p ort org.apache.spark.SparkContext._
val sc = n ew SparkContext(url, nam e, sparkH om e, Seq(app.jar))

Spark
Cluster
URL,
or
App
im p ort org.apache.spark.api.java.JavaSparkContextinstall
;
path
local / local[N]
name
on cluster

JavaSparkContext sc = n ew JavaSparkContext(
m asterU rl, nam e, sparkH om e, new String[] {app.jar}));

List of JARs
with app code
(to ship)

from pyspark im p ort SparkContext


sc = SparkContext(m asterU rl, nam e, sparkH om e, [library.py]))

Complete App
im p ort sys
from pyspark im p ort SparkContext
if __nam e__ = = "__m ain__":
sc = SparkContext( local, W ordCount, sys.argv[0], N on e)
lines = sc.textFile(sys.argv[1])
counts = lines.fl
atM ap(lam bda s: s.split( )) \
.m ap(lam bda w ord: (w ord, 1)) \
.reduceByKey(lam bda x, y: x + y)
counts.saveAsTextFile(sys.argv[2])

EXAMPLE APPLICATION:
PAGERANK

Example: PageRank
Good example of a more complex
algorithm
Multiple stages of map & reduce

Benefits from Sparks in-memory


caching
Multiple iterations over the same data

Basic Idea
Give pages ranks
(scores) based on
links to them
Links from many
pages high rank
Link from a highrank page high
rank

ge: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png

Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each pages rank to 0.15 + 0.85 contribs

1.0
1.0

1.0

1.0

Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each pages rank to 0.15 + 0.85 contribs

1.0
1
1.0

0.5
0.5
1.0

0.5
1

1.0
0.5

Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each pages rank to 0.15 + 0.85 contribs

1.85
1.0

0.58

0.58

Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each pages rank to 0.15 + 0.85 contribs

1.85
0.58
0.58

0.29
0.29
0.58

0.5
1.85
0.5

1.0

Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each pages rank to 0.15 + 0.85 contribs

1.31
0.39

...
0.58

1.72

Algorithm
1. Start each page at a rank of 1
2. On each iteration, have page p contribute
rankp / |neighborsp| to its neighbors
3. Set each pages rank to 0.15 + 0.85 contribs

Final state:

1.44
1.37

0.46

0.73

Scala Implementation
val links = // load RD D of (url, neighbors) pairs
var ranks = // load RD D of (url, rank) pairs
for (i< - 1 to ITERATIO N S) {
val contribs = links.join(ranks).fl
atM ap {
case (url, (links, rank)) = >
links.m ap(dest = > (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.m apValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)

PageRank Performance
200

171
Hadoo
p

150 time (s)


Iteration
100
50
0

80
23
30.0

14
60.0

Number of machines

Other Iterative
Algorithms
155

K-Means Clustering

4.1
0

Logistic Regression

30

60

90

120

150

180

110
0.96
0

25

50

75

100

Time per Iteration (s)

125

Spark

CONCLUSION

Conclusion
Spark offers a rich API to make data
analytics fast: both fast to write and
fast to run
Achieves 100x speedups in real
applications
Growing community with 25+
companies contributing

Get Started
Up and Running in a Few
Steps
Download
Unzip
Shell
Project Resources
Examples on the Project Site
Examples in the Distribution
Documentation

http://spark.incubator.apache.org