Spark Notes

Table
of Contents
Introduction 1.1
Exercises 1.2
Caching 1.2.1
ex-pair-rdd 1.2.2
Pair RDD - Transformations 1.2.3
Parallelism 1.2.4
Encoders 1.2.5
Problems 1.3
Solutions 1.4
Techniques & Best Practices 1.5
SparkSQL 1.6
RDD to Dataset 1.6.1
Usecases 1.7
Exploratory Data Analysis 1.7.1
References 1.8
1
Introduction
Apache Spark Notes

Notes prepared for future quick reference and is prepared from the following
sources
Learning Spark
Spark in Action
Spark official documentation
StackOverflow questions
Coursera courses
My own exercises
Other internet sources
It would be good to have a general understanding of the following functional

programming concepts (Scala is preferred)
Higher-order functions
Pattern matching
Implicit conversions
Tuples
Spark shells
1. spark-shell
2. pyspark
3. sparkR
4. spark-sql
2
Exercises
Exercises
I'd suggest you to turn ON only ERROR logging from spark-shell so that the focus
is on your commands and data
scala> import org.apache.log4j.{Level, Logger}

import org.apache.log4j.{Level, Logger}
scala> Logger.getRootLogger().setLevel(Level.ERROR)
If you want to turn ON the DEBUG logs from the code provided in these code
katas then, run the following command
scala> Logger.getLogger("mylog").setLevel(Level.DEBUG)
To turn them OFF
scala> Logger.getLogger("mylog").setLevel(Level.ERROR)
Datasets
Star Wars Kid

Read the story from Andy Baio's blog post here.
Video: Original, Mix
Summary:
Released in 2003
Downloaded 1.1 million times the first 2 weeks
Since been downloaded 40 million times since it was moved to YouTube
On April 29 2003, Andy Baio renamed it to Star_Wars_Kid.wmv and posted it
on his site
3
Exercises
In 2008, Andy Baio shared the Apache logs for the first 6 months from his site
File is 1.6 GB unzipped and available via BitTorent
Data:
Sample: swk_small.log TODO - add link from github

Original: Torrent link on this website
Univ of Florida
http://www.stat.ufl.edu/~winner/datasets.html
Submitting applications to Spark

From http://stackoverflow.com/questions/34391977/spark-submit-does-
automatically-upload-the-jar-to-cluster
This seems to be a common problem, here's what my research into this has
turned up. In practice it seems that a lot of people ask this question here and on
other forums, so I would think the default behavior is that Spark's driver does NOT
push jars around a cluster, just the classpath. The point of confusion, that I along
with other newcomers commonly suffer from is this:
The driver does NOT push your jars to the cluster. The master in the cluster
DOES push your jars to the workers. In theory. I see various things in the docs for
Spark that seem to contradict the idea that if you pass an assembly jar and/or
dependencies to spark-submit with --jars, that these jars are pushed out to the
cluster. For example, on the plus side,
"When using spark-submit, the application jar along with any jars included with the
--jars option will be automatically transferred to the cluster. " That's from
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-
dependency-management
So far so good right? But that is only talking about once your master has the jars,
it can push them to workers.
There's a line on the docs for spark-submit, for the main parameter application-jar.
Quoting another answer I wrote:
4
Exercises
"I assume they mean most people will get the classpath out through a driver
config option. I know most of the docs for spark-submit make it look like the script
handles moving your code around the cluster, but it only moves the classpath
around for you. The driver does not load the jars to the master. For example in this
line from Launching Applications with spark-submit explicitly says you have to
move the jars yourself or make them "globally available":
application-jar: Path to a bundled jar including your application and all

dependencies. The URL must be globally visible inside of your cluster, for
instance, an hdfs:// path or a file:// path that is present on all nodes." From an
answer I wrote on Spark workers unable to find JAR on EC2 cluster
So I suspect we are seeing a "bug" in the docs. I would explicitly put the jars on
each node, or "globally available" via NFS or HDFS, for another reason from the
docs:
From Advanced Dependency Management, it seems to present the best of both

worlds, but also a great reason for manually pushing your jars out to all nodes:
local: - a URI starting with local:/ is expected to exist as a local file on each worker
node. This means that no network IO will be incurred, and works well for large
files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
Bottom line, I am coming to realize that to truly understand what Spark is doing,
we need to bypass the docs and read the source code. Even the Java docs don't
list all the parameters for things like parallelize().
5
Caching
Caching
We can use persist() to tell spark to store the partitions
On node failures (that persist data), lost partitions are recomputed by spark
In Scala and Java, default persist() stores the data on heap as unserialized
objects
We can also replicate the data on to multiple nodes to handle node failures
without slowdown
Persistence is at partition level. i.e. there should be enough memory to cache
all the data of a partition. Partial caching isn't supported as of 1.5.2
LRU policy is used to evict cached partitions to make room for new ones.
Wen evicting,
memory_only ones are recomputed next time when they are accessed
memory_and_disk ones are written to disk
TODO
add examples of various persist mechanisms

example for tachyon
tungston
6
ex-pair-rdd
Pair RDD
Exercise 1
Goal
Understand different ways of creating Pair RDD's
Problem(s)
Problem 1: Create a Pair RDD of this collection Seq("the", "quick", "brown", "fox",
"jumps", "over", "the", "lazy", "dog")?
Problem 2: Create Pair RDD's from external storages.
Answer(s)
Answer 1
In general, we can extract a key from the data by running a map() transformation
on it. Pair's are represented using Tuples (key, value).
A better way is to use the keyBy method so that the intent is explicit.
7
ex-pair-rdd
scala> val words=sc.parallelize(Seq("the", "quick", "brown", "fo

x", "jumps", "over", "the", "lazy", "dog"))
words: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[
45] at parallelize at <console>:27
scala> val wordPair=words.map( w => (w.charAt(0), w) )

wordPair: org.apache.spark.rdd.RDD[(Char, String)] = MapPartitio
nsRDD[46] at map at <console>:29
scala> wordPair.foreach(println)
(o,over)
(l,lazy)
(d,dog)
(t,the)
(j,jumps)
(t,the)
(f,fox)
(b,brown)
(q,quick)
Answer 2
Some methods on SparkContext produce pair RDD by default for reading files
in certain Hadoop formats.
TODO
8
Pair RDD - Transformations

Reference: http://spark.apache.org/docs/latest/programming-
guide.html#transformations
Excercise 1
Goal
Understand the usage of reduceByKey() transformation.
reduceByKey(func, [numTasks]): When called on a dataset of (K, V) pairs,

returns a dataset of (K, V) pairs where the values for each key are aggregated
using the given reduce function func, which must be of type (V,V) => V. Like in
groupByKey, the number of reduce tasks is configurable through an optional
second argument.
Problem(s)
Problem 1: Calculate page views by day for the star wars video (download link).
Problem 2: Word count
Answer(s)
Answer 1
scala> val logs=sc.textFile("file:///c:/_home/swk_small.log")

logs: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at
textFile at <console>:27
scala> val parsedLogs=logs.map(line => parseAccessLog(line))

parsedLogs: org.apache.spark.rdd.RDD[scala.collection.mutable.Map
[String,String]] = MapPartitionsRDD[51] at map at <console>:31
scala> //create a pair RDD and do reduceByKey() transformation
scala> val viewsByDate=parsedLogs.map( m => (m.getOrElse("access
9
Date","unknown"), 1)).reduceByKey((x,y) => x+y)

viewsByDate: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledR
DD[55] at reduceByKey at <console>:33
scala> viewsByDate.foreach(println)
(03-02-2003,1)
(05-04-2003,1)
(12-04-2003,1913)
(02-03-2003,1)
(26-02-2003,2)
(07-02-2003,1)
(05-01-2003,1)
(22-03-2003,1)
(19-02-2003,1)
(02-01-2003,2)
(11-04-2003,4162)
(22-01-2003,1)
(21-03-2003,1)
(10-04-2003,3902)
(01-03-2003,1)
(05-02-2003,1)
(18-03-2003,2)
(25-02-2003,1)
(06-02-2003,1)
(09-04-2003,1)
(20-02-2003,1)
(23-03-2003,1)
(06-04-2003,1)
Answer 2
Excercise 2
Goal
Understand the usage of foldByKey() transformation.
10
Problem 1: Calculate page views by day for the star wars video (download link).
Answer(s)
Answer 1
scala> val logs=sc.textFile("file:///c:/_home/swk_small.log")

logs: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[74] at
textFile at <console>:27
scala> val parsedLogs=logs.map(line => parseAccessLog(line))

parsedLogs: org.apache.spark.rdd.RDD[scala.collection.mutable.Map
[String,String]] = MapPartitionsRDD[75] at map at <console>:31
scala> val viewDates=parsedLogs.map( m => (m.getOrElse("accessDa

te","unknown"), 1))
viewDates: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitio
nsRDD[77] at map at <console>:33
scala> val viewsByDate=viewDates.foldByKey(0)((x,y)=> x+y)

viewsByDate: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledR
DD[78] at foldByKey at <console>:35
scala> viewsByDate.foreach(println)
(03-02-2003,1)
(26-02-2003,2)
(05-04-2003,1)
(05-01-2003,1)
(19-02-2003,1)
(11-04-2003,4162)
(21-03-2003,1)
(01-03-2003,1)
(12-04-2003,1913)
(18-03-2003,2)
(02-03-2003,1)
(06-02-2003,1)
(20-02-2003,1)
(07-02-2003,1)
(23-03-2003,1)
(22-03-2003,1)
11
(06-04-2003,1)
(02-01-2003,2)
(22-01-2003,1)
(10-04-2003,3902)
(05-02-2003,1)
(25-02-2003,1)
(09-04-2003,1)
Excercise 3
Goal
Understand the usage of combineByKey() transformation.
def combineByKey[C] (createCombiner: V => C, mergeValue: (C, V) => C,

mergeCombiners: (C, C) => C): RDD[(K, C)]
Problem 1: From here
Answer(s)
Answer 1
scala> val a = sc.parallelize(List("dog","cat","gnu","salmon","r

abbit","turkey","wolf","bear","bee"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[87]
at parallelize at <console>:27
scala> val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)

b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[88] at
parallelize at <console>:27
scala> val c = b.zip(a)

c: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2
[89] at zip at <console>:31
scala> c.collect
res78: Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2
,salmon), (2,rabbit), (1,turkey), (2,wolf), (2,bear), (2,bee))
12
scala> def createCombiner(v:String):List[String] = List[String](

v)
createCombiner: (v: String)List[String]
scala> def mergeValue(acc:List[String], value:String) : List[Str

ing] = value :: acc
mergeValue: (acc: List[String], value: String)List[String]
scala> def mergeCombiners(acc1:List[String], acc2:List[String])

: List[String] = acc2 ::: acc1
mergeCombiners: (acc1: List[String], acc2: List[String])List[Str
ing]
scala> val result=c.combineByKey(createCombiner,mergeValue,merge

Combiners)
result: org.apache.spark.rdd.RDD[(Int, List[String])] = Shuffled
RDD[91] at combineByKey at <console>:39
scala> result.collect
res80: Array[(Int, List[String])] = Array((1,List(turkey, cat, d
og)), (2,List(bee, bear, wolf, rabbit, salmon, gnu)))
13
Parallelism
Parallelism
Every RDD has a fixed number of partitions that determine the degree of
parallelism to use when executing operations on RDD
Tuning
When performing aggregations and grouping operations, you can tell Spark to
use a specific number of partitions. For e.g. reduceByKey(func,
customParallelismOrPartitionsToSue)
If you want to change the partitioning of an RDD outside of these grouping or
aggregation operations, then you can use repartition() or coaleesce()
repartition() reshuffles the data acroos network to create a new set of
partitions
coalesce() avoids reshuffling data ONLY if we are decreasing the number of
existing RDD partitions (Use rdd.partitions.size() to see the current RDD
partitions)
Tackle performance intensive commands like collect(), groupByKey(),
reduceByKey()
See this post to see a small e.g. of how coalesc works
We can run multiple jobs within a single SparkContext parallerly by using
threads. Look at the answers from Patrick Liu and yuanbosoft in this post
Within each Spark application, multiple “jobs” (Spark actions) may be
running concurrently if they were submitted by different threads. This is
common if your application is serving requests over the network.
Improving Garbage Collection
Below JDK 1.7 update 4
CMS - Concurrent Mark and Sweep
focus is on low-latency (doesn't do compaction)
hence use it for real-time streaming
Parallel GC
Focus is on higher-throughput but results in higher pause-times
(performs whole-heap only compaction.
hence use it for batch
14
Parallelism
Above or equal to JDK 1.7 update 4

G1
Primarily meant for server-side and multi-core machines with
large memory
15
Encoders
Problem: Case classes in Scala 2.10 can support only up to 22 fields. To work
around this limit, you can use custom classes that implement the Product
interface. Provide an example
16
Problems
Problems
Problem 1: The following is data of few users.

It cotains four columns (userid, date, item1 and
item2). Find each user's latest day's records?
Input
val data="""
user date item1 item2
1 2015-12-01 14 5.6
1 2015-12-01 10 0.6
1 2015-12-02 8 9.4
1 2015-12-02 90 1.3
2 2015-12-01 30 0.3
2 2015-12-01 89 1.2
2 2015-12-30 70 1.9
2 2015-12-31 20 2.5
3 2015-12-01 19 9.3
3 2015-12-01 40 2.3
3 2015-12-02 13 1.4
3 2015-12-02 50 1.0
3 2015-12-02 19 7.8
"""
Expected output
For user 1, the latest date is 2015-12-02 and he has two records for that
particular date.
For user 2, the latest date is 2015-12-31 and he has two records for that
particular date.
For user 3, the latest date is 2015-12-02 and he has three records for that
particular date.
17
Problems
Problem 2: From the tweet data set here, find

the following (This is my own solution version
of excellent article: Getting started with Spark
in practice)
all the tweets by user
how many tweets each user has
all the persons mentioned on tweets
Count how many times each person is mentioned
Find the 10 most mentioned persons
Find all the hashtags mentioned on a tweet
Count how many times each hashtag is mentioned
Find the 10 most popular Hashtags
Input sample
{
"id":"572692378957430785",
"user":"Srkian_nishu :)",
"text":"@always_nidhi @YouTube no i dnt understand bt i loved
the music nd their dance awesome all the song of this mve is ro
cking",
"place":"Orissa",
"country":"India"
}
{
"id":"572575240615796737",
"user":"TagineDiningGlobal",
"text":"@OnlyDancers Bellydancing this Friday with live music
call 646-373-6265 http://t.co/BxZLdiIVM0",
"place":"Manhattan",
"country":"United States"
}
18
Problems
Problem 3: Demonstrate data virtualization

capabilities of SparkSQL by joining data across
different data stores i.e. rdbms, parquet, avro
19
Solutions
Techniques & Best Practices
General
Set spark.app.id when running on YARN. It enables monitoring. More details.
Follow best practices for Scala
Minimize network data shuffling by using shared variables, and favoring
reduce operations (which reduce the data locally before shuffling across the
network) over grouping (which shuffles all the data across the network)
Properly configure your temp work dir cleanups
Serialization
Closure serialization
Every task run from Driver to Worker gets serialized
The closure and its environment gets serialized to worker every time it
is used. Not just once per node. This might slow things down
considerably, specially when you have large variables in environment.
This is where broadcast variables can help.
Default java serialization mechanism is used to serialize closures.
Result serialization: Every result from every task gets serialized at some point
In general, tasks larger than about 20 KB are probably worth optimizing
If your objects are large, you may also need to increase the
spark.kryoserializer.buffer config. This value needs to be large
enough to hold the largest object you will serialize.
Broadcast and Accumulator

Both are ties to a single SparkContext and hence can't be shared across
contexts
According to a study, broadcasting data of around 1GB to larger workers (40
max) reduces the performance considerably. Refer to this paper.
20
Solutions
Logging
A Spark Executor writes all its logs (including INFO, DEBUG) to stderr. The
stdout is empty
spark-shell
default behavior spark-shell
21
Solutions
(~/spark-2.1.0-bin-hadoop2.7) spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defa
ults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR
, use setLogLevel(newLevel).
17/02/01 18:00:18 WARN NativeCodeLoader: Unable to load native-h
adoop library for your platform... using builtin-java classes wh
ere applicable
17/02/01 18:00:21 WARN Utils: Your hostname, aravind-VirtualBox
resolves to a loopback address: 127.0.1.1; using 10.0.2.15 inste
ad (on interface enp0s3)
17/02/01 18:00:21 WARN Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
17/02/01 18:02:11 WARN ObjectStore: Failed to get database globa
l_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.0.2.15:4040
Spark context available as 'sc' (master = local[*], app id = loc
al-1485990033491).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, J

ava 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
with custom log4j
22
Solutions
al-1485991564309).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/

ava 1.8.0_121)
scala>
info.log
(~/spark-2.1.0-bin-hadoop2.7/logs) cat info.log | more

17/02/01 18:26:02 WARN NativeCodeLoader.M(L): Unable to load nat
ive-hadoop library for your platform... usi
ng builtin-java classes where applicable
17/02/01 18:26:03 WARN Utils.M(L): Your hostname, aravind-Virtua
lBox resolves to a loopback address: 127.0.
1.1; using 10.0.2.15 instead (on interface enp0s3)
17/02/01 18:26:03 WARN Utils.M(L): Set SPARK_LOCAL_IP if you nee
d to bind to another address
17/02/01 18:26:30 WARN ObjectStore.M(L): Failed to get database
global_temp, returning NoSuchObjectException
spark-submit
23
Solutions
command
./bin/spark-submit --class org.apache.spark.examples.SparkPi \

--executor-memory 512MB \
--total-executor-cores 1 \
examples/jars/spark-examples_2.11-2.1.0.jar \
10
default behavior
(~/spark-2.1.0-bin-hadoop2.7) sh ./runexample-local.sh
ults.properties
17/02/01 18:13:45 INFO SparkContext: Running Spark version 2.1.0
ere applicable
17/02/01 18:13:48 INFO SecurityManager: Changing view acls to: a
ravind
17/02/01 18:13:48 INFO SecurityManager: Changing modify acls to:
aravind
17/02/01 18:13:48 INFO SecurityManager: Changing view acls group
s to:
17/02/01 18:13:48 INFO SecurityManager: Changing modify acls gro
ups to:
17/02/01 18:13:48 INFO SecurityManager: SecurityManager: authent
ication disabled; ui acls disabled; users with view permissions
: Set(aravind); groups with view permissions: Set(); users with
modify permissions: Set(aravind); groups with modify permission
s: Set()
17/02/01 18:13:50 INFO Utils: Successfully started service 'spar
kDriver' on port 35848.
17/02/01 18:13:51 INFO SparkEnv: Registering MapOutputTracker
17/02/01 18:13:51 INFO SparkEnv: Registering BlockManagerMaster
24
Solutions
17/02/01 18:13:51 INFO BlockManagerMasterEndpoint: Using org.apa

che.spark.storage.DefaultTopologyMapper for getting topology inf
ormation
17/02/01 18:13:51 INFO BlockManagerMasterEndpoint: BlockManagerM
asterEndpoint up
17/02/01 18:13:51 INFO DiskBlockManager: Created local directory
at /tmp/blockmgr-dd0f99cf-b0a4-4793-bf19-d400d5305456
17/02/01 18:13:51 INFO MemoryStore: MemoryStore started with cap
acity 413.9 MB
17/02/01 18:13:51 INFO SparkEnv: Registering OutputCommitCoordin
ator
17/02/01 18:13:52 INFO Utils: Successfully started service 'Spar
kUI' on port 4040.
17/02/01 18:13:52 INFO SparkUI: Bound SparkUI to 0.0.0.0, and st
arted at http://10.0.2.15:4040
17/02/01 18:13:52 INFO SparkContext: Added JAR file:/home/aravin
d/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.
1.0.jar at spark://10.0.2.15:35848/jars/spark-examples_2.11-2.1.
0.jar with timestamp 1485990832660
17/02/01 18:13:52 INFO Executor: Starting executor ID driver on
host localhost
17/02/01 18:13:52 INFO Utils: Successfully started service 'org.
apache.spark.network.netty.NettyBlockTransferService' on port 33
106.
17/02/01 18:13:52 INFO NettyBlockTransferService: Server created
on 10.0.2.15:33106
17/02/01 18:13:52 INFO BlockManager: Using org.apache.spark.stor
age.RandomBlockReplicationPolicy for block replication policy
17/02/01 18:13:52 INFO BlockManagerMaster: Registering BlockMana
ger BlockManagerId(driver, 10.0.2.15, 33106, None)
17/02/01 18:13:52 INFO BlockManagerMasterEndpoint: Registering b
lock manager 10.0.2.15:33106 with 413.9 MB RAM, BlockManagerId(d
river, 10.0.2.15, 33106, None)
17/02/01 18:13:52 INFO BlockManagerMaster: Registered BlockManag
er BlockManagerId(driver, 10.0.2.15, 33106, None)
17/02/01 18:13:52 INFO BlockManager: Initialized BlockManager: B
lockManagerId(driver, 10.0.2.15, 33106, None)
17/02/01 18:13:53 INFO SharedState: Warehouse path is 'file:/hom
e/aravind/spark-2.1.0-bin-hadoop2.7/spark-warehouse/'.
17/02/01 18:13:55 INFO SparkContext: Starting job: reduce at Spa
25
Solutions
rkPi.scala:38
17/02/01 18:13:55 INFO DAGScheduler: Got job 0 (reduce at SparkP
i.scala:38) with 10 output partitions
17/02/01 18:13:55 INFO DAGScheduler: Final stage: ResultStage 0
(reduce at SparkPi.scala:38)
17/02/01 18:13:55 INFO DAGScheduler: Parents of final stage: Lis
t()
17/02/01 18:13:55 INFO DAGScheduler: Missing parents: List()
17/02/01 18:13:55 INFO DAGScheduler: Submitting ResultStage 0 (M
apPartitionsRDD[1] at map at SparkPi.scala:34), which has no mis
sing parents
17/02/01 18:13:56 INFO MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 1832.0 B, free 413.9 MB)
17/02/01 18:13:56 INFO MemoryStore: Block broadcast_0_piece0 sto
red as bytes in memory (estimated size 1172.0 B, free 413.9 MB)
17/02/01 18:13:56 INFO BlockManagerInfo: Added broadcast_0_piece
0 in memory on 10.0.2.15:33106 (size: 1172.0 B, free: 413.9 MB)
17/02/01 18:13:56 INFO SparkContext: Created broadcast 0 from br
oadcast at DAGScheduler.scala:996
17/02/01 18:13:56 INFO DAGScheduler: Submitting 10 missing tasks
from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala
:34)
17/02/01 18:13:56 INFO TaskSchedulerImpl: Adding task set 0.0 wi
th 10 tasks
17/02/01 18:13:56 INFO TaskSetManager: Starting task 0.0 in stag
e 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_L
OCAL, 6021 bytes)
OCAL, 6021 bytes)
17/02/01 18:13:56 INFO Executor: Running task 0.0 in stage 0.0 (
TID 0)
TID 1)
17/02/01 18:13:56 INFO Executor: Fetching spark://10.0.2.15:3584
8/jars/spark-examples_2.11-2.1.0.jar with timestamp 148599083266
0
17/02/01 18:13:56 INFO TransportClientFactory: Successfully crea
ted connection to /10.0.2.15:35848 after 149 ms (0 ms spent in b
ootstraps)
26
Solutions
17/02/01 18:13:56 INFO Utils: Fetching spark://10.0.2.15:35848/j

ars/spark-examples_2.11-2.1.0.jar to /tmp/spark-aa3139da-fd4b-47
3b-9e0c-42858ae78db5/userFiles-38c779f8-e12f-420e-a4d3-ac20a561a
919/fetchFileTemp2346557375955678751.tmp
17/02/01 18:13:57 INFO Executor: Adding file:/tmp/spark-aa3139da
-fd4b-473b-9e0c-42858ae78db5/userFiles-38c779f8-e12f-420e-a4d3-a
c20a561a919/spark-examples_2.11-2.1.0.jar to class loader
17/02/01 18:13:57 INFO Executor: Finished task 0.0 in stage 0.0
(TID 0). 1114 bytes result sent to driver
OCAL, 6021 bytes)
TID 2)
OCAL, 6021 bytes)
TID 3)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 0.0 in stag
e 0.0 (TID 0) in 996 ms on localhost (executor driver) (1/10)
OCAL, 6021 bytes)
TID 4)
OCAL, 6021 bytes)
TID 5)
27
Solutions

OCAL, 6021 bytes)
TID 6)
OCAL, 6021 bytes)
TID 7)
OCAL, 6021 bytes)
TID 8)
OCAL, 6021 bytes)
TID 9)
28
Solutions

17/02/01 18:13:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, w
hose tasks have all completed, from pool
17/02/01 18:13:57 INFO DAGScheduler: ResultStage 0 (reduce at Sp
arkPi.scala:38) finished in 1.376 s
17/02/01 18:13:57 INFO DAGScheduler: Job 0 finished: reduce at S
parkPi.scala:38, took 2.439870 s
Pi is roughly 3.141711141711142
17/02/01 18:13:57 INFO SparkUI: Stopped Spark web UI at http://1
0.0.2.15:4040
17/02/01 18:13:57 INFO MapOutputTrackerMasterEndpoint: MapOutput
TrackerMasterEndpoint stopped!
17/02/01 18:13:57 INFO MemoryStore: MemoryStore cleared
17/02/01 18:13:57 INFO BlockManager: BlockManager stopped
17/02/01 18:13:57 INFO BlockManagerMaster: BlockManagerMaster st
opped
17/02/01 18:13:57 INFO OutputCommitCoordinator$OutputCommitCoord
inatorEndpoint: OutputCommitCoordinator stopped!
17/02/01 18:13:57 INFO SparkContext: Successfully stopped SparkC
ontext
17/02/01 18:13:57 INFO ShutdownHookManager: Shutdown hook called
17/02/01 18:13:57 INFO ShutdownHookManager: Deleting directory /
tmp/spark-aa3139da-fd4b-473b-9e0c-42858ae78db5
(~/spark-2.1.0-bin-hadoop2.7)
with custom log4j
Pi is roughly 3.1403831403831406
info.log
29
Solutions

ive-hadoop library for your platform... using builtin-java class
es where applicab
le
lBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15
instead (on inter
face enp0s3)
es where applicab
le
instead (on inter
face enp0s3)
17/02/01 18:35:54 INFO log.M(L): Logging initialized @10835ms
17/02/01 18:35:55 INFO Server.M(L): jetty-9.2.z-SNAPSHOT
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7bb3a9fe{/jobs,null,AVAILABLE}
letContextHandler@7cbee484{/jobs/json,null,AVAILABLE}
letContextHandler@7f811d00{/jobs/job,null,AVAILABLE}
letContextHandler@62923ee6{/jobs/job/json,null,AVAILABLE}
letContextHandler@4089713{/stages,null,AVAILABLE}
letContextHandler@f19c9d2{/stages/json,null,AVAILABLE}
letContextHandler@7807ac2c{/stages/stage,null,AVAILABLE}
letContextHandler@b91d8c4{/stages/stage/json,null,AVAILABLE}
30
Solutions

letContextHandler@4b6166aa{/stages/pool,null,AVAILABLE}
letContextHandler@a77614d{/stages/pool/json,null,AVAILABLE}
letContextHandler@4fd4cae3{/storage,null,AVAILABLE}
letContextHandler@4a067c25{/storage/json,null,AVAILABLE}
letContextHandler@a1217f9{/storage/rdd,null,AVAILABLE}
letContextHandler@3bde62ff{/storage/rdd/json,null,AVAILABLE}
letContextHandler@523424b5{/environment,null,AVAILABLE}
letContextHandler@2baa8d82{/environment/json,null,AVAILABLE}
letContextHandler@319dead1{/executors,null,AVAILABLE}
letContextHandler@791cbf87{/executors/json,null,AVAILABLE}
letContextHandler@a7e2d9d{/executors/threadDump,null,AVAILABLE}
letContextHandler@754777cd{/executors/threadDump/json,null,AVAIL
ABLE}
letContextHandler@2b52c0d6{/static,null,AVAILABLE}
letContextHandler@372ea2bc{/,null,AVAILABLE}
letContextHandler@4cc76301{/api,null,AVAILABLE}
letContextHandler@2f08c4b{/jobs/job/kill,null,AVAILABLE}
letContextHandler@3f19b8b3{/stages/stage/kill,null,AVAILABLE}
17/02/01 18:35:55 INFO ServerConnector.M(L): Started ServerConne
ctor@4efcf8a{HTTP/1.1}{0.0.0.0:4040}
17/02/01 18:35:55 INFO Server.M(L): Started @11454ms
letContextHandler@778ca8ef{/metrics/json,null,AVAILABLE}
31
Solutions

letContextHandler@4c060c8f{/SQL,null,AVAILABLE}
letContextHandler@383f3558{/SQL/json,null,AVAILABLE}
letContextHandler@6ea94d6a{/SQL/execution,null,AVAILABLE}
letContextHandler@4d7e7435{/SQL/execution/json,null,AVAILABLE}
letContextHandler@294bdeb4{/static/sql,null,AVAILABLE}
17/02/01 18:36:00 INFO ServerConnector.M(L): Stopped ServerConne
ctor@4efcf8a{HTTP/1.1}{0.0.0.0:4040}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@3f19b8b3{/stages/stage/kill,null,UNAVAILABLE}
letContextHandler@2f08c4b{/jobs/job/kill,null,UNAVAILABLE}
letContextHandler@4cc76301{/api,null,UNAVAILABLE}
letContextHandler@372ea2bc{/,null,UNAVAILABLE}
letContextHandler@2b52c0d6{/static,null,UNAVAILABLE}
letContextHandler@754777cd{/executors/threadDump/json,null,UNAVA
ILABLE}
letContextHandler@a7e2d9d{/executors/threadDump,null,UNAVAILABLE
}
letContextHandler@791cbf87{/executors/json,null,UNAVAILABLE}
letContextHandler@319dead1{/executors,null,UNAVAILABLE}
letContextHandler@2baa8d82{/environment/json,null,UNAVAILABLE}
letContextHandler@523424b5{/environment,null,UNAVAILABLE}
letContextHandler@3bde62ff{/storage/rdd/json,null,UNAVAILABLE}
letContextHandler@a1217f9{/storage/rdd,null,UNAVAILABLE}
32
Solutions

letContextHandler@4a067c25{/storage/json,null,UNAVAILABLE}
letContextHandler@4fd4cae3{/storage,null,UNAVAILABLE}
letContextHandler@a77614d{/stages/pool/json,null,UNAVAILABLE}
letContextHandler@4b6166aa{/stages/pool,null,UNAVAILABLE}
letContextHandler@b91d8c4{/stages/stage/json,null,UNAVAILABLE}
letContextHandler@7807ac2c{/stages/stage,null,UNAVAILABLE}
letContextHandler@f19c9d2{/stages/json,null,UNAVAILABLE}
letContextHandler@4089713{/stages,null,UNAVAILABLE}
letContextHandler@62923ee6{/jobs/job/json,null,UNAVAILABLE}
letContextHandler@7f811d00{/jobs/job,null,UNAVAILABLE}
letContextHandler@7cbee484{/jobs/json,null,UNAVAILABLE}
letContextHandler@7bb3a9fe{/jobs,null,UNAVAILABLE}
(~/spark-2.1.0-bin-hadoop2.7/logs)
Cluster
Master
(~/spark-2.1.0-bin-hadoop2.7) ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /home
/aravind/spark-2.1.0-bin-hadoop2.7/logs/spark-aravind-org.apache
.spark.deploy.master.Master-1-aravind-VirtualBox.out
Slave
33
Solutions
(~/spark-2.1.0-bin-hadoop2.7) ./sbin/start-slave.sh spark://arav

ind-VirtualBox:7077
starting org.apache.spark.deploy.worker.Worker, logging to /home
.spark.deploy.worker.Worker-1-aravind-VirtualBox.out
Shuffling
This setting configures the serializer used for not only shuffling data between
worker nodes but also when serializing RDDs to disk
Monitoring Spark Apps

Web UI
Spark launches as embedded web-app by default to monitor a running
application. this is launched on the Driver
It launches the monitoring app on 4040 by default.
If multiple apps are running simultaneously then they bind to
successive ports
REST interface
Publish metrics to aggregation systems like Graphite, Ganglia and
JMX
Scheduling
By default FIFO is used
Can be configured to use FAIR scheduler by changing spark.scheduler.mode.
Spark assigns tasks between jobs in a “round robin” fashion in FAIR mode
34
General
Set spark.app.id when running on YARN. It enables monitoring. More details.
Follow best practices for Scala
Minimize network data shuffling by using shared variables, and favoring
reduce operations (which reduce the data locally before shuffling across the
network) over grouping (which shuffles all the data across the network)
Properly configure your temp work dir cleanups
Serialization
Closure serialization
Every task run from Driver to Worker gets serialized
The closure and its environment gets serialized to worker every time it
is used. Not just once per node. This might slow things down
considerably, specially when you have large variables in environment.
This is where broadcast variables can help.
Default java serialization mechanism is used to serialize closures.
Result serialization: Every result from every task gets serialized at some point
In general, tasks larger than about 20 KB are probably worth optimizing
If your objects are large, you may also need to increase the
spark.kryoserializer.buffer config. This value needs to be large
enough to hold the largest object you will serialize.
Broadcast and Accumulator

Both are tied to a single SparkContext and hence can't be shared across
contexts
Broadcast
Broadcasted variables are sent to the worker node _only _once even
though each node might be handling more than one task (that is
35
dependent on this variable) of the same job

Broadcasted variables are cached in-memory of the nodes, ready to be
used during execution of the program
Gossip (BitTorrent like) protocol is used to share data among nodes so
that the burden isn't always on the driver node to distribute the variables
According to a study, broadcasting data of around 1GB to larger workers
(40 max) reduces the performance considerably. Refer to this paper.
Logging
A Spark Executor writes all its logs (including INFO, DEBUG) to stderr. The
stdout is empty
spark-shell
default behavior spark-shell
36
ults.properties
ere applicable
17/02/01 18:02:11 WARN ObjectStore: Failed to get database globa
l_temp, returning NoSuchObjectException
al-1485990033491).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/

ava 1.8.0_121)
scala>
with custom log4j
37
al-1485991564309).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/

ava 1.8.0_121)
scala>
info.log

ive-hadoop library for your platform... usi
ng builtin-java classes where applicable
lBox resolves to a loopback address: 127.0.
1.1; using 10.0.2.15 instead (on interface enp0s3)
spark-submit
38
command

10
default behavior
ults.properties
ere applicable
ravind
aravind
s to:
ups to:
s: Set()
39

ormation
asterEndpoint up
at /tmp/blockmgr-dd0f99cf-b0a4-4793-bf19-d400d5305456
acity 413.9 MB
ator
kUI' on port 4040.
arted at http://10.0.2.15:4040
17/02/01 18:13:52 INFO Executor: Starting executor ID driver on
host localhost
106.
on 10.0.2.15:33106
river, 10.0.2.15, 33106, None)
40
rkPi.scala:38
t()
sing parents
:34)
th 10 tasks
OCAL, 6021 bytes)
OCAL, 6021 bytes)
TID 0)
TID 1)
17/02/01 18:13:56 INFO Executor: Fetching spark://10.0.2.15:3584
8/jars/spark-examples_2.11-2.1.0.jar with timestamp 148599083266
0
ted connection to /10.0.2.15:35848 after 149 ms (0 ms spent in b
ootstraps)
41
17/02/01 18:13:56 INFO Utils: Fetching spark://10.0.2.15:35848/j

ars/spark-examples_2.11-2.1.0.jar to /tmp/spark-aa3139da-fd4b-47
3b-9e0c-42858ae78db5/userFiles-38c779f8-e12f-420e-a4d3-ac20a561a
919/fetchFileTemp2346557375955678751.tmp
17/02/01 18:13:57 INFO Executor: Adding file:/tmp/spark-aa3139da
-fd4b-473b-9e0c-42858ae78db5/userFiles-38c779f8-e12f-420e-a4d3-a
c20a561a919/spark-examples_2.11-2.1.0.jar to class loader
OCAL, 6021 bytes)
TID 2)
OCAL, 6021 bytes)
TID 3)
OCAL, 6021 bytes)
TID 4)
OCAL, 6021 bytes)
TID 5)
42

OCAL, 6021 bytes)
TID 6)
OCAL, 6021 bytes)
TID 7)
OCAL, 6021 bytes)
TID 8)
OCAL, 6021 bytes)
TID 9)
43

Pi is roughly 3.141711141711142
0.0.2.15:4040
opped
ontext
tmp/spark-aa3139da-fd4b-473b-9e0c-42858ae78db5
with custom log4j
Pi is roughly 3.1403831403831406
info.log
44

es where applicab
le
instead (on inter
face enp0s3)
es where applicab
le
instead (on inter
face enp0s3)
45

ABLE}
ctor@4efcf8a{HTTP/1.1}{0.0.0.0:4040}
letContextHandler@778ca8ef{/metrics/json,null,AVAILABLE}
46

letContextHandler@4c060c8f{/SQL,null,AVAILABLE}
letContextHandler@383f3558{/SQL/json,null,AVAILABLE}
letContextHandler@6ea94d6a{/SQL/execution,null,AVAILABLE}
letContextHandler@4d7e7435{/SQL/execution/json,null,AVAILABLE}
letContextHandler@294bdeb4{/static/sql,null,AVAILABLE}
ctor@4efcf8a{HTTP/1.1}{0.0.0.0:4040}
ILABLE}
}
47

Cluster
default behavior
both master and worker create two files under logs directory and redirect their
output to those files
uses default log profile printing `Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.propertie`
Master
48
Slave

ind-VirtualBox:7077
Master LOG output
(~/spark-2.1.0-bin-hadoop2.7/logs) cat spark-aravind-org.apache.

spark.deploy.master.Master-1-aravind-VirtualBox.out
Spark Command: /usr/lib/jvm/java-8-oracle/bin/java -cp /home/ara
vind/spark-2.1.0-bin-hadoop2.7/conf/:/home/aravind/spark-2.1.0-b
in-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.master.Master
--host aravind-VirtualBox --port 7077 --webui-port 8080
========================================
ults.properties
17/02/01 18:49:02 INFO Master: Started daemon with process name:
4332@aravind-VirtualBox
17/02/01 18:49:02 INFO SignalUtils: Registered signal handler fo
r TERM
r HUP
r INT
49

ere applicable
ravind
aravind
s to:
ups to:
s: Set()
kMaster' on port 7077.
17/02/01 18:49:03 INFO Master: Starting Spark master at spark://
aravind-VirtualBox:7077
17/02/01 18:49:03 INFO Master: Running Spark version 2.1.0
17/02/01 18:49:03 INFO Utils: Successfully started service 'Mast
erUI' on port 8080.
17/02/01 18:49:03 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0
, and started at http://10.0.2.15:8080
17/02/01 18:49:03 INFO Utils: Successfully started service on po
rt 6066.
17/02/01 18:49:03 INFO StandaloneRestServer: Started REST server
for submitting applications on port 6066
17/02/01 18:49:03 INFO Master: I have been elected leader! New s
tate: ALIVE
17/02/01 18:50:57 INFO Master: Registering worker 10.0.2.15:4313
5 with 2 cores, 1024.0 MB RAM
(~/spark-2.1.0-bin-hadoop2.7/logs) ll
total 16
drwxrwxr-x 2 aravind aravind 4096 Feb 1 18:50 ./
drwxr-xr-x 16 aravind aravind 4096 Feb 1 18:13 ../
-rw-rw-r-- 1 aravind aravind 2377 Feb 1 18:50 spark-aravind-or
g.apache.spark.deploy.master.Master-1-aravind-VirtualBox.out
-rw-rw-r-- 1 aravind aravind 2448 Feb 1 18:50 spark-aravind-or
50
g.apache.spark.deploy.worker.Worker-1-aravind-VirtualBox.out
Worker LOG output
(~/spark-2.1.0-bin-hadoop2.7/logs) cat spark-aravind-org.apache.

spark.deploy.worker.Worker-1-aravind-VirtualBox.out | more
Spark Command: /usr/lib/jvm/java-8-oracle/bin/java -cp /home/ara
vind/spark-2.1.0-bin-hadoop2.7/conf/:/home/aravind/spark-2.1.0-b
in-hadoop2.7/jars
/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 808
1 spark://aravind-VirtualBox:7077
========================================
ults.properties
17/02/01 18:50:55 INFO Worker: Started daemon with process name:
4423@aravind-VirtualBox
r TERM
r HUP
r INT
ad (on interface
enp0s3)
ere applicable
ravind
aravind
s to:
51
ups to:
: Set(aravind); g
roups with view permissions: Set(); users with modify permissio
ns: Set(aravind); groups with modify permissions: Set()
kWorker' on port 43135.
17/02/01 18:50:56 INFO Worker: Starting Spark worker 10.0.2.15:4
3135 with 2 cores, 1024.0 MB RAM
17/02/01 18:50:56 INFO Worker: Running Spark version 2.1.0
17/02/01 18:50:56 INFO Worker: Spark home: /home/aravind/spark-2
.1.0-bin-hadoop2.7
17/02/01 18:50:56 INFO Utils: Successfully started service 'Work
erUI' on port 8081.
17/02/01 18:50:56 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0
, and started at http://10.0.2.15:8081
17/02/01 18:50:56 INFO Worker: Connecting to master aravind-Virt
ualBox:7077...
ted connection to aravind-VirtualBox/127.0.1.1:7077 after 72 ms
(0 ms spent in bo
otstraps)
17/02/01 18:50:57 INFO Worker: Successfully registered with mast
er spark://aravind-VirtualBox:7077
spark-submit to master
command

--master spark://aravind-VirtualBox:7077 \
10
output
52
(~/spark-2.1.0-bin-hadoop2.7) sh ./runexample.sh
ults.properties
ere applicable
ravind
aravind
s to:
ups to:
s: Set()
ormation
asterEndpoint up
at /tmp/blockmgr-6220cf77-5149-4f51-9d01-fa222f23ba65
acity 413.9 MB
53
ator
kUI' on port 4040.
arted at http://10.0.2.15:4040
17/02/01 19:44:51 INFO StandaloneAppClient$ClientEndpoint: Conne
cting to master spark://aravind-VirtualBox:7077...
ted connection to aravind-VirtualBox/127.0.1.1:7077 after 303 ms
(0 ms spent in bootstraps)
17/02/01 19:45:11 INFO StandaloneAppClient$ClientEndpoint: Conne
cting to master spark://aravind-VirtualBox:7077...
17/02/01 19:45:16 INFO StandaloneSchedulerBackend: Connected to
Spark cluster with app ID app-20170201194515-0000
697.
on 10.0.2.15:36697
river, 10.0.2.15, 36697, None)
17/02/01 19:45:17 INFO StandaloneAppClient$ClientEndpoint: Execu
tor added: app-20170201194515-0000/0 on worker-20170201185056-10
.0.2.15-43135 (10.0.2.15:43135) with 1 cores
17/02/01 19:45:17 INFO StandaloneSchedulerBackend: Granted execu
tor ID app-20170201194515-0000/0 on hostPort 10.0.2.15:43135 wit
h 1 cores, 512.0 MB RAM
54
17/02/01 19:45:19 INFO StandaloneSchedulerBackend: SchedulerBack

end is ready for scheduling beginning after reached minRegistere
dResourcesRatio: 0.0
rkPi.scala:38
t()
sing parents
:34)
th 10 tasks
17/02/01 19:45:36 INFO StandaloneAppClient$ClientEndpoint: Execu
tor updated: app-20170201194515-0000/0 is now RUNNING
17/02/01 19:45:44 WARN TaskSchedulerImpl: Initial job has not ac
cepted any resources; check your cluster UI to ensure that worke
rs are registered and have sufficient resources
17/02/01 19:45:45 INFO CoarseGrainedSchedulerBackend$DriverEndpo
int: Registered executor NettyRpcEndpointRef(null) (10.0.2.15:32
846) with ID 0
lock manager 10.0.2.15:42332 with 117.0 MB RAM, BlockManagerId(0
55
, 10.0.2.15, 42332, None)

e 0.0 (TID 0, 10.0.2.15, executor 0, partition 0, PROCESS_LOCAL,
6025 bytes)
6025 bytes)
6025 bytes)
e 0.0 (TID 0) in 13900 ms on 10.0.2.15 (executor 0) (1/10)
6025 bytes)
6025 bytes)
6025 bytes)
6025 bytes)
6025 bytes)
56

6025 bytes)
6025 bytes)
Pi is roughly 3.1426511426511428
0.0.2.15:4040
17/02/01 19:46:04 INFO StandaloneSchedulerBackend: Shutting down
all executors
17/02/01 19:46:04 INFO CoarseGrainedSchedulerBackend$DriverEndpo
int: Asking each executor to shut down
opped
ontext
tmp/spark-e528b271-c091-46f3-ac7c-f345cee3cc4c
57
with custom log4j
the conf/log4j.properties is honored redirecting output of master, worker and

spark-submit to info.log
Master, Worker and spark-submit output

ind-VirtualBox:7077
(~/spark-2.1.0-bin-hadoop2.7) sh ./runexample.sh
Pi is roughly 3.1427031427031427
info.log
(~/spark-2.1.0-bin-hadoop2.7/logs) cat info.log

instead (on interface enp0s3)
es where applicable
letContextHandler@5470d948{/app,null,AVAILABLE}
letContextHandler@591d8573{/app/json,null,AVAILABLE}
58
letContextHandler@2e495ee7{/,null,AVAILABLE}
letContextHandler@52a2fb5d{/json,null,AVAILABLE}
letContextHandler@4ac2d8{/static,null,AVAILABLE}
letContextHandler@35c9f9bf{/app/kill,null,AVAILABLE}
letContextHandler@51132787{/driver/kill,null,AVAILABLE}
ctor@7dfc76f8{HTTP/1.1}{0.0.0.0:8080}
letContextHandler@2e5cf022{/,null,AVAILABLE}
ctor@57ef1427{HTTP/1.1}{aravind-VirtualBox:6066}
letContextHandler@6e7baffa{/metrics/master/json,null,AVAILABLE}
letContextHandler@40ecf3f6{/metrics/applications/json,null,AVAIL
ABLE}
es where applicable
letContextHandler@768d6913{/logPage,null,AVAILABLE}
letContextHandler@500e5e0c{/logPage/json,null,AVAILABLE}
letContextHandler@3727270a{/,null,AVAILABLE}
59
letContextHandler@e7cc804{/json,null,AVAILABLE}
letContextHandler@13a4ed29{/static,null,AVAILABLE}
letContextHandler@31783f6f{/log,null,AVAILABLE}
ctor@1e448d11{HTTP/1.1}{0.0.0.0:8081}
letContextHandler@2c5b237e{/metrics/json,null,AVAILABLE}
es where applicable
60

ABLE}
ctor@1cd0a3ab{HTTP/1.1}{0.0.0.0:4040}
letContextHandler@3569edd5{/metrics/json,null,AVAILABLE}
letContextHandler@6622a690{/SQL,null,AVAILABLE}
letContextHandler@497570fb{/SQL/json,null,AVAILABLE}
61

letContextHandler@a8a8b75{/SQL/execution,null,AVAILABLE}
letContextHandler@72be135f{/SQL/execution/json,null,AVAILABLE}
letContextHandler@deb3b60{/static/sql,null,AVAILABLE}
ctor@1cd0a3ab{HTTP/1.1}{0.0.0.0:4040}
ILABLE}
}
62

17/02/01 20:18:20 WARN Master.M(L): Got status update for unknow
n executor app-20170201201809-0000/0
Shuffling
This setting configures the serializer used for not only shuffling data between
worker nodes but also when serializing RDDs to disk
Monitoring Spark Apps

Web UI
Spark launches as embedded web-app by default to monitor a running
application. this is launched on the Driver
It launches the monitoring app on 4040 by default.
If multiple apps are running simultaneously then they bind to
successive ports
63
REST interface
Publish metrics to aggregation systems like Graphite, Ganglia and
JMX
Scheduling
By default FIFO is used
Can be configured to use FAIR scheduler by changing spark.scheduler.mode.
Spark assigns tasks between jobs in a “round robin” fashion in FAIR mode
64
SparkSQL
SparkSQL
Introduction
The interfaces provided by SparkSQL provide Spark with more information

about the structure of the both the data and computation being performed.
This extra information allows Spark to perform more optimizations
You interact with SparkSQL using SQL and Dataset API
You can mix and match different API's and languages but the underlying
execution engine is the same and hence same optimization techniques are
applied
Exposes 3 processing interfaces: SQL, HiveQL and language integrated

queries
Can be used in 2 modes
as a library
data processing tasks are expressed as SQL, HiveQL or integrated
queries in a Scala app
3 key abstractions
SQLContext
HiveContext
DataFrame
as a distributed SQL execution engine
allows multiple users to share a single spark cluster
allows centralized caching across all jobs and across multiple data
stores
DataFrame
Is an Untyped Dataset
represents a distributed collection of rows organized into named columns
Executing SQL queries provides us a DataFrame as the result
is schema aware i.e knows names and types of columns
provides methods for processing data
can be easily converted to regular RDD
65
SparkSQL
can be registered as a temporary table to run SQL/HiveQL

2 ways to create a DF
from existing RDD
toDF if we can infer schema using a case class
createDataFrame is you want to pass the Schema explicitly
(StructType, StructField)
from external DataSources
single unified interace to create DF, either from data stores
(MySQL, PostgresSQL, Oracle, Cassandra) or from files
(JSON, Parquet, ORC, CSV, HDFS, local, S3)
built-in support for JSON, Parquet, ORC, JDBC, Hive
uses DataFrameReader to read data from external datasources
can specify partiotioning, format and data source specific options
SQL/HiveContext has a factory method called read() that returns an
instance of DataFrameReader
uses DataFrameWriter to write data to external datasources
can specify partiotioning, format and data source specific options
build-in functions
optimized for faster execution through code generation

import org.apache.spark.sql.functions._
Optimization techniques used
Encoder/Decoder
Better than Java and Kryo serialization mechanics
Allows Spark to perform many operations like sorting, filtering,
hashing etc without deserializing the bytes back to objects
Code is generated dynamically for encode/decode
Reduced Disk IO
Skip rows
if a data store (for e.g. Parquet, ORC) maintains statistical info
about data (min, max of a particular column in a group) then
SparkSQL can take advantage of it to skip these groups
Skip columns (Use support of Parquet)
skip non-required partitions
Predicate Push down i.e. pushing filtering predicates to run natively
on data stores like Cassandra, DB etc.
66
SparkSQL
In-memory columnar caching

cache only required columns from any data store
compress the cached columns (using snappy) to minimize memory
usage and GC
SparkSQL can automatically select a compression codec for a
column based on data type
columnar format enables using of efficient compression techniques
run length encoding
delta encoding
dictionary encoding
Query optimization
uses both cost-based (in the physical planning phase) and rule-
based (in the logical optimization phase) optimization
can even optimize across functions
code generation
Dataset
Typed data and transformations: Leverage static typing

Uses special Encoder/Decoders, instead of Java Serialization or Kryo, to
serialize objects for processing and transmitting data over network
Encoders
are code generated dynamically
allows Spark to perform operations like filtering, sorting, hashing etc.
without deserializing the bytes back to object
Datasource
Are used by Spark to load and save data

parquet is registered as the default data source
Default data source can be customized using
spark.sql.sources.default property
Save
different SaveMode 's: overwrite, ignore, error, append
Persistent tables
can be saved as a Hive table (even if Hive isn't installed). Use
saveAsTable
will create a managed table
67
SparkSQL
Schema merging for parquet, avro etc. files is supported
Imports
//For implicit conversions like converting RDDs to DataFrames, t

o use the $-notation,
//to use default Encoders provided for most common types
import spark.implicits._
//Contains library of functions to be used in Dataset and DataFr

ame methods like
//select, filter etc. in addition to column reference like selec
t($"name) and expressions like select($"age" + 1)
import org.apache.spark.sql.functions._
68
RDD to Dataset
Two methods for converting RDD to a Dataset
Method 1: When you already know the schema
Let Sprak infer the schema using Reflection. This results in a more concise
code
Method 2: When you do not know the columns and their types (i.e. schema)
before hand
Create the Schema programatically and apply to to an existing RDD at

runtime. This is more flexible
69
Exploratory Data Analysis
What data do I have?

In EDA, the intent is to get familiar with the data before performing any statistical
analysis. We will look into different ways of answering the question: What data do
I have?, using Scala & Spark.
1. How big is your data?

2. How many variables/features are in your data set?
3. What are data types?
i. Helps us determining the graphs to visualise the data
4. What does the data look like?
i. Possible Trends
ii. Possible Interactions
5. Is there anything odd about the data? Outliers?
i. Helps
After learning about the data using EDA, we can determine which tests and
descriptive statistics are appropriate for the data.
70
References
References
https://ogirardot.wordpress.com/2015/01/09/changing-sparks-default-java-
serialization-to-kryo/
https://ogirardot.wordpress.com/2014/12/11/apache-spark-memory-
management-and-graceful-degradation/
71

Spark Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark Notes

Uploaded by

Copyright:

Available Formats

Table

Apache Spark Notes

It would be good to have a general understanding of the following functional

scala> import org.apache.log4j.{Level, Logger}

To turn them OFF

Star Wars Kid

Video: Original, Mix

Sample: swk_small.log TODO - add link from github

Submitting applications to Spark

application-jar: Path to a bundled jar including your application and all

From Advanced Dependency Management, it seems to present the best of both

add examples of various persist mechanisms

Understand different ways of creating Pair RDD's

Problem 2: Create Pair RDD's from external storages.

scala> val words=sc.parallelize(Seq("the", "quick", "brown", "fo

scala> val wordPair=words.map( w => (w.charAt(0), w) )

Pair RDD - Transformations

Understand the usage of reduceByKey() transformation.

reduceByKey(func, [numTasks]): When called on a dataset of (K, V) pairs,

Problem 2: Word count

scala> val logs=sc.textFile("file:///c:/_home/swk_small.log")

scala> val parsedLogs=logs.map(line => parseAccessLog(line))

scala> //create a pair RDD and do reduceByKey() transformation

scala> val viewsByDate=parsedLogs.map( m => (m.getOrElse("access

Date","unknown"), 1)).reduceByKey((x,y) => x+y)

Understand the usage of foldByKey() transformation.

scala> val logs=sc.textFile("file:///c:/_home/swk_small.log")

scala> val parsedLogs=logs.map(line => parseAccessLog(line))

scala> val viewDates=parsedLogs.map( m => (m.getOrElse("accessDa

scala> val viewsByDate=viewDates.foldByKey(0)((x,y)=> x+y)

Understand the usage of combineByKey() transformation.

def combineByKey[C] (createCombiner: V => C, mergeValue: (C, V) => C,

Problem 1: From here

scala> val a = sc.parallelize(List("dog","cat","gnu","salmon","r

scala> val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)

scala> val c = b.zip(a)

scala> def createCombiner(v:String):List[String] = List[String](

scala> def mergeValue(acc:List[String], value:String) : List[Str

scala> def mergeCombiners(acc1:List[String], acc2:List[String])

scala> val result=c.combineByKey(createCombiner,mergeValue,merge

Above or equal to JDK 1.7 update 4

Problem 1: The following is data of few users.

Problem 2: From the tweet data set here, find

Problem 3: Demonstrate data virtualization

Techniques & Best Practices

Broadcast and Accumulator

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, J

with custom log4j

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, J

(~/spark-2.1.0-bin-hadoop2.7/logs) cat info.log | more

./bin/spark-submit --class org.apache.spark.examples.SparkPi \

17/02/01 18:13:51 INFO BlockManagerMasterEndpoint: Using org.apa

17/02/01 18:13:56 INFO Utils: Fetching spark://10.0.2.15:35848/j

e 0.0 (TID 2) in 192 ms on localhost (executor driver) (3/10)

(TID 8). 1114 bytes result sent to driver

with custom log4j

(~/spark-2.1.0-bin-hadoop2.7/logs) cat info.log | more

17/02/01 18:26:02 WARN NativeCodeLoader.M(L): Unable to load nat

17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv

17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv

17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv

(~/spark-2.1.0-bin-hadoop2.7) ./sbin/start-slave.sh spark://arav

Monitoring Spark Apps