You are on page 1of 71

Table

of Contents
Introduction 1.1
Exercises 1.2
Caching 1.2.1
ex-pair-rdd 1.2.2
Pair RDD - Transformations 1.2.3
Parallelism 1.2.4
Encoders 1.2.5
Problems 1.3
Solutions 1.4
Techniques & Best Practices 1.5
SparkSQL 1.6
RDD to Dataset 1.6.1
Usecases 1.7
Exploratory Data Analysis 1.7.1
References 1.8

1
Introduction

Apache Spark Notes


Notes prepared for future quick reference and is prepared from the following
sources

Learning Spark
Spark in Action
Spark official documentation
StackOverflow questions
Coursera courses
My own exercises
Other internet sources

It would be good to have a general understanding of the following functional


programming concepts (Scala is preferred)

Higher-order functions
Pattern matching
Implicit conversions
Tuples

Spark shells

1. spark-shell
2. pyspark
3. sparkR
4. spark-sql

2
Exercises

Exercises
I'd suggest you to turn ON only ERROR logging from spark-shell so that the focus
is on your commands and data

scala> import org.apache.log4j.{Level, Logger}


import org.apache.log4j.{Level, Logger}

scala> Logger.getRootLogger().setLevel(Level.ERROR)

If you want to turn ON the DEBUG logs from the code provided in these code
katas then, run the following command

scala> Logger.getLogger("mylog").setLevel(Level.DEBUG)

To turn them OFF

scala> Logger.getLogger("mylog").setLevel(Level.ERROR)

Datasets

Star Wars Kid


Read the story from Andy Baio's blog post here.

Video: Original, Mix

Summary:

Released in 2003
Downloaded 1.1 million times the first 2 weeks
Since been downloaded 40 million times since it was moved to YouTube
On April 29 2003, Andy Baio renamed it to Star_Wars_Kid.wmv and posted it
on his site

3
Exercises

In 2008, Andy Baio shared the Apache logs for the first 6 months from his site
File is 1.6 GB unzipped and available via BitTorent

Data:

Sample: swk_small.log TODO - add link from github


Original: Torrent link on this website

Univ of Florida
http://www.stat.ufl.edu/~winner/datasets.html

Submitting applications to Spark


From http://stackoverflow.com/questions/34391977/spark-submit-does-
automatically-upload-the-jar-to-cluster

This seems to be a common problem, here's what my research into this has
turned up. In practice it seems that a lot of people ask this question here and on
other forums, so I would think the default behavior is that Spark's driver does NOT
push jars around a cluster, just the classpath. The point of confusion, that I along
with other newcomers commonly suffer from is this:

The driver does NOT push your jars to the cluster. The master in the cluster
DOES push your jars to the workers. In theory. I see various things in the docs for
Spark that seem to contradict the idea that if you pass an assembly jar and/or
dependencies to spark-submit with --jars, that these jars are pushed out to the
cluster. For example, on the plus side,

"When using spark-submit, the application jar along with any jars included with the
--jars option will be automatically transferred to the cluster. " That's from
http://spark.apache.org/docs/latest/submitting-applications.html#advanced-
dependency-management

So far so good right? But that is only talking about once your master has the jars,
it can push them to workers.

There's a line on the docs for spark-submit, for the main parameter application-jar.
Quoting another answer I wrote:

4
Exercises

"I assume they mean most people will get the classpath out through a driver
config option. I know most of the docs for spark-submit make it look like the script
handles moving your code around the cluster, but it only moves the classpath
around for you. The driver does not load the jars to the master. For example in this
line from Launching Applications with spark-submit explicitly says you have to
move the jars yourself or make them "globally available":

application-jar: Path to a bundled jar including your application and all


dependencies. The URL must be globally visible inside of your cluster, for
instance, an hdfs:// path or a file:// path that is present on all nodes." From an
answer I wrote on Spark workers unable to find JAR on EC2 cluster

So I suspect we are seeing a "bug" in the docs. I would explicitly put the jars on
each node, or "globally available" via NFS or HDFS, for another reason from the
docs:

From Advanced Dependency Management, it seems to present the best of both


worlds, but also a great reason for manually pushing your jars out to all nodes:

local: - a URI starting with local:/ is expected to exist as a local file on each worker
node. This means that no network IO will be incurred, and works well for large
files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
Bottom line, I am coming to realize that to truly understand what Spark is doing,
we need to bypass the docs and read the source code. Even the Java docs don't
list all the parameters for things like parallelize().

5
Caching

Caching
We can use persist() to tell spark to store the partitions
On node failures (that persist data), lost partitions are recomputed by spark
In Scala and Java, default persist() stores the data on heap as unserialized
objects
We can also replicate the data on to multiple nodes to handle node failures
without slowdown
Persistence is at partition level. i.e. there should be enough memory to cache
all the data of a partition. Partial caching isn't supported as of 1.5.2
LRU policy is used to evict cached partitions to make room for new ones.
Wen evicting,
memory_only ones are recomputed next time when they are accessed
memory_and_disk ones are written to disk

TODO

add examples of various persist mechanisms


example for tachyon
tungston

6
ex-pair-rdd

Pair RDD

Exercise 1
Goal

Understand different ways of creating Pair RDD's

Problem(s)

Problem 1: Create a Pair RDD of this collection Seq("the", "quick", "brown", "fox",
"jumps", "over", "the", "lazy", "dog")?

Problem 2: Create Pair RDD's from external storages.

Answer(s)

Answer 1

In general, we can extract a key from the data by running a map() transformation
on it. Pair's are represented using Tuples (key, value).

A better way is to use the keyBy method so that the intent is explicit.

7
ex-pair-rdd

scala> val words=sc.parallelize(Seq("the", "quick", "brown", "fo


x", "jumps", "over", "the", "lazy", "dog"))
words: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[
45] at parallelize at <console>:27

scala> val wordPair=words.map( w => (w.charAt(0), w) )


wordPair: org.apache.spark.rdd.RDD[(Char, String)] = MapPartitio
nsRDD[46] at map at <console>:29

scala> wordPair.foreach(println)
(o,over)
(l,lazy)
(d,dog)
(t,the)
(j,jumps)
(t,the)
(f,fox)
(b,brown)
(q,quick)

Answer 2

Some methods on SparkContext produce pair RDD by default for reading files
in certain Hadoop formats.

TODO

8
Pair RDD - Transformations

Pair RDD - Transformations


Reference: http://spark.apache.org/docs/latest/programming-
guide.html#transformations

Excercise 1
Goal

Understand the usage of reduceByKey() transformation.

reduceByKey(func, [numTasks]): When called on a dataset of (K, V) pairs,


returns a dataset of (K, V) pairs where the values for each key are aggregated
using the given reduce function func, which must be of type (V,V) => V. Like in
groupByKey, the number of reduce tasks is configurable through an optional
second argument.

Problem(s)

Problem 1: Calculate page views by day for the star wars video (download link).

Problem 2: Word count

Answer(s)

Answer 1

scala> val logs=sc.textFile("file:///c:/_home/swk_small.log")


logs: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at
textFile at <console>:27

scala> val parsedLogs=logs.map(line => parseAccessLog(line))


parsedLogs: org.apache.spark.rdd.RDD[scala.collection.mutable.Map
[String,String]] = MapPartitionsRDD[51] at map at <console>:31

scala> //create a pair RDD and do reduceByKey() transformation

scala> val viewsByDate=parsedLogs.map( m => (m.getOrElse("access

9
Pair RDD - Transformations

Date","unknown"), 1)).reduceByKey((x,y) => x+y)


viewsByDate: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledR
DD[55] at reduceByKey at <console>:33

scala> viewsByDate.foreach(println)
(03-02-2003,1)
(05-04-2003,1)
(12-04-2003,1913)
(02-03-2003,1)
(26-02-2003,2)
(07-02-2003,1)
(05-01-2003,1)
(22-03-2003,1)
(19-02-2003,1)
(02-01-2003,2)
(11-04-2003,4162)
(22-01-2003,1)
(21-03-2003,1)
(10-04-2003,3902)
(01-03-2003,1)
(05-02-2003,1)
(18-03-2003,2)
(25-02-2003,1)
(06-02-2003,1)
(09-04-2003,1)
(20-02-2003,1)
(23-03-2003,1)
(06-04-2003,1)

Answer 2

Excercise 2
Goal

Understand the usage of foldByKey() transformation.

10
Pair RDD - Transformations

Problem 1: Calculate page views by day for the star wars video (download link).

Answer(s)

Answer 1

scala> val logs=sc.textFile("file:///c:/_home/swk_small.log")


logs: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[74] at
textFile at <console>:27

scala> val parsedLogs=logs.map(line => parseAccessLog(line))


parsedLogs: org.apache.spark.rdd.RDD[scala.collection.mutable.Map
[String,String]] = MapPartitionsRDD[75] at map at <console>:31

scala> val viewDates=parsedLogs.map( m => (m.getOrElse("accessDa


te","unknown"), 1))
viewDates: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitio
nsRDD[77] at map at <console>:33

scala> val viewsByDate=viewDates.foldByKey(0)((x,y)=> x+y)


viewsByDate: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledR
DD[78] at foldByKey at <console>:35

scala> viewsByDate.foreach(println)
(03-02-2003,1)
(26-02-2003,2)
(05-04-2003,1)
(05-01-2003,1)
(19-02-2003,1)
(11-04-2003,4162)
(21-03-2003,1)
(01-03-2003,1)
(12-04-2003,1913)
(18-03-2003,2)
(02-03-2003,1)
(06-02-2003,1)
(20-02-2003,1)
(07-02-2003,1)
(23-03-2003,1)
(22-03-2003,1)

11
Pair RDD - Transformations

(06-04-2003,1)
(02-01-2003,2)
(22-01-2003,1)
(10-04-2003,3902)
(05-02-2003,1)
(25-02-2003,1)
(09-04-2003,1)

Excercise 3
Goal

Understand the usage of combineByKey() transformation.

def combineByKey[C] (createCombiner: V => C, mergeValue: (C, V) => C,


mergeCombiners: (C, C) => C): RDD[(K, C)]

Problem 1: From here

Answer(s)

Answer 1

scala> val a = sc.parallelize(List("dog","cat","gnu","salmon","r


abbit","turkey","wolf","bear","bee"), 3)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[87]
at parallelize at <console>:27

scala> val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)


b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[88] at
parallelize at <console>:27

scala> val c = b.zip(a)


c: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2
[89] at zip at <console>:31

scala> c.collect
res78: Array[(Int, String)] = Array((1,dog), (1,cat), (2,gnu), (2
,salmon), (2,rabbit), (1,turkey), (2,wolf), (2,bear), (2,bee))

12
Pair RDD - Transformations

scala> def createCombiner(v:String):List[String] = List[String](


v)
createCombiner: (v: String)List[String]

scala> def mergeValue(acc:List[String], value:String) : List[Str


ing] = value :: acc
mergeValue: (acc: List[String], value: String)List[String]

scala> def mergeCombiners(acc1:List[String], acc2:List[String])


: List[String] = acc2 ::: acc1
mergeCombiners: (acc1: List[String], acc2: List[String])List[Str
ing]

scala> val result=c.combineByKey(createCombiner,mergeValue,merge


Combiners)
result: org.apache.spark.rdd.RDD[(Int, List[String])] = Shuffled
RDD[91] at combineByKey at <console>:39

scala> result.collect
res80: Array[(Int, List[String])] = Array((1,List(turkey, cat, d
og)), (2,List(bee, bear, wolf, rabbit, salmon, gnu)))

13
Parallelism

Parallelism
Every RDD has a fixed number of partitions that determine the degree of
parallelism to use when executing operations on RDD

Tuning
When performing aggregations and grouping operations, you can tell Spark to
use a specific number of partitions. For e.g. reduceByKey(func,
customParallelismOrPartitionsToSue)
If you want to change the partitioning of an RDD outside of these grouping or
aggregation operations, then you can use repartition() or coaleesce()
repartition() reshuffles the data acroos network to create a new set of
partitions
coalesce() avoids reshuffling data ONLY if we are decreasing the number of
existing RDD partitions (Use rdd.partitions.size() to see the current RDD
partitions)
Tackle performance intensive commands like collect(), groupByKey(),
reduceByKey()
See this post to see a small e.g. of how coalesc works
We can run multiple jobs within a single SparkContext parallerly by using
threads. Look at the answers from Patrick Liu and yuanbosoft in this post
Within each Spark application, multiple “jobs” (Spark actions) may be
running concurrently if they were submitted by different threads. This is
common if your application is serving requests over the network.
Improving Garbage Collection
Below JDK 1.7 update 4
CMS - Concurrent Mark and Sweep
focus is on low-latency (doesn't do compaction)
hence use it for real-time streaming
Parallel GC
Focus is on higher-throughput but results in higher pause-times
(performs whole-heap only compaction.
hence use it for batch

14
Parallelism

Above or equal to JDK 1.7 update 4


G1
Primarily meant for server-side and multi-core machines with
large memory

15
Encoders

Problem: Case classes in Scala 2.10 can support only up to 22 fields. To work
around this limit, you can use custom classes that implement the Product
interface. Provide an example

16
Problems

Problems

Problem 1: The following is data of few users.


It cotains four columns (userid, date, item1 and
item2). Find each user's latest day's records?

Input

val data="""
user date item1 item2
1 2015-12-01 14 5.6
1 2015-12-01 10 0.6
1 2015-12-02 8 9.4
1 2015-12-02 90 1.3
2 2015-12-01 30 0.3
2 2015-12-01 89 1.2
2 2015-12-30 70 1.9
2 2015-12-31 20 2.5
3 2015-12-01 19 9.3
3 2015-12-01 40 2.3
3 2015-12-02 13 1.4
3 2015-12-02 50 1.0
3 2015-12-02 19 7.8
"""

Expected output
For user 1, the latest date is 2015-12-02 and he has two records for that
particular date.
For user 2, the latest date is 2015-12-31 and he has two records for that
particular date.
For user 3, the latest date is 2015-12-02 and he has three records for that
particular date.

17
Problems

Problem 2: From the tweet data set here, find


the following (This is my own solution version
of excellent article: Getting started with Spark
in practice)
all the tweets by user
how many tweets each user has
all the persons mentioned on tweets
Count how many times each person is mentioned
Find the 10 most mentioned persons
Find all the hashtags mentioned on a tweet
Count how many times each hashtag is mentioned
Find the 10 most popular Hashtags

Input sample

{
"id":"572692378957430785",
"user":"Srkian_nishu :)",
"text":"@always_nidhi @YouTube no i dnt understand bt i loved
the music nd their dance awesome all the song of this mve is ro
cking",
"place":"Orissa",
"country":"India"
}
{
"id":"572575240615796737",
"user":"TagineDiningGlobal",
"text":"@OnlyDancers Bellydancing this Friday with live music
call 646-373-6265 http://t.co/BxZLdiIVM0",
"place":"Manhattan",
"country":"United States"
}

18
Problems

Problem 3: Demonstrate data virtualization


capabilities of SparkSQL by joining data across
different data stores i.e. rdbms, parquet, avro

19
Solutions

Techniques & Best Practices

General
Set spark.app.id when running on YARN. It enables monitoring. More details.
Follow best practices for Scala
Minimize network data shuffling by using shared variables, and favoring
reduce operations (which reduce the data locally before shuffling across the
network) over grouping (which shuffles all the data across the network)
Properly configure your temp work dir cleanups

Serialization
Closure serialization
Every task run from Driver to Worker gets serialized
The closure and its environment gets serialized to worker every time it
is used. Not just once per node. This might slow things down
considerably, specially when you have large variables in environment.
This is where broadcast variables can help.
Default java serialization mechanism is used to serialize closures.
Result serialization: Every result from every task gets serialized at some point
In general, tasks larger than about 20 KB are probably worth optimizing
If your objects are large, you may also need to increase the
spark.kryoserializer.buffer config. This value needs to be large
enough to hold the largest object you will serialize.

Broadcast and Accumulator


Both are ties to a single SparkContext and hence can't be shared across
contexts
According to a study, broadcasting data of around 1GB to larger workers (40
max) reduces the performance considerably. Refer to this paper.

20
Solutions

Logging
A Spark Executor writes all its logs (including INFO, DEBUG) to stderr. The
stdout is empty

spark-shell
default behavior spark-shell

21
Solutions

(~/spark-2.1.0-bin-hadoop2.7) spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defa
ults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR
, use setLogLevel(newLevel).
17/02/01 18:00:18 WARN NativeCodeLoader: Unable to load native-h
adoop library for your platform... using builtin-java classes wh
ere applicable
17/02/01 18:00:21 WARN Utils: Your hostname, aravind-VirtualBox
resolves to a loopback address: 127.0.1.1; using 10.0.2.15 inste
ad (on interface enp0s3)
17/02/01 18:00:21 WARN Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
17/02/01 18:02:11 WARN ObjectStore: Failed to get database globa
l_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.0.2.15:4040
Spark context available as 'sc' (master = local[*], app id = loc
al-1485990033491).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, J


ava 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

with custom log4j

22
Solutions

(~/spark-2.1.0-bin-hadoop2.7) spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR
, use setLogLevel(newLevel).
Spark context Web UI available at http://10.0.2.15:4040
Spark context available as 'sc' (master = local[*], app id = loc
al-1485991564309).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, J


ava 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

info.log

(~/spark-2.1.0-bin-hadoop2.7/logs) cat info.log | more


17/02/01 18:26:02 WARN NativeCodeLoader.M(L): Unable to load nat
ive-hadoop library for your platform... usi
ng builtin-java classes where applicable
17/02/01 18:26:03 WARN Utils.M(L): Your hostname, aravind-Virtua
lBox resolves to a loopback address: 127.0.
1.1; using 10.0.2.15 instead (on interface enp0s3)
17/02/01 18:26:03 WARN Utils.M(L): Set SPARK_LOCAL_IP if you nee
d to bind to another address
17/02/01 18:26:30 WARN ObjectStore.M(L): Failed to get database
global_temp, returning NoSuchObjectException

spark-submit

23
Solutions

command

./bin/spark-submit --class org.apache.spark.examples.SparkPi \


--executor-memory 512MB \
--total-executor-cores 1 \
examples/jars/spark-examples_2.11-2.1.0.jar \
10

default behavior

(~/spark-2.1.0-bin-hadoop2.7) sh ./runexample-local.sh
Using Spark's default log4j profile: org/apache/spark/log4j-defa
ults.properties
17/02/01 18:13:45 INFO SparkContext: Running Spark version 2.1.0
17/02/01 18:13:47 WARN NativeCodeLoader: Unable to load native-h
adoop library for your platform... using builtin-java classes wh
ere applicable
17/02/01 18:13:48 WARN Utils: Your hostname, aravind-VirtualBox
resolves to a loopback address: 127.0.1.1; using 10.0.2.15 inste
ad (on interface enp0s3)
17/02/01 18:13:48 WARN Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
17/02/01 18:13:48 INFO SecurityManager: Changing view acls to: a
ravind
17/02/01 18:13:48 INFO SecurityManager: Changing modify acls to:
aravind
17/02/01 18:13:48 INFO SecurityManager: Changing view acls group
s to:
17/02/01 18:13:48 INFO SecurityManager: Changing modify acls gro
ups to:
17/02/01 18:13:48 INFO SecurityManager: SecurityManager: authent
ication disabled; ui acls disabled; users with view permissions
: Set(aravind); groups with view permissions: Set(); users with
modify permissions: Set(aravind); groups with modify permission
s: Set()
17/02/01 18:13:50 INFO Utils: Successfully started service 'spar
kDriver' on port 35848.
17/02/01 18:13:51 INFO SparkEnv: Registering MapOutputTracker
17/02/01 18:13:51 INFO SparkEnv: Registering BlockManagerMaster

24
Solutions

17/02/01 18:13:51 INFO BlockManagerMasterEndpoint: Using org.apa


che.spark.storage.DefaultTopologyMapper for getting topology inf
ormation
17/02/01 18:13:51 INFO BlockManagerMasterEndpoint: BlockManagerM
asterEndpoint up
17/02/01 18:13:51 INFO DiskBlockManager: Created local directory
at /tmp/blockmgr-dd0f99cf-b0a4-4793-bf19-d400d5305456
17/02/01 18:13:51 INFO MemoryStore: MemoryStore started with cap
acity 413.9 MB
17/02/01 18:13:51 INFO SparkEnv: Registering OutputCommitCoordin
ator
17/02/01 18:13:52 INFO Utils: Successfully started service 'Spar
kUI' on port 4040.
17/02/01 18:13:52 INFO SparkUI: Bound SparkUI to 0.0.0.0, and st
arted at http://10.0.2.15:4040
17/02/01 18:13:52 INFO SparkContext: Added JAR file:/home/aravin
d/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.
1.0.jar at spark://10.0.2.15:35848/jars/spark-examples_2.11-2.1.
0.jar with timestamp 1485990832660
17/02/01 18:13:52 INFO Executor: Starting executor ID driver on
host localhost
17/02/01 18:13:52 INFO Utils: Successfully started service 'org.
apache.spark.network.netty.NettyBlockTransferService' on port 33
106.
17/02/01 18:13:52 INFO NettyBlockTransferService: Server created
on 10.0.2.15:33106
17/02/01 18:13:52 INFO BlockManager: Using org.apache.spark.stor
age.RandomBlockReplicationPolicy for block replication policy
17/02/01 18:13:52 INFO BlockManagerMaster: Registering BlockMana
ger BlockManagerId(driver, 10.0.2.15, 33106, None)
17/02/01 18:13:52 INFO BlockManagerMasterEndpoint: Registering b
lock manager 10.0.2.15:33106 with 413.9 MB RAM, BlockManagerId(d
river, 10.0.2.15, 33106, None)
17/02/01 18:13:52 INFO BlockManagerMaster: Registered BlockManag
er BlockManagerId(driver, 10.0.2.15, 33106, None)
17/02/01 18:13:52 INFO BlockManager: Initialized BlockManager: B
lockManagerId(driver, 10.0.2.15, 33106, None)
17/02/01 18:13:53 INFO SharedState: Warehouse path is 'file:/hom
e/aravind/spark-2.1.0-bin-hadoop2.7/spark-warehouse/'.
17/02/01 18:13:55 INFO SparkContext: Starting job: reduce at Spa

25
Solutions

rkPi.scala:38
17/02/01 18:13:55 INFO DAGScheduler: Got job 0 (reduce at SparkP
i.scala:38) with 10 output partitions
17/02/01 18:13:55 INFO DAGScheduler: Final stage: ResultStage 0
(reduce at SparkPi.scala:38)
17/02/01 18:13:55 INFO DAGScheduler: Parents of final stage: Lis
t()
17/02/01 18:13:55 INFO DAGScheduler: Missing parents: List()
17/02/01 18:13:55 INFO DAGScheduler: Submitting ResultStage 0 (M
apPartitionsRDD[1] at map at SparkPi.scala:34), which has no mis
sing parents
17/02/01 18:13:56 INFO MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 1832.0 B, free 413.9 MB)
17/02/01 18:13:56 INFO MemoryStore: Block broadcast_0_piece0 sto
red as bytes in memory (estimated size 1172.0 B, free 413.9 MB)
17/02/01 18:13:56 INFO BlockManagerInfo: Added broadcast_0_piece
0 in memory on 10.0.2.15:33106 (size: 1172.0 B, free: 413.9 MB)
17/02/01 18:13:56 INFO SparkContext: Created broadcast 0 from br
oadcast at DAGScheduler.scala:996
17/02/01 18:13:56 INFO DAGScheduler: Submitting 10 missing tasks
from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala
:34)
17/02/01 18:13:56 INFO TaskSchedulerImpl: Adding task set 0.0 wi
th 10 tasks
17/02/01 18:13:56 INFO TaskSetManager: Starting task 0.0 in stag
e 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:56 INFO TaskSetManager: Starting task 1.0 in stag
e 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:56 INFO Executor: Running task 0.0 in stage 0.0 (
TID 0)
17/02/01 18:13:56 INFO Executor: Running task 1.0 in stage 0.0 (
TID 1)
17/02/01 18:13:56 INFO Executor: Fetching spark://10.0.2.15:3584
8/jars/spark-examples_2.11-2.1.0.jar with timestamp 148599083266
0
17/02/01 18:13:56 INFO TransportClientFactory: Successfully crea
ted connection to /10.0.2.15:35848 after 149 ms (0 ms spent in b
ootstraps)

26
Solutions

17/02/01 18:13:56 INFO Utils: Fetching spark://10.0.2.15:35848/j


ars/spark-examples_2.11-2.1.0.jar to /tmp/spark-aa3139da-fd4b-47
3b-9e0c-42858ae78db5/userFiles-38c779f8-e12f-420e-a4d3-ac20a561a
919/fetchFileTemp2346557375955678751.tmp
17/02/01 18:13:57 INFO Executor: Adding file:/tmp/spark-aa3139da
-fd4b-473b-9e0c-42858ae78db5/userFiles-38c779f8-e12f-420e-a4d3-a
c20a561a919/spark-examples_2.11-2.1.0.jar to class loader
17/02/01 18:13:57 INFO Executor: Finished task 0.0 in stage 0.0
(TID 0). 1114 bytes result sent to driver
17/02/01 18:13:57 INFO Executor: Finished task 1.0 in stage 0.0
(TID 1). 1114 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 2.0 in stag
e 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO Executor: Running task 2.0 in stage 0.0 (
TID 2)
17/02/01 18:13:57 INFO TaskSetManager: Starting task 3.0 in stag
e 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO Executor: Running task 3.0 in stage 0.0 (
TID 3)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 0.0 in stag
e 0.0 (TID 0) in 996 ms on localhost (executor driver) (1/10)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 1.0 in stag
e 0.0 (TID 1) in 940 ms on localhost (executor driver) (2/10)
17/02/01 18:13:57 INFO Executor: Finished task 3.0 in stage 0.0
(TID 3). 1041 bytes result sent to driver
17/02/01 18:13:57 INFO Executor: Finished task 2.0 in stage 0.0
(TID 2). 1041 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 4.0 in stag
e 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO Executor: Running task 4.0 in stage 0.0 (
TID 4)
17/02/01 18:13:57 INFO TaskSetManager: Starting task 5.0 in stag
e 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO Executor: Running task 5.0 in stage 0.0 (
TID 5)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 2.0 in stag

27
Solutions

e 0.0 (TID 2) in 192 ms on localhost (executor driver) (3/10)


17/02/01 18:13:57 INFO TaskSetManager: Finished task 3.0 in stag
e 0.0 (TID 3) in 224 ms on localhost (executor driver) (4/10)
17/02/01 18:13:57 INFO Executor: Finished task 4.0 in stage 0.0
(TID 4). 1201 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 6.0 in stag
e 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO Executor: Running task 6.0 in stage 0.0 (
TID 6)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 4.0 in stag
e 0.0 (TID 4) in 110 ms on localhost (executor driver) (5/10)
17/02/01 18:13:57 INFO Executor: Finished task 5.0 in stage 0.0
(TID 5). 1114 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 7.0 in stag
e 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 5.0 in stag
e 0.0 (TID 5) in 112 ms on localhost (executor driver) (6/10)
17/02/01 18:13:57 INFO Executor: Running task 7.0 in stage 0.0 (
TID 7)
17/02/01 18:13:57 INFO Executor: Finished task 6.0 in stage 0.0
(TID 6). 1128 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 8.0 in stag
e 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 6.0 in stag
e 0.0 (TID 6) in 68 ms on localhost (executor driver) (7/10)
17/02/01 18:13:57 INFO Executor: Running task 8.0 in stage 0.0 (
TID 8)
17/02/01 18:13:57 INFO Executor: Finished task 7.0 in stage 0.0
(TID 7). 1041 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 9.0 in stag
e 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 7.0 in stag
e 0.0 (TID 7) in 54 ms on localhost (executor driver) (8/10)
17/02/01 18:13:57 INFO Executor: Running task 9.0 in stage 0.0 (
TID 9)
17/02/01 18:13:57 INFO Executor: Finished task 8.0 in stage 0.0

28
Solutions

(TID 8). 1114 bytes result sent to driver


17/02/01 18:13:57 INFO TaskSetManager: Finished task 8.0 in stag
e 0.0 (TID 8) in 112 ms on localhost (executor driver) (9/10)
17/02/01 18:13:57 INFO Executor: Finished task 9.0 in stage 0.0
(TID 9). 1114 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Finished task 9.0 in stag
e 0.0 (TID 9) in 123 ms on localhost (executor driver) (10/10)
17/02/01 18:13:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, w
hose tasks have all completed, from pool
17/02/01 18:13:57 INFO DAGScheduler: ResultStage 0 (reduce at Sp
arkPi.scala:38) finished in 1.376 s
17/02/01 18:13:57 INFO DAGScheduler: Job 0 finished: reduce at S
parkPi.scala:38, took 2.439870 s
Pi is roughly 3.141711141711142
17/02/01 18:13:57 INFO SparkUI: Stopped Spark web UI at http://1
0.0.2.15:4040
17/02/01 18:13:57 INFO MapOutputTrackerMasterEndpoint: MapOutput
TrackerMasterEndpoint stopped!
17/02/01 18:13:57 INFO MemoryStore: MemoryStore cleared
17/02/01 18:13:57 INFO BlockManager: BlockManager stopped
17/02/01 18:13:57 INFO BlockManagerMaster: BlockManagerMaster st
opped
17/02/01 18:13:57 INFO OutputCommitCoordinator$OutputCommitCoord
inatorEndpoint: OutputCommitCoordinator stopped!
17/02/01 18:13:57 INFO SparkContext: Successfully stopped SparkC
ontext
17/02/01 18:13:57 INFO ShutdownHookManager: Shutdown hook called
17/02/01 18:13:57 INFO ShutdownHookManager: Deleting directory /
tmp/spark-aa3139da-fd4b-473b-9e0c-42858ae78db5
(~/spark-2.1.0-bin-hadoop2.7)

with custom log4j

(~/spark-2.1.0-bin-hadoop2.7) sh ./runexample-local.sh
Pi is roughly 3.1403831403831406

info.log

(~/spark-2.1.0-bin-hadoop2.7/logs) cat info.log | more

29
Solutions

17/02/01 18:26:02 WARN NativeCodeLoader.M(L): Unable to load nat


ive-hadoop library for your platform... using builtin-java class
es where applicab
le
17/02/01 18:26:03 WARN Utils.M(L): Your hostname, aravind-Virtua
lBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15
instead (on inter
face enp0s3)
17/02/01 18:26:03 WARN Utils.M(L): Set SPARK_LOCAL_IP if you nee
d to bind to another address
17/02/01 18:26:30 WARN ObjectStore.M(L): Failed to get database
global_temp, returning NoSuchObjectException
17/02/01 18:35:50 WARN NativeCodeLoader.M(L): Unable to load nat
ive-hadoop library for your platform... using builtin-java class
es where applicab
le
17/02/01 18:35:51 WARN Utils.M(L): Your hostname, aravind-Virtua
lBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15
instead (on inter
face enp0s3)
17/02/01 18:35:51 WARN Utils.M(L): Set SPARK_LOCAL_IP if you nee
d to bind to another address
17/02/01 18:35:54 INFO log.M(L): Logging initialized @10835ms
17/02/01 18:35:55 INFO Server.M(L): jetty-9.2.z-SNAPSHOT
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7bb3a9fe{/jobs,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7cbee484{/jobs/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7f811d00{/jobs/job,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@62923ee6{/jobs/job/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4089713{/stages,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@f19c9d2{/stages/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7807ac2c{/stages/stage,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@b91d8c4{/stages/stage/json,null,AVAILABLE}

30
Solutions

17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv


letContextHandler@4b6166aa{/stages/pool,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@a77614d{/stages/pool/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4fd4cae3{/storage,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4a067c25{/storage/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@a1217f9{/storage/rdd,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@3bde62ff{/storage/rdd/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@523424b5{/environment,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2baa8d82{/environment/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@319dead1{/executors,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@791cbf87{/executors/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@a7e2d9d{/executors/threadDump,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@754777cd{/executors/threadDump/json,null,AVAIL
ABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2b52c0d6{/static,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@372ea2bc{/,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4cc76301{/api,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2f08c4b{/jobs/job/kill,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@3f19b8b3{/stages/stage/kill,null,AVAILABLE}
17/02/01 18:35:55 INFO ServerConnector.M(L): Started ServerConne
ctor@4efcf8a{HTTP/1.1}{0.0.0.0:4040}
17/02/01 18:35:55 INFO Server.M(L): Started @11454ms
17/02/01 18:35:56 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@778ca8ef{/metrics/json,null,AVAILABLE}

31
Solutions

17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv


letContextHandler@4c060c8f{/SQL,null,AVAILABLE}
17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@383f3558{/SQL/json,null,AVAILABLE}
17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@6ea94d6a{/SQL/execution,null,AVAILABLE}
17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4d7e7435{/SQL/execution/json,null,AVAILABLE}
17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@294bdeb4{/static/sql,null,AVAILABLE}
17/02/01 18:36:00 INFO ServerConnector.M(L): Stopped ServerConne
ctor@4efcf8a{HTTP/1.1}{0.0.0.0:4040}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@3f19b8b3{/stages/stage/kill,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@2f08c4b{/jobs/job/kill,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4cc76301{/api,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@372ea2bc{/,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@2b52c0d6{/static,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@754777cd{/executors/threadDump/json,null,UNAVA
ILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@a7e2d9d{/executors/threadDump,null,UNAVAILABLE
}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@791cbf87{/executors/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@319dead1{/executors,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@2baa8d82{/environment/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@523424b5{/environment,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@3bde62ff{/storage/rdd/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@a1217f9{/storage/rdd,null,UNAVAILABLE}

32
Solutions

17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv


letContextHandler@4a067c25{/storage/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4fd4cae3{/storage,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@a77614d{/stages/pool/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4b6166aa{/stages/pool,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@b91d8c4{/stages/stage/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7807ac2c{/stages/stage,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@f19c9d2{/stages/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4089713{/stages,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@62923ee6{/jobs/job/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7f811d00{/jobs/job,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7cbee484{/jobs/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7bb3a9fe{/jobs,null,UNAVAILABLE}
(~/spark-2.1.0-bin-hadoop2.7/logs)

Cluster
Master

(~/spark-2.1.0-bin-hadoop2.7) ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /home
/aravind/spark-2.1.0-bin-hadoop2.7/logs/spark-aravind-org.apache
.spark.deploy.master.Master-1-aravind-VirtualBox.out

Slave

33
Solutions

(~/spark-2.1.0-bin-hadoop2.7) ./sbin/start-slave.sh spark://arav


ind-VirtualBox:7077
starting org.apache.spark.deploy.worker.Worker, logging to /home
/aravind/spark-2.1.0-bin-hadoop2.7/logs/spark-aravind-org.apache
.spark.deploy.worker.Worker-1-aravind-VirtualBox.out

Shuffling
This setting configures the serializer used for not only shuffling data between
worker nodes but also when serializing RDDs to disk

Monitoring Spark Apps


Web UI
Spark launches as embedded web-app by default to monitor a running
application. this is launched on the Driver
It launches the monitoring app on 4040 by default.
If multiple apps are running simultaneously then they bind to
successive ports
REST interface
Publish metrics to aggregation systems like Graphite, Ganglia and
JMX

Scheduling
By default FIFO is used
Can be configured to use FAIR scheduler by changing spark.scheduler.mode.
Spark assigns tasks between jobs in a “round robin” fashion in FAIR mode

34
Techniques & Best Practices

Techniques & Best Practices

General
Set spark.app.id when running on YARN. It enables monitoring. More details.
Follow best practices for Scala
Minimize network data shuffling by using shared variables, and favoring
reduce operations (which reduce the data locally before shuffling across the
network) over grouping (which shuffles all the data across the network)
Properly configure your temp work dir cleanups

Serialization
Closure serialization
Every task run from Driver to Worker gets serialized
The closure and its environment gets serialized to worker every time it
is used. Not just once per node. This might slow things down
considerably, specially when you have large variables in environment.
This is where broadcast variables can help.
Default java serialization mechanism is used to serialize closures.
Result serialization: Every result from every task gets serialized at some point
In general, tasks larger than about 20 KB are probably worth optimizing
If your objects are large, you may also need to increase the
spark.kryoserializer.buffer config. This value needs to be large
enough to hold the largest object you will serialize.

Broadcast and Accumulator


Both are tied to a single SparkContext and hence can't be shared across
contexts
Broadcast
Broadcasted variables are sent to the worker node _only _once even
though each node might be handling more than one task (that is

35
Techniques & Best Practices

dependent on this variable) of the same job


Broadcasted variables are cached in-memory of the nodes, ready to be
used during execution of the program
Gossip (BitTorrent like) protocol is used to share data among nodes so
that the burden isn't always on the driver node to distribute the variables
According to a study, broadcasting data of around 1GB to larger workers
(40 max) reduces the performance considerably. Refer to this paper.

Logging
A Spark Executor writes all its logs (including INFO, DEBUG) to stderr. The
stdout is empty

spark-shell
default behavior spark-shell

36
Techniques & Best Practices

(~/spark-2.1.0-bin-hadoop2.7) spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defa
ults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR
, use setLogLevel(newLevel).
17/02/01 18:00:18 WARN NativeCodeLoader: Unable to load native-h
adoop library for your platform... using builtin-java classes wh
ere applicable
17/02/01 18:00:21 WARN Utils: Your hostname, aravind-VirtualBox
resolves to a loopback address: 127.0.1.1; using 10.0.2.15 inste
ad (on interface enp0s3)
17/02/01 18:00:21 WARN Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
17/02/01 18:02:11 WARN ObjectStore: Failed to get database globa
l_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.0.2.15:4040
Spark context available as 'sc' (master = local[*], app id = loc
al-1485990033491).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, J


ava 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

with custom log4j

37
Techniques & Best Practices

(~/spark-2.1.0-bin-hadoop2.7) spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR
, use setLogLevel(newLevel).
Spark context Web UI available at http://10.0.2.15:4040
Spark context available as 'sc' (master = local[*], app id = loc
al-1485991564309).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, J


ava 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

info.log

(~/spark-2.1.0-bin-hadoop2.7/logs) cat info.log | more


17/02/01 18:26:02 WARN NativeCodeLoader.M(L): Unable to load nat
ive-hadoop library for your platform... usi
ng builtin-java classes where applicable
17/02/01 18:26:03 WARN Utils.M(L): Your hostname, aravind-Virtua
lBox resolves to a loopback address: 127.0.
1.1; using 10.0.2.15 instead (on interface enp0s3)
17/02/01 18:26:03 WARN Utils.M(L): Set SPARK_LOCAL_IP if you nee
d to bind to another address
17/02/01 18:26:30 WARN ObjectStore.M(L): Failed to get database
global_temp, returning NoSuchObjectException

spark-submit

38
Techniques & Best Practices

command

./bin/spark-submit --class org.apache.spark.examples.SparkPi \


--executor-memory 512MB \
--total-executor-cores 1 \
examples/jars/spark-examples_2.11-2.1.0.jar \
10

default behavior

(~/spark-2.1.0-bin-hadoop2.7) sh ./runexample-local.sh
Using Spark's default log4j profile: org/apache/spark/log4j-defa
ults.properties
17/02/01 18:13:45 INFO SparkContext: Running Spark version 2.1.0
17/02/01 18:13:47 WARN NativeCodeLoader: Unable to load native-h
adoop library for your platform... using builtin-java classes wh
ere applicable
17/02/01 18:13:48 WARN Utils: Your hostname, aravind-VirtualBox
resolves to a loopback address: 127.0.1.1; using 10.0.2.15 inste
ad (on interface enp0s3)
17/02/01 18:13:48 WARN Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
17/02/01 18:13:48 INFO SecurityManager: Changing view acls to: a
ravind
17/02/01 18:13:48 INFO SecurityManager: Changing modify acls to:
aravind
17/02/01 18:13:48 INFO SecurityManager: Changing view acls group
s to:
17/02/01 18:13:48 INFO SecurityManager: Changing modify acls gro
ups to:
17/02/01 18:13:48 INFO SecurityManager: SecurityManager: authent
ication disabled; ui acls disabled; users with view permissions
: Set(aravind); groups with view permissions: Set(); users with
modify permissions: Set(aravind); groups with modify permission
s: Set()
17/02/01 18:13:50 INFO Utils: Successfully started service 'spar
kDriver' on port 35848.
17/02/01 18:13:51 INFO SparkEnv: Registering MapOutputTracker
17/02/01 18:13:51 INFO SparkEnv: Registering BlockManagerMaster

39
Techniques & Best Practices

17/02/01 18:13:51 INFO BlockManagerMasterEndpoint: Using org.apa


che.spark.storage.DefaultTopologyMapper for getting topology inf
ormation
17/02/01 18:13:51 INFO BlockManagerMasterEndpoint: BlockManagerM
asterEndpoint up
17/02/01 18:13:51 INFO DiskBlockManager: Created local directory
at /tmp/blockmgr-dd0f99cf-b0a4-4793-bf19-d400d5305456
17/02/01 18:13:51 INFO MemoryStore: MemoryStore started with cap
acity 413.9 MB
17/02/01 18:13:51 INFO SparkEnv: Registering OutputCommitCoordin
ator
17/02/01 18:13:52 INFO Utils: Successfully started service 'Spar
kUI' on port 4040.
17/02/01 18:13:52 INFO SparkUI: Bound SparkUI to 0.0.0.0, and st
arted at http://10.0.2.15:4040
17/02/01 18:13:52 INFO SparkContext: Added JAR file:/home/aravin
d/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.
1.0.jar at spark://10.0.2.15:35848/jars/spark-examples_2.11-2.1.
0.jar with timestamp 1485990832660
17/02/01 18:13:52 INFO Executor: Starting executor ID driver on
host localhost
17/02/01 18:13:52 INFO Utils: Successfully started service 'org.
apache.spark.network.netty.NettyBlockTransferService' on port 33
106.
17/02/01 18:13:52 INFO NettyBlockTransferService: Server created
on 10.0.2.15:33106
17/02/01 18:13:52 INFO BlockManager: Using org.apache.spark.stor
age.RandomBlockReplicationPolicy for block replication policy
17/02/01 18:13:52 INFO BlockManagerMaster: Registering BlockMana
ger BlockManagerId(driver, 10.0.2.15, 33106, None)
17/02/01 18:13:52 INFO BlockManagerMasterEndpoint: Registering b
lock manager 10.0.2.15:33106 with 413.9 MB RAM, BlockManagerId(d
river, 10.0.2.15, 33106, None)
17/02/01 18:13:52 INFO BlockManagerMaster: Registered BlockManag
er BlockManagerId(driver, 10.0.2.15, 33106, None)
17/02/01 18:13:52 INFO BlockManager: Initialized BlockManager: B
lockManagerId(driver, 10.0.2.15, 33106, None)
17/02/01 18:13:53 INFO SharedState: Warehouse path is 'file:/hom
e/aravind/spark-2.1.0-bin-hadoop2.7/spark-warehouse/'.
17/02/01 18:13:55 INFO SparkContext: Starting job: reduce at Spa

40
Techniques & Best Practices

rkPi.scala:38
17/02/01 18:13:55 INFO DAGScheduler: Got job 0 (reduce at SparkP
i.scala:38) with 10 output partitions
17/02/01 18:13:55 INFO DAGScheduler: Final stage: ResultStage 0
(reduce at SparkPi.scala:38)
17/02/01 18:13:55 INFO DAGScheduler: Parents of final stage: Lis
t()
17/02/01 18:13:55 INFO DAGScheduler: Missing parents: List()
17/02/01 18:13:55 INFO DAGScheduler: Submitting ResultStage 0 (M
apPartitionsRDD[1] at map at SparkPi.scala:34), which has no mis
sing parents
17/02/01 18:13:56 INFO MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 1832.0 B, free 413.9 MB)
17/02/01 18:13:56 INFO MemoryStore: Block broadcast_0_piece0 sto
red as bytes in memory (estimated size 1172.0 B, free 413.9 MB)
17/02/01 18:13:56 INFO BlockManagerInfo: Added broadcast_0_piece
0 in memory on 10.0.2.15:33106 (size: 1172.0 B, free: 413.9 MB)
17/02/01 18:13:56 INFO SparkContext: Created broadcast 0 from br
oadcast at DAGScheduler.scala:996
17/02/01 18:13:56 INFO DAGScheduler: Submitting 10 missing tasks
from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala
:34)
17/02/01 18:13:56 INFO TaskSchedulerImpl: Adding task set 0.0 wi
th 10 tasks
17/02/01 18:13:56 INFO TaskSetManager: Starting task 0.0 in stag
e 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:56 INFO TaskSetManager: Starting task 1.0 in stag
e 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:56 INFO Executor: Running task 0.0 in stage 0.0 (
TID 0)
17/02/01 18:13:56 INFO Executor: Running task 1.0 in stage 0.0 (
TID 1)
17/02/01 18:13:56 INFO Executor: Fetching spark://10.0.2.15:3584
8/jars/spark-examples_2.11-2.1.0.jar with timestamp 148599083266
0
17/02/01 18:13:56 INFO TransportClientFactory: Successfully crea
ted connection to /10.0.2.15:35848 after 149 ms (0 ms spent in b
ootstraps)

41
Techniques & Best Practices

17/02/01 18:13:56 INFO Utils: Fetching spark://10.0.2.15:35848/j


ars/spark-examples_2.11-2.1.0.jar to /tmp/spark-aa3139da-fd4b-47
3b-9e0c-42858ae78db5/userFiles-38c779f8-e12f-420e-a4d3-ac20a561a
919/fetchFileTemp2346557375955678751.tmp
17/02/01 18:13:57 INFO Executor: Adding file:/tmp/spark-aa3139da
-fd4b-473b-9e0c-42858ae78db5/userFiles-38c779f8-e12f-420e-a4d3-a
c20a561a919/spark-examples_2.11-2.1.0.jar to class loader
17/02/01 18:13:57 INFO Executor: Finished task 0.0 in stage 0.0
(TID 0). 1114 bytes result sent to driver
17/02/01 18:13:57 INFO Executor: Finished task 1.0 in stage 0.0
(TID 1). 1114 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 2.0 in stag
e 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO Executor: Running task 2.0 in stage 0.0 (
TID 2)
17/02/01 18:13:57 INFO TaskSetManager: Starting task 3.0 in stag
e 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO Executor: Running task 3.0 in stage 0.0 (
TID 3)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 0.0 in stag
e 0.0 (TID 0) in 996 ms on localhost (executor driver) (1/10)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 1.0 in stag
e 0.0 (TID 1) in 940 ms on localhost (executor driver) (2/10)
17/02/01 18:13:57 INFO Executor: Finished task 3.0 in stage 0.0
(TID 3). 1041 bytes result sent to driver
17/02/01 18:13:57 INFO Executor: Finished task 2.0 in stage 0.0
(TID 2). 1041 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 4.0 in stag
e 0.0 (TID 4, localhost, executor driver, partition 4, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO Executor: Running task 4.0 in stage 0.0 (
TID 4)
17/02/01 18:13:57 INFO TaskSetManager: Starting task 5.0 in stag
e 0.0 (TID 5, localhost, executor driver, partition 5, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO Executor: Running task 5.0 in stage 0.0 (
TID 5)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 2.0 in stag

42
Techniques & Best Practices

e 0.0 (TID 2) in 192 ms on localhost (executor driver) (3/10)


17/02/01 18:13:57 INFO TaskSetManager: Finished task 3.0 in stag
e 0.0 (TID 3) in 224 ms on localhost (executor driver) (4/10)
17/02/01 18:13:57 INFO Executor: Finished task 4.0 in stage 0.0
(TID 4). 1201 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 6.0 in stag
e 0.0 (TID 6, localhost, executor driver, partition 6, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO Executor: Running task 6.0 in stage 0.0 (
TID 6)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 4.0 in stag
e 0.0 (TID 4) in 110 ms on localhost (executor driver) (5/10)
17/02/01 18:13:57 INFO Executor: Finished task 5.0 in stage 0.0
(TID 5). 1114 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 7.0 in stag
e 0.0 (TID 7, localhost, executor driver, partition 7, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 5.0 in stag
e 0.0 (TID 5) in 112 ms on localhost (executor driver) (6/10)
17/02/01 18:13:57 INFO Executor: Running task 7.0 in stage 0.0 (
TID 7)
17/02/01 18:13:57 INFO Executor: Finished task 6.0 in stage 0.0
(TID 6). 1128 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 8.0 in stag
e 0.0 (TID 8, localhost, executor driver, partition 8, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 6.0 in stag
e 0.0 (TID 6) in 68 ms on localhost (executor driver) (7/10)
17/02/01 18:13:57 INFO Executor: Running task 8.0 in stage 0.0 (
TID 8)
17/02/01 18:13:57 INFO Executor: Finished task 7.0 in stage 0.0
(TID 7). 1041 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Starting task 9.0 in stag
e 0.0 (TID 9, localhost, executor driver, partition 9, PROCESS_L
OCAL, 6021 bytes)
17/02/01 18:13:57 INFO TaskSetManager: Finished task 7.0 in stag
e 0.0 (TID 7) in 54 ms on localhost (executor driver) (8/10)
17/02/01 18:13:57 INFO Executor: Running task 9.0 in stage 0.0 (
TID 9)
17/02/01 18:13:57 INFO Executor: Finished task 8.0 in stage 0.0

43
Techniques & Best Practices

(TID 8). 1114 bytes result sent to driver


17/02/01 18:13:57 INFO TaskSetManager: Finished task 8.0 in stag
e 0.0 (TID 8) in 112 ms on localhost (executor driver) (9/10)
17/02/01 18:13:57 INFO Executor: Finished task 9.0 in stage 0.0
(TID 9). 1114 bytes result sent to driver
17/02/01 18:13:57 INFO TaskSetManager: Finished task 9.0 in stag
e 0.0 (TID 9) in 123 ms on localhost (executor driver) (10/10)
17/02/01 18:13:57 INFO TaskSchedulerImpl: Removed TaskSet 0.0, w
hose tasks have all completed, from pool
17/02/01 18:13:57 INFO DAGScheduler: ResultStage 0 (reduce at Sp
arkPi.scala:38) finished in 1.376 s
17/02/01 18:13:57 INFO DAGScheduler: Job 0 finished: reduce at S
parkPi.scala:38, took 2.439870 s
Pi is roughly 3.141711141711142
17/02/01 18:13:57 INFO SparkUI: Stopped Spark web UI at http://1
0.0.2.15:4040
17/02/01 18:13:57 INFO MapOutputTrackerMasterEndpoint: MapOutput
TrackerMasterEndpoint stopped!
17/02/01 18:13:57 INFO MemoryStore: MemoryStore cleared
17/02/01 18:13:57 INFO BlockManager: BlockManager stopped
17/02/01 18:13:57 INFO BlockManagerMaster: BlockManagerMaster st
opped
17/02/01 18:13:57 INFO OutputCommitCoordinator$OutputCommitCoord
inatorEndpoint: OutputCommitCoordinator stopped!
17/02/01 18:13:57 INFO SparkContext: Successfully stopped SparkC
ontext
17/02/01 18:13:57 INFO ShutdownHookManager: Shutdown hook called
17/02/01 18:13:57 INFO ShutdownHookManager: Deleting directory /
tmp/spark-aa3139da-fd4b-473b-9e0c-42858ae78db5
(~/spark-2.1.0-bin-hadoop2.7)

with custom log4j

(~/spark-2.1.0-bin-hadoop2.7) sh ./runexample-local.sh
Pi is roughly 3.1403831403831406

info.log

(~/spark-2.1.0-bin-hadoop2.7/logs) cat info.log | more

44
Techniques & Best Practices

17/02/01 18:26:02 WARN NativeCodeLoader.M(L): Unable to load nat


ive-hadoop library for your platform... using builtin-java class
es where applicab
le
17/02/01 18:26:03 WARN Utils.M(L): Your hostname, aravind-Virtua
lBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15
instead (on inter
face enp0s3)
17/02/01 18:26:03 WARN Utils.M(L): Set SPARK_LOCAL_IP if you nee
d to bind to another address
17/02/01 18:26:30 WARN ObjectStore.M(L): Failed to get database
global_temp, returning NoSuchObjectException
17/02/01 18:35:50 WARN NativeCodeLoader.M(L): Unable to load nat
ive-hadoop library for your platform... using builtin-java class
es where applicab
le
17/02/01 18:35:51 WARN Utils.M(L): Your hostname, aravind-Virtua
lBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15
instead (on inter
face enp0s3)
17/02/01 18:35:51 WARN Utils.M(L): Set SPARK_LOCAL_IP if you nee
d to bind to another address
17/02/01 18:35:54 INFO log.M(L): Logging initialized @10835ms
17/02/01 18:35:55 INFO Server.M(L): jetty-9.2.z-SNAPSHOT
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7bb3a9fe{/jobs,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7cbee484{/jobs/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7f811d00{/jobs/job,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@62923ee6{/jobs/job/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4089713{/stages,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@f19c9d2{/stages/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7807ac2c{/stages/stage,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@b91d8c4{/stages/stage/json,null,AVAILABLE}

45
Techniques & Best Practices

17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv


letContextHandler@4b6166aa{/stages/pool,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@a77614d{/stages/pool/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4fd4cae3{/storage,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4a067c25{/storage/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@a1217f9{/storage/rdd,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@3bde62ff{/storage/rdd/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@523424b5{/environment,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2baa8d82{/environment/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@319dead1{/executors,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@791cbf87{/executors/json,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@a7e2d9d{/executors/threadDump,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@754777cd{/executors/threadDump/json,null,AVAIL
ABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2b52c0d6{/static,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@372ea2bc{/,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4cc76301{/api,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2f08c4b{/jobs/job/kill,null,AVAILABLE}
17/02/01 18:35:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@3f19b8b3{/stages/stage/kill,null,AVAILABLE}
17/02/01 18:35:55 INFO ServerConnector.M(L): Started ServerConne
ctor@4efcf8a{HTTP/1.1}{0.0.0.0:4040}
17/02/01 18:35:55 INFO Server.M(L): Started @11454ms
17/02/01 18:35:56 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@778ca8ef{/metrics/json,null,AVAILABLE}

46
Techniques & Best Practices

17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv


letContextHandler@4c060c8f{/SQL,null,AVAILABLE}
17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@383f3558{/SQL/json,null,AVAILABLE}
17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@6ea94d6a{/SQL/execution,null,AVAILABLE}
17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4d7e7435{/SQL/execution/json,null,AVAILABLE}
17/02/01 18:35:57 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@294bdeb4{/static/sql,null,AVAILABLE}
17/02/01 18:36:00 INFO ServerConnector.M(L): Stopped ServerConne
ctor@4efcf8a{HTTP/1.1}{0.0.0.0:4040}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@3f19b8b3{/stages/stage/kill,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@2f08c4b{/jobs/job/kill,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4cc76301{/api,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@372ea2bc{/,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@2b52c0d6{/static,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@754777cd{/executors/threadDump/json,null,UNAVA
ILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@a7e2d9d{/executors/threadDump,null,UNAVAILABLE
}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@791cbf87{/executors/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@319dead1{/executors,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@2baa8d82{/environment/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@523424b5{/environment,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@3bde62ff{/storage/rdd/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@a1217f9{/storage/rdd,null,UNAVAILABLE}

47
Techniques & Best Practices

17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv


letContextHandler@4a067c25{/storage/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4fd4cae3{/storage,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@a77614d{/stages/pool/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4b6166aa{/stages/pool,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@b91d8c4{/stages/stage/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7807ac2c{/stages/stage,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@f19c9d2{/stages/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4089713{/stages,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@62923ee6{/jobs/job/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7f811d00{/jobs/job,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7cbee484{/jobs/json,null,UNAVAILABLE}
17/02/01 18:36:00 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7bb3a9fe{/jobs,null,UNAVAILABLE}
(~/spark-2.1.0-bin-hadoop2.7/logs)

Cluster
default behavior

both master and worker create two files under logs directory and redirect their
output to those files

uses default log profile printing `Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.propertie`

Master

48
Techniques & Best Practices

(~/spark-2.1.0-bin-hadoop2.7) ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /home
/aravind/spark-2.1.0-bin-hadoop2.7/logs/spark-aravind-org.apache
.spark.deploy.master.Master-1-aravind-VirtualBox.out

Slave

(~/spark-2.1.0-bin-hadoop2.7) ./sbin/start-slave.sh spark://arav


ind-VirtualBox:7077
starting org.apache.spark.deploy.worker.Worker, logging to /home
/aravind/spark-2.1.0-bin-hadoop2.7/logs/spark-aravind-org.apache
.spark.deploy.worker.Worker-1-aravind-VirtualBox.out

Master LOG output

(~/spark-2.1.0-bin-hadoop2.7/logs) cat spark-aravind-org.apache.


spark.deploy.master.Master-1-aravind-VirtualBox.out
Spark Command: /usr/lib/jvm/java-8-oracle/bin/java -cp /home/ara
vind/spark-2.1.0-bin-hadoop2.7/conf/:/home/aravind/spark-2.1.0-b
in-hadoop2.7/jars/* -Xmx1g org.apache.spark.deploy.master.Master
--host aravind-VirtualBox --port 7077 --webui-port 8080
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defa
ults.properties
17/02/01 18:49:02 INFO Master: Started daemon with process name:
4332@aravind-VirtualBox
17/02/01 18:49:02 INFO SignalUtils: Registered signal handler fo
r TERM
17/02/01 18:49:02 INFO SignalUtils: Registered signal handler fo
r HUP
17/02/01 18:49:02 INFO SignalUtils: Registered signal handler fo
r INT
17/02/01 18:49:02 WARN Utils: Your hostname, aravind-VirtualBox
resolves to a loopback address: 127.0.1.1; using 10.0.2.15 inste
ad (on interface enp0s3)
17/02/01 18:49:02 WARN Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
17/02/01 18:49:02 WARN NativeCodeLoader: Unable to load native-h

49
Techniques & Best Practices

adoop library for your platform... using builtin-java classes wh


ere applicable
17/02/01 18:49:02 INFO SecurityManager: Changing view acls to: a
ravind
17/02/01 18:49:02 INFO SecurityManager: Changing modify acls to:
aravind
17/02/01 18:49:02 INFO SecurityManager: Changing view acls group
s to:
17/02/01 18:49:02 INFO SecurityManager: Changing modify acls gro
ups to:
17/02/01 18:49:02 INFO SecurityManager: SecurityManager: authent
ication disabled; ui acls disabled; users with view permissions
: Set(aravind); groups with view permissions: Set(); users with
modify permissions: Set(aravind); groups with modify permission
s: Set()
17/02/01 18:49:03 INFO Utils: Successfully started service 'spar
kMaster' on port 7077.
17/02/01 18:49:03 INFO Master: Starting Spark master at spark://
aravind-VirtualBox:7077
17/02/01 18:49:03 INFO Master: Running Spark version 2.1.0
17/02/01 18:49:03 INFO Utils: Successfully started service 'Mast
erUI' on port 8080.
17/02/01 18:49:03 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0
, and started at http://10.0.2.15:8080
17/02/01 18:49:03 INFO Utils: Successfully started service on po
rt 6066.
17/02/01 18:49:03 INFO StandaloneRestServer: Started REST server
for submitting applications on port 6066
17/02/01 18:49:03 INFO Master: I have been elected leader! New s
tate: ALIVE
17/02/01 18:50:57 INFO Master: Registering worker 10.0.2.15:4313
5 with 2 cores, 1024.0 MB RAM

(~/spark-2.1.0-bin-hadoop2.7/logs) ll
total 16
drwxrwxr-x 2 aravind aravind 4096 Feb 1 18:50 ./
drwxr-xr-x 16 aravind aravind 4096 Feb 1 18:13 ../
-rw-rw-r-- 1 aravind aravind 2377 Feb 1 18:50 spark-aravind-or
g.apache.spark.deploy.master.Master-1-aravind-VirtualBox.out
-rw-rw-r-- 1 aravind aravind 2448 Feb 1 18:50 spark-aravind-or

50
Techniques & Best Practices

g.apache.spark.deploy.worker.Worker-1-aravind-VirtualBox.out
(~/spark-2.1.0-bin-hadoop2.7/logs)

Worker LOG output

(~/spark-2.1.0-bin-hadoop2.7/logs) cat spark-aravind-org.apache.


spark.deploy.worker.Worker-1-aravind-VirtualBox.out | more
Spark Command: /usr/lib/jvm/java-8-oracle/bin/java -cp /home/ara
vind/spark-2.1.0-bin-hadoop2.7/conf/:/home/aravind/spark-2.1.0-b
in-hadoop2.7/jars
/* -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 808
1 spark://aravind-VirtualBox:7077
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defa
ults.properties
17/02/01 18:50:55 INFO Worker: Started daemon with process name:
4423@aravind-VirtualBox
17/02/01 18:50:55 INFO SignalUtils: Registered signal handler fo
r TERM
17/02/01 18:50:55 INFO SignalUtils: Registered signal handler fo
r HUP
17/02/01 18:50:55 INFO SignalUtils: Registered signal handler fo
r INT
17/02/01 18:50:55 WARN Utils: Your hostname, aravind-VirtualBox
resolves to a loopback address: 127.0.1.1; using 10.0.2.15 inste
ad (on interface
enp0s3)
17/02/01 18:50:55 WARN Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
17/02/01 18:50:55 WARN NativeCodeLoader: Unable to load native-h
adoop library for your platform... using builtin-java classes wh
ere applicable
17/02/01 18:50:55 INFO SecurityManager: Changing view acls to: a
ravind
17/02/01 18:50:55 INFO SecurityManager: Changing modify acls to:
aravind
17/02/01 18:50:55 INFO SecurityManager: Changing view acls group
s to:
17/02/01 18:50:55 INFO SecurityManager: Changing modify acls gro

51
Techniques & Best Practices

ups to:
17/02/01 18:50:55 INFO SecurityManager: SecurityManager: authent
ication disabled; ui acls disabled; users with view permissions
: Set(aravind); g
roups with view permissions: Set(); users with modify permissio
ns: Set(aravind); groups with modify permissions: Set()
17/02/01 18:50:56 INFO Utils: Successfully started service 'spar
kWorker' on port 43135.
17/02/01 18:50:56 INFO Worker: Starting Spark worker 10.0.2.15:4
3135 with 2 cores, 1024.0 MB RAM
17/02/01 18:50:56 INFO Worker: Running Spark version 2.1.0
17/02/01 18:50:56 INFO Worker: Spark home: /home/aravind/spark-2
.1.0-bin-hadoop2.7
17/02/01 18:50:56 INFO Utils: Successfully started service 'Work
erUI' on port 8081.
17/02/01 18:50:56 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0
, and started at http://10.0.2.15:8081
17/02/01 18:50:56 INFO Worker: Connecting to master aravind-Virt
ualBox:7077...
17/02/01 18:50:57 INFO TransportClientFactory: Successfully crea
ted connection to aravind-VirtualBox/127.0.1.1:7077 after 72 ms
(0 ms spent in bo
otstraps)
17/02/01 18:50:57 INFO Worker: Successfully registered with mast
er spark://aravind-VirtualBox:7077
(~/spark-2.1.0-bin-hadoop2.7/logs)

spark-submit to master

command

./bin/spark-submit --class org.apache.spark.examples.SparkPi \


--master spark://aravind-VirtualBox:7077 \
--executor-memory 512MB \
--total-executor-cores 1 \
examples/jars/spark-examples_2.11-2.1.0.jar \
10

output

52
Techniques & Best Practices

(~/spark-2.1.0-bin-hadoop2.7) sh ./runexample.sh
Using Spark's default log4j profile: org/apache/spark/log4j-defa
ults.properties
17/02/01 19:44:16 INFO SparkContext: Running Spark version 2.1.0
17/02/01 19:44:29 WARN NativeCodeLoader: Unable to load native-h
adoop library for your platform... using builtin-java classes wh
ere applicable
17/02/01 19:44:31 WARN Utils: Your hostname, aravind-VirtualBox
resolves to a loopback address: 127.0.1.1; using 10.0.2.15 inste
ad (on interface enp0s3)
17/02/01 19:44:31 WARN Utils: Set SPARK_LOCAL_IP if you need to
bind to another address
17/02/01 19:44:32 INFO SecurityManager: Changing view acls to: a
ravind
17/02/01 19:44:32 INFO SecurityManager: Changing modify acls to:
aravind
17/02/01 19:44:32 INFO SecurityManager: Changing view acls group
s to:
17/02/01 19:44:32 INFO SecurityManager: Changing modify acls gro
ups to:
17/02/01 19:44:32 INFO SecurityManager: SecurityManager: authent
ication disabled; ui acls disabled; users with view permissions
: Set(aravind); groups with view permissions: Set(); users with
modify permissions: Set(aravind); groups with modify permission
s: Set()
17/02/01 19:44:42 INFO Utils: Successfully started service 'spar
kDriver' on port 41120.
17/02/01 19:44:43 INFO SparkEnv: Registering MapOutputTracker
17/02/01 19:44:43 INFO SparkEnv: Registering BlockManagerMaster
17/02/01 19:44:43 INFO BlockManagerMasterEndpoint: Using org.apa
che.spark.storage.DefaultTopologyMapper for getting topology inf
ormation
17/02/01 19:44:43 INFO BlockManagerMasterEndpoint: BlockManagerM
asterEndpoint up
17/02/01 19:44:43 INFO DiskBlockManager: Created local directory
at /tmp/blockmgr-6220cf77-5149-4f51-9d01-fa222f23ba65
17/02/01 19:44:43 INFO MemoryStore: MemoryStore started with cap
acity 413.9 MB
17/02/01 19:44:44 INFO SparkEnv: Registering OutputCommitCoordin

53
Techniques & Best Practices

ator
17/02/01 19:44:49 INFO Utils: Successfully started service 'Spar
kUI' on port 4040.
17/02/01 19:44:49 INFO SparkUI: Bound SparkUI to 0.0.0.0, and st
arted at http://10.0.2.15:4040
17/02/01 19:44:49 INFO SparkContext: Added JAR file:/home/aravin
d/spark-2.1.0-bin-hadoop2.7/examples/jars/spark-examples_2.11-2.
1.0.jar at spark://10.0.2.15:41120/jars/spark-examples_2.11-2.1.
0.jar with timestamp 1485996289545
17/02/01 19:44:51 INFO StandaloneAppClient$ClientEndpoint: Conne
cting to master spark://aravind-VirtualBox:7077...
17/02/01 19:44:51 INFO TransportClientFactory: Successfully crea
ted connection to aravind-VirtualBox/127.0.1.1:7077 after 303 ms
(0 ms spent in bootstraps)
17/02/01 19:45:11 INFO StandaloneAppClient$ClientEndpoint: Conne
cting to master spark://aravind-VirtualBox:7077...
17/02/01 19:45:16 INFO StandaloneSchedulerBackend: Connected to
Spark cluster with app ID app-20170201194515-0000
17/02/01 19:45:16 INFO Utils: Successfully started service 'org.
apache.spark.network.netty.NettyBlockTransferService' on port 36
697.
17/02/01 19:45:16 INFO NettyBlockTransferService: Server created
on 10.0.2.15:36697
17/02/01 19:45:16 INFO BlockManager: Using org.apache.spark.stor
age.RandomBlockReplicationPolicy for block replication policy
17/02/01 19:45:16 INFO BlockManagerMaster: Registering BlockMana
ger BlockManagerId(driver, 10.0.2.15, 36697, None)
17/02/01 19:45:16 INFO BlockManagerMasterEndpoint: Registering b
lock manager 10.0.2.15:36697 with 413.9 MB RAM, BlockManagerId(d
river, 10.0.2.15, 36697, None)
17/02/01 19:45:16 INFO BlockManagerMaster: Registered BlockManag
er BlockManagerId(driver, 10.0.2.15, 36697, None)
17/02/01 19:45:16 INFO BlockManager: Initialized BlockManager: B
lockManagerId(driver, 10.0.2.15, 36697, None)
17/02/01 19:45:17 INFO StandaloneAppClient$ClientEndpoint: Execu
tor added: app-20170201194515-0000/0 on worker-20170201185056-10
.0.2.15-43135 (10.0.2.15:43135) with 1 cores
17/02/01 19:45:17 INFO StandaloneSchedulerBackend: Granted execu
tor ID app-20170201194515-0000/0 on hostPort 10.0.2.15:43135 wit
h 1 cores, 512.0 MB RAM

54
Techniques & Best Practices

17/02/01 19:45:19 INFO StandaloneSchedulerBackend: SchedulerBack


end is ready for scheduling beginning after reached minRegistere
dResourcesRatio: 0.0
17/02/01 19:45:20 INFO SharedState: Warehouse path is 'file:/hom
e/aravind/spark-2.1.0-bin-hadoop2.7/spark-warehouse/'.
17/02/01 19:45:25 INFO SparkContext: Starting job: reduce at Spa
rkPi.scala:38
17/02/01 19:45:26 INFO DAGScheduler: Got job 0 (reduce at SparkP
i.scala:38) with 10 output partitions
17/02/01 19:45:26 INFO DAGScheduler: Final stage: ResultStage 0
(reduce at SparkPi.scala:38)
17/02/01 19:45:26 INFO DAGScheduler: Parents of final stage: Lis
t()
17/02/01 19:45:26 INFO DAGScheduler: Missing parents: List()
17/02/01 19:45:26 INFO DAGScheduler: Submitting ResultStage 0 (M
apPartitionsRDD[1] at map at SparkPi.scala:34), which has no mis
sing parents
17/02/01 19:45:28 INFO MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 1832.0 B, free 413.9 MB)
17/02/01 19:45:29 INFO MemoryStore: Block broadcast_0_piece0 sto
red as bytes in memory (estimated size 1172.0 B, free 413.9 MB)
17/02/01 19:45:29 INFO BlockManagerInfo: Added broadcast_0_piece
0 in memory on 10.0.2.15:36697 (size: 1172.0 B, free: 413.9 MB)
17/02/01 19:45:29 INFO SparkContext: Created broadcast 0 from br
oadcast at DAGScheduler.scala:996
17/02/01 19:45:29 INFO DAGScheduler: Submitting 10 missing tasks
from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala
:34)
17/02/01 19:45:29 INFO TaskSchedulerImpl: Adding task set 0.0 wi
th 10 tasks
17/02/01 19:45:36 INFO StandaloneAppClient$ClientEndpoint: Execu
tor updated: app-20170201194515-0000/0 is now RUNNING
17/02/01 19:45:44 WARN TaskSchedulerImpl: Initial job has not ac
cepted any resources; check your cluster UI to ensure that worke
rs are registered and have sufficient resources
17/02/01 19:45:45 INFO CoarseGrainedSchedulerBackend$DriverEndpo
int: Registered executor NettyRpcEndpointRef(null) (10.0.2.15:32
846) with ID 0
17/02/01 19:45:46 INFO BlockManagerMasterEndpoint: Registering b
lock manager 10.0.2.15:42332 with 117.0 MB RAM, BlockManagerId(0

55
Techniques & Best Practices

, 10.0.2.15, 42332, None)


17/02/01 19:45:47 INFO TaskSetManager: Starting task 0.0 in stag
e 0.0 (TID 0, 10.0.2.15, executor 0, partition 0, PROCESS_LOCAL,
6025 bytes)
17/02/01 19:45:54 INFO BlockManagerInfo: Added broadcast_0_piece
0 in memory on 10.0.2.15:42332 (size: 1172.0 B, free: 117.0 MB)
17/02/01 19:46:00 INFO TaskSetManager: Starting task 1.0 in stag
e 0.0 (TID 1, 10.0.2.15, executor 0, partition 1, PROCESS_LOCAL,
6025 bytes)
17/02/01 19:46:00 INFO TaskSetManager: Starting task 2.0 in stag
e 0.0 (TID 2, 10.0.2.15, executor 0, partition 2, PROCESS_LOCAL,
6025 bytes)
17/02/01 19:46:00 INFO TaskSetManager: Finished task 0.0 in stag
e 0.0 (TID 0) in 13900 ms on 10.0.2.15 (executor 0) (1/10)
17/02/01 19:46:00 INFO TaskSetManager: Finished task 1.0 in stag
e 0.0 (TID 1) in 375 ms on 10.0.2.15 (executor 0) (2/10)
17/02/01 19:46:00 INFO TaskSetManager: Starting task 3.0 in stag
e 0.0 (TID 3, 10.0.2.15, executor 0, partition 3, PROCESS_LOCAL,
6025 bytes)
17/02/01 19:46:00 INFO TaskSetManager: Finished task 2.0 in stag
e 0.0 (TID 2) in 191 ms on 10.0.2.15 (executor 0) (3/10)
17/02/01 19:46:00 INFO TaskSetManager: Starting task 4.0 in stag
e 0.0 (TID 4, 10.0.2.15, executor 0, partition 4, PROCESS_LOCAL,
6025 bytes)
17/02/01 19:46:00 INFO TaskSetManager: Finished task 3.0 in stag
e 0.0 (TID 3) in 72 ms on 10.0.2.15 (executor 0) (4/10)
17/02/01 19:46:00 INFO TaskSetManager: Starting task 5.0 in stag
e 0.0 (TID 5, 10.0.2.15, executor 0, partition 5, PROCESS_LOCAL,
6025 bytes)
17/02/01 19:46:00 INFO TaskSetManager: Finished task 4.0 in stag
e 0.0 (TID 4) in 67 ms on 10.0.2.15 (executor 0) (5/10)
17/02/01 19:46:00 INFO TaskSetManager: Starting task 6.0 in stag
e 0.0 (TID 6, 10.0.2.15, executor 0, partition 6, PROCESS_LOCAL,
6025 bytes)
17/02/01 19:46:00 INFO TaskSetManager: Finished task 5.0 in stag
e 0.0 (TID 5) in 89 ms on 10.0.2.15 (executor 0) (6/10)
17/02/01 19:46:00 INFO TaskSetManager: Starting task 7.0 in stag
e 0.0 (TID 7, 10.0.2.15, executor 0, partition 7, PROCESS_LOCAL,
6025 bytes)
17/02/01 19:46:00 INFO TaskSetManager: Finished task 6.0 in stag

56
Techniques & Best Practices

e 0.0 (TID 6) in 72 ms on 10.0.2.15 (executor 0) (7/10)


17/02/01 19:46:00 INFO TaskSetManager: Starting task 8.0 in stag
e 0.0 (TID 8, 10.0.2.15, executor 0, partition 8, PROCESS_LOCAL,
6025 bytes)
17/02/01 19:46:00 INFO TaskSetManager: Finished task 7.0 in stag
e 0.0 (TID 7) in 64 ms on 10.0.2.15 (executor 0) (8/10)
17/02/01 19:46:00 INFO TaskSetManager: Starting task 9.0 in stag
e 0.0 (TID 9, 10.0.2.15, executor 0, partition 9, PROCESS_LOCAL,
6025 bytes)
17/02/01 19:46:00 INFO TaskSetManager: Finished task 8.0 in stag
e 0.0 (TID 8) in 38 ms on 10.0.2.15 (executor 0) (9/10)
17/02/01 19:46:00 INFO TaskSetManager: Finished task 9.0 in stag
e 0.0 (TID 9) in 144 ms on 10.0.2.15 (executor 0) (10/10)
17/02/01 19:46:00 INFO DAGScheduler: ResultStage 0 (reduce at Sp
arkPi.scala:38) finished in 31.483 s
17/02/01 19:46:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, w
hose tasks have all completed, from pool
17/02/01 19:46:01 INFO DAGScheduler: Job 0 finished: reduce at S
parkPi.scala:38, took 35.773000 s
Pi is roughly 3.1426511426511428
17/02/01 19:46:03 INFO SparkUI: Stopped Spark web UI at http://1
0.0.2.15:4040
17/02/01 19:46:04 INFO StandaloneSchedulerBackend: Shutting down
all executors
17/02/01 19:46:04 INFO CoarseGrainedSchedulerBackend$DriverEndpo
int: Asking each executor to shut down
17/02/01 19:46:05 INFO MapOutputTrackerMasterEndpoint: MapOutput
TrackerMasterEndpoint stopped!
17/02/01 19:46:06 INFO MemoryStore: MemoryStore cleared
17/02/01 19:46:06 INFO BlockManager: BlockManager stopped
17/02/01 19:46:06 INFO BlockManagerMaster: BlockManagerMaster st
opped
17/02/01 19:46:06 INFO OutputCommitCoordinator$OutputCommitCoord
inatorEndpoint: OutputCommitCoordinator stopped!
17/02/01 19:46:06 INFO SparkContext: Successfully stopped SparkC
ontext
17/02/01 19:46:06 INFO ShutdownHookManager: Shutdown hook called
17/02/01 19:46:06 INFO ShutdownHookManager: Deleting directory /
tmp/spark-e528b271-c091-46f3-ac7c-f345cee3cc4c
(~/spark-2.1.0-bin-hadoop2.7)

57
Techniques & Best Practices

with custom log4j

the conf/log4j.properties is honored redirecting output of master, worker and


spark-submit to info.log

Master, Worker and spark-submit output

(~/spark-2.1.0-bin-hadoop2.7) ./sbin/start-master.sh
starting org.apache.spark.deploy.master.Master, logging to /home
/aravind/spark-2.1.0-bin-hadoop2.7/logs/spark-aravind-org.apache
.spark.deploy.master.Master-1-aravind-VirtualBox.out

(~/spark-2.1.0-bin-hadoop2.7) ./sbin/start-slave.sh spark://arav


ind-VirtualBox:7077
starting org.apache.spark.deploy.worker.Worker, logging to /home
/aravind/spark-2.1.0-bin-hadoop2.7/logs/spark-aravind-org.apache
.spark.deploy.worker.Worker-1-aravind-VirtualBox.out

(~/spark-2.1.0-bin-hadoop2.7) sh ./runexample.sh
Pi is roughly 3.1427031427031427

info.log

(~/spark-2.1.0-bin-hadoop2.7/logs) cat info.log


17/02/01 20:15:47 WARN Utils.M(L): Your hostname, aravind-Virtua
lBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15
instead (on interface enp0s3)
17/02/01 20:15:47 WARN Utils.M(L): Set SPARK_LOCAL_IP if you nee
d to bind to another address
17/02/01 20:15:49 WARN NativeCodeLoader.M(L): Unable to load nat
ive-hadoop library for your platform... using builtin-java class
es where applicable
17/02/01 20:15:53 INFO log.M(L): Logging initialized @12103ms
17/02/01 20:15:54 INFO Server.M(L): jetty-9.2.z-SNAPSHOT
17/02/01 20:15:54 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@5470d948{/app,null,AVAILABLE}
17/02/01 20:15:54 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@591d8573{/app/json,null,AVAILABLE}
17/02/01 20:15:54 INFO ContextHandler.M(L): Started o.s.j.s.Serv

58
Techniques & Best Practices

letContextHandler@2e495ee7{/,null,AVAILABLE}
17/02/01 20:15:54 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@52a2fb5d{/json,null,AVAILABLE}
17/02/01 20:15:54 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4ac2d8{/static,null,AVAILABLE}
17/02/01 20:15:54 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@35c9f9bf{/app/kill,null,AVAILABLE}
17/02/01 20:15:54 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@51132787{/driver/kill,null,AVAILABLE}
17/02/01 20:15:54 INFO ServerConnector.M(L): Started ServerConne
ctor@7dfc76f8{HTTP/1.1}{0.0.0.0:8080}
17/02/01 20:15:54 INFO Server.M(L): Started @12778ms
17/02/01 20:15:54 INFO Server.M(L): jetty-9.2.z-SNAPSHOT
17/02/01 20:15:54 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2e5cf022{/,null,AVAILABLE}
17/02/01 20:15:54 INFO ServerConnector.M(L): Started ServerConne
ctor@57ef1427{HTTP/1.1}{aravind-VirtualBox:6066}
17/02/01 20:15:54 INFO Server.M(L): Started @12895ms
17/02/01 20:15:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@6e7baffa{/metrics/master/json,null,AVAILABLE}
17/02/01 20:15:55 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@40ecf3f6{/metrics/applications/json,null,AVAIL
ABLE}
17/02/01 20:16:32 WARN Utils.M(L): Your hostname, aravind-Virtua
lBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15
instead (on interface enp0s3)
17/02/01 20:16:32 WARN Utils.M(L): Set SPARK_LOCAL_IP if you nee
d to bind to another address
17/02/01 20:16:33 WARN NativeCodeLoader.M(L): Unable to load nat
ive-hadoop library for your platform... using builtin-java class
es where applicable
17/02/01 20:16:34 INFO log.M(L): Logging initialized @2814ms
17/02/01 20:16:34 INFO Server.M(L): jetty-9.2.z-SNAPSHOT
17/02/01 20:16:34 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@768d6913{/logPage,null,AVAILABLE}
17/02/01 20:16:34 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@500e5e0c{/logPage/json,null,AVAILABLE}
17/02/01 20:16:34 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@3727270a{/,null,AVAILABLE}
17/02/01 20:16:34 INFO ContextHandler.M(L): Started o.s.j.s.Serv

59
Techniques & Best Practices

letContextHandler@e7cc804{/json,null,AVAILABLE}
17/02/01 20:16:34 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@13a4ed29{/static,null,AVAILABLE}
17/02/01 20:16:34 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@31783f6f{/log,null,AVAILABLE}
17/02/01 20:16:34 INFO ServerConnector.M(L): Started ServerConne
ctor@1e448d11{HTTP/1.1}{0.0.0.0:8081}
17/02/01 20:16:34 INFO Server.M(L): Started @2946ms
17/02/01 20:16:34 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2c5b237e{/metrics/json,null,AVAILABLE}
17/02/01 20:18:06 WARN NativeCodeLoader.M(L): Unable to load nat
ive-hadoop library for your platform... using builtin-java class
es where applicable
17/02/01 20:18:06 WARN Utils.M(L): Your hostname, aravind-Virtua
lBox resolves to a loopback address: 127.0.1.1; using 10.0.2.15
instead (on interface enp0s3)
17/02/01 20:18:06 WARN Utils.M(L): Set SPARK_LOCAL_IP if you nee
d to bind to another address
17/02/01 20:18:08 INFO log.M(L): Logging initialized @3872ms
17/02/01 20:18:08 INFO Server.M(L): jetty-9.2.z-SNAPSHOT
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7bb3a9fe{/jobs,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7cbee484{/jobs/json,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7f811d00{/jobs/job,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@62923ee6{/jobs/job/json,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4089713{/stages,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@f19c9d2{/stages/json,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@7807ac2c{/stages/stage,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@b91d8c4{/stages/stage/json,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4b6166aa{/stages/pool,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@a77614d{/stages/pool/json,null,AVAILABLE}

60
Techniques & Best Practices

17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv


letContextHandler@4fd4cae3{/storage,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4a067c25{/storage/json,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@a1217f9{/storage/rdd,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@3bde62ff{/storage/rdd/json,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@523424b5{/environment,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2baa8d82{/environment/json,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@319dead1{/executors,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@791cbf87{/executors/json,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@a7e2d9d{/executors/threadDump,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@754777cd{/executors/threadDump/json,null,AVAIL
ABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2b52c0d6{/static,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@372ea2bc{/,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@4cc76301{/api,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@2f08c4b{/jobs/job/kill,null,AVAILABLE}
17/02/01 20:18:08 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@3f19b8b3{/stages/stage/kill,null,AVAILABLE}
17/02/01 20:18:08 INFO ServerConnector.M(L): Started ServerConne
ctor@1cd0a3ab{HTTP/1.1}{0.0.0.0:4040}
17/02/01 20:18:08 INFO Server.M(L): Started @4080ms
17/02/01 20:18:09 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@3569edd5{/metrics/json,null,AVAILABLE}
17/02/01 20:18:10 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@6622a690{/SQL,null,AVAILABLE}
17/02/01 20:18:10 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@497570fb{/SQL/json,null,AVAILABLE}

61
Techniques & Best Practices

17/02/01 20:18:10 INFO ContextHandler.M(L): Started o.s.j.s.Serv


letContextHandler@a8a8b75{/SQL/execution,null,AVAILABLE}
17/02/01 20:18:10 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@72be135f{/SQL/execution/json,null,AVAILABLE}
17/02/01 20:18:10 INFO ContextHandler.M(L): Started o.s.j.s.Serv
letContextHandler@deb3b60{/static/sql,null,AVAILABLE}
17/02/01 20:18:19 INFO ServerConnector.M(L): Stopped ServerConne
ctor@1cd0a3ab{HTTP/1.1}{0.0.0.0:4040}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@3f19b8b3{/stages/stage/kill,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@2f08c4b{/jobs/job/kill,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4cc76301{/api,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@372ea2bc{/,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@2b52c0d6{/static,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@754777cd{/executors/threadDump/json,null,UNAVA
ILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@a7e2d9d{/executors/threadDump,null,UNAVAILABLE
}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@791cbf87{/executors/json,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@319dead1{/executors,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@2baa8d82{/environment/json,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@523424b5{/environment,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@3bde62ff{/storage/rdd/json,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@a1217f9{/storage/rdd,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4a067c25{/storage/json,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4fd4cae3{/storage,null,UNAVAILABLE}

62
Techniques & Best Practices

17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv


letContextHandler@a77614d{/stages/pool/json,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4b6166aa{/stages/pool,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@b91d8c4{/stages/stage/json,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7807ac2c{/stages/stage,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@f19c9d2{/stages/json,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@4089713{/stages,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@62923ee6{/jobs/job/json,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7f811d00{/jobs/job,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7cbee484{/jobs/json,null,UNAVAILABLE}
17/02/01 20:18:19 INFO ContextHandler.M(L): Stopped o.s.j.s.Serv
letContextHandler@7bb3a9fe{/jobs,null,UNAVAILABLE}
17/02/01 20:18:20 WARN Master.M(L): Got status update for unknow
n executor app-20170201201809-0000/0
(~/spark-2.1.0-bin-hadoop2.7/logs)

Shuffling
This setting configures the serializer used for not only shuffling data between
worker nodes but also when serializing RDDs to disk

Monitoring Spark Apps


Web UI
Spark launches as embedded web-app by default to monitor a running
application. this is launched on the Driver
It launches the monitoring app on 4040 by default.
If multiple apps are running simultaneously then they bind to
successive ports

63
Techniques & Best Practices

REST interface
Publish metrics to aggregation systems like Graphite, Ganglia and
JMX

Scheduling
By default FIFO is used
Can be configured to use FAIR scheduler by changing spark.scheduler.mode.
Spark assigns tasks between jobs in a “round robin” fashion in FAIR mode

64
SparkSQL

SparkSQL
Introduction

The interfaces provided by SparkSQL provide Spark with more information


about the structure of the both the data and computation being performed.
This extra information allows Spark to perform more optimizations
You interact with SparkSQL using SQL and Dataset API
You can mix and match different API's and languages but the underlying
execution engine is the same and hence same optimization techniques are
applied

Exposes 3 processing interfaces: SQL, HiveQL and language integrated


queries

Can be used in 2 modes

as a library
data processing tasks are expressed as SQL, HiveQL or integrated
queries in a Scala app
3 key abstractions
SQLContext
HiveContext
DataFrame
as a distributed SQL execution engine
allows multiple users to share a single spark cluster
allows centralized caching across all jobs and across multiple data
stores

DataFrame

Is an Untyped Dataset
represents a distributed collection of rows organized into named columns
Executing SQL queries provides us a DataFrame as the result
is schema aware i.e knows names and types of columns
provides methods for processing data
can be easily converted to regular RDD

65
SparkSQL

can be registered as a temporary table to run SQL/HiveQL


2 ways to create a DF
from existing RDD
toDF if we can infer schema using a case class
createDataFrame is you want to pass the Schema explicitly
(StructType, StructField)
from external DataSources
single unified interace to create DF, either from data stores
(MySQL, PostgresSQL, Oracle, Cassandra) or from files
(JSON, Parquet, ORC, CSV, HDFS, local, S3)
built-in support for JSON, Parquet, ORC, JDBC, Hive
uses DataFrameReader to read data from external datasources
can specify partiotioning, format and data source specific options
SQL/HiveContext has a factory method called read() that returns an
instance of DataFrameReader
uses DataFrameWriter to write data to external datasources
can specify partiotioning, format and data source specific options
build-in functions

optimized for faster execution through code generation


import org.apache.spark.sql.functions._
Optimization techniques used

Encoder/Decoder
Better than Java and Kryo serialization mechanics
Allows Spark to perform many operations like sorting, filtering,
hashing etc without deserializing the bytes back to objects
Code is generated dynamically for encode/decode
Reduced Disk IO
Skip rows
if a data store (for e.g. Parquet, ORC) maintains statistical info
about data (min, max of a particular column in a group) then
SparkSQL can take advantage of it to skip these groups
Skip columns (Use support of Parquet)
skip non-required partitions
Predicate Push down i.e. pushing filtering predicates to run natively
on data stores like Cassandra, DB etc.

66
SparkSQL

In-memory columnar caching


cache only required columns from any data store
compress the cached columns (using snappy) to minimize memory
usage and GC
SparkSQL can automatically select a compression codec for a
column based on data type
columnar format enables using of efficient compression techniques
run length encoding
delta encoding
dictionary encoding
Query optimization
uses both cost-based (in the physical planning phase) and rule-
based (in the logical optimization phase) optimization
can even optimize across functions
code generation

Dataset

Typed data and transformations: Leverage static typing


Uses special Encoder/Decoders, instead of Java Serialization or Kryo, to
serialize objects for processing and transmitting data over network
Encoders
are code generated dynamically
allows Spark to perform operations like filtering, sorting, hashing etc.
without deserializing the bytes back to object

Datasource

Are used by Spark to load and save data


parquet is registered as the default data source
Default data source can be customized using
spark.sql.sources.default property
Save
different SaveMode 's: overwrite, ignore, error, append
Persistent tables
can be saved as a Hive table (even if Hive isn't installed). Use
saveAsTable
will create a managed table

67
SparkSQL

Schema merging for parquet, avro etc. files is supported

Imports

//For implicit conversions like converting RDDs to DataFrames, t


o use the $-notation,
//to use default Encoders provided for most common types
import spark.implicits._

//Contains library of functions to be used in Dataset and DataFr


ame methods like
//select, filter etc. in addition to column reference like selec
t($"name) and expressions like select($"age" + 1)
import org.apache.spark.sql.functions._

68
RDD to Dataset

Two methods for converting RDD to a Dataset

Method 1: When you already know the schema

Let Sprak infer the schema using Reflection. This results in a more concise
code

Method 2: When you do not know the columns and their types (i.e. schema)
before hand

Create the Schema programatically and apply to to an existing RDD at


runtime. This is more flexible

69
Exploratory Data Analysis

What data do I have?


In EDA, the intent is to get familiar with the data before performing any statistical
analysis. We will look into different ways of answering the question: What data do
I have?, using Scala & Spark.

1. How big is your data?


2. How many variables/features are in your data set?
3. What are data types?
i. Helps us determining the graphs to visualise the data
4. What does the data look like?
i. Possible Trends
ii. Possible Interactions
5. Is there anything odd about the data? Outliers?
i. Helps

After learning about the data using EDA, we can determine which tests and
descriptive statistics are appropriate for the data.

70
References

References
https://ogirardot.wordpress.com/2015/01/09/changing-sparks-default-java-
serialization-to-kryo/
https://ogirardot.wordpress.com/2014/12/11/apache-spark-memory-
management-and-graceful-degradation/

71

You might also like