You are on page 1of 45

Spark

Basics
Chapter 10

201509
Course Chapters

1 IntroducGon Course IntroducGon


2 IntroducGon to Hadoop and the Hadoop Ecosystem
IntroducGon to Hadoop
3 Hadoop Architecture and HDFS
4 ImporGng RelaGonal Data with Apache Sqoop
5 IntroducGon to Impala and Hive
ImporGng and Modeling Structured
6 Modeling and Managing Data with Impala and Hive
Data
7 Data Formats
8 Data File ParGGoning
9 Capturing Data with Apache Flume IngesGng Streaming Data

10 Spark Basics
11 Working with RDDs in Spark
12 AggregaGng Data with Pair RDDs
13 WriGng and Deploying Spark ApplicaGons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common PaBerns in Spark
17 Spark SQL and DataFrames

18 Conclusion Course Conclusion

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-2
Spark Basics

In this chapter you will learn


How to start the Spark Shell
About the SparkContext
Key Concepts of Resilient Distributed Datasets (RDDs)
What are they?
How do you create them?
What operaGons can you perform with them?
How Spark uses the principles of funcHonal programming

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-3
Chapter Topics

Distributed Data Processing with


Spark Basics
Spark

What is Apache Spark?


Using the Spark Shell
RDDs (Resilient Distributed Datasets)
FuncGonal Programming in Spark
Conclusion
Homework Assignments

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-4
What is Apache Spark?

Apache Spark is a fast and general engine for large-scale


data processing
WriNen in Scala
FuncGonal programming language that runs in a JVM
Spark Shell
InteracGve for learning or data exploraGon
Python or Scala
Spark ApplicaHons
For large scale data processing
Python, Scala, or Java

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-5
Chapter Topics

Distributed Data Processing with


Spark Basics
Spark

What is Apache Spark?


Using the Spark Shell
RDDs (Resilient Distributed Datasets)
FuncGonal Programming in Spark
Conclusion
Homework Assignments

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-6
Spark Shell

The Spark Shell provides interacHve data exploraHon (REPL)


WriHng Spark applicaHons without the shell will be covered later

Python Shell: pyspark Scala Shell: spark-shell


$ pyspark $ spark-shell

Welcome to
Welcome to
____ __
/ __/__ ___ _____/ /__ ____ __
_\ \/ _ \/ _ `/ __/ '_/ / __/__ ___ _____/ /__
/__ / .__/\_,_/_/ /_/\_\ version 1.3.0 _\ \/ _ \/ _ `/ __/ '_/
/_/ /___/ .__/\_,_/_/ /_/\_\ version 1.3.0
/_/
Using Python version 2.7.8 (default, Aug 27
2015 05:23:36)
SparkContext available as sc, HiveContext Using Scala version 2.10.4 (Java HotSpot(TM)
available as sqlCtx. 64-Bit Server VM, Java 1.7.0_67)
>>> Created spark context..
Spark context available as sc.
SQL context available as sqlContext.

scala>
REPL: Read/Evaluate/Print Loop

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-7
Spark Context

Every Spark applicaHon requires a Spark Context


The main entry point to the Spark API
Spark Shell provides a precongured Spark Context called sc

Using Python version 2.7.8 (default, Aug 27 2015 05:23:36)


SparkContext available as sc, HiveContext available as sqlCtx.
Python
>>> sc.appName
u'PySparkShell'


Spark context available as sc.
SQL context available as sqlContext.
Scala
scala> sc.appName
res0: String = Spark shell

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-8
Chapter Topics

Distributed Data Processing with


Spark Basics
Spark

What is Apache Spark?


Using the Spark Shell
RDDs (Resilient Distributed Datasets)
FuncGonal Programming With Spark
Conclusion
Homework Assignments

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-9
RDD (Resilient Distributed Dataset)

RDD (Resilient Distributed Dataset)


Resilient if data in memory is lost, it can be recreated
Distributed processed across the cluster
Dataset iniGal data can come from a le or be created
programmaGcally
RDDs are the fundamental unit of data in Spark
Most Spark programming consists of performing operaHons on RDDs

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-10
CreaGng an RDD

Three ways to create an RDD


From a le or set of les
From data in memory
From another RDD

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-11
Example: A File-based RDD

File: purplecow.txt

> val mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.


I never hope to see one;

But I can tell you, anyhow,
15/01/29 06:20:37 INFO storage.MemoryStore: I'd rather see than be one.
Block broadcast_0 stored as values to
memory (estimated size 151.4 KB, free 296.8
MB)

> mydata.count() RDD: mydata


I've never seen a purple cow.

I never hope to see one;
15/01/29 06:27:37 INFO spark.SparkContext: Job
finished: take at <stdin>:1, took But I can tell you, anyhow,
0.160482078 s I'd rather see than be one.
4

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-12
RDD OperaGons

Two types of RDD operaHons


RDD

value
AcGons return values

Base RDD New RDD


TransformaGons dene a new
RDD based on the current one(s)

Pop quiz:
Which type of operaGon is
count()?

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-13
RDD OperaGons: AcGons

Some common acHons RDD


count() return the number of elements
value
take(n) return an array of the rst n
elements
collect() return an array of all elements
saveAsTextFile(file) save to text
le(s)
> mydata = > val mydata =
sc.textFile("purplecow.txt") sc.textFile("purplecow.txt")

> mydata.count() > mydata.count()


4 4

> for line in mydata.take(2): > for (line <- mydata.take(2))


print line println(line)
I've never seen a purple cow. I've never seen a purple cow.
I never hope to see one; I never hope to see one;

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-14
RDD OperaGons: TransformaGons

TransformaHons create a new RDD from


Base RDD New RDD
an exisHng one
RDDs are immutable
Data in an RDD is never changed
Transform in sequence to modify the
data as needed
Some common transformaHons
map(function) creates a new RDD by performing a funcGon on
each record in the base RDD
filter(function) creates a new RDD by including or
excluding each record in the base RDD according to a boolean
funcGon

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-15
Example: map and filter TransformaGons
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

map(lambda line: line.upper()) map(line => line.toUpperCase)

I'VE NEVER SEEN A PURPLE COW.


I NEVER HOPE TO SEE ONE;
BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

filter(lambda line: line.startswith('I')) filter(line => line.startsWith('I'))

I'VE NEVER SEEN A PURPLE COW.


I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-16
Lazy ExecuGon (1)
File: purplecow.txt
Data in RDDs is not processed unHl I've never seen a purple cow.
an ac#on is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

>

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-17
Lazy ExecuGon (2)
File: purplecow.txt
Data in RDDs is not processed unHl I've never seen a purple cow.
an ac#on is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD: mydata

> val mydata = sc.textFile("purplecow.txt")

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-18
Lazy ExecuGon (3)
File: purplecow.txt
Data in RDDs is not processed unHl I've never seen a purple cow.
an ac#on is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD: mydata

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line =>
line.toUpperCase())

RDD: mydata_uc

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-19
Lazy ExecuGon (4)
File: purplecow.txt
Data in RDDs is not processed unHl I've never seen a purple cow.
an ac#on is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD: mydata

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
RDD: mydata_uc
=> line.startsWith("I"))

RDD: mydata_lt

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-20
Lazy ExecuGon (5)
File: purplecow.txt
Data in RDDs is not processed unHl I've never seen a purple cow.
an ac#on is performed I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.

RDD: mydata
I've never seen a purple cow.
> val mydata = sc.textFile("purplecow.txt") I never hope to see one;
> val mydata_uc = mydata.map(line => But I can tell you, anyhow,
line.toUpperCase()) I'd rather see than be one.
> val mydata_filt = mydata_uc.filter(line
RDD: mydata_uc
=> line.startsWith("I")) I'VE NEVER SEEN A PURPLE COW.
> mydata_filt.count() I NEVER HOPE TO SEE ONE;
3 BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.

RDD: mydata_lt
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-21
Chaining TransformaGons (Scala)

TransformaHons may be chained together

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line => line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line => line.startsWith("I"))
> mydata_filt.count()
3

is exactly equivalent to

> sc.textFile("purplecow.txt").map(line => line.toUpperCase()).


filter(line => line.startsWith("I")).count()
3

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-22
Chaining TransformaGons (Python)

Same example in Python

> mydata = sc.textFile("purplecow.txt")


> mydata_uc = mydata.map(lambda s: s.upper())
> mydata_filt = mydata_uc.filter(lambda s: s.startswith('I'))
> mydata_filt.count()
3

is exactly equivalent to

> sc.textFile("purplecow.txt").map(lambda line: line.upper()) \


.filter(lambda line: line.startswith('I')).count()
3

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-23
RDD Lineage and toDebugString (Scala)
File: purplecow.txt
Spark maintains each RDDs lineage I've never seen a purple cow.
the previous RDDs on which it I never hope to see one;
But I can tell you, anyhow,
depends I'd rather see than be one.

Use toDebugString to view the RDD[5]

lineage of an RDD

> val mydata_filt =


sc.textFile("purplecow.txt"). RDD[6]
map(line => line.toUpperCase()).
filter(line => line.startsWith("I"))
> mydata_filt.toDebugString

(2) FilteredRDD[7] at filter RDD[7]


| MappedRDD[6] at map
| purplecow.txt MappedRDD[5]
| purplecow.txt HadoopRDD[4]

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-24
RDD Lineage and toDebugString (Python)

toDebugString output is not displayed as nicely in Python

> mydata_filt.toDebugString()
(1) PythonRDD[8] at RDD at \n | purplecow.txt MappedRDD[7] at textFile
at []\n | purplecow.txt HadoopRDD[6] at textFile at []

Use print for pre_er output

> print mydata_filt.toDebugString()


(1) PythonRDD[8] at RDD at
| purplecow.txt MappedRDD[7] at textFile at
| purplecow.txt HadoopRDD[6] at textFile at

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-25
Pipelining (1)
File: purplecow.txt
When possible, Spark will perform I've never seen a purple cow.
sequences of transformaHons by I never hope to see one;
But I can tell you, anyhow,
row so no data is stored I'd rather see than be one.

> val mydata = sc.textFile("purplecow.txt") I've never seen a purple cow.


> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-26
Pipelining (2)
File: purplecow.txt
When possible, Spark will perform I've never seen a purple cow.
sequences of transformaHons by I never hope to see one;
But I can tell you, anyhow,
row so no data is stored I'd rather see than be one.

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
I'VE NEVER SEEN A PURPLE COW.
=> line.startsWith("I"))
> mydata_filt.take(2)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-27
Pipelining (3)
File: purplecow.txt
When possible, Spark will perform I've never seen a purple cow.
sequences of transformaHons by I never hope to see one;
But I can tell you, anyhow,
row so no data is stored I'd rather see than be one.

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)

I'VE NEVER SEEN A PURPLE COW.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-28
Pipelining (4)
File: purplecow.txt
When possible, Spark will perform I've never seen a purple cow.
sequences of transformaHons by I never hope to see one;
But I can tell you, anyhow,
row so no data is stored I'd rather see than be one.

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-29
Pipelining (5)
File: purplecow.txt
When possible, Spark will perform I've never seen a purple cow.
sequences of transformaHons by I never hope to see one;
But I can tell you, anyhow,
row so no data is stored I'd rather see than be one.

> val mydata = sc.textFile("purplecow.txt") I never hope to see one;


> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-30
Pipelining (6)
File: purplecow.txt
When possible, Spark will perform I've never seen a purple cow.
sequences of transformaHons by I never hope to see one;
But I can tell you, anyhow,
row so no data is stored I'd rather see than be one.

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
I NEVER HOPE TO SEE ONE;
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-31
Pipelining (7)
File: purplecow.txt
When possible, Spark will perform I've never seen a purple cow.
sequences of transformaHons by I never hope to see one;
But I can tell you, anyhow,
row so no data is stored I'd rather see than be one.

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-32
Pipelining (8)
File: purplecow.txt
When possible, Spark will perform I've never seen a purple cow.
sequences of transformaHons by I never hope to see one;
But I can tell you, anyhow,
row so no data is stored I'd rather see than be one.

> val mydata = sc.textFile("purplecow.txt")


> val mydata_uc = mydata.map(line =>
line.toUpperCase())
> val mydata_filt = mydata_uc.filter(line
=> line.startsWith("I"))
> mydata_filt.take(2)
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-33
Chapter Topics

Distributed Data Processing with


Spark Basics
Spark

What is Apache Spark?


Using the Spark Shell
RDDs (Resilient Distributed Datasets)
FuncHonal Programming in Spark
Conclusion
Homework Assignments

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-34
FuncGonal Programming in Spark

Spark depends heavily on the concepts of func#onal programming


FuncGons are the fundamental unit of programming
FuncGons have input and output only
No state or side eects
Key concepts
Passing funcGons as input to other funcGons
Anonymous funcGons

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-35
Passing FuncGons as Parameters

Many RDD operaHons take funcHons as parameters


Pseudocode for the RDD map operaHon
Applies funcGon fn to each record in the RDD

RDD {
map(fn(x)) {
foreach record in rdd
emit fn(record)
}
}

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-36
Example: Passing Named FuncGons

Python

> def toUpper(s):


return s.upper()
> mydata = sc.textFile("purplecow.txt")
> mydata.map(toUpper).take(2)

Scala

> def toUpper(s: String): String =


{ s.toUpperCase }
> val mydata = sc.textFile("purplecow.txt")
> mydata.map(toUpper).take(2)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-37
Anonymous FuncGons

FuncHons dened in-line without an idenHer


Best for short, one-o funcGons
Supported in many programming languages
Python: lambda x: ...
Scala: x => ...
Java 8: x -> ...

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-38
Example: Passing Anonymous FuncGons

Python:

> mydata.map(lambda line: line.upper()).take(2)

Scala:

> mydata.map(line => line.toUpperCase()).take(2)

OR

> mydata.map(_.toUpperCase()).take(2)

Scala allows anonymous parameters


using underscore (_)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-39
Example: Java

...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
new MapFunction<String, String>() {
Java 7 public String call(String line) {
return line.toUpperCase();
}
});
...

...
Java 8 JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
line -> line.toUpperCase());
...

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-40
Chapter Topics

Distributed Data Processing with


Spark Basics
Spark

What is Apache Spark?


Using the Spark Shell
RDDs (Resilient Distributed Datasets)
FuncGonal Programming With Spark
Conclusion
Homework Assignments

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-41
EssenGal Points

Spark can be used interacHvely via the Spark Shell


Python or Scala
WriGng non-interacGve Spark applicaGons will be covered later
RDDs (Resilient Distributed Datasets) are a key concept in Spark
RDD OperaHons
TransformaGons create a new RDD based on an exisGng one
AcGons return a value from an RDD
Lazy ExecuHon
TransformaGons are not executed unGl required by an acGon
Spark uses funcHonal programming
Passing funcGons as parameters
Anonymous funcGons in supported languages (Python and Scala)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-42
Chapter Topics

Distributed Data Processing with


Spark Basics
Spark

What is Apache Spark?


Using the Spark Shell
RDDs (Resilient Distributed Datasets)
FuncGonal Programming With Spark
Conclusion
Homework Assignments

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-43
Spark Homework: Pick Your Language

Your choice: Python or Scala


For the Spark-based homework assignments in this course, you may
choose to work with either Python or Scala
ConvenHons:
.pyspark Python shell commands
.scalaspark Scala shell commands
.py Python Spark applicaGons
.scala Scala Spark applicaGons

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-44
Spark Homework Assignments

There are three homework assignments for this chapter


1. View the Spark Documenta7on
Familiarize yourself with the Spark documentaGon; you will refer to
this documentaGon frequently during the course
2. Explore RDDs Using the Spark Shell
Follow the instrucGons for either the Python or Scala shell
3. Use RDDs to Transform a Dataset
Explore Loudacre web log les
Please refer to the Homework descripHon

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-45

You might also like