Professional Documents
Culture Documents
Basics
Chapter
10
201509
Course
Chapters
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
AggregaGng
Data
with
Pair
RDDs
13
WriGng
and
Deploying
Spark
ApplicaGons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaBerns
in
Spark
17
Spark
SQL
and
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-2
Spark
Basics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-3
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-4
What
is
Apache
Spark?
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-5
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-6
Spark
Shell
Welcome to
Welcome to
____ __
/ __/__ ___ _____/ /__ ____ __
_\ \/ _ \/ _ `/ __/ '_/ / __/__ ___ _____/ /__
/__ / .__/\_,_/_/ /_/\_\ version 1.3.0 _\ \/ _ \/ _ `/ __/ '_/
/_/ /___/ .__/\_,_/_/ /_/\_\ version 1.3.0
/_/
Using Python version 2.7.8 (default, Aug 27
2015 05:23:36)
SparkContext available as sc, HiveContext Using Scala version 2.10.4 (Java HotSpot(TM)
available as sqlCtx. 64-Bit Server VM, Java 1.7.0_67)
>>> Created spark context..
Spark context available as sc.
SQL context available as sqlContext.
scala>
REPL:
Read/Evaluate/Print
Loop
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-7
Spark
Context
Spark context available as sc.
SQL context available as sqlContext.
Scala
scala> sc.appName
res0: String = Spark shell
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-8
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-9
RDD
(Resilient
Distributed
Dataset)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-10
CreaGng
an
RDD
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-11
Example:
A
File-based
RDD
File: purplecow.txt
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-12
RDD
OperaGons
value
AcGons
return
values
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-13
RDD
OperaGons:
AcGons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-14
RDD
OperaGons:
TransformaGons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-15
Example:
map
and
filter
TransformaGons
I've never seen a purple cow.
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-16
Lazy
ExecuGon
(1)
File:
purplecow.txt
Data
in
RDDs
is
not
processed
unHl
I've never seen a purple cow.
an
ac#on
is
performed
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
>
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-17
Lazy
ExecuGon
(2)
File:
purplecow.txt
Data
in
RDDs
is
not
processed
unHl
I've never seen a purple cow.
an
ac#on
is
performed
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD: mydata
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-18
Lazy
ExecuGon
(3)
File:
purplecow.txt
Data
in
RDDs
is
not
processed
unHl
I've never seen a purple cow.
an
ac#on
is
performed
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD: mydata
RDD: mydata_uc
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-19
Lazy
ExecuGon
(4)
File:
purplecow.txt
Data
in
RDDs
is
not
processed
unHl
I've never seen a purple cow.
an
ac#on
is
performed
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD: mydata
RDD: mydata_lt
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-20
Lazy
ExecuGon
(5)
File:
purplecow.txt
Data
in
RDDs
is
not
processed
unHl
I've never seen a purple cow.
an
ac#on
is
performed
I never hope to see one;
But I can tell you, anyhow,
I'd rather see than be one.
RDD:
mydata
I've never seen a purple cow.
> val mydata = sc.textFile("purplecow.txt") I never hope to see one;
> val mydata_uc = mydata.map(line => But I can tell you, anyhow,
line.toUpperCase()) I'd rather see than be one.
> val mydata_filt = mydata_uc.filter(line
RDD:
mydata_uc
=> line.startsWith("I")) I'VE NEVER SEEN A PURPLE COW.
> mydata_filt.count() I NEVER HOPE TO SEE ONE;
3 BUT I CAN TELL YOU, ANYHOW,
I'D RATHER SEE THAN BE ONE.
RDD:
mydata_lt
I'VE NEVER SEEN A PURPLE COW.
I NEVER HOPE TO SEE ONE;
I'D RATHER SEE THAN BE ONE.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-21
Chaining
TransformaGons
(Scala)
is exactly equivalent to
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-22
Chaining
TransformaGons
(Python)
is exactly equivalent to
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-23
RDD
Lineage
and
toDebugString
(Scala)
File:
purplecow.txt
Spark
maintains
each
RDDs
lineage
I've never seen a purple cow.
the
previous
RDDs
on
which
it
I never hope to see one;
But I can tell you, anyhow,
depends
I'd rather see than be one.
lineage of an RDD
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-24
RDD
Lineage
and
toDebugString
(Python)
> mydata_filt.toDebugString()
(1) PythonRDD[8] at RDD at \n | purplecow.txt MappedRDD[7] at textFile
at []\n | purplecow.txt HadoopRDD[6] at textFile at []
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-25
Pipelining
(1)
File:
purplecow.txt
When
possible,
Spark
will
perform
I've never seen a purple cow.
sequences
of
transformaHons
by
I never hope to see one;
But I can tell you, anyhow,
row
so
no
data
is
stored
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-26
Pipelining
(2)
File:
purplecow.txt
When
possible,
Spark
will
perform
I've never seen a purple cow.
sequences
of
transformaHons
by
I never hope to see one;
But I can tell you, anyhow,
row
so
no
data
is
stored
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-27
Pipelining
(3)
File:
purplecow.txt
When
possible,
Spark
will
perform
I've never seen a purple cow.
sequences
of
transformaHons
by
I never hope to see one;
But I can tell you, anyhow,
row
so
no
data
is
stored
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-28
Pipelining
(4)
File:
purplecow.txt
When
possible,
Spark
will
perform
I've never seen a purple cow.
sequences
of
transformaHons
by
I never hope to see one;
But I can tell you, anyhow,
row
so
no
data
is
stored
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-29
Pipelining
(5)
File:
purplecow.txt
When
possible,
Spark
will
perform
I've never seen a purple cow.
sequences
of
transformaHons
by
I never hope to see one;
But I can tell you, anyhow,
row
so
no
data
is
stored
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-30
Pipelining
(6)
File:
purplecow.txt
When
possible,
Spark
will
perform
I've never seen a purple cow.
sequences
of
transformaHons
by
I never hope to see one;
But I can tell you, anyhow,
row
so
no
data
is
stored
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-31
Pipelining
(7)
File:
purplecow.txt
When
possible,
Spark
will
perform
I've never seen a purple cow.
sequences
of
transformaHons
by
I never hope to see one;
But I can tell you, anyhow,
row
so
no
data
is
stored
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-32
Pipelining
(8)
File:
purplecow.txt
When
possible,
Spark
will
perform
I've never seen a purple cow.
sequences
of
transformaHons
by
I never hope to see one;
But I can tell you, anyhow,
row
so
no
data
is
stored
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-33
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-34
FuncGonal
Programming
in
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-35
Passing
FuncGons
as
Parameters
RDD {
map(fn(x)) {
foreach record in rdd
emit fn(record)
}
}
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-36
Example:
Passing
Named
FuncGons
Python
Scala
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-37
Anonymous
FuncGons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-38
Example:
Passing
Anonymous
FuncGons
Python:
Scala:
OR
> mydata.map(_.toUpperCase()).take(2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-39
Example:
Java
...
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
new MapFunction<String, String>() {
Java
7
public String call(String line) {
return line.toUpperCase();
}
});
...
...
Java
8
JavaRDD<String> lines = sc.textFile("file");
JavaRDD<String> lines_uc = lines.map(
line -> line.toUpperCase());
...
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-40
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-41
EssenGal
Points
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-42
Chapter
Topics
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-43
Spark
Homework:
Pick
Your
Language
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriBen
consent
from
Cloudera.
10-44
Spark
Homework
Assignments
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriBen consent from Cloudera. 10-45