Professional Documents
Culture Documents
201509
Course
Chapters
10
Spark
Basics
11
Working
with
RDDs
in
Spark
12
AggregaIng
Data
with
Pair
RDDs
13
WriIng
and
Deploying
Spark
ApplicaIons
Distributed
Data
Processing
with
14
Parallel
Processing
in
Spark
Spark
15
Spark
RDD
Persistence
16
Common
PaEerns
in
Spark
Data
Processing
17
Spark
SQL
and
DataFrames
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-2
Working
With
RDDs
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-3
Chapter
Topics
CreaCng
RDDs
Other
General
RDD
OperaIons
Conclusion
Homework:
Process
Data
Files
with
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-4
RDDs
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-5
CreaIng
RDDs
From
CollecIons
Useful
when
TesIng
GeneraIng
data
programmaIcally
IntegraIng
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-6
CreaIng
RDDs
from
Files
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-7
CreaIng
RDDs
from
Files
(2)
I've never seen a purple cow.\n I've never seen a purple cow.
I never hope to see one;\n I never hope to see one;
But I can tell you, anyhow,\n
But I can tell you, anyhow,
I'd rather see than be one.\n
I'd rather see than be one.
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-8
Input
and
Output
Formats
(1)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-9
Input
and
Output
Formats
(2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-10
Whole
File-Based
RDDs
(1)
le2.json
sc.wholeTextFiles(directory) {
Maps
enIre
contents
of
each
le
in
a
directory
"firstName":"Barney",
"lastName":"Rubble",
to
a
single
RDD
element
"userid":"234
}
Works
only
for
small
les
(element
must
t
in
memory)
(file1.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":234"} )
(file3.xml, )
(file4.xml, )
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-11
Whole
File-Based
RDDs
(2)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-12
Chapter
Topics
CreaIng
RDDs
Other
General
RDD
OperaCons
Conclusion
Homework:
Process
Data
Files
with
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-13
Some
Other
General
RDD
OperaIons
Single-RDD
TransformaCons
flatMap
maps
one
element
in
the
base
RDD
to
mulIple
elements
distinct
lter
out
duplicates
sortBy
use
provided
funcIon
to
sort
MulC-RDD
TransformaCons
intersection
create
a
new
RDD
with
all
elements
in
both
original
RDDs
union
add
all
elements
of
two
RDDs
into
a
single
new
RDD
zip
pair
each
element
of
the
rst
RDD
with
the
corresponding
element
of
the
second
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-14
Example:
flatMap
and
distinct
> sc.textFile(file) \
Python
.flatMap(lambda line: line.split()) \
.distinct()
> sc.textFile(file).
Scala
flatMap(line => line.split(' ')).
distinct()
Ive Ive
never never
I've never seen a purple cow.
seen seen
I never hope to see one;
a a
But I can tell you, anyhow,
purple purple
I'd rather see than be one.
cow cow
I I
never hope
hope to
to
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-15
Examples:
MulI-RDD
TransformaIons
rdd1.union(rdd2)
rdd1
rdd2
Chicago San Francisco
Boston Boston
Chicago
Paris Amsterdam
Boston
San Francisco Mumbai
Paris
Tokyo McMurdo Station
San Francisco
Tokyo
rdd1.subtract(rdd2) rdd1.zip(rdd2) San Francisco
Boston
Tokyo (Chicago,San Francisco) Amsterdam
Paris (Boston,Boston) Mumbai
Chicago (Paris,Amsterdam) McMurdo Station
(San Francisco,Mumbai)
(Tokyo,McMurdo Station)
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-16
Some
Other
General
RDD
OperaIons
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-17
Chapter
Topics
CreaIng
RDDs
Other
General
RDD
OperaIons
Conclusion
Homework:
Process
Data
Files
with
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-18
EssenIal
Points
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-19
Chapter
Topics
CreaIng
RDDs
Other
General
RDD
OperaIons
Conclusion
Homework:
Process
Data
Files
with
Spark
Copyright
2010-2015
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
or
shared
without
prior
wriEen
consent
from
Cloudera.
11-20
Homework:
Process
Data
Files
with
Spark
Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-21