You are on page 1of 21

Working

With RDDs in Spark


Chapter 11

201509
Course Chapters

1 IntroducIon Course IntroducIon


2 IntroducIon to Hadoop and the Hadoop Ecosystem
IntroducIon to Hadoop
3 Hadoop Architecture and HDFS
4 ImporIng RelaIonal Data with Apache Sqoop
5 IntroducIon to Impala and Hive
ImporIng and Modeling Structured
6 Modeling and Managing Data with Impala and Hive
Data
7 Data Formats
8 Data File ParIIoning
9 Capturing Data with Apache Flume IngesIng Streaming Data

10 Spark Basics
11 Working with RDDs in Spark
12 AggregaIng Data with Pair RDDs
13 WriIng and Deploying Spark ApplicaIons Distributed Data Processing with
14 Parallel Processing in Spark Spark
15 Spark RDD Persistence
16 Common PaEerns in Spark Data Processing
17 Spark SQL and DataFrames

18 Conclusion Course Conclusion

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-2
Working With RDDs

In this chapter you will learn


How RDDs are created from les or data in memory
How to handle le formats with mulC-line records
How to use some addiConal operaCons on RDDs

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-3
Chapter Topics

Distributed Data Processing with


Working With RDDs in Spark
Spark

CreaCng RDDs
Other General RDD OperaIons
Conclusion
Homework: Process Data Files with Spark

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-4
RDDs

RDDs can hold any type of element


PrimiIve types: integers, characters, booleans, etc.
Sequence types: strings, lists, arrays, tuples, dicts, etc. (including nested
data types)
Scala/Java Objects (if serializable)
Mixed types
Some types of RDDs have addiConal funcConality
Pair RDDs
RDDs consisIng of Key-Value pairs
Double RDDs
RDDs consisIng of numeric data

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-5
CreaIng RDDs From CollecIons

You can create RDDs from collecCons instead of les


sc.parallelize(collection)

> myData = ["Alice","Carlos","Frank","Barbara"]


> myRdd = sc.parallelize(myData)
> myRdd.take(2)
['Alice', 'Carlos']

Useful when
TesIng
GeneraIng data programmaIcally
IntegraIng

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-6
CreaIng RDDs from Files (1)

For le-based RDDs, use SparkContext.textFile


Accepts a single le, a wildcard list of les, or a comma-separated list of
les
Examples
sc.textFile("myfile.txt")
sc.textFile("mydata/*.log")
sc.textFile("myfile1.txt,myfile2.txt")
Each line in the le(s) is a separate record in the RDD
Files are referenced by absolute or relaCve URI
Absolute URI:
file:/home/training/myfile.txt
hdfs://localhost/loudacre/myfile.txt
RelaIve URI (uses default le system): myfile.txt

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-7
CreaIng RDDs from Files (2)

textFile maps each line in a le to a separate RDD element

I've never seen a purple cow.\n I've never seen a purple cow.
I never hope to see one;\n I never hope to see one;
But I can tell you, anyhow,\n
But I can tell you, anyhow,
I'd rather see than be one.\n
I'd rather see than be one.

textFile only works with line-delimited text les


What about other formats?

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-8
Input and Output Formats (1)

Spark uses Hadoop InputFormat and OutputFormat Java classes


Some examples from core Hadoop
TextInputFormat / TextOutputFormat newline
delimited text les
SequenceInputFormat / SequenceOutputFormat
FixedLengthInputFormat
Many implementaIons available in addiIonal libraries
e.g. AvroInputFormat / AvroOutputFormat in the Avro
library

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-9
Input and Output Formats (2)

Specify any input format using sc.hadoopFile


or newAPIhadoopFile for New API classes
Specify any output format using rdd.saveAsHadoopFile
or saveAsNewAPIhadoopFile for New API classes
textFile and saveAsTextFile are convenience funcCons
textFile just calls hadoopFile specifying TextInputFormat
saveAsTextFile calls saveAsHadoopFile specifying
TextOutputFormat

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-10
Whole File-Based RDDs (1)

sc.textFile maps each line in a le to a le1.json


{
separate RDD element "firstName":"Fred",
"lastName":"Flintstone",
What about les with a mulI-line input "userid":"123"
format, e.g. XML or JSON? }

le2.json
sc.wholeTextFiles(directory) {
Maps enIre contents of each le in a directory "firstName":"Barney",
"lastName":"Rubble",
to a single RDD element "userid":"234
}
Works only for small les (element must t in
memory)

(file1.json,{"firstName":"Fred","lastName":"Flintstone","userid":"123"} )
(file2.json,{"firstName":"Barney","lastName":"Rubble","userid":234"} )
(file3.xml, )
(file4.xml, )

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-11
Whole File-Based RDDs (2)

> import json


> myrdd1 = sc.wholeTextFiles(mydir)
> myrdd2 = myrdd1
.map(lambda (fname,s): json.loads(s))
> for record in myrdd2.take(2): Output:
> print record["firstName"] Fred
Barney
> import scala.util.parsing.json.JSON
> val myrdd1 = sc.wholeTextFiles(mydir)
> val myrdd2 = myrdd1
.map(pair => JSON.parseFull(pair._2).get.
asInstanceOf[Map[String,String]])
> for (record <- myrdd2.take(2))
println(record.getOrElse("firstName",null))

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-12
Chapter Topics

Distributed Data Processing with


Working With RDDs in Spark
Spark

CreaIng RDDs
Other General RDD OperaCons
Conclusion
Homework: Process Data Files with Spark

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-13
Some Other General RDD OperaIons

Single-RDD TransformaCons
flatMap maps one element in the base RDD to mulIple elements
distinct lter out duplicates
sortBy use provided funcIon to sort
MulC-RDD TransformaCons
intersection create a new RDD with all elements in both original
RDDs
union add all elements of two RDDs into a single new RDD
zip pair each element of the rst RDD with the corresponding
element of the second

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-14
Example: flatMap and distinct

> sc.textFile(file) \
Python .flatMap(lambda line: line.split()) \
.distinct()

> sc.textFile(file).
Scala flatMap(line => line.split(' ')).
distinct()

Ive Ive
never never
I've never seen a purple cow.
seen seen
I never hope to see one;
a a
But I can tell you, anyhow,
purple purple
I'd rather see than be one.
cow cow
I I
never hope
hope to
to

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-15
Examples: MulI-RDD TransformaIons

rdd1.union(rdd2)
rdd1 rdd2
Chicago San Francisco
Boston Boston
Chicago
Paris Amsterdam
Boston
San Francisco Mumbai
Paris
Tokyo McMurdo Station
San Francisco
Tokyo
rdd1.subtract(rdd2) rdd1.zip(rdd2) San Francisco
Boston
Tokyo (Chicago,San Francisco) Amsterdam
Paris (Boston,Boston) Mumbai
Chicago (Paris,Amsterdam) McMurdo Station
(San Francisco,Mumbai)
(Tokyo,McMurdo Station)

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-16
Some Other General RDD OperaIons

Other RDD operaCons


first return the rst element of the RDD
foreach apply a funcIon to each element in an RDD
top(n) return the largest n elements using natural ordering
Sampling operaCons
sample create a new RDD with a sampling of elements
takeSample return an array of sampled elements
Double RDD operaCons
StaIsIcal funcIons, e.g., mean, sum, variance, stdev

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-17
Chapter Topics

Distributed Data Processing with


Working With RDDs in Spark
Spark

CreaIng RDDs
Other General RDD OperaIons
Conclusion
Homework: Process Data Files with Spark

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-18
EssenIal Points

RDDs can be created from les, parallelized data in memory, or other


RDDs
sc.textFile reads newline delimited text, one line per RDD record
sc.wholeTextFile reads enCre les into single RDD records
Generic RDDs can consist of any type of data
Generic RDDs provide a wide range of transformaCon operaCons

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-19
Chapter Topics

Distributed Data Processing with


Working With RDDs in Spark
Spark

CreaIng RDDs
Other General RDD OperaIons
Conclusion
Homework: Process Data Files with Spark

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-20
Homework: Process Data Files with Spark

In this homework assignment you will


Process a set of XML les using wholeTextFiles
Reformat a dataset to standardize format (bonus)
Please refer to the Homework descripCon

Copyright 2010-2015 Cloudera. All rights reserved. Not to be reproduced or shared without prior wriEen consent from Cloudera. 11-21

You might also like