You are on page 1of 37

Spark essentials

Alexey Filanovskiy
Cloudera certified developer

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Architecture

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Architecture

of the processing engine over the HDFS

MapRedu
ce
and Hive

Processing Layer
Spark

Impala

Search

Big Data
SQL

Resource Management (YARN, cgroups)


Storage Layer
Filesystem (HDFS)

NoSQL Databases
(Oracle NoSQL DB, Hbase)

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Architecture

Spark consists of:


- Spark Core. Main engine for
processing data
- Mlib. Extension of Spark core
for machine learning
- GraphX. Extension of Spark
core for Graph processing
engine
-Spark SQL. Extension of Spark
core that allow you write
programs with SQL queries
- Spark streaming. Engine that
could deal with real-time data.

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Architecture

es for writing program for Spark:

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Architecture
Spark automatically elect one node of cluster for running
Driver Program (main coordinator).
It manage all other processing distributed by other nodes

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Definition
An RDD (Resilient Distributed Dataset) in Spark is simply an immutable distributed collection of
objects. Each RDD is split into multiplepartitions, which may be computed on different nodes
of the cluster.
In other words RDD is input for your Spark Jobs
Spark provides two ways to create RDDs:
- loading an external dataset
- parallelizing a collection in your driver program.

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Terminology

: 1TB File: hdfs://namespace-ns/tmp/example-data/file.dat


{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-0701:00:00:07","recommended":null,"activity":8}
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-0701:00:00:22","recommended":"N","activity":7}
{"custId":1083711,"movieId":null,"genreId":null,"time":"2012-0701:00:00:26","recommended":null,"activity":9}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-0701:00:00:32","recommended":"Y","activity":7}
{"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-0701:00:00:42","recommended":"Y","activity":6}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-0701:00:00:43","recommended":null,"activity":8}
{"custId":1253676,"movieId":null,"genreId":null,"time":"2012-0701:00:00:50","recommended":null,"activity":9}
{"custId":1351777,"movieId":608,"genreId":6,"time":"2012-0701:00:01:03","recommended":"N","activity":7}
{"custId":1143971,"movieId":null,"genreId":null,"time":"2012-0701:00:01:07","recommended":null,"activity":9}
{"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-0701:00:01:18","recommended":"Y","activity":7}
{"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-0701:00:01:26","recommended":"Y","activity":7}
{"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-0701:00:01:35","recommended":"N","activity":7}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-0701:00:01:39","recommended":"Y","activity":7}}
Copyright 2014 Oracle and/or its affiliates. All rights reserved.

Block B1
Block B2

Its partitions
(unit of parallelis

Block B3

Its RDD (input data)

RDD. Load dataset


An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into
multiplepartitions, which may be computed on different nodes of the cluster.
Spark provides two ways to create RDDs:
- Loading an external dataset
Example:
[cloudera@quickstart ~]$hadoop fs -cat hdfs://localhost:8020/tmp/testrdd/test.file
12
345
6789
10

scala> val inputRDD = sc.textFile("hdfs://localhost:8020/tmp/testrdd/");


scala> println(inputRDD.collect().mkString(" , "))

1 2 , 3 4 5 , 6 7 8 9 , 10

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Define in program


An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into
multiplepartitions, which may be computed on different nodes of the cluster.
Spark provides two ways to create RDDs:
- Loading an external dataset
- Parallelizing a collection in your driver program
Example:
scala> val inputRDD = sc.parallelize(List("1 2","3 4 5","6 7 8 9","10"))
scala> println(inputRDD.collect().mkString(" , "))
..
1 2 , 3 4 5 , 6 7 8 9 , 10

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD transformation

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

13

RDD transformation

oncepts:
sformations are operations on RDDs that return a new RDD
sformations on RDDs are lazily evaluated, meaning that Spark will not begin to execute until it sees an actio

ple:
> val weblog = sc.textFile("hdfs://localhost:8020/user/hive/warehouse/weblogs")
> val NewRDDwithFilter = weblog.filter(line => line.contains("adidas"))
> NewRDDwithFilter.count()

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Common pattern

attern, when single RDD using multiple times:

scala> val input = sc.parallelize(List(1, 2, 3, 4))


scala> val result1 = input.map(x => x * x)
scala> val result2 = input.filter(x => x!=1);
.
scala> println(result1.collect().mkString(","))
1,4,9,16
scala> println(result2.collect().mkString(","))
2,3,4

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Common pattern. Caching

scala> import
org.apache.spark.storage.StorageLevel
scala> val input = sc.parallelize(List(1, 2, 3, 4))
scala> input.persist(StorageLevel.MEMORY_ONLY)
scala> val result1 = input.map(x => x * x)
scala> val result2 = input.filter(x => x!=1);
.
scala> println(result1.collect().mkString(","))
1,4,9,16
scala> println(result2.collect().mkString(","))
2,3,4

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Useful transformations (Map phase) for single RDD


1) map() - Apply a function to each element in the RDD and return an RDD of the result.
scala> val input = sc.parallelize(List(1, 2, 3, 4))
scala> val result = input.map(x => x * x)
scala> println(result.collect().mkString(","))
1,4,9,16
2) flatMap() - Apply a function to each element in the RDD and return an RDD of the contents of
the iterators returned.
scala> val input = sc.parallelize(List("one", "one two", "one two three", "one two three four"))
scala> val result = input.flatMap(x => x.split(" "))
scala> println(result.collect().mkString(","))

one,one,two,one,two,three,one,two,three,four
3) filter() Return an RDD consisting of only elements that pass the condition passed tofilter()
scala> val input = sc.parallelize(List(1, 2, 3, 4))
scala> val result = input.filter(line => line != 1)
scala> println(result.collect().mkString(","))

2,3,4
Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Map in details


map() function details.
Its input file (RDD):
3
4
5
6

map(x => x + 1)
For this function will add 1 to
Each line:
map(x => 3 + 1) => 4
Output:
4
5
6
7

map(x => x + x)
For this function we add to
Line its own value
map(x => 3 + 3) => 6
Output :
6
8
10
12

map(x => x * x)
For this function we multiple line
On itself
map(x => 3 * 3) => 9
Output :
9
16
25
36

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Useful transformations (Map phase) for multiple RDD


1) union() - Produce an RDD containing elements from both RDDs.
scala> val input1 = sc.parallelize(List(1, 2, 3, 4))
scala> val input2 = sc.parallelize(List(3, 4, 5, 6))
scala> println(input1.union(input2).collect().mkString(","));
..
1,2,3,4,3,4,5,6
2) intersection() - RDD containing only elements found in both RDD
scala> val input1 = sc.parallelize(List(1, 2, 3, 4))
scala> val input2 = sc.parallelize(List(3, 4, 5, 6))
scala> println(input1.intersection(input2).collect().mkString(","));
..
4,3
3) cartesian() - Cartesian product with the other RDD
scala> val input1 = sc.parallelize(List(1, 2, 3, 4))
scala> val input2 = sc.parallelize(List(3, 4, 5, 6))
scala> println(input1.cartesian(input2).collect().mkString(","));

(1,3),(1,4),(1,5),(1,6),(2,3),(2,4),(2,5),(2,6),(3,3),(3,4),(3,5),(3,6),(4,3),(4,4),(4,5),(4,6)
Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD actions

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

20

RDD. Useful actions(Reduce phase) for multiple RDD


1) count() - Number of elements in the RDD.
scala> val inputRDD = sc.parallelize(List(1, 2, 3, 4, 3, 4))
scala> println(inputRDD.count());
6
2) countByValue() - Number of times each element occurs in the RDD
scala> val inputRDD = sc.parallelize(List(1, 2, 3, 4, 3, 4))
scala> println(inputRDD. countByValue());

Map(4 -> 2, 1 -> 1, 3 -> 2, 2 -> 1)


3) reduce(func) - Combine the elements of the RDD together in parallel
scala> val inputRDD = sc.parallelize(List(1, 2, 3, 4, 3, 4))
scala> println(inputRDD.reduce((x,y) => x + y));
17
scala> println(inputRDD.reduce((x,y) => x * y));
288
scala> println(inputRDD.reduce((x,y) => x - y));
-15

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Reduce in details


reduce(func) function in details.
Its input file:
3
4
5
6

reduce((x,y) => x + y)
It will goes down from first element
To the last. Example:
Initially x=3, y=4
3+4=7
Then x=7, y=5
7+5=12
Then x=12, y=6
12+6 = 18. its result

reduce((x,y) => x - y)
It will goes down from first element
To the last. Example:
Initially x=3, y=4
3 - 4=-1
Then x=-1, y=5
-1 5 = -6
Then x=-6, y=6
-6-6 = -12. its result
Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Pair RDD

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

23

Pair RDD
Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs
Very useful for group by key type of operations.
Creating Pair RDD, example:
scala> val inputRDD = sc.parallelize(List("first string word some other, "second string hello"))
scala> val pairs = inputRDD.map(x => (x.split(" ")(0), x))

scala> println(pairs.collect().mkString(","));
(first ,first string word some other),(second ,second string hello)

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Transformation of Pair RDD (over single RDD)


1) ReduceByKey() - Combine values with the same key.
scala> val inputRDD = sc.parallelize(List((1,8),(1,4),(2,5),(2,1)))
scala> println(inputRDD.reduceByKey(_+_).collect().mkString(","))
(1,12),(2,6)
2) groupByKey() - Group values with the same key.
scala> val inputRDD = sc.parallelize(List((1,8),(1,4),(2,5),(2,1)))
scala> println(inputRDD.groupByKey().collect().mkString(","));
(1,CompactBuffer(8, 4)),(2,CompactBuffer(5, 1))
Note: avoid this function. It always shuffle data without local reduce
3) mapValues(func) - Apply a function to each value of a pair RDD without changing the key
scala> val inputRDD = sc.parallelize(List((1,8),(1,4),(2,5),(2,1)))
scala> println(inputRDD.mapValues(x => x * x).collect().mkString(","));
(1,64),(1,16),(2,25),(2,1)
4) sortByKey() - Return an RDD sorted by the key
scala> val inputRDD = sc.parallelize(List((1,8),(2,4),(1,5),(2,1)))
scala> println(inputRDD.sortByKey().collect().mkString(","));
(1,8),(1,5),(2,4),(2,1)

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Transformation of Pair RDD (over multiple RDDs)


1) Join()
scala> val inputRDD1 = sc.parallelize(List((1,8),(2,4),(3,5),(4,1)))
scala> val inputRDD2 = sc.parallelize(List((4,7)))
scala> println(inputRDD1.join(inputRDD2).collect().mkString(","));
.
(4,(1,7))
2) leftOuterJoin()
scala> val inputRDD1 = sc.parallelize(List((1,8),(2,4),(3,5),(4,1)))
scala> val inputRDD2 = sc.parallelize(List((4,7)))
scala> println(inputRDD1.join(inputRDD2).collect().mkString(","));
............
(4,(1,Some(7))),(1,(8,None)),(3,(5,None)),(2,(4,None))
3) rightOuterJoin()
scala> val inputRDD1 = sc.parallelize(List((1,8),(2,4),(3,5),(4,1)))
scala> val inputRDD2 = sc.parallelize(List((4,7)))
scala> println(inputRDD1.join(inputRDD2).collect().mkString(","));
............
(4,(Some(1),7))

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Average by key example


scala> val inputRDD = sc.parallelize(List((panda,0),(pink,3),(pirate,3),(panda,1),(pink,4)))
Create
Key(x,
-Va1))
Define some RDD
lue stru
scala> val kvRDD = inputRDD.mapValues(x =>
cture fo
r Valu
scala> val sumRDD = kvRDD. reduceByKey((x, y) => (x._1 + y._1, x._2
+ey._2))
scala> println(sumRDD.collect().mkString(","))
.
(panda,(1,2)),(pirate,(3,1)),(pink,(7,2))
Su
v m
m alu ke
aj es y
or ( s
ke gro and
y) up
by

scala> println(sumRDD.mapValues(x => x._1/x._2.toFloat).collect().mkString(",")); // divide key on


value

(panda,0.5),(pirate,3.0),(pink,3.5) its our Result average value for each key


Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

ReduceByKey. How it works.

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

28

ReduceByKey. How it works.

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Parallel execution

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

30

Parallel execution
Key concepts:
1) Every RDD has a fixed number ofpartitionsthat determine the degree of parallelism
- To know how many partitions contain given RDD, run:
scala> bigRDD.partitions.size;
.
res115: Int = 102
2) By default number of partitions equal to number of blocks:
[cloudera@quickstart ~]$ hdfs fsck /user/hive/warehouse/weblogs/|grep "Total blocks

Total blocks (validated):102 (avg. block size 39593868 B)


3) To change number of partitions use repartition() function
scala> val lotPartitions = bigRDD.repartition(1000);
scala> val bitPartitions = bigRDD.repartition(10);
4) If you reduce number of partitions use optimized version of repartition() - coalesce()
scala> val bitPartitions2 = bigRDD.coalesce(10);
5) For action functions you could specify parallel degree
sc.parallelize(data).reduceByKey((x, y) => x + y, 10)
Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Spark Partitioning

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

32

Data partitioning. Problem


Case:
-We need to join two datasets
periodically (10 minutes, for example)
- userData is large immutable dataset
- events is relatively small dataset, its
new for each join operation
Every time two (big one and small one)
datasets will be distributed across
network.

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Data partitioning. Solution


Solution:
- Fix some distribution across servers for large
immutable dataset
- Redistribute small dataset across network
accordingly to distribution of big one for each
query
For do this just run over big one dataset:
val userData = sc.sequenceFile[UserID, UserInfo]
("hdfs://...") .partitionBy(new
HashPartitioner(100)).persist()

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Data partitioning. Key concepts


Trick explained above named as Spark partitioning:
- Sparks partitioning is available on all RDDs of key/value pairs
- Spark does not give explicit control of which worker node each key goes to
- Program ensure that asetof keys will appear together onsomenode
- If a given RDD is scanned only once, there is no point in partitioning it in
advance
- It is useful only when a dataset is reusedmultiple timesin key-oriented
operations such as joins
Example:
scala> val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3)))
scala> pairs.partitioner
res132: Option[org.apache.spark.Partitioner] = None
scala> import org.apache.spark.HashPartitioner
scala> val partitioned = pairs.partitionBy(new HashPartitioner(2))
scala> partitioned.partitioner
res133: Option[org.apache.spark.Partitioner] =
Some(org.apache.spark.HashPartitioner@2)
Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

36

You might also like