Spark Essentials

Spark essentials
Alexey Filanovskiy
Cloudera certified developer
Copyright 2014 Oracle and/or its affiliates. All rights reserved. |
Architecture
Architecture
of the processing engine over the HDFS
MapRedu
ce
and Hive
Processing Layer
Spark
Impala
Search
Big Data
SQL
Resource Management (YARN, cgroups)

Storage Layer
Filesystem (HDFS)
NoSQL Databases
(Oracle NoSQL DB, Hbase)
Architecture
Spark consists of:

- Spark Core. Main engine for
processing data
- Mlib. Extension of Spark core
for machine learning
- GraphX. Extension of Spark
core for Graph processing
engine
-Spark SQL. Extension of Spark
core that allow you write
programs with SQL queries
- Spark streaming. Engine that
could deal with real-time data.
Architecture
es for writing program for Spark:
Architecture
Spark automatically elect one node of cluster for running
Driver Program (main coordinator).
It manage all other processing distributed by other nodes
RDD
RDD. Definition
An RDD (Resilient Distributed Dataset) in Spark is simply an immutable distributed collection of
objects. Each RDD is split into multiplepartitions, which may be computed on different nodes
of the cluster.
In other words RDD is input for your Spark Jobs
Spark provides two ways to create RDDs:
- loading an external dataset
- parallelizing a collection in your driver program.
RDD. Terminology
: 1TB File: hdfs://namespace-ns/tmp/example-data/file.dat

{"custId":1185972,"movieId":null,"genreId":null,"time":"2012-0701:00:00:07","recommended":null,"activity":8}
{"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-0701:00:00:22","recommended":"N","activity":7}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-0701:00:00:32","recommended":"Y","activity":7}
{"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-0701:00:01:39","recommended":"Y","activity":7}}
Copyright 2014 Oracle and/or its affiliates. All rights reserved.
Block B1
Block B2
Its partitions
(unit of parallelis
Block B3
Its RDD (input data)
RDD. Load dataset

An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into
multiplepartitions, which may be computed on different nodes of the cluster.
- Loading an external dataset
Example:
[cloudera@quickstart ~]$hadoop fs -cat hdfs://localhost:8020/tmp/testrdd/test.file
12
345
6789
10
scala> val inputRDD = sc.textFile("hdfs://localhost:8020/tmp/testrdd/");

scala> println(inputRDD.collect().mkString(" , "))
1 2 , 3 4 5 , 6 7 8 9 , 10
RDD. Define in program

An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into
multiplepartitions, which may be computed on different nodes of the cluster.
- Loading an external dataset
- Parallelizing a collection in your driver program
Example:
scala> val inputRDD = sc.parallelize(List("1 2","3 4 5","6 7 8 9","10"))
scala> println(inputRDD.collect().mkString(" , "))
..
1 2 , 3 4 5 , 6 7 8 9 , 10
RDD transformation
13
RDD transformation
oncepts:
sformations are operations on RDDs that return a new RDD
sformations on RDDs are lazily evaluated, meaning that Spark will not begin to execute until it sees an actio
ple:
> val weblog = sc.textFile("hdfs://localhost:8020/user/hive/warehouse/weblogs")
> val NewRDDwithFilter = weblog.filter(line => line.contains("adidas"))
> NewRDDwithFilter.count()
RDD. Common pattern
attern, when single RDD using multiple times:
scala> val input = sc.parallelize(List(1, 2, 3, 4))

scala> val result1 = input.map(x => x * x)
scala> val result2 = input.filter(x => x!=1);
.
scala> println(result1.collect().mkString(","))
1,4,9,16
2,3,4
RDD. Common pattern. Caching
scala> import
org.apache.spark.storage.StorageLevel
scala> input.persist(StorageLevel.MEMORY_ONLY)
scala> val result1 = input.map(x => x * x)
scala> val result2 = input.filter(x => x!=1);
.
1,4,9,16
2,3,4
RDD. Useful transformations (Map phase) for single RDD

1) map() - Apply a function to each element in the RDD and return an RDD of the result.
scala> val result = input.map(x => x * x)
scala> println(result.collect().mkString(","))
1,4,9,16
2) flatMap() - Apply a function to each element in the RDD and return an RDD of the contents of
the iterators returned.
scala> val input = sc.parallelize(List("one", "one two", "one two three", "one two three four"))
scala> val result = input.flatMap(x => x.split(" "))
one,one,two,one,two,three,one,two,three,four
3) filter() Return an RDD consisting of only elements that pass the condition passed tofilter()
scala> val result = input.filter(line => line != 1)
2,3,4
RDD. Map in details

map() function details.
Its input file (RDD):
3
4
5
6
map(x => x + 1)
For this function will add 1 to
Each line:
map(x => 3 + 1) => 4
Output:
4
5
6
7
map(x => x + x)
For this function we add to
Line its own value
map(x => 3 + 3) => 6
Output :
6
8
10
12
map(x => x * x)
For this function we multiple line
On itself
map(x => 3 * 3) => 9
Output :
9
16
25
36
RDD. Useful transformations (Map phase) for multiple RDD

1) union() - Produce an RDD containing elements from both RDDs.
scala> val input1 = sc.parallelize(List(1, 2, 3, 4))
scala> println(input1.union(input2).collect().mkString(","));
..
1,2,3,4,3,4,5,6
2) intersection() - RDD containing only elements found in both RDD
scala> println(input1.intersection(input2).collect().mkString(","));
..
4,3
3) cartesian() - Cartesian product with the other RDD
scala> println(input1.cartesian(input2).collect().mkString(","));
(1,3),(1,4),(1,5),(1,6),(2,3),(2,4),(2,5),(2,6),(3,3),(3,4),(3,5),(3,6),(4,3),(4,4),(4,5),(4,6)
RDD actions
20
RDD. Useful actions(Reduce phase) for multiple RDD

1) count() - Number of elements in the RDD.
scala> val inputRDD = sc.parallelize(List(1, 2, 3, 4, 3, 4))
scala> println(inputRDD.count());
6
2) countByValue() - Number of times each element occurs in the RDD
scala> println(inputRDD. countByValue());
Map(4 -> 2, 1 -> 1, 3 -> 2, 2 -> 1)

3) reduce(func) - Combine the elements of the RDD together in parallel
scala> println(inputRDD.reduce((x,y) => x + y));
17
scala> println(inputRDD.reduce((x,y) => x * y));
288
scala> println(inputRDD.reduce((x,y) => x - y));
-15
RDD. Reduce in details

reduce(func) function in details.
Its input file:
3
4
5
6
reduce((x,y) => x + y)
It will goes down from first element
To the last. Example:
Initially x=3, y=4
3+4=7
Then x=7, y=5
7+5=12
Then x=12, y=6
12+6 = 18. its result
reduce((x,y) => x - y)
It will goes down from first element
To the last. Example:
Initially x=3, y=4
3 - 4=-1
Then x=-1, y=5
-1 5 = -6
Then x=-6, y=6
-6-6 = -12. its result
Pair RDD
23
Pair RDD
Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs
Very useful for group by key type of operations.
Creating Pair RDD, example:
scala> val inputRDD = sc.parallelize(List("first string word some other, "second string hello"))
scala> val pairs = inputRDD.map(x => (x.split(" ")(0), x))
scala> println(pairs.collect().mkString(","));
(first ,first string word some other),(second ,second string hello)
Transformation of Pair RDD (over single RDD)

1) ReduceByKey() - Combine values with the same key.
scala> val inputRDD = sc.parallelize(List((1,8),(1,4),(2,5),(2,1)))
scala> println(inputRDD.reduceByKey(_+_).collect().mkString(","))
(1,12),(2,6)
2) groupByKey() - Group values with the same key.
scala> println(inputRDD.groupByKey().collect().mkString(","));
(1,CompactBuffer(8, 4)),(2,CompactBuffer(5, 1))
Note: avoid this function. It always shuffle data without local reduce
3) mapValues(func) - Apply a function to each value of a pair RDD without changing the key
scala> println(inputRDD.mapValues(x => x * x).collect().mkString(","));
(1,64),(1,16),(2,25),(2,1)
4) sortByKey() - Return an RDD sorted by the key
scala> println(inputRDD.sortByKey().collect().mkString(","));
(1,8),(1,5),(2,4),(2,1)
Transformation of Pair RDD (over multiple RDDs)

1) Join()
scala> val inputRDD1 = sc.parallelize(List((1,8),(2,4),(3,5),(4,1)))
scala> val inputRDD2 = sc.parallelize(List((4,7)))
scala> println(inputRDD1.join(inputRDD2).collect().mkString(","));
.
(4,(1,7))
2) leftOuterJoin()
............
(4,(1,Some(7))),(1,(8,None)),(3,(5,None)),(2,(4,None))
3) rightOuterJoin()
............
(4,(Some(1),7))
Average by key example

scala> val inputRDD = sc.parallelize(List((panda,0),(pink,3),(pirate,3),(panda,1),(pink,4)))
Create
Key(x,
-Va1))
Define some RDD
lue stru
scala> val kvRDD = inputRDD.mapValues(x =>
cture fo
r Valu
scala> val sumRDD = kvRDD. reduceByKey((x, y) => (x._1 + y._1, x._2
+ey._2))
scala> println(sumRDD.collect().mkString(","))
.
(panda,(1,2)),(pirate,(3,1)),(pink,(7,2))
Su
v m
m alu ke
aj es y
or ( s
ke gro and
y) up
by
scala> println(sumRDD.mapValues(x => x._1/x._2.toFloat).collect().mkString(",")); // divide key on

value
(panda,0.5),(pirate,3.0),(pink,3.5) its our Result average value for each key

ReduceByKey. How it works.
28
ReduceByKey. How it works.
Parallel execution
30
Parallel execution
Key concepts:
1) Every RDD has a fixed number ofpartitionsthat determine the degree of parallelism
- To know how many partitions contain given RDD, run:
scala> bigRDD.partitions.size;
.
res115: Int = 102
2) By default number of partitions equal to number of blocks:
[cloudera@quickstart ~]$ hdfs fsck /user/hive/warehouse/weblogs/|grep "Total blocks
Total blocks (validated):102 (avg. block size 39593868 B)

3) To change number of partitions use repartition() function
scala> val lotPartitions = bigRDD.repartition(1000);
scala> val bitPartitions = bigRDD.repartition(10);
4) If you reduce number of partitions use optimized version of repartition() - coalesce()
scala> val bitPartitions2 = bigRDD.coalesce(10);
5) For action functions you could specify parallel degree
sc.parallelize(data).reduceByKey((x, y) => x + y, 10)
Spark Partitioning
32
Data partitioning. Problem

Case:
-We need to join two datasets
periodically (10 minutes, for example)
- userData is large immutable dataset
- events is relatively small dataset, its
new for each join operation
Every time two (big one and small one)
datasets will be distributed across
network.
Data partitioning. Solution

Solution:
- Fix some distribution across servers for large
immutable dataset
- Redistribute small dataset across network
accordingly to distribution of big one for each
query
For do this just run over big one dataset:
val userData = sc.sequenceFile[UserID, UserInfo]
("hdfs://...") .partitionBy(new
HashPartitioner(100)).persist()
Data partitioning. Key concepts

Trick explained above named as Spark partitioning:
- Sparks partitioning is available on all RDDs of key/value pairs
- Spark does not give explicit control of which worker node each key goes to
- Program ensure that asetof keys will appear together onsomenode
- If a given RDD is scanned only once, there is no point in partitioning it in
advance
- It is useful only when a dataset is reusedmultiple timesin key-oriented
operations such as joins
Example:
scala> val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3)))
scala> pairs.partitioner
res132: Option[org.apache.spark.Partitioner] = None
scala> import org.apache.spark.HashPartitioner
scala> val partitioned = pairs.partitionBy(new HashPartitioner(2))
scala> partitioned.partitioner
res133: Option[org.apache.spark.Partitioner] =
Some(org.apache.spark.HashPartitioner@2)
36

Spark Essentials

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spark Essentials

Uploaded by

Copyright:

Available Formats

Spark essentials

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

of the processing engine over the HDFS

Resource Management (YARN, cgroups)

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Spark consists of:

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

es for writing program for Spark:

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

: 1TB File: hdfs://namespace-ns/tmp/example-data/file.dat

Its RDD (input data)

RDD. Load dataset

scala> val inputRDD = sc.textFile("hdfs://localhost:8020/tmp/testrdd/");

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Define in program

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Common pattern

attern, when single RDD using multiple times:

scala> val input = sc.parallelize(List(1, 2, 3, 4))

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Common pattern. Caching

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Useful transformations (Map phase) for single RDD

RDD. Map in details

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Useful transformations (Map phase) for multiple RDD

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Useful actions(Reduce phase) for multiple RDD

Map(4 -> 2, 1 -> 1, 3 -> 2, 2 -> 1)

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

RDD. Reduce in details

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Transformation of Pair RDD (over single RDD)

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Transformation of Pair RDD (over multiple RDDs)

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Average by key example

scala> println(sumRDD.mapValues(x => x._1/x._2.toFloat).collect().mkString(",")); // divide key on

(panda,0.5),(pirate,3.0),(pink,3.5) its our Result average value for each key

ReduceByKey. How it works.

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

ReduceByKey. How it works.

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Total blocks (validated):102 (avg. block size 39593868 B)

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Data partitioning. Problem

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Data partitioning. Solution

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

Data partitioning. Key concepts

Copyright 2014 Oracle and/or its affiliates. All rights reserved. |

You might also like