You are on page 1of 21

Intro to PySpark

Jason White, Shopify

What is PySpark?

Python interface to Apache Spark

Map/Reduce style distributed computing

Natively Scala

Interfaces to Python, R are well-maintained

Uses Py4J for Java <-> Scala interface

PySpark Basics

Distributed Computing basic premise:

Data is big

Program to process data is relatively small

Send program to where the data lives

Leave data on multiple nodes, scale horizontally

PySpark Basics

Driver: e.g. my laptop

Cluster Manager: YARN, Mesos, etc

Workers: Containers spun up by Cluster Manager

PySpark Basics

RDD: Resilient Distributed Dataset

Interface to parallelized data in the cluster

Map, Filter, Reduce functions sent from driver,


executed by workers on chunks of data in
parallel

PySpark: Hello World

Classic Word Count problem

How many times does each word appear in a


given text?

Approach: Each worker computes word counts


independently, then results aggregated together

PySpark: Hello World


Map

The
brown

dog

brown
dog

(The, 1)
(brown, 1)

(dog, 1)

(brown, 1)
(dog, 1)

Shuffle
Reduce

Collect

(The, 1)
(dog, 2)

(brown, 2)

(The, 1)
(dog, 2)
(brown, 2)

Demo
# example 1
text = "the brown dog jumped over the other brown dog"
text_rdd = sc.parallelize(text.split(' '))
text_rdd.map(lambda word: (word, 1)) \
.reduceByKey(lambda left, right: left + right).collect()
# example 2
import string
time_machine = sc.textFile('/user/jasonwhite/time_machine')
time_machine_tuples = time_machine.flatMap(lambda line: line.lower().split(' ')) \
.map(lambda word: ''.join(ch for ch in word if ch in string.letters)) \
.filter(lambda word: word != '') \
.map(lambda word: (word, 1))
word_counts = time_machine_tuples.reduceByKey(lambda left, right: left + right)

Monoids

Monoids are combinations of:

set of data; and

associative, commutative functions

Very efficient in M/R, strongly preferred

Examples:

addition of integers

min/max of records by timestamp

Demo
# example 3
dataset = sc.parallelize([
{'id': 1, 'value': 1},
{'id': 2, 'value': 2},
{'id': 2, 'value': 6}
])
def add_tuples(left, right):
left_sum, left_count = left
right_sum, right_count = right
return (left_sum + right_sum, left_count + right_count)
averages = dataset.map(lambda d: (d['value'], 1)) \
.reduce(add_tuples)
averages_by_key = dataset.map(lambda d: (d['id'], (d['value'], 1))) \
.reduceByKey(add_tuples) \
.map(lambda (key, (sum, count)): (key, sum * 1.0 / count))

# example 4
from datetime import date
dataset = sc.parallelize([
{'id': 1, 'group_id': 10,
{'id': 2, 'group_id': 10,
{'id': 3, 'group_id': 10,
{'id': 4, 'group_id': 11,
{'id': 5, 'group_id': 11,
])

Demo
'timestamp':
'timestamp':
'timestamp':
'timestamp':
'timestamp':

date(1978,
date(1984,
date(1986,
date(1956,
date(1953,

3,
3,
5,
6,
2,

2)},
24)},
19)},
5)},
21)},

def calculate_age(d):
d['age'] = (date.today() - d['timestamp']).days()
return d
def calculate_group_stats(left, right):
earliest = min(left['earliest'], right['earliest'])
latest = max(left['latest'], right['latest'])
total_age = left['total_age'] + right['total_age']
count = left['count'] + right['count']
return {
'earliest': earliest,
'latest': latest,
'total_age': total_age,
'count': left['count'] + right[count']
}
group_stats = dataset.map(calculate_age) \
.map(lambda d: (d['group_id'], {'earliest': d['timestamp'], 'latest':
d['timestamp'], 'total_age': d['age'], 'count': 1})) \
.reduceByKey(calculate_group_stats)

Joining RDDs

Like many RDD operations, works on (k, v) pairs

Each side shuffled using common keys

Each node builds its part of the joined dataset

Joining RDDs
{id: 1, field1: foo}
{id: 2, field1: bar}

{id: 1, field2: baz}


{id: 2, field2: baz}

(1, {id: 1, field1: foo})


(2, {id: 2, field1: bar})

(1, {id: 1, field2: baz})


(2, {id: 2, field2: baz})

(1, ({id: 1, field1: foo},


{id: 1, field2: baz}))

(2, ({id: 2, field1: bar},


{id: 2, field2: baz}))

Demo
# example 4
first_dataset = sc.parallelize([
{'id': 1, 'field1': 'foo'},
{'id': 2, 'field1': 'bar'},
{'id': 2, 'field1': 'baz'},
{'id': 3, 'field1': 'foo'}
])
first_dataset = first_dataset.map(lambda d: (d['id'], d))
second_dataset = sc.parallelize([
{'id': 1, 'field2': 'abc'},
{'id': 2, 'field2': 'def'}
])
second_dataset = second_dataset.map(lambda d: (d['id'], d))
output = first_dataset.join(second_dataset)

Key Skew

Achilles heel of M/R: key skew

Shuffle phase distributes like keys to like nodes

If billions of rows are shuffled to the same node,


may cause slight memory issues

Joining RDDs w/ Skew

When joining to small RDD, an alternative is to


broadcast the RDD

Instead of shuffling, entire RDD is sent to each


worker

Now each worker has all data needed

Each join is now just a map. No shuffle needed!

Demo
# example 5
first_dataset = sc.parallelize([
{'id': 1, 'field1': 'foo'},
{'id': 2, 'field1': 'bar'},
{'id': 2, 'field1': 'baz'},
{'id': 3, 'field1': 'foo'}
])
first_dataset = first_dataset.map(lambda d: (d['id'], d))
second_dataset = sc.parallelize([
{'id': 1, 'field2': 'abc'},
{'id': 2, 'field2': 'def'}
])
second_dataset = second_dataset.map(lambda d: (d['id'], d))
second_dict = sc.broadcast(second_dataset.collectAsMap())
def join_records((key, record)):
if key in second_dict.value.keys():
yield (key, (record, second_dict.value[key]))
output = first_dataset.flatMap(join_records)

Ordering

Row order isnt guaranteed unless you explicitly


sort the RDD

But: sometimes you need to process events in


order!

Solution: repartitionAndSortWithinPartitions

Ordering
{id: 1, value: 10}
{id: 2, value: 10}
{id: 3, value: 20}

{id: 1, value: 12}


{id: 1, value: 5}
{id: 2, value: 15}

Shuffle & Sort


{id: 1, value: 5}
{id: 1, value: 10}
{id: 1, value: 12}
{id: 3, value: 20

{id: 2, value: 10}


{id: 2, value: 15}

MapPartitions
{id: 1, interval: 5}
{id: 1, interval: 2}

{id: 2, interval: 5}

Thanks!

You might also like