A - Parviz Deyhim Amazon

Making
Fly
Parviz Deyhim
http://bit.ly/sparkemr
Obsession
Scalable, Highly-available and Secure
Obsession
Scalable, Highly-available and Secure
Amazon Elastic MapReduce
Hadoop-as-a-service
Map-Reduce engine
Integrated with tools
What is EMR?
Massively parallel
Integrated to AWS services

Cost effective AWS wrapper
In
te
gr
at
io
n
HDF
S
Amazon EMR
In
te
gr
at
io
n
HDF
S
Amazon EMR
Amazon S3
Amazon
DynamoDB
In
te
gr
at
io
n
Data management
Analytics languages
HDF
S
Amazon EMR
Amazon S3
Amazon
DynamoDB
In
te
gr
at
io
n
Data management
Analytics languages
HDF
S
Amazon EMR
Amazon S3
Amazon
DynamoDB
Amazon
RDS
In
te
gr
at
io
n
Data management
Analytics languages
HDF
S
Amazon EMR
Amazon
RedShift
Data Pipeline
Amazon S3
Amazon
DynamoDB
Amazon
RDS
In
te
gr
at
io
n
Data management
Analytics languages
HDF
S
Amazon EMR
Amazon
RedShift
Data Pipeline
Amazon S3
Amazon
DynamoDB
Amazon
RDS
Amazon EMR Concepts

Master Node
Core Nodes
Task Nodes
Core Nodes
Amazon EMR cluster
Master instance group
Maste
r
Node
DataNode (HDFS)
HDFS
HDFS
HDFS
HDFS
Core instance group
Core Nodes
Amazon EMR cluster
Maste
r
Node
Can Add Core

Nodes:
More CPU
More Memory
More HDFS Space
HDFS
HDFS
HDFS
HDFS
Core instance group
HDFS
HDFS
Core Nodes
Amazon EMR cluster
Maste
r
Node
Cant remove core

nodes:
HDFS corruption
HDFS
HDFS
HDFS
HDFS
Core instance group
HDFS
HDFS
Task Nodes
Amazon EMR cluster
Maste
r
Node
No HDFS
Provides compute
resources:
CPU
Memory
HDFS
HDFS
HDFS
HDFS
Core instance group
Task Nodes
Amazon EMR cluster
Maste
r
Node
Can add and

remove task
nodes
HDFS
HDFS
HDFS
HDFS
Core instance group
Spark On Amazon EMR
Bootstrap Actions
Ability to run or install additional
packages/software on EMR nodes
Simple bash script stored on S3
Script gets executed during node/instance
boot time
Script gets executed on every node that gets
added to the cluster
Spark on Amazon EMR

Bootstrap action installs Spark on
EMR nodes
Currently on Spark 0.73 & upgrading
to 0.8 very soon
http://bit.ly/sparkemr
Why Spark on Amazon EMR?

Deploy small and large Spark
clusters in minutes
EMR handles node recover in case of
failures
Integration with EC2 Spot Market,
Amazon Redshift, Amazon Data
pipeline, Amazon Cloudwatch and etc
Why Spark on Amazon EMR?

Shipping Spark logs to S3 for
debugging
Define S3 bucket at cluster deploy time
Spark on EC2 Spot Market

Bid on un-used EC2 capacity
Spark is memory hungry.
Bid on large memory instances with
the fraction of the cost

1TB Memory Cluster
Instance
Type
# of Onnode demand
s
Cost
Spot Cost
Cost/GB Of Memory
Using Spot
M1.xlarge
63
$31/h
$4.41/h
0.44c/GB/h
CC2.8xlarge
16
$39/h
$4.64/h
0.46c/GB/h
M2.4xlarge
15
$24/h
$2.25/h
0.22c/GB/h

Amazon EMR cluster
Maste
r
Node
Launch initial
Spark cluster
with core
nodes
HDFS to store
and checkpoint
RDDs
HDFS
HDFS
HDFS
HDFS
32GB Memory

Amazon EMR cluster
Maste
r
Node
Add Task nodes

in spot market
to increase
memory
capacity
HDFS
HDFS
HDFS
HDFS
32GB Memory
256GB Memory

Create RDDs from
HDFS or Amazon
S3 with:
Amazon EMR cluster

Maste
r
Node
sc.textFile
OR
sc.sequenceFile
HDFS
HDFS
Run Computation
on RDDs
HDFS
HDFS
32GB Memory
256GB Memory
Amazon S3

Amazon EMR cluster
Maste
r
Node
Save the resulting

RDDs to HDFS or
S3 with:
rdd.saveAsSequenc
eFile
OR
HDFS
HDFS
HDFS
HDFS
32GB Memory
rdd.saveAsObjectFil
e
256GB Memory
saveAsObjectF
ile
Amazon S3

Amazon EMR cluster
Maste
r
Node
Shutdown
TaskNodes
when your job
is done
HDFS
HDFS
HDFS
HDFS
32GB Memory
Elastic Spark With Amazon

EMR
Autoscaling Spark
Amazon EMR cluster
Master
Node
HDFS
HDFS
32GB Memory
Autoscaling Spark
Amazon EMR cluster
Master
Node
HDFS
HDFS
32GB Memory
256GB Memory
Elastic Spark
When to Scale?
Depends on your job
CPU bounded or Memory intensive?

Probably both for Spark jobs
Use CPU/Memory util. metrics to decide when

to scale
Amazon EMR Cloudwatch

metrics
EMR integrates with Cloudwatch
Provides many metrics.
Examples:
Load Metrics
HDFS Metrics
S3 Metrics
Amazon EMR Cloudwatch

metrics
Basics on Cloudwatch
Metrics
Pick any Cloudwatch metrics
Pick a threshold that you like to be notified if its
breached
Setup Cloudwatch Alarms based on your thresholds
Receive SNS notification in forms of:
Email
SNS
HTTP API Call
Metrics
Monitor With
Cloudwatch
Receive
Email
Notification
Take Manual
Action Such As
Adding More
Task Nodes
Metrics
Monitor With
Cloudwatch
HTTP API
Calls
Take Automated
Actions
Spark Autoscaling Based on

Load
Setup Cloudwatch alarm on EMR
TotalLoad metric
Receive Email/SNS/HTTP notification
Add more worker nodes by adding
EMR task nodes
Spark Autoscaling Based

Memory
Spark needs memory
Lost of it!!
How to scale based on the memory

usage?
Spark Metrics
Spark 0.8 provides cluster metrics
Source and Sink topology
Spark Metrics
Spark Metric Sources
(metrics.properties):
worker.source.jvm.class=org.apache.spark.metrics.sour
ce.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.sourc
e.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.so
urce.JvmSource
Spark Metrics
Spark Metric Sinks
Package:
ConsoleSink
org.apache.spark.metrics.sink
JmxSink
CsvSink
GangliaSink
Spark Metrics
Spark Metric Sinks

CloudwatchSink
Spark Metrics & Cloudwatch
Spark Metrics & Cloudwatch

Monitor Spark metrics with Cloudwatch
Setup Cloudwatch alarms and get notified
if any metrics reached your threshold.
Example: if JvmHeapUsed > 20G
Receive notification and take manual or
automated actions
Spark Streaming and

Amazon Kinesis
Amazon Kinesis
Kinesis
Amazon Kinesis
CreateStream
Creates a new Data Stream within the Kinesis Service
PutRecord
Adds new records to a Kinesis Stream
DescribeStream
Provides metadata about the Stream, including name,
status, Shards, etc.
GetNextRecord
Fetches next record for processing by user business logic
MergeShard / SplitShard
Scales Stream up/ down
DeleteStream
Deletes the Stream
Amazon Kinesis
Kinesis
Spark Streaming and Amazon

Kinesis
SparkStreaming Kinesis
Receiver
Extends NetworkReceiver
Creates a single Receiver per shard
and reads from Kinesis
Misc.
New AWS instances provide
enhanced networking in VPC
C3
I2
Higher PPS
Less Jitter
Great CPU Power
Suitable for Spark: Serialization and
Shuffle
What Do You Like To See On Spark By

Amazon?
Send Feedbacks To:
Parviz Deyhim
parvizd@amazon.com
@pdeyhim

A - Parviz Deyhim Amazon

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A - Parviz Deyhim Amazon

Uploaded by

Copyright:

Available Formats

Making

Amazon Elastic MapReduce

Integrated with tools

Integrated to AWS services

Amazon EMR Concepts

Core instance group

Can Add Core

Core instance group

Cant remove core

Core instance group

Core instance group

Can add and

Core instance group

Spark On Amazon EMR

Spark on Amazon EMR

Why Spark on Amazon EMR?

Why Spark on Amazon EMR?

Spark on EC2 Spot Market

Spark on EC2 Spot Market

Spark on EC2 Spot Market

Spark on EC2 Spot Market

Add Task nodes

Spark on EC2 Spot Market

Amazon EMR cluster

Spark on EC2 Spot Market

Save the resulting

Spark on EC2 Spot Market

Elastic Spark With Amazon

CPU bounded or Memory intensive?

Use CPU/Memory util. metrics to decide when

Amazon EMR Cloudwatch

Amazon EMR Cloudwatch

Spark Autoscaling Based on

Spark Autoscaling Based

How to scale based on the memory

Spark Metric Sinks

Spark Metrics & Cloudwatch

Spark Metrics & Cloudwatch

Spark Streaming and

Spark Streaming and Amazon

What Do You Like To See On Spark By

You might also like