Professional Documents
Culture Documents
Fly
Parviz Deyhim
http://bit.ly/sparkemr
Obsession
Scalable, Highly-available and Secure
Obsession
Scalable, Highly-available and Secure
Hadoop-as-a-service
Map-Reduce engine
What is EMR?
Massively parallel
In
te
gr
at
io
n
HDF
S
Amazon EMR
In
te
gr
at
io
n
HDF
S
Amazon EMR
Amazon S3
Amazon
DynamoDB
In
te
gr
at
io
n
Data management
Analytics languages
HDF
S
Amazon EMR
Amazon S3
Amazon
DynamoDB
In
te
gr
at
io
n
Data management
Analytics languages
HDF
S
Amazon EMR
Amazon S3
Amazon
DynamoDB
Amazon
RDS
In
te
gr
at
io
n
Data management
Analytics languages
HDF
S
Amazon EMR
Amazon
RedShift
Data Pipeline
Amazon S3
Amazon
DynamoDB
Amazon
RDS
In
te
gr
at
io
n
Data management
Analytics languages
HDF
S
Amazon EMR
Amazon
RedShift
Data Pipeline
Amazon S3
Amazon
DynamoDB
Amazon
RDS
Core Nodes
Amazon EMR cluster
Master instance group
Maste
r
Node
DataNode (HDFS)
HDFS
HDFS
HDFS
HDFS
Core Nodes
Amazon EMR cluster
Master instance group
Maste
r
Node
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
Core Nodes
Amazon EMR cluster
Master instance group
Maste
r
Node
HDFS
HDFS
HDFS
HDFS
HDFS
HDFS
Task Nodes
Amazon EMR cluster
Master instance group
Maste
r
Node
No HDFS
Provides compute
resources:
CPU
Memory
HDFS
HDFS
HDFS
HDFS
Task Nodes
Amazon EMR cluster
Master instance group
Maste
r
Node
HDFS
HDFS
HDFS
HDFS
Bootstrap Actions
Ability to run or install additional
packages/software on EMR nodes
Simple bash script stored on S3
Script gets executed during node/instance
boot time
Script gets executed on every node that gets
added to the cluster
http://bit.ly/sparkemr
# of Onnode demand
s
Cost
Spot Cost
Cost/GB Of Memory
Using Spot
M1.xlarge
63
$31/h
$4.41/h
0.44c/GB/h
CC2.8xlarge
16
$39/h
$4.64/h
0.46c/GB/h
M2.4xlarge
15
$24/h
$2.25/h
0.22c/GB/h
Maste
r
Node
Launch initial
Spark cluster
with core
nodes
HDFS to store
and checkpoint
RDDs
HDFS
HDFS
HDFS
HDFS
32GB Memory
Maste
r
Node
HDFS
HDFS
32GB Memory
256GB Memory
Maste
r
Node
sc.textFile
OR
sc.sequenceFile
HDFS
HDFS
Run Computation
on RDDs
HDFS
HDFS
32GB Memory
256GB Memory
Amazon S3
Maste
r
Node
HDFS
HDFS
HDFS
HDFS
32GB Memory
rdd.saveAsObjectFil
e
256GB Memory
saveAsObjectF
ile
Amazon S3
Maste
r
Node
Shutdown
TaskNodes
when your job
is done
HDFS
HDFS
HDFS
HDFS
32GB Memory
Autoscaling Spark
Amazon EMR cluster
Master
Node
HDFS
HDFS
32GB Memory
Autoscaling Spark
Amazon EMR cluster
Master
Node
HDFS
HDFS
32GB Memory
256GB Memory
Elastic Spark
When to Scale?
Depends on your job
Basics on Cloudwatch
Metrics
Pick any Cloudwatch metrics
Pick a threshold that you like to be notified if its
breached
Setup Cloudwatch Alarms based on your thresholds
Receive SNS notification in forms of:
Email
SNS
HTTP API Call
Basics on Cloudwatch
Metrics
Monitor With
Cloudwatch
Receive
Email
Notification
Take Manual
Action Such As
Adding More
Task Nodes
Basics on Cloudwatch
Metrics
Monitor With
Cloudwatch
HTTP API
Calls
Take Automated
Actions
Spark Metrics
Spark 0.8 provides cluster metrics
Source and Sink topology
Spark Metrics
Spark Metric Sources
(metrics.properties):
worker.source.jvm.class=org.apache.spark.metrics.sour
ce.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.sourc
e.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.so
urce.JvmSource
Spark Metrics
Spark Metric Sinks
(metrics.properties):
Package:
ConsoleSink
org.apache.spark.metrics.sink
JmxSink
CsvSink
GangliaSink
Spark Metrics
Amazon Kinesis
Kinesis
Amazon Kinesis
CreateStream
Creates a new Data Stream within the Kinesis Service
PutRecord
Adds new records to a Kinesis Stream
DescribeStream
Provides metadata about the Stream, including name,
status, Shards, etc.
GetNextRecord
Fetches next record for processing by user business logic
MergeShard / SplitShard
Scales Stream up/ down
DeleteStream
Deletes the Stream
Amazon Kinesis
Kinesis
SparkStreaming Kinesis
Receiver
Extends NetworkReceiver
Creates a single Receiver per shard
and reads from Kinesis
Misc.
New AWS instances provide
enhanced networking in VPC
C3
I2
Higher PPS
Less Jitter
Great CPU Power
Suitable for Spark: Serialization and
Shuffle