Vignesh Sukumar SVCC 2012

Building massive
scale, fault
tolerant,
job processing
systems with Scala
Akka framework
Vignesh Sukumar
SVCC 2012
About me
Storage group, Backend Engineering

at Box
Love enterprise software!
Interested in Big Data and building
distributed systems in the cloud
About Box
Leader in enterprise cloud

collaboration and storage
Cutting-edge work in backend,
frontend, platform and engineering
services
A really fun place to work we have
a long slide!
Talk outline
Job processing requirements
Traditional & new models for job
processing
Akka actors framework

Achieving and controlling high IO
throughput
Fine-grained fault tolerance
Typical architecture in a
cloud storage environment
Practical realities
Storage nodes are usually of varying

configurations (OS, processing power,
storage capacity, etc) mainly because of
rapid evolution in provisioning operations
Some nodes are more over-worked than
the others (for ex, accepting live
uploads)
Billions of files; petabytes
Job processing
requirements
Iterate over all files (billions, petabyte
scale): for ex, check consistency of all
files
High throughput
Fault tolerant
Secure
Traditional job
processing model
Why traditional models fail
in cloud storage
environments
Not scalable: petabyte scale, billions
of files
Insecure: cannot move files out of
storage nodes
No performance control: easy to
overwhelm any storage node
No fine grained fault tolerance
Compute on Storage
Move job computation directly to

storage nodes
Utilize abundant CPU on storage
nodes
Metadata store still stays in a highly
available system like a RDBMS
Results from operations on a file are
completely independent
Master slave
architecture
Benefits
High IO throughput: Direct access; no

transfer of files over a network
Secure: files do not leave storage nodes
Better performance control: compute
can easily monitor system load and
back of
Better fault tolerance handling: finer
grained handling of errors
Master node
Responsible for accepting job

submissions and splitting them to
tasks for slave nodes
Stateful: keeps durable copy of jobs
and tasks in Zookeeper
Horizontally scalable: service can be
run on multiple nodes
Agent
Runs directly on the storage nodes on

a machine-independent JVM
container
Stateless: no task state is maintained
Monitors system load with back-of
Reports results directly to master
without synchronizing with other
agents
Implementation
with the
the Scala Akka
Actor
framework
Actors
Concurrent threads abstraction with

no shared state
Exchange messages
Asynchronous, non-blocking
Multiple actors can map to a single OS
thread
Parent-children hierarchical
relationship
Actors and messages
Class MyActor extends Actor {
def receive = {
case MsgType1 => // do something
}
}
// instantiation and sending messages

val actorRef = system.actorOf(Props(new MyActor))
actorRef ! MsgType1
Agent Actor System
Achieving high IO
throughput
Parallel, asynchronous IO through
Futures
val fileIOResult = Future {
// issue high latency tasks like file IO
}
val networkIOResult = Future { // read from network }
Futures.awaitAll(<wait time>, fileIOResult,

networkIOResult)
fileIOResult onSuccess { // do something }
networkIOResult onFailure { // retry }
Controlling system
throughput
The problem: agents need to throttle
themselves as storage nodes serve
live traffic
Adjust number of parallel workers

dynamically through a monitoring
service
Controlling throughput:
Examples
Parallelism parameters can be gotten
from a separate configuration service
on a per node basis
Some machines can be speeded up
and others slowed down this way
The configuration can be updated on
a cron schedule to speed up during
weekends
Fine grained fault tolerance
with Supervisors
Parents of child actors can define

specific fault-handling strategies for
each failure scenario in their children
Components can fail gracefully
without afecting the entire system
Supervision strategy:
Examples
Class TaskActor extends Actor {

// create child workers
override val supervisorStrategy =
OneForOneStrategy(maxNrOrRetries = 3) {
case SqlException => Resume // retry the same file
case FileCorruptionException => Stop // dont clobber it!
case IOException => Restart // report and move on
}
Unit testing
Scalatra test framework: very easy to read!

TaskActorTest.receive(BadFileMsg) must throw
FileNotFoundException
Mocks for network and database calls
val mockHttp = mock[HttpExecutor]
TaskActorTest ! doHttpPost
there was atLeastOne(mockHttp).POST
Extensive testing of failure injection

scenarios
Takeaways
Keep your architecture simple by modeling
actor message flow along the same paths as
parent-child actor hierarchy (i.e., no
message exchange between peer child
actors)
Design and implement for component
failures
Write unit tests extensively: we did not have
any fundamental level functionality
breakage
Box Engineering is awesome!

Vignesh Sukumar SVCC 2012

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vignesh Sukumar SVCC 2012

Uploaded by

Copyright:

Available Formats

Building massive

Storage group, Backend Engineering

Leader in enterprise cloud

Akka actors framework

Storage nodes are usually of varying

Move job computation directly to

High IO throughput: Direct access; no

Responsible for accepting job

Runs directly on the storage nodes on

Concurrent threads abstraction with

// instantiation and sending messages

Futures.awaitAll(<wait time>, fileIOResult,

Adjust number of parallel workers

Parents of child actors can define

Class TaskActor extends Actor {

Scalatra test framework: very easy to read!

Extensive testing of failure injection

You might also like