You are on page 1of 49

The Art of Infrastructure Elasticity

April 28th, 2012

Cloud Developer Conference 2012 , Bangalore

Harish Ganesan CTO and Co-Founder 8KMiles

Agenda Problem Challenges Requirements Solution Architecture Q&A


2

What is the problem scenario ?

Big Sales Promotion every quarter by the Enterprise


4

Massive online Concurrent Visitors

Limited processing capacity of the Booking Engine (~3k requests/sec)

Unhappy Visitors More Booking opportunity lost

Solution (Step 1): Create a Queuing App before the Booking engine Efficiently Queue the concurrent visitors

Solution (Step 2) : Moderate and move the visitors waiting in Queuing app to Booking engine

What are the Challenges ?

Concurrency
HTTP/AJAX/REST requests
Total : 500+ Million requests in 6 hours Average :23k+ requests/sec Peak : 80K+ requests/sec

10

Queue efficiency
Allot unique Queue Numbers for visitors
Queue Number allotment on Fair Basis (As much possible) Reduce the wait time in Queue Number allotment process Reduce overall Queue wait time for the visitor
11

Load Volatility
Compute Peak utilization during Promos Wasted Capacity

Yearly Complete under utilization of Infra other times

Massive utilization and under utilization pattern

12

IP Whitelisting
Public Cloud

IP Address of the source EC2 Instances needs to be whitelisted in 3rd party Services gateway

3rd Party Services

Booking engine needs EC2 IP Whitelisting for security Consecutive IP range needed

13

Variety of OS / Softwares
RedHat OS for Load Balancer , NoSQL and Queue Layer

Apache Tomcat Java web/App Layer

CentOS for Processing Programs


MySQL for Result storage
Hadoop for Analytics
14

What are the requirements from enterprise ?

15

Requirements
Elastic Infrastructure
Create the Infrastructure 2 hrs before the promo Tear down infrastructure 2 hrs after the promo Elastically expand the infra during the promo

Highly Scalable and Available


Log Analytics Complete Infrastructure Automation
16

Solution Architecture

17

Solution Architecture
Option 1: Single Queue ( Initial thought)

Queuing Application Concurrent visitors

Booking Engine
18

Solution Architecture
Option 2: Parallel Queue ( Recommended)

Concurrent visitors

Queuing Application

Booking Engine
19

Request types
Customer Visit is a HTTP request to the Queuing Application Current Visitor Queue position is a AJAX call every X seconds to the Queuing Application
More Wait ~ More Calls

20

Solution Step 1 : The Cloud ?

Amazon Web Services We had 4+ years Architecture experience in AWS It satisfied many customer requirements and challenges in this use case
21

Solution Step 2 : R53/NW


Amazon Virtual Private Cloud Users

Amazon Route 53

EC2 Instances on AWS VPC Subnet 1 Availability Zone 1 Users VPC Subnet 2 Availability Zone 2

Amazon VPC with Multi-AZ subnet configurations ( HA ) Amazon Route 53 for Managed DNS DNS RR algorithm at Route53

22

Solution Step 3 : Load Balancing


Amazon Virtual Private Cloud Users

Amazon Route 53

EBS Volumes Users

M1.large

Elastic IP

EBS Volumes

M1.large

Elastic IP

HAProxy EC2 Instance 1


Round Robin Algorithm

HAProxy EC2 Instance 2


Round Robin Algorithm

VPC Subnet 1 Availability Zone 1

23

Solution Step 3: Load Balancing


HAProxy vs Amazon ELB

Custom programs to Auto Scale HAProxy


HAProxy Elastic -> Attach / Detach from Route53 HAProxy IP whitelisting in 3rd party Gateway 16 HAProxy Instances , 2 AZs , 2 Subnets RR Load Balancing algorithm
24

Solution Step 4 : Web/App Servers


Amazon Virtual Private Cloud Users

Amazon Route 53

HA Proxy EC2 Instance-1

Round Robin Algorithm Users

EBS Volumes

C1.Xlarge

Elastic IP

Web/App 2

Web/App 3

Web/App EC2 Instance 1


VPC Subnet 1 Availability Zone 1

25

Solution Step 4: Web/App Servers


3 Web/App instances under every HAProxy

C1.Xlarge Instance Type for Web/App Instances

Custom programs to Auto Scale C1.Xlarge

Automatic Attach / Detach from HAProxy


Every web/App Instance with EIP for IP whitelisting 48 Web/App EC2 Instances spread across 2 AZs
26

Solution Step 5 : Queue Servers


Amazon Virtual Private Cloud

HA Proxy EC2 Instance-1


Users

Amazon Route 53

Round Robin Algorithm

Users

Web/App 1

Web/App 2

Web/App 3

EBS Volumes

m1.large

RabbitMQ
Availability Zone 1

VPC Subnet 1

27

Solution Step 5: Queue Servers


RabbitMQ vs Amazon SQS

FIFO/Concurrency/No Duplicate messages


1 RabbitMQ instance for queuing every sector M1. large Instance Type

16 RabbitMQ Instances overall

28

Solution Step 6 : Processors/Redis


Amazon Route 53

Single Sector View

1 HA Proxy

Round Robin Algorithm

Components of Single Sector 1. One HAProxy 2. Three Web/App 3. One RabbitMQ 4. One BG Processor Node 5. Two Redis Sector is not an AWS term , it is 8KMiles term for Logical EC2 instance groups for this use case

Web/App 1

Web/App 2

Web/App 3 3

RabbitMQ 4

5 Redis Master

29
6

Processors
Redis Slave

Processors

Booking Engine

Solution Step 6: Redis


Redis vs Amazon DynamoDB

Redis : NoSQL KV Data store


Visitors are shown their Current Queue position every X seconds from Redis 1 Redis Master-Slave instance for every sector M1. large Instance Type for Redis

32 Redis Instances overall

30

Solution Step 6: Processors


BG Processors : Java Programs to
RabbitMq -> Redis : Allot Queue numbers to visitor requests and insert to Redis Redis -> Booking Engine : Moderate the movement of queued visitors from Redis to Booking Engine

Process the Response Status / Booking Status / Inactive Visitors / Timeouts

2 BG Processor node per sector CPU intensive : C1.Xlarge Instance Type


31

32 BG Processor Instances overall

Overall Solution Architecture


Sector is not an AWS term , it is 8KMiles term for Logical EC2 instance groups for this use case
Amazon Route 53

Sector 1

..

..

16

HAProxy Web/App RabbitMQ Redis BG Programs


Booking Engine 32

Scalability
AZ-1
Sector -1
Amazon Route 53
EC2 Instances EC2 Instances Amazon Virtual Private Cloud

AZ-2

Sector -3

EC2 Instances

EC2 Instances

VPC Subnet 1 Availability Zone 1

VPC Subnet 1 Availability Zone 1

VPC Subnet 2 Availability Zone 2

VPC Subnet 2 Availability Zone 2

Sector -2

Sector -4

EC2 Instances

EC2 Instances

EC2 Instances

EC2 Instances

VPC Subnet 1

VPC Subnet 1 Availability Zone 1

VPC Subnet 2

Availability Zone 1

VPC Subnet 2 Availability Zone 2

Availability Zone 2

Scalability
New sectors containing LB, Web, Queue , NoSQL , BG stack will be created automatically depending upon the load Same AZ or multi-AZ can be specified for the creation CloudWatch Custom parameters used Automated Java Programs were used for the sector creation No Manual intervention needed

34

High Availability @ Instance level


AZ-1
Amazon Virtual Private Cloud

AZ-2

Amazon Route 53
EC2 Instances EC2 Instances

EC2 Instances

EC2 Instances

VPC Subnet 1 Availability Zone 1

VPC Subnet 1 Availability Zone 1

VPC Subnet 2 Availability Zone 2

VPC Subnet 2 Availability Zone 2

EC2 Instances

EC2 Instances

EC2 Instances

EC2 Instances

VPC Subnet 1

VPC Subnet 1 Availability Zone 1

VPC Subnet 2

Availability Zone 1

VPC Subnet 2 Availability Zone 2

Availability Zone 2

High Availability @ Instance


HA built @ Web/App , Redis and BG processor instances Any Failure / Non responsive EC2 instances will be automatically detected/replaced by Java programs No Manual intervention needed

36

High Availability @ Sector level


AZ-1
Sector -1
Amazon Route 53
EC2 Instances EC2 Instances Amazon Virtual Private Cloud

AZ-2
Sector -3

Sector -2

Sector -5

EC2 Instances

EC2 Instances

VPC Subnet 1 Availability Zone 1

VPC Subnet 1 Availability Zone 1

VPC Subnet 2 Availability Zone 2

VPC Subnet 2 Availability Zone 2

Sector -6

Sector -4

EC2 Instances

EC2 Instances

EC2 Instances

EC2 Instances

VPC Subnet 1

VPC Subnet 1 Availability Zone 1

VPC Subnet 2

Availability Zone 1

VPC Subnet 2 Availability Zone 2

Availability Zone 2

High Availability @ Sector level


Any Failure / Non responsive instances inside Sectors will be automatically detected/replaced by Java programs If sector-3 fails , still other sectors will be active and can take requests

38

High Availability @ AZ Level


AZ-1
Amazon Virtual Private Cloud

AZ-2

Amazon Route 53
EC2 Instances EC2 Instances

EC2 Instances

EC2 Instances

VPC Subnet 1 Availability Zone 1

VPC Subnet 1 Availability Zone 1

VPC Subnet 2 Availability Zone 2

VPC Subnet 2 Availability Zone 2

EC2 Instances

EC2 Instances

EC2 Instances

EC2 Instances

VPC Subnet 1

VPC Subnet 1 Availability Zone 1

VPC Subnet 2

Availability Zone 1

VPC Subnet 2 Availability Zone 2

Availability Zone 2

High Availability @ AZ level


If entire AZ-2 fails then load will be balanced to instances in AZ-1 Automated programs will create new sectors inside AZ-1 to handle the load

40

Log Analytics
HDFS Cluster

1
EC2 Instances S3 Bucket with logs

3
RDS MySQL
Elastic Map Reduce Jobs

Redis , Web/App , HAProxy , RBQ logs synced to S3 Elastic MapReduce Jobs to process / analyze the logs Processed result moved to RDS MySQL for reports/ Visualizations
41

Monitoring
Nagios + Puppet (combined) for Auto scaled monitoring infra and deployment

CloudWatch Custom metrics / Tomcat Valve/ Automated Java Programs for EC2

42

Backup
No backups -> only Syncs to S3 Golden AMIs snapshot to S3 Periodic Sync of data between EC2 and S3 Periodic log Sync between Web/App to S3
43

Infrastructure
Amazon Route53 Amazon VPC Public , Private subnet 150+ EC2 instances , 2 AZs , 1 Region 70+ Elastic IPs 200+ EBS S3 buckets Suite of monitoring tools 1 Puppet Server Amazon CloudWatch Amazon CloudFront

44

Infrastructure Elasticity
Entire Infra created 2 hrs before promo Tear down infra 2 hrs after promo ~30 Mins to launch the infra in AWS ~45 Mins to tear down Automated Failure detection/rectification Automated Programs for Infra creation

45

Infrastructure Cost
~10K USD per promo Not inclusive of Data charges Unthinkable Savings Visitor experience was good More Bookings per Promo

Power of Elasticity is Simply priceless AWS is AWSome

46

If you need help in architecting Highly Elastic solutions on AWS?

Leave it to the experts , we will handle this

Cloud Architecture Consulting


Cloud Application Development Cloud Migration & Implementation Cloud Adoption Strategy

Let's get the job done

Q&A
8KMiles harish@8kmiles.com http://in.linkedin.com/in/harishganesan www.twitter.com/harish11g harish11g@gmail.com http://harish11g.blogspot.com

Amazon Web Services aws.amazon.com aws.amazon.com/contact-us/aws-sales

49