You are on page 1of 36

Big Data

= Big Maths
Laurence Liew
General Manager, APAC

Who we are
Leading provider of commercial analytics platform
based on open source R statistical computing
language
Our
Software Delivers
Power: Distributed, scalable high performance
advanced analytics
Productivity: Easier to build and deploy analytic
applications

Customers
200+ Global 2000

Enterprise Readiness: Multi-platform

Our Services Deliver

Global Presence

Knowledge: Our experts enable you to be experts

North America / EMEA / APAC

Time-to-Value: Our QuickStart projects give you


a jumpstart

Global Industries Served


Financial Services

Guidance: Our customer support team is here to


help you

Digital Media

Our Philosophy

Health & Life Sciences

Customer-centric innovation
Easy to do business with

Government
High Tech
Manufacturing
Retail
Telco
2

Revolution Confidential

200 Corporate Customers and Growing


Finance & Insurance

Academic & Govt

Healthcare & Life Sciences

Consumer & Info Svcs

Manuf & Tech

Centre of Excellence
COE
Partner with iLEs to create new IPs in big data
analytics in Singapore
Big data analytics training/workshops
We will have our data scientist and developers work
alongside our collaboration partners.

Centre of Attachment
COA
To accelerate formation of data science team within
organization
Analytics/statistics skills
Big data infrastructure skills such as Hadoop and
HPC clusters

Why Big Data Now?

THE PERFECT STORM


CONVERGENCE OF

Backdrop - Massive Data Volumes


Exabytes

Petabytes

Terabytes

Gigabytes

3D/4D
Seismic

Systems
Logs

Volumes

ERP
Cost
Records

Realtime
Telemetry

Vehicle
Monitoring

Logistics

Summary
Operating
Statistics

Machine
Sensors

Geospatial
ESRI

Incidents
Alarms

Daily
Activity
Reports

Communication
Logs

Video
And
Imagery

Text
Instructions
Workorders
Reports

Increasing Volume, Variety and Velocity


7

Decision Management Solutions, 2013

Volume Variety

Whats big data?

Velocity

Next Generation Big Data Analytics


Players

???
ANALYTICS
HDD -> SSD -> In-Memory
INFRASTRUCTURE AND DATABASES

What is R (Video)

http://www.youtube.com/watch?feature=player_embe
dded&v=TR2bHSJ_eck

10

= Language + Analytics

Statistical data analysis programming language

Huge library algorithms for data access, manipulation,


analysis & graphics

Data Analytics Workflow


INGEST

DISTILL & ANALYZE

CONSUME

R is open source and drives analytic innovation


but.
has some limitations for Enterprises

Big Data
Speed of
Analysis
Enterprise
Readiness
Analytic
Breadth
& Depth
Commercial
Viability

In memory bound

Disk based
scalability

Single
threaded

Parallel
threading

Community support

4500+ innovative
analytic packages
Risk of
deployment of
open source
13

Commercial
support
Leverage open
source packages
plus Big Data
ready packages
Commercial
License

13

Big Data Speed @ Scale


with Revolution R Enterprise
In-Hadoop Execution
In-Database Execution
Parallelized User Code
Parallelized Algorithms
Multi-Core Processing
Multi-Threaded Execution
Memory Management
Fast Math Libraries

14

14

Revolution R Enterprise ScaleR


Performance and Capacity

15

SAS HPA Benchmarking comparison*


Logistic Regression

32 nodes
appliance
~ $2.5M

Rows of data

1 billion

Parameters

just a few

Double

Time

80 seconds

45%

Data location

In memory

Nodes

32

1/6th

Cores

384

5%

20

RAM

1,536 GB

5%

80 GB

1 billion
7
44 seconds
On disk

Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a 20th as
much RAM, a 6th as many nodes, and not pre-loading data into RAM.

Revolution R Enterprise Delivers Performance at

2% of the Cost

*As published by SAS in HPC Wire, April 21, 2011


16

5 nodes
Linux HPC
cluster
~ $30K

Benchmarks: RevoR vs legacy tool


Airline data set: 123,534,969 rows and 29 columns in its original state.
All tests were run on laptop: 16GB RAM, SSD, and i7-3632QM CPU@2.2GHz.

17

Allstate compares SAS and R for Big


Data Insurance Models
150 million observations and 70 degrees of
freedom.
Approach

Platform

Time to fit

1: SAS

16-core Sun Server

5 hours

2 R

250 GB Server

Impossible (> 3 days)

3: RRE

5-node (4 cores / node) LSF cluster

5.7 minutes

So what have we learned:


SAS works, but is slow.
The data is too big for open-source R, even on a very
large server.
Revolution R Enterprise gets the same results as SAS,
but about 50x faster.

"It's difficult to be productive


on a tight schedule if it takes
over 5 hours to fit one
candidate models!"
18

Write Once. Deploy Anywhere.


DESIGNED FOR SCALE, PORTABILITY & PERFORMANCE
Hadoop

Hortonworks
Cloudera, Intel

EDW

Teradata

Clustered Systems

Linux HPC
Windows HPC

Workstations & Servers

Desktop
Server
Linux

In the Cloud

CloudR

DeployR
ConnectR
ScaleR
DistributedR

19

From Laptops, workstations and


Servers

20

To High Performance Compute


Clusters
Storage Node
- 2-way or NAS
- external SCSI OR
- external SAN
- FAST HDD
- Cluster FS
Supercomputing Network
- High Bandwidth (>250MB/s)
- Low Latency (1.2-8us)
- Cost effective: GE
- Performance:
- Infiniband
- 1GE or 10 GE
- NumaConnect

Frontend
- 2-way or 4-way
- Cluster Management
- Fast HDD
- Lots of RAM

Admin Network
- Good Bandwidth
- Route admin traffic
- Typically : GE

Compute Node
- 2-way or 4-way
- Computations
- Fast CPU
- Fast HDD
- Lots of RAM

To Hadoop-scale on-disk analytics


The Apache Hadoop
software library is
a framework that
allows for the
distributed
processing of large
data sets across
clusters of
computers
1 node
=
12TB
10 nodes
= 120TB
100 nodes = 1.2PB
22

To in-database

23

To massive in memory SMP


clusters
SMP Server specs:
8 nodes
16 CPUs
256 core
1TB RAM
One Linux instance!

We first became interested in shared memory to simplify


the programming paradigmI think we are going to see a
lot more people looking at this type of environment.

- William W. Thigpen, Chief, Engineering Branch, NASA Advanced Supercomputing (NAS)

Division

24

To Cloud

25

Write Once. Deploy Anywhere.

26

Hadoop + R

27

Hadoop

Dell PowerEdge Servers


28

Linear Regression in Java for MapReduce

Data setup

Mapper

Reducer

Total: ~ 100 lines of Java code (exclude setup)

Linear Regression with RevoR on a


Hadoop Cluster!

Total: ~ 2 lines of R code, Productivity of 50 times


30

RevoR with Hadoop

Complex & Basic analytics

31

Big Analytics on Big Data in Hadoop


100% R on Hadoop

Analytics

Applications

Hadoop

Full Skill Transfer - No Java


needed.
Use 4500+ CRAN Packages
100% R.

Big Data
Scale

Blend Combine R & Other Tools


/ Methods

100% Portability

Scalable
Compute

Build Once Deploy Many


Track Evolution of Hadoop

Hive

Data

Portability.

Protect Against Platform


Uncertainty

HBase
HDFS

Parallel Storage

Avoid Platform Lock-ins

Hadoop Performance & Scale


Leverage Hadoop Parallelism
Easily
32

Analyze Data Without Moving It

RRE V7 inside Hadoop


Hadoop

Applications

Edge Node

Analytics
Applications

MapReduce

Other MapReduce Jobs

Revolution
R Enterprise

Analytics

DeployR
Revolution
R Enterprise
ScaleR Algorithms

Data
DB, EDW
M2M

ScaleR Algorithms

DistributedR
Framework

DistributedR
Framework

ConnectR:
HBase
HDFS
ODBC &
High-Speed Connectors

ConnectR:
HBase
HDFS
ODBC &
High-Speed Connectors

HDFS

HBase

33

So how do I start?

34

www.bigdatastarterkit.com

www.bigdataconsumerkit.com

35

Q & A
Revolution Analytics is the leading
commercial provider of software and
support for the popular open source R
statistics language.
E: Laurence.liew@revolutionanalytics.com
W: www.revolutionanalytics.com
36

You might also like